Skip to content

fix: address top-7 code-review findings across packages#721

Merged
mchmarny merged 4 commits into
mainfrom
chore/review-fixes-top7
Apr 30, 2026
Merged

fix: address top-7 code-review findings across packages#721
mchmarny merged 4 commits into
mainfrom
chore/review-fixes-top7

Conversation

@mchmarny
Copy link
Copy Markdown
Member

Summary

Applies seven hardening items identified in a cross-package code review covering error handling, ingress safety, retry/backoff for OCI push, atomic file writes, K8s helper extraction, and collector parallelism.

Motivation / Context

A cross-package review surfaced consistent gaps: unbounded HTTP POST bodies, an os/exec bash collector lacking project conventions, a non-atomic state file write, missing retries on registry pushes, duplicated K8s utilities that bypassed pkg/k8s/pod, sequential sub-collectors that could parallelize, and handler logs missing request-ID correlation. This PR addresses all seven prioritized findings.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Refactoring (no functional changes)

Component(s) Affected

  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Other: pkg/build, pkg/oci, pkg/evidence, pkg/recipe, pkg/defaults

Implementation Notes

  1. Evidence collector — aggregate per-section errors via stderrors.Join, bound bash subprocess with context.WithTimeout, capture stdout/stderr into bounded buffers and emit via slog, probe bash/kubectl once at start, add WithNoCluster() short-circuit, capture os.RemoveAll cleanup error, ctx-check renderer index step.
  2. HTTP handlershttp.MaxBytesReader on bundler (8 MiB) and recipe (1 MiB) POSTs; *http.MaxBytesError → 413 with ErrCodeInvalidRequest. New server.RequestIDFromContext accessor; per-request slog logger with requestID in all three handlers.
  3. OCI push — per-attempt RegistryPushTimeout=10m, 3-attempt retry with exponential backoff + jitter on transient errors only (5xx/429/timeouts), slog.Warn on InsecureTLS=true, transport from defaults.NewHTTPTransport(), explicit oras.CopyOptions.Concurrency=3.
  4. Pod helpers — new pod.GetPodForJob, pod.WaitForTermination (with one-shot watch retry), pod.WaitForJobTerminal (Failed-as-completion mode). pkg/validator/job refactored to delegate; bare returns in EnsureRBAC wrapped; CleanupRBAC switched from string-joined error to stderrors.Join + WrapWithContext (preserves underlying error codes for errors.Is/As).
  5. Build spec — atomic WriteBack via temp+fsync+rename, MaxSpecFileBytes=1MiB cap on LoadSpec, KnownFields(true) YAML decoder, errors.Is(fs.ErrNotExist) replacing os.IsNotExist.
  6. (Folded into 1 and 2.)
  7. K8s collector — sub-collectors run via errgroup.WithContext, collectContainerImages paginated at K8sPodListPageSize=500, dynamic client cached via sync.Once.

Behavior change: Deployer.WaitForPodTermination now returns error (was no return). Single in-tree caller updated.

Testing

make test       # all in-scope packages pass with -race
make lint       # 0 issues across all modified packages
  • 31 files changed (26 modified + 5 new).
  • golangci-lint run -c .golangci.yaml on all modified packages: 0 issues.
  • Per-package test deltas: pkg/evidence 59.4% → 78.1%, pkg/k8s/pod 81.6%, pkg/server 94.0%, pkg/oci 77.6%, pkg/recipe 90.7%, pkg/validator/job 83.8%, pkg/build (with new tests for atomic write, oversize file, unknown YAML field).
  • Aggregate coverage on modified scope: 77.3% (above the 70% project floor).
  • Pre-existing sandbox-only failures in pkg/bundler/deployer/helm (mktemp on /var/folders) and pkg/trust (writes to ~/.sigstore) are untouched by this PR.

Risk Assessment

  • Medium — Touches multiple components with broader impact

Rollout notes:

  • Single API change: Deployer.WaitForPodTermination now returns error. Internal-only; sole caller (pkg/validator/validator.go) updated to log a warning on the new return value.
  • New pkg/defaults constants (MaxBundlePOSTBytes, MaxRecipePOSTBytes, RegistryPushTimeout, RegistryPushRetries, RegistryPushBackoff, OCIPushConcurrency, EvidenceSectionTimeout, EvidenceMaxOutputBytes, K8sPodListPageSize, MaxSpecFileBytes, SpecFileMode).
  • Behavior tightened (size caps, timeouts, retries) — no compatibility shims required; clients exceeding the caps now get explicit 413.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (no user-facing surface — internal hardening only)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Apply seven hardening items identified in a cross-package review:

1. Evidence collector — aggregate per-section errors via stderrors.Join,
   bound bash subprocess with context.WithTimeout, capture stdout/stderr
   into bounded buffers and emit via slog, probe bash/kubectl in PATH
   once at start, add WithNoCluster() short-circuit for test isolation,
   capture os.RemoveAll cleanup error, and ctx-check renderer index step.

2. HTTP handlers — wrap POST bodies with http.MaxBytesReader (8 MiB
   bundle, 1 MiB recipe), translate *http.MaxBytesError to 413 with
   structured error code, and tag every handler slog call with requestID
   via new server.RequestIDFromContext accessor.

3. OCI push — add per-attempt context.WithTimeout, bounded retry with
   exponential backoff + jitter on transient errors, slog.Warn when
   InsecureTLS=true, base transport on defaults.NewHTTPTransport(), and
   set explicit oras.CopyOptions.Concurrency.

4. K8s pod helpers — extract pod.GetPodForJob, pod.WaitForTermination
   (with one-shot watch retry), and pod.WaitForJobTerminal (Failed-as-
   completion mode). Refactor pkg/validator/job to delegate. Wrap bare
   returns in EnsureRBAC and replace stringified CleanupRBAC join with
   stderrors.Join + WrapWithContext.

5. Build spec — make WriteBack atomic via temp+fsync+rename, cap
   LoadSpec input at MaxSpecFileBytes (1 MiB), add KnownFields(true)
   to YAML decode, and replace os.IsNotExist with errors.Is(fs.ErrNotExist).

6. Folded into items 1 and 2.

7. K8s collector — parallelize sub-collectors via errgroup, paginate
   collectContainerImages at K8sPodListPageSize=500, and cache the
   dynamic client via sync.Once.

Tests added/updated for each change. golangci-lint passes 0 issues
across all modified packages; coverage on changed scope is 77.3%
(above the 70% project floor). Pre-existing unrelated sandbox-only
test failures in pkg/bundler/deployer/helm and pkg/trust are not
touched by this PR.
@github-actions

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Apply 12 inline review comments:

- pkg/build/spec.go: use os.CreateTemp for unique per-write tmp file so
  concurrent WriteBack callers cannot clobber each other; fsync the
  parent directory after rename for crash-safety; add a renameFn package
  seam so tests can deterministically exercise the rename failure path.
  Drop redundant chmod (CreateTemp opens with 0o600 by default).
- pkg/build/spec_test.go: rename misleading test from
  RenameFailurePreservesOriginal -> TempCreateFailurePreservesOriginal
  to reflect what it actually exercises (read-only directory blocks
  CreateTemp). Add real RenameFailurePreservesOriginal that uses the
  renameFn seam. Update glob checks for the new ".tmp-*" pattern.

- pkg/k8s/pod/job.go: in WaitForJobCompletion and WaitForJobTerminal,
  preserve ErrCodeTimeout classification when the watch channel closes
  after the deadline (previously surfaced as ErrCodeInternal). Handle
  watch.Error events explicitly with ErrCodeInternal instead of silently
  continuing on the type-assert miss.

- pkg/evidence/collector.go: check ctx.Err() before each section dispatch
  so cancellation stops creating temp work after the caller aborts.
  Surface section subprocess timeouts as ErrCodeTimeout (was
  ErrCodeInternal). Aggregate-level error preserves ErrCodeTimeout when
  any section reported a timeout.

- pkg/validator/job/deployer.go: WaitForCompletion now propagates
  pod.WaitForJobTerminal errors as-is so structured codes
  (ErrCodeTimeout, ErrCodeUnavailable) survive for retry classification.
  WaitForPodTermination only swallows ErrCodeNotFound; other errors
  (RBAC, transient API failures) propagate so cleanup can race-aware.

- pkg/oci/push.go: clone existing TLSClientConfig (or allocate fresh)
  and only mutate InsecureSkipVerify, preserving any future hardening
  defaults from defaults.NewHTTPTransport (MinVersion, cipher suites).

- pkg/bundler/handler_test.go, pkg/recipe/handler_test.go,
  pkg/recipe/handler_query_test.go: tighten 413 assertions to verify
  exact limit_bytes value matches defaults.MaxBundlePOSTBytes /
  MaxRecipePOSTBytes. Use a valid JSON envelope for the recipe and
  query tests so MaxBytesReader detection is deterministic rather than
  coupled to early JSON syntax errors.

Tests added for the new error-handling paths:
- WaitForJobCompletion_WatchError, _WatchClosedAfterTimeout
- WaitForJobTerminal_WatchError, _WatchClosedAfterTimeout
- evidence: TestRunPreservesTimeoutCode,
  TestRunStopsOnContextCancellation
- validator/job: TestWaitForPodTerminationPropagatesNonNotFound

golangci-lint passes 0 issues across all modified packages.
coderabbitai[bot]

This comment was marked as resolved.

mchmarny and others added 2 commits April 30, 2026 13:20
CI's pinned golangci-lint v2.10.1 with gosec flags os.Remove(tmp) on
the WriteBack failure-cleanup paths because tmp's filename derives from
the path argument via os.CreateTemp. The tmp variable is the return
value of f.Name() from a CreateTemp call earlier in the same function;
removing it is the inverse of that creation, not a tainted-path op.

Reproduced with the pinned linter version (local v2.11.4 didn't flag).
coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@lockwobr lockwobr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mchmarny mchmarny merged commit de84b0f into main Apr 30, 2026
30 checks passed
@mchmarny mchmarny deleted the chore/review-fixes-top7 branch April 30, 2026 21:05
@mchmarny mchmarny restored the chore/review-fixes-top7 branch May 1, 2026 16:30
@mchmarny mchmarny deleted the chore/review-fixes-top7 branch May 1, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants