Skip to content

ci: retry grype install on transient github 502s#701

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix-grype-install-retry
Apr 28, 2026
Merged

ci: retry grype install on transient github 502s#701
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix-grype-install-retry

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Wrap tools/setup-tools grype install in a quadratic-backoff retry loop so transient github.com 502s during release-metadata lookups do not fail the merge-gate E2E job.

Motivation / Context

Recent merge-gate runs have failed spuriously with:

[error] received HTTP status=502 for url='https://github.com/anchore/grype/releases/v0.107.0'
##[error]Process completed with exit code 1.

Example: https://github.com/NVIDIA/aicr/actions/runs/25031579265/job/73314217949

The first download (raw.githubusercontent.com install.sh) succeeds; the script's subsequent github.com release-metadata query is what 502s. With set -e the step dies immediately on the first failure.

This change retries the anchore install.sh up to 3 times with 5s/20s backoff, mirroring the pattern already used in kwok/scripts/run-all-recipes.sh and pkg/bundler/deployer/helm/templates/deploy.sh.tmpl.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: setup-tools / dev-environment installer

Implementation Notes

  • Three attempts with quadratic backoff (5s, 20s); third failure aborts the step.
  • Cleans up GRYPE_TMP before exiting on terminal failure.
  • Uses the existing log_warning / log_error functions from tools/common.

Testing

  • bash -n tools/setup-tools — syntax OK.
  • actionlint/yamllint not applicable (shell-only change).
  • Full make qualify was not run because this is a tooling-script-only change with no Go, YAML, docs-sidebar, or recipe impact.

Risk Assessment

Blast radius is limited to the linux grype install path of tools/setup-tools. macOS install (brew install grype) is unchanged. Happy path (no 502) is identical — no extra latency in the success case. Worst-case added latency on persistent 502: ~25s before failing the same way it does today.

Checklist

  • My code follows the project's coding conventions
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have run make qualify and all checks pass

The anchore install.sh queries github.com for release metadata and
intermittently fails with 502 Bad Gateway during high-load periods, which
causes the merge-gate E2E job to fail spuriously. Wrap the install with
three quadratic-backoff retries (5s, 20s) so a brief CDN blip does not
cancel the whole CI step.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 00950486-0c12-49e3-9a77-e235281cc17f

📥 Commits

Reviewing files that changed from the base of the PR and between 3be21ed and c4a80c9.

📒 Files selected for processing (1)
  • tools/setup-tools

📝 Walkthrough

Walkthrough

The grype Linux installation process in tools/setup-tools has been modified to include retry logic. The anchore install.sh invocation now executes within an until loop with a maximum of 3 attempts. On each failure, the attempt counter increments, a warning log is emitted with the computed quadratic backoff delay (attempt² × 5 seconds), and the script sleeps before retrying. After exceeding the maximum attempts, an error is logged, the temporary installation directory is removed, and the script exits with status 1. On success, the grype binary is moved to /usr/local/bin as before.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'ci: retry grype install on transient github 502s' clearly and concisely summarizes the main change: adding retry logic to the grype installation step for handling transient GitHub failures.
Description check ✅ Passed The description comprehensively explains the motivation (transient 502 errors), the solution (quadratic-backoff retry loop), implementation details, testing performed, and risk assessment, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny merged commit 3b7e970 into NVIDIA:main Apr 28, 2026
66 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants