Skip to content

fix(pipeline): timeout + retry InstallPlaywrightModule so a hung download fails fast#5889

Merged
thomhurst merged 1 commit into
mainfrom
fix/ci-playwright-install-retry
May 11, 2026
Merged

fix(pipeline): timeout + retry InstallPlaywrightModule so a hung download fails fast#5889
thomhurst merged 1 commit into
mainfrom
fix/ci-playwright-install-retry

Conversation

@thomhurst
Copy link
Copy Markdown
Owner

Summary

npx playwright install --with-deps ran with no per-attempt timeout and no retry. When the network stalled mid-download (~500 MB of browser binaries), the module sat at the outer 30-minute module budget before failing — cascading into RunPlaywrightTestsModule, UploadToNuGetModule, and CreateReleaseModule (run 25238011861).

Adds a 10-minute per-attempt timeout via a linked CancellationTokenSource plus the same ModuleConfiguration.WithRetryCount(2) pattern already used by TestNugetPackageModule. Happy path (~2 min on a warm runner) is unaffected; a hang now fails at 10 min and retries instead of burning the full 30.

Notes

  • The repo has no root package.json, so the workflow's actions/cache@v5 step (.github/workflows/dotnet.yml:78-88) keys the cache against docs/**/package.json and Directory.Packages.props — neither tracks the Playwright version. Cache will hit reliably across PR runs, but a Playwright release will silently invalidate which browsers are needed and trigger a fresh download. Out of scope to fix here.

Test plan

  • dotnet build TUnit.Pipeline passes
  • Workflow runs across all three OS legs without regression

…load fails fast

`npx playwright install --with-deps` ran with no per-attempt timeout
and no retry. When the network stalled mid-download (~500 MB of
browser binaries), the module sat at the outer 30-minute module
budget before failing, cascading into RunPlaywrightTestsModule,
UploadToNuGetModule, and CreateReleaseModule (run 25238011861).

Adds a 10-minute per-attempt timeout via a linked CancellationTokenSource
plus the standard ModuleConfiguration.WithRetryCount(2). Happy path
(~2 min on a warm runner) is unaffected; a hang now fails at 10 min
and retries instead of burning the full 30.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: The design is correct and addresses a real production flakiness issue cleanly. Two points worth considering.


Concern: OperationCanceledException and retry behaviour

The linked CTS approach is correct for propagating external cancellation, but it introduces a subtle risk: when attemptCts fires after 10 minutes, BashCommandOptions execution throws OperationCanceledException. Whether ModularPipelines retries on OperationCanceledException depends on how its retry logic distinguishes between:

  • Timeout-caused OCE (from attemptCts.Token) → should retry
  • External-cancellation-caused OCE (from outer cancellationToken) → should not retry

Because attemptCts is a linked source, attemptCts.Token != cancellationToken, so cancellationToken.IsCancellationRequested is false when the timeout fires — the framework should retry. But if ModularPipelines' retry handler catches OperationCanceledException and simply rethrows without inspecting the token, the retry won't happen and the module fails after one attempt despite WithRetryCount(2).

Suggested check: Add a quick smoke-test where the timeout fires before the command finishes and confirm the framework actually retries. Alternatively, consider wrapping the cancellation and rethrowing as a non-OCE exception explicitly:

catch (OperationCanceledException) when (attemptCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
{
    throw new TimeoutException($"Playwright install timed out after {PerAttemptTimeout.TotalMinutes} minutes");
}

This makes the timeout a TimeoutException (which the retry framework will definitely retry), and re-throws the original OCE only when external cancellation was the cause.


Minor: multi-line comment violates project style

CLAUDE.md specifies "Never write multi-paragraph docstrings or multi-line comment blocks — one short line max." The 3-line block comment on PerAttemptTimeout violates this.

The WHY here is non-obvious enough to warrant a comment, but it should be a single line — the PR description already documents the full reasoning:

// 10 min: far beyond happy-path (~2 min) but short enough to retry instead of burning the 30-min module budget.
private static readonly TimeSpan PerAttemptTimeout = TimeSpan.FromMinutes(10);

What's good

  • Correctly mirrors the WithRetryCount pattern from AbstractTestNugetPackageModule.
  • using var attemptCts + CancelAfter is the right modern pattern (no new CancellationTokenSource(TimeSpan) anti-pattern).
  • Each retry gets a fresh CancellationTokenSource because ExecuteAsync is re-entered per attempt — that part is correct.
  • 10-minute per-attempt ceiling is a sensible choice given the stated ~2 min happy path and 30-min outer budget.

@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@thomhurst thomhurst merged commit 74c6c46 into main May 11, 2026
13 of 14 checks passed
@thomhurst thomhurst deleted the fix/ci-playwright-install-retry branch May 11, 2026 22:07
@claude claude Bot mentioned this pull request May 14, 2026
1 task
This was referenced May 14, 2026
This was referenced May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant