fix(pipeline): timeout + retry InstallPlaywrightModule so a hung download fails fast#5889
Conversation
…load fails fast `npx playwright install --with-deps` ran with no per-attempt timeout and no retry. When the network stalled mid-download (~500 MB of browser binaries), the module sat at the outer 30-minute module budget before failing, cascading into RunPlaywrightTestsModule, UploadToNuGetModule, and CreateReleaseModule (run 25238011861). Adds a 10-minute per-attempt timeout via a linked CancellationTokenSource plus the standard ModuleConfiguration.WithRetryCount(2). Happy path (~2 min on a warm runner) is unaffected; a hang now fails at 10 min and retries instead of burning the full 30.
There was a problem hiding this comment.
Code Review
Overall: The design is correct and addresses a real production flakiness issue cleanly. Two points worth considering.
Concern: OperationCanceledException and retry behaviour
The linked CTS approach is correct for propagating external cancellation, but it introduces a subtle risk: when attemptCts fires after 10 minutes, BashCommandOptions execution throws OperationCanceledException. Whether ModularPipelines retries on OperationCanceledException depends on how its retry logic distinguishes between:
- Timeout-caused OCE (from
attemptCts.Token) → should retry - External-cancellation-caused OCE (from outer
cancellationToken) → should not retry
Because attemptCts is a linked source, attemptCts.Token != cancellationToken, so cancellationToken.IsCancellationRequested is false when the timeout fires — the framework should retry. But if ModularPipelines' retry handler catches OperationCanceledException and simply rethrows without inspecting the token, the retry won't happen and the module fails after one attempt despite WithRetryCount(2).
Suggested check: Add a quick smoke-test where the timeout fires before the command finishes and confirm the framework actually retries. Alternatively, consider wrapping the cancellation and rethrowing as a non-OCE exception explicitly:
catch (OperationCanceledException) when (attemptCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
{
throw new TimeoutException($"Playwright install timed out after {PerAttemptTimeout.TotalMinutes} minutes");
}This makes the timeout a TimeoutException (which the retry framework will definitely retry), and re-throws the original OCE only when external cancellation was the cause.
Minor: multi-line comment violates project style
CLAUDE.md specifies "Never write multi-paragraph docstrings or multi-line comment blocks — one short line max." The 3-line block comment on PerAttemptTimeout violates this.
The WHY here is non-obvious enough to warrant a comment, but it should be a single line — the PR description already documents the full reasoning:
// 10 min: far beyond happy-path (~2 min) but short enough to retry instead of burning the 30-min module budget.
private static readonly TimeSpan PerAttemptTimeout = TimeSpan.FromMinutes(10);What's good
- Correctly mirrors the
WithRetryCountpattern fromAbstractTestNugetPackageModule. using var attemptCts+CancelAfteris the right modern pattern (nonew CancellationTokenSource(TimeSpan)anti-pattern).- Each retry gets a fresh
CancellationTokenSourcebecauseExecuteAsyncis re-entered per attempt — that part is correct. - 10-minute per-attempt ceiling is a sensible choice given the stated ~2 min happy path and 30-min outer budget.
Up to standards ✅🟢 Issues
|
Summary
npx playwright install --with-depsran with no per-attempt timeout and no retry. When the network stalled mid-download (~500 MB of browser binaries), the module sat at the outer 30-minute module budget before failing — cascading intoRunPlaywrightTestsModule,UploadToNuGetModule, andCreateReleaseModule(run 25238011861).Adds a 10-minute per-attempt timeout via a linked
CancellationTokenSourceplus the sameModuleConfiguration.WithRetryCount(2)pattern already used byTestNugetPackageModule. Happy path (~2 min on a warm runner) is unaffected; a hang now fails at 10 min and retries instead of burning the full 30.Notes
package.json, so the workflow'sactions/cache@v5step (.github/workflows/dotnet.yml:78-88) keys the cache againstdocs/**/package.jsonandDirectory.Packages.props— neither tracks the Playwright version. Cache will hit reliably across PR runs, but a Playwright release will silently invalidate which browsers are needed and trigger a fresh download. Out of scope to fix here.Test plan
dotnet build TUnit.Pipelinepasses