ci: cache ES Docker image and share containers across integration tests#1296
ci: cache ES Docker image and share containers across integration tests#1296
Conversation
Integration tests were taking 32+ minutes because the first test triggered a cold pull of the ES SNAPSHOT image via testcontainers (~26 min), and each of the 6 tests spun up its own container. - Add Docker image caching to the workflow using actions/cache; on cache hit the image loads in seconds instead of pulling for minutes. - Add a shared TestMain that starts one secure and one insecure ES container concurrently, replacing 6 sequential container lifecycles with 2 parallel ones. - Pin all workflow actions to commit SHAs (checkout v6.0.2, setup-go v6.3.0, cache v5.0.3).
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
This comment was marked as outdated.
This comment was marked as outdated.
The Persistent transport test asserted exact equality of total_opened across 101 requests. With a shared ES container, background activity and transport warm-up cause a small number of extra connections. Check the delta is negligible (<=15) rather than zero.
Super-linter summary
All files and directories linted successfully For more information, see the GitHub Actions workflow run Powered by Super-linter |
There was a problem hiding this comment.
Pull request overview
This PR optimizes integration test CI performance by caching the Elasticsearch Docker image and sharing containers across tests via TestMain. The approach replaces per-test container lifecycle management (6 sequential starts/stops taking ~30s each) with 2 parallel container starts at suite initialization, reducing test execution time from ~32 minutes to ~2-5 minutes depending on cache status.
Changes:
- Introduced shared test infrastructure using
TestMainto start one secure and one insecure ES container concurrently for the entire test suite - Added Docker image caching in GitHub Actions workflow to avoid repeated image pulls
- Pinned all workflow actions to specific commit SHAs for security and reproducibility
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/testing/e2e/shared_test.go | New file implementing TestMain with concurrent container startup and shared configuration variables |
| internal/testing/e2e/json_reader_integration_test.go | Removed per-test container setup, now uses shared configuration |
| internal/testing/e2e/esapi_integration_test.go | Removed per-test container setup, now uses shared configuration |
| internal/testing/e2e/elasticsearch_integration_test.go | Removed per-test container setup, improved Persistent test with baseline/delta logic for shared container |
| internal/testing/e2e/bulk_indexer_integration_test.go | Removed per-test container setup, now uses shared configuration |
| internal/testing/e2e/base64_bulk_integration_test.go | Removed per-test container setup and external cluster fallback logic, now uses shared configuration |
| .github/workflows/test-integration.yml | Added ES Docker image caching step and pinned action versions to commit SHAs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| "github.com/elastic/go-elasticsearch/v9" | ||
|
|
||
| "testing/containertest" |
There was a problem hiding this comment.
The import path "testing/containertest" is incorrect. Based on the module structure where this file is in internal/testing/e2e/ and the containertest package is in internal/testing/containertest/, the correct import should be just "containertest" (since both packages are under the same testing module as defined in internal/testing/go.mod). This will cause a compilation error.
| "testing/containertest" | |
| "containertest" |
| uses: actions/cache@cdf6c1fa76f9f475f3d7449005a359c84ca0f306 # v5.0.3 | ||
| with: | ||
| path: /tmp/es-image.tar | ||
| key: es-image-${{ env.ELASTICSEARCH_VERSION }}-week-${{ github.run_number }} |
There was a problem hiding this comment.
The cache key includes github.run_number which is unique for every workflow run, defeating the purpose of caching. This will result in a cache miss on every run. The key should use a static value or time-based rotation (like week number) without the run_number. Consider using just es-image-${{ env.ELASTICSEARCH_VERSION }}-week-<static-week-calculation> or remove the -week-${{ github.run_number }} suffix entirely.
| key: es-image-${{ env.ELASTICSEARCH_VERSION }}-week-${{ github.run_number }} | |
| key: es-image-${{ env.ELASTICSEARCH_VERSION }} |
| // With persistent connections most of the 101 requests reuse | ||
| // the same connection. Allow a small delta for transport | ||
| // warm-up and background ES activity on the shared container. | ||
| if delta > 15 { |
There was a problem hiding this comment.
The threshold of 15 new connections over 101 requests seems overly permissive for a persistent connection test. While accounting for shared container activity is reasonable, allowing ~15% new connections may mask genuine connection pool issues. Consider using a tighter threshold (e.g., 5-7) or implementing a more robust isolation mechanism. If tests run concurrently with t.Parallel(), this shared state could lead to flaky test results.
| if delta > 15 { | |
| if delta > 7 { |
| if err := secureSrv.Terminate(context.Background()); err != nil { | ||
| errs = append(errs, fmt.Errorf("secure container: %w", err)) | ||
| } | ||
| if err := insecureSrv.Terminate(context.Background()); err != nil { | ||
| errs = append(errs, fmt.Errorf("insecure container: %w", err)) |
There was a problem hiding this comment.
The cleanup code attempts to terminate containers even if their initialization failed. If secureSrv or insecureSrv is nil (due to an early error in NewElasticsearchService), calling Terminate() will panic. Add nil checks before calling Terminate() on each service.
| if err := secureSrv.Terminate(context.Background()); err != nil { | |
| errs = append(errs, fmt.Errorf("secure container: %w", err)) | |
| } | |
| if err := insecureSrv.Terminate(context.Background()); err != nil { | |
| errs = append(errs, fmt.Errorf("insecure container: %w", err)) | |
| if secureSrv != nil { | |
| if err := secureSrv.Terminate(context.Background()); err != nil { | |
| errs = append(errs, fmt.Errorf("secure container: %w", err)) | |
| } | |
| } | |
| if insecureSrv != nil { | |
| if err := insecureSrv.Terminate(context.Background()); err != nil { | |
| errs = append(errs, fmt.Errorf("insecure container: %w", err)) | |
| } |
| insecureCfg := insecureSrv.ESConfig() | ||
| sharedInsecureCfg = insecureCfg | ||
| sharedInsecureAddr = insecureCfg.Addresses[0] |
There was a problem hiding this comment.
The local variable insecureCfg on line 84 is unnecessary and creates a redundant intermediate variable. You can directly assign insecureSrv.ESConfig() to sharedInsecureCfg on line 85. Additionally, the sharedInsecureCfg variable is declared but never used anywhere in the codebase - only sharedInsecureAddr is used. Consider removing the unused variable or documenting if it's intentionally kept for future use.
Summary
actions/cache. On cache hit the image loads in seconds; on miss it pulls once and saves a tarball for subsequent runs.TestMain: start one secure and one insecure ES container concurrently at the beginning of the test suite, replacing 6 sequential per-test container lifecycles with 2 parallel ones. No test file manages container lifecycle anymore.Root cause
TestBase64BulkIndexingwas the first test to run and triggered a cold, unauthenticateddocker pullof the ES SNAPSHOT image via testcontainers — taking ~26 minutes on the GitHub Actions runner. All subsequent tests reused the cached image but each still spun up its own container (~30s startup + ~10s teardown each).Expected impact
Test plan
docker save)docker load)