Skip to content

ci: cache ES Docker image and share containers across integration tests#1296

Draft
MattDevy wants to merge 2 commits intomainfrom
ci/cache-docker-images
Draft

ci: cache ES Docker image and share containers across integration tests#1296
MattDevy wants to merge 2 commits intomainfrom
ci/cache-docker-images

Conversation

@MattDevy
Copy link
Contributor

Summary

  • Cache the ES Docker image in CI using actions/cache. On cache hit the image loads in seconds; on miss it pulls once and saves a tarball for subsequent runs.
  • Share containers via TestMain: start one secure and one insecure ES container concurrently at the beginning of the test suite, replacing 6 sequential per-test container lifecycles with 2 parallel ones. No test file manages container lifecycle anymore.
  • Pin all workflow actions to commit SHAs (checkout v6.0.2, setup-go v6.3.0, cache v5.0.3).

Root cause

TestBase64BulkIndexing was the first test to run and triggered a cold, unauthenticated docker pull of the ES SNAPSHOT image via testcontainers — taking ~26 minutes on the GitHub Actions runner. All subsequent tests reused the cached image but each still spun up its own container (~30s startup + ~10s teardown each).

Expected impact

Scenario Before After
Cold (no cache) ~32 min ~4-5 min
Warm (cache hit) ~32 min ~2 min

Test plan

  • Verify the integration workflow passes on this PR
  • Confirm Docker image cache is populated on first run (docker save)
  • Confirm subsequent runs load from cache (docker load)
  • Verify all 6 integration tests still pass with shared containers

Integration tests were taking 32+ minutes because the first test
triggered a cold pull of the ES SNAPSHOT image via testcontainers
(~26 min), and each of the 6 tests spun up its own container.

- Add Docker image caching to the workflow using actions/cache;
  on cache hit the image loads in seconds instead of pulling for
  minutes.
- Add a shared TestMain that starts one secure and one insecure
  ES container concurrently, replacing 6 sequential container
  lifecycles with 2 parallel ones.
- Pin all workflow actions to commit SHAs (checkout v6.0.2,
  setup-go v6.3.0, cache v5.0.3).
@prodsecmachine
Copy link

prodsecmachine commented Feb 26, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions

This comment was marked as outdated.

The Persistent transport test asserted exact equality of total_opened
across 101 requests. With a shared ES container, background activity
and transport warm-up cause a small number of extra connections.
Check the delta is negligible (<=15) rather than zero.
@github-actions
Copy link
Contributor

Super-linter summary

Language Validation result
GITHUB_ACTIONS Pass ✅
GITLEAKS Pass ✅
GIT_MERGE_CONFLICT_MARKERS Pass ✅
SPELL_CODESPELL Pass ✅
YAML Pass ✅
YAML_PRETTIER Pass ✅

All files and directories linted successfully

For more information, see the GitHub Actions workflow run

Powered by Super-linter

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes integration test CI performance by caching the Elasticsearch Docker image and sharing containers across tests via TestMain. The approach replaces per-test container lifecycle management (6 sequential starts/stops taking ~30s each) with 2 parallel container starts at suite initialization, reducing test execution time from ~32 minutes to ~2-5 minutes depending on cache status.

Changes:

  • Introduced shared test infrastructure using TestMain to start one secure and one insecure ES container concurrently for the entire test suite
  • Added Docker image caching in GitHub Actions workflow to avoid repeated image pulls
  • Pinned all workflow actions to specific commit SHAs for security and reproducibility

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/testing/e2e/shared_test.go New file implementing TestMain with concurrent container startup and shared configuration variables
internal/testing/e2e/json_reader_integration_test.go Removed per-test container setup, now uses shared configuration
internal/testing/e2e/esapi_integration_test.go Removed per-test container setup, now uses shared configuration
internal/testing/e2e/elasticsearch_integration_test.go Removed per-test container setup, improved Persistent test with baseline/delta logic for shared container
internal/testing/e2e/bulk_indexer_integration_test.go Removed per-test container setup, now uses shared configuration
internal/testing/e2e/base64_bulk_integration_test.go Removed per-test container setup and external cluster fallback logic, now uses shared configuration
.github/workflows/test-integration.yml Added ES Docker image caching step and pinned action versions to commit SHAs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


"github.com/elastic/go-elasticsearch/v9"

"testing/containertest"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import path "testing/containertest" is incorrect. Based on the module structure where this file is in internal/testing/e2e/ and the containertest package is in internal/testing/containertest/, the correct import should be just "containertest" (since both packages are under the same testing module as defined in internal/testing/go.mod). This will cause a compilation error.

Suggested change
"testing/containertest"
"containertest"

Copilot uses AI. Check for mistakes.
uses: actions/cache@cdf6c1fa76f9f475f3d7449005a359c84ca0f306 # v5.0.3
with:
path: /tmp/es-image.tar
key: es-image-${{ env.ELASTICSEARCH_VERSION }}-week-${{ github.run_number }}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key includes github.run_number which is unique for every workflow run, defeating the purpose of caching. This will result in a cache miss on every run. The key should use a static value or time-based rotation (like week number) without the run_number. Consider using just es-image-${{ env.ELASTICSEARCH_VERSION }}-week-<static-week-calculation> or remove the -week-${{ github.run_number }} suffix entirely.

Suggested change
key: es-image-${{ env.ELASTICSEARCH_VERSION }}-week-${{ github.run_number }}
key: es-image-${{ env.ELASTICSEARCH_VERSION }}

Copilot uses AI. Check for mistakes.
// With persistent connections most of the 101 requests reuse
// the same connection. Allow a small delta for transport
// warm-up and background ES activity on the shared container.
if delta > 15 {
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold of 15 new connections over 101 requests seems overly permissive for a persistent connection test. While accounting for shared container activity is reasonable, allowing ~15% new connections may mask genuine connection pool issues. Consider using a tighter threshold (e.g., 5-7) or implementing a more robust isolation mechanism. If tests run concurrently with t.Parallel(), this shared state could lead to flaky test results.

Suggested change
if delta > 15 {
if delta > 7 {

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +95
if err := secureSrv.Terminate(context.Background()); err != nil {
errs = append(errs, fmt.Errorf("secure container: %w", err))
}
if err := insecureSrv.Terminate(context.Background()); err != nil {
errs = append(errs, fmt.Errorf("insecure container: %w", err))
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup code attempts to terminate containers even if their initialization failed. If secureSrv or insecureSrv is nil (due to an early error in NewElasticsearchService), calling Terminate() will panic. Add nil checks before calling Terminate() on each service.

Suggested change
if err := secureSrv.Terminate(context.Background()); err != nil {
errs = append(errs, fmt.Errorf("secure container: %w", err))
}
if err := insecureSrv.Terminate(context.Background()); err != nil {
errs = append(errs, fmt.Errorf("insecure container: %w", err))
if secureSrv != nil {
if err := secureSrv.Terminate(context.Background()); err != nil {
errs = append(errs, fmt.Errorf("secure container: %w", err))
}
}
if insecureSrv != nil {
if err := insecureSrv.Terminate(context.Background()); err != nil {
errs = append(errs, fmt.Errorf("insecure container: %w", err))
}

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +86
insecureCfg := insecureSrv.ESConfig()
sharedInsecureCfg = insecureCfg
sharedInsecureAddr = insecureCfg.Addresses[0]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local variable insecureCfg on line 84 is unnecessary and creates a redundant intermediate variable. You can directly assign insecureSrv.ESConfig() to sharedInsecureCfg on line 85. Additionally, the sharedInsecureCfg variable is declared but never used anywhere in the codebase - only sharedInsecureAddr is used. Consider removing the unused variable or documenting if it's intentionally kept for future use.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants