Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions enhancements/test-platform/e2e-observer-pods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: e2e-observer-pods
authors:
- "@deads2k"
reviewers:
approvers:
- "@stevek"
creation-date: yyyy-mm-dd
last-updated: yyyy-mm-dd
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced
see-also:
replaces:
superseded-by:
---

# e2e Observer Pods

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

e2e tests have multiple dimensions:
1. which tests are running (parallel, serial, conformance, operator-specific, etc)
2. which platform (gcp, aws, azure, etc)
3. which configuration of that platform (proxy, fips, ovs, etc)
4. which version is running (4.4, 4.5, 4.6, etc)
5. which version change is running (single upgrade, double upgrade, minor upgrade, etc)
Our individual jobs are an intersection of these dimensions and that intersectionality drove the development of steps
to help handle the complexity.

There is a kind of CI test observer that wants to run across all of these intersections by leveraging the uniformity of
OCP across all the different dimensions above.
It wants to do something like start an observer agent of some kind for every CI cluster ever created and provide o(100M)
of data back to include in the CI job overall.
Working with steps would require integrating in multiple dimensions and lead to gaps in coverage.

## Motivation

We need to run tools like
1. e2e monitor - doesn't cover installation
2. resourcewatcher - doesn't cover installation today if we wire it directly into tests.
3. loki log collection (as created by group-b) - hasn't been able to work in upgrades at all.
4. loki as provided by the logging team (future)
To debug our existing install and upgrade failures.

### Goals

1. allow a cluster observing tool to run in the CI cluster (this avoid restarts during upgrades)
2. allow a cluster observer to provide data (including junit) to be collected by the CI job
3. collect data from cluster observer regardless of whether the job succeeded or failed
4. allow multiple observers per CI cluster

### Non-Goals

1. allow a cluster observer to impact success or failure of a job
2. allow a cluster observer to impact the running test. This is an observer author failure.
3. provide ANY dimension specific data. If an observer needs this, they need to integrate differently.
4. allow multiple instances of a single cluster observer to run against one CI cluster

## Proposal

The overall goal:
1. have an e2e-observer process that is expected to be running before a kubeconfig exists
2. the kubeconfig should be provided to the e2e-observer at the earliest possible time. Even before it can be used.
3. the e2e-observer process is expected to detect the presence of the kubeconfig itself
4. the e2e-observer process will accept a signal indicating that teardown begins

### Changes to CI
This is a sketch of a possible path.
1. Allow a non-dptp-developer to produces a manifest outside of any existing resource that can define
1. image
2. process
3. potentially a bash entrypoint
4. env vars
that mounts a `secret/<job>-kubeconfig` that will later contain a kubeconfig.
This happens to neatly align to a PodTemplate, but any file not tied to a particular CI dimension can work.
2. Allow a developer to bind configured observer(s) to a job by one of the following mechanisms:
1. attach the observer(s) to a step, so that any job which runs that step will run the observer
2. attach the observer(s) to a workflow, so that any job which runs the workflow will run the observer
3. attach the observer(s) to a literal test configuraiton, so that an observer could be added in a repo's test stanza
3. Allow a developer to opt out of running named observer(s) by one of the following mechanisms:
1. opt out of the observer(s) in a workflow, so that any job which runs the workflow will not run the observer
2. opt out of the observer(s) in a literal test configuraiton, so that an observer could be excluded in a repo's test stanza
4. Before a kubeconfig is present, create an instance of each binary from #1 is created and an empty `secret/<job>-kubeconfig`.
5. As soon as a kubeconfig is available (these go into a known location in setup container today), write that
`kubeconfig` into every `secret/<job>-kubeconfig` (you probably want to label them).
6. When it is time for collection, the existing pod (I think it's teardown container), issues a SIGTERM to the process.
7. Ten minutes teardown begins, something in CI gathers a well known directory,
`/var/e2e-observer`, which may contain `/var/e2e-observer/junit` and `/var/e2e-observer/artifacts`. These contents
are placed in some reasonable spot.

This could be optimized with a file write in the other direction, but the naive approach doesn't require it.
8. All resources are cleaned up.

### Requirements on e2e-observer authors
1. Your pod must be able to run against *every* dimension. No exceptions. If your pod needs to quietly no-op, it can do that.
2. Your pod must handle a case of a kubeconfig file that isn't present when the pod starts and appears afterward.
Or even doesn't appear at all.
3. Your pod must be able to tolerate a kubeconfig that doesn't work.
The kubeconfig may point to a cluster than never comes up or that hasn't come up yet. Your pod must not fail.
4. If your pod fails, it will not fail the e2e job. If you need to fail an e2e job reliably, you need something else.
5. Your pod must terminate when asked.

## Alternatives

### Modify each type of job
Given the matrix of dimensions, this seems impractical.
Even today, we have a wide variance in the quality of artifacts from different jobs.

### Modify the openshift-tests command
This is easy for some developers, BUT as e2e monitor shows us, it doesn't provide enough information.
We need information from before the test command is run to find many of our problems.
Missing logs and missing intermediate resource states (intermediate operator status) are the most egregious so far.
Copy link
Member

@wking wking Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vanilla e2e suite already records intermediate operator status in Monitor cluster while tests execute. It does not seem like that big a reach to me to ask the installer to also monitor and log operator status, instead of its current "just collect this at the end if the install failed" behavior. And that would also be useful information to have in the log for installs that fail in the wild, which will not benefit from CI-specific monitoring suites.

Log preservation is unlikely to be covered by either the installer or the Monitor cluster while tests execute "test-case". I think that's a valid use-case for this enhancement. But being able to configure all clusters, CI and otherwise, to be able to stream logs to an external logging aggregator would be nice too, and would cover both CI and real-world use-cases.