Create and activate a virtual environment (example shown for bash-compatible shells):
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .Optional tooling used by the eval harness (Braintrust uploads and semantic classifier helpers):
python -m pip install braintrust openai autoevalsRUN_LIVE=true \
MODEL=azure/gpt-4.1 \
CLASSIFIER_MODEL=azure/gpt-4o \
AKS_AGENT_RESOURCE_GROUP=<rg> \
AKS_AGENT_CLUSTER=<cluster> \
KUBECONFIG=<path-to-kubeconfig> \
AZURE_API_KEY=<key> \
AZURE_API_BASE=<endpoint> \
AZURE_API_VERSION=<version> \
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 01_list_all_nodes -m aks_evalPer-scenario overrides (resource_group, cluster_name, kubeconfig, test_env_vars) still apply. Use --skip-setup or --skip-cleanup to bypass hooks. Expect the test to log iteration progress, classifier scores, and (on the final iteration) a Braintrust link when uploads are enabled.
Example output (live run with classifier)
[iteration 1/3] running setup commands for 01_list_all_nodes
[iteration 1/3] invoking AKS Agent CLI for 01_list_all_nodes
[iteration 1/3] classifier score for 01_list_all_nodes: 1
[iteration 2/3] invoking AKS Agent CLI for 01_list_all_nodes
[iteration 2/3] classifier score for 01_list_all_nodes: 1
[iteration 3/3] invoking AKS Agent CLI for 01_list_all_nodes
[iteration 3/3] classifier score for 01_list_all_nodes: 1
...
🔍 Braintrust: https://www.braintrust.dev/app/<org>/p/aks-agent/experiments/aks-agent/...
# Generate fresh mocks from a live run
RUN_LIVE=true GENERATE_MOCKS=true \
MODEL=azure/gpt-4.1 \
AKS_AGENT_RESOURCE_GROUP=<rg> \
AKS_AGENT_CLUSTER=<cluster> \
KUBECONFIG=<path-to-kubeconfig> \
AZURE_API_KEY=<key> \
AZURE_API_BASE=<endpoint> \
AZURE_API_VERSION=<version> \
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval
# Re-run offline using the recorded response
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_evalIf a mock is missing, pytest skips the scenario with instructions to regenerate it.
Regression guardrails
- Mocked answers make iterations deterministic, so you can update parsing or prompts without waiting on live infrastructure.
- If you check in a new mock after behavior changes, reviewers see the exact diff in
mocks/response.txt, making regressions obvious. - CI can run
RUN_LIVEoff by default, catching logical regressions early without needing cluster credentials.
Example skip (no mock present)
azext_aks_agent/tests/evals/test_ask_agent.py::test_ask_agent_live[02_list_clusters]
SKIPPED: Mock response missing for scenario 02_list_clusters; rerun with RUN_LIVE=true GENERATE_MOCKS=true
Set the following environment variables to push results:
BRAINTRUST_API_KEYandBRAINTRUST_ORG(required)- Optional overrides:
BRAINTRUST_PROJECT(defaultaks-agent),BRAINTRUST_DATASET(defaultaks-agent/ask),EXPERIMENT_ID
Each iteration logs to Braintrust; the console prints a clickable link (for aware terminals) when uploads succeed.
Tips
- Leave
EXPERIMENT_IDunset to generate a fresh experiment name each run (aks-agent/<model>/<run-id>). - Use
BRAINTRUST_RUN_ID=<custom>if you want deterministic experiment names across retries. - The upload payload includes classifier score, rationale, raw CLI output, cluster, and resource group metadata for later filtering.
-
Enabled by default; set
ENABLE_CLASSIFIER=falseto opt out. -
Requires Azure OpenAI credentials:
AZURE_API_BASE,AZURE_API_KEY,AZURE_API_VERSION, and a classifier deployment specified viaCLASSIFIER_MODEL(e.g.azure/<deployment>). Defaults to the same deployment asMODELwhen not provided. -
Install classifier dependencies when online (see Environment Setup above if not already installed).
-
Scenarios can override the grading style by adding:
evaluation: correctness: type: loose # or strict (default)
Classifier scores and rationales are attached to Braintrust uploads and printed in the pytest output metadata.
Debugging classifiers
python -m pytest ... -o log_cli=true -o log_cli_level=DEBUG -s
Look for classifier score ... lines to confirm the semantic judge executed.
ITERATIONS=<n>repeats every scenario, useful for non-deterministic models.- Filter suites with pytest markers:
-m aks_eval,-m easy,-m medium, etc.
- Missing mocks: rerun with
RUN_LIVE=true GENERATE_MOCKS=true. - Cleanup always executes unless
--skip-cleanupis provided; check the[cleanup]log line. - Braintrust disabled messages mean credentials or the SDK are missing.
- Classifier disabled messages usually indicate missing Azure settings (
AZURE_API_BASE,AZURE_API_KEY,AZURE_API_VERSION).
- Install dependencies inside a virtual environment (
python -m pip install -e .) and, if needed, the optional tooling (python -m pip install braintrust openai autoevals). RUN_LIVE=true: set Azure creds (AZURE_API_KEY,AZURE_API_BASE,AZURE_API_VERSION),MODEL, kubeconfig, and optional Braintrust vars.RUN_LIVEunset/false: ensure each scenario directory hasmocks/response.txt.- Classifier overrides:
CLASSIFIER_MODEL(defaults toMODEL) and per-scenarioevaluation.correctness.type. - Optional:
BRAINTRUST_RUN_ID=<identifier>to reuse experiment names across retries.