feat: error message annotation step #4385

nv-nmailhot · 2025-11-17T00:56:00Z

Overview

This PR enhances the GitHub Actions workflow error reporting system by implementing intelligent error extraction using Salesforce LogAI and creating rich GitHub annotations with detailed diagnostic information. The changes ensure that deployment and test failures are immediately visible with actionable context, while also addressing security vulnerabilities in JSON payload construction.

Key Improvements:

🔍 Intelligent Error Detection: LogAI-powered error extraction with fallback to regex patterns
📍 Rich Annotations: GitHub Check Runs API integration with detailed error messages, pod status, and Kubernetes events

Details

1. Error Annotation System (`container-validation-backends.yml`)

What Changed:

Added multi-step error handling workflow for deploy-operator and deploy-test-* jobs
Each deployment job now has 5 steps when failures occur:
1. Main deployment step (with continue-on-error: true)
2. Setup Python for Log Analysis
3. Install LogAI
4. Extract Errors from Logs using LogAI
5. Check for Job Failure and Create Annotation (renamed from "Create GitHub Annotation on Failure")

Error Handling Flow:

Main step captures ERROR_MESSAGE on failure
LogAI analyzes log files (deploy-operator.log, test-output.log)
Script combines LogAI extraction + manual error captures + Kubernetes diagnostics
Creates GitHub Check Run via API with rich annotations
Creates ::error workflow command annotation

2. LogAI Error Extraction Script (`extract_log_errors.py`)

What Changed:

Robust LogAI import handling with graceful fallback to regex
Enhanced regex patterns for Kubernetes-specific errors:
- timed out waiting for the condition
- no matching resources found
- Failed to pull image
- CrashLoopBackOff
Error explanations dictionary with actionable insights
Proper formatting for multi-line error messages

Example Output:

Primary Error: timed out waiting for the condition on pods/sglang-agg-0-frontend-t2sb8

💡 Explanation: This indicates pods failed to become ready within the timeout period.
Common causes: insufficient resources, image pull errors, container crashes.

Context:
+ kubectl wait --for=condition=ready pod...
pod/sglang-agg-0-decode-nm947 condition met
error: timed out waiting for the condition on pods/sglang-agg-0-frontend-t2sb8

Where Should the Reviewer Start?

1️⃣ Start with the Error Extraction Script (`extract_log_errors.py`)

Review the LogAI integration logic (lines 13-27)
Check the enhanced regex patterns and error explanations (lines 35-120)
Understand the fallback mechanism if LogAI is unavailable

2️⃣ Review the Annotation Workflow (`container-validation-backends.yml`)

Look at the deploy-operator job's error handling flow (lines 634-735)
Focus on the "Check for Job Failure and Create Annotation" step (lines 673-735)
Note how it combines LogAI + manual errors + Kubernetes diagnostics

🔍 Testing Recommendations:

Trigger a failure in deploy-operator and verify:
- LogAI extraction runs
- GitHub annotation appears with pod diagnostics
- Check Run is created in the Checks tab

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Enhanced failure diagnostics in deployment workflows with detailed error reporting, structured logging, and GitHub annotations pointing to specific issues.
Chores
- Improved CI/CD pipeline observability with comprehensive logging and error extraction infrastructure for better troubleshooting.

coderabbitai · 2025-11-17T01:09:57Z

Walkthrough

Introduces comprehensive error extraction and observability infrastructure: a new Python script for extracting errors from logs using optional LogAI integration with regex fallback, a workflow for testing LogAI installation, and extensive hardening of the container validation workflow with runtime logging, diagnostics capture, and GitHub annotations.

Changes

Cohort / File(s)	Change Summary
CI/CD Infrastructure & Trigger `\.github/\.trigger`, `\.github/workflows/test-logai.yml`	Added a trigger file to force workflow runs and introduced a new workflow for validating LogAI installation, including Python setup, package inspection, import verification, error extraction script testing, and fallback mode simulation.
Error Extraction & Logging `\.github/scripts/extract_log_errors.py`	New Python script implementing `LogErrorExtractor` class that extracts errors from logs via optional LogAI integration (with dynamic imports and component checks) and falls back to regex-based extraction; includes deduplication, sorting, CLI entry point with JSON output support, and human-readable summaries.
Container Validation Workflow Enhancement `\.github/workflows/container-validation-backends.yml`	Extended workflow with runtime hardening across deploy-operator and test phases: redirects step output to log files, validates Helm availability, captures diagnostics on failures (pods, events, helm status), integrates error extraction via the new script, and adds GitHub annotations with detailed error reporting via REST API; replaces conditional gating with unconditional execution (TODO-annotated for restoration).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

extract_log_errors.py: New script with dense logic including LogAI integration with component checks, dynamic imports, comprehensive regex-based fallback with multiple error patterns (general and Kubernetes-specific), deduplication/sorting, and dual output modes (JSON/human-readable).
container-validation-backends.yml: Substantial workflow modifications with multiple nested error handling flows, diagnostics capture patterns, environment variable propagation, GitHub annotation construction, and REST API integration across deploy and test phases; coordination between multiple layers of logging and error extraction requires careful cross-step validation.
Integration points: Verify correct interaction between the error extraction script and workflow steps, especially environment variable passing, annotation payload formatting, and fallback execution paths.

Poem

🐰 A script hops in with log-sniffing might,
With LogAI grace and regex-fallback light;
The workflows now catch each stumble and slip,
With diagnostics captured on each failed trip,
Annotations guide us when deployments don't fly—
Better observability makes errors say "hi!" 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description largely follows the template structure with Overview, Details, Where should the reviewer start, and Related Issues sections, but the Related Issues section is incomplete.	Replace 'closes GitHub issue: #xxx' with an actual GitHub issue number or remove the placeholder if no related issue exists.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: error message annotation step' is specific and directly related to the main objective of adding error message annotation functionality, matching the substantial workflow and script changes in the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example:

"Create a concise high-level summary as a bullet-point list. Then include a Markdown table showing lines added and removed by each contributing author."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/container-validation-backends.yml (1)
827-843: Fix the /v1/models jq filter to check if any model matches, not just the last

The check .data[]?.id == $MODEL_NAME with jq -e only succeeds when the last model's id matches. Since jq -e bases its exit code on the final output value, only the last boolean matters. If the matching model appears anywhere else in the list, the check fails.

Replace with:
if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" \
  'any(.data[]?; .id == $MODEL_NAME)' >/dev/null 2>&1; then
This is the idiomatic jq pattern for "check if any element matches a condition."

🧹 Nitpick comments (4)

.github/scripts/extract_log_errors.py (3)
37-73: Optionally mark class attributes as ClassVar to satisfy RUF012

The mutable class attributes (ERROR_PATTERNS, ERROR_EXPLANATIONS, CONTEXT_PATTERNS) are effectively constants and flagged by Ruff RUF012. You can make that intent explicit:
-from typing import List, Dict, Any
+from typing import Any, ClassVar, Dict, List
@@
-    ERROR_PATTERNS = [
+    ERROR_PATTERNS: ClassVar[List[str]] = [
@@
-    ERROR_EXPLANATIONS = {
+    ERROR_EXPLANATIONS: ClassVar[Dict[str, str]] = {
@@
-    CONTEXT_PATTERNS = [
+    CONTEXT_PATTERNS: ClassVar[List[str]] = [
This is mostly stylistic but keeps linters quiet and documents that instances shouldn’t mutate these.

101-104: Use a secure temporary file instead of a fixed /tmp/analysis.log

Hard-coding /tmp/analysis.log can trip security scanners (S108) and risks collisions if multiple runs overlap.

Consider using tempfile.NamedTemporaryFile:
+import tempfile
@@
-            # Write log content to a temporary file for LogAI processing
-            temp_log = Path("/tmp/analysis.log")
-            temp_log.write_text(self.log_content)
+            # Write log content to a secure temporary file for LogAI processing
+            with tempfile.NamedTemporaryFile("w+", delete=False) as temp_log:
+                temp_log.write(self.log_content)
+                temp_log_path = temp_log.name
@@
-            logrecord = dataloader.load_data(str(temp_log))
+            logrecord = dataloader.load_data(temp_log_path)
Optionally, you can clean up the temp file after parsing if FileDataLoader doesn’t need it afterward.

136-139: Broad except Exception hides unexpected failures

Catching Exception here guarantees fallback behavior, but it also hides non-LogAI issues (e.g., coding bugs).

If you know which exceptions LogAI can raise, prefer limiting this to those; otherwise, at least log a bit more context (e.g., type + repr of the exception) so it’s easier to debug when something truly unexpected happens.
.github/workflows/container-validation-backends.yml (1)
634-646: Update actions/setup-python to current major (v6) for log analysis steps

The current recommended major version of actions/setup-python for GitHub Actions workflows is v6. The log-analysis steps currently use actions/setup-python@v4, which should be updated:

Lines 634–639: Setup Python for Log Analysis in deploy-operator.

Lines 889–894: Setup Python for Log Analysis in deploy-test-*.
-    - name: Setup Python for Log Analysis
-      if: always() && steps.deploy-operator-step.outcome == 'failure'
-      uses: actions/setup-python@v4
+    - name: Setup Python for Log Analysis
+      if: always() && steps.deploy-operator-step.outcome == 'failure'
+      uses: actions/setup-python@v6
       with:
         python-version: '3.10'
(and similarly for the deploy-test job).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed8cd59 and 8704fef.

📒 Files selected for processing (4)

.github/.trigger (1 hunks)
.github/scripts/extract_log_errors.py (1 hunks)
.github/workflows/container-validation-backends.yml (14 hunks)
.github/workflows/test-logai.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/test-logai.yml

21-21: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

.github/workflows/container-validation-backends.yml

636-636: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

743-743: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

891-891: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.

.github/workflows/test-logai.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit in this file.

.github/workflows/container-validation-backends.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit in this file.

.github/scripts/extract_log_errors.py

[error] 1-1: pre-commit: isort and black hooks modified this file. Run git commit to apply changes.

[error] 15-15: Ruff: F401 'logai' imported but unused.

[error] 22-22: Ruff: F401 'logai.information_extraction.log_parser.LogParser' imported but unused.

[error] 23-23: Ruff: F401 'logai.preprocess.preprocessor.Preprocessor' imported but unused.

[error] 1-1: Check-shebang-scripts-are-executable: Script has a shebang but is not executable. Run 'chmod +x .github/scripts/extract_log_errors.py'.

[error] 1-1: Trailing whitespace: fix trailing spaces in the file (or rerun pre-commit to auto-fix).

[error] 1-1: pre-commit: trailing-whitespace, ruff, and other hooks reported failures.

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4385/merge) by nv-nmailhot.

.github/workflows/container-validation-backends.yml

[error] 1-1: trailing-whitespace: Fixing trailing whitespace in container-validation-backends.yml

.github/scripts/extract_log_errors.py

[error] 1-1: pre-commit: isort modified this file during commit (Fixing /home/runner/work/dynamo/dynamo/.github/scripts/extract_log_errors.py)

[error] 1-1: pre-commit: black reformatted this file (reformatted .github/scripts/extract_log_errors.py)

[error] 15-15: ruff: F401 'logai' imported but unused

[error] 22-22: ruff: F401 'logai.information_extraction.log_parser.LogParser' imported but unused

[error] 1-1: check-shebang-scripts-are-executable: .github/scripts/extract_log_errors.py has a shebang but is not marked executable

🪛 GitHub Check: deploy-test-sglang (agg_router)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): sglang (agg_router)

[Line 70] timed out waiting for the condition

🪛 GitHub Check: deploy-test-sglang (agg)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): sglang (agg)

[Line 66] waiting for the condition on pods/sglang-agg-0-decode-x56vh

🪛 GitHub Check: deploy-test-trtllm (agg_router)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): trtllm (agg_router)

[Line 66] waiting for the condition on pods/trtllm-agg-router-0-frontend-8l6gv

🪛 GitHub Check: deploy-test-trtllm (agg)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): trtllm (agg)

[Line 63] timed out waiting for the condition

🪛 GitHub Check: deploy-test-trtllm (disagg_router)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): trtllm (disagg_router)

[Line 95] waiting for the condition on pods/trtllm-v1-disagg-router-0-trtllmdecodeworker-m7hwl

🪛 GitHub Check: deploy-test-vllm (agg_router)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): vllm (agg_router)

[Line 61] waiting for the condition on pods/vllm-agg-router-0-frontend-2qmwz

🪛 GitHub Check: deploy-test-vllm (agg)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): vllm (agg)

[Line 59] timed out waiting for the condition

🪛 GitHub Check: deploy-test-vllm (disagg_router)

.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): vllm (disagg_router)

[Line 166] creating error stream for port 8000 -> 8000: Timeout occurred"

🪛 Ruff (0.14.4)

.github/scripts/extract_log_errors.py

1-1: Shebang is present but file is not executable

(EXE001)

38-57: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

60-65: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

68-73: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

102-102: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"

(S108)

136-136: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: vllm (amd64)
GitHub Check: sglang (amd64)
GitHub Check: operator (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (3)

.github/.trigger (1)

1-3: Trigger file looks fine

This is a harmless, self-documenting trigger file; no functional concerns from a CI perspective.
.github/workflows/test-logai.yml (1)
20-23: Update actions/setup-python to v6

actions/setup-python@v4 is outdated. Use the current recommended major version:
      - name: Setup Python
-       uses: actions/setup-python@v4
+       uses: actions/setup-python@v6
        with:
          python-version: '3.10'
The current recommended major version is v6, referenced by the major tag (not a branch/commit) to receive safe non-breaking updates within that major version.

Likely an incorrect or invalid review comment.
.github/scripts/extract_log_errors.py (1)

13-31: Suggested fix is correct, but review justification conflicts with actual configuration

The proposed simplification of the LogAI import block is sound for code cleanliness, but the stated reason requires clarification. F401 is explicitly in the ignore list in ruff.toml, so F401 violations are globally suppressed and not causing pipeline failures.

However, the unused top-level imports at lines 21–22 are genuinely unused (re-imported inside extract_with_logai() at lines 91–92 where they're actually used), and removing them aligns with best practices for optional dependency handling.

Additionally, the file has a shebang but is not executable (-rw-r--r--), which triggers the EXE rule. If CI validates executability via pre-commit, this could be a separate failure point.

The simplified structure (marking only import logai with # noqa: F401) is the correct approach—verify pre-commit passes after applying this fix and confirm the executable bit is corrected if needed.

.github/scripts/extract_log_errors.py

.github/workflows/container-validation-backends.yml

nv-nmailhot added 9 commits November 7, 2025 15:09

add error message propogation

959ec2b

extract error with ai

11e20fd

always run deploy test to test temporarily

ada3992

change rules to test

38cc68d

force trigger always

4e01be8

test log ai

50926d8

logai fixes

556308d

Merge branch 'main' into nmailhot/ai-logs

a591a54

add more error identifiers

8704fef

nv-nmailhot requested a review from a team as a code owner November 17, 2025 00:56

pull-request-size bot added the size/XL label Nov 17, 2025

github-actions bot added the feat label Nov 17, 2025

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

.github/scripts/extract_log_errors.py Show resolved Hide resolved

.github/workflows/container-validation-backends.yml Show resolved Hide resolved

.github/workflows/container-validation-backends.yml Outdated Show resolved Hide resolved

nv-nmailhot added 4 commits November 17, 2025 09:39

remove unneeded files

9cf62b9

change step name and revert temporary testing changes

810dc25

safely encode error message

d8b1601

fix precommit issues

55a2966

copy-pr-bot bot temporarily deployed to GITLAB November 17, 2025 17:50 Inactive

minor cleanup

b87a752

copy-pr-bot bot temporarily deployed to GITLAB November 17, 2025 17:51 Inactive

nv-nmailhot requested review from nv-anants and pvijayakrish November 17, 2025 17:52

copy-pr-bot bot temporarily deployed to GITLAB November 17, 2025 17:57 Inactive

Merge branch 'main' into nmailhot/final-error-log

a2c2469

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 01:06 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 01:07 Inactive

Merge branch 'main' into nmailhot/final-error-log

c147af1

copy-pr-bot bot had a problem deploying to GITLAB December 1, 2025 20:26 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: error message annotation step #4385

feat: error message annotation step #4385

nv-nmailhot commented Nov 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: error message annotation step #4385

Are you sure you want to change the base?

feat: error message annotation step #4385

Conversation

nv-nmailhot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

1. Error Annotation System (container-validation-backends.yml)

2. LogAI Error Extraction Script (extract_log_errors.py)

Where Should the Reviewer Start?

1️⃣ Start with the Error Extraction Script (extract_log_errors.py)

2️⃣ Review the Annotation Workflow (container-validation-backends.yml)

🔍 Testing Recommendations:

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nmailhot commented Nov 17, 2025 •

edited

Loading

1. Error Annotation System (`container-validation-backends.yml`)

2. LogAI Error Extraction Script (`extract_log_errors.py`)

1️⃣ Start with the Error Extraction Script (`extract_log_errors.py`)

2️⃣ Review the Annotation Workflow (`container-validation-backends.yml`)

coderabbitai bot commented Nov 17, 2025 •

edited

Loading