Skip to content

Conversation

@nv-nmailhot
Copy link
Contributor

@nv-nmailhot nv-nmailhot commented Nov 17, 2025

Overview

This PR enhances the GitHub Actions workflow error reporting system by implementing intelligent error extraction using Salesforce LogAI and creating rich GitHub annotations with detailed diagnostic information. The changes ensure that deployment and test failures are immediately visible with actionable context, while also addressing security vulnerabilities in JSON payload construction.

Key Improvements:

  • 🔍 Intelligent Error Detection: LogAI-powered error extraction with fallback to regex patterns
  • 📍 Rich Annotations: GitHub Check Runs API integration with detailed error messages, pod status, and Kubernetes events

Details

1. Error Annotation System (container-validation-backends.yml)

What Changed:

  • Added multi-step error handling workflow for deploy-operator and deploy-test-* jobs
  • Each deployment job now has 5 steps when failures occur:
    1. Main deployment step (with continue-on-error: true)
    2. Setup Python for Log Analysis
    3. Install LogAI
    4. Extract Errors from Logs using LogAI
    5. Check for Job Failure and Create Annotation (renamed from "Create GitHub Annotation on Failure")

Error Handling Flow:

  • Main step captures ERROR_MESSAGE on failure
  • LogAI analyzes log files (deploy-operator.log, test-output.log)
  • Script combines LogAI extraction + manual error captures + Kubernetes diagnostics
  • Creates GitHub Check Run via API with rich annotations
  • Creates ::error workflow command annotation

2. LogAI Error Extraction Script (extract_log_errors.py)

What Changed:

  • Robust LogAI import handling with graceful fallback to regex
  • Enhanced regex patterns for Kubernetes-specific errors:
    • timed out waiting for the condition
    • no matching resources found
    • Failed to pull image
    • CrashLoopBackOff
  • Error explanations dictionary with actionable insights
  • Proper formatting for multi-line error messages

Example Output:

Primary Error: timed out waiting for the condition on pods/sglang-agg-0-frontend-t2sb8

💡 Explanation: This indicates pods failed to become ready within the timeout period.
Common causes: insufficient resources, image pull errors, container crashes.

Context:
+ kubectl wait --for=condition=ready pod...
pod/sglang-agg-0-decode-nm947 condition met
error: timed out waiting for the condition on pods/sglang-agg-0-frontend-t2sb8

Where Should the Reviewer Start?

1️⃣ Start with the Error Extraction Script (extract_log_errors.py)

  • Review the LogAI integration logic (lines 13-27)
  • Check the enhanced regex patterns and error explanations (lines 35-120)
  • Understand the fallback mechanism if LogAI is unavailable

2️⃣ Review the Annotation Workflow (container-validation-backends.yml)

  • Look at the deploy-operator job's error handling flow (lines 634-735)
  • Focus on the "Check for Job Failure and Create Annotation" step (lines 673-735)
  • Note how it combines LogAI + manual errors + Kubernetes diagnostics

🔍 Testing Recommendations:

  1. Trigger a failure in deploy-operator and verify:
    • LogAI extraction runs
    • GitHub annotation appears with pod diagnostics
    • Check Run is created in the Checks tab

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Enhanced failure diagnostics in deployment workflows with detailed error reporting, structured logging, and GitHub annotations pointing to specific issues.
  • Chores

    • Improved CI/CD pipeline observability with comprehensive logging and error extraction infrastructure for better troubleshooting.

@nv-nmailhot nv-nmailhot requested a review from a team as a code owner November 17, 2025 00:56
@github-actions github-actions bot added the feat label Nov 17, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 17, 2025

Walkthrough

Introduces comprehensive error extraction and observability infrastructure: a new Python script for extracting errors from logs using optional LogAI integration with regex fallback, a workflow for testing LogAI installation, and extensive hardening of the container validation workflow with runtime logging, diagnostics capture, and GitHub annotations.

Changes

Cohort / File(s) Change Summary
CI/CD Infrastructure & Trigger
\.github/\.trigger, \.github/workflows/test-logai.yml
Added a trigger file to force workflow runs and introduced a new workflow for validating LogAI installation, including Python setup, package inspection, import verification, error extraction script testing, and fallback mode simulation.
Error Extraction & Logging
\.github/scripts/extract_log_errors.py
New Python script implementing LogErrorExtractor class that extracts errors from logs via optional LogAI integration (with dynamic imports and component checks) and falls back to regex-based extraction; includes deduplication, sorting, CLI entry point with JSON output support, and human-readable summaries.
Container Validation Workflow Enhancement
\.github/workflows/container-validation-backends.yml
Extended workflow with runtime hardening across deploy-operator and test phases: redirects step output to log files, validates Helm availability, captures diagnostics on failures (pods, events, helm status), integrates error extraction via the new script, and adds GitHub annotations with detailed error reporting via REST API; replaces conditional gating with unconditional execution (TODO-annotated for restoration).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • extract_log_errors.py: New script with dense logic including LogAI integration with component checks, dynamic imports, comprehensive regex-based fallback with multiple error patterns (general and Kubernetes-specific), deduplication/sorting, and dual output modes (JSON/human-readable).
  • container-validation-backends.yml: Substantial workflow modifications with multiple nested error handling flows, diagnostics capture patterns, environment variable propagation, GitHub annotation construction, and REST API integration across deploy and test phases; coordination between multiple layers of logging and error extraction requires careful cross-step validation.
  • Integration points: Verify correct interaction between the error extraction script and workflow steps, especially environment variable passing, annotation payload formatting, and fallback execution paths.

Poem

🐰 A script hops in with log-sniffing might,
With LogAI grace and regex-fallback light;
The workflows now catch each stumble and slip,
With diagnostics captured on each failed trip,
Annotations guide us when deployments don't fly—
Better observability makes errors say "hi!" 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description largely follows the template structure with Overview, Details, Where should the reviewer start, and Related Issues sections, but the Related Issues section is incomplete. Replace 'closes GitHub issue: #xxx' with an actual GitHub issue number or remove the placeholder if no related issue exists.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: error message annotation step' is specific and directly related to the main objective of adding error message annotation functionality, matching the substantial workflow and script changes in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example:

"Create a concise high-level summary as a bullet-point list. Then include a Markdown table showing lines added and removed by each contributing author."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/container-validation-backends.yml (1)

827-843: Fix the /v1/models jq filter to check if any model matches, not just the last

The check .data[]?.id == $MODEL_NAME with jq -e only succeeds when the last model's id matches. Since jq -e bases its exit code on the final output value, only the last boolean matters. If the matching model appears anywhere else in the list, the check fails.

Replace with:

if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" \
  'any(.data[]?; .id == $MODEL_NAME)' >/dev/null 2>&1; then

This is the idiomatic jq pattern for "check if any element matches a condition."

🧹 Nitpick comments (4)
.github/scripts/extract_log_errors.py (3)

37-73: Optionally mark class attributes as ClassVar to satisfy RUF012

The mutable class attributes (ERROR_PATTERNS, ERROR_EXPLANATIONS, CONTEXT_PATTERNS) are effectively constants and flagged by Ruff RUF012. You can make that intent explicit:

-from typing import List, Dict, Any
+from typing import Any, ClassVar, Dict, List
@@
-    ERROR_PATTERNS = [
+    ERROR_PATTERNS: ClassVar[List[str]] = [
@@
-    ERROR_EXPLANATIONS = {
+    ERROR_EXPLANATIONS: ClassVar[Dict[str, str]] = {
@@
-    CONTEXT_PATTERNS = [
+    CONTEXT_PATTERNS: ClassVar[List[str]] = [

This is mostly stylistic but keeps linters quiet and documents that instances shouldn’t mutate these.


101-104: Use a secure temporary file instead of a fixed /tmp/analysis.log

Hard-coding /tmp/analysis.log can trip security scanners (S108) and risks collisions if multiple runs overlap.

Consider using tempfile.NamedTemporaryFile:

+import tempfile
@@
-            # Write log content to a temporary file for LogAI processing
-            temp_log = Path("/tmp/analysis.log")
-            temp_log.write_text(self.log_content)
+            # Write log content to a secure temporary file for LogAI processing
+            with tempfile.NamedTemporaryFile("w+", delete=False) as temp_log:
+                temp_log.write(self.log_content)
+                temp_log_path = temp_log.name
@@
-            logrecord = dataloader.load_data(str(temp_log))
+            logrecord = dataloader.load_data(temp_log_path)

Optionally, you can clean up the temp file after parsing if FileDataLoader doesn’t need it afterward.


136-139: Broad except Exception hides unexpected failures

Catching Exception here guarantees fallback behavior, but it also hides non-LogAI issues (e.g., coding bugs).

If you know which exceptions LogAI can raise, prefer limiting this to those; otherwise, at least log a bit more context (e.g., type + repr of the exception) so it’s easier to debug when something truly unexpected happens.

.github/workflows/container-validation-backends.yml (1)

634-646: Update actions/setup-python to current major (v6) for log analysis steps

The current recommended major version of actions/setup-python for GitHub Actions workflows is v6. The log-analysis steps currently use actions/setup-python@v4, which should be updated:

  • Lines 634–639: Setup Python for Log Analysis in deploy-operator.
  • Lines 889–894: Setup Python for Log Analysis in deploy-test-*.
-    - name: Setup Python for Log Analysis
-      if: always() && steps.deploy-operator-step.outcome == 'failure'
-      uses: actions/setup-python@v4
+    - name: Setup Python for Log Analysis
+      if: always() && steps.deploy-operator-step.outcome == 'failure'
+      uses: actions/setup-python@v6
       with:
         python-version: '3.10'

(and similarly for the deploy-test job).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed8cd59 and 8704fef.

📒 Files selected for processing (4)
  • .github/.trigger (1 hunks)
  • .github/scripts/extract_log_errors.py (1 hunks)
  • .github/workflows/container-validation-backends.yml (14 hunks)
  • .github/workflows/test-logai.yml (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/test-logai.yml

21-21: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

.github/workflows/container-validation-backends.yml

636-636: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)


743-743: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


891-891: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.
.github/workflows/test-logai.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit in this file.

.github/workflows/container-validation-backends.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit in this file.

.github/scripts/extract_log_errors.py

[error] 1-1: pre-commit: isort and black hooks modified this file. Run git commit to apply changes.


[error] 15-15: Ruff: F401 'logai' imported but unused.


[error] 22-22: Ruff: F401 'logai.information_extraction.log_parser.LogParser' imported but unused.


[error] 23-23: Ruff: F401 'logai.preprocess.preprocessor.Preprocessor' imported but unused.


[error] 1-1: Check-shebang-scripts-are-executable: Script has a shebang but is not executable. Run 'chmod +x .github/scripts/extract_log_errors.py'.


[error] 1-1: Trailing whitespace: fix trailing spaces in the file (or rerun pre-commit to auto-fix).


[error] 1-1: pre-commit: trailing-whitespace, ruff, and other hooks reported failures.

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4385/merge) by nv-nmailhot.
.github/workflows/container-validation-backends.yml

[error] 1-1: trailing-whitespace: Fixing trailing whitespace in container-validation-backends.yml

.github/scripts/extract_log_errors.py

[error] 1-1: pre-commit: isort modified this file during commit (Fixing /home/runner/work/dynamo/dynamo/.github/scripts/extract_log_errors.py)


[error] 1-1: pre-commit: black reformatted this file (reformatted .github/scripts/extract_log_errors.py)


[error] 15-15: ruff: F401 'logai' imported but unused


[error] 22-22: ruff: F401 'logai.information_extraction.log_parser.LogParser' imported but unused


[error] 1-1: check-shebang-scripts-are-executable: .github/scripts/extract_log_errors.py has a shebang but is not marked executable

🪛 GitHub Check: deploy-test-sglang (agg_router)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): sglang (agg_router)

  1. [Line 70] timed out waiting for the condition
🪛 GitHub Check: deploy-test-sglang (agg)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): sglang (agg)

  1. [Line 66] waiting for the condition on pods/sglang-agg-0-decode-x56vh
🪛 GitHub Check: deploy-test-trtllm (agg_router)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): trtllm (agg_router)

  1. [Line 66] waiting for the condition on pods/trtllm-agg-router-0-frontend-8l6gv
🪛 GitHub Check: deploy-test-trtllm (agg)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): trtllm (agg)

  1. [Line 63] timed out waiting for the condition
🪛 GitHub Check: deploy-test-trtllm (disagg_router)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): trtllm (disagg_router)

  1. [Line 95] waiting for the condition on pods/trtllm-v1-disagg-router-0-trtllmdecodeworker-m7hwl
🪛 GitHub Check: deploy-test-vllm (agg_router)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): vllm (agg_router)

  1. [Line 61] waiting for the condition on pods/vllm-agg-router-0-frontend-2qmwz
🪛 GitHub Check: deploy-test-vllm (agg)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): vllm (agg)

  1. [Line 59] timed out waiting for the condition
🪛 GitHub Check: deploy-test-vllm (disagg_router)
.github/workflows/container-validation-backends.yml

[failure] 626-626: Deploy Test Failed (LogAI): vllm (disagg_router)

  1. [Line 166] creating error stream for port 8000 -> 8000: Timeout occurred"
🪛 Ruff (0.14.4)
.github/scripts/extract_log_errors.py

1-1: Shebang is present but file is not executable

(EXE001)


38-57: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


60-65: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


68-73: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


102-102: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"

(S108)


136-136: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
.github/.trigger (1)

1-3: Trigger file looks fine

This is a harmless, self-documenting trigger file; no functional concerns from a CI perspective.

.github/workflows/test-logai.yml (1)

20-23: Update actions/setup-python to v6

actions/setup-python@v4 is outdated. Use the current recommended major version:

      - name: Setup Python
-       uses: actions/setup-python@v4
+       uses: actions/setup-python@v6
        with:
          python-version: '3.10'

The current recommended major version is v6, referenced by the major tag (not a branch/commit) to receive safe non-breaking updates within that major version.

Likely an incorrect or invalid review comment.

.github/scripts/extract_log_errors.py (1)

13-31: Suggested fix is correct, but review justification conflicts with actual configuration

The proposed simplification of the LogAI import block is sound for code cleanliness, but the stated reason requires clarification. F401 is explicitly in the ignore list in ruff.toml, so F401 violations are globally suppressed and not causing pipeline failures.

However, the unused top-level imports at lines 21–22 are genuinely unused (re-imported inside extract_with_logai() at lines 91–92 where they're actually used), and removing them aligns with best practices for optional dependency handling.

Additionally, the file has a shebang but is not executable (-rw-r--r--), which triggers the EXE rule. If CI validates executability via pre-commit, this could be a separate failure point.

The simplified structure (marking only import logai with # noqa: F401) is the correct approach—verify pre-commit passes after applying this fix and confirm the executable bit is corrected if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants