-
Notifications
You must be signed in to change notification settings - Fork 738
feat: error message annotation step #4385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces comprehensive error extraction and observability infrastructure: a new Python script for extracting errors from logs using optional LogAI integration with regex fallback, a workflow for testing LogAI installation, and extensive hardening of the container validation workflow with runtime logging, diagnostics capture, and GitHub annotations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/container-validation-backends.yml (1)
827-843: Fix the/v1/modelsjq filter to check if any model matches, not just the lastThe check
.data[]?.id == $MODEL_NAMEwithjq -eonly succeeds when the last model's id matches. Sincejq -ebases its exit code on the final output value, only the last boolean matters. If the matching model appears anywhere else in the list, the check fails.Replace with:
if echo "$MODELS_RESPONSE" | jq -e --arg MODEL_NAME "$MODEL_NAME" \ 'any(.data[]?; .id == $MODEL_NAME)' >/dev/null 2>&1; thenThis is the idiomatic jq pattern for "check if any element matches a condition."
🧹 Nitpick comments (4)
.github/scripts/extract_log_errors.py (3)
37-73: Optionally mark class attributes asClassVarto satisfy RUF012The mutable class attributes (
ERROR_PATTERNS,ERROR_EXPLANATIONS,CONTEXT_PATTERNS) are effectively constants and flagged by Ruff RUF012. You can make that intent explicit:-from typing import List, Dict, Any +from typing import Any, ClassVar, Dict, List @@ - ERROR_PATTERNS = [ + ERROR_PATTERNS: ClassVar[List[str]] = [ @@ - ERROR_EXPLANATIONS = { + ERROR_EXPLANATIONS: ClassVar[Dict[str, str]] = { @@ - CONTEXT_PATTERNS = [ + CONTEXT_PATTERNS: ClassVar[List[str]] = [This is mostly stylistic but keeps linters quiet and documents that instances shouldn’t mutate these.
101-104: Use a secure temporary file instead of a fixed/tmp/analysis.logHard-coding
/tmp/analysis.logcan trip security scanners (S108) and risks collisions if multiple runs overlap.Consider using
tempfile.NamedTemporaryFile:+import tempfile @@ - # Write log content to a temporary file for LogAI processing - temp_log = Path("/tmp/analysis.log") - temp_log.write_text(self.log_content) + # Write log content to a secure temporary file for LogAI processing + with tempfile.NamedTemporaryFile("w+", delete=False) as temp_log: + temp_log.write(self.log_content) + temp_log_path = temp_log.name @@ - logrecord = dataloader.load_data(str(temp_log)) + logrecord = dataloader.load_data(temp_log_path)Optionally, you can clean up the temp file after parsing if
FileDataLoaderdoesn’t need it afterward.
136-139: Broadexcept Exceptionhides unexpected failuresCatching
Exceptionhere guarantees fallback behavior, but it also hides non-LogAI issues (e.g., coding bugs).If you know which exceptions LogAI can raise, prefer limiting this to those; otherwise, at least log a bit more context (e.g., type + repr of the exception) so it’s easier to debug when something truly unexpected happens.
.github/workflows/container-validation-backends.yml (1)
634-646: Updateactions/setup-pythonto current major (v6) for log analysis stepsThe current recommended major version of
actions/setup-pythonfor GitHub Actions workflows is v6. The log-analysis steps currently useactions/setup-python@v4, which should be updated:
- Lines 634–639:
Setup Python for Log Analysisindeploy-operator.- Lines 889–894:
Setup Python for Log Analysisindeploy-test-*.- - name: Setup Python for Log Analysis - if: always() && steps.deploy-operator-step.outcome == 'failure' - uses: actions/setup-python@v4 + - name: Setup Python for Log Analysis + if: always() && steps.deploy-operator-step.outcome == 'failure' + uses: actions/setup-python@v6 with: python-version: '3.10'(and similarly for the deploy-test job).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.github/.trigger(1 hunks).github/scripts/extract_log_errors.py(1 hunks).github/workflows/container-validation-backends.yml(14 hunks).github/workflows/test-logai.yml(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/test-logai.yml
21-21: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
.github/workflows/container-validation-backends.yml
636-636: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
743-743: label "cpu-amd-m5-2xlarge" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
891-891: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.
.github/workflows/test-logai.yml
[error] 1-1: Trailing whitespace detected and fixed by pre-commit in this file.
.github/workflows/container-validation-backends.yml
[error] 1-1: Trailing whitespace detected and fixed by pre-commit in this file.
.github/scripts/extract_log_errors.py
[error] 1-1: pre-commit: isort and black hooks modified this file. Run git commit to apply changes.
[error] 15-15: Ruff: F401 'logai' imported but unused.
[error] 22-22: Ruff: F401 'logai.information_extraction.log_parser.LogParser' imported but unused.
[error] 23-23: Ruff: F401 'logai.preprocess.preprocessor.Preprocessor' imported but unused.
[error] 1-1: Check-shebang-scripts-are-executable: Script has a shebang but is not executable. Run 'chmod +x .github/scripts/extract_log_errors.py'.
[error] 1-1: Trailing whitespace: fix trailing spaces in the file (or rerun pre-commit to auto-fix).
[error] 1-1: pre-commit: trailing-whitespace, ruff, and other hooks reported failures.
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4385/merge) by nv-nmailhot.
.github/workflows/container-validation-backends.yml
[error] 1-1: trailing-whitespace: Fixing trailing whitespace in container-validation-backends.yml
.github/scripts/extract_log_errors.py
[error] 1-1: pre-commit: isort modified this file during commit (Fixing /home/runner/work/dynamo/dynamo/.github/scripts/extract_log_errors.py)
[error] 1-1: pre-commit: black reformatted this file (reformatted .github/scripts/extract_log_errors.py)
[error] 15-15: ruff: F401 'logai' imported but unused
[error] 22-22: ruff: F401 'logai.information_extraction.log_parser.LogParser' imported but unused
[error] 1-1: check-shebang-scripts-are-executable: .github/scripts/extract_log_errors.py has a shebang but is not marked executable
🪛 GitHub Check: deploy-test-sglang (agg_router)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): sglang (agg_router)
- [Line 70] timed out waiting for the condition
🪛 GitHub Check: deploy-test-sglang (agg)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): sglang (agg)
- [Line 66] waiting for the condition on pods/sglang-agg-0-decode-x56vh
🪛 GitHub Check: deploy-test-trtllm (agg_router)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): trtllm (agg_router)
- [Line 66] waiting for the condition on pods/trtllm-agg-router-0-frontend-8l6gv
🪛 GitHub Check: deploy-test-trtllm (agg)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): trtllm (agg)
- [Line 63] timed out waiting for the condition
🪛 GitHub Check: deploy-test-trtllm (disagg_router)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): trtllm (disagg_router)
- [Line 95] waiting for the condition on pods/trtllm-v1-disagg-router-0-trtllmdecodeworker-m7hwl
🪛 GitHub Check: deploy-test-vllm (agg_router)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): vllm (agg_router)
- [Line 61] waiting for the condition on pods/vllm-agg-router-0-frontend-2qmwz
🪛 GitHub Check: deploy-test-vllm (agg)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): vllm (agg)
- [Line 59] timed out waiting for the condition
🪛 GitHub Check: deploy-test-vllm (disagg_router)
.github/workflows/container-validation-backends.yml
[failure] 626-626: Deploy Test Failed (LogAI): vllm (disagg_router)
- [Line 166] creating error stream for port 8000 -> 8000: Timeout occurred"
🪛 Ruff (0.14.4)
.github/scripts/extract_log_errors.py
1-1: Shebang is present but file is not executable
(EXE001)
38-57: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
60-65: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
68-73: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
102-102: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"
(S108)
136-136: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
.github/.trigger (1)
1-3: Trigger file looks fineThis is a harmless, self-documenting trigger file; no functional concerns from a CI perspective.
.github/workflows/test-logai.yml (1)
20-23: Updateactions/setup-pythonto v6
actions/setup-python@v4is outdated. Use the current recommended major version:- name: Setup Python - uses: actions/setup-python@v4 + uses: actions/setup-python@v6 with: python-version: '3.10'The current recommended major version is v6, referenced by the major tag (not a branch/commit) to receive safe non-breaking updates within that major version.
Likely an incorrect or invalid review comment.
.github/scripts/extract_log_errors.py (1)
13-31: Suggested fix is correct, but review justification conflicts with actual configurationThe proposed simplification of the LogAI import block is sound for code cleanliness, but the stated reason requires clarification. F401 is explicitly in the ignore list in ruff.toml, so F401 violations are globally suppressed and not causing pipeline failures.
However, the unused top-level imports at lines 21–22 are genuinely unused (re-imported inside
extract_with_logai()at lines 91–92 where they're actually used), and removing them aligns with best practices for optional dependency handling.Additionally, the file has a shebang but is not executable (
-rw-r--r--), which triggers the EXE rule. If CI validates executability via pre-commit, this could be a separate failure point.The simplified structure (marking only
import logaiwith# noqa: F401) is the correct approach—verify pre-commit passes after applying this fix and confirm the executable bit is corrected if needed.
Overview
This PR enhances the GitHub Actions workflow error reporting system by implementing intelligent error extraction using Salesforce LogAI and creating rich GitHub annotations with detailed diagnostic information. The changes ensure that deployment and test failures are immediately visible with actionable context, while also addressing security vulnerabilities in JSON payload construction.
Key Improvements:
Details
1. Error Annotation System (
container-validation-backends.yml)What Changed:
deploy-operatoranddeploy-test-*jobscontinue-on-error: true)Error Handling Flow:
2. LogAI Error Extraction Script (
extract_log_errors.py)What Changed:
timed out waiting for the conditionno matching resources foundFailed to pull imageCrashLoopBackOffExample Output:
Where Should the Reviewer Start?
1️⃣ Start with the Error Extraction Script (
extract_log_errors.py)2️⃣ Review the Annotation Workflow (
container-validation-backends.yml)deploy-operatorjob's error handling flow (lines 634-735)🔍 Testing Recommendations:
deploy-operatorand verify:Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores