This Python script, nr-alert-analyzer.py, interacts directly with New Relic's GraphQL API (NerdGraph) to fetch NrAiIncident events. It performs a deep statistical analysis to help Site Reliability Engineers (SREs) and DevOps teams separate "signals from the noise" in their alerting strategy.
The script reports on:
- Temporal Patterns: Identifies if noise is constant or spiking at specific times.
- Severity Breakdown: The ratio of Critical vs. Warning alerts.
- Root Cause: Which Alert Policies and specific Conditions are generating the most volume (including Priority).
- Entity Hotspots: Which specific hosts, apps, or targets are the "noisiest," with a drill-down into exactly which conditions are failing on them.
The script requires the following:
- Python 3.7+
- pandas: Used for data aggregation and statistical analysis.
- requests
It is highly recommended to run this script within a Python virtual environment to manage dependencies cleanly.
From your terminal, navigate to the directory where you saved nr-alert-analyzer.py and create a virtual environment:
# For macOS and Linux
python3 -m venv venv
# For Windows
python -m venv venv
You must activate the environment in your terminal session before installing dependencies or running the script.
# For macOS and Linux
source venv/bin/activate
# For Windows (Command Prompt)
.\venv\Scripts\activate.bat
# For Windows (PowerShell)
.\venv\Scripts\Activate.ps1
Your terminal prompt should change to show (venv) at the beginning.
With your virtual environment active, install the required libraries:
pip install pandas requests
Run the script from your terminal. You must provide your New Relic User API Key and Account ID.
By default, the script analyzes the last 7 days of data.
python nr-alert-analyzer.py --api_key "NRAK-YOUR-KEY" --account_id 1234567
You can define a custom window using YYYY-MM-DD HH:MM:SS format.
python nr-alert-analyzer.py \
--api_key "NRAK-..." \
--account_id 1234567 \
--start_time "2023-10-01 00:00:00" \
--end_time "2023-10-02 00:00:00"
| Argument | Required | Description | Default |
|---|---|---|---|
| --api_key | Yes | Your New Relic User Key (starts with NRAK-). | None |
| --account_id | Yes | The New Relic Account ID to query. | None |
| --start_time | No | Start of analysis window (YYYY-MM-DD HH:MM:SS). | 7 days ago |
| --end_time | No | End of analysis window (YYYY-MM-DD HH:MM:SS). | Now (UTC) |
The script prints its analysis directly to the terminal in specific sections.
Confirms the connection to New Relic and the number of events fetched.
- Note: The script currently fetches a maximum of 2,000 incidents per query.
Helps you distinguish between "always on" noise and "acute" incidents.
- Daily Breakdown: Shows incident volume per day.
- Temporal Peak: Identifies the specific hour of the day with the highest volume.
Shows the ratio of Critical vs. Warning violations.
- Tip: If you have 90% Warnings, your alert thresholds are likely too sensitive.
This groups alerts by Policy, Condition, and Priority.
- What it finds: The specific configuration rules that are generating the most noise.
- Example: [150] Priority: critical | Policy: 'Database' -> Condition: 'High CPU'
This groups alerts by the Entity (Target Name).
- What it finds: Specific hosts, pods, or applications that are failing.
- Nested Detail: Under each entity, it lists the specific conditions triggering on that host.
- Example: host-prod-01 might be triggering "High CPU" (Critical) and "Disk Full" (Warning) simultaneously.
After extracting incident data with the download script, you can generate a professional Alert Quality Management (AQM) analysis report from the raw CSV. The report generator performs all analysis at runtime — every table, metric, and finding is computed directly from the incident export with zero hardcoded content.
The script produces two outputs:
- A formatted
.docxreport with data tables, KPI metrics, and template narrative - A structured prompt file (
.md) designed to be fed to your LLM of choice for polished interpretive paragraphs and prioritized recommendations
This two-stage design keeps the data analysis deterministic and reproducible, while letting an LLM add the contextual narrative that makes the report worth thousands of dollars in consulting fees.
pip install pandas python-docx# Basic usage (generates two files alongside the CSV)
python3 generate_aqm_report.py --csv incidents.csv --account "ACME Corp"
# With custom analyst name and output path
python3 generate_aqm_report.py \
--csv incidents.csv \
--account "Contoso Financial" \
--analyst "Jane Doe, Senior Solution Architect" \
--output reports/Contoso_AQM_Report.docx| Argument | Required | Default | Description |
|---|---|---|---|
--csv |
Yes | — | Path to the raw NrAiIncident CSV export |
--account |
Yes | — | Customer / account display name for the cover page |
--analyst |
No | Jim Hagan, Principal Solution Architect |
Analyst name and title for the cover page |
--output |
No | AQM_Analysis_{account}.docx |
Output file path for the docx report |
The prompt file is automatically generated alongside the docx with the same base name
and a _prompt.md suffix (e.g., AQM_Analysis_ACME_Corp_prompt.md).
The .docx report contains 10 sections and 3 appendices, all data-driven:
| Section | Title | Contents |
|---|---|---|
| 1 | Executive Summary | KPI metrics, date range, data model verification, severity breakdown |
| 2 | Noisiest Alert Conditions | Top 20 conditions by open-event count with target counts and severity |
| 3 | Noisiest Alert Policies | Top 15 policies; condition replication analysis (duplicated names) |
| 4 | Flappiness Analysis | Duration distribution; top 15 flappiest conditions (% under 5 min) |
| 5 | Re-Open Pattern Analysis | Close-to-reopen gap by condition+target; aggregated by condition |
| 6 | Expiration & VTL Configuration | VTL distribution, close causes, long-running incident inventory |
| 7 | Noisiest Entities | Top 15 entities (entity.name with targetName fallback) |
| 8 | Noisiest Signal Targets | Top targets (all, then excluding dominant condition); entity mapping |
| 9 | Noise by Entity Type | Entity type distribution with percentage breakdown |
| 10 | Prioritized Recommendations | Placeholder — populated via the prompt file workflow |
| A | Workshop Session Guide | Template agenda for a 2-hour AQM workshop |
| B | Methodology | Auto-generated: row counts, date range, event pairing stats |
| C | Field Analysis | All columns with population rates, types, and top values |
The _prompt.md file contains a structured prompt with all analysis results
formatted for an LLM. It includes instructions, terminology guidance, and every data
point the LLM needs to write interpretive paragraphs and generate the Top 10
Recommendations (Section 10). Feed it to your LLM of choice like this:
# Copy the prompt file contents and paste into your LLM, or:
cat AQM_Analysis_ACME_Corp_prompt.md | pbcopy # macOSThe script runs a deterministic analysis pipeline:
- Load & validate — reads the CSV, verifies required columns
(
timestamp,event,incidentId,conditionName,policyName,durationSeconds,targetName,priority), converts timestamps - Separate events — splits into open/close sets; pairs by
incidentIdto verify lifecycle completeness (both, open-only, close-only counts) - Auto-detect dominant noise source — identifies the condition with the highest open count; if it exceeds 50% of volume, automatically excludes it from entity/target/re-open tables to prevent it from obscuring other patterns
- Compute noise rankings — top conditions, policies, entities, and targetNames
- Calculate flappiness — from
durationSecondson close events; identifies conditions with the highest % of incidents closing under 5 minutes - Detect re-open patterns — measures close-to-next-open gaps per condition+target pair (configurable gap threshold, default 600s / 10 min)
- Analyze configuration — VTL distribution, close causes, signal expiration settings, long-running incidents (>12h, >24h)
- Profile every field — population rates, data types, unique counts, top values, zero percentages, anomaly flags
- Extract severity — parses SEV1/SEV2/SEV3 from policy naming conventions
- Render — writes the
.docxreport and the_prompt.mdfile
The script expects a standard NrAiIncident CSV export. Required columns:
timestamp, event, incidentId, conditionName, policyName,
durationSeconds, targetName, priority
Optional columns used when present (graceful fallback if absent):
entity.name, entity.type, entity.guid, conditionId, policyId,
threshold, thresholdDuration, thresholdOccurrences, operator,
nrqlQuery, nrqlEventType, evaluationType, aggregationMethod,
aggregationDuration, fillOption, delay, slideBySeconds,
violationTimeLimitSeconds, expirationDuration,
closeViolationsOnExpiration, openViolationOnExpiration,
closeCause, closeTime, recoveryTime, openTime, muted,
runbookUrl, description, title, signalId, accountId
If your export was split into chunks (e.g., via split), reassemble first:
cat incidents_chunk_* > incidents_full.csvInternal constants can be adjusted at the top of the script:
| Constant | Default | Purpose |
|---|---|---|
HEADER_BG |
00AC69 |
Table header background color (hex) |
ROW_ALT_BG |
E8F8F0 |
Alternating row background color (hex) |
REOPEN_GAP |
600 |
Re-open threshold in seconds (incidents closing and re-opening within this window are counted) |
MIN_FLAP |
20 |
Minimum closed incidents for a condition to appear in the flappiness table |
$ python3 generate_aqm_report.py --csv incidents.csv --account "Contoso Financial"
Loading incidents.csv...
Rows: 500,000, Incidents: 262,852, Range: 2026-02-05 to 2026-03-30
Analyzing...
Analysis complete.
Report: AQM_Analysis_Contoso_Financial.docx
Prompt: AQM_Analysis_Contoso_Financial_prompt.md
Done! Feed AQM_Analysis_Contoso_Financial_prompt.md to your LLM of choice for narrative polish.