Skip to content

Commit 10b4901

Browse files
dt-benedictBenedict Bleimschein
andauthored
Add comprehensive observability analysis rules and DQL patterns (#74)
* Updated rules files with observabilty use cases: problem flow and investigation * Apply prettier formatting to rule files - Fix formatting issues identified by prettier - Ensure consistent code style across all markdown files * Address PR review feedback - Fix issues identified in code review - Update documentation and rule files based on feedback * Apply prettier formatting * Update .gitignore * Added disclaimer to rules-readme * Run prettier on rules-readme * Remove .DS_Store file from dynatrace-agent-rules folder --------- Co-authored-by: Benedict Bleimschein <[email protected]>
1 parent d7c84ff commit 10b4901

File tree

8 files changed

+2540
-98
lines changed

8 files changed

+2540
-98
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,6 @@ dist/*
33
node_modules/
44
/.vscode/*
55
!/.vscode/mcp.json
6+
.DS_Store
67

78
tsconfig.tsbuildinfo
Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
# Dynatrace Log Analysis with DQL
2+
3+
## Overview
4+
5+
This guide covers comprehensive log analysis using DQL (Dynatrace Query Language) for troubleshooting, debugging, and root cause analysis. Logs are crucial for understanding application behavior, especially during incidents and deployments.
6+
7+
## Core DQL Pattern for Logs
8+
9+
### Basic Log Query Structure
10+
11+
```dql
12+
fetch logs, from:now() - 2h
13+
| filter loglevel == "ERROR" or loglevel == "WARN"
14+
| fields timestamp, content, loglevel, k8s.pod.name, k8s.namespace.name
15+
| sort timestamp desc
16+
| limit 20
17+
```
18+
19+
## Available Data Points
20+
21+
### Primary Fields
22+
23+
- **content** - The actual log message content
24+
- **loglevel** - Log level (ERROR, WARN, INFO, DEBUG, TRACE)
25+
- **timestamp** - When the log entry was created
26+
- **message** - Structured message field (alternative to content)
27+
- **log.source** - Source of the log (e.g., "Container Output")
28+
29+
### Kubernetes Context Fields
30+
31+
- **k8s.pod.name** - Pod name generating the log
32+
- **k8s.namespace.name** - Kubernetes namespace
33+
- **k8s.container.name** - Container name within the pod
34+
- **k8s.cluster.name** - Kubernetes cluster name
35+
- **k8s.node.name** - Node where pod is running
36+
- **k8s.workload.name** - Workload (deployment/statefulset) name
37+
38+
### Dynatrace Entity Fields
39+
40+
- **dt.entity.service** - Service entity ID
41+
- **dt.entity.process_group_instance** - Process group instance entity ID
42+
- **dt.entity.host** - Host entity ID
43+
- **dt.entity.kubernetes_cluster** - Kubernetes cluster entity ID
44+
45+
### Trace Correlation Fields
46+
47+
- **trace_id** - Distributed trace ID
48+
- **span_id** - Span ID within the trace
49+
- **dt.trace_id** - Dynatrace trace ID
50+
- **dt.span_id** - Dynatrace span ID
51+
52+
### Error Context Fields
53+
54+
- **exception.message** - Exception message text
55+
- **exception.type** - Exception type/class
56+
- **exception.stack_trace** - Full stack trace
57+
- **status** - Log status (often mirrors loglevel)
58+
59+
## Common Query Patterns
60+
61+
### 1. Error Logs from Specific Service
62+
63+
```dql
64+
fetch logs, from:now() - 4h
65+
| filter dt.entity.service == "SERVICE-YOUR-ID"
66+
| filter loglevel == "ERROR"
67+
| fields timestamp, content, exception.message, trace_id
68+
| sort timestamp desc
69+
| limit 15
70+
```
71+
72+
### 2. Application Errors by Pod
73+
74+
```dql
75+
fetch logs, from:now() - 2h
76+
| filter matchesPhrase(k8s.pod.name, "payment")
77+
| filter loglevel == "ERROR" or matchesPhrase(content, "error") or matchesPhrase(content, "exception")
78+
| fields timestamp, content, k8s.pod.name, k8s.container.name
79+
| sort timestamp desc
80+
| limit 20
81+
```
82+
83+
### 3. Logs with Stack Traces
84+
85+
```dql
86+
fetch logs, from:now() - 6h
87+
| filter exception.stack_trace != ""
88+
| fields timestamp, exception.message, exception.type, exception.stack_trace, k8s.pod.name
89+
| sort timestamp desc
90+
| limit 10
91+
```
92+
93+
### 4. Deployment-Related Logs
94+
95+
```dql
96+
fetch logs, from:now() - 1h
97+
| filter matchesPhrase(content, "deployment") or matchesPhrase(content, "restart") or matchesPhrase(content, "startup")
98+
| fields timestamp, content, k8s.pod.name, k8s.namespace.name
99+
| sort timestamp desc
100+
| limit 25
101+
```
102+
103+
### 5. High-Frequency Error Analysis
104+
105+
```dql
106+
fetch logs, from:now() - 2h
107+
| filter loglevel == "ERROR"
108+
| fields timestamp, exception.message, k8s.pod.name
109+
| summarize error_count = count(), by: {exception.message, k8s.pod.name}
110+
| sort error_count desc
111+
| limit 15
112+
```
113+
114+
### 6. Trace-Correlated Logs
115+
116+
```dql
117+
fetch logs, from:now() - 1h
118+
| filter trace_id == "your-trace-id"
119+
| fields timestamp, content, span_id, k8s.pod.name
120+
| sort timestamp asc
121+
```
122+
123+
## Business Logic Error Detection
124+
125+
### Credit Card Processing Errors (Real Example)
126+
127+
```dql
128+
fetch logs, from:now() - 4h
129+
| filter matchesPhrase(content, "credit card") or matchesPhrase(content, "payment")
130+
| filter loglevel == "WARN" or loglevel == "ERROR"
131+
| fields timestamp, content, exception.message, k8s.pod.name
132+
| sort timestamp desc
133+
| limit 20
134+
```
135+
136+
### Payment Service Specific Analysis
137+
138+
```dql
139+
fetch logs, from:now() - 4h
140+
| filter matchesPhrase(k8s.pod.name, "payment") and matchesPhrase(content, "American Express")
141+
| fields timestamp, content, exception.message, trace_id
142+
| sort timestamp desc
143+
| limit 10
144+
```
145+
146+
## Integration with Problem Analysis
147+
148+
### Correlating Logs with Problems
149+
150+
```dql
151+
fetch logs, from:now() - 2h
152+
| filter loglevel == "ERROR" or loglevel == "WARN"
153+
| filter matchesPhrase(k8s.namespace.name, "astroshop")
154+
| fields timestamp, content, k8s.pod.name, dt.entity.service
155+
| sort timestamp desc
156+
| limit 30
157+
```
158+
159+
### Problem Timeline Correlation
160+
161+
When analyzing a specific problem timeframe (e.g., 11:54 AM - 12:29 PM):
162+
163+
```dql
164+
fetch logs, from:"2025-07-24T01:54:00Z", to:"2025-07-24T12:29:00Z"
165+
| filter matchesPhrase(k8s.pod.name, "payment-6977fffc7-2r2hb")
166+
| filter loglevel == "WARN" or loglevel == "ERROR"
167+
| fields timestamp, content, exception.message
168+
| sort timestamp desc
169+
| limit 20
170+
```
171+
172+
## Advanced Analysis Patterns
173+
174+
### Log Rate Analysis
175+
176+
```dql
177+
fetch logs, from:now() - 2h
178+
| filter loglevel == "ERROR"
179+
| fieldsAdd time_bucket = bin(timestamp, 5m)
180+
| summarize log_count = count(), by: {time_bucket, k8s.pod.name}
181+
| sort time_bucket desc
182+
```
183+
184+
### Multi-Service Error Correlation
185+
186+
```dql
187+
fetch logs, from:now() - 2h
188+
| filter loglevel == "ERROR"
189+
| filter matchesPhrase(k8s.namespace.name, "astroshop")
190+
| summarize error_count = count(), latest_error = max(timestamp), by: {k8s.pod.name, exception.message}
191+
| sort error_count desc
192+
```
193+
194+
### Performance Issue Detection
195+
196+
```dql
197+
fetch logs, from:now() - 1h
198+
| filter matchesPhrase(content, "timeout") or matchesPhrase(content, "slow") or matchesPhrase(content, "performance")
199+
| fields timestamp, content, k8s.pod.name, k8s.container.name
200+
| sort timestamp desc
201+
| limit 15
202+
```
203+
204+
## String Matching Best Practices
205+
206+
### ✅ Correct String Operations for Logs
207+
208+
```dql
209+
| filter matchesPhrase(content, "payment") // Text search in content
210+
| filter loglevel == "ERROR" // Exact level match
211+
| filter startsWith(k8s.pod.name, "payment-") // Pod prefix match
212+
| filter endsWith(exception.type, "Exception") // Exception type suffix
213+
```
214+
215+
### ❌ Unsupported Operations
216+
217+
```dql
218+
| filter contains(content, "error") // NOT supported
219+
| filter content like "%payment%" // NOT supported
220+
```
221+
222+
### Content Search Techniques
223+
224+
```dql
225+
// Multiple keyword search
226+
| filter matchesPhrase(content, "error") or matchesPhrase(content, "exception") or matchesPhrase(content, "failed")
227+
228+
// Case variations
229+
| filter matchesPhrase(content, "Error") or matchesPhrase(content, "ERROR") or matchesPhrase(content, "error")
230+
231+
// Specific error patterns
232+
| filter matchesPhrase(content, "cannot process") or matchesPhrase(content, "validation failed")
233+
```
234+
235+
## Real-World Investigation Examples
236+
237+
### Payment Service American Express Bug Investigation
238+
239+
**Context**: 57.95% error rate during deployment
240+
**Timeline**: 11:54 AM - 12:29 PM
241+
242+
```dql
243+
fetch logs, from:now() - 4h
244+
| filter matchesPhrase(k8s.pod.name, "payment")
245+
| filter loglevel == "WARN" and matchesPhrase(content, "American Express")
246+
| fields timestamp, content, exception.message, trace_id
247+
| sort timestamp desc
248+
| limit 10
249+
```
250+
251+
**Key Findings from Real Data**:
252+
253+
- **Error Location**: `/usr/src/app/charge.js:73:11`
254+
- **Business Logic Bug**: "Sorry, we cannot process American Express credit cards. Only Visa or Mastercard or American Express are accepted."
255+
- **Trace Correlation**: Each error has associated trace_id for transaction tracking
256+
- **Pod Consistency**: All errors from same pod `payment-6977fffc7-2r2hb`
257+
258+
### Deployment Impact Analysis
259+
260+
```dql
261+
fetch logs, from:now() - 4h
262+
| filter matchesPhrase(k8s.namespace.name, "astroshop")
263+
| filter matchesPhrase(content, "ArgoCD") or matchesPhrase(content, "deployment") or matchesPhrase(content, "git#")
264+
| fields timestamp, content, k8s.pod.name
265+
| sort timestamp desc
266+
| limit 15
267+
```
268+
269+
## Structured Log Analysis
270+
271+
### JSON Log Parsing
272+
273+
For structured JSON logs, the content field contains JSON that can be analyzed:
274+
275+
```dql
276+
fetch logs, from:now() - 2h
277+
| filter matchesPhrase(content, "level")
278+
| fields timestamp, content, k8s.pod.name
279+
| sort timestamp desc
280+
| limit 10
281+
```
282+
283+
### Extracting Values from JSON Logs
284+
285+
```dql
286+
fetch logs, from:now() - 1h
287+
| filter matchesPhrase(content, "\"level\":\"warn\"")
288+
| fields timestamp, content, exception.message, k8s.pod.name
289+
| sort timestamp desc
290+
```
291+
292+
## Performance Considerations
293+
294+
### Optimizing Log Queries
295+
296+
1. **Always include timeframe**: `from:now() - 2h` (avoid overly broad searches)
297+
2. **Filter early**: Apply restrictive filters first
298+
3. **Use entity filters**: Filter by specific pods/services when possible
299+
4. **Limit results**: Always include reasonable limits
300+
5. **Sort efficiently**: Sort by timestamp desc for recent logs
301+
302+
### Query Timeframe Recommendations
303+
304+
- **Real-time debugging**: 15-30 minutes
305+
- **Incident investigation**: 2-4 hours
306+
- **Deployment analysis**: 1-2 hours around deployment time
307+
- **Pattern analysis**: 24 hours maximum
308+
- **Historical research**: Use specific time windows, not broad ranges
309+
310+
## Troubleshooting Log Queries
311+
312+
### Common Issues
313+
314+
1. **Empty Results**: Check timeframe, pod names, and filter criteria
315+
2. **Performance Issues**: Reduce timeframe or add more specific filters
316+
3. **Missing Logs**: Verify log ingestion and entity IDs
317+
4. **Field Access**: Use `| limit 1` to explore available fields
318+
319+
### Field Discovery Technique
320+
321+
```dql
322+
fetch logs, from:now() - 1h
323+
| filter matchesPhrase(k8s.pod.name, "your-pod-name")
324+
| limit 1
325+
```
326+
327+
### Debugging Query Performance
328+
329+
```dql
330+
fetch logs, from:now() - 30m // Shorter timeframe
331+
| filter k8s.pod.name == "specific-pod-name" // Exact match
332+
| filter loglevel == "ERROR" // Specific level
333+
| limit 10 // Small result set
334+
```
335+
336+
## Integration with MCP Tools
337+
338+
### Using Logs After Entity Discovery
339+
340+
After finding entities with MCP tools, use their names in log queries:
341+
342+
```dql
343+
fetch logs, from:now() - 2h
344+
| filter dt.entity.service == "SERVICE-BECA49FB15C72B6A" // From MCP entity lookup
345+
| filter loglevel == "ERROR"
346+
| fields timestamp, content, exception.message
347+
| sort timestamp desc
348+
| limit 15
349+
```
350+
351+
### Cross-Referencing with Problems
352+
353+
1. **Find problems** with `fetch events | filter event.kind == "DAVIS_PROBLEM"`
354+
2. **Extract affected entities** from problem results
355+
3. **Query logs** for those specific entities during problem timeframe
356+
4. **Correlate trace IDs** between problems and logs
357+
358+
## Follow-up Analysis Workflows
359+
360+
### After Finding Error Logs
361+
362+
1. **Extract stack traces** for deeper code analysis
363+
2. **Find related spans** using trace IDs
364+
3. **Check entity health** for affected services
365+
4. **Analyze deployment correlation** with timestamps
366+
5. **Search for similar errors** across other pods/services
367+
368+
### Log-Driven Root Cause Analysis Process
369+
370+
1. **Start with problem timeframe** from Davis AI analysis
371+
2. **Filter logs to error level** during that period
372+
3. **Identify error patterns** and affected components
373+
4. **Trace correlation** using trace/span IDs
374+
5. **Business logic validation** through error message analysis
375+
6. **Infrastructure correlation** with deployment events
376+
377+
## Integration Notes
378+
379+
- **Always verify DQL syntax** with `verify_dql` before execution
380+
- **Use MCP entity tools** to get precise entity IDs for log filtering
381+
- **Combine with spans** for complete transaction analysis
382+
- **Correlate with metrics** for performance context
383+
- **Reference problem events** for incident context
384+
- **Consider log sampling** for high-volume environments

0 commit comments

Comments
 (0)