|
| 1 | +# Dynatrace Log Analysis with DQL |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This guide covers comprehensive log analysis using DQL (Dynatrace Query Language) for troubleshooting, debugging, and root cause analysis. Logs are crucial for understanding application behavior, especially during incidents and deployments. |
| 6 | + |
| 7 | +## Core DQL Pattern for Logs |
| 8 | + |
| 9 | +### Basic Log Query Structure |
| 10 | + |
| 11 | +```dql |
| 12 | +fetch logs, from:now() - 2h |
| 13 | +| filter loglevel == "ERROR" or loglevel == "WARN" |
| 14 | +| fields timestamp, content, loglevel, k8s.pod.name, k8s.namespace.name |
| 15 | +| sort timestamp desc |
| 16 | +| limit 20 |
| 17 | +``` |
| 18 | + |
| 19 | +## Available Data Points |
| 20 | + |
| 21 | +### Primary Fields |
| 22 | + |
| 23 | +- **content** - The actual log message content |
| 24 | +- **loglevel** - Log level (ERROR, WARN, INFO, DEBUG, TRACE) |
| 25 | +- **timestamp** - When the log entry was created |
| 26 | +- **message** - Structured message field (alternative to content) |
| 27 | +- **log.source** - Source of the log (e.g., "Container Output") |
| 28 | + |
| 29 | +### Kubernetes Context Fields |
| 30 | + |
| 31 | +- **k8s.pod.name** - Pod name generating the log |
| 32 | +- **k8s.namespace.name** - Kubernetes namespace |
| 33 | +- **k8s.container.name** - Container name within the pod |
| 34 | +- **k8s.cluster.name** - Kubernetes cluster name |
| 35 | +- **k8s.node.name** - Node where pod is running |
| 36 | +- **k8s.workload.name** - Workload (deployment/statefulset) name |
| 37 | + |
| 38 | +### Dynatrace Entity Fields |
| 39 | + |
| 40 | +- **dt.entity.service** - Service entity ID |
| 41 | +- **dt.entity.process_group_instance** - Process group instance entity ID |
| 42 | +- **dt.entity.host** - Host entity ID |
| 43 | +- **dt.entity.kubernetes_cluster** - Kubernetes cluster entity ID |
| 44 | + |
| 45 | +### Trace Correlation Fields |
| 46 | + |
| 47 | +- **trace_id** - Distributed trace ID |
| 48 | +- **span_id** - Span ID within the trace |
| 49 | +- **dt.trace_id** - Dynatrace trace ID |
| 50 | +- **dt.span_id** - Dynatrace span ID |
| 51 | + |
| 52 | +### Error Context Fields |
| 53 | + |
| 54 | +- **exception.message** - Exception message text |
| 55 | +- **exception.type** - Exception type/class |
| 56 | +- **exception.stack_trace** - Full stack trace |
| 57 | +- **status** - Log status (often mirrors loglevel) |
| 58 | + |
| 59 | +## Common Query Patterns |
| 60 | + |
| 61 | +### 1. Error Logs from Specific Service |
| 62 | + |
| 63 | +```dql |
| 64 | +fetch logs, from:now() - 4h |
| 65 | +| filter dt.entity.service == "SERVICE-YOUR-ID" |
| 66 | +| filter loglevel == "ERROR" |
| 67 | +| fields timestamp, content, exception.message, trace_id |
| 68 | +| sort timestamp desc |
| 69 | +| limit 15 |
| 70 | +``` |
| 71 | + |
| 72 | +### 2. Application Errors by Pod |
| 73 | + |
| 74 | +```dql |
| 75 | +fetch logs, from:now() - 2h |
| 76 | +| filter matchesPhrase(k8s.pod.name, "payment") |
| 77 | +| filter loglevel == "ERROR" or matchesPhrase(content, "error") or matchesPhrase(content, "exception") |
| 78 | +| fields timestamp, content, k8s.pod.name, k8s.container.name |
| 79 | +| sort timestamp desc |
| 80 | +| limit 20 |
| 81 | +``` |
| 82 | + |
| 83 | +### 3. Logs with Stack Traces |
| 84 | + |
| 85 | +```dql |
| 86 | +fetch logs, from:now() - 6h |
| 87 | +| filter exception.stack_trace != "" |
| 88 | +| fields timestamp, exception.message, exception.type, exception.stack_trace, k8s.pod.name |
| 89 | +| sort timestamp desc |
| 90 | +| limit 10 |
| 91 | +``` |
| 92 | + |
| 93 | +### 4. Deployment-Related Logs |
| 94 | + |
| 95 | +```dql |
| 96 | +fetch logs, from:now() - 1h |
| 97 | +| filter matchesPhrase(content, "deployment") or matchesPhrase(content, "restart") or matchesPhrase(content, "startup") |
| 98 | +| fields timestamp, content, k8s.pod.name, k8s.namespace.name |
| 99 | +| sort timestamp desc |
| 100 | +| limit 25 |
| 101 | +``` |
| 102 | + |
| 103 | +### 5. High-Frequency Error Analysis |
| 104 | + |
| 105 | +```dql |
| 106 | +fetch logs, from:now() - 2h |
| 107 | +| filter loglevel == "ERROR" |
| 108 | +| fields timestamp, exception.message, k8s.pod.name |
| 109 | +| summarize error_count = count(), by: {exception.message, k8s.pod.name} |
| 110 | +| sort error_count desc |
| 111 | +| limit 15 |
| 112 | +``` |
| 113 | + |
| 114 | +### 6. Trace-Correlated Logs |
| 115 | + |
| 116 | +```dql |
| 117 | +fetch logs, from:now() - 1h |
| 118 | +| filter trace_id == "your-trace-id" |
| 119 | +| fields timestamp, content, span_id, k8s.pod.name |
| 120 | +| sort timestamp asc |
| 121 | +``` |
| 122 | + |
| 123 | +## Business Logic Error Detection |
| 124 | + |
| 125 | +### Credit Card Processing Errors (Real Example) |
| 126 | + |
| 127 | +```dql |
| 128 | +fetch logs, from:now() - 4h |
| 129 | +| filter matchesPhrase(content, "credit card") or matchesPhrase(content, "payment") |
| 130 | +| filter loglevel == "WARN" or loglevel == "ERROR" |
| 131 | +| fields timestamp, content, exception.message, k8s.pod.name |
| 132 | +| sort timestamp desc |
| 133 | +| limit 20 |
| 134 | +``` |
| 135 | + |
| 136 | +### Payment Service Specific Analysis |
| 137 | + |
| 138 | +```dql |
| 139 | +fetch logs, from:now() - 4h |
| 140 | +| filter matchesPhrase(k8s.pod.name, "payment") and matchesPhrase(content, "American Express") |
| 141 | +| fields timestamp, content, exception.message, trace_id |
| 142 | +| sort timestamp desc |
| 143 | +| limit 10 |
| 144 | +``` |
| 145 | + |
| 146 | +## Integration with Problem Analysis |
| 147 | + |
| 148 | +### Correlating Logs with Problems |
| 149 | + |
| 150 | +```dql |
| 151 | +fetch logs, from:now() - 2h |
| 152 | +| filter loglevel == "ERROR" or loglevel == "WARN" |
| 153 | +| filter matchesPhrase(k8s.namespace.name, "astroshop") |
| 154 | +| fields timestamp, content, k8s.pod.name, dt.entity.service |
| 155 | +| sort timestamp desc |
| 156 | +| limit 30 |
| 157 | +``` |
| 158 | + |
| 159 | +### Problem Timeline Correlation |
| 160 | + |
| 161 | +When analyzing a specific problem timeframe (e.g., 11:54 AM - 12:29 PM): |
| 162 | + |
| 163 | +```dql |
| 164 | +fetch logs, from:"2025-07-24T01:54:00Z", to:"2025-07-24T12:29:00Z" |
| 165 | +| filter matchesPhrase(k8s.pod.name, "payment-6977fffc7-2r2hb") |
| 166 | +| filter loglevel == "WARN" or loglevel == "ERROR" |
| 167 | +| fields timestamp, content, exception.message |
| 168 | +| sort timestamp desc |
| 169 | +| limit 20 |
| 170 | +``` |
| 171 | + |
| 172 | +## Advanced Analysis Patterns |
| 173 | + |
| 174 | +### Log Rate Analysis |
| 175 | + |
| 176 | +```dql |
| 177 | +fetch logs, from:now() - 2h |
| 178 | +| filter loglevel == "ERROR" |
| 179 | +| fieldsAdd time_bucket = bin(timestamp, 5m) |
| 180 | +| summarize log_count = count(), by: {time_bucket, k8s.pod.name} |
| 181 | +| sort time_bucket desc |
| 182 | +``` |
| 183 | + |
| 184 | +### Multi-Service Error Correlation |
| 185 | + |
| 186 | +```dql |
| 187 | +fetch logs, from:now() - 2h |
| 188 | +| filter loglevel == "ERROR" |
| 189 | +| filter matchesPhrase(k8s.namespace.name, "astroshop") |
| 190 | +| summarize error_count = count(), latest_error = max(timestamp), by: {k8s.pod.name, exception.message} |
| 191 | +| sort error_count desc |
| 192 | +``` |
| 193 | + |
| 194 | +### Performance Issue Detection |
| 195 | + |
| 196 | +```dql |
| 197 | +fetch logs, from:now() - 1h |
| 198 | +| filter matchesPhrase(content, "timeout") or matchesPhrase(content, "slow") or matchesPhrase(content, "performance") |
| 199 | +| fields timestamp, content, k8s.pod.name, k8s.container.name |
| 200 | +| sort timestamp desc |
| 201 | +| limit 15 |
| 202 | +``` |
| 203 | + |
| 204 | +## String Matching Best Practices |
| 205 | + |
| 206 | +### ✅ Correct String Operations for Logs |
| 207 | + |
| 208 | +```dql |
| 209 | +| filter matchesPhrase(content, "payment") // Text search in content |
| 210 | +| filter loglevel == "ERROR" // Exact level match |
| 211 | +| filter startsWith(k8s.pod.name, "payment-") // Pod prefix match |
| 212 | +| filter endsWith(exception.type, "Exception") // Exception type suffix |
| 213 | +``` |
| 214 | + |
| 215 | +### ❌ Unsupported Operations |
| 216 | + |
| 217 | +```dql |
| 218 | +| filter contains(content, "error") // NOT supported |
| 219 | +| filter content like "%payment%" // NOT supported |
| 220 | +``` |
| 221 | + |
| 222 | +### Content Search Techniques |
| 223 | + |
| 224 | +```dql |
| 225 | +// Multiple keyword search |
| 226 | +| filter matchesPhrase(content, "error") or matchesPhrase(content, "exception") or matchesPhrase(content, "failed") |
| 227 | +
|
| 228 | +// Case variations |
| 229 | +| filter matchesPhrase(content, "Error") or matchesPhrase(content, "ERROR") or matchesPhrase(content, "error") |
| 230 | +
|
| 231 | +// Specific error patterns |
| 232 | +| filter matchesPhrase(content, "cannot process") or matchesPhrase(content, "validation failed") |
| 233 | +``` |
| 234 | + |
| 235 | +## Real-World Investigation Examples |
| 236 | + |
| 237 | +### Payment Service American Express Bug Investigation |
| 238 | + |
| 239 | +**Context**: 57.95% error rate during deployment |
| 240 | +**Timeline**: 11:54 AM - 12:29 PM |
| 241 | + |
| 242 | +```dql |
| 243 | +fetch logs, from:now() - 4h |
| 244 | +| filter matchesPhrase(k8s.pod.name, "payment") |
| 245 | +| filter loglevel == "WARN" and matchesPhrase(content, "American Express") |
| 246 | +| fields timestamp, content, exception.message, trace_id |
| 247 | +| sort timestamp desc |
| 248 | +| limit 10 |
| 249 | +``` |
| 250 | + |
| 251 | +**Key Findings from Real Data**: |
| 252 | + |
| 253 | +- **Error Location**: `/usr/src/app/charge.js:73:11` |
| 254 | +- **Business Logic Bug**: "Sorry, we cannot process American Express credit cards. Only Visa or Mastercard or American Express are accepted." |
| 255 | +- **Trace Correlation**: Each error has associated trace_id for transaction tracking |
| 256 | +- **Pod Consistency**: All errors from same pod `payment-6977fffc7-2r2hb` |
| 257 | + |
| 258 | +### Deployment Impact Analysis |
| 259 | + |
| 260 | +```dql |
| 261 | +fetch logs, from:now() - 4h |
| 262 | +| filter matchesPhrase(k8s.namespace.name, "astroshop") |
| 263 | +| filter matchesPhrase(content, "ArgoCD") or matchesPhrase(content, "deployment") or matchesPhrase(content, "git#") |
| 264 | +| fields timestamp, content, k8s.pod.name |
| 265 | +| sort timestamp desc |
| 266 | +| limit 15 |
| 267 | +``` |
| 268 | + |
| 269 | +## Structured Log Analysis |
| 270 | + |
| 271 | +### JSON Log Parsing |
| 272 | + |
| 273 | +For structured JSON logs, the content field contains JSON that can be analyzed: |
| 274 | + |
| 275 | +```dql |
| 276 | +fetch logs, from:now() - 2h |
| 277 | +| filter matchesPhrase(content, "level") |
| 278 | +| fields timestamp, content, k8s.pod.name |
| 279 | +| sort timestamp desc |
| 280 | +| limit 10 |
| 281 | +``` |
| 282 | + |
| 283 | +### Extracting Values from JSON Logs |
| 284 | + |
| 285 | +```dql |
| 286 | +fetch logs, from:now() - 1h |
| 287 | +| filter matchesPhrase(content, "\"level\":\"warn\"") |
| 288 | +| fields timestamp, content, exception.message, k8s.pod.name |
| 289 | +| sort timestamp desc |
| 290 | +``` |
| 291 | + |
| 292 | +## Performance Considerations |
| 293 | + |
| 294 | +### Optimizing Log Queries |
| 295 | + |
| 296 | +1. **Always include timeframe**: `from:now() - 2h` (avoid overly broad searches) |
| 297 | +2. **Filter early**: Apply restrictive filters first |
| 298 | +3. **Use entity filters**: Filter by specific pods/services when possible |
| 299 | +4. **Limit results**: Always include reasonable limits |
| 300 | +5. **Sort efficiently**: Sort by timestamp desc for recent logs |
| 301 | + |
| 302 | +### Query Timeframe Recommendations |
| 303 | + |
| 304 | +- **Real-time debugging**: 15-30 minutes |
| 305 | +- **Incident investigation**: 2-4 hours |
| 306 | +- **Deployment analysis**: 1-2 hours around deployment time |
| 307 | +- **Pattern analysis**: 24 hours maximum |
| 308 | +- **Historical research**: Use specific time windows, not broad ranges |
| 309 | + |
| 310 | +## Troubleshooting Log Queries |
| 311 | + |
| 312 | +### Common Issues |
| 313 | + |
| 314 | +1. **Empty Results**: Check timeframe, pod names, and filter criteria |
| 315 | +2. **Performance Issues**: Reduce timeframe or add more specific filters |
| 316 | +3. **Missing Logs**: Verify log ingestion and entity IDs |
| 317 | +4. **Field Access**: Use `| limit 1` to explore available fields |
| 318 | + |
| 319 | +### Field Discovery Technique |
| 320 | + |
| 321 | +```dql |
| 322 | +fetch logs, from:now() - 1h |
| 323 | +| filter matchesPhrase(k8s.pod.name, "your-pod-name") |
| 324 | +| limit 1 |
| 325 | +``` |
| 326 | + |
| 327 | +### Debugging Query Performance |
| 328 | + |
| 329 | +```dql |
| 330 | +fetch logs, from:now() - 30m // Shorter timeframe |
| 331 | +| filter k8s.pod.name == "specific-pod-name" // Exact match |
| 332 | +| filter loglevel == "ERROR" // Specific level |
| 333 | +| limit 10 // Small result set |
| 334 | +``` |
| 335 | + |
| 336 | +## Integration with MCP Tools |
| 337 | + |
| 338 | +### Using Logs After Entity Discovery |
| 339 | + |
| 340 | +After finding entities with MCP tools, use their names in log queries: |
| 341 | + |
| 342 | +```dql |
| 343 | +fetch logs, from:now() - 2h |
| 344 | +| filter dt.entity.service == "SERVICE-BECA49FB15C72B6A" // From MCP entity lookup |
| 345 | +| filter loglevel == "ERROR" |
| 346 | +| fields timestamp, content, exception.message |
| 347 | +| sort timestamp desc |
| 348 | +| limit 15 |
| 349 | +``` |
| 350 | + |
| 351 | +### Cross-Referencing with Problems |
| 352 | + |
| 353 | +1. **Find problems** with `fetch events | filter event.kind == "DAVIS_PROBLEM"` |
| 354 | +2. **Extract affected entities** from problem results |
| 355 | +3. **Query logs** for those specific entities during problem timeframe |
| 356 | +4. **Correlate trace IDs** between problems and logs |
| 357 | + |
| 358 | +## Follow-up Analysis Workflows |
| 359 | + |
| 360 | +### After Finding Error Logs |
| 361 | + |
| 362 | +1. **Extract stack traces** for deeper code analysis |
| 363 | +2. **Find related spans** using trace IDs |
| 364 | +3. **Check entity health** for affected services |
| 365 | +4. **Analyze deployment correlation** with timestamps |
| 366 | +5. **Search for similar errors** across other pods/services |
| 367 | + |
| 368 | +### Log-Driven Root Cause Analysis Process |
| 369 | + |
| 370 | +1. **Start with problem timeframe** from Davis AI analysis |
| 371 | +2. **Filter logs to error level** during that period |
| 372 | +3. **Identify error patterns** and affected components |
| 373 | +4. **Trace correlation** using trace/span IDs |
| 374 | +5. **Business logic validation** through error message analysis |
| 375 | +6. **Infrastructure correlation** with deployment events |
| 376 | + |
| 377 | +## Integration Notes |
| 378 | + |
| 379 | +- **Always verify DQL syntax** with `verify_dql` before execution |
| 380 | +- **Use MCP entity tools** to get precise entity IDs for log filtering |
| 381 | +- **Combine with spans** for complete transaction analysis |
| 382 | +- **Correlate with metrics** for performance context |
| 383 | +- **Reference problem events** for incident context |
| 384 | +- **Consider log sampling** for high-volume environments |
0 commit comments