Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Set severity to NodeCPUHighUsage to info
Signed-off-by: Vitaly Zhuravlev <[email protected]>
  • Loading branch information
v-zhuravlev committed Jun 29, 2023
commit b7dfb32bfc1e20bf8c7493427ac085d550589c7e
16 changes: 9 additions & 7 deletions docs/node-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -312,15 +312,17 @@
{
alert: 'NodeCPUHighUsage',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High CPU usage is not a problem and can just be an indicator or properly utilizing your machine, so I'd remove these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, as long as we can alert on high system load(saturation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, CPU usage is good. :) I mean, this would be a case for the "info" level alerts that I like to promote, but I don't think we have them here in the mixin.

(Info level alerts notify nobody, but you could look at the alerts page while troubleshooting. They point to things that are not problems per se and might be OK, but which you might be interested while there is an actual incident happening.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd be fine with a 'info' level severity. No reason to now just introduce that now that we're on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it an info according to this guideline:

info for alerts that do not require any action by itself but mark something as “out of the ordinary”. Those alerts aren’t usually routed anywhere, but can be inspected during troubleshooting.

Copy link

@jcpunk jcpunk May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a warning if the usage stays above 98% for 1h would be viable? That would be a case where the host is at capacity and scheduling more tasks there would result in performance degradation. It is a risk folks can accept but something that should be considered as part of the capacity plan.

expr: |||
sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode!="idle"}[2m]))) > 0.8
sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{%(nodeExporterSelector)s, mode!="idle"}[2m]))) * 100 > %(cpuHighUsageThreshold)d
||| % $._config,
'for': '15m',
labels: {
severity: 'warning',
severity: 'info',
},
annotations: {
summary: 'High CPU usage.',
description: 'CPU usage at {{ $labels.instance }} has been above 80% for the last 15 minutes, is currently at {{ printf "%.2f" $value }}%.',
description: |||
CPU usage at {{ $labels.instance }} has been above %(cpuHighUsageThreshold)d%% for the last 15 minutes, is currently at {{ printf "%%.2f" $value }}%%.
||| % $._config,
},
},
{
Expand All @@ -336,7 +338,7 @@
annotations: {
summary: 'System saturated, load per core is very high.',
description: |||
System load per core at {{ $labels.instance }} has been above %(systemSaturationPerCoreThreshold)d for the last 15 minutes, is currently at {{ printf "%.2f" $value }}.
System load per core at {{ $labels.instance }} has been above %(systemSaturationPerCoreThreshold)d for the last 15 minutes, is currently at {{ printf "%%.2f" $value }}.
This might indicate this instance resources saturation and can cause it becoming unresponsive.
||| % $._config,
},
Expand All @@ -353,7 +355,7 @@
annotations: {
summary: 'Memory major page faults are occurring at very high rate.',
description: |||
Memory major pages are occurring at very high rate at {{ $labels.instance }}, %(memoryMajorPagesFaultsThreshold)d major page faults per second for the last 15 minutes, is currently at {{ printf "%.2f" $value }}.
Memory major pages are occurring at very high rate at {{ $labels.instance }}, %(memoryMajorPagesFaultsThreshold)d major page faults per second for the last 15 minutes, is currently at {{ printf "%%.2f" $value }}.
Please check that there is enough memory available at this instance.
||| % $._config,
},
Expand All @@ -370,7 +372,7 @@
annotations: {
summary: 'Host is running out of memory.',
description: |||
Memory is filling up at {{ $labels.instance }}, has been above %(memoryHighUtilizationThreshold)d%% for the last 15 minutes, is currently at {{ printf "%.2f" $value }}%.
Memory is filling up at {{ $labels.instance }}, has been above %(memoryHighUtilizationThreshold)d%% for the last 15 minutes, is currently at {{ printf "%%.2f" $value }}%%.
||| % $._config,
},
},
Expand All @@ -386,7 +388,7 @@
annotations: {
summary: 'Disk IO queue is high.',
description: |||
Disk IO queue (aqu-sq) is high on {{ $labels.device }} at {{ $labels.instance }}, has been above %(diskIOSaturationThreshold)d for the last 15 minutes, is currently at {{ printf "%.2f" $value }}.
Disk IO queue (aqu-sq) is high on {{ $labels.device }} at {{ $labels.instance }}, has been above %(diskIOSaturationThreshold)d for the last 15 minutes, is currently at {{ printf "%%.2f" $value }}.
This symptom might indicate disk saturation.
||| % $._config,
},
Expand Down
4 changes: 3 additions & 1 deletion docs/node-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@
// just a warning for K8s nodes.
nodeCriticalSeverity: 'critical',


// CPU utilization (%) on which to trigger the
// 'NodeCPUHighUsage' alert.
cpuHighUsageThreshold: 90,
// Load average 1m (per core) on which to trigger the
// 'NodeSystemSaturation' alert.
systemSaturationPerCoreThreshold: 2,
Expand Down