Skip to content

Conversation

@discordianfish
Copy link
Member

This collector exports the following metrics:

  • raid_drive_temperature: drive temperature
  • raid_drive_count: drive error and event counters
  • raid_adapter_disk_presence: disk presence per adapter

I still have to see if everything is working as expected, but feel free to review already :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fahrenheit, Celsius, Kelvin? ;) Include unit suffix please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew it! :)

@juliusv
Copy link
Member

juliusv commented Jul 8, 2014

👍 otherwise, though I admit I didn't look too closely since it seems like quite a specialized collector module :)

@discordianfish
Copy link
Member Author

@juliusv Well, it's not a beauty - the megacli output is really, really ugly but the only way to get RAID stats for the most common hw raid controllers. The same RAID controllers you guys are using btw, so that could come in handy for you as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want CounterOpts. (Sorry, other way round, updated my previous comment.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CounterOpts instead of GaugeOpts... (didn't it get my update?)

@beorn7
Copy link
Member

beorn7 commented Jul 9, 2014

@juliusv is confident this is good to go. I just discovered the small inconsistency above.

This collector exports the following metrics:

- raid_drive_temperature: drive temperature
- raid_drive_count: drive error and event counters
- raid_adapter_disk_presence: disk presence per adapter
@beorn7
Copy link
Member

beorn7 commented Jul 9, 2014

👍

discordianfish added a commit that referenced this pull request Jul 9, 2014
@discordianfish discordianfish merged commit 50c6691 into master Jul 9, 2014
@discordianfish discordianfish deleted the add-megaraid-metrics branch July 9, 2014 12:56
pgier pushed a commit to pgier/node_exporter that referenced this pull request Jan 15, 2019
…r-promu

Install promu package for OCP multistage builds
v-zhuravlev added a commit to v-zhuravlev/node_exporter that referenced this pull request Apr 14, 2023
* Add mountpoint to NodeFilesystem alerts

This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Decrease NodeFilesystem pending time to 15m

30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add CPU and memory alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add failed systemd service alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Decrease NodeNetwork*Errs pending period

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Set 'at' everywhere as preposition for instance

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add NodeDiskIOSaturation alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add %(nodeExporterSelector)s to Network and conntrack alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add diskDevice selector

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Fix NodeMemoryHighUtilization alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add NodeSystemSaturation and NodeMemoryMajorPagesFaults

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Decrease NodeSystemdServiceFailed severity to warning

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Extend alert description

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add comma after 'mounted on'

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add thresholds for memory alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add thresholds for memory, disk and system alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Set severity to NodeCPUHighUsage to info

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Convert graph panels to timeseries panel

...With default style (opacity, tooltip etc).
Also:
Change 'logical core' line style to dotted
Update Disk I/O time metric to dots

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Move dashboard paramaters to config

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Lint mixin

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add overview row

* Add Cpu Usage stat panel

* Add network dash

* Improve network dash

- Add interfaces overview panel
- Add oper status timeline
- Add common lib with reused elements (templates, queries)
- Add common panels with shared style to be used accross this mixin

* Remove external panels lib

* Add fleet dashboard

* Update fleet dash

* Add CPU and memory to fleet

* Add common cpu/memory/disk/network panels on fleet

* add network errors panel as points

* Fix alerts column in fleet table

* Add support for multiple group and instance labels

* Add sockstat to network dashboard

* Add netstat to network dashboard

* Change span to gridPod. Make overview row smaller.

gridPos supports tiny panels height.

* add reboot annotation

* Add system dashboard

* add filesystem row

* Add disk and fs dashboard

* Update mixin

* make fmt

* Add memory dashboard

* Add memory generic counters to memory dashboard

* Update common lib

* Update OOM killer panel

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add common annotations: kernelChange, OOMkill

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add mountpoint to NodeFilesystem alerts

This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Decrease NodeFilesystem pending time to 15m

30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file).

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add CPU and memory alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add failed systemd service alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Decrease NodeNetwork*Errs pending period

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Set 'at' everywhere as preposition for instance

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add NodeDiskIOSaturation alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add %(nodeExporterSelector)s to Network and conntrack alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add diskDevice selector

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Fix NodeMemoryHighUtilization alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add NodeSystemSaturation and NodeMemoryMajorPagesFaults

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Decrease NodeSystemdServiceFailed severity to warning

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Remove unused import

* Add ability to set custom dashboardUID

Required when multiple mixins are loaded based on node-mixin

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add mountpoint to NodeFilesystem alerts

This helps to identify alerting filesystem.

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add failed systemd service alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Set 'at' everywhere as preposition for instance

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add NodeDiskIOSaturation alert

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add diskDevice selector

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Fix OOMkill panel

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Remove systemd panel

systemd collector is disabled by default

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add some lint exclusions.
Add UIDs to all dashboards.
Add units and descriptions to all panels which were missing them.
Modify alerts descriptions and summaries as needed for linting.

Signed-off-by: Ryan J. Geyer <[email protected]>

* Add multi-cluster dashboard lint exclusions

* Extend alert description

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add comma after 'mounted on'

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add thresholds for memory alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Add thresholds for memory, disk and system alerts

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Set severity to NodeCPUHighUsage to info

Signed-off-by: Vitaly Zhuravlev <[email protected]>

* Fix broken diskSpaceUsage link

* Fix cpuIdle panel units

* Change cpuUsage to use $__rate_interval

* Fix cpu usage (replace with nodeQuerySelector)

* Fix units (seconds->s)

* Fix iops units

* Add %(nodeQuerySelector)s to alerts queries

* Remove trailing space

* Add support for multi in job

* Fix Pagesout metric

* Add memory desciptions

* Add total and available memory metrics

* Update context switches description

* Add network descriptions

* Change pipe to | from / in AxisLabel

* Update changes

* Remove , in dashboards.jsonnet

* Remove code comments

* Update network descriptions

* Add timezone metric

* Add disk description

---------

Signed-off-by: Vitaly Zhuravlev <[email protected]>
Signed-off-by: Ryan J. Geyer <[email protected]>
philipgough pushed a commit to philipgough/node_exporter that referenced this pull request Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants