-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add MegaCLI collector #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
collector/megacli.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fahrenheit, Celsius, Kelvin? ;) Include unit suffix please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew it! :)
|
👍 otherwise, though I admit I didn't look too closely since it seems like quite a specialized collector module :) |
|
@juliusv Well, it's not a beauty - the megacli output is really, really ugly but the only way to get RAID stats for the most common hw raid controllers. The same RAID controllers you guys are using btw, so that could come in handy for you as well. |
collector/megacli.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want CounterOpts. (Sorry, other way round, updated my previous comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CounterOpts instead of GaugeOpts... (didn't it get my update?)
|
@juliusv is confident this is good to go. I just discovered the small inconsistency above. |
This collector exports the following metrics: - raid_drive_temperature: drive temperature - raid_drive_count: drive error and event counters - raid_adapter_disk_presence: disk presence per adapter
|
👍 |
…r-promu Install promu package for OCP multistage builds
* Add mountpoint to NodeFilesystem alerts This helps to identify alerting filesystem. Signed-off-by: Vitaly Zhuravlev <[email protected]> * Decrease NodeFilesystem pending time to 15m 30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file). Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add CPU and memory alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add failed systemd service alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Decrease NodeNetwork*Errs pending period Signed-off-by: Vitaly Zhuravlev <[email protected]> * Set 'at' everywhere as preposition for instance Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add NodeDiskIOSaturation alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add %(nodeExporterSelector)s to Network and conntrack alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add diskDevice selector Signed-off-by: Vitaly Zhuravlev <[email protected]> * Fix NodeMemoryHighUtilization alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add NodeSystemSaturation and NodeMemoryMajorPagesFaults Signed-off-by: Vitaly Zhuravlev <[email protected]> * Decrease NodeSystemdServiceFailed severity to warning Signed-off-by: Vitaly Zhuravlev <[email protected]> * Extend alert description Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add comma after 'mounted on' Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add thresholds for memory alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add thresholds for memory, disk and system alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Set severity to NodeCPUHighUsage to info Signed-off-by: Vitaly Zhuravlev <[email protected]> * Convert graph panels to timeseries panel ...With default style (opacity, tooltip etc). Also: Change 'logical core' line style to dotted Update Disk I/O time metric to dots Signed-off-by: Vitaly Zhuravlev <[email protected]> * Move dashboard paramaters to config Signed-off-by: Vitaly Zhuravlev <[email protected]> * Lint mixin Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add overview row * Add Cpu Usage stat panel * Add network dash * Improve network dash - Add interfaces overview panel - Add oper status timeline - Add common lib with reused elements (templates, queries) - Add common panels with shared style to be used accross this mixin * Remove external panels lib * Add fleet dashboard * Update fleet dash * Add CPU and memory to fleet * Add common cpu/memory/disk/network panels on fleet * add network errors panel as points * Fix alerts column in fleet table * Add support for multiple group and instance labels * Add sockstat to network dashboard * Add netstat to network dashboard * Change span to gridPod. Make overview row smaller. gridPos supports tiny panels height. * add reboot annotation * Add system dashboard * add filesystem row * Add disk and fs dashboard * Update mixin * make fmt * Add memory dashboard * Add memory generic counters to memory dashboard * Update common lib * Update OOM killer panel Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add common annotations: kernelChange, OOMkill Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add mountpoint to NodeFilesystem alerts This helps to identify alerting filesystem. Signed-off-by: Vitaly Zhuravlev <[email protected]> * Decrease NodeFilesystem pending time to 15m 30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file). Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add CPU and memory alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add failed systemd service alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Decrease NodeNetwork*Errs pending period Signed-off-by: Vitaly Zhuravlev <[email protected]> * Set 'at' everywhere as preposition for instance Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add NodeDiskIOSaturation alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add %(nodeExporterSelector)s to Network and conntrack alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add diskDevice selector Signed-off-by: Vitaly Zhuravlev <[email protected]> * Fix NodeMemoryHighUtilization alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add NodeSystemSaturation and NodeMemoryMajorPagesFaults Signed-off-by: Vitaly Zhuravlev <[email protected]> * Decrease NodeSystemdServiceFailed severity to warning Signed-off-by: Vitaly Zhuravlev <[email protected]> * Remove unused import * Add ability to set custom dashboardUID Required when multiple mixins are loaded based on node-mixin Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add mountpoint to NodeFilesystem alerts This helps to identify alerting filesystem. Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add failed systemd service alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Set 'at' everywhere as preposition for instance Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add NodeDiskIOSaturation alert Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add diskDevice selector Signed-off-by: Vitaly Zhuravlev <[email protected]> * Fix OOMkill panel Signed-off-by: Vitaly Zhuravlev <[email protected]> * Remove systemd panel systemd collector is disabled by default Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add some lint exclusions. Add UIDs to all dashboards. Add units and descriptions to all panels which were missing them. Modify alerts descriptions and summaries as needed for linting. Signed-off-by: Ryan J. Geyer <[email protected]> * Add multi-cluster dashboard lint exclusions * Extend alert description Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add comma after 'mounted on' Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add thresholds for memory alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Add thresholds for memory, disk and system alerts Signed-off-by: Vitaly Zhuravlev <[email protected]> * Set severity to NodeCPUHighUsage to info Signed-off-by: Vitaly Zhuravlev <[email protected]> * Fix broken diskSpaceUsage link * Fix cpuIdle panel units * Change cpuUsage to use $__rate_interval * Fix cpu usage (replace with nodeQuerySelector) * Fix units (seconds->s) * Fix iops units * Add %(nodeQuerySelector)s to alerts queries * Remove trailing space * Add support for multi in job * Fix Pagesout metric * Add memory desciptions * Add total and available memory metrics * Update context switches description * Add network descriptions * Change pipe to | from / in AxisLabel * Update changes * Remove , in dashboards.jsonnet * Remove code comments * Update network descriptions * Add timezone metric * Add disk description --------- Signed-off-by: Vitaly Zhuravlev <[email protected]> Signed-off-by: Ryan J. Geyer <[email protected]>
Signed-off-by: dislbenn <[email protected]> Signed-off-by: dislbenn <[email protected]>
This collector exports the following metrics:
I still have to see if everything is working as expected, but feel free to review already :)