feat(core): add cluster health check tool #434

ropatil010 · 2025-11-03T18:05:36Z

Hi Team,

Can you PTAL on this.

PR about:
Implements comprehensive cluster health check tool that examines:

Cluster operators (OpenShift)
Nodes (readiness, schedulability, resource pressure)
Pods (failures, crash loops, image pull errors, high restarts)
Workload controllers (Deployments, StatefulSets, DaemonSets)
Storage (PVC status)
Recent warning events

Features:

Text and JSON output formats
Verbose mode for detailed diagnostics
Configurable event checking
Clear severity indicators (critical/warning/healthy)

ropatil010 · 2025-11-03T18:07:55Z

Output format:

===============================================
Cluster Health Check Report

Cluster Type: OpenShift
Cluster Version: 4.20.0
Check Time: 2025-11-03T17:33:34Z

Checking Cluster Operators...
✅ All cluster operators healthy (34/34)
Checking Node Health...
✅ All nodes healthy (6)
Checking Pod Health...
❌ CRITICAL: 1 pod(s) in failed/pending state

openshift-xxxx/nodename [Failed]
Checking Workload Controllers...
✅ All deployments healthy
✅ All statefulsets healthy
✅ All daemonsets healthy
Checking Storage...
✅ All PVCs bound
Checking Recent Events...
✅ No recent warning events

===============================================
Summary

Critical Issues: 1
Warnings: 0

❌ Cluster has CRITICAL issues requiring immediate attention

matzew · 2025-11-05T07:35:20Z

pkg/toolsets/core/health_check.go

+	// Fallback: Get server version
+	// Note: This would require access to discovery client which isn't exposed in params
+	// For now, return empty string
+	return "", fmt.Errorf("unable to determine cluster version")


so this only is OCP?

manusa · 2025-11-05T07:59:31Z

I'm not really sure about this one, it seems to be quite opinionated.
Given the raw tools, the LLMs should be able to perform the operations themselves.

In https://github.com/Flux159/mcp-server-kubernetes they're implementing similar functionality via a prompt. Which doesn't seem like a bad idea.

Similarly, in https://github.com/GoogleCloudPlatform/kubectl-ai (AFAIU) they're leveraging the system prompt for this purpose: https://github.com/GoogleCloudPlatform/kubectl-ai/blob/main/pkg/agent/systemprompt_template_default.txt

IMO if we want to proceed with this feature we should either:

Reimplement it as a prompt for the core toolset.
If we really want this as a tool, add it to a specific opt-in toolset diagnostics or something similar.

This is also a good case to test evals and see if they can be used to make a better decision on how to implement this feature.

Signed-off-by: Rohit Patil <[email protected]>

ropatil010 force-pushed the cluster-health-check branch 2 times, most recently from 2e02dc9 to 767c1db Compare November 4, 2025 05:23

Cali0707 requested a review from manusa November 4, 2025 15:39

matzew reviewed Nov 5, 2025

View reviewed changes

ropatil010 force-pushed the cluster-health-check branch from 1d70e69 to 5fd5987 Compare November 5, 2025 15:18

ropatil010 added 3 commits November 5, 2025 20:56

check cluster health

e310902

Signed-off-by: Rohit Patil <[email protected]>

implement with prompt type

37b8e88

Signed-off-by: Rohit Patil <[email protected]>

update issue

afb7d63

Signed-off-by: Rohit Patil <[email protected]>

ropatil010 force-pushed the cluster-health-check branch from 5fd5987 to afb7d63 Compare November 5, 2025 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): add cluster health check tool #434

feat(core): add cluster health check tool #434

ropatil010 commented Nov 3, 2025

Uh oh!

ropatil010 commented Nov 3, 2025

Uh oh!

matzew Nov 5, 2025

Uh oh!

manusa commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(core): add cluster health check tool #434

Are you sure you want to change the base?

feat(core): add cluster health check tool #434

Conversation

ropatil010 commented Nov 3, 2025

Uh oh!

ropatil010 commented Nov 3, 2025

=============================================== Cluster Health Check Report

=============================================== Summary

Uh oh!

matzew Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

manusa commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

===============================================
Cluster Health Check Report

===============================================
Summary