Skip to content

Conversation

@ropatil010
Copy link

Hi Team,

Can you PTAL on this.

PR about:
Implements comprehensive cluster health check tool that examines:

  • Cluster operators (OpenShift)
  • Nodes (readiness, schedulability, resource pressure)
  • Pods (failures, crash loops, image pull errors, high restarts)
  • Workload controllers (Deployments, StatefulSets, DaemonSets)
  • Storage (PVC status)
  • Recent warning events

Features:

  • Text and JSON output formats
  • Verbose mode for detailed diagnostics
  • Configurable event checking
  • Clear severity indicators (critical/warning/healthy)

@ropatil010
Copy link
Author

Output format:

===============================================
Cluster Health Check Report

Cluster Type: OpenShift
Cluster Version: 4.20.0
Check Time: 2025-11-03T17:33:34Z

Checking Cluster Operators...
✅ All cluster operators healthy (34/34)
Checking Node Health...
✅ All nodes healthy (6)
Checking Pod Health...
❌ CRITICAL: 1 pod(s) in failed/pending state

  • openshift-xxxx/nodename [Failed]
    Checking Workload Controllers...
    ✅ All deployments healthy
    ✅ All statefulsets healthy
    ✅ All daemonsets healthy
    Checking Storage...
    ✅ All PVCs bound
    Checking Recent Events...
    ✅ No recent warning events

===============================================
Summary

Critical Issues: 1
Warnings: 0

❌ Cluster has CRITICAL issues requiring immediate attention

@ropatil010 ropatil010 force-pushed the cluster-health-check branch 2 times, most recently from 2e02dc9 to 767c1db Compare November 4, 2025 05:23
@Cali0707 Cali0707 requested a review from manusa November 4, 2025 15:39
// Fallback: Get server version
// Note: This would require access to discovery client which isn't exposed in params
// For now, return empty string
return "", fmt.Errorf("unable to determine cluster version")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this only is OCP?

@manusa
Copy link
Member

manusa commented Nov 5, 2025

I'm not really sure about this one, it seems to be quite opinionated.
Given the raw tools, the LLMs should be able to perform the operations themselves.

In https://github.com/Flux159/mcp-server-kubernetes they're implementing similar functionality via a prompt. Which doesn't seem like a bad idea.

Similarly, in https://github.com/GoogleCloudPlatform/kubectl-ai (AFAIU) they're leveraging the system prompt for this purpose: https://github.com/GoogleCloudPlatform/kubectl-ai/blob/main/pkg/agent/systemprompt_template_default.txt

IMO if we want to proceed with this feature we should either:

  1. Reimplement it as a prompt for the core toolset.
  2. If we really want this as a tool, add it to a specific opt-in toolset diagnostics or something similar.

This is also a good case to test evals and see if they can be used to make a better decision on how to implement this feature.

@ropatil010 ropatil010 force-pushed the cluster-health-check branch from 1d70e69 to 5fd5987 Compare November 5, 2025 15:18
Signed-off-by: Rohit Patil <[email protected]>
Signed-off-by: Rohit Patil <[email protected]>
Signed-off-by: Rohit Patil <[email protected]>
@ropatil010 ropatil010 force-pushed the cluster-health-check branch from 5fd5987 to afb7d63 Compare November 5, 2025 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants