Conversation
…fixes and full docs - Add PlaybookExecutor: LOW risk executes via MCP immediately, MEDIUM/HIGH routes through ApprovalManager - Fix watchloop stale _known_issues: nodes and deployments cleared on recovery - Hoist PlaybookRegistry/PlaybookExecutor as startup singletons in main.py - Wire _on_cluster_event to call playbook_executor.execute() - Add Prometheus/Alertmanager/Grafana config and docker-compose services - Add scripts/start_production.sh with kubeconfig proxy-url patching - Fix k8s/client.py: colon-separated KUBECONFIG, _init_attempted guard - Add docs/aiops.md with 6 Mermaid sequence diagrams - Update docs/sequence-diagrams.md diagrams 7-11 (AIOps flows) - Update docs/architecture.md with AIOps pipeline section
- Rewrite README.md with full feature matrix, AIOps, MCP/K8s docs - Add docs/hld.d2 and docs/hld.svg (D2 traffic-flow HLD diagram) - Add LICENSE, CHANGELOG.md, CONTRIBUTING.md, SECURITY.md - Add .dockerignore, expand .gitignore, update .env.example - Add .github/workflows/ci.yml (lint, test, docker build, ghcr.io push) - Add .github PR template and issue templates - Update Dockerfile OCI labels - Rename project from clawbot to simple-ai-agent throughout - Update pyproject.toml (name, version 0.4.0), src/config.py, src/main.py
- Add slack_manifest.yml (manifest schema v2) with:
- bot_user: simple-ai-agent, always_online
- App Home with Messages tab enabled (DM support)
- OAuth scopes: app_mentions:read, chat:write, im:history,
im:write, users:read, channels:history, groups:history
- Event subscriptions: app_mention, message.im
- Interactivity and socket mode disabled
- request_url placeholder: https://YOUR-DOMAIN/api/webhook/slack
- Rewrite docs/slack-setup.md to lead with manifest approach:
- 6-step manifest-based setup as primary method
- Manual setup retained as Method 2 alternative
- Added ngrok local dev tip
- Updated architecture diagram and security table
- Added manifest update/rotation API commands
Add slack_manifest.*.yml to .gitignore so private/enterprise manifests (e.g. slack_manifest.htunn-enterprise.yml) are never committed to the public repository.
- Add cloudflare/cloudflared:latest service with restart policy - Route https://slack.simpleportchecker.com -> http://app:8000 - Token injected via CLOUDFLARE_TUNNEL_TOKEN env var (gitignored .env) - Add CLOUDFLARE_TUNNEL_TOKEN placeholder to .env.example - Tunnel health-depends on app service_healthy - Logging: json-file, 5m max-size, 2 files
- Read raw body once for both HMAC verification and JSON parsing - Verify signature before spawning background task - Fire event processing with asyncio.create_task (non-blocking) preventing Slack's 3-second timeout from triggering retries - Deduplicate Slack retries via event_id stored in Redis (5 min TTL) - Gracefully degrade if Redis is unavailable (no dedup, no crash) - Clean up inline imports: json, asyncio now at module level Root cause: awaiting the full AI pipeline (5-10s) before returning 200 caused Slack to retry, spawning two concurrent executions - one failed (DB/Redis cold start) and sent the error message, the other succeeded.
Scale intent detection:
- Normalise word numbers before parsing ('one' -> 1, 'two' -> 2, ...)
- Add dedicated scale/resize branch checked BEFORE pod/deployment
branches, so 'scale down X pod to 1 replica' is never mistaken
for a pod-list request
- Two-pass name extraction: verb-adjacent token first, then
fallback pattern for 'X pod to N replica' phrasing
- Retries without namespace if the first attempt fails
Error-pods-only replies for Slack:
- Default pod listing now shows ONLY problem pods (Error,
CrashLoopBackOff, ImagePullBackOff, Pending, Failed, Terminating,
ContainerCreating, OOMKilled, degraded-ready-ratio)
- Summary: '(N issue(s), M healthy)' keeps context without noise
- Replies '✅ All N pod(s) are healthy' when nothing is broken
Scale namespace auto-discovery: - Before scaling, run kubectl get deployment --all-namespaces to find which namespace the deployment lives in automatically - Scale superadmin-frontend works without specifying namespace Slack-friendly command prefix: - Messages starting with ! are normalised to / internally so !k8s !help !incident !alert all work from Slack - Slack intercepts / commands before the bot sees them Help text cleanup: - All examples updated to use ! prefix in Slack context
- Delete src/channels/discord_adapter.py - Remove Discord from channels __init__, router, config, prompt_manager - Remove Discord webhook stub endpoint from api/webhooks.py - Remove discord.py from requirements.txt - Remove 'discord' from pyproject.toml keywords - Remove DISCORD_TOKEN from docker-compose.yml and .env* files - Remove all Discord nodes and flows from docs/hld.d2; regenerate hld.svg - Update README, CHANGELOG, SETUP, scripts/README with Telegram+Slack only - Update all docs/*.md (architecture, component-diagram, sequence-diagrams, database-architecture, aiops, kubernetes-integration) — Discord removed - Update .github/ISSUE_TEMPLATE/bug_report.md - Update scripts/init_db.py default channel seeds (telegram only) - Update src/services/kubernetes_handler.py docstring
…, test suite 49/49 - src/monitoring/metrics.py: new custom Prometheus counters/gauges/histograms (aiagent_messages_*, aiagent_ai_requests_*, aiagent_k8s_*, aiagent_aiops_*, aiagent_mcp_*, aiagent_webhook_*, aiagent_build_info) - src/main.py: import metrics module on startup to register all metrics; add /metrics endpoint via prometheus_client - src/utils/logger.py: auto-select JSONRenderer (non-TTY / LOG_FORMAT=json) vs ConsoleRenderer (interactive dev), enabling structured JSON logs in Docker - src/k8s/client.py: add missing list_namespaces() method (fixed AttributeError in /health endpoint) - requirements.txt: add prometheus-client==0.21.1 - tests/test_production_readiness.py: full 14-section, 49-test production suite covering HTTP, Slack, Alertmanager, rate limiting, Postgres, Redis, K8s, AIOps, MCP, AI client, channels, security, observability — all 49/49 pass
- Fix IndentationError at line 550: status_filter problem block was mis-indented outside the for-loop body, breaking all elif branches - Fix _format_kubectl_table: detect --all-namespaces 6-col output to prevent namespace name (e.g. velero) appearing as pod name - Add NLP fix pods intent handler + _fix_problem_pods() method for auto-remediation of Error/CrashLoopBackOff/OOMKilled/Failed pods - Add /k8s fix [namespace] subcommand routing - Add OOMKilled to problem pod status filter list - Add fix pods hint to problem pods display output - Telegram webhook: implement HMAC secret-token validation (was TODO) - Alertmanager webhook: return alerts_ingested count in response - docker-compose: bind Redis/Postgres to 127.0.0.1 (not 0.0.0.0) - docker-compose: remove Prometheus --web.enable-admin-api flag - config.py: add telegram_webhook_secret field - tests: fix RuleEngine event_type (pod_crash_loop -> crash_loop) - tests: add 5 new security tests (secret leak, webhook auth, port binding) - tests: replace datetime.utcnow() with datetime.now(timezone.utc) All 54/54 production readiness tests passing
There was a problem hiding this comment.
Pull request overview
This PR expands the agent into a more production-oriented AIOps platform: adding proactive Kubernetes monitoring (watchloop), human-in-the-loop approvals, Prometheus/Grafana/Alertmanager integration, Slack-focused channel support, and operational tooling/docs for deployment and contribution.
Changes:
- Added AIOps subsystem (watchloop, rule engine, playbooks, approval manager, RCA/log analysis) and supporting DB tables/migrations.
- Added observability stack + endpoints/config (Prometheus metrics endpoint, Prometheus/Grafana/Alertmanager clients and compose/config files).
- Streamlined channel support toward Telegram/Slack (removing Discord) and updated docs/templates/CI/deployment scripts accordingly.
Reviewed changes
Copilot reviewed 62 out of 65 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils/logger.py | Switches structlog renderer to JSON in non-TTY/LOG_FORMAT=json contexts. |
| src/services/kubernetes_handler.py | Updates module docstring to reflect supported channels. |
| src/services/approval_manager.py | Adds Redis-backed human approval workflow for remediation actions. |
| src/monitoring/watchloop.py | Adds background polling loop to detect common K8s anomalies and emit events. |
| src/monitoring/prometheus.py | Adds async Prometheus client plus convenience health queries. |
| src/monitoring/metrics.py | Defines application Prometheus metrics (counters/gauges/info). |
| src/monitoring/grafana.py | Adds Grafana client for incident annotations. |
| src/monitoring/init.py | Exposes monitoring clients/watchloop as package exports. |
| src/mcp/kubernetes_server.py | Adds self-healing / AIOps kubectl tools and handlers to MCP server. |
| src/main.py | Initializes AIOps components at startup; adds /metrics; updates app name. |
| src/k8s/client.py | Adds kubernetes-asyncio based singleton client wrapper. |
| src/k8s/init.py | Exports K8s client accessor(s). |
| src/database/models.py | Adds AIOps persistence models (incidents/alerts/remediations/audit/snapshots). |
| src/database/migrations/versions/002_aiops_tables.py | Adds Alembic migration for new AIOps tables. |
| src/config.py | Adds AIOps/observability/watchloop/approval settings; updates DB default. |
| src/channels/router.py | Removes Discord adapter registration; keeps Telegram/Slack. |
| src/channels/discord_adapter.py | Removes Discord adapter implementation. |
| src/channels/init.py | Removes Discord export; adds Slack export. |
| src/api/webhooks.py | Adds Telegram secret validation; Slack async processing + dedupe; adds Alertmanager webhook. |
| src/api/health.py | Expands health response with K8s/Prometheus/watchloop/AIOps counters and adds /health/aiops. |
| src/aiops/rule_engine.py | Adds in-memory rule engine mapping events to playbooks. |
| src/aiops/rca_engine.py | Adds AI-powered (and fallback) RCA report generation. |
| src/aiops/playbooks.py | Adds playbook registry and executor with risk-gated steps. |
| src/aiops/log_analyzer.py | Adds regex-based log pattern analysis (+ optional AI enrichment). |
| src/aiops/init.py | Exports AIOps modules as a package. |
| src/ai/prompt_manager.py | Updates channel prompts and help text for Slack + AIOps commands. |
| src/ai/model_selector.py | Updates channel type docs to match Telegram/Slack. |
| slack_manifest.yml | Adds Slack App Manifest for easier Slack setup. |
| scripts/start_production.sh | Adds production bootstrap script for compose + data dirs + kubeconfig patching. |
| scripts/init_db.py | Removes Discord default channel seeding. |
| scripts/README.md | Updates scripts docs to remove Discord token reference. |
| requirements.txt | Removes discord.py; adds kubernetes-asyncio + prometheus-client and related deps. |
| pyproject.toml | Updates project metadata/version and classifiers/keywords. |
| docs/slack-setup.md | Rewrites Slack setup docs (manifest-first) and updates usage/troubleshooting. |
| docs/sequence-diagrams.md | Updates diagrams to Telegram/Slack + adds AIOps sequences. |
| docs/kubernetes-integration.md | Updates docs to remove Discord references. |
| docs/hld.d2 | Adds high-level design diagram for new architecture components. |
| docs/database-architecture.md | Updates Redis/session examples and channel references (Telegram/Slack). |
| docs/component-diagram.md | Updates component diagram to remove Discord and add Slack. |
| docs/architecture.md | Updates architecture docs; adds AIOps architecture section. |
| docker-compose.yml | Updates env handling, mounts, extra_hosts; adds cloudflared + Prometheus/Alertmanager/Grafana stack. |
| config/prometheus.yml | Adds Prometheus scrape config for app + stack. |
| config/grafana/provisioning/datasources/prometheus.yml | Provisions Prometheus datasource in Grafana. |
| config/grafana/provisioning/dashboards/dashboard.yml | Adds Grafana dashboard provisioning config. |
| config/alertmanager.yml | Adds Alertmanager routing config to webhook receiver. |
| config/alert_rules.yml | Adds Prometheus alerting rules for app/K8s/infrastructure signals. |
| SETUP.md | Removes Discord setup instructions; keeps Telegram/Slack. |
| SECURITY.md | Adds security policy and deployment best practices. |
| LICENSE | Adds MIT license. |
| Dockerfile | Updates OCI labels and description; retains kubectl install for runtime/build. |
| CONTRIBUTING.md | Adds contributing guide and conventions. |
| CHANGELOG.md | Adds changelog and release notes including AIOps/ops features. |
| .gitignore | Expands ignores for secrets, data dirs, caches, kubeconfig, etc. |
| .github/workflows/ci.yml | Adds CI workflow for lint/type/test/docker build/publish. |
| .github/PULL_REQUEST_TEMPLATE.md | Adds PR template. |
| .github/ISSUE_TEMPLATE/feature_request.md | Adds feature request template. |
| .github/ISSUE_TEMPLATE/bug_report.md | Adds bug report template. |
| .env.production.example | Expands production env template for AIOps/monitoring. |
| .env.example | Overhauls env template with grouped sections and added variables. |
| .dockerignore | Adds docker build context exclusions for secrets/data/dev artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def approval_message(self) -> str: | ||
| risk_emoji = {"low": "🟡", "medium": "🟠", "high": "🔴"}[self.risk_level.value] | ||
| lines = [ | ||
| f"{risk_emoji} **Approval Required** [{self.risk_level.value.upper()}]", | ||
| f"", | ||
| f"**Action:** {self.description}", | ||
| f"**Tool:** `{self.tool_name}`", | ||
| f"**Parameters:** `{json.dumps(self.tool_params, indent=2)}`", | ||
| f"", | ||
| f"Reply with **`approve {self.approval_id[:8]}`** to proceed or **`reject {self.approval_id[:8]}`** to cancel.", | ||
| f"This request expires in {settings.approval_timeout_seconds // 60} minutes.", | ||
| ] | ||
| if self.risk_level == RiskLevel.HIGH: | ||
| lines.insert(0, "⚠️ **HIGH RISK ACTION — Review carefully before approving**\n") |
There was a problem hiding this comment.
RiskLevel is defined separately in src/aiops/playbooks.py and src/services/approval_manager.py. Because they are different Enum classes, comparisons like self.risk_level == RiskLevel.HIGH in PendingApproval.approval_message() can silently fail if a playbook RiskLevel instance is passed in, which can suppress the HIGH-risk warning. Prefer importing and using a single shared RiskLevel enum across both modules (or normalize to strings at the boundary).
| # 1. Scan for crashloop and OOMKilled pods | ||
| try: | ||
| crash_pods = await self._k8s.get_crashloop_pods() | ||
| for pod in crash_pods: | ||
| key = f"pod/{pod['namespace']}/{pod['name']}" | ||
| if key not in self._known_issues: | ||
| self._known_issues[key] = datetime.now(timezone.utc) | ||
| status = pod.get("status", "CrashLoopBackOff") | ||
| severity = "critical" if "CrashLoop" in status or "OOM" in status else "warning" | ||
| events.append(ClusterEvent( | ||
| event_type="crash_loop" if "OOM" not in status else "oom_killed", | ||
| severity=severity, | ||
| namespace=pod["namespace"], | ||
| resource_kind="Pod", | ||
| resource_name=pod["name"], | ||
| message=f"Pod {pod['name']} in {pod['namespace']} is {status} (restarts: {pod.get('restarts', 0)})", | ||
| labels=pod.get("labels", {}), | ||
| )) | ||
| # Clear resolved pods from known issues | ||
| elif pod.get("status") not in ("CrashLoopBackOff", "Error", "OOMKilled"): | ||
| self._known_issues.pop(key, None) |
There was a problem hiding this comment.
Crash-loop pod deduplication never clears: get_crashloop_pods() only returns pods currently in CrashLoop/Error/OOMKilled, so recovered pods won’t appear in crash_pods and the elif ... not in (...) branch will never run. This means _known_issues entries for pods will persist forever, preventing future alerts for the same pod and causing unbounded growth. Track the set of currently failing pod keys and remove any pod/... keys not present each tick (similar to the node/deployment cleanup).
| try: | ||
| namespaces_resp = await self._k8s._core_v1.list_namespace() # type: ignore[union-attr] | ||
| current_failed_deployments: set[str] = set() |
There was a problem hiding this comment.
This tick uses self._k8s._core_v1.list_namespace() (a private attribute). Prefer a public method on KubernetesClient (or add one) to avoid coupling the watchloop to internal client implementation details.
| try: | ||
| redis_client = get_redis() | ||
| keys = await redis_client.keys("approval:*") | ||
| result["pending_approvals"] = [k.split(":")[-1] for k in keys] | ||
| except Exception: |
There was a problem hiding this comment.
This uses redis.keys("approval:*") again for /health/aiops. For the same reason as /health, prefer SCAN/scan_iter (or a stored counter) to avoid blocking Redis under load.
| # CLOUDFLARE_TUNNEL_TOKEN must be set in .env (never commit the real value). | ||
| # --------------------------------------------------------------------------- | ||
| cloudflared: | ||
| image: cloudflare/cloudflared:latest |
There was a problem hiding this comment.
cloudflared uses the cloudflare/cloudflared:latest tag. Pinning to a specific version/digest is safer for production deployments (avoids unexpected breaking changes on redeploy) while still allowing planned upgrades.
| image: cloudflare/cloudflared:latest | |
| image: cloudflare/cloudflared:${CLOUDFLARED_VERSION:-2024.2.1} |
| # Optional HMAC secret validation | ||
| if settings.alertmanager_webhook_secret: | ||
| body_bytes = await request.body() | ||
| expected = hmac.new( | ||
| settings.alertmanager_webhook_secret.encode(), | ||
| body_bytes, | ||
| hashlib.sha256, | ||
| ).hexdigest() | ||
| received = x_alertmanager_secret or "" | ||
| if not hmac.compare_digest(expected, received): | ||
| logger.warning("alertmanager_webhook_invalid_secret") | ||
| raise HTTPException(status_code=403, detail="Invalid webhook secret") | ||
| body = __import__("json").loads(body_bytes) |
There was a problem hiding this comment.
Alertmanager secret validation computes an HMAC of the request body and expects the header to match it, but Alertmanager typically can only send a static header value (or basic auth), not a per-payload HMAC. As written, setting ALERTMANAGER_WEBHOOK_SECRET will likely reject all legitimate alerts. Consider treating X-Alertmanager-Secret as a shared static secret (direct string compare), and configure Alertmanager to send that header, or switch to basic auth/TLS client auth.
| async def _kubectl_drain_node(self, args: Dict[str, Any]) -> str: | ||
| """Drain all pods from a node.""" | ||
| node_name = args["node_name"] | ||
| ignore_daemonsets = args.get("ignore_daemonsets", True) | ||
| cmd = ["drain", node_name, "--delete-emissary-data", "--timeout=120s"] | ||
| if ignore_daemonsets: | ||
| cmd.append("--ignore-daemonsets") | ||
| return await self._run_kubectl(cmd) |
There was a problem hiding this comment.
kubectl drain is invoked with --delete-emissary-data, which is not a valid kubectl flag (the standard flag is --delete-emptydir-data). This will cause drains to fail at runtime. Update the arguments to valid kubectl drain options (and consider exposing/controlling any destructive flags via input params).
| from prometheus_client import Counter, Gauge, Histogram, Info, REGISTRY, CollectorRegistry | ||
| from prometheus_client.core import GaugeMetricFamily |
There was a problem hiding this comment.
This module imports several Prometheus classes that are never used (REGISTRY, CollectorRegistry, GaugeMetricFamily). With default ruff/pyflakes settings, unused imports will fail CI. Remove unused imports or use them.
| webhook_endpoints: "POST /api/webhook/{slack,telegram}" { | ||
| shape: rectangle | ||
| style.fill: "#c8e6c9" | ||
| } | ||
| alertmanager_hook: "POST /api/alert/webhook\n(Alertmanager ingest)" { | ||
| shape: rectangle | ||
| style.fill: "#c8e6c9" | ||
| } | ||
| rate_limiter: "Rate Limiter\n(slowapi)" { |
There was a problem hiding this comment.
The HLD diagram documents the Alertmanager webhook endpoint as POST /api/alert/webhook, but the actual FastAPI route is POST /api/webhook/alertmanager. This mismatch can cause incorrect deployments and failed alert delivery; update the diagram to match the implemented endpoint.
| if self._approval: | ||
| try: | ||
| await self._approval.request_approval( | ||
| tool_name=step.tool_name, | ||
| tool_params=params, | ||
| risk_level=step.risk_level, | ||
| description=step.description, | ||
| requested_by=requested_by, | ||
| channel_type=channel_type, | ||
| channel_target=channel_target, | ||
| playbook_run_id=run.run_id, | ||
| ) |
There was a problem hiding this comment.
When requesting approval for MEDIUM/HIGH steps, request_approval() is called without send_message_callback, so the user never receives the approval prompt (and thus has no approval ID to respond with). Wire a send callback here (e.g., via the router) so an approval request message is actually posted to channel_target.
This pull request introduces foundational improvements for project maintainability, developer onboarding, and operational clarity. It adds essential documentation files, standardizes environment configuration templates, establishes issue and pull request templates, and sets up a robust CI workflow. The changes also enhance observability and AIOps configuration, making it easier to deploy, monitor, and contribute to the project.
Documentation and Contribution Process:
CONTRIBUTING.mdoutlining code quality, commit conventions, PR workflow, and extension guidelines for channels, MCP servers, and AIOps playbooks.CHANGELOG.mdfollowing Keep a Changelog and Semantic Versioning, documenting all major features and changes for transparency.Environment and Configuration:
.env.examplewith grouped sections, detailed comments, and expanded variables for all supported features (Kubernetes, AIOps, observability, security, etc.), improving clarity for setup and deployment..env.production.exampleto include all monitoring, AIOps, and network/data settings for production readiness. [1] [2]CI/CD and Issue Templates:
Operational Improvements:
.dockerignoreto optimize Docker build context and exclude unnecessary files, speeding up image builds and reducing risk of leaking secrets or dev artifacts.