Skip to content

Enhance AIOps platform with production readiness, Slack integration, and metrics#1

Merged
Htunn merged 12 commits into
mainfrom
develop
Mar 3, 2026
Merged

Enhance AIOps platform with production readiness, Slack integration, and metrics#1
Htunn merged 12 commits into
mainfrom
develop

Conversation

@Htunn
Copy link
Copy Markdown
Owner

@Htunn Htunn commented Mar 3, 2026

This pull request introduces foundational improvements for project maintainability, developer onboarding, and operational clarity. It adds essential documentation files, standardizes environment configuration templates, establishes issue and pull request templates, and sets up a robust CI workflow. The changes also enhance observability and AIOps configuration, making it easier to deploy, monitor, and contribute to the project.

Documentation and Contribution Process:

  • Added comprehensive CONTRIBUTING.md outlining code quality, commit conventions, PR workflow, and extension guidelines for channels, MCP servers, and AIOps playbooks.
  • Added CHANGELOG.md following Keep a Changelog and Semantic Versioning, documenting all major features and changes for transparency.

Environment and Configuration:

  • Overhauled .env.example with grouped sections, detailed comments, and expanded variables for all supported features (Kubernetes, AIOps, observability, security, etc.), improving clarity for setup and deployment.
  • Updated .env.production.example to include all monitoring, AIOps, and network/data settings for production readiness. [1] [2]

CI/CD and Issue Templates:

  • Added GitHub Actions workflow for linting, type checking, testing (with coverage), Docker build, and image publishing to streamline code quality and deployment.
  • Introduced issue templates for bugs and feature requests, and a pull request template to standardize reporting and review. [1] [2] [3]

Operational Improvements:

  • Added .dockerignore to optimize Docker build context and exclude unnecessary files, speeding up image builds and reducing risk of leaking secrets or dev artifacts.

Htunn added 12 commits February 27, 2026 23:09
…fixes and full docs

- Add PlaybookExecutor: LOW risk executes via MCP immediately, MEDIUM/HIGH routes through ApprovalManager
- Fix watchloop stale _known_issues: nodes and deployments cleared on recovery
- Hoist PlaybookRegistry/PlaybookExecutor as startup singletons in main.py
- Wire _on_cluster_event to call playbook_executor.execute()
- Add Prometheus/Alertmanager/Grafana config and docker-compose services
- Add scripts/start_production.sh with kubeconfig proxy-url patching
- Fix k8s/client.py: colon-separated KUBECONFIG, _init_attempted guard
- Add docs/aiops.md with 6 Mermaid sequence diagrams
- Update docs/sequence-diagrams.md diagrams 7-11 (AIOps flows)
- Update docs/architecture.md with AIOps pipeline section
- Rewrite README.md with full feature matrix, AIOps, MCP/K8s docs
- Add docs/hld.d2 and docs/hld.svg (D2 traffic-flow HLD diagram)
- Add LICENSE, CHANGELOG.md, CONTRIBUTING.md, SECURITY.md
- Add .dockerignore, expand .gitignore, update .env.example
- Add .github/workflows/ci.yml (lint, test, docker build, ghcr.io push)
- Add .github PR template and issue templates
- Update Dockerfile OCI labels
- Rename project from clawbot to simple-ai-agent throughout
- Update pyproject.toml (name, version 0.4.0), src/config.py, src/main.py
- Add slack_manifest.yml (manifest schema v2) with:
  - bot_user: simple-ai-agent, always_online
  - App Home with Messages tab enabled (DM support)
  - OAuth scopes: app_mentions:read, chat:write, im:history,
    im:write, users:read, channels:history, groups:history
  - Event subscriptions: app_mention, message.im
  - Interactivity and socket mode disabled
  - request_url placeholder: https://YOUR-DOMAIN/api/webhook/slack
- Rewrite docs/slack-setup.md to lead with manifest approach:
  - 6-step manifest-based setup as primary method
  - Manual setup retained as Method 2 alternative
  - Added ngrok local dev tip
  - Updated architecture diagram and security table
  - Added manifest update/rotation API commands
Add slack_manifest.*.yml to .gitignore so private/enterprise
manifests (e.g. slack_manifest.htunn-enterprise.yml) are never
committed to the public repository.
- Add cloudflare/cloudflared:latest service with restart policy
- Route https://slack.simpleportchecker.com -> http://app:8000
- Token injected via CLOUDFLARE_TUNNEL_TOKEN env var (gitignored .env)
- Add CLOUDFLARE_TUNNEL_TOKEN placeholder to .env.example
- Tunnel health-depends on app service_healthy
- Logging: json-file, 5m max-size, 2 files
- Read raw body once for both HMAC verification and JSON parsing
- Verify signature before spawning background task
- Fire event processing with asyncio.create_task (non-blocking)
  preventing Slack's 3-second timeout from triggering retries
- Deduplicate Slack retries via event_id stored in Redis (5 min TTL)
- Gracefully degrade if Redis is unavailable (no dedup, no crash)
- Clean up inline imports: json, asyncio now at module level

Root cause: awaiting the full AI pipeline (5-10s) before returning 200
caused Slack to retry, spawning two concurrent executions - one failed
(DB/Redis cold start) and sent the error message, the other succeeded.
Scale intent detection:
- Normalise word numbers before parsing ('one' -> 1, 'two' -> 2, ...)
- Add dedicated scale/resize branch checked BEFORE pod/deployment
  branches, so 'scale down X pod to 1 replica' is never mistaken
  for a pod-list request
- Two-pass name extraction: verb-adjacent token first, then
  fallback pattern for 'X pod to N replica' phrasing
- Retries without namespace if the first attempt fails

Error-pods-only replies for Slack:
- Default pod listing now shows ONLY problem pods (Error,
  CrashLoopBackOff, ImagePullBackOff, Pending, Failed, Terminating,
  ContainerCreating, OOMKilled, degraded-ready-ratio)
- Summary: '(N issue(s), M healthy)' keeps context without noise
- Replies '✅ All N pod(s) are healthy' when nothing is broken
Scale namespace auto-discovery:
- Before scaling, run kubectl get deployment --all-namespaces to
  find which namespace the deployment lives in automatically
- Scale superadmin-frontend works without specifying namespace

Slack-friendly command prefix:
- Messages starting with ! are normalised to / internally
  so !k8s !help !incident !alert all work from Slack
- Slack intercepts / commands before the bot sees them

Help text cleanup:
- All examples updated to use ! prefix in Slack context
- Delete src/channels/discord_adapter.py
- Remove Discord from channels __init__, router, config, prompt_manager
- Remove Discord webhook stub endpoint from api/webhooks.py
- Remove discord.py from requirements.txt
- Remove 'discord' from pyproject.toml keywords
- Remove DISCORD_TOKEN from docker-compose.yml and .env* files
- Remove all Discord nodes and flows from docs/hld.d2; regenerate hld.svg
- Update README, CHANGELOG, SETUP, scripts/README with Telegram+Slack only
- Update all docs/*.md (architecture, component-diagram, sequence-diagrams,
  database-architecture, aiops, kubernetes-integration) — Discord removed
- Update .github/ISSUE_TEMPLATE/bug_report.md
- Update scripts/init_db.py default channel seeds (telegram only)
- Update src/services/kubernetes_handler.py docstring
…, test suite 49/49

- src/monitoring/metrics.py: new custom Prometheus counters/gauges/histograms
  (aiagent_messages_*, aiagent_ai_requests_*, aiagent_k8s_*, aiagent_aiops_*,
   aiagent_mcp_*, aiagent_webhook_*, aiagent_build_info)
- src/main.py: import metrics module on startup to register all metrics;
  add /metrics endpoint via prometheus_client
- src/utils/logger.py: auto-select JSONRenderer (non-TTY / LOG_FORMAT=json)
  vs ConsoleRenderer (interactive dev), enabling structured JSON logs in Docker
- src/k8s/client.py: add missing list_namespaces() method (fixed AttributeError
  in /health endpoint)
- requirements.txt: add prometheus-client==0.21.1
- tests/test_production_readiness.py: full 14-section, 49-test production suite
  covering HTTP, Slack, Alertmanager, rate limiting, Postgres, Redis, K8s,
  AIOps, MCP, AI client, channels, security, observability — all 49/49 pass
- Fix IndentationError at line 550: status_filter problem block was
  mis-indented outside the for-loop body, breaking all elif branches
- Fix _format_kubectl_table: detect --all-namespaces 6-col output to
  prevent namespace name (e.g. velero) appearing as pod name
- Add NLP fix pods intent handler + _fix_problem_pods() method for
  auto-remediation of Error/CrashLoopBackOff/OOMKilled/Failed pods
- Add /k8s fix [namespace] subcommand routing
- Add OOMKilled to problem pod status filter list
- Add fix pods hint to problem pods display output
- Telegram webhook: implement HMAC secret-token validation (was TODO)
- Alertmanager webhook: return alerts_ingested count in response
- docker-compose: bind Redis/Postgres to 127.0.0.1 (not 0.0.0.0)
- docker-compose: remove Prometheus --web.enable-admin-api flag
- config.py: add telegram_webhook_secret field
- tests: fix RuleEngine event_type (pod_crash_loop -> crash_loop)
- tests: add 5 new security tests (secret leak, webhook auth, port binding)
- tests: replace datetime.utcnow() with datetime.now(timezone.utc)

All 54/54 production readiness tests passing
@Htunn Htunn requested a review from Copilot March 3, 2026 07:15
@Htunn Htunn merged commit 241467d into main Mar 3, 2026
9 of 11 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the agent into a more production-oriented AIOps platform: adding proactive Kubernetes monitoring (watchloop), human-in-the-loop approvals, Prometheus/Grafana/Alertmanager integration, Slack-focused channel support, and operational tooling/docs for deployment and contribution.

Changes:

  • Added AIOps subsystem (watchloop, rule engine, playbooks, approval manager, RCA/log analysis) and supporting DB tables/migrations.
  • Added observability stack + endpoints/config (Prometheus metrics endpoint, Prometheus/Grafana/Alertmanager clients and compose/config files).
  • Streamlined channel support toward Telegram/Slack (removing Discord) and updated docs/templates/CI/deployment scripts accordingly.

Reviewed changes

Copilot reviewed 62 out of 65 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
src/utils/logger.py Switches structlog renderer to JSON in non-TTY/LOG_FORMAT=json contexts.
src/services/kubernetes_handler.py Updates module docstring to reflect supported channels.
src/services/approval_manager.py Adds Redis-backed human approval workflow for remediation actions.
src/monitoring/watchloop.py Adds background polling loop to detect common K8s anomalies and emit events.
src/monitoring/prometheus.py Adds async Prometheus client plus convenience health queries.
src/monitoring/metrics.py Defines application Prometheus metrics (counters/gauges/info).
src/monitoring/grafana.py Adds Grafana client for incident annotations.
src/monitoring/init.py Exposes monitoring clients/watchloop as package exports.
src/mcp/kubernetes_server.py Adds self-healing / AIOps kubectl tools and handlers to MCP server.
src/main.py Initializes AIOps components at startup; adds /metrics; updates app name.
src/k8s/client.py Adds kubernetes-asyncio based singleton client wrapper.
src/k8s/init.py Exports K8s client accessor(s).
src/database/models.py Adds AIOps persistence models (incidents/alerts/remediations/audit/snapshots).
src/database/migrations/versions/002_aiops_tables.py Adds Alembic migration for new AIOps tables.
src/config.py Adds AIOps/observability/watchloop/approval settings; updates DB default.
src/channels/router.py Removes Discord adapter registration; keeps Telegram/Slack.
src/channels/discord_adapter.py Removes Discord adapter implementation.
src/channels/init.py Removes Discord export; adds Slack export.
src/api/webhooks.py Adds Telegram secret validation; Slack async processing + dedupe; adds Alertmanager webhook.
src/api/health.py Expands health response with K8s/Prometheus/watchloop/AIOps counters and adds /health/aiops.
src/aiops/rule_engine.py Adds in-memory rule engine mapping events to playbooks.
src/aiops/rca_engine.py Adds AI-powered (and fallback) RCA report generation.
src/aiops/playbooks.py Adds playbook registry and executor with risk-gated steps.
src/aiops/log_analyzer.py Adds regex-based log pattern analysis (+ optional AI enrichment).
src/aiops/init.py Exports AIOps modules as a package.
src/ai/prompt_manager.py Updates channel prompts and help text for Slack + AIOps commands.
src/ai/model_selector.py Updates channel type docs to match Telegram/Slack.
slack_manifest.yml Adds Slack App Manifest for easier Slack setup.
scripts/start_production.sh Adds production bootstrap script for compose + data dirs + kubeconfig patching.
scripts/init_db.py Removes Discord default channel seeding.
scripts/README.md Updates scripts docs to remove Discord token reference.
requirements.txt Removes discord.py; adds kubernetes-asyncio + prometheus-client and related deps.
pyproject.toml Updates project metadata/version and classifiers/keywords.
docs/slack-setup.md Rewrites Slack setup docs (manifest-first) and updates usage/troubleshooting.
docs/sequence-diagrams.md Updates diagrams to Telegram/Slack + adds AIOps sequences.
docs/kubernetes-integration.md Updates docs to remove Discord references.
docs/hld.d2 Adds high-level design diagram for new architecture components.
docs/database-architecture.md Updates Redis/session examples and channel references (Telegram/Slack).
docs/component-diagram.md Updates component diagram to remove Discord and add Slack.
docs/architecture.md Updates architecture docs; adds AIOps architecture section.
docker-compose.yml Updates env handling, mounts, extra_hosts; adds cloudflared + Prometheus/Alertmanager/Grafana stack.
config/prometheus.yml Adds Prometheus scrape config for app + stack.
config/grafana/provisioning/datasources/prometheus.yml Provisions Prometheus datasource in Grafana.
config/grafana/provisioning/dashboards/dashboard.yml Adds Grafana dashboard provisioning config.
config/alertmanager.yml Adds Alertmanager routing config to webhook receiver.
config/alert_rules.yml Adds Prometheus alerting rules for app/K8s/infrastructure signals.
SETUP.md Removes Discord setup instructions; keeps Telegram/Slack.
SECURITY.md Adds security policy and deployment best practices.
LICENSE Adds MIT license.
Dockerfile Updates OCI labels and description; retains kubectl install for runtime/build.
CONTRIBUTING.md Adds contributing guide and conventions.
CHANGELOG.md Adds changelog and release notes including AIOps/ops features.
.gitignore Expands ignores for secrets, data dirs, caches, kubeconfig, etc.
.github/workflows/ci.yml Adds CI workflow for lint/type/test/docker build/publish.
.github/PULL_REQUEST_TEMPLATE.md Adds PR template.
.github/ISSUE_TEMPLATE/feature_request.md Adds feature request template.
.github/ISSUE_TEMPLATE/bug_report.md Adds bug report template.
.env.production.example Expands production env template for AIOps/monitoring.
.env.example Overhauls env template with grouped sections and added variables.
.dockerignore Adds docker build context exclusions for secrets/data/dev artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +73 to +86
def approval_message(self) -> str:
risk_emoji = {"low": "🟡", "medium": "🟠", "high": "🔴"}[self.risk_level.value]
lines = [
f"{risk_emoji} **Approval Required** [{self.risk_level.value.upper()}]",
f"",
f"**Action:** {self.description}",
f"**Tool:** `{self.tool_name}`",
f"**Parameters:** `{json.dumps(self.tool_params, indent=2)}`",
f"",
f"Reply with **`approve {self.approval_id[:8]}`** to proceed or **`reject {self.approval_id[:8]}`** to cancel.",
f"This request expires in {settings.approval_timeout_seconds // 60} minutes.",
]
if self.risk_level == RiskLevel.HIGH:
lines.insert(0, "⚠️ **HIGH RISK ACTION — Review carefully before approving**\n")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RiskLevel is defined separately in src/aiops/playbooks.py and src/services/approval_manager.py. Because they are different Enum classes, comparisons like self.risk_level == RiskLevel.HIGH in PendingApproval.approval_message() can silently fail if a playbook RiskLevel instance is passed in, which can suppress the HIGH-risk warning. Prefer importing and using a single shared RiskLevel enum across both modules (or normalize to strings at the boundary).

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +148
# 1. Scan for crashloop and OOMKilled pods
try:
crash_pods = await self._k8s.get_crashloop_pods()
for pod in crash_pods:
key = f"pod/{pod['namespace']}/{pod['name']}"
if key not in self._known_issues:
self._known_issues[key] = datetime.now(timezone.utc)
status = pod.get("status", "CrashLoopBackOff")
severity = "critical" if "CrashLoop" in status or "OOM" in status else "warning"
events.append(ClusterEvent(
event_type="crash_loop" if "OOM" not in status else "oom_killed",
severity=severity,
namespace=pod["namespace"],
resource_kind="Pod",
resource_name=pod["name"],
message=f"Pod {pod['name']} in {pod['namespace']} is {status} (restarts: {pod.get('restarts', 0)})",
labels=pod.get("labels", {}),
))
# Clear resolved pods from known issues
elif pod.get("status") not in ("CrashLoopBackOff", "Error", "OOMKilled"):
self._known_issues.pop(key, None)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash-loop pod deduplication never clears: get_crashloop_pods() only returns pods currently in CrashLoop/Error/OOMKilled, so recovered pods won’t appear in crash_pods and the elif ... not in (...) branch will never run. This means _known_issues entries for pods will persist forever, preventing future alerts for the same pod and causing unbounded growth. Track the set of currently failing pod keys and remove any pod/... keys not present each tick (similar to the node/deployment cleanup).

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +182
try:
namespaces_resp = await self._k8s._core_v1.list_namespace() # type: ignore[union-attr]
current_failed_deployments: set[str] = set()
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tick uses self._k8s._core_v1.list_namespace() (a private attribute). Prefer a public method on KubernetesClient (or add one) to avoid coupling the watchloop to internal client implementation details.

Copilot uses AI. Check for mistakes.
Comment thread src/api/health.py
Comment on lines +181 to +185
try:
redis_client = get_redis()
keys = await redis_client.keys("approval:*")
result["pending_approvals"] = [k.split(":")[-1] for k in keys]
except Exception:
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses redis.keys("approval:*") again for /health/aiops. For the same reason as /health, prefer SCAN/scan_iter (or a stored counter) to avoid blocking Redis under load.

Copilot uses AI. Check for mistakes.
Comment thread docker-compose.yml
# CLOUDFLARE_TUNNEL_TOKEN must be set in .env (never commit the real value).
# ---------------------------------------------------------------------------
cloudflared:
image: cloudflare/cloudflared:latest
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloudflared uses the cloudflare/cloudflared:latest tag. Pinning to a specific version/digest is safer for production deployments (avoids unexpected breaking changes on redeploy) while still allowing planned upgrades.

Suggested change
image: cloudflare/cloudflared:latest
image: cloudflare/cloudflared:${CLOUDFLARED_VERSION:-2024.2.1}

Copilot uses AI. Check for mistakes.
Comment thread src/api/webhooks.py
Comment on lines +185 to +197
# Optional HMAC secret validation
if settings.alertmanager_webhook_secret:
body_bytes = await request.body()
expected = hmac.new(
settings.alertmanager_webhook_secret.encode(),
body_bytes,
hashlib.sha256,
).hexdigest()
received = x_alertmanager_secret or ""
if not hmac.compare_digest(expected, received):
logger.warning("alertmanager_webhook_invalid_secret")
raise HTTPException(status_code=403, detail="Invalid webhook secret")
body = __import__("json").loads(body_bytes)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alertmanager secret validation computes an HMAC of the request body and expects the header to match it, but Alertmanager typically can only send a static header value (or basic auth), not a per-payload HMAC. As written, setting ALERTMANAGER_WEBHOOK_SECRET will likely reject all legitimate alerts. Consider treating X-Alertmanager-Secret as a shared static secret (direct string compare), and configure Alertmanager to send that header, or switch to basic auth/TLS client auth.

Copilot uses AI. Check for mistakes.
Comment on lines +761 to +768
async def _kubectl_drain_node(self, args: Dict[str, Any]) -> str:
"""Drain all pods from a node."""
node_name = args["node_name"]
ignore_daemonsets = args.get("ignore_daemonsets", True)
cmd = ["drain", node_name, "--delete-emissary-data", "--timeout=120s"]
if ignore_daemonsets:
cmd.append("--ignore-daemonsets")
return await self._run_kubectl(cmd)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl drain is invoked with --delete-emissary-data, which is not a valid kubectl flag (the standard flag is --delete-emptydir-data). This will cause drains to fail at runtime. Update the arguments to valid kubectl drain options (and consider exposing/controlling any destructive flags via input params).

Copilot uses AI. Check for mistakes.
Comment thread src/monitoring/metrics.py
Comment on lines +7 to +8
from prometheus_client import Counter, Gauge, Histogram, Info, REGISTRY, CollectorRegistry
from prometheus_client.core import GaugeMetricFamily
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module imports several Prometheus classes that are never used (REGISTRY, CollectorRegistry, GaugeMetricFamily). With default ruff/pyflakes settings, unused imports will fail CI. Remove unused imports or use them.

Copilot uses AI. Check for mistakes.
Comment thread docs/hld.d2
Comment on lines +106 to +114
webhook_endpoints: "POST /api/webhook/{slack,telegram}" {
shape: rectangle
style.fill: "#c8e6c9"
}
alertmanager_hook: "POST /api/alert/webhook\n(Alertmanager ingest)" {
shape: rectangle
style.fill: "#c8e6c9"
}
rate_limiter: "Rate Limiter\n(slowapi)" {
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HLD diagram documents the Alertmanager webhook endpoint as POST /api/alert/webhook, but the actual FastAPI route is POST /api/webhook/alertmanager. This mismatch can cause incorrect deployments and failed alert delivery; update the diagram to match the implemented endpoint.

Copilot uses AI. Check for mistakes.
Comment thread src/aiops/playbooks.py
Comment on lines +333 to +344
if self._approval:
try:
await self._approval.request_approval(
tool_name=step.tool_name,
tool_params=params,
risk_level=step.risk_level,
description=step.description,
requested_by=requested_by,
channel_type=channel_type,
channel_target=channel_target,
playbook_run_id=run.run_id,
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When requesting approval for MEDIUM/HIGH steps, request_approval() is called without send_message_callback, so the user never receives the approval prompt (and thus has no approval ID to respond with). Wire a send callback here (e.g., via the router) so an approval request message is actually posted to channel_target.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants