Logging, health checks, system status, and cost visibility for HomeAgent.
All logging uses structlog with JSON output in production and human-readable console output in development. Plain print() and logging.info() are not used directly — all log calls go through structlog.
{
"timestamp": "2026-03-01T08:32:11.123Z",
"level": "info",
"event": "agent_run_complete",
"user_id": "abc123",
"household_id": "xyz",
"model": "claude-sonnet-4-5",
"duration_ms": 1243,
"tokens_input": 2847,
"tokens_output": 312,
"tools_called": ["homey_device_set_capability"],
"trace_id": "uuid"
}Human-readable with colour highlighting, controlled by LOG_FORMAT=console in .env.
| Level | When to use |
|---|---|
DEBUG |
Detailed internals — prompt assembly, tool args, memory retrieval scores |
INFO |
Normal operations — agent run, tool call, message received |
WARNING |
Degraded state — Homey unreachable, stale cache, memory retrieval skipped |
ERROR |
Failures requiring attention — DB write failed, LLM provider down, scheduler crash |
CRITICAL |
Service-level failures — both LLM providers down, DB unreadable |
Set via LOG_LEVEL in .env.
Each incoming webhook request is assigned a trace_id (UUID) at the FastAPI middleware layer. All log entries within that request share the same trace_id, making it easy to follow a single conversation turn through the logs.
GET /health returns component status. Used by Docker healthcheck and optionally by external monitoring.
Healthy response (HTTP 200):
{
"status": "healthy",
"version": "0.1.0",
"uptime_seconds": 86400,
"components": {
"db_users": "ok",
"db_memory": "ok",
"db_cache": "ok",
"mcp_homey": "ok",
"mcp_prom": "ok",
"mcp_tools": "ok",
"scheduler": "ok",
}
}Degraded response (HTTP 200, status = "degraded"):
{
"status": "degraded",
"components": {
"db_users": "ok",
"db_memory": "ok",
"db_cache": "ok",
"mcp_homey": "disconnected",
"mcp_prom": "ok",
"mcp_tools": "ok",
"scheduler": "ok"
}
}The implementation returns "healthy" when all three DBs are reachable and Homey MCP is connected. It returns "degraded" if any DB is unreachable or if Homey MCP is disconnected.
Docker Compose healthcheck:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15sAny admin can send /status to get a real-time summary in Telegram:
Status:
Scheduler : ok
Homey MCP : ok
Prometheus MCP : ok
Tools MCP : ok
Every agent run logs tokens_used: {input: N, output: N} and model_used to agent_run_log. This is the raw data for cost estimates and run analysis.
A scheduled job runs every Monday at 08:00 (household timezone) and sends a summary to all admins:
HomeAgent weekly summary — week of 24 Feb
Conversations: 47 (↑12 from last week)
LLM calls: 83
Claude Sonnet 4.5: 61 calls, ~1.2M tokens
Claude Haiku 4.5: 22 calls, ~340K tokens
GPT-4o: 0 calls (fallback unused)
Estimated cost: ~$1.84
(Claude: ~$1.61, OpenAI embeddings: ~$0.23)
Home actions: 34
Confirmed: 5, Immediate: 29
Failures: 0
Top users this week:
Kristian: 28 messages
Emma: 12 messages
Sofie: 7 messages
Cost estimates use fixed per-token rates configured in .env. They are estimates only — check your Anthropic and OpenAI dashboards for exact billing.
COST_ESTIMATE_CLAUDE_SONNET_INPUT=0.000003 # $ per token
COST_ESTIMATE_CLAUDE_SONNET_OUTPUT=0.000015
COST_ESTIMATE_CLAUDE_HAIKU_INPUT=0.00000025
COST_ESTIMATE_CLAUDE_HAIKU_OUTPUT=0.00000125
COST_ESTIMATE_GPT4O_INPUT=0.0000025
COST_ESTIMATE_GPT4O_OUTPUT=0.00001
COST_ESTIMATE_EMBEDDING=0.00000002The admin dashboard receives real-time events via Server-Sent Events. Key event types:
| Event type | Payload | When |
|---|---|---|
run.start |
{user_id, household_id} |
Agent run begins |
run.complete |
{model, tokens_input, tokens_output, duration_ms} |
Agent run finishes |
run.background_error |
{task, error} |
Fire-and-forget background task failed |
world.update |
{entity_type, action, name} |
World model entity created/updated |
job.scheduled |
{job_id, trigger_time} |
Scheduler job added |
memory.stored |
{importance, source_run_id} |
Episodic memory saved |
The ring buffer holds the last 150 events. Subscriber queues (maxsize=200) log a warning on overflow rather than silently dropping.
Logs are written to stdout and captured by Docker. Structured runtime events are also persisted in event_log and agent_run_log in cache.db, with cleanup jobs handling retention.
For file-based log archiving, configure Docker's log driver (e.g. json-file with max-size and max-file options in docker-compose.yml).
If you want uptime alerting without checking Telegram, the /health endpoint can be polled by:
- UptimeRobot (free tier, checks every 5 min, alerts via email/Telegram)
- Healthchecks.io (ping-based, great for scheduled jobs too)
- Prometheus + Grafana (overkill for home use, but possible)
A GET /metrics endpoint in Prometheus format is not implemented by default but can be added via the prometheus-fastapi-instrumentator library if needed.