Benchmarking Long-Running Agents on Project-Scale Software Engineering
Benchmarks for AI coding agents, such as SWE-Bench, focus on relatively small, well-scoped tasks like fixing individual pull requests. While useful, they fall short of capturing real-world software engineering, where projects span long time horizons and require continuous, structured development.
Hackathon-24h is a benchmark designed to evaluate agents in project-scale software engineering settings. Instead of short, isolated tasks, Hackathon-24h challenges agents to operate continuously for 24 hours or more, tackling complex objectives such as building a database system from scratch. Along the way, agents are expected not only to make progress, but also to deliver periodic, usable releases, mirroring real development workflows.
Peak scores (%) achieved within a 24-hour window. All runs use Claude Code as the harness framework. Number in parentheses = total test points per task.
| Model | Harness | Database (48) | Message Queue (30) | HTTP Server (47) | Redis KV (132) |
|---|---|---|---|---|---|
| Kimi-K2.5 | Claude Code | 4.2% | 26.7% | 87.2% | 89.4% |
| Qwen3.5-Plus | Claude Code | 14.6% | 16.7% | 72.3% | 93.2% |
| MiniMax-M2.5 | Claude Code | 0% | 0% | 70.2% | 84.1% |
| Feature | Description |
|---|---|
| Four System Types | Database, message queue, HTTP server, Redis KV store -- each with protocol-level complexity |
| Pluggable Agent Harnesses | Codex, Claude Code, OpenCode, Cline |
| Universal Provider Support | Native APIs, gateways, and proxy providers with automatic key rotation |
| Tiered Benchmarks | Progressive difficulty tiers (L0--L6) with industry-standard evaluation suites |
| Docker Isolation | Sandboxed agent execution with minimal privileges |
| Automated Feedback Loop | Anonymized benchmark briefs drive iterative improvement |
| Extensible Design | Add new system types, benchmarks, or agents via registry pattern |
- Python 3.10+ (< 3.13)
- Docker
git clone https://github.com/zzq2000/Hackathon-24h-Bench.git && cd Hackathon-24h-Bench && bash ./setup.shcp .env.template .env # fill in your API key(s)./start_system_docker.sh --agent claude --system-type database --model <your-model>The Docker image builds automatically on first run. To monitor or stop a run:
./stop_system_docker.sh --listHackathon-24h currently comprises four tasks. We chose these systems deliberately: each comes with a precise, publicly documented specification, has established industrial test suites as external judges, supports incremental progress rather than a single pass-or-fail outcome, and is complex enough that regressions are easy to introduce but hard to detect.
Each task is organised into a tier ladder of increasing difficulty, scored against established industrial benchmarks. We wrote none of the test cases ourselves.
| Task | Tiers | Test Cases | Evaluation Suites | Parameter |
|---|---|---|---|---|
| Database | L0--L3 | 48 | Sysbench, TPC-C, TPC-H | --system-type database |
| Message Queue | L0--L2 | 30 | pika, OMQ, RabbitMQ PerfTest | --system-type message_queue |
| HTTP Server | L0--L4 | 47 | h1spec, CISPA, TFB | --system-type http_server |
| Redis KV Store | L0--L6 | 132 | Redis TCL, YCSB, memtier | --system-type redis_kvstore |
The agent loop runs inside a Docker container: a coding agent receives a reference specification and an initial prompt (without starter code), then continuously builds the system with no time or token limits. The test loop runs on the host: it periodically snapshots the workspace, runs benchmarks in isolation, and feeds back an anonymised brief. The two loops communicate only through the shared workspace and the brief; they never run in the same environment.
The agent learns which areas passed or failed, but never sees specific test cases. This prevents hacking specific tests and instead pushes the agent to develop real capability. The feedback is intentionally slightly stale, so the agent cannot rely on it blindly -- it must reconcile this delayed signal with its own local observations and reasoning.
| Agent | CLI Tool | Providers |
|---|---|---|
| Codex | codex |
OpenAI, LiteLLM, OpenRouter |
| Claude Code | claude |
Anthropic, Z.AI Gateway, Bailian Gateway, Moonshot Gateway, MiniMax Gateway, LiteLLM, OpenRouter |
| OpenCode | opencode |
OpenAI, Anthropic, Google, X.AI, OpenRouter, DashScope, Bailian Gateway, LiteLLM, Azure OpenAI, vLLM, SGLang |
| Cline | cline |
Anthropic, OpenAI, Bailian OpenAI Gateway, LiteLLM, OpenRouter, vLLM, SGLang |
We evaluated three models (Kimi-K2.5, Qwen3.5-Plus, MiniMax-M2.5) across all four tasks using Claude Code as the agent framework. Each run had a 24-hour window. We also compared three harness frameworks on the HTTP task with the same underlying model (Kimi-K2.5). In the figures below, legend entries follow the format harness_model (e.g. claude_kimi-k2.5 means Claude Code + Kimi-K2.5).
Developing a database system is the hardest task. Qwen3.5-Plus peaked at 14.6%; Kimi-K2.5 reached 4.2% but regressed to 0%.
Kimi-K2.5 spent 14.5 hours at 0% before climbing to 26.7%. MiniMax-M2.5 completed 85 iterations and marked 28 plan tasks "DONE" while scoring 0% on all 47 test cycles.
Kimi-K2.5 reached 87.2% and recovered from regressions. MiniMax-M2.5 peaked at 70.2% then regressed to 55.3% without noticing.
The easiest task. Qwen3.5-Plus reached 93.2% in only 29 iterations. All three models scored above 84%.
To isolate the impact of the agent framework itself, we run the same underlying model across multiple harnesses under identical conditions. Despite sharing the exact same model, we observe a ~13-point gap in final 24-hour scores: Claude Code reaches 85.1%, OpenCode 80.9%, and Cline 72.3%.
| Agent | Iterations (24h) | Peak Score | Final Score |
|---|---|---|---|
| Claude Code | 188 | 87.2% (41/47) | 85.1% (40/47) |
| OpenCode | 180 | 80.9% (38/47) | 80.9% (38/47) |
| Cline | 40 | 72.3% (34/47) | 72.3% (34/47) |
Hackathon-24h is open source and actively growing. We welcome contributions in several areas:
- LLM API sponsorship: Help expand coverage by sponsoring API access for evaluating more models
- Harness integrations: Add support for new agent frameworks and execution environments
- Model evaluations: Run your own models and contribute results to the leaderboard
- Benchmark improvements: Extend the benchmark with new tasks or improve the evaluation and scoring system
Please contact Ziqin or Zewen for further contribution details.
Create a .env file (never commit this) with your API keys. The .env.template generated by setup.sh contains all supported variables. Below is a summary by category:
| Variable | Agent | Description |
|---|---|---|
OPENAI_API_KEY |
Codex, OpenCode, Cline | OpenAI native API key |
ANTHROPIC_API_KEY |
Claude, OpenCode, Cline | Anthropic native API key |
OPENCODE_API_KEY |
OpenCode | OpenCode API key (for Bailian, OpenRouter, Moonshot, MiniMax) |
KIMI_API_KEY |
Claude, Cline | Moonshot / Kimi API key |
MINIMAX_API_KEY |
Claude, Cline | MiniMax API key |
| Variable | Gateway / Platform | Used By |
|---|---|---|
ANTHROPIC_AUTH_TOKEN |
Bailian / Z.AI / Moonshot / MiniMax Claude Gateway | Claude |
ANTHROPIC_BASE_URL |
Gateway endpoint override | Claude |
DASHSCOPE_API_KEY |
DashScope / Bailian | Claude, Cline, OpenCode |
CLINE_API_KEY |
Cline generic key override | Cline |
CLINE_BASE_URL |
Cline base URL override | Cline |
CLINE_PROVIDER |
Cline provider selector (anthropic, openai, bailian) |
Cline |
| Variable | Provider |
|---|---|
OPENROUTER_API_KEY |
OpenRouter |
LITELLM_API_KEY / LITELLM_BASE_URL |
LiteLLM Proxy |
VLLM_API_KEY / VLLM_BASE_URL |
vLLM |
SGLANG_API_KEY / SGLANG_BASE_URL |
SGLang |
| File | Purpose |
|---|---|
config/global.yaml |
Scheduler timing, agent settings, test parameters |
config/providers.yaml |
Provider/model definitions, API routing |
config/system.<type>.yaml |
Per-system benchmark defaults and tier ladders |
| Variable | Description | Default |
|---|---|---|
CODE_AGENT_WAIT_TIME |
Seconds between improvement iterations | 60 |
CODE_AGENT_MAX_RETRIES |
Retry count for agent commands | from config |
CODE_AGENT_COMMAND_TIMEOUT |
Timeout per agent command (seconds) | from config |
TEST_INTERVAL |
Test runner interval (seconds, start-to-start) | 900 |
BENCHMARK_TIMEOUT |
Per-benchmark timeout (seconds) | 900 |
Hackathon-24h-Bench/
├── sut/ # System Under Test abstraction layer
├── benchmarks/ # Benchmark runner implementations
├── feedback/ # Anonymized feedback formatting
├── test_runner/ # Test orchestration and scheduling
├── code_agent_controller/ # AI agent controller and agent implementations
│ ├── agents/ # Agent adapters (claude, codex, opencode, cline, ...)
│ └── providers/ # Provider registry and key rotation
├── metrics/ # Metrics aggregation and reporting
├── config/ # YAML configuration files
├── workspace/ # Agent working directories (per system/run)
├── start_system_docker.sh # Docker-mode startup
└── stop_system_docker.sh # Docker-mode shutdown
- Never commit API keys to version control -- use .env (gitignored) for all secrets
- Hackathon-24h runs are resource-intensive and consume a large number of tokens -- monitor your API usage and set spending limits before starting a 24-hour run
- Agents should only execute inside Docker containers -- never run code agents outside of Docker to avoid unintended modifications to your host system
If you use Hackathon-24h in your research, please cite:
@misc{zhu2026hackathon24hbench,
title = {Hackathon-24h Bench: Benchmarking Long-Running Agents
on Project-Scale Software Engineering},
author = {Zhu, Ziqin and Chi, Zewen and Witbrock, Michael and Dong, Li and Wei, Furu
and Liu, Qian},
year = {2026},
howpublished = {\url{https://zhuziqin-blogs.notion.site/hackathon-24h-bench}}
}This project is licensed under the Apache License 2.0.





