|
233 | 233 |
|
234 | 234 | --- |
235 | 235 |
|
236 | | - |
237 | 236 | ## 📊 Experimental Results |
238 | 237 |
|
239 | | -We evaluate **DeepCode** on the PaperBench Code-Dev benchmark, a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 gradable components assessed using SimpleJudge with hierarchical weighting. Our experiments compare DeepCode against four baseline categories: (1) Human Experts, (2) Commercial Code Agents, (3) Scientific Code Agents, and (4) LLM-Based Agents. |
| 238 | +We evaluate **DeepCode** on the [*PaperBench*](https://openai.com/index/paperbench/) benchmark (released by OpenAI), a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 gradable components assessed using SimpleJudge with hierarchical weighting. |
| 239 | + |
| 240 | +Our experiments compare DeepCode against four baseline categories: **(1) Human Experts**, **(2) State-of-the-Art Commercial Code Agents**, **(3) Scientific Code Agents**, and **(4) LLM-Based Agents**. |
| 241 | + |
| 242 | +### ① 🧠 Human Expert Performance (Top Machine Learning PhD) |
| 243 | + |
| 244 | +**DeepCode: 75.9% vs. Top Machine Learning PhD: 72.4% (+3.5%)** |
| 245 | + |
| 246 | +DeepCode achieves **75.9%** on the 3-paper human evaluation subset, **surpassing the best-of-3 human expert baseline (72.4%) by +3.5 percentage points**. This demonstrates that our framework not only matches but exceeds expert-level code reproduction capabilities, representing a significant milestone in autonomous scientific software engineering. |
240 | 247 |
|
241 | | -### ① Human Expert Performance (Top ML PhD) |
| 248 | +### ② 💼 State-of-the-Art Commercial Code Agents |
242 | 249 |
|
243 | | -DeepCode achieves **75.9%** on the 3-paper human evaluation subset (with Claude Sonnet 4.5-thinking), surpassing the best-of-3 human expert baseline (**72.4%**) by **+3.5 percentage points**. This demonstrates that our framework not only matches but exceeds expert-level code reproduction capabilities, representing a significant milestone in autonomous scientific software engineering. |
| 250 | +**DeepCode: 84.8% vs. Best Commercial Agent: 58.7% (+26.1%)** |
244 | 251 |
|
245 | | -### ② Commercial Code Agents |
| 252 | +On the 5-paper subset, DeepCode substantially outperforms leading commercial coding tools: |
| 253 | +- Cursor: 58.4% |
| 254 | +- Claude Code: 58.7% |
| 255 | +- Codex: 40.0% |
| 256 | +- **DeepCode: 84.8%** |
246 | 257 |
|
247 | | -On the 5-paper subset, DeepCode substantially outperforms leading commercial coding tools: **Cursor** (58.4%), **Claude Code** (58.7%), and **Codex** (40.0%). DeepCode achieves **84.8%**, representing a **+26.1% improvement** over the best commercial agent (Claude Code). All commercial agents utilize Claude Sonnet 4.5-thinking (Cursor and Claude Code) or GPT-5 Codex-high, highlighting that DeepCode's superior architecture—rather than base model capability—drives this performance gap. |
| 258 | +This represents a **+26.1% improvement** over the leading commercial code agent. All commercial agents utilize Claude Sonnet 4.5-thinking or GPT-5 Codex-high, highlighting that **DeepCode's superior architecture**—rather than base model capability—drives this performance gap. |
248 | 259 |
|
249 | | -### ③ Scientific Code Agent |
| 260 | +### ③ 🔬 Scientific Code Agents |
| 261 | + |
| 262 | +**DeepCode: 73.5% vs. PaperCoder: 51.1% (+22.4%)** |
250 | 263 |
|
251 | 264 | Compared to PaperCoder (**51.1%**), the state-of-the-art scientific code reproduction framework, DeepCode achieves **73.5%**, demonstrating a **+22.4% relative improvement**. This substantial margin validates our multi-module architecture combining planning, hierarchical task decomposition, code generation, and iterative debugging over simpler pipeline-based approaches. |
252 | 265 |
|
253 | | -### ④ LLM-Based Agents |
| 266 | +### ④ 🤖 LLM-Based Agents |
| 267 | + |
| 268 | +**DeepCode: 73.5% vs. Best LLM Agent: 43.3% (+30.2%)** |
| 269 | + |
| 270 | +DeepCode significantly outperforms all tested LLM agents: |
| 271 | +- Claude 3.5 Sonnet + IterativeAgent: 27.5% |
| 272 | +- o1 + IterativeAgent (36 hours): 42.4% |
| 273 | +- o1 BasicAgent: 43.3% |
| 274 | +- **DeepCode: 73.5%** |
254 | 275 |
|
255 | | -DeepCode (**73.5%**) significantly outperforms all tested LLM agents, including Claude 3.5 Sonnet with IterativeAgent (27.5%), o1 with IterativeAgent at 36 hours (42.4%), and o1 BasicAgent (43.3%). The **+30.2% improvement** over the best-performing LLM agent demonstrates that sophisticated agent scaffolding, rather than extended inference time or larger models alone, is critical for complex code reproduction tasks. |
| 276 | +The **+30.2% improvement** over the best-performing LLM agent demonstrates that sophisticated agent scaffolding, rather than extended inference time or larger models alone, is critical for complex code reproduction tasks. |
256 | 277 |
|
257 | 278 | <div align="center"> |
258 | 279 | <img src='./assets/result_main.jpg' /><br> |
259 | 280 | </div> |
260 | 281 | <br/> |
261 | 282 |
|
| 283 | +--- |
| 284 | + |
262 | 285 | ### 🎯 **Autonomous Self-Orchestrating Multi-Agent Architecture** |
263 | 286 |
|
264 | 287 | **The Challenges**: |
|
0 commit comments