|
136 | 136 | - [📰 News](#-news) |
137 | 137 | - [🚀 Key Features](#-key-features) |
138 | 138 | - [🏗️ Architecture](#️-architecture) |
| 139 | +- [📊 Experimental Results](#-experimental-results) |
139 | 140 | - [🚀 Quick Start](#-quick-start) |
140 | 141 | - [💡 Examples](#-examples) |
141 | 142 | - [🎬 Live Demonstrations](#-live-demonstrations) |
|
154 | 155 | - 🔬 **Advances Scientific Code Generation**: **+22.4%** improvement over PaperCoder, the previous SOTA scientific code agent |
155 | 156 | - 🚀 **Beats LLM-Based Agents**: **+30.2%** improvement over best LLM agent frameworks, demonstrating the power of sophisticated agent architecture |
156 | 157 |
|
157 | | -<div align="center"> |
158 | | - <img src='./assets/result_main.jpg' /><br> |
159 | | -</div> |
160 | | - |
161 | 158 | --- |
162 | 159 |
|
163 | 160 | ## 🚀 Key Features |
@@ -462,6 +459,33 @@ Implementation Generation • Testing • Documentation |
462 | 459 |
|
463 | 460 | --- |
464 | 461 |
|
| 462 | + |
| 463 | +## 📊 Experimental Results |
| 464 | + |
| 465 | +We evaluate **DeepCode** on the PaperBench Code-Dev benchmark, a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 gradable components assessed using SimpleJudge with hierarchical weighting. Our experiments compare DeepCode against four baseline categories: (1) Human Experts, (2) Commercial Code Agents, (3) Scientific Code Agents, and (4) LLM-Based Agents. |
| 466 | + |
| 467 | +### ① Human Expert Performance (Top ML PhD) |
| 468 | + |
| 469 | +DeepCode achieves **75.9%** on the 3-paper human evaluation subset (with Claude Sonnet 4.5-thinking), surpassing the best-of-3 human expert baseline (**72.4%**) by **+3.5 percentage points**. This demonstrates that our framework not only matches but exceeds expert-level code reproduction capabilities, representing a significant milestone in autonomous scientific software engineering. |
| 470 | + |
| 471 | +### ② Commercial Code Agents |
| 472 | + |
| 473 | +On the 5-paper subset, DeepCode substantially outperforms leading commercial coding tools: **Cursor** (58.4%), **Claude Code** (58.7%), and **Codex** (40.0%). DeepCode achieves **84.8%**, representing a **+26.1% improvement** over the best commercial agent (Claude Code). All commercial agents utilize Claude Sonnet 4.5-thinking (Cursor and Claude Code) or GPT-5 Codex-high, highlighting that DeepCode's superior architecture—rather than base model capability—drives this performance gap. |
| 474 | + |
| 475 | +### ③ Scientific Code Agent |
| 476 | + |
| 477 | +Compared to PaperCoder (**51.1%**), the state-of-the-art scientific code reproduction framework, DeepCode achieves **73.5%**, demonstrating a **+22.4% relative improvement**. This substantial margin validates our multi-module architecture combining planning, hierarchical task decomposition, code generation, and iterative debugging over simpler pipeline-based approaches. |
| 478 | + |
| 479 | +### ④ LLM-Based Agents |
| 480 | + |
| 481 | +DeepCode (**73.5%**) significantly outperforms all tested LLM agents, including Claude 3.5 Sonnet with IterativeAgent (27.5%), o1 with IterativeAgent at 36 hours (42.4%), and o1 BasicAgent (43.3%). The **+30.2% improvement** over the best-performing LLM agent demonstrates that sophisticated agent scaffolding, rather than extended inference time or larger models alone, is critical for complex code reproduction tasks. |
| 482 | + |
| 483 | +<div align="center"> |
| 484 | + <img src='./assets/result_main.jpg' /><br> |
| 485 | +</div> |
| 486 | +<br/> |
| 487 | + |
| 488 | + |
465 | 489 | ## 🚀 Quick Start |
466 | 490 |
|
467 | 491 |
|
|
0 commit comments