Skip to content

Commit b3f414e

Browse files
authored
Update README.md
1 parent 914343e commit b3f414e

File tree

1 file changed

+28
-4
lines changed

1 file changed

+28
-4
lines changed

README.md

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@
136136
- [📰 News](#-news)
137137
- [🚀 Key Features](#-key-features)
138138
- [🏗️ Architecture](#️-architecture)
139+
- [📊 Experimental Results](#-experimental-results)
139140
- [🚀 Quick Start](#-quick-start)
140141
- [💡 Examples](#-examples)
141142
- [🎬 Live Demonstrations](#-live-demonstrations)
@@ -154,10 +155,6 @@
154155
- 🔬 **Advances Scientific Code Generation**: **+22.4%** improvement over PaperCoder, the previous SOTA scientific code agent
155156
- 🚀 **Beats LLM-Based Agents**: **+30.2%** improvement over best LLM agent frameworks, demonstrating the power of sophisticated agent architecture
156157

157-
<div align="center">
158-
<img src='./assets/result_main.jpg' /><br>
159-
</div>
160-
161158
---
162159

163160
## 🚀 Key Features
@@ -462,6 +459,33 @@ Implementation Generation • Testing • Documentation
462459

463460
---
464461

462+
463+
## 📊 Experimental Results
464+
465+
We evaluate **DeepCode** on the PaperBench Code-Dev benchmark, a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 gradable components assessed using SimpleJudge with hierarchical weighting. Our experiments compare DeepCode against four baseline categories: (1) Human Experts, (2) Commercial Code Agents, (3) Scientific Code Agents, and (4) LLM-Based Agents.
466+
467+
### ① Human Expert Performance (Top ML PhD)
468+
469+
DeepCode achieves **75.9%** on the 3-paper human evaluation subset (with Claude Sonnet 4.5-thinking), surpassing the best-of-3 human expert baseline (**72.4%**) by **+3.5 percentage points**. This demonstrates that our framework not only matches but exceeds expert-level code reproduction capabilities, representing a significant milestone in autonomous scientific software engineering.
470+
471+
### ② Commercial Code Agents
472+
473+
On the 5-paper subset, DeepCode substantially outperforms leading commercial coding tools: **Cursor** (58.4%), **Claude Code** (58.7%), and **Codex** (40.0%). DeepCode achieves **84.8%**, representing a **+26.1% improvement** over the best commercial agent (Claude Code). All commercial agents utilize Claude Sonnet 4.5-thinking (Cursor and Claude Code) or GPT-5 Codex-high, highlighting that DeepCode's superior architecture—rather than base model capability—drives this performance gap.
474+
475+
### ③ Scientific Code Agent
476+
477+
Compared to PaperCoder (**51.1%**), the state-of-the-art scientific code reproduction framework, DeepCode achieves **73.5%**, demonstrating a **+22.4% relative improvement**. This substantial margin validates our multi-module architecture combining planning, hierarchical task decomposition, code generation, and iterative debugging over simpler pipeline-based approaches.
478+
479+
### ④ LLM-Based Agents
480+
481+
DeepCode (**73.5%**) significantly outperforms all tested LLM agents, including Claude 3.5 Sonnet with IterativeAgent (27.5%), o1 with IterativeAgent at 36 hours (42.4%), and o1 BasicAgent (43.3%). The **+30.2% improvement** over the best-performing LLM agent demonstrates that sophisticated agent scaffolding, rather than extended inference time or larger models alone, is critical for complex code reproduction tasks.
482+
483+
<div align="center">
484+
<img src='./assets/result_main.jpg' /><br>
485+
</div>
486+
<br/>
487+
488+
465489
## 🚀 Quick Start
466490

467491

0 commit comments

Comments
 (0)