Skip to content

Commit 3303781

Browse files
authored
Update README.md
1 parent 05acc98 commit 3303781

File tree

1 file changed

+32
-9
lines changed

1 file changed

+32
-9
lines changed

README.md

Lines changed: 32 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -233,32 +233,55 @@
233233

234234
---
235235

236-
237236
## 📊 Experimental Results
238237

239-
We evaluate **DeepCode** on the PaperBench Code-Dev benchmark, a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 gradable components assessed using SimpleJudge with hierarchical weighting. Our experiments compare DeepCode against four baseline categories: (1) Human Experts, (2) Commercial Code Agents, (3) Scientific Code Agents, and (4) LLM-Based Agents.
238+
We evaluate **DeepCode** on the [*PaperBench*](https://openai.com/index/paperbench/) benchmark (released by OpenAI), a rigorous testbed requiring AI agents to independently reproduce 20 ICML 2024 papers from scratch. The benchmark comprises 8,316 gradable components assessed using SimpleJudge with hierarchical weighting.
239+
240+
Our experiments compare DeepCode against four baseline categories: **(1) Human Experts**, **(2) State-of-the-Art Commercial Code Agents**, **(3) Scientific Code Agents**, and **(4) LLM-Based Agents**.
241+
242+
### ① 🧠 Human Expert Performance (Top Machine Learning PhD)
243+
244+
**DeepCode: 75.9% vs. Top Machine Learning PhD: 72.4% (+3.5%)**
245+
246+
DeepCode achieves **75.9%** on the 3-paper human evaluation subset, **surpassing the best-of-3 human expert baseline (72.4%) by +3.5 percentage points**. This demonstrates that our framework not only matches but exceeds expert-level code reproduction capabilities, representing a significant milestone in autonomous scientific software engineering.
240247

241-
### ① Human Expert Performance (Top ML PhD)
248+
### ② 💼 State-of-the-Art Commercial Code Agents
242249

243-
DeepCode achieves **75.9%** on the 3-paper human evaluation subset (with Claude Sonnet 4.5-thinking), surpassing the best-of-3 human expert baseline (**72.4%**) by **+3.5 percentage points**. This demonstrates that our framework not only matches but exceeds expert-level code reproduction capabilities, representing a significant milestone in autonomous scientific software engineering.
250+
**DeepCode: 84.8% vs. Best Commercial Agent: 58.7% (+26.1%)**
244251

245-
### ② Commercial Code Agents
252+
On the 5-paper subset, DeepCode substantially outperforms leading commercial coding tools:
253+
- Cursor: 58.4%
254+
- Claude Code: 58.7%
255+
- Codex: 40.0%
256+
- **DeepCode: 84.8%**
246257

247-
On the 5-paper subset, DeepCode substantially outperforms leading commercial coding tools: **Cursor** (58.4%), **Claude Code** (58.7%), and **Codex** (40.0%). DeepCode achieves **84.8%**, representing a **+26.1% improvement** over the best commercial agent (Claude Code). All commercial agents utilize Claude Sonnet 4.5-thinking (Cursor and Claude Code) or GPT-5 Codex-high, highlighting that DeepCode's superior architecture—rather than base model capability—drives this performance gap.
258+
This represents a **+26.1% improvement** over the leading commercial code agent. All commercial agents utilize Claude Sonnet 4.5-thinking or GPT-5 Codex-high, highlighting that **DeepCode's superior architecture**—rather than base model capability—drives this performance gap.
248259

249-
### ③ Scientific Code Agent
260+
### ③ 🔬 Scientific Code Agents
261+
262+
**DeepCode: 73.5% vs. PaperCoder: 51.1% (+22.4%)**
250263

251264
Compared to PaperCoder (**51.1%**), the state-of-the-art scientific code reproduction framework, DeepCode achieves **73.5%**, demonstrating a **+22.4% relative improvement**. This substantial margin validates our multi-module architecture combining planning, hierarchical task decomposition, code generation, and iterative debugging over simpler pipeline-based approaches.
252265

253-
### ④ LLM-Based Agents
266+
### ④ 🤖 LLM-Based Agents
267+
268+
**DeepCode: 73.5% vs. Best LLM Agent: 43.3% (+30.2%)**
269+
270+
DeepCode significantly outperforms all tested LLM agents:
271+
- Claude 3.5 Sonnet + IterativeAgent: 27.5%
272+
- o1 + IterativeAgent (36 hours): 42.4%
273+
- o1 BasicAgent: 43.3%
274+
- **DeepCode: 73.5%**
254275

255-
DeepCode (**73.5%**) significantly outperforms all tested LLM agents, including Claude 3.5 Sonnet with IterativeAgent (27.5%), o1 with IterativeAgent at 36 hours (42.4%), and o1 BasicAgent (43.3%). The **+30.2% improvement** over the best-performing LLM agent demonstrates that sophisticated agent scaffolding, rather than extended inference time or larger models alone, is critical for complex code reproduction tasks.
276+
The **+30.2% improvement** over the best-performing LLM agent demonstrates that sophisticated agent scaffolding, rather than extended inference time or larger models alone, is critical for complex code reproduction tasks.
256277

257278
<div align="center">
258279
<img src='./assets/result_main.jpg' /><br>
259280
</div>
260281
<br/>
261282

283+
---
284+
262285
### 🎯 **Autonomous Self-Orchestrating Multi-Agent Architecture**
263286

264287
**The Challenges**:

0 commit comments

Comments
 (0)