An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).
MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.
- Evaluate real tool usage across multiple MCP services: 
Notion,GitHub,Filesystem,Postgres,Playwright. - Use ready-to-run tasks covering practical workflows, each with strict automated verification.
 - Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
 - Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
 - Flexible deployment: local or Docker; fully validated on macOS and Linux.
 
git clone https://github.com/eval-sys/mcpmark.git
cd mcpmarkOnly set what you need. Add service credentials when running tasks for that service.
# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."
# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox
PLAYWRIGHT_HEADLESS="True"
# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2"   # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"
# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"See docs/introduction.md and the service guides below for more details.
Local (Recommended)
pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright installMCPMark defaults to the built-in orchestration agent (MCPMarkAgent). To experiment with the ReAct-style agent, pass --agent react to pipeline.py (other settings stay the same).
Docker
./build-docker.shRun a filesystem task (no external accounts required):
python -m pipeline \
  --mcp filesystem \
  --k 1 \ # run once to quick start
  --models gpt-5  \ # or any model you configured
  --tasks file_property/size_classificationResults are saved to ./results/{exp_name}/{model}__{mcp}/run-*/... (e.g., ./results/test-run/gpt-5__filesystem/run-1/...).
# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1
# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1
# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1
# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL
# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all
# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --dockerPlease visit docs/introduction.md for choices of MODEL.
Tip: MCPMark supports auto-resume. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see RETRYABLE_PATTERNS) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.
| Service | Setup summary | Docs | 
|---|---|---|
| Notion | Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification. | Guide | 
| GitHub | Multi-account token pooling recommended; import pre-exported repo state if needed. | Guide | 
| Postgres | Start via Docker and import sample databases. | Setup | 
| Playwright | Install browsers before first run; defaults to chromium. | 
Setup | 
| Filesystem | Zero-configuration, run directly. | Config | 
You can also follow Quickstart for the shortest end-to-end path.
- Results are organized under 
./results/{exp_name}/{model}__{mcp}/run-*/(JSON + CSV per task). - Generate a summary with:
 
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp
# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1- Only models with complete results across all tasks and runs are included in the final summary.
 - Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.
 
- Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs: 
LiteLLM Doc. For Anthropic (Claude) extended thinking mode (enabled via--reasoning-effort), we use Anthropic’s native API. - See 
docs/introduction.mdfor details and configuration of supported models in MCPMark. - To add a new model, edit 
src/model_config.py. Before adding, check LiteLLM supported models/providers. SeeLiteLLM Doc. - Task design principles in 
docs/datasets/task.md. Each task ships with an automatedverify.pyfor objective, reproducible evaluation, seedocs/task.mdfor details. 
Contributions are welcome:
- Add a new task under 
tasks/<category_id>/<task_id>/withmeta.json,description.mdandverify.py. - Ensure local checks pass and open a PR.
 - See 
docs/contributing/make-contribution.md. 
If you find our works useful for your research, please consider citing:
@misc{wu2025mcpmark,
      title={MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use}, 
      author={Zijian Wu and Xiangyan Liu and Xinyuan Zhang and Lingjun Chen and Fanqing Meng and Lingxiao Du and Yiran Zhao and Fanshi Zhang and Yaoqi Ye and Jiawei Wang and Zirui Wang and Jinjie Ni and Yufan Yang and Arvin Xu and Michael Qizhe Shieh},
      year={2025},
      eprint={2509.24002},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24002}, 
}This project is licensed under the Apache License 2.0 — see LICENSE.
