Zephyr Bench – Real-world LLM & Tool Benchmark Suite

Zephyr Bench is a comprehensive benchmark for evaluating coding agents on real software tasks. This system allows you to create realistic coding scenarios, test different AI agents, and measure their performance across multiple evaluation criteria.

Environment Setup: Create a .env file in the project root directory (ze-benchmarks/.env) with your API keys. See Environment Variables section below for details.

Quick Setup

1. Installation

# Clone and install
git clone https://github.com/your-org/ze-benchmarks.git
cd ze-benchmarks
pnpm install

2. Environment Setup (`.env`)

# Edit .env and add your API keys
# Required for Anthropic Claude
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# Optional for OpenRouter models
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Worker configuration (already set for local development)
ZE_BENCHMARKS_WORKER_URL=https://bench-api-dev.zephyr-cloud.io
ZE_PUBLIC_API_URL=https://bench-api-dev.zephyr-cloud.io
ZE_PUBLIC_VITE_WORKER_URL=https://bench-api-dev.zephyr-cloud.io
ZE_BENCHMARKS_API_KEY=dev-local-key

Get API Keys:

Anthropic Claude: https://console.anthropic.com/settings/keys
OpenRouter: https://openrouter.ai/keys

# 2. Run benchmark with enrichment
pnpm bench next.js 001-server-component \
  --tier L1 \
  --specialist nextjs-specialist \
  --enrich-template templates/nextjs-specialist-template.json5

# This will:
# 1. Load the specialist template: templates/nextjs-specialist-template.json5
# 2. Auto-detect agent/model from template's preferred_models
# 3. Run the benchmark once (default iterations: 1)
# 4. Store results in the database
# 5. After benchmark completes, enrich the template:
#    - Analyze each documentation entry
#    - Generate metadata using LLM (default: anthropic/claude-3.5-haiku via OpenRouter)
#    - Save enriched template to templates/enriched/<version>/enriched-XXX.json5

Multiple Iterations with Enrichment

For statistical confidence, run multiple iterations:

pnpm bench next.js 000-app-router-migration-simple \
  --tier L1-basic \
  --specialist nextjs-specialist \
  --iterations 3 \
  --enrich-template templates/nextjs-specialist-template.json5

# This will:
# 1. Run the benchmark 3 times
# 2. Store all 3 runs in a batch
# 3. After all runs complete, enrich the template once

Suite Structure

A suite is a collection of related benchmarks. Each suite contains:

Scenarios: Individual test cases within the suite
Prompts: Difficulty tiers (L0-L3, Lx) for each scenario
Repository Fixtures: Real codebases with intentional issues

Current Suites

next.js: Next.js App Router scenarios (migrations, components, etc.)
update-deps: Dependency update scenarios (React, TypeScript, etc.)
test-suite: Basic test scenarios with limited tiers

Scenario Components

Each scenario requires:

scenario.yaml: Configuration and evaluation criteria
oracle-answers.json: Expected outcomes for validation
repo-fixture/: Complete codebase with intentional issues
Prompts: L0-L3 and Lx difficulty tiers

Available Agents

Agent	Description	API Key Required
Echo	Test agent that echoes the prompt	No
Anthropic Claude	Direct Claude API integration	`ANTHROPIC_API_KEY`
OpenRouter	Multiple LLM providers via OpenRouter	`OPENROUTER_API_KEY`
Claude Code	Claude Code CLI tool integration	`ANTHROPIC_API_KEY`

Note: When using --specialist, you don't need to specify --agent - it's auto-detected from the template.

Prompt Tier System

The system automatically scans available difficulty tiers:

Tier	Description	Use Case
L0	Minimal context	Tests discovery and inference skills
L1	Basic context	Standard user scenarios
L2	Directed guidance	Complex tasks with detailed instructions
L3	Migration specific	Technology transitions and upgrades
Lx	Adversarial	Edge cases and challenging scenarios

Evaluation System

Automated Evaluators

BuildEvaluator: Validates project builds successfully
LintEvaluator: Ensures code quality standards
TypecheckEvaluator: Verifies TypeScript type correctness
CompanionAlignmentEvaluator: Ensures companion packages align
NamespaceMigrationEvaluator: Handles package namespace changes

LLM Judge

AI-Powered Assessment: Quality evaluation using language models
Weighted Scoring: Configurable evaluation weights
Detailed Feedback: Comprehensive reasoning and suggestions

Project Structure

ze-benchmarks/
├── packages/
│   ├── harness/              # CLI and execution engine
│   ├── agent-adapters/        # Agent implementations  
│   ├── evaluators/           # Evaluation logic and LLM Judge
│   └── database/             # SQLite database management
├── suites/                   # Benchmark scenarios
│   ├── next.js/              # Next.js App Router scenarios
│   ├── update-deps/          # Dependency update scenarios
│   │   ├── prompts/          # L0-L3 and Lx prompts
│   │   └── scenarios/        # Scenario configs and fixtures
│   └── test-suite/           # Basic test scenarios
├── templates/                # Specialist templates
│   ├── nextjs-specialist-template.json5
│   ├── shadcn-specialist-template.json5
│   └── enriched/             # Enriched templates (auto-generated)
│       └── <version>/
│           └── enriched-XXX.json5
├── apps/benchmark-report/    # Web dashboard (React + Rsbuild)
│   └── public/               # Database and static assets
└── docs/                     # Comprehensive documentation
    ├── ADDING-BENCHMARKS.md  # Guide for creating benchmarks
    ├── ADDING-EVALUATORS.md  # Guide for creating evaluators
    └── templates/            # Ready-to-use templates

CLI Usage

Interactive Mode (Recommended)

# Start interactive CLI
pnpm cli

# Choose from:
# - Run benchmarks (select suite, scenario, specialist)
# - View statistics and history  
# - Compare model performance
# - Analyze batch results

Direct Execution

# Run specific benchmark with specialist
pnpm bench next.js 001-server-component --tier L1 --specialist nextjs-specialist

# With enrichment
pnpm bench next.js 001-server-component \
  --tier L1 \
  --specialist nextjs-specialist \
  --enrich-template templates/nextjs-specialist-template.json5

# View results
pnpm stats suite next.js
pnpm batches

Statistics and Analysis

# View comprehensive statistics
pnpm stats

# Compare batches
pnpm compare-batches

# View batch details
pnpm batch-details <batch-id>

Contributing

We welcome contributions! Whether you want to add new benchmarks, create evaluators, or improve documentation, we have comprehensive guides to help you get started.

Quick Start - IMPORTANT

Contributing Guide - Complete contribution guidelines
Adding Benchmarks - Step-by-step benchmark creation
Adding Evaluators - Evaluator development guide

Propose New Benchmarks

Use our GitHub issue template to propose new benchmarks:

Go to GitHub Issues
Click "New Issue" → "New Benchmark Proposal"
Fill out the template with your benchmark idea
We'll review and help you implement it!

Ready to Contribute?

Check out our Contributing Guide for detailed instructions on:

Setting up your development environment
Creating new benchmarks and evaluators
Submitting pull requests
Code quality standards

Environment Variables

The system uses environment variables for configuration. Create a .env file from the example:

Required Variables

ANTHROPIC_API_KEY: Required for Anthropic Claude agent and Claude Code
OPENROUTER_API_KEY: Required for OpenRouter agent ZE_BENCHMARKS_WORKER_URL=write these ZE_PUBLIC_API_URL=write these ZE_PUBLIC_VITE_WORKER_URL=write these

Optional Variables

DEBUG: Enable debug logging (default: false)
PORT: Web dashboard port (default: 3000)

Data Storage and Results

Worker API Architecture

Storage: Cloudflare D1 (SQLite) via Worker API
API Endpoint: http://localhost:8787 (local development)
Authentication: Bearer token authentication for write operations
Real-time: All data fetched fresh on each request

API Endpoints

GET /api/runs - List all benchmark runs
GET /api/runs/:id - Get run details with evaluations and telemetry
GET /api/batches - List all batches
GET /api/batches/:id - Get batch details with runs
GET /api/stats - Get global statistics
POST /api/results - Submit benchmark run (authenticated)
POST /api/results/batch - Submit batch (authenticated)

Result Storage

Runs: Individual benchmark executions with detailed metadata
Batches: Groups of runs with aggregate statistics
Evaluations: Detailed evaluator results and scores
Telemetry: Tool calls, tokens, costs, duration

Web Dashboard Features

Real-time Updates: Fetches data directly from Worker API
Interactive Charts: Performance visualization with proper 0-1 scoring
Batch Analytics: Comprehensive batch-level statistics
Run Details: Individual run analysis and debugging
Failure Analysis: Detailed failure reasons and patterns

Ready to create benchmarks? Check out our Contributing Guide for comprehensive instructions!

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github		.github
apps		apps
docs		docs
packages		packages
results		results
scripts		scripts
suites		suites
templates		templates
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARK_EXECUTION_FLOW.md		BENCHMARK_EXECUTION_FLOW.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NAMING_LOGIC.md		NAMING_LOGIC.md
README.md		README.md
first.json		first.json
mprocs.yaml		mprocs.yaml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

License

ZephyrCloudIO/ze-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Zephyr Bench – Real-world LLM & Tool Benchmark Suite

Quick Setup

1. Installation

2. Environment Setup (.env)

Multiple Iterations with Enrichment

Suite Structure

Current Suites

Scenario Components

Available Agents

Prompt Tier System

Evaluation System

Automated Evaluators

LLM Judge

Project Structure

CLI Usage

Interactive Mode (Recommended)

Direct Execution

Statistics and Analysis

Contributing

Quick Start - IMPORTANT

Propose New Benchmarks

Ready to Contribute?

Environment Variables

Required Variables

Optional Variables

Data Storage and Results

Worker API Architecture

API Endpoints

Result Storage

Web Dashboard Features

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

2. Environment Setup (`.env`)

Packages