A Go CLI tool for benchmarking OpenAI-compatible API servers. Automatically discovers available models and measures comprehensive performance metrics including requests per second, tokens per second, and response latency.
| Provider | Model | Tokens/sec | Success Rate | Avg Latency | Notes |
|---|---|---|---|---|---|
| Cerebras | qwen-3-coder-480b | 91.93 | 26.8% | 996ms | Highest throughput but rate limited |
| Alibaba Cloud | qwen3-coder-plus | 38.83 | 88.9% | 3.5s | Best reliability |
| Custom Server | qwen3-coder | 33.88 | 87.5% | 4.2s | Most consistent |
Winner: Cerebras for raw speed, Alibaba Cloud for production reliability
- Cerebtras API delivers exceptional performance (91.93 tokens/sec) but has aggressive rate limiting (~73% error rate)
- Alibaba Cloud offers the best balance for production use with 38.83 tokens/sec and 88.9% success rate
- Custom server provides predictable performance ideal for development/testing
Benchmarked on 2025-08-01 with identical prompts and conditions
- Auto-discovery: Automatically discovers available models via
/v1/modelsendpoint - Comprehensive metrics: RPS, tokens/sec, latency percentiles, TTFT, error rates
- Concurrent testing: Auto-detects optimal concurrency or use custom values
- Multiple prompt sizes: Test with small, medium, large, or mixed prompts
- Real-time progress: Live progress bar with dynamic tokens/sec and smart status updates
- Streaming support: Test both regular and streaming completions
- Rich output: Colored terminal output and JSON export options
- Warmup phase: Optional warmup to stabilize server performance
- API Key support: Built-in authentication for cloud providers
- Smart URL handling: Automatic HTTPS for domains, HTTP for local IPs
# Clone and build
git clone <repo-url>
cd aibench
go build -o aibench
# Or install directly
go install# Test a server with default settings (30s duration)
./aibench 86.38.182.74:8000
# Test specific server with custom duration
./aibench --duration 60s localhost:8000
# Test with specific concurrency
./aibench --concurrency 10 --duration 45s api.server.com:8080# Test specific models only
./aibench --models gpt-4,gpt-3.5-turbo server:8000
# Force concurrency 1 (optimal for vLLM/resource-limited servers)
./aibench --concurrency -1 server:8000
# Test with API key authentication
./aibench --api-key "your-key-here" api.provider.com
# Test streaming responses
./aibench --streaming --duration 30s server:8000
# Test with large prompts and longer timeout
./aibench --prompt-size large --timeout 60s server:8000
# Export results as JSON
./aibench --format json server:8000 > results.json
# Disable colors and warmup
./aibench --no-color --warmup 0 server:8000# Alibaba DashScope (Qwen models)
./aibench --api-key "sk-your-key" --models qwen3-coder-plus dashscope-intl.aliyuncs.com/compatible-mode
# OpenAI API
./aibench --api-key "sk-your-key" --models gpt-4 api.openai.com
# Any OpenAI-compatible provider
./aibench --api-key "your-key" your-provider.com/v1| Flag | Default | Description |
|---|---|---|
-d, --duration |
30s |
How long to run the benchmark |
-c, --concurrency |
0 (auto) |
Concurrent requests (0=auto, -1=force 1) |
-m, --models |
all discovered | Specific models to test |
-k, --api-key |
"" |
API key (or use OPENAI_API_KEY env var) |
-t, --timeout |
30s |
Request timeout |
-w, --warmup |
5s |
Warmup duration (0 to disable) |
--streaming |
false |
Test streaming responses |
--prompt-size |
medium |
Prompt size: small|medium|large|all |
--skip-discovery |
false |
Skip model discovery, use specified models |
-f, --format |
table |
Output format: table|json |
-v, --verbose |
false |
Verbose output |
--no-color |
false |
Disable colored output |
The benchmark shows live performance metrics while running:
[████████████████████] 67.3% | RPS: 45.2 | TPS: 1.2k | Reqs: 1356/1400 | ETA: 9s
- Adaptive status: Shows different info based on benchmark phase
Waiting for first response...- When no requests completed yetWarming up...- During initial 2 secondsRPS: X | TPS: Y- Live performance metrics
- Color-coded metrics: Green (good), yellow (medium), red (low/slow)
- Recent vs average: Shows instantaneous speed after warmup
- Smart progress bar: Visual completion percentage with ETA
- Requests/sec: Successful requests per second
- Tokens/sec: Total tokens processed per second
- Input/Output Tokens/sec: Breakdown by token type
- Average/Min/Max Latency: Response time statistics
- P95/P99 Latency: 95th and 99th percentile latencies
- TTFT: Time to first token (streaming only)
- Success Rate: Percentage of successful requests
- Error Rate: Percentage of failed requests
- Error Breakdown: Detailed error categorization
aibench - OpenAI API Benchmarking Tool
Server: localhost:8000
Models: gpt-4, gpt-3.5-turbo
────────────────────────────────────────────────────────────
Benchmarking model: gpt-4
Finding optimal concurrency...
Optimal concurrency: 8
Warming up (5s)...
Running benchmark (concurrency: 8, duration: 30s)...
[████████████████████] 100.0% | RPS: 45.2 | TPS: 1.2k | Reqs: 1356/1356 | ETA: 0s
Benchmark Results
Model: gpt-4
──────────────────────────────────────────────────
Duration: 30.0s
Total Requests: 1356
Successful: 1356 (100.0% success rate)
Requests/sec: 45.2
Tokens/sec: 1.2k
Input Tokens/sec: 890.5
Output Tokens/sec: 312.1
Avg Latency: 177ms
Min Latency: 89ms
Max Latency: 945ms
P95 Latency: 298ms
P99 Latency: 567ms
Avg TTFT: 45ms
This tool works with any OpenAI-compatible API server including:
- OpenAI API
- Local LLM servers (Ollama, LM Studio, etc.)
- Cloud providers (Azure OpenAI, AWS Bedrock, Alibaba DashScope, etc.)
- Custom inference servers
POST /v1/chat/completions- Chat completions (required)GET /v1/models- Model discovery (optional)
Important: Some providers have incomplete or missing model discovery endpoints. Common issues:
- Missing Models: The
/v1/modelsendpoint may not list all available models - Authentication Issues: Discovery may fail even when the completion endpoint works
- Provider Limitations: Some APIs don't implement the models endpoint
Solutions:
- Specify models directly: Use
--models model-nameto bypass discovery - Skip discovery: Use
--skip-discoverywhen you know the model exists - Check provider docs: Consult your API provider's documentation for correct model names
Smart Behavior: When you specify models with --models, discovery is automatically skipped. This eliminates the need for a separate flag and handles provider limitations seamlessly.
# This may fail if discovery is broken
./aibench --api-key "key" provider.com
# This automatically skips discovery (recommended)
./aibench --api-key "key" --models qwen3-coder-plus provider.com
# Multiple models also skip discovery
./aibench --api-key "key" --models model1,model2,model3 provider.com
# Only uses discovery when no models specified
./aibench --api-key "key" provider.com # Will discover all available modelsBest Practice: Always specify model names when working with cloud providers to avoid discovery issues and get faster startup times.
- Client: HTTP client with connection pooling for API interactions
- Benchmarker: Coordinates the entire benchmark process
- Metrics: Thread-safe performance tracking and statistical calculations
- Display: Real-time progress and formatted result output
- Workers: Concurrent request generators with different prompt strategies
MIT License