diff --git a/benchmarks/README.md b/benchmarks/README.md index dcb24dfdbf..3ea151ed89 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -69,6 +69,17 @@ This directory also includes lightweight tools for: Detailed information is provided in the `prefix_data_generator` directory. +## Benchmarking Tools + +This directory contains several benchmarking tools for different use cases: + +- **[General Benchmarking Framework](utils/)** - Python-based wrapper around AIPerf for benchmarking any HTTP endpoints (see [complete guide](../docs/benchmarks/benchmarking.md)) +- **[LLM Benchmarking Scripts](llm/)** - Specialized scripts (`perf.sh` and `plot_pareto.py`) for detailed LLM performance analysis and Pareto frontier visualization +- **[Router Benchmarking](router/)** - Scripts for benchmarking Dynamo router with prefix caching +- **[Profiler Tools](profiler/)** - Performance profiling utilities for SLA-driven deployments + ## Comprehensive Guide For detailed documentation, configuration options, and advanced usage, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md). + +For LLM-specific benchmarking with Pareto plots, see the [LLM Benchmarking Guide](llm/README.md). diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md index 9f5400655d..f230e24d7a 100644 --- a/benchmarks/llm/README.md +++ b/benchmarks/llm/README.md @@ -1,10 +1,13 @@ -Coming soon. +# LLM Deployment Benchmarking Guide + +This guide provides detailed steps on benchmarking Large Language Models (LLMs) using the `perf.sh` and `plot_pareto.py` scripts in single and multi-node configurations. These scripts use [AIPerf](https://github.com/triton-inference-server/perf_analyzer) to collect performance metrics and generate Pareto frontier visualizations. + +> [!Note] +> This workflow works with all Dynamo backends (vLLM, SGLang, TensorRT-LLM) since they all expose the same OpenAI-compatible HTTP API. The examples in this guide use vLLM for clarity, but you can benchmark SGLang or TensorRT-LLM deployments using the same workflow—just deploy your workers with `python -m dynamo.sglang` or `python -m dynamo.trtllm` instead. + +## Overview + +The benchmarking tools in this directory help you: + +- **Benchmark LLM deployments** at various concurrency levels +- **Compare performance** between aggregated and disaggregated serving modes +- **Generate Pareto plots** to visualize throughput vs latency trade-offs +- **Evaluate different configurations** (tensor parallelism, data parallelism, etc.) + +### Scripts + +- **`perf.sh`**: Bash script that runs AIPerf benchmarks across multiple concurrency levels +- **`plot_pareto.py`**: Python script that generates Pareto efficiency graphs from benchmark results +- **`nginx.conf`**: NGINX configuration template for load balancing (used for baseline comparisons) + +## Prerequisites + +> [!Important] +> At least one 8xH100-80GB node is recommended for the following instructions. Different hardware configurations may yield different results. + +1. **Dynamo installed** - Follow the [installation guide](../../README.md#installation) to set up Dynamo +2. **Model downloaded** - Download the model you want to benchmark: + ```bash + huggingface-cli download Qwen/Qwen3-0.6B + ``` +3. **NATS and etcd running** - Start the required services: + + ```bash + # Using docker-compose (recommended) + docker compose -f deploy/docker-compose.yml up -d + + # Or manually: + # Start etcd: ./etcd + # Start NATS with JetStream: nats-server -js + ``` + +4. **AIPerf installed** - Install AIPerf for benchmarking: + ```bash + pip install aiperf + ``` +5. **Python dependencies for plotting**: + ```bash + pip install matplotlib seaborn pandas numpy + ``` + +> [!NOTE] +> This guide was tested on node(s) with the following hardware configuration: +> +> - **GPUs**: 8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs) +> - **CPU**: 2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5 +> - **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU +> - **InfiniBand**: 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links) +> +> Benchmarking with a different hardware configuration may yield different results. + +## Deployment Options + +You can benchmark Dynamo deployments in two ways: + +1. **Kubernetes Deployment** - Deploy using DynamoGraphDeployment (recommended for production) +2. **Local Deployment** - Run components directly on your machine (useful for development/testing) + +Choose the method that best fits your use case. The benchmarking scripts work with either approach as long as you have an HTTP endpoint accessible at the specified URL. + +## Benchmarking Disaggregated Single Node Deployment + +> [!Important] +> One 8xH100-80GB node is required for the following instructions. + +In this setup, we compare Dynamo disaggregated vLLM performance to native vLLM aggregated baseline on a single node. We use 4 prefill workers (TP=1 each) and 1 decode worker (TP=4). + +### Option 1: Kubernetes Deployment + +1. **Deploy DynamoGraphDeployment**: + + ```bash + # Set namespace + export NAMESPACE=dynamo-system + + # Create namespace if needed + kubectl create namespace ${NAMESPACE} + + # Deploy disaggregated configuration + kubectl apply -f examples/backends/vllm/deploy/disagg.yaml -n ${NAMESPACE} + + # Wait for deployment to be ready + kubectl wait --for=condition=ready dynamographdeployment/vllm-disagg -n ${NAMESPACE} --timeout=10m + ``` + +2. **Port-forward the frontend service**: + + ```bash + kubectl port-forward -n ${NAMESPACE} svc/vllm-disagg-frontend 8000:8000 > /dev/null 2>&1 & + ``` + +3. **Run the benchmark**: + ```bash + bash -x benchmarks/llm/perf.sh \ + --mode disaggregated \ + --deployment-kind dynamo_vllm \ + --prefill-tensor-parallelism 1 \ + --prefill-data-parallelism 4 \ + --decode-tensor-parallelism 4 \ + --decode-data-parallelism 1 \ + --url http://localhost:8000 + ``` + +### Option 2: Local Deployment + +1. **Start the frontend**: + + ```bash + python -m dynamo.frontend --http-port 8000 > frontend.log 2>&1 & + ``` + +2. **Start prefill workers** (4 workers, each with TP=1): + + ```bash + CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --is-prefill-worker > prefill_0.log 2>&1 & + + CUDA_VISIBLE_DEVICES=1 python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --is-prefill-worker > prefill_1.log 2>&1 & + + CUDA_VISIBLE_DEVICES=2 python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --is-prefill-worker > prefill_2.log 2>&1 & + + CUDA_VISIBLE_DEVICES=3 python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --is-prefill-worker > prefill_3.log 2>&1 & + ``` + +3. **Start decode worker** (TP=4): + + ```bash + CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B > decode.log 2>&1 & + ``` + +4. **Wait for services to be ready** - Check the logs to ensure all services are fully started before benchmarking. + +5. **Run the benchmark**: + ```bash + bash -x benchmarks/llm/perf.sh \ + --mode disaggregated \ + --deployment-kind dynamo_vllm \ + --prefill-tensor-parallelism 1 \ + --prefill-data-parallelism 4 \ + --decode-tensor-parallelism 4 \ + --decode-data-parallelism 1 \ + --url http://localhost:8000 + ``` + +> [!Important] +> The parallelism settings in `perf.sh` must accurately reflect your deployment configuration. In the above command, we specify: +> +> - 4 prefill workers with TP=1 each (prefill-data-parallelism=4, prefill-tensor-parallelism=1) +> - 1 decode worker with TP=4 (decode-data-parallelism=1, decode-tensor-parallelism=4) +> +> See `perf.sh --help` for more information about these options. + +## Benchmarking Disaggregated Multinode Deployment + +> [!Important] +> Two 8xH100-80GB nodes are required for the following instructions. + +In this setup, we compare Dynamo disaggregated vLLM performance across two nodes. We use 8 prefill workers (TP=1 each) and 1 decode worker (TP=8). + +### Setup + +1. **On Node 0** - Start NATS and etcd: + + ```bash + docker compose -f deploy/docker-compose.yml up -d + # Or start manually: ./etcd and nats-server -js + ``` + +2. **On Node 1** - Configure NATS and etcd endpoints: + ```bash + export NATS_SERVER="nats://:4222" + export ETCD_ENDPOINTS=":2379" + ``` + +> [!Important] +> Node 1 must be able to reach Node 0 over the network for the above services. + +### Deployment + +**Option 1: Kubernetes (Recommended)** + +Deploy a multi-node DynamoGraphDeployment following the [multinode deployment guide](../../docs/kubernetes/deployment/multinode-deployment.md), then port-forward and benchmark as shown in the single-node example. + +**Option 2: Local** + +1. **On Node 0** - Start frontend and decode worker: + + ```bash + # Start frontend + python -m dynamo.frontend --http-port 8000 > frontend.log 2>&1 & + + # Start decode worker (TP=8, using all 8 GPUs) + python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B > decode.log 2>&1 & + ``` + +2. **On Node 1** - Start prefill workers: + + ```bash + # Set environment variables for Node 0 connectivity + export NATS_SERVER="nats://:4222" + export ETCD_ENDPOINTS=":2379" + + # Start 8 prefill workers (one per GPU) + for i in {0..7}; do + CUDA_VISIBLE_DEVICES=$i python -m dynamo.vllm \ + --model Qwen/Qwen3-0.6B \ + --is-prefill-worker > prefill_${i}.log 2>&1 & + done + ``` + +3. **Run the benchmark** (from Node 0 or any machine with access to the frontend): + ```bash + bash -x benchmarks/llm/perf.sh \ + --mode disaggregated \ + --deployment-kind dynamo_vllm \ + --prefill-tensor-parallelism 1 \ + --prefill-data-parallelism 8 \ + --decode-tensor-parallelism 8 \ + --decode-data-parallelism 1 \ + --url http://:8000 + ``` + +## Benchmarking vLLM Aggregated Baseline + +> [!Important] +> One (or two) 8xH100-80GB nodes are required for the following instructions. + +This section shows how to benchmark native vLLM aggregated serving for comparison with Dynamo disaggregated deployments. + +### Single Node + +1. **Start vLLM servers** (2 instances, each with TP=4): + + ```bash + CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3-0.6B \ + --block-size 128 \ + --max-model-len 3500 \ + --max-num-batched-tokens 3500 \ + --tensor-parallel-size 4 \ + --gpu-memory-utilization 0.95 \ + --disable-log-requests \ + --port 8001 > vllm_0.log 2>&1 & + + CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve Qwen/Qwen3-0.6B \ + --block-size 128 \ + --max-model-len 3500 \ + --max-num-batched-tokens 3500 \ + --tensor-parallel-size 4 \ + --gpu-memory-utilization 0.95 \ + --disable-log-requests \ + --port 8002 > vllm_1.log 2>&1 & + ``` + +2. **Set up NGINX load balancer**: + + ```bash + sudo apt update && sudo apt install -y nginx + sudo cp benchmarks/llm/nginx.conf /etc/nginx/nginx.conf + sudo service nginx restart + ``` + +3. **Run the benchmark**: + ```bash + bash -x benchmarks/llm/perf.sh \ + --mode aggregated \ + --deployment-kind vllm_serve \ + --tensor-parallelism 4 \ + --data-parallelism 2 \ + --url http://localhost:8000 + ``` + +### Two Nodes + +For two nodes, use `--tensor-parallel-size 8` and run one `vllm serve` instance per node. Update the `nginx.conf` upstream configuration to include the second node's IP address. + +## Using perf.sh + +The `perf.sh` script runs AIPerf benchmarks across multiple concurrency levels and stores results in an artifacts directory. + +### Basic Usage + +```bash +bash benchmarks/llm/perf.sh [OPTIONS] +``` + +### Command-Line Options + +```bash +Options: + --tensor-parallelism, --tp Tensor parallelism (for aggregated mode) + --data-parallelism, --dp Data parallelism (for aggregated mode) + --prefill-tensor-parallelism, --prefill-tp Prefill tensor parallelism (for disaggregated mode) + --prefill-data-parallelism, --prefill-dp Prefill data parallelism (for disaggregated mode) + --decode-tensor-parallelism, --decode-tp Decode tensor parallelism (for disaggregated mode) + --decode-data-parallelism, --decode-dp Decode data parallelism (for disaggregated mode) + --model Hugging Face model ID to benchmark (default: Qwen/Qwen3-0.6B) + --input-sequence-length, --isl Input sequence length (default: 3000) + --output-sequence-length, --osl Output sequence length (default: 150) + --url Target URL for inference requests (default: http://localhost:8000) + --concurrency Comma-separated concurrency levels (default: 1,2,4,8,16,32,64,128,256) + --mode Serving mode (default: aggregated) + --artifacts-root-dir Root directory to store benchmark results (default: artifacts_root) + --deployment-kind Deployment tag used for pareto chart labels (default: dynamo) + --help Show this help message and exit +``` + +### Examples + +**Custom model and sequence lengths**: + +```bash +bash benchmarks/llm/perf.sh \ + --mode aggregated \ + --deployment-kind vllm_serve \ + --tensor-parallelism 4 \ + --data-parallelism 2 \ + --model Qwen/Qwen3-0.6B \ + --input-sequence-length 2000 \ + --output-sequence-length 256 \ + --concurrency 1,2,4,8,16,32,64 +``` + +**Single concurrency level**: + +```bash +bash benchmarks/llm/perf.sh \ + --mode disaggregated \ + --deployment-kind dynamo_vllm \ + --prefill-tensor-parallelism 1 \ + --prefill-data-parallelism 4 \ + --decode-tensor-parallelism 4 \ + --decode-data-parallelism 1 \ + --concurrency 64 +``` + +### Output Structure + +The script creates an `artifacts_root` directory (or your specified directory) with the following structure: + +``` +artifacts_root/ +├── artifacts_0/ +│ ├── deployment_config.json # Deployment configuration metadata +│ ├── -concurrency1/ +│ │ └── profile_export_aiperf.json # AIPerf results for concurrency=1 +│ ├── -concurrency2/ +│ │ └── profile_export_aiperf.json +│ └── ... +├── artifacts_1/ # Next benchmark run +│ └── ... +``` + +Each `artifacts_N` directory contains results from one benchmark run. The script automatically increments the index to avoid overwriting previous results. + +> [!Tip] +> Start with a clean `artifacts_root` directory when beginning a new comparison experiment to ensure you only include results from the runs you want to compare. + +## Using plot_pareto.py + +The `plot_pareto.py` script generates Pareto frontier plots from benchmark results, helping you visualize the trade-off between throughput and latency. + +### Basic Usage + +```bash +python3 benchmarks/llm/plot_pareto.py --artifacts-root-dir artifacts_root +``` + +### Command-Line Options + +```bash +Options: + --artifacts-root-dir Root directory containing artifact directories (required) + --title Title for the Pareto graph (default: "Single Node") +``` + +### Examples + +**Single node comparison**: + +```bash +python3 benchmarks/llm/plot_pareto.py --artifacts-root-dir artifacts_root +``` + +**Two node comparison**: + +```bash +python3 benchmarks/llm/plot_pareto.py \ + --artifacts-root-dir artifacts_root \ + --title "Two Nodes" +``` + +### Output + +The script generates: + +- **`pareto_plot.png`**: Pareto frontier visualization +- **`results.csv`**: Detailed results in CSV format + +## Interpreting Results + +### Understanding Pareto Graphs + +Pareto graphs help answer: **How much can output token throughput be improved by switching from aggregated to disaggregated serving when both operate under similar inter-token latency?** + +**Axes:** + +- **X-axis (tokens/s/user)**: Higher values indicate lower latency per user +- **Y-axis (tokens/s/gpu avg)**: Average throughput per GPU + +**Pareto Frontier:** + +- The dashed line connects Pareto-efficient points +- A point is Pareto-efficient if no other point has both higher throughput AND lower latency +- Points on the frontier represent optimal trade-offs + +**Example Interpretation:** +At 45 tokens/s/user, if the disaggregated line shows 145 tokens/s/gpu and the baseline shows 80 tokens/s/gpu, the improvement is: + +- **Absolute improvement**: 145 - 80 = 65 tokens/s/gpu +- **Relative improvement**: 145 / 80 = 1.81x speedup + +### Metrics Explained + +- **Output Token Throughput**: Total tokens generated per second across all requests +- **Output Token Throughput per User**: Average tokens per second per concurrent user (inverse of latency) +- **Output Token Throughput per GPU**: Average tokens per second per GPU (efficiency metric) +- **Time to First Token (TTFT)**: Latency from request to first token +- **Inter Token Latency**: Average time between consecutive tokens + +## Comparing Multiple Deployments + +To compare different deployment configurations: + +1. **Run benchmarks for each configuration**: + + ```bash + # Benchmark configuration A + bash benchmarks/llm/perf.sh --mode aggregated --deployment-kind vllm_serve --tp 4 --dp 2 + + # Benchmark configuration B + bash benchmarks/llm/perf.sh --mode disaggregated --deployment-kind dynamo_vllm --prefill-tp 1 --prefill-dp 4 --decode-tp 4 --decode-dp 1 + ``` + +2. **Generate comparison plot**: + ```bash + python3 benchmarks/llm/plot_pareto.py --artifacts-root-dir artifacts_root + ``` + +The plot will show both configurations on the same graph, making it easy to compare their Pareto frontiers. + +> [!Important] +> Ensure the `--deployment-kind` values are different for each configuration so they appear as separate series in the plot. + +## Supporting Additional Models + +The instructions above can be used for nearly any HuggingFace-compatible model. The key requirements are: + +1. **Model must be accessible** - Either downloaded locally or accessible via HuggingFace +2. **Deployment must match** - Your deployment configuration must match the parallelism settings specified in `perf.sh` +3. **Endpoint must be accessible** - The HTTP endpoint must be reachable at the specified URL + +For more complex setups or different frameworks, refer to: + +- [Dynamo Kubernetes Deployment Guide](../../docs/kubernetes/README.md) +- [vLLM Backend Documentation](../../examples/backends/vllm/deploy/README.md) +- [TensorRT-LLM Backend Documentation](../../examples/backends/trtllm/deploy/README.md) +- [SGLang Backend Documentation](../../examples/backends/sglang/deploy/README.md) + +## Monitoring Deployment Readiness + +When benchmarking, ensure all workers are ready before starting the benchmark. For Dynamo deployments, you can check worker readiness: + +**Kubernetes:** + +```bash +kubectl get pods -n -l app= +# Check that all pods are in "Running" state +``` + +**Local:** + +- Check the logs of each worker to ensure they've finished loading the model +- Send a test request to verify the endpoint is responding: + ```bash + curl http://localhost:8000/v1/models + ``` + +## Troubleshooting + +### Common Issues + +1. **Benchmark fails with connection errors** + + - Verify the endpoint URL is correct and accessible + - Check that NATS and etcd are running + - Ensure the frontend service is running and healthy + +2. **Incorrect parallelism settings** + + - The parallelism settings in `perf.sh` must match your actual deployment + - Verify your deployment configuration (Kubernetes YAML or worker command-line args) + - Check GPU allocation matches your expectations + +3. **Plot generation fails** + + - Ensure all required Python packages are installed: `pip install matplotlib seaborn pandas numpy` + - Verify the artifacts directory contains `deployment_config.json` files + - Check that `profile_export_aiperf.json` files exist in the concurrency subdirectories + +4. **Low throughput or high latency** + - Verify all workers are actually processing requests (check logs) + - Ensure network connectivity between nodes (for multinode) + - Check GPU utilization to confirm resources are being used + - Review the [Performance Tuning Guide](../../docs/performance/tuning.md) for optimization tips + +### Interconnect Configuration (Multinode) + +For multinode deployments, ensure the fastest interconnect is being used. Misconfiguration can cause significant latency overhead (e.g., TCP instead of RDMA for KV cache transfer). + +- Verify network configuration between nodes +- Check that NIXL is using the optimal transport +- Review backend-specific debug options if experiencing abnormal latency + +## Additional Resources + +- **[AIPerf Documentation](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)** - Learn more about AIPerf benchmarking +- **[Dynamo Benchmarking Guide](../../docs/benchmarks/benchmarking.md)** - General benchmarking framework documentation +- **[Performance Tuning Guide](../../docs/performance/tuning.md)** - Optimize your deployment configuration +- **[Metrics and Visualization](../../docs/observability/prometheus-grafana.md)** - Monitor deployments with Prometheus and Grafana