Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: Address review comments - use Qwen/Qwen3-0.6B model and reword …
…hardware note

- Replace DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with Qwen/Qwen3-0.6B throughout
  (smaller model better for examples and testing)
- Change 'suboptimal results' to 'different results' for less judgmental wording

Addresses review comments from PR #4234
  • Loading branch information
AsadShahid04 committed Nov 13, 2025
commit 1dbe6bd228e73e28cd3092025af03ec2bb763bcc
24 changes: 12 additions & 12 deletions benchmarks/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
1. **Dynamo installed** - Follow the [installation guide](../../README.md#installation) to set up Dynamo
2. **Model downloaded** - Download the model you want to benchmark:
```bash
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
huggingface-cli download Qwen/Qwen3-0.6B
```
3. **NATS and etcd running** - Start the required services:

Expand Down Expand Up @@ -72,7 +72,7 @@
> - **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
> - **InfiniBand**: 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
>
> Benchmarking with a different hardware configuration may yield suboptimal results.
> Benchmarking with a different hardware configuration may yield different results.

## Deployment Options

Expand Down Expand Up @@ -138,27 +138,27 @@

```bash
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker > prefill_0.log 2>&1 &

CUDA_VISIBLE_DEVICES=1 python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker > prefill_1.log 2>&1 &

CUDA_VISIBLE_DEVICES=2 python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker > prefill_2.log 2>&1 &

CUDA_VISIBLE_DEVICES=3 python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker > prefill_3.log 2>&1 &
```

3. **Start decode worker** (TP=4):

```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic > decode.log 2>&1 &
--model Qwen/Qwen3-0.6B > decode.log 2>&1 &
```

4. **Wait for services to be ready** - Check the logs to ensure all services are fully started before benchmarking.
Expand Down Expand Up @@ -224,7 +224,7 @@

# Start decode worker (TP=8, using all 8 GPUs)
python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic > decode.log 2>&1 &
--model Qwen/Qwen3-0.6B > decode.log 2>&1 &
```

2. **On Node 1** - Start prefill workers:
Expand All @@ -237,7 +237,7 @@
# Start 8 prefill workers (one per GPU)
for i in {0..7}; do
CUDA_VISIBLE_DEVICES=$i python -m dynamo.vllm \
--model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker > prefill_${i}.log 2>&1 &
done
```
Expand Down Expand Up @@ -266,7 +266,7 @@
1. **Start vLLM servers** (2 instances, each with TP=4):

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3-0.6B \
--block-size 128 \
--max-model-len 3500 \
--max-num-batched-tokens 3500 \
Expand All @@ -275,7 +275,7 @@
--disable-log-requests \
--port 8001 > vllm_0.log 2>&1 &

CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve Qwen/Qwen3-0.6B \
--block-size 128 \
--max-model-len 3500 \
--max-num-batched-tokens 3500 \
Expand Down Expand Up @@ -327,7 +327,7 @@
--prefill-data-parallelism, --prefill-dp <int> Prefill data parallelism (for disaggregated mode)
--decode-tensor-parallelism, --decode-tp <int> Decode tensor parallelism (for disaggregated mode)
--decode-data-parallelism, --decode-dp <int> Decode data parallelism (for disaggregated mode)
--model <model_id> Hugging Face model ID to benchmark (default: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic)
--model <model_id> Hugging Face model ID to benchmark (default: Qwen/Qwen3-0.6B)
--input-sequence-length, --isl <int> Input sequence length (default: 3000)
--output-sequence-length, --osl <int> Output sequence length (default: 150)
--url <http://host:port> Target URL for inference requests (default: http://localhost:8000)
Expand Down Expand Up @@ -560,4 +560,4 @@
- **[AIPerf Documentation](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)** - Learn more about AIPerf benchmarking
- **[Dynamo Benchmarking Guide](../../docs/benchmarks/benchmarking.md)** - General benchmarking framework documentation
- **[Performance Tuning Guide](../../docs/performance/tuning.md)** - Optimize your deployment configuration
- **[Metrics and Visualization](../../deploy/metrics/k8s/README.md)** - Monitor deployments with Prometheus and Grafana

Check failure on line 563 in benchmarks/llm/README.md

View workflow job for this annotation

GitHub Actions / Check for broken markdown links

Broken link: [Metrics and Visualization](../../deploy/metrics/k8s/README.md) - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/benchmarks/llm/README.md?plain=1#L563
Loading