docs: Address review comments - use Qwen/Qwen3-0.6B model and reword …

…hardware note - Replace DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with Qwen/Qwen3-0.6B throughout (smaller model better for examples and testing) - Change 'suboptimal results' to 'different results' for less judgmental wording Addresses review comments from PR #4234
ai-dynamo · AsadShahid04 · Nov 7, 2025 · Nov 12, 2025 · Nov 13, 2025 · Nov 13, 2025
commit 1dbe6bd228e73e28cd3092025af03ec2bb763bcc
diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md
@@ -42,7 +42,7 @@
 1. **Dynamo installed** - Follow the [installation guide](../../README.md#installation) to set up Dynamo
 2. **Model downloaded** - Download the model you want to benchmark:
    ```bash
-   huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+   huggingface-cli download Qwen/Qwen3-0.6B
    ```
 3. **NATS and etcd running** - Start the required services:
 
@@ -72,7 +72,7 @@
 > - **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
 > - **InfiniBand**: 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
 >
-> Benchmarking with a different hardware configuration may yield suboptimal results.
+> Benchmarking with a different hardware configuration may yield different results.
 
 ## Deployment Options
 
@@ -138,27 +138,27 @@
 
    ```bash
    CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm \
-     --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+     --model Qwen/Qwen3-0.6B \
      --is-prefill-worker > prefill_0.log 2>&1 &
 
    CUDA_VISIBLE_DEVICES=1 python -m dynamo.vllm \
-     --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+     --model Qwen/Qwen3-0.6B \
      --is-prefill-worker > prefill_1.log 2>&1 &
 
    CUDA_VISIBLE_DEVICES=2 python -m dynamo.vllm \
-     --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+     --model Qwen/Qwen3-0.6B \
      --is-prefill-worker > prefill_2.log 2>&1 &
 
    CUDA_VISIBLE_DEVICES=3 python -m dynamo.vllm \
-     --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+     --model Qwen/Qwen3-0.6B \
      --is-prefill-worker > prefill_3.log 2>&1 &
    ```
 
 3. **Start decode worker** (TP=4):
 
    ```bash
    CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \
-     --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic > decode.log 2>&1 &
+     --model Qwen/Qwen3-0.6B > decode.log 2>&1 &
    ```
 
 4. **Wait for services to be ready** - Check the logs to ensure all services are fully started before benchmarking.
@@ -224,7 +224,7 @@
 
    # Start decode worker (TP=8, using all 8 GPUs)
    python -m dynamo.vllm \
-     --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic > decode.log 2>&1 &
+     --model Qwen/Qwen3-0.6B > decode.log 2>&1 &
    ```
 
 2. **On Node 1** - Start prefill workers:
@@ -237,7 +237,7 @@
    # Start 8 prefill workers (one per GPU)
    for i in {0..7}; do
      CUDA_VISIBLE_DEVICES=$i python -m dynamo.vllm \
-       --model neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+       --model Qwen/Qwen3-0.6B \
        --is-prefill-worker > prefill_${i}.log 2>&1 &
    done
    ```
@@ -266,7 +266,7 @@
 1. **Start vLLM servers** (2 instances, each with TP=4):
 
    ```bash
-   CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+   CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3-0.6B \
      --block-size 128 \
      --max-model-len 3500 \
      --max-num-batched-tokens 3500 \
@@ -275,7 +275,7 @@
      --disable-log-requests \
      --port 8001 > vllm_0.log 2>&1 &
 
-   CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+   CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve Qwen/Qwen3-0.6B \
      --block-size 128 \
      --max-model-len 3500 \
      --max-num-batched-tokens 3500 \
@@ -327,7 +327,7 @@
   --prefill-data-parallelism, --prefill-dp <int>     Prefill data parallelism (for disaggregated mode)
   --decode-tensor-parallelism, --decode-tp <int>     Decode tensor parallelism (for disaggregated mode)
   --decode-data-parallelism, --decode-dp <int>       Decode data parallelism (for disaggregated mode)
-  --model <model_id>                         Hugging Face model ID to benchmark (default: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic)
+  --model <model_id>                         Hugging Face model ID to benchmark (default: Qwen/Qwen3-0.6B)
   --input-sequence-length, --isl <int>       Input sequence length (default: 3000)
   --output-sequence-length, --osl <int>      Output sequence length (default: 150)
   --url <http://host:port>                   Target URL for inference requests (default: http://localhost:8000)
@@ -560,4 +560,4 @@
 - **[AIPerf Documentation](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)** - Learn more about AIPerf benchmarking
 - **[Dynamo Benchmarking Guide](../../docs/benchmarks/benchmarking.md)** - General benchmarking framework documentation
 - **[Performance Tuning Guide](../../docs/performance/tuning.md)** - Optimize your deployment configuration
 - **[Metrics and Visualization](../../deploy/metrics/k8s/README.md)** - Monitor deployments with Prometheus and Grafana