-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Closed
Description
Commit 01612b7 decreased token generation performance when the model is split across multiple GPUs.
The behaviour is very similar to issue #13751
cmake options: -DGGML_CUDA=ON -DLLAMA_CURL=OFF
Performance before the commit:
Two GPUs:
> .\086cf81\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | pp512 | 3650.47 ± 3.07 |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | tg128 | 61.53 ± 0.20 |
build: 086cf81e (5921)
Single GPU:
> .\086cf81\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | pp512 | 3688.40 ± 20.60 |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | tg128 | 62.50 ± 0.02 |
build: 086cf81e (5921)
Performance after the commit:
Two GPUs (reduced performance):
> .\01612b7\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | pp512 | 3632.41 ± 2.00 |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | tg128 | 41.26 ± 0.09 |
build: 01612b74 (5922)
Single GPU:
> .\01612b7\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | pp512 | 3688.09 ± 20.76 |
| qwen3 32B Q4_K - Medium | 18.64 GiB | 32.76 B | CUDA | 99 | 1 | 0 | tg128 | 62.44 ± 0.10 |
build: 01612b74 (5922)
Panchovix
Metadata
Metadata
Assignees
Labels
No labels