Performance regression with multiple GPUs in commit 01612b7

Commit https://github.com/ggml-org/llama.cpp/commit/01612b74090df592663cfa01f661c9628f403b59 decreased token generation performance when the model is split across multiple GPUs.
The behaviour is very similar to issue https://github.com/ggml-org/llama.cpp/issues/13751

cmake options: `-DGGML_CUDA=ON -DLLAMA_CURL=OFF`

### Performance before the commit:

#### Two GPUs:
```
> .\086cf81\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           pp512 |       3650.47 ± 3.07 |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           tg128 |         61.53 ± 0.20 |

build: 086cf81e (5921)
```

#### Single GPU:
```
> .\086cf81\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           pp512 |      3688.40 ± 20.60 |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           tg128 |         62.50 ± 0.02 |

build: 086cf81e (5921)
```

### Performance after the commit:

#### Two GPUs (reduced performance):
```
> .\01612b7\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           pp512 |       3632.41 ± 2.00 |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           tg128 |         41.26 ± 0.09 |

build: 01612b74 (5922)
```

#### Single GPU:
```
> .\01612b7\bin\Release\llama-bench.exe -m "Qwen3-32B-UD-Q4_K_XL.gguf" -mmp 0 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           pp512 |      3688.09 ± 20.76 |
| qwen3 32B Q4_K - Medium        |  18.64 GiB |    32.76 B | CUDA       |  99 |  1 |    0 |           tg128 |         62.44 ± 0.10 |

build: 01612b74 (5922)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression with multiple GPUs in commit 01612b7 #14863

Performance before the commit:

Two GPUs:

Single GPU:

Performance after the commit:

Two GPUs (reduced performance):

Single GPU:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance regression with multiple GPUs in commit 01612b7 #14863

Description

Performance before the commit:

Two GPUs:

Single GPU:

Performance after the commit:

Two GPUs (reduced performance):

Single GPU:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions