Releases: ggml-org/llama.cpp
b7868
CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7867
sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154)
Co-authored-by: Neo Zhang Jianyu [email protected]
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7865
Vulkan Flash Attention Coopmat1 Refactor (#19075)
-
vulkan: use coopmat for flash attention p*v matrix multiplication
-
fix P loading issue
-
fix barrier position
-
remove reduction that is no longer needed
-
move max thread reduction into loop
-
remove osh padding
-
add bounds checks and padding
-
remove unused code
-
fix shmem sizes, loop duration and accesses
-
don't overwrite Qf, add new shared psh buffer instead
-
add missing bounds checks
-
use subgroup reductions
-
optimize
-
move bounds check, reduce barriers
-
support other Bc values and other subgroup sizes
-
remove D_split
-
replace Of register array with shared memory Ofsh array
-
parallelize HSV across the rowgroups
-
go back to Of in registers, not shmem
-
vectorize sfsh
-
don't store entire K tile in shmem
-
fixes
-
load large k tiles to shmem on Nvidia
-
adapt shared memory host check function to shader changes
-
remove Bc 32 case
-
remove unused variable
-
fix missing mask reduction tmspsh barrier
-
fix mask bounds check
-
fix rowmax f16 under/overflow to inf
-
fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7864
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
-
server: introduce self-speculative decoding
-
server: moved self-call into speculative.cpp
-
can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov [email protected]
-
server: can_speculate() tests self-spec
-
server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret [email protected]
- common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret [email protected]
-
server: can_speculate() requires a task instance
-
common: ngram map, config self-speculative decoding
-
common: add enum common_speculative_type
-
common: add vector of speculative states
-
common: add option --spec-draftless
-
server: cleanup (remove slot.batch_spec, rename)
-
common: moved self-spec impl to ngram-map
-
common: cleanup (use common_speculative_state_draft)
-
spec : refactor
-
cont : naming
-
spec: remove --spec-config
-
doc: (draftless) speculative decoding
-
common: print performance in spec decoding
-
minor : cleanup
-
common : better names
-
minor : cleanup + fix build
-
minor: comments
-
CODEOWNERS: add common/ngram-map.* (#18471)
-
common : rename speculative.draftless_type -> speculative.type
-
ngram-map : fix uninitialized values
-
ngram-map : take into account the input can become shorter
-
ngram-map : revert len check for now
-
arg : change
--spec-draftless->--spec-type -
spec : add common_speculative_state::accept()
-
spec : refactor + add common_speculative_begin()
-
spec : fix begin() call with mtmd
-
spec : additional refactor + remove common_speculative_params
Co-authored-by: Georgi Gerganov [email protected]
Co-authored-by: Sigbjørn Skjæret [email protected]
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7862
ggml-sycl: remove unused syclcompat header (#19140)
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (#17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7861
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
- undefined is treated as iterable (string/array) by filters
tojson is not a supported undefined filter
-
add tests
-
add sequence and iterable tests
keep it DRY and fix some types
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7860
vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.
See more details here: KhronosGroup/MoltenVK#2683.
The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.
The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7858
ggml: new backend for Virglrenderer API Remoting acceleration (v2) (#18718)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7857
ggml-cpu: arm64: Q4_K scale unroll and vectorization (#19108)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b7856
cuda : fix "V is K view" check for non-unified KV cache (#19145)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: