Releases · ggml-org/llama.cpp

29 Jan 04:16

github-actions

b7868

3bcc990

b7868 Latest

Latest

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2026-01-29T04:16:57Z
cudart-llama-bin-win-cuda-13.1-x64.zip

sha256:f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

384 MB 2026-01-29T04:17:07Z
llama-b7868-bin-310p-openEuler-aarch64.tar.gz

sha256:d195733de7ddbc9649507913d0977e6d051018551a39cfc2ed5e76218e2d908d

52.5 MB 2026-01-29T04:17:16Z
llama-b7868-bin-310p-openEuler-x86.tar.gz

sha256:ebd224586bb9e8fe5092feeab2c3edc0ec212dbbf40e2353749fc01704aff87c

58.2 MB 2026-01-29T04:17:18Z
llama-b7868-bin-910b-openEuler-aarch64-aclgraph.tar.gz

sha256:26ba6c4979d9738de92c03b4ba8c16c2690e6a7793d898e126bcc8c683a116c7

52.5 MB 2026-01-29T04:17:20Z
llama-b7868-bin-910b-openEuler-x86-aclgraph.tar.gz

sha256:c1b9225005ea7428f2db583ae66ca086098c481ca4e01ae47fb220ae9abc5e66

58.2 MB 2026-01-29T04:17:22Z
llama-b7868-bin-macos-arm64.tar.gz

sha256:01daa88cad856c5fea88e89b21e41d6776ce5299e89c1a7706645cf115ed479e

28.5 MB 2026-01-29T04:17:24Z
llama-b7868-bin-macos-x64.tar.gz

sha256:aad7d28385becaf5ae16a220eeb7f7f42c821749a7fce0ae37c0f4097b5a2e7d

80.6 MB 2026-01-29T04:17:26Z
llama-b7868-bin-ubuntu-s390x.tar.gz

sha256:be350d7e417d0c6429acb23aeb0d9eb1a518cf8d73c1aa2d7438a6f4000fedcd

24.1 MB 2026-01-29T04:17:28Z
llama-b7868-bin-ubuntu-vulkan-x64.tar.gz

sha256:4f8df98cd4f1f5f6531b3a5122511e12dafee62d200f010b8e5e41e7f7e58be5

39.2 MB 2026-01-29T04:17:30Z
Source code (zip)

2026-01-29T02:31:28Z
Source code (tar.gz)

2026-01-29T02:31:28Z

29 Jan 02:51

github-actions

b7867

d4964a7

b7867

sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154)

Co-authored-by: Neo Zhang Jianyu [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 21:49

github-actions

b7865

f6b533d

b7865

Vulkan Flash Attention Coopmat1 Refactor (#19075)

vulkan: use coopmat for flash attention p*v matrix multiplication
fix P loading issue
fix barrier position
remove reduction that is no longer needed
move max thread reduction into loop
remove osh padding
add bounds checks and padding
remove unused code
fix shmem sizes, loop duration and accesses
don't overwrite Qf, add new shared psh buffer instead
add missing bounds checks
use subgroup reductions
optimize
move bounds check, reduce barriers
support other Bc values and other subgroup sizes
remove D_split
replace Of register array with shared memory Ofsh array
parallelize HSV across the rowgroups
go back to Of in registers, not shmem
vectorize sfsh
don't store entire K tile in shmem
fixes
load large k tiles to shmem on Nvidia
adapt shared memory host check function to shader changes
remove Bc 32 case
remove unused variable
fix missing mask reduction tmspsh barrier
fix mask bounds check
fix rowmax f16 under/overflow to inf
fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 21:43

github-actions

b7864

72d3b18

b7864

spec : add self‑speculative decoding (no draft model required) + refactor (#18471)

server: introduce self-speculative decoding
server: moved self-call into speculative.cpp
can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov [email protected]

server: can_speculate() tests self-spec
server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret [email protected]

common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret [email protected]

server: can_speculate() requires a task instance
common: ngram map, config self-speculative decoding
common: add enum common_speculative_type
common: add vector of speculative states
common: add option --spec-draftless
server: cleanup (remove slot.batch_spec, rename)
common: moved self-spec impl to ngram-map
common: cleanup (use common_speculative_state_draft)
spec : refactor
cont : naming
spec: remove --spec-config
doc: (draftless) speculative decoding
common: print performance in spec decoding
minor : cleanup
common : better names
minor : cleanup + fix build
minor: comments
CODEOWNERS: add common/ngram-map.* (#18471)
common : rename speculative.draftless_type -> speculative.type
ngram-map : fix uninitialized values
ngram-map : take into account the input can become shorter
ngram-map : revert len check for now
arg : change --spec-draftless -> --spec-type
spec : add common_speculative_state::accept()
spec : refactor + add common_speculative_begin()
spec : fix begin() call with mtmd
spec : additional refactor + remove common_speculative_params

Co-authored-by: Georgi Gerganov [email protected]
Co-authored-by: Sigbjørn Skjæret [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 21:29

github-actions

b7862

0cd7032

b7862

ggml-sycl: remove unused syclcompat header (#19140)

The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (#17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 21:05

github-actions

b7861

60368e1

b7861

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

undefined is treated as iterable (string/array) by filters

tojson is not a supported undefined filter

add tests
add sequence and iterable tests

keep it DRY and fix some types

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 19:33

github-actions

b7860

88d23ad

b7860

vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)

Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 17:49

github-actions

b7858

b7feacf

b7858

ggml: new backend for Virglrenderer API Remoting acceleration (v2) (#18718)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 14:43

github-actions

b7857

6ad70c5

b7857

ggml-cpu: arm64: Q4_K scale unroll and vectorization (#19108)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

28 Jan 14:08

github-actions

b7856

631cbfc

b7856

cuda : fix "V is K view" check for non-unified KV cache (#19145)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 22

Releases: ggml-org/llama.cpp

b7868

Uh oh!

b7867

Uh oh!

b7865

Uh oh!

b7864

Uh oh!

b7862

Uh oh!

b7861

Uh oh!

b7860

Uh oh!

b7858

Uh oh!

b7857

Uh oh!

b7856

Uh oh!