Skip to content

Releases: ggml-org/llama.cpp

b7868

29 Jan 04:16
3bcc990

Choose a tag to compare

b7867

29 Jan 02:51
d4964a7

Choose a tag to compare

b7865

28 Jan 21:49
f6b533d

Choose a tag to compare

Vulkan Flash Attention Coopmat1 Refactor (#19075)

  • vulkan: use coopmat for flash attention p*v matrix multiplication

  • fix P loading issue

  • fix barrier position

  • remove reduction that is no longer needed

  • move max thread reduction into loop

  • remove osh padding

  • add bounds checks and padding

  • remove unused code

  • fix shmem sizes, loop duration and accesses

  • don't overwrite Qf, add new shared psh buffer instead

  • add missing bounds checks

  • use subgroup reductions

  • optimize

  • move bounds check, reduce barriers

  • support other Bc values and other subgroup sizes

  • remove D_split

  • replace Of register array with shared memory Ofsh array

  • parallelize HSV across the rowgroups

  • go back to Of in registers, not shmem

  • vectorize sfsh

  • don't store entire K tile in shmem

  • fixes

  • load large k tiles to shmem on Nvidia

  • adapt shared memory host check function to shader changes

  • remove Bc 32 case

  • remove unused variable

  • fix missing mask reduction tmspsh barrier

  • fix mask bounds check

  • fix rowmax f16 under/overflow to inf

  • fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

macOS/iOS:

Linux:

Windows:

openEuler:

b7864

28 Jan 21:43
72d3b18

Choose a tag to compare

spec : add self‑speculative decoding (no draft model required) + refactor (#18471)

  • server: introduce self-speculative decoding

  • server: moved self-call into speculative.cpp

  • can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov [email protected]

  • server: can_speculate() tests self-spec

  • server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret [email protected]

  • common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret [email protected]

  • server: can_speculate() requires a task instance

  • common: ngram map, config self-speculative decoding

  • common: add enum common_speculative_type

  • common: add vector of speculative states

  • common: add option --spec-draftless

  • server: cleanup (remove slot.batch_spec, rename)

  • common: moved self-spec impl to ngram-map

  • common: cleanup (use common_speculative_state_draft)

  • spec : refactor

  • cont : naming

  • spec: remove --spec-config

  • doc: (draftless) speculative decoding

  • common: print performance in spec decoding

  • minor : cleanup

  • common : better names

  • minor : cleanup + fix build

  • minor: comments

  • CODEOWNERS: add common/ngram-map.* (#18471)

  • common : rename speculative.draftless_type -> speculative.type

  • ngram-map : fix uninitialized values

  • ngram-map : take into account the input can become shorter

  • ngram-map : revert len check for now

  • arg : change --spec-draftless -> --spec-type

  • spec : add common_speculative_state::accept()

  • spec : refactor + add common_speculative_begin()

  • spec : fix begin() call with mtmd

  • spec : additional refactor + remove common_speculative_params


Co-authored-by: Georgi Gerganov [email protected]
Co-authored-by: Sigbjørn Skjæret [email protected]

macOS/iOS:

Linux:

Windows:

openEuler:

b7862

28 Jan 21:29
0cd7032

Choose a tag to compare

ggml-sycl: remove unused syclcompat header (#19140)

The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (#17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.

macOS/iOS:

Linux:

Windows:

openEuler:

b7861

28 Jan 21:05
60368e1

Choose a tag to compare

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

  • undefined is treated as iterable (string/array) by filters

tojson is not a supported undefined filter

  • add tests

  • add sequence and iterable tests

keep it DRY and fix some types

macOS/iOS:

Linux:

Windows:

openEuler:

b7860

28 Jan 19:33
88d23ad

Choose a tag to compare

vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)

Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.

macOS/iOS:

Linux:

Windows:

openEuler:

b7858

28 Jan 17:49
b7feacf

Choose a tag to compare

b7857

28 Jan 14:43
6ad70c5

Choose a tag to compare

b7856

28 Jan 14:08
631cbfc

Choose a tag to compare