ggml : remove bit shuffling #1305

ggerganov · 2023-05-03T20:19:30Z

Implementation of #1241

Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization

Q4_0
- quantize
- dequantize
- dot ARM NEON
- dot AVX
Q4_1
- quantize
- dequantize
- dot ARM NEON
- dot AVX
Q5_0
- quantize
- dequantize
- dot ARM NEON
- dot AVX
- dot WASM SIMD
Q5_1
- quantize
- dequantize
- dot ARM NEON
- dot AVX
- dot WASM SIMD

New timings:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	ms/tok @ 4th	127	49	56	89	92	74
7B	ms/tok @ 8th	120	44	52	49	52	70
13B	ms/tok @ 4th	261*	91	103	173	177	139
13B	ms/tok @ 8th	316*	81	95	103	113	134

these numbers vary a lot since the model is on the 32GB limit of my MacBook

Old timings:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	ms/tok @ 4th	128	56	61	91	95	75
7B	ms/tok @ 8th	128	47	55	53	59	75
13B	ms/tok @ 4th	239	104	113	176	185	141
13B	ms/tok @ 8th	240	85	99	108	117	147

overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do

ggerganov · 2023-05-04T19:07:03Z

Unfortunately, Q4_2 does not fit into this pattern, unless we introduce a Q8_2 block with size of 16 instead of 32 elements.
Any suggestions how to proceed? Maybe drop Q4_2 support?

sw · 2023-05-06T17:34:57Z

A few remarks:

quantize_row_q4_1_reference is broken, it's missing the i*qk in x[...]
Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?
I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

ggerganov · 2023-05-06T17:49:14Z

@sw

I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

Yes, I'm still hesitating. But I think Q8 quantization will be sub-optimal this way.
Not much, but still.

ggerganov · 2023-05-07T17:02:12Z

Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?

Somehow perplexity computation with Q5_0 + cuBLAS is currently broken and I don't see the error.
It works on M1.

Edit: fixed

digiwombat · 2023-05-09T09:45:45Z

"End users" here are not state bureaucrats with IE8

Firstly, I agree very much in spirit with your statement and the speedup the new version will bring. I also would offer that it is fairly early in the lifecycle for this project and that is an argument in favor of just putting through the breaking change and letting it be.

On the other side, ggml (especially via llama.cpp) is being used pretty widely and I would ask that a thought be spared for the maintainers of supporting projects like @LostRuins with Koboldcpp or the fine folks doing llama-cpp-python (and downstream of it, oobabooga's webui), among others who will likely bear the brunt of user confusion on these issues. It will cause a burden in a wider radius that the user-facing software people don't have a lot of control over since they are generally not in control of the model repos to make the updates themselves.

That's all. Just wanted to toss out a bit of an explanation since the nature of the users was raised. I think the repos for front-end projects may see the target userbase for their projects much differently than core llama.cpp, generally speaking.

Green-Sky · 2023-05-09T14:01:47Z

https://github.com/ggerganov/llama.cpp/blob/0e48eb6f6b3588d78656267b3b8029b7711f6cdf/ggml.h#L191

can we increment this value by 1 ?

edit: oh, it was all in the llama.h/.cpp

sw · 2023-05-09T14:17:36Z

can we increment this value by 1 ?

That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly.

LostRuins · 2023-05-09T16:07:58Z

@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible.

ggerganov · 2023-05-11T18:36:54Z

Close in favor of #1405

ProfessorSparrs · 2023-05-15T03:01:11Z

great, I finally compiled it on my pc(no avx2 support) AND with cuda support. And this change makes none of my models load :(. I dont know how to quantize things, Ive read a lot about it and I doubt I even have the PC-resources to do it.

Green-Sky · 2023-05-15T11:21:34Z

@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the quantize executable)

ggerganov added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels May 3, 2023

ggerganov changed the title ~~ggml : remove bit shufling~~ ggml : remove bit shuffling May 3, 2023

ggerganov mentioned this pull request May 4, 2023

Continuous layouts for quantization q4_0c #1073

Closed

4 tasks

ggerganov force-pushed the remove-vzip branch from 94f5d4a to 79e49c9 Compare May 5, 2023 14:13

JohannesGaessler mentioned this pull request May 6, 2023

More GPU threads for dequantization #1341

Closed

slaren mentioned this pull request May 6, 2023

[DRAFT] Speedup dequantize kernels #1221

Closed

ggerganov mentioned this pull request May 7, 2023

ggml : add AVX support and modify AVX2 code #1331

Closed

ggerganov force-pushed the remove-vzip branch from 166e60f to f9968a5 Compare May 7, 2023 15:30

ggerganov added 14 commits May 8, 2023 21:35

ggml : remove Q4_0 bit shufling (ARM NEON)

a546dc6

ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

edb6c8b

ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

086cfea

ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

a6a1d96

ggml : remove Q5_0 bit shuffling (ARM NEON)

796f8ae

ggml : 2x faster scalar implementations

39bb8e7

ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

c7af904

ggml : simplify scalar dot

ba953d6

ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

4991499

ggml : fix Q4_1 quantization

c216656

ggml : update cuBLAS + normalize variable names

b47bd28

ggml : remove Q4_2 mode

7cdc08a

ggml : minor formatting

60f62bb

ggml : fix Q5_0 quantization

8fbf777

ggerganov force-pushed the remove-vzip branch from ee19c8b to 8fbf777 Compare May 8, 2023 18:37

scripts : add script for measuring the time per token

d155f0f

sw mentioned this pull request May 8, 2023

AVX implementations for remove-vzip #1370

Merged

slaren mentioned this pull request May 9, 2023

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

Merged

ggerganov added 2 commits May 9, 2023 18:19

llama : produce error upon loading old model files

4201fa5

llama : fix model magic/version write

ffd76e1

philpax mentioned this pull request May 9, 2023

Support the bit-shuffling changes from llama.cpp rustformers/llm#198

Closed

sw mentioned this pull request May 9, 2023

Remove Q4/Q5 bit shuffling without breaking compatibility #1384

Closed

ggml : speed-up Q5_0 + Q5_1 at 4 threads

e116eb6

ggerganov mentioned this pull request May 11, 2023

ggml : remove bit shuffling #1405

Merged

ggerganov closed this May 11, 2023

apcameron mentioned this pull request May 11, 2023

this format is no longer supported #1408

Closed

michael7908 mentioned this pull request May 14, 2023

NameError: Could not load Llama model from path: D:\privateGPT\ggml-model-q4_0.bin zylon-ai/private-gpt#113

Closed

b007zk mentioned this pull request May 14, 2023

raise NameError(f"Could not load Llama model from path: {model_path}") NameError: Could not load Llama model from path: models/ggml-model-q4_0.bin zylon-ai/private-gpt#140

Closed

jasonogrady mentioned this pull request May 15, 2023

llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this zylon-ai/private-gpt#15

Closed

hippalectryon-0 mentioned this pull request May 16, 2023

ingest.py - versioning su77ungr/CASALIOY#56

Closed

andreakiro mentioned this pull request May 16, 2023

LLaMa model no longer supported zylon-ai/private-gpt#220

Closed

milobestcat mentioned this pull request May 17, 2023

error loading model oobabooga/text-generation-webui#2135

Open

1 task

peterchanws mentioned this pull request May 17, 2023

Could not load Llama model from path: models/ggml-model-q4_0.bin zylon-ai/private-gpt#261

Closed

ChaoticByte mentioned this pull request May 18, 2023

Bump llama-cpp-python[server] from 0.1.48 to 0.1.50 ChaoticByte/Eucalyptus-Chat#2

Merged

alei76 mentioned this pull request May 21, 2023

可直接使用的 13b-plus 4bit 量化模型下载 ymcui/Chinese-LLaMA-Alpaca#395

Closed

QuantumPickleJar mentioned this pull request May 23, 2023

UnboundLocalError: cannot access local variable 'llm' where it is not associated with a value zylon-ai/private-gpt#394

Closed

sandyrs9421 mentioned this pull request May 24, 2023

File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCppEmbeddings zylon-ai/private-gpt#461

Closed

Huge mentioned this pull request May 29, 2023

How to run with -ngl parameter? abetlen/llama-cpp-python#268

Closed

DjToMeK30 mentioned this pull request May 31, 2023

ggml-old-vic13b-q5_1.bin not supported zylon-ai/private-gpt#567

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : remove bit shuffling #1305

ggml : remove bit shuffling #1305

Uh oh!

ggerganov commented May 3, 2023 •

edited

Loading

Uh oh!

ggerganov commented May 4, 2023 •

edited

Loading

Uh oh!

sw commented May 6, 2023

Uh oh!

ggerganov commented May 6, 2023

Uh oh!

ggerganov commented May 7, 2023 •

edited

Loading

Uh oh!

digiwombat commented May 9, 2023 •

edited

Loading

Uh oh!

Green-Sky commented May 9, 2023 •

edited

Loading

Uh oh!

sw commented May 9, 2023

Uh oh!

LostRuins commented May 9, 2023

Uh oh!

ggerganov commented May 11, 2023

Uh oh!

ProfessorSparrs commented May 15, 2023

Uh oh!

Green-Sky commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ggml : remove bit shuffling #1305

ggml : remove bit shuffling #1305

Uh oh!

Conversation

ggerganov commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw commented May 6, 2023

Uh oh!

ggerganov commented May 6, 2023

Uh oh!

ggerganov commented May 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

digiwombat commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw commented May 9, 2023

Uh oh!

LostRuins commented May 9, 2023

Uh oh!

ggerganov commented May 11, 2023

Uh oh!

ProfessorSparrs commented May 15, 2023

Uh oh!

Green-Sky commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ggerganov commented May 3, 2023 •

edited

Loading

ggerganov commented May 4, 2023 •

edited

Loading

ggerganov commented May 7, 2023 •

edited

Loading

digiwombat commented May 9, 2023 •

edited

Loading

Green-Sky commented May 9, 2023 •

edited

Loading