Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented May 3, 2023

Implementation of #1241

Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization

  • Q4_0

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
  • Q4_1

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
  • Q5_0

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
    • dot WASM SIMD
  • Q5_1

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
    • dot WASM SIMD

New timings:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B ms/tok @ 4th 127 49 56 89 92 74
7B ms/tok @ 8th 120 44 52 49 52 70
13B ms/tok @ 4th 261* 91 103 173 177 139
13B ms/tok @ 8th 316* 81 95 103 113 134
  • these numbers vary a lot since the model is on the 32GB limit of my MacBook

Old timings:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B ms/tok @ 4th 128 56 61 91 95 75
7B ms/tok @ 8th 128 47 55 53 59 75
13B ms/tok @ 4th 239 104 113 176 185 141
13B ms/tok @ 8th 240 85 99 108 117 147

overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do

@ggerganov ggerganov added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels May 3, 2023
@ggerganov ggerganov changed the title ggml : remove bit shufling ggml : remove bit shuffling May 3, 2023
@ggerganov
Copy link
Member Author

ggerganov commented May 4, 2023

Unfortunately, Q4_2 does not fit into this pattern, unless we introduce a Q8_2 block with size of 16 instead of 32 elements.
Any suggestions how to proceed? Maybe drop Q4_2 support?

@sw
Copy link
Contributor

sw commented May 6, 2023

A few remarks:

  • quantize_row_q4_1_reference is broken, it's missing the i*qk in x[...]
  • Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?
  • I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

@ggerganov
Copy link
Member Author

@sw

I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

Yes, I'm still hesitating. But I think Q8 quantization will be sub-optimal this way.
Not much, but still.

@ggerganov
Copy link
Member Author

ggerganov commented May 7, 2023

Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?

Somehow perplexity computation with Q5_0 + cuBLAS is currently broken and I don't see the error.
It works on M1.

Edit: fixed

@digiwombat
Copy link
Contributor

digiwombat commented May 9, 2023

"End users" here are not state bureaucrats with IE8

Firstly, I agree very much in spirit with your statement and the speedup the new version will bring. I also would offer that it is fairly early in the lifecycle for this project and that is an argument in favor of just putting through the breaking change and letting it be.

On the other side, ggml (especially via llama.cpp) is being used pretty widely and I would ask that a thought be spared for the maintainers of supporting projects like @LostRuins with Koboldcpp or the fine folks doing llama-cpp-python (and downstream of it, oobabooga's webui), among others who will likely bear the brunt of user confusion on these issues. It will cause a burden in a wider radius that the user-facing software people don't have a lot of control over since they are generally not in control of the model repos to make the updates themselves.

That's all. Just wanted to toss out a bit of an explanation since the nature of the users was raised. I think the repos for front-end projects may see the target userbase for their projects much differently than core llama.cpp, generally speaking.

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 9, 2023

https://github.com/ggerganov/llama.cpp/blob/0e48eb6f6b3588d78656267b3b8029b7711f6cdf/ggml.h#L191

can we increment this value by 1 ?

edit: oh, it was all in the llama.h/.cpp

@sw
Copy link
Contributor

sw commented May 9, 2023

can we increment this value by 1 ?

That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly.

@LostRuins
Copy link
Collaborator

@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible.

@ggerganov
Copy link
Member Author

Close in favor of #1405

@ProfessorSparrs
Copy link

great, I finally compiled it on my pc(no avx2 support) AND with cuda support. And this change makes none of my models load :(. I dont know how to quantize things, Ive read a lot about it and I doubt I even have the PC-resources to do it.

@Green-Sky
Copy link
Collaborator

@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the quantize executable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. performance Speed related topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants