-
Notifications
You must be signed in to change notification settings - Fork 13.3k
ggml : remove bit shuffling #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unfortunately, |
A few remarks:
|
Yes, I'm still hesitating. But I think |
Somehow perplexity computation with Edit: fixed |
Firstly, I agree very much in spirit with your statement and the speedup the new version will bring. I also would offer that it is fairly early in the lifecycle for this project and that is an argument in favor of just putting through the breaking change and letting it be. On the other side, ggml (especially via llama.cpp) is being used pretty widely and I would ask that a thought be spared for the maintainers of supporting projects like @LostRuins with Koboldcpp or the fine folks doing llama-cpp-python (and downstream of it, oobabooga's webui), among others who will likely bear the brunt of user confusion on these issues. It will cause a burden in a wider radius that the user-facing software people don't have a lot of control over since they are generally not in control of the model repos to make the updates themselves. That's all. Just wanted to toss out a bit of an explanation since the nature of the users was raised. I think the repos for front-end projects may see the target userbase for their projects much differently than core llama.cpp, generally speaking. |
can we increment this value by 1 ? edit: oh, it was all in the llama.h/.cpp |
That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly. |
@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible. |
Close in favor of #1405 |
great, I finally compiled it on my pc(no avx2 support) AND with cuda support. And this change makes none of my models load :(. I dont know how to quantize things, Ive read a lot about it and I doubt I even have the PC-resources to do it. |
@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the |
Implementation of #1241
Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization
Q4_0
Q4_1
Q5_0
Q5_1
New timings:
Old timings:
overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do