-
Notifications
You must be signed in to change notification settings - Fork 24
RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… and context mapping kernels
|
\rerun |
|
current issues: |
|
Update: I have tested with both the FP16 and Q8 models, using both OpenCL and PTX backends, the only models I haven't tested are the llama3.2 and qwen3 8B FP16 models, since there isn't enough VRAM in the 5080 for it. I am unable to reproduce the error/gibberish output from the Mistral 7B FP16/Q8 models (works on my side). I am also not able to reproduce the gibberish output from the Phi3 Q8 model (works on my side). The Phi3 FP16 model still seems bugged in the baseline without my modifications. I am using the latest TornadoVM and Tornado llama builds |
|
The gibberish output from the qwen3 4B FP16 and Q8 models seems inconsistent on my side, sometimes I can reproduce it, sometimes I cannot. Need to take a closer look |
The main optimization is done by using all the threads, instead of only the first global thread, to calculate the scaling factor. This avoids thread divergence and the need to sync threads across work groups, since in context mapping, all the threads are used instead of just global thread 0 (sync threads across work groups is difficult to implement in OpenCL since it involves the use of atomic operations and locks, though maybe doable in CUDA with grid sync). By doing this, the reduction kernel and the context mapping kernel can be merged, reducing kernel launch overheads.
Tested with the following models: beehive-llama-3.2-1b-instruct-fp16.gguf, beehive-llama-3.2-3b-instruct-fp16.gguf, DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf, Qwen2.5-0.5B-Instruct-f16.gguf, qwen2.5-1.5b-instruct-fp16.gguf, Qwen3-0.6B-f16.gguf, Qwen3-1.7B-f16.gguf, Qwen3-4B-f16.gguf. 7B and 8B models are not tested due to hardware limitations (5080 only has 16GB of VRAM).
E2E performance (in tok/s) improvement ranges between 5% - 22% on 5080 mobile with TornadoVM OpenCL backend.