RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel #77

yrq0208 · 2025-12-01T14:56:02Z

The main optimization is done by using all the threads, instead of only the first global thread, to calculate the scaling factor. This avoids thread divergence and the need to sync threads across work groups, since in context mapping, all the threads are used instead of just global thread 0 (sync threads across work groups is difficult to implement in OpenCL since it involves the use of atomic operations and locks, though maybe doable in CUDA with grid sync). By doing this, the reduction kernel and the context mapping kernel can be merged, reducing kernel launch overheads.

Tested with the following models: beehive-llama-3.2-1b-instruct-fp16.gguf, beehive-llama-3.2-3b-instruct-fp16.gguf, DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf, Qwen2.5-0.5B-Instruct-f16.gguf, qwen2.5-1.5b-instruct-fp16.gguf, Qwen3-0.6B-f16.gguf, Qwen3-1.7B-f16.gguf, Qwen3-4B-f16.gguf. 7B and 8B models are not tested due to hardware limitations (5080 only has 16GB of VRAM).

E2E performance (in tok/s) improvement ranges between 5% - 22% on 5080 mobile with TornadoVM OpenCL backend.

…fter layer 0

…layers

… and context mapping kernels

CLAassistant · 2025-12-01T14:56:15Z

All committers have signed the CLA.

mikepapadim · 2025-12-01T15:57:43Z

\rerun

yrq0208 · 2025-12-01T17:38:18Z

current issues:
OpenCL: Qwen3-4B-f16.gguf, Mistral-7B-Instruct-v0.3.fp16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf, Mistral-7B-Instruct-v0.3.Q8_0.gguf
PTX: Qwen3-4B-f16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Qwen3-0.6B-Q8_0.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf

yrq0208 · 2025-12-02T15:08:40Z

Update: I have tested with both the FP16 and Q8 models, using both OpenCL and PTX backends, the only models I haven't tested are the llama3.2 and qwen3 8B FP16 models, since there isn't enough VRAM in the 5080 for it.

I am unable to reproduce the error/gibberish output from the Mistral 7B FP16/Q8 models (works on my side). I am also not able to reproduce the gibberish output from the Phi3 Q8 model (works on my side). The Phi3 FP16 model still seems bugged in the baseline without my modifications. I am using the latest TornadoVM and Tornado llama builds

yrq0208 · 2025-12-02T18:15:30Z

The gibberish output from the qwen3 4B FP16 and Q8 models seems inconsistent on my side, sometimes I can reproduce it, sometimes I cannot. Need to take a closer look

yrq0208 and others added 8 commits November 21, 2025 17:44

change default dumping dir for profiling to prevent profile hanging a…

d441578

…fter layer 0

modify path to point to a local up to date tornado

c4a1f74

python venv

3a2b55c

Merge branch 'beehive-lab:main' into RMS_opt_fusion

78c15d4

experimental changes trying to fuse reduction and map context in FFN …

3167260

…layers

rms fuse opts

48345ff

remove comments, refactor host code to reflect the merge of reduction…

f69e9d9

… and context mapping kernels

Merge branch 'beehive-lab:main' into RMS_opt_fusion

d57d1ce

mikepapadim requested review from Copilot, mairooni, mikepapadim and orionpapadakis and removed request for Copilot and mikepapadim December 1, 2025 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel #77

RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel #77

Uh oh!

yrq0208 commented Dec 1, 2025

Uh oh!

CLAassistant commented Dec 1, 2025 •

edited

Loading

Uh oh!

mikepapadim commented Dec 1, 2025

Uh oh!

yrq0208 commented Dec 1, 2025

Uh oh!

yrq0208 commented Dec 2, 2025 •

edited

Loading

Uh oh!

yrq0208 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel #77

Are you sure you want to change the base?

RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel #77

Uh oh!

Conversation

yrq0208 commented Dec 1, 2025

Uh oh!

CLAassistant commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikepapadim commented Dec 1, 2025

Uh oh!

yrq0208 commented Dec 1, 2025

Uh oh!

yrq0208 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yrq0208 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Dec 1, 2025 •

edited

Loading

yrq0208 commented Dec 2, 2025 •

edited

Loading