Skip to content

Conversation

@yrq0208
Copy link

@yrq0208 yrq0208 commented Dec 1, 2025

The main optimization is done by using all the threads, instead of only the first global thread, to calculate the scaling factor. This avoids thread divergence and the need to sync threads across work groups, since in context mapping, all the threads are used instead of just global thread 0 (sync threads across work groups is difficult to implement in OpenCL since it involves the use of atomic operations and locks, though maybe doable in CUDA with grid sync). By doing this, the reduction kernel and the context mapping kernel can be merged, reducing kernel launch overheads.

Tested with the following models: beehive-llama-3.2-1b-instruct-fp16.gguf, beehive-llama-3.2-3b-instruct-fp16.gguf, DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf, Qwen2.5-0.5B-Instruct-f16.gguf, qwen2.5-1.5b-instruct-fp16.gguf, Qwen3-0.6B-f16.gguf, Qwen3-1.7B-f16.gguf, Qwen3-4B-f16.gguf. 7B and 8B models are not tested due to hardware limitations (5080 only has 16GB of VRAM).

E2E performance (in tok/s) improvement ranges between 5% - 22% on 5080 mobile with TornadoVM OpenCL backend.

@CLAassistant
Copy link

CLAassistant commented Dec 1, 2025

CLA assistant check
All committers have signed the CLA.

@mikepapadim mikepapadim requested review from Copilot, mairooni, mikepapadim and orionpapadakis and removed request for Copilot and mikepapadim December 1, 2025 14:56
@mikepapadim
Copy link
Member

\rerun

@yrq0208
Copy link
Author

yrq0208 commented Dec 1, 2025

current issues:
OpenCL: Qwen3-4B-f16.gguf, Mistral-7B-Instruct-v0.3.fp16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf, Mistral-7B-Instruct-v0.3.Q8_0.gguf
PTX: Qwen3-4B-f16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Qwen3-0.6B-Q8_0.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf

@yrq0208
Copy link
Author

yrq0208 commented Dec 2, 2025

Update: I have tested with both the FP16 and Q8 models, using both OpenCL and PTX backends, the only models I haven't tested are the llama3.2 and qwen3 8B FP16 models, since there isn't enough VRAM in the 5080 for it.

I am unable to reproduce the error/gibberish output from the Mistral 7B FP16/Q8 models (works on my side). I am also not able to reproduce the gibberish output from the Phi3 Q8 model (works on my side). The Phi3 FP16 model still seems bugged in the baseline without my modifications. I am using the latest TornadoVM and Tornado llama builds

@yrq0208
Copy link
Author

yrq0208 commented Dec 2, 2025

The gibberish output from the qwen3 4B FP16 and Q8 models seems inconsistent on my side, sometimes I can reproduce it, sometimes I cannot. Need to take a closer look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants