[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

mikepapadim · 2025-12-03T15:29:56Z

Summary

Implements fused dequantize-and-compute patterns for quantized matrix-vector operations,
eliminating intermediate memory round-trips during inference.

Changes

Fused Dequantization: Dequantize weights directly in registers before compute,
avoiding the previous dequantize → store → load → compute pipeline
Optimized SGEMV Kernels: Improved memory coalescing and compute utilization
for the memory-bound decode phase
SiLU-GLU Fusion: Combined activation and gating into a single kernel pass

Benchmarks (Llama 3.2 1B FP16)

GPU	Before	After	Speedup
RTX 3070	52 tok/s	62 tok/s	+19%
RTX 4090	66 tok/s	86 tok/s	+30%

Why This Works

Single-token generation is memory-bandwidth bound (matrix-vector ops).
Fusing dequantization with compute hides quantization overhead by keeping
data in registers rather than writing back to memory between operations.

…optimized matrix-vector kernels, and SiLU-GLU activation

…6 task graph setup

…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.

…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).

…eprecate `mapContext` and `quantizeXb`

…ids, and deprecate redundant tasks in FP16 layer.

…r grid assignments, and enhance attention and FFN block configurations.

…e kernel setup, and enhance FP16 task processing.

…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities

…tailed data flow, task breakdown, and fusion points

… provided

…incorporate fused RMS normalization, gate, and up-projection

…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.

…FN layers to optimize worker grid configuration.

…tmul`, and `fusedRmsNormQKVMatmul`. Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.

….java into feat/deq-n-compute

…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.

…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.

…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.

…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.

…ith fused kernels, update worker grid configurations, and streamline data transfer logic.

…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.

…er arrays

…yers and update grid scheduler configuration

…c worker grid, update RoPE task configuration, and streamline layer setup.

…update Phi3 FP16 FFN layers with optimized worker grid configurations, fused workflows for attention and FFN blocks, and detailed task flow documentation.

…r Phi3 FP16 FFN layers to consolidate QKV projection tasks, and update worker grid/task configurations.

… Phi3 FP16 FFN layers to streamline task configuration and clean up commented code.

…lace `rms_ffn_gate_up` and `gateUpSiLU` tasks with a single fused task, streamline task graph and update documentation.

…ented code, and streamline Phi3 FP16 FFN layer configurations.

…ting line breaks in data transfer logic and disabling formatter for consistent formatting.

…id scheduler logic, and improve readability by adjusting formatting.

Implement FP16 support in TornadoVM by introducing HalfFloat arrays, …

ca2b28a

…optimized matrix-vector kernels, and SiLU-GLU activation

mikepapadim changed the title ~~Implement FP16 support in TornadoVM by introducing HalfFloat arrays, …~~ Implement deq and compute pattern for SGEEMs Dec 3, 2025

mikepapadim added 22 commits December 4, 2025 00:53

Introduce matrix-vector kernel with residual addition and enhance FP1…

f0411ae

…6 task graph setup

Fused Q/K/V matrix-vector multiplication into a single kernel to redu…

6334ac3

…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.

Fuse RoPE rotation and KV cache copy into a single kernel, update tas…

46218a7

…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).

Add mapContextWithQuantize kernel, integrate into task graph, and d…

b48ec62

…eprecate `mapContext` and `quantizeXb`

Refactor logits task graph to optimize kernel setup, update worker gr…

943da78

…ids, and deprecate redundant tasks in FP16 layer.

Refactor FP16 FFN layers to streamline task graph setup, update worke…

386dddc

…r grid assignments, and enhance attention and FFN block configurations.

Refactor FP16 FFN layers to streamline task graph setup, update worke…

b202bb4

…r grid assignments, and enhance attention and FFN block configurations.

Refactor LogitsFP16Layer task graph to improve readability, optimiz…

3eba3b3

…e kernel setup, and enhance FP16 task processing.

Add fusedFeedForwardWithSiLUAndGLUActivation kernel for HalfFloat a…

2e010b1

…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities

Document Transformer Layer Task Flow for LlamaFP16FFNLayers with de…

4aef300

…tailed data flow, task breakdown, and fusion points

Set default profiler dump directory relative to LLAMA_ROOT when not…

177ec9d

… provided

Add fusedRmsNormFFNGateUp kernel and update FP16 FFN task graph to …

a1c94fb

…incorporate fused RMS normalization, gate, and up-projection

Increase BLOCK_SIZE_C to 16 for Transformer kernel and update FP16 …

577b6b1

…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.

Increase ropeWithCacheWorker local work group size to 512 in FP16 F…

d5c1206

…FN layers to optimize worker grid configuration.

Add fused kernels for Qwen3: ropeRotationWithCacheCopy, `fusedQKVMa…

f91108c

…tmul`, and `fusedRmsNormQKVMatmul`. Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.

Merge branch 'feat/deq-n-compute' of github.com:beehive-lab/GPULlama3…

67050bb

….java into feat/deq-n-compute

Add fused Q and K RMSNorm kernel and refactor task graph to consolida…

cfa3ba0

…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.

Refactor Qwen3 FP16 FFN layers to streamline worker grid setup, updat…

abf12d4

…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.

Add processHeadsFlashAttentionOptV2 kernel with static memory size …

042b0b5

…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.

Refactor Qwen3 FP16 FFN layers: remove unused imports, replace explic…

1cbe03a

…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.

Refactor Qwen2 FP16 task graph: consolidate attention and FFN tasks w…

a4bc159

…ith fused kernels, update worker grid configurations, and streamline data transfer logic.

Add fusedQKvBiasAddition kernel, refactor Qwen2 FP16 task graph to …

e15c229

…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.

mikepapadim requested review from Copilot and orionpapadakis and removed request for Copilot and orionpapadakis December 4, 2025 20:28

Add support for HalfFloatArray in Phi3State and initialize FP16 wrapp…

e7d79c9

…er arrays

mikepapadim added 3 commits December 4, 2025 22:40

Add splitQKV and splitGateUpSiLU worker grids to Phi3 FP16 FFN la…

02b1a2c

…yers and update grid scheduler configuration

Refactor Phi3 FP16 FFN layers: replace createRoPEWorker with generi…

428e5cc

…c worker grid, update RoPE task configuration, and streamline layer setup.

Add Phi3-specific fused kernels for RMSNorm+QKV and RMSNorm+Gate/Up, …

6c1ac6f

…update Phi3 FP16 FFN layers with optimized worker grid configurations, fused workflows for attention and FFN blocks, and detailed task flow documentation.

mikepapadim changed the title ~~Implement deq and compute pattern for SGEEMs~~ [FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup Dec 4, 2025

mikepapadim self-assigned this Dec 4, 2025

mikepapadim marked this pull request as ready for review December 4, 2025 20:59

mikepapadim added 6 commits December 4, 2025 23:03

Replace splitQKV kernel with fusedRmsNormQKVMatmulDirect, refacto…

ed74652

…r Phi3 FP16 FFN layers to consolidate QKV projection tasks, and update worker grid/task configurations.

Remove unused splitQKV and RMS Apply+QKV Projection kernels, update…

8b52fbe

… Phi3 FP16 FFN layers to streamline task configuration and clean up commented code.

Add fusedRmsNormFFNGateUpSiLU kernel to optimize Phi3 FFN flow, rep…

977f0ba

…lace `rms_ffn_gate_up` and `gateUpSiLU` tasks with a single fused task, streamline task graph and update documentation.

Remove unused splitQKV and splitGateUpSiLU workers, clean up comm…

7e19032

…ented code, and streamline Phi3 FP16 FFN layer configurations.

Refactor Phi3 FP16 FFN layer task graph: improve readability by adjus…

1e46405

…ting line breaks in data transfer logic and disabling formatter for consistent formatting.

Refactor LogitsFP16Layer: streamline task graph setup, consolidate gr…

7c63dc4

…id scheduler logic, and improve readability by adjusting formatting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

mikepapadim commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

Are you sure you want to change the base?

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

Conversation

mikepapadim commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmarks (Llama 3.2 1B FP16)

Why This Works

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikepapadim commented Dec 3, 2025 •

edited

Loading