-
Notifications
You must be signed in to change notification settings - Fork 24
[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mikepapadim
wants to merge
33
commits into
main
Choose a base branch
from
feat/deq-n-compute
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,658
−544
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…optimized matrix-vector kernels, and SiLU-GLU activation
…6 task graph setup
…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.
…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).
…eprecate `mapContext` and `quantizeXb`
…ids, and deprecate redundant tasks in FP16 layer.
…r grid assignments, and enhance attention and FFN block configurations.
…r grid assignments, and enhance attention and FFN block configurations.
…e kernel setup, and enhance FP16 task processing.
…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities
…tailed data flow, task breakdown, and fusion points
…incorporate fused RMS normalization, gate, and up-projection
…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.
…FN layers to optimize worker grid configuration.
…tmul`, and `fusedRmsNormQKVMatmul`. Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.
….java into feat/deq-n-compute
…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.
…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.
…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.
…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.
…ith fused kernels, update worker grid configurations, and streamline data transfer logic.
…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.
…yers and update grid scheduler configuration
…c worker grid, update RoPE task configuration, and streamline layer setup.
…update Phi3 FP16 FFN layers with optimized worker grid configurations, fused workflows for attention and FFN blocks, and detailed task flow documentation.
…r Phi3 FP16 FFN layers to consolidate QKV projection tasks, and update worker grid/task configurations.
… Phi3 FP16 FFN layers to streamline task configuration and clean up commented code.
…lace `rms_ffn_gate_up` and `gateUpSiLU` tasks with a single fused task, streamline task graph and update documentation.
…ented code, and streamline Phi3 FP16 FFN layer configurations.
…ting line breaks in data transfer logic and disabling formatter for consistent formatting.
…id scheduler logic, and improve readability by adjusting formatting.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Implements fused dequantize-and-compute patterns for quantized matrix-vector operations,
eliminating intermediate memory round-trips during inference.
Changes
avoiding the previous dequantize → store → load → compute pipeline
for the memory-bound decode phase
Benchmarks (Llama 3.2 1B FP16)
Why This Works
Single-token generation is memory-bandwidth bound (matrix-vector ops).
Fusing dequantization with compute hides quantization overhead by keeping
data in registers rather than writing back to memory between operations.