Skip to content

Commit a47b20f

Browse files
committed
slight fix
1 parent 2f34577 commit a47b20f

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

_posts/2024-01-03-introduce-flashinfer.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ We report achieved TFLops/s for prefill & append attention kernels, and GPU memo
107107

108108
### Prefill Kernels
109109

110-
For prefill (multi-query) attention, we reimplemented the FlashAttention 2 algorithm in pure CUDA with some additional optimizations. Standard FlashAttention implementation uses Tensor Cores with fp16 input and fp32 accumulator, however, RTX 4090 GPUs has lower Tensor Cores performance with fp32 accumulator, we observe that the $\frac{\mathbf{q}\cdot \mathbf{k}^{T}}{\sqrt(d)}$ phase in attention computation have small range and can be accumulated with fp16 (because the head dimension is always small: e.g. 128), FlashInfer provides an `allow_fp16_qk_reduction` option to allow this optimization (but still use fp32 accumulation for $\mathbf{score} \cdot \mathbf{v}$), this optimization could bring 50% speedup on RTX 4090. Below is the performance comparison of FlashInfer 0.0.1 and FlashAttention 2.4.2 on different GPUs:
110+
For prefill (multi-query) attention, we reimplemented the FlashAttention 2 algorithm in pure CUDA with some additional optimizations. Standard FlashAttention implementation uses Tensor Cores with fp16 input and fp32 accumulator, however, RTX 4090 GPUs has lower Tensor Cores performance with fp32 accumulator, we observe that the $\frac{\mathbf{q}\cdot \mathbf{k}^{T}}{\sqrt(d)}$ phase in attention computation have small value range and can be accumulated with fp16 (because the head dimension is always small: e.g. 128), FlashInfer provides an `allow_fp16_qk_reduction` option to allow this optimization (but still use fp32 accumulation for $\mathbf{score} \cdot \mathbf{v}$), this optimization could bring 50% speedup on RTX 4090. Below is the performance comparison of FlashInfer 0.0.1 and FlashAttention 2.4.2 on different GPUs:
111111

112112
<p align="center">
113113
<img src="/assets/imgs/single-prefill-benchmark.png" alt="single prefill kernel benchmarks" width="800"/>
@@ -242,5 +242,5 @@ This blog post is written by [Zihao Ye](https://homes.cs.washington.edu/~zhye/).
242242

243243
We also thank Masahiro Masuda (OctoAI), Yixin Dong (UW & SJTU), Roy Lu (UW), Chien-Yu Lin (UW), Ying Sheng (Stanford & LMSys) and Lianmin Zheng (Berkeley & LMSys) for their valuable feedbacks and discussions.
244244

245-
## References
245+
## Footnotes
246246
[^1]: [Dissecting Batching Effects in GPT Inference](https://le.qun.ch/en/blog/2023/05/13/transformer-batching/) by Lequn Chen

0 commit comments

Comments
 (0)