Skip to content
Closed
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
09b012e
[Draft] DeepGEMM Blackwell integration
Barry-Delaney Jul 13, 2025
ec400ab
Clean up fused_moe_deepgemm.py
Barry-Delaney Jul 13, 2025
d9a85ac
Moving permute space allocation to GPU
Barry-Delaney Jul 15, 2025
7c4045c
optimize padding in deepgemm moe.
lfr-0531 Jul 16, 2025
20b2592
add torch compile to per_token_cast_to_fp8_e8m0 and rm the two sync.
lfr-0531 Jul 16, 2025
c74a31a
Improve bmm.
yuxianq Jul 16, 2025
d3e1797
Online resmooth for fp8 checkpoint on Blackwell. (#2)
yuxianq Jul 16, 2025
d83cc25
Fix OOM issue for fp8 resmooth. (#4)
yuxianq Jul 17, 2025
e1e96fd
Enbale masked grouped GEMM (#5)
Barry-Delaney Jul 18, 2025
09b0465
Pin DeepGEMM's version to commit cc416ee. (#6)
yuxianq Jul 18, 2025
35b4e23
Improve resmooth. (#7)
yuxianq Jul 18, 2025
dce291f
Add compile for quantization kernels (#8)
Barry-Delaney Jul 18, 2025
b3ab47d
Move SF transform to TRTLLM (#11)
Barry-Delaney Jul 21, 2025
65d05d6
Use local barrier to avoid multi-node hang issue. (#12)
yuxianq Jul 21, 2025
d65bdac
optimize the masked index copy and index gather (#13)
lfr-0531 Jul 21, 2025
0af69ac
Fix adp for deepgemm moe backend (#10)
zongfeijing Jul 21, 2025
6f431f6
Use DeepGEMM main branch instead.
yuxianq Jul 21, 2025
481fd50
Revert "Use DeepGEMM main branch instead."
Barry-Delaney Jul 21, 2025
ab7175f
Use DeepGEMM main branch and disable ue8m0 cast. (#16)
yuxianq Jul 21, 2025
97a21fd
fuse maskec index_copy and grouped fp8 quantization.
lfr-0531 Jul 21, 2025
f668fa7
fix quantization accuracy issue.
lfr-0531 Jul 21, 2025
c6b8985
Fuse swiglu and quant 2 (#18)
Barry-Delaney Jul 21, 2025
11053b7
Opt gather kernel (#19)
zongfeijing Jul 24, 2025
0173836
optimize the perf of masked_index_copy_group_quant_fp8.
lfr-0531 Jul 23, 2025
bd94e37
fix duplicate load.
lfr-0531 Jul 23, 2025
f1d3115
fuse scaling factor transform to _masked_index_copy_group_quant_fp8.
lfr-0531 Jul 24, 2025
acd4381
fix.
lfr-0531 Jul 24, 2025
2d5beab
add another for loop on the group dim.
lfr-0531 Jul 24, 2025
5653eea
Remove SFB transform from forward process (#23)
Barry-Delaney Jul 25, 2025
49dcb98
change deepgeem to a new commit that with torch dependency. (#24)
lfr-0531 Jul 25, 2025
9997006
fix format and rebase bug.
lfr-0531 Jul 25, 2025
d8ae02c
fix dummy requests when estimate kv cache with attention DP enabled t…
lfr-0531 Jul 28, 2025
fb3e467
Fuse quantize and transform e8m0 scales (#26)
Barry-Delaney Jul 28, 2025
9107cfa
Revert "Fuse quantize and transform e8m0 scales (#26)" (#27)
Barry-Delaney Jul 28, 2025
59b3957
Fix CI install error for DeepGEMM. (#28)
yuxianq Jul 28, 2025
3c413be
Reapply "Fuse quantize and transform e8m0 scales (#26)" (#27) (#29)
Barry-Delaney Jul 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix quantization accuracy issue.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
  • Loading branch information
lfr-0531 committed Jul 25, 2025
commit f668fa78ba5b3c0a78a4e893f6c4cec2813afb1b
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,10 @@ def _masked_index_copy_group_quant_fp8(
# quant
_absmax = tl.maximum(tl.max(tl.abs(input)), eps)
output_s = _absmax / 448.0
output_s = tl.exp2(tl.ceil(tl.log2(tl.abs(output_s))))
output_s_inv = 1.0 / output_s
output_q = tl.clamp(input * output_s_inv, -448.0,
448.0).to(out_q_ptr.dtype.element_ty)
output_s = tl.exp2(tl.ceil(tl.log2(tl.abs(output_s))))

# write output
s_dim_size = dim_size // group_size
Expand Down