-
Notifications
You must be signed in to change notification settings - Fork 772
Description
First of all, thank you for the amazing work on DeepGEMM — it's been extremely helpful.
While integrating DeepGEMM into a backward pass implementation, I encountered a reproducible crash when running the k-grouped FP8 GEMM with N = 768.
❗ Error
Running DeepGEMM with the following shape causes a CUDA illegal instruction error:
RuntimeError: CUDA driver error (csrc/apis/../jit_kernels/impls/sm90_fp8_gemm_1d1d.hpp:65): 715
(CUDA_ERROR_ILLEGAL_INSTRUCTION, an illegal instruction was encountered)
🔁 Reproduction
The issue reproduces consistently by adding the following shape into
enumerate_k_grouped_contiguous:
(128, 2048, 768, 4096)This triggers the following call path:
k_grouped_fp8_gemm_nt_contiguous- → FP8 kernel selection
- → SM90 kernel dispatch
- → crash with CUDA illegal instruction
Notably, the same configuration works correctly when N = 1536, so the issue appears to be specific to N = 768.
🧩 Expected Behavior
The kernel should run successfully for (groups=128, M=2048, N=768, K=4096) without causing an illegal instruction.
🧪 Environment (if helpful)
- GPU: H100 (SM90)
- CUDA Toolkit: CUDA 12.9 Driver Version: 535.161.08
- PyTorch version: 2.8.0
🙏 Additional Notes
If you need further logs or want me to test a patch, I’m happy to help.
Thanks again for the excellent work on DeepGEMM!