Skip to content

Conversation

@SohamGovande
Copy link
Contributor

@SohamGovande SohamGovande commented Mar 5, 2025

Summary

This 4-line code change achieves performance parity between the transposed (H100_mma_ABt) and non-transposed (H100_mma) matmul kernels by dispatching the largest available tensor core instruction (wgmma of size 64x16x256). Previously, the transposed kernel was approximately 75-80 TFLOPS slower than its non-transposed counterpart.

Changes

  • Changed wgmma instruction size from 64x16x64 to 64x16x256.
  • Fixed strides for the column-major B tensor. Ensures correctness for cases where N != K, resolving previous correctness check failures.

Benchmark Changes

  • Updated benchmark dimensions from square (N=4096) to rectangular (M=2048, N=4096, K=8192) to showcase and validate performance improvements and correctness for non-square inputs.

Testing

Verified correctness and performance improvements through internal benchmarks. Confirmed stable results and parity with H100_mma.

@DanFu09 DanFu09 merged commit 419d813 into HazyResearch:main Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants