CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

am17an · 2025-12-10T10:59:21Z

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is ~~10% slower than master~~ 25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

Figure out why we don't see better results
Address NMSE error b/w q8_0 and mxfp4

Master

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	10564.11 ± 81.35
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp1024	10766.92 ± 72.32
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	10893.41 ± 54.63

This PR:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	12833.61 ± 83.54
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp1024	13006.15 ± 75.36
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	13258.88 ± 25.34

Note: This PR was developed on @JohannesGaessler's server with a 5090 provided by NVIDIA. So thanks to them both!

ggml/src/ggml-cuda/CMakeLists.txt

easyfab · 2025-12-10T18:11:34Z

Nice speedup ,

Master:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp512	5614.78 ± 40.21
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp2048	4729.89 ± 10.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	tg128	204.28 ± 0.53
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	6460.61 ± 65.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	6624.29 ± 24.83
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	221.47 ± 0.25

PR:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp512	6473.65 ± 37.97
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp2048	5346.78 ± 4.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	tg128	205.29 ± 0.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	7754.67 ± 53.15
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	7917.86 ± 20.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	221.23 ± 0.21

ggml/src/ggml-cuda/common.cuh

JohannesGaessler · 2025-12-10T18:48:53Z

ggml/src/ggml-cuda/common.cuh

+    if (sign > 0.0f) {
+        return static_cast<uint8_t>(best_i);        // 0..7
+    } else {
+        return static_cast<uint8_t>(best_i | 0x8);  // 8..15
+    }


I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.

More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.

ggml/src/ggml-cuda/mmq.cu

ggml/src/ggml-cuda/mmq.cuh

JohannesGaessler · 2025-12-10T19:33:14Z

ggml/src/ggml-cuda/mmq.cuh

+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 0] = compress(aux_q4[1]) << 16 | compress(aux_q4[0]);
+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 1] = compress(aux_q4[3]) << 16 | compress(aux_q4[2]);
+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 2] = compress(aux_q4[1] >> 4) << 16 | compress(aux_q4[0] >> 4);
+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 3] = compress(aux_q4[3] >> 4) << 16 | compress(aux_q4[2] >> 4);


At this point in the code you should be suffering from a 4-way shared memory bank conflict.

ggml/src/ggml-cuda/mmq.cuh

ggml/src/ggml-cuda/quantize.cu

JohannesGaessler · 2025-12-11T11:25:14Z

ggml/src/ggml-cuda/common.cuh

+        return 0;
+    }
+
+    const uint8_t sign_bit = x < 0.0f ? 0x8 : 0;


I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.

JohannesGaessler · 2025-12-11T11:27:06Z

ggml/src/ggml-cuda/common.cuh

+    // Saturate to max representable magnitude
+    if (ax > pos_lut[7]) {
+        ax = pos_lut[7];
+    }
+


Suggested change

// Saturate to max representable magnitude

if (ax > pos_lut[7]) {

ax = pos_lut[7];

}

It should be fine to remove this since values > 6 will automatically use the last value since it will have the smallest error.

JohannesGaessler · 2025-12-11T11:30:31Z

ggml/src/ggml-cuda/mmq.cuh

 }

 #define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0                   + 4)
+#define MMQ_MMA_TILE_X_K_FP4  (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0)


The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.

JohannesGaessler · 2025-12-11T11:30:56Z

ggml/src/ggml-cuda/mmq.cuh

+        case GGML_TYPE_MXFP4:   return MMQ_MMA_TILE_X_K_FP4;
+#else
        case GGML_TYPE_MXFP4:   return MMQ_MMA_TILE_X_K_Q8_1;
+#endif


Suggested change

#endif

#endif // BLACKWELL_MMA_AVAILABLE

JohannesGaessler · 2025-12-11T11:39:02Z

ggml/src/ggml-cuda/mmq.cuh

+        const int k0 = kbx * 4;
+        memcpy(x_qs + i * MMQ_MMA_TILE_X_K_FP4 + k0, bxi->qs, 16);


This needs a comment mentioning that the data is permuted vs. the q8_0 path and that this is handled via permutation in quantize_mmq_mxfp4.

JohannesGaessler · 2025-12-11T11:41:48Z

ggml/src/ggml-cuda/mmq.cuh

        }

-        offset_y   += (col_low + jt*mmq_x)*(sizeof(block_q8_1_mmq)/sizeof(int));
+        constexpr size_t sz = type == GGML_TYPE_MXFP4 ? sizeof(block_fp4_mmq) : sizeof(block_q8_1_mmq);


This also needs a check for BLACKWELL_MMA_AVAILABLE.

JohannesGaessler · 2025-12-11T11:42:08Z

ggml/src/ggml-cuda/mmq.cuh

    }

-    offset_y   += (col_low + jt*mmq_x)*(sizeof(block_q8_1_mmq)/sizeof(int));
+    constexpr size_t sz = type == GGML_TYPE_MXFP4 ? sizeof(block_fp4_mmq) : sizeof(block_q8_1_mmq);


Same as above.

JohannesGaessler · 2025-12-11T11:46:44Z

ggml/src/ggml-cuda/quantize.cu

+        const uint8_t q_lo_0 = __shfl_sync(0xFFFFFFFF, q_val, base,      WARP_SIZE);
+        const uint8_t q_lo_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 1,  WARP_SIZE);
+        const uint8_t q_hi_0 = __shfl_sync(0xFFFFFFFF, q_val, base + 16, WARP_SIZE);
+        const uint8_t q_hi_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 17, WARP_SIZE);


This needs a comment to explain the permutation.

am17an requested a review from JohannesGaessler as a code owner December 10, 2025 10:59

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025

am17an marked this pull request as draft December 10, 2025 10:59

loci-dev mentioned this pull request Dec 10, 2025

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP] auroralabs-loci/llama.cpp#511

Open

2 tasks

CISC reviewed Dec 10, 2025

View reviewed changes

ggml/src/ggml-cuda/CMakeLists.txt Outdated Show resolved Hide resolved

am17an force-pushed the mxfp4 branch from 16e8a11 to 9dde464 Compare December 10, 2025 15:48

JohannesGaessler reviewed Dec 10, 2025

View reviewed changes

JohannesGaessler reviewed Dec 11, 2025

View reviewed changes

Aman Gupta added 8 commits December 11, 2025 13:52

CUDA: experimental native mxfp4 support for blackwell

e214110

optimize load_tiles

41e876a

optimize quantize_mxfp4

40eb6c7

cleanup

65f944b

first pass review: formatting

a6dcaa5

use interleaved layout for mma

b7deb96

mmq: add assert for size

928cc55

use __nv_fp4x4_e2m1

a1672f6

am17an force-pushed the mxfp4 branch from b978da3 to a1672f6 Compare December 11, 2025 15:41

		const int k0 = kbx * 4;
		memcpy(x_qs + i * MMQ_MMA_TILE_X_K_FP4 + k0, bxi->qs, 16);

CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

Are you sure you want to change the base?

CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

Uh oh!

Conversation

am17an commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

easyfab commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Dec 10, 2025 •

edited

Loading