Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Dec 10, 2025

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is 10% slower than master 25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

  • Figure out why we don't see better results
  • Address NMSE error b/w q8_0 and mxfp4

Master

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp512 10564.11 ± 81.35
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp1024 10766.92 ± 72.32
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp2048 10893.41 ± 54.63

This PR:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp512 12833.61 ± 83.54
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp1024 13006.15 ± 75.36
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp2048 13258.88 ± 25.34

Note: This PR was developed on @JohannesGaessler's server with a 5090 provided by NVIDIA. So thanks to them both!

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025
@am17an am17an marked this pull request as draft December 10, 2025 10:59
@easyfab
Copy link

easyfab commented Dec 10, 2025

Nice speedup ,

Master:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp512 5614.78 ± 40.21
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp2048 4729.89 ± 10.28
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 tg128 204.28 ± 0.53
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp512 6460.61 ± 65.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp2048 6624.29 ± 24.83
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 221.47 ± 0.25

PR:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp512 6473.65 ± 37.97
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp2048 5346.78 ± 4.23
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 tg128 205.29 ± 0.30
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp512 7754.67 ± 53.15
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp2048 7917.86 ± 20.30
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 221.23 ± 0.21

Comment on lines 740 to 744
if (sign > 0.0f) {
return static_cast<uint8_t>(best_i); // 0..7
} else {
return static_cast<uint8_t>(best_i | 0x8); // 8..15
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.

More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.

Comment on lines 824 to 827
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 0] = compress(aux_q4[1]) << 16 | compress(aux_q4[0]);
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 1] = compress(aux_q4[3]) << 16 | compress(aux_q4[2]);
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 2] = compress(aux_q4[1] >> 4) << 16 | compress(aux_q4[0] >> 4);
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 3] = compress(aux_q4[3] >> 4) << 16 | compress(aux_q4[2] >> 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the code you should be suffering from a 4-way shared memory bank conflict.

return 0;
}

const uint8_t sign_bit = x < 0.0f ? 0x8 : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.

Comment on lines +722 to +726
// Saturate to max representable magnitude
if (ax > pos_lut[7]) {
ax = pos_lut[7];
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Saturate to max representable magnitude
if (ax > pos_lut[7]) {
ax = pos_lut[7];
}

It should be fine to remove this since values > 6 will automatically use the last value since it will have the smallest error.

}

#define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0 + 4)
#define MMQ_MMA_TILE_X_K_FP4 (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.

case GGML_TYPE_MXFP4: return MMQ_MMA_TILE_X_K_FP4;
#else
case GGML_TYPE_MXFP4: return MMQ_MMA_TILE_X_K_Q8_1;
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // BLACKWELL_MMA_AVAILABLE

Comment on lines +808 to +809
const int k0 = kbx * 4;
memcpy(x_qs + i * MMQ_MMA_TILE_X_K_FP4 + k0, bxi->qs, 16);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment mentioning that the data is permuted vs. the q8_0 path and that this is handled via permutation in quantize_mmq_mxfp4.

}

offset_y += (col_low + jt*mmq_x)*(sizeof(block_q8_1_mmq)/sizeof(int));
constexpr size_t sz = type == GGML_TYPE_MXFP4 ? sizeof(block_fp4_mmq) : sizeof(block_q8_1_mmq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a check for BLACKWELL_MMA_AVAILABLE.

}

offset_y += (col_low + jt*mmq_x)*(sizeof(block_q8_1_mmq)/sizeof(int));
constexpr size_t sz = type == GGML_TYPE_MXFP4 ? sizeof(block_fp4_mmq) : sizeof(block_q8_1_mmq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines +127 to +150
const uint8_t q_lo_0 = __shfl_sync(0xFFFFFFFF, q_val, base, WARP_SIZE);
const uint8_t q_lo_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 1, WARP_SIZE);
const uint8_t q_hi_0 = __shfl_sync(0xFFFFFFFF, q_val, base + 16, WARP_SIZE);
const uint8_t q_hi_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 17, WARP_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment to explain the permutation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants