Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Dec 10, 2025

Changes

  • Q4_0 Dot Product Optimization:
    • Implemented hvx_vec_load_and_mul_d_rx2 and hvx_vec_load_and_mul_d_r2x2 helper functions to streamline vector loading and multiplication.
    • Refactored vec_dot_q4x4x2_q8x4x2_rx2 and vec_dot_q8x4x2_q8x4x2_rx2 to improve instruction pipelining and reduce overhead in the main loops.

Performance

The following performance comparison shows significant improvements for MUL_MAT(type_a=q4_0, type_b=f32) across various batch sizes (n), with ~30% speedup observed for n >= 2.

Device: 8Gen3
Baseline: 4d3726278
Current: 00d5fb31b

Operation (q4_0) Baseline (GFLOPS) Current (GFLOPS) Speedup
n=2 238.32 316.59 +32%
n=3 242.05 323.53 +33%
n=4 244.17 327.72 +34%
n=5 245.33 329.64 +34%
n=8 247.10 333.06 +34%

}

HVX_Vector_x4 r_dd =
hvx_vec_load_and_mul_d_r2x2(r0_x_d + i * x_dblk_size, r1_x_d + i * x_dblk_size, y_d + i * y_dblk_size);
Copy link
Contributor Author

@chraac chraac Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimized the scale multiplication step. The previous implementation only processed 32xf16 elements (half the vector width). This change enables 64xf16 multiplication to fully utilize the HVX vector capacity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting garbled output for all models.
Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32). This means that we still need to do the same number of multiplies and use the same number of HVX registers either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting garbled output for all models.

Reverted scale loading to handle unaligned scales, as alignment cannot be ensured for all tensor shapes.

Thought this resolves the garbled output issues. Tested on:

  • llama3-1b: log
  • qwen3-1.7b: log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32).

  • Regarding the scales utilization: The original source uses 2 Q6_Wqf32_vmpy_VhfVhf instructions for 2 rows but ignores the upper half. This PR aims to fully utilize the results of both multiplications.

  • As for the accumulator width: For Q4_0, an INT32 accumulator is likely excessive. Since src0 (4-bit) * src1 (8-bit) fits in 12 bits, accumulating 32 elements only requires 17 bits total. A 32-bit accumulator is far larger than what is strictly required.

@chraac chraac marked this pull request as draft December 10, 2025 13:36
const uint8_t * restrict y_d) {
HVX_Vector vy_d = *(const HVX_Vector *) y_d;
HVX_Vector r0_d = *(const HVX_Vector *) r0_x_d;
HVX_Vector r1_d = *(const HVX_Vector *) r1_x_d;
Copy link
Contributor Author

@chraac chraac Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Given the update to 64xf16, is it safe to assume that x_d and y_d are now aligned to HVX_Vector?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The current REPACK format is all quants (4-bit nibbles) followed by the scales.
For models where the number of elements per row is a multiple of 128 the scales will be aligned but for something like gpt-oss-20b with 2880 element rows the scales will not be aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to unaligned scale loading.
Thought, since the scales currently follow the quantization layout, it may be worth implementing an aligned load path for specific row elem cnt for better performance.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 10, 2025
@chraac
Copy link
Contributor Author

chraac commented Dec 10, 2025

@max-krasnyansky, I'd like to open a discussion regarding matvec and matmul. Currently, we issue the new DMA row read after the vec_dot operation, which seems suboptimal.

Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the vec_dot calculation.

@max-krasnyansky
Copy link
Collaborator

@max-krasnyansky, I'd like to open a discussion regarding matvec and matmul. Currently, we issue the new DMA row read after the vec_dot operation, which seems suboptimal.

Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the vec_dot calculation.

Actually the DMA is fully asynchronous and it already overlaps with vec_dot.
If you look at the overall outer loop it works like this

  1. Prefill scratchpad with 16 rows --> issue 8x DMA requests (2 rows per request) for rows 0 ... 15
  2. Wait for first DMA request to complete (rows 0,1)
  3. VecDot for rows 0,1 (DMAs for rows 2.... are in flight)
  4. Issue DMA request for rows 16,17 (will override row 0,1)
  5. Wait for second DMA request to complete (row 2, 3) -- will not actually wait because DMA should be done by now
  6. VecDot for rows 2,3 (DMAs for rows 4 ... are in flight)
  7. Issue DMA request for rows 18,19 (will override row 2,3)
    ...

You get the idea. It's fully pipelined. Typically all the waits are no-ops except for the first one.
Also, if you just comment out vec_dot calls from the outter loop you'll see that we're fully DMA/DDR bound (ie you'll get about the same token rate).

The Prompt on the other hand is compute bound and I'm working on redoing the matvec to optimize out the number of reductions that are needed (ie those rmpy_x8 functions can be improved but need a data layout/repack changes).
And of course we'll need to enable HMX to fully utilize the TOPs but that is a bit tricky and might take some time :)

@chraac
Copy link
Contributor Author

chraac commented Dec 11, 2025

Actually the DMA is fully asynchronous and it already overlaps with vec_dot. If you look at the overall outer loop it works like this

  1. Prefill scratchpad with 16 rows --> issue 8x DMA requests (2 rows per request) for rows 0 ... 15
  2. Wait for first DMA request to complete (rows 0,1)
  3. VecDot for rows 0,1 (DMAs for rows 2.... are in flight)
  4. Issue DMA request for rows 16,17 (will override row 0,1)
  5. Wait for second DMA request to complete (row 2, 3) -- will not actually wait because DMA should be done by now
  6. VecDot for rows 2,3 (DMAs for rows 4 ... are in flight)
  7. Issue DMA request for rows 18,19 (will override row 2,3)

Thanks. I was referring to swapping the order so we issue the DMA request (step 4) before vec_dot (step 3).
That would require implementing a second buffer (double buffering), haven't tested it yet, but I'm going to comment out vec_dot first to verify if that improves the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants