[WIP]gml-hexagon: Q4_0 mm opt #17907

chraac · 2025-12-10T13:31:05Z

Changes

Q4_0 Dot Product Optimization:
- Implemented hvx_vec_load_and_mul_d_rx2 and hvx_vec_load_and_mul_d_r2x2 helper functions to streamline vector loading and multiplication.
- Refactored vec_dot_q4x4x2_q8x4x2_rx2 and vec_dot_q8x4x2_q8x4x2_rx2 to improve instruction pipelining and reduce overhead in the main loops.

Performance

The following performance comparison shows significant improvements for MUL_MAT(type_a=q4_0, type_b=f32) across various batch sizes (n), with ~30% speedup observed for n >= 2.

Device: 8Gen3
Baseline: 4d3726278
Current: 00d5fb31b

Operation (`q4_0`)	Baseline (GFLOPS)	Current (GFLOPS)	Speedup
`n=2`	238.32	316.59	+32%
`n=3`	242.05	323.53	+33%
`n=4`	244.17	327.72	+34%
`n=5`	245.33	329.64	+34%
`n=8`	247.10	333.06	+34%

…tions

…ul operations" This reverts commit 7c8f101.

…rations" This reverts commit b567413.

…ction

…handling and processing

…and multiplication

…oading and multiplication

…ptimization

chraac · 2025-12-10T13:35:56Z

ggml/src/ggml-hexagon/htp/matmul-ops.c

+        }
+
+        HVX_Vector_x4 r_dd =
+            hvx_vec_load_and_mul_d_r2x2(r0_x_d + i * x_dblk_size, r1_x_d + i * x_dblk_size, y_d + i * y_dblk_size);


Optimized the scale multiplication step. The previous implementation only processed 32xf16 elements (half the vector width). This change enables 64xf16 multiplication to fully utilize the HVX vector capacity.

I'm getting garbled output for all models.
Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32). This means that we still need to do the same number of multiplies and use the same number of HVX registers either way.

I'm getting garbled output for all models.

Reverted scale loading to handle unaligned scales, as alignment cannot be ensured for all tensor shapes.

Thought this resolves the garbled output issues. Tested on:

llama3-1b: log

qwen3-1.7b: log

Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32).

Regarding the scales utilization: The original source uses 2 Q6_Wqf32_vmpy_VhfVhf instructions for 2 rows but ignores the upper half. This PR aims to fully utilize the results of both multiplications.

As for the accumulator width: For Q4_0, an INT32 accumulator is likely excessive. Since src0 (4-bit) * src1 (8-bit) fits in 12 bits, accumulating 32 elements only requires 17 bits total. A 32-bit accumulator is far larger than what is strictly required.

chraac · 2025-12-10T13:40:22Z

ggml/src/ggml-hexagon/htp/matmul-ops.c

+                                                        const uint8_t * restrict y_d) {
+    HVX_Vector vy_d = *(const HVX_Vector *) y_d;
+    HVX_Vector r0_d = *(const HVX_Vector *) r0_x_d;
+    HVX_Vector r1_d = *(const HVX_Vector *) r1_x_d;


QQ: Given the update to 64xf16, is it safe to assume that x_d and y_d are now aligned to HVX_Vector?

I don't think so. The current REPACK format is all quants (4-bit nibbles) followed by the scales.
For models where the number of elements per row is a multiple of 128 the scales will be aligned but for something like gpt-oss-20b with 2880 element rows the scales will not be aligned.

Reverting to unaligned scale loading.
Thought, since the scales currently follow the quantization layout, it may be worth implementing an aligned load path for specific row elem cnt for better performance.

chraac · 2025-12-10T14:29:19Z

@max-krasnyansky, I'd like to open a discussion regarding matvec and matmul. Currently, we issue the new DMA row read after the vec_dot operation, which seems suboptimal.

Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the vec_dot calculation.

max-krasnyansky · 2025-12-10T23:29:40Z

@max-krasnyansky, I'd like to open a discussion regarding matvec and matmul. Currently, we issue the new DMA row read after the vec_dot operation, which seems suboptimal.

Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the vec_dot calculation.

Actually the DMA is fully asynchronous and it already overlaps with vec_dot.
If you look at the overall outer loop it works like this

Prefill scratchpad with 16 rows --> issue 8x DMA requests (2 rows per request) for rows 0 ... 15
Wait for first DMA request to complete (rows 0,1)
VecDot for rows 0,1 (DMAs for rows 2.... are in flight)
Issue DMA request for rows 16,17 (will override row 0,1)
Wait for second DMA request to complete (row 2, 3) -- will not actually wait because DMA should be done by now
VecDot for rows 2,3 (DMAs for rows 4 ... are in flight)
Issue DMA request for rows 18,19 (will override row 2,3)
...

You get the idea. It's fully pipelined. Typically all the waits are no-ops except for the first one.
Also, if you just comment out vec_dot calls from the outter loop you'll see that we're fully DMA/DDR bound (ie you'll get about the same token rate).

The Prompt on the other hand is compute bound and I'm working on redoing the matvec to optimize out the number of reductions that are needed (ie those rmpy_x8 functions can be improved but need a data layout/repack changes).
And of course we'll need to enable HMX to fully utilize the TOPs but that is a bit tricky and might take some time :)

chraac · 2025-12-11T14:20:04Z

Actually the DMA is fully asynchronous and it already overlaps with vec_dot. If you look at the overall outer loop it works like this

Prefill scratchpad with 16 rows --> issue 8x DMA requests (2 rows per request) for rows 0 ... 15

Wait for first DMA request to complete (rows 0,1)

VecDot for rows 0,1 (DMAs for rows 2.... are in flight)

Issue DMA request for rows 16,17 (will override row 0,1)

Wait for second DMA request to complete (row 2, 3) -- will not actually wait because DMA should be done by now

VecDot for rows 2,3 (DMAs for rows 4 ... are in flight)

Issue DMA request for rows 18,19 (will override row 2,3)

Thanks. I was referring to swapping the order so we issue the DMA request (step 4) before vec_dot (step 3).
That would require implementing a second buffer (double buffering), haven't tested it yet, but I'm going to comment out vec_dot first to verify if that improves the performance.

chraac added 25 commits November 27, 2025 12:54

fix test failure

407b408

fix: correct scaling calculations in rope_cache_init

4ddb8a4

wip

cfca78b

wip

e9a02fd

fix: optimize element copying in rope_hex_f32 using memcpy

e324bb0

fix: optimize loop boundaries in rope_hex_f32 for better performance

0121291

rename

010039a

wip

a6ef41f

Merge branch 'master' into dev-fix-rope

0376146

Merge tag 'b7207' into dev-fix-rope

8abecfa

feat: add profiling macros for performance measurement in operations

b567413

refactor: replace manual timing with profiling macros in matmul opera…

7c8f101

…tions

Merge branch 'master' into dev-fix-rope

3a70465

Revert "refactor: replace manual timing with profiling macros in matm…

3b0cef4

…ul operations" This reverts commit 7c8f101.

Revert "feat: add profiling macros for performance measurement in ope…

121e656

…rations" This reverts commit b567413.

refactor: optimize vector operations in vec_dot_q4x4x2_q8x4x2_rx2 fun…

401fd3e

…ction

wip

cf491f2

feat: enhance vec_dot_q4x4x2_q8x4x2_rx2 function with optimized data …

3a01d82

…handling and processing

Merge branch 'master' into dev-mulmat-opt

87ad8b2

feat: add hvx_vec_load_d_and_mpy function for optimized data loading …

421d031

…and multiplication

wip

bd43860

feat: add hvx_vec_load_d_and_mpy_r2x2 function for optimized vector l…

b197464

…oading and multiplication

feat: optimize vec_dot functions with improved data handling and loading

309d782

wip

dbe9309

feat: add build information and update vector loading functions for o…

00d5fb3

…ptimization

chraac requested review from lhez and max-krasnyansky as code owners December 10, 2025 13:31

chraac commented Dec 10, 2025

View reviewed changes

chraac marked this pull request as draft December 10, 2025 13:36

chraac commented Dec 10, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 10, 2025

UPSTREAM PR #17907: [WIP]gml-hexagon: Q4_0 mm opt auroralabs-loci/llama.cpp#512

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 10, 2025

revert rope changes

b54ff18

Merge tag 'b7345' into dev-mulmat-opt

f757245

fix: revert HVX_Vector back to HVX_UVector

09c4899

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]gml-hexagon: Q4_0 mm opt #17907

[WIP]gml-hexagon: Q4_0 mm opt #17907

chraac commented Dec 10, 2025 •

edited

Loading

Uh oh!

chraac Dec 10, 2025 •

edited

Loading

Uh oh!

max-krasnyansky Dec 10, 2025

Uh oh!

chraac Dec 11, 2025

Uh oh!

chraac Dec 11, 2025

Uh oh!

chraac Dec 10, 2025 •

edited

Loading

Uh oh!

max-krasnyansky Dec 10, 2025

Uh oh!

chraac Dec 11, 2025

Uh oh!

chraac commented Dec 10, 2025 •

edited

Loading

Uh oh!

max-krasnyansky commented Dec 10, 2025

Uh oh!

chraac commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP]gml-hexagon: Q4_0 mm opt #17907

Are you sure you want to change the base?

[WIP]gml-hexagon: Q4_0 mm opt #17907

Conversation

chraac commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance

Uh oh!

chraac Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

chraac commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 10, 2025

Uh oh!

chraac commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chraac commented Dec 10, 2025 •

edited

Loading

chraac Dec 10, 2025 •

edited

Loading

chraac Dec 10, 2025 •

edited

Loading

chraac commented Dec 10, 2025 •

edited

Loading