-
Notifications
You must be signed in to change notification settings - Fork 14.1k
[WIP]gml-hexagon: Q4_0 mm opt #17907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| HVX_Vector_x4 r_dd = | ||
| hvx_vec_load_and_mul_d_r2x2(r0_x_d + i * x_dblk_size, r1_x_d + i * x_dblk_size, y_d + i * y_dblk_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimized the scale multiplication step. The previous implementation only processed 32xf16 elements (half the vector width). This change enables 64xf16 multiplication to fully utilize the HVX vector capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting garbled output for all models.
Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32). This means that we still need to do the same number of multiplies and use the same number of HVX registers either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ultimately we end up with the INT32 accumulator for each block (32 elements).
In order to multiply it with the FP16 scale we need to convert both (accumulator and scale) into FP32 (QF32).
-
Regarding the scales utilization: The original source uses 2
Q6_Wqf32_vmpy_VhfVhfinstructions for 2 rows but ignores the upper half. This PR aims to fully utilize the results of both multiplications. -
As for the accumulator width: For
Q4_0, an INT32 accumulator is likely excessive. Sincesrc0(4-bit) *src1(8-bit) fits in 12 bits, accumulating 32 elements only requires 17 bits total. A 32-bit accumulator is far larger than what is strictly required.
| const uint8_t * restrict y_d) { | ||
| HVX_Vector vy_d = *(const HVX_Vector *) y_d; | ||
| HVX_Vector r0_d = *(const HVX_Vector *) r0_x_d; | ||
| HVX_Vector r1_d = *(const HVX_Vector *) r1_x_d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Given the update to 64xf16, is it safe to assume that x_d and y_d are now aligned to HVX_Vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The current REPACK format is all quants (4-bit nibbles) followed by the scales.
For models where the number of elements per row is a multiple of 128 the scales will be aligned but for something like gpt-oss-20b with 2880 element rows the scales will not be aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting to unaligned scale loading.
Thought, since the scales currently follow the quantization layout, it may be worth implementing an aligned load path for specific row elem cnt for better performance.
|
@max-krasnyansky, I'd like to open a discussion regarding Since the DMA engine can run in parallel with the HVX SIMD unit, I propose implementing a VTCM double-buffering strategy. This would allow us to overlap DMA loading with the |
Actually the DMA is fully asynchronous and it already overlaps with vec_dot.
You get the idea. It's fully pipelined. Typically all the waits are no-ops except for the first one. The Prompt on the other hand is compute bound and I'm working on redoing the matvec to optimize out the number of reductions that are needed (ie those rmpy_x8 functions can be improved but need a data layout/repack changes). |
Thanks. I was referring to swapping the order so we issue the DMA request (step 4) before |
Changes
hvx_vec_load_and_mul_d_rx2andhvx_vec_load_and_mul_d_r2x2helper functions to streamline vector loading and multiplication.vec_dot_q4x4x2_q8x4x2_rx2andvec_dot_q8x4x2_q8x4x2_rx2to improve instruction pipelining and reduce overhead in the main loops.Performance
The following performance comparison shows significant improvements for
MUL_MAT(type_a=q4_0, type_b=f32)across various batch sizes (n), with ~30% speedup observed forn >= 2.Device: 8Gen3
Baseline:
4d3726278Current:
00d5fb31bq4_0)n=2n=3n=4n=5n=8