-
Notifications
You must be signed in to change notification settings - Fork 14.1k
HIP: enable mmf for RDNA3 #17879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
HIP: enable mmf for RDNA3 #17879
Conversation
|
Add the perf data of ops on windows, windows data is unstable, but this is the only RDNA3 I have. I will be very helpful if anyone can have a test other RDNA3 GPUs on Linux, thank you. MUL_MAT
MUL_MAT_ID
|
|
Finally I can get an Ubuntu 22.04 work, just add the data on it with ROCm 7.1.0, unlike my 9070XT, looks like that 7900XTX can get perf improvement on mul_mat_f, this is why I doubt that ROCm compiler doesn't do optimization for RDNA4. MUL_MAT
MUL_MAT_ID
|
|
Sees remarkable boost for MUL_MAT_ID on the entry-level iGPU 780M in Linux
Largest gains: Largest losses:
|
|
Add the data in the real model, the weird thing is the perf difference deepseek bf16 and fp16 version. DeepSeek-R1-Distill-Qwen-1.5B_f16
DeepSeek-R1-Distill-Qwen-1.5B_bf16
granite-3.1-1b-a400m-instruct_f16
granite-3.1-1b-a400m-instruct_bf16
|
Performance on a Strix Halo system
For RDNA 3.5 this PR seems to be universally faster. I was hoping that RDNA 3 and RDNA 3.5 would be largely interchangeable in terms of kernel selection logic but this seems to be not the case. So I would ask you to expand the definitions in |
|
Honestly, there might be more compiler optimization for gfx1151 as DGX spark exists, I don't see any extra resource for dGPU from AMD now. For RDNA 3.5, the deep learning part is similar as RDNA3, it just adds some feature for gaming. |
|
Adjust mmf and mmvf shape, move ne11 > 8 kernel to hipblas, move ne11 > 3 to mmf. MUL_MAT
MUL_MAT_ID
DeepSeek-R1-Distill-Qwen-1.5B_f16
DeepSeek-R1-Distill-Qwen-1.5B_bf16
granite-3.1-1b-a400m-instruct_f16
granite-3.1-1b-a400m-instruct_bf16
|
Enable mmf for RDNA3, all mul_mat_f related cases shall pass, still getting the perf data.
There is also perf regression in mul_mat_f on my 7900XTX, I assume it's the similar issue as ROCm/ROCm#5727.
If anyone can help to collect the perf data of MUL_MAT on other RDNA3, that will be very helpful. If there is perf improvement, I will still enable mul_mat_f on RDNA3 and ask ROCm to improve the perf, or I will suggest to disable mul_mat_f on RDNA3.
MUL_MAT_ID_FUSION_rdna3_test.txt
MUL_MAT_ID_rdna3_test.txt
MUL_MAT_rdna3_test.txt
MUL_MAT_ID_FUSION_rdna4_test.txt
MUL_MAT_ID_rdna4_test.txt
MUL_MAT_rdna4_test.txt