Skip to content

Conversation

@joeldushouyu
Copy link

Summary

This pr allows running vision models(tested with gemma4b) on Hexagon NPU.

For now, it only supports using the CDSP for doing fp16xfp32.
Note: I am fully aware that the current FP16xFP32 implementation is not the most optimal. For example, we can easily reduce unnecessary data repetition by using the vtcm as cache, but I think that should probably go into a separate pr that focuses solely on optimization.

Test

I used the f16 vision weights and q40 language weights from unsloth.

1. build hexagon in docker

cmake --preset arm64-android-snapdragon-release -B build-snapdragon
cmake --build build-snapdragon
cmake --install build-snapdragon --prefix pkg-adb/llama.cpp

2. push the weights to phone(tested with samsung s25 ultra

adb push mmproj-F16.gguf data/local/tmp/gguf
adb push gemma-3-4b-it-Q4_0.gguf /data/local/tmp/gguf
adb push hydro_1.png /data/local/tmp/gguf   #Image for testing 

3. run the run-mtmd script

E=1 NDEV=1 D=HTP0 MTMD_DEVICE=HTP0 PROF=1 V=1 M=gemma-3-4b-it-Q4_0.gguf MMPROJ=mmproj-F16.gguf IMG=hydro_1.png ./scripts/snapdragon/adb/run-mtmd.sh -p '"What is in this image."'

@joeldushouyu joeldushouyu changed the title Mtmd hexagon ggml-hexagon: mm for mtmd Dec 9, 2025
@joeldushouyu joeldushouyu marked this pull request as ready for review December 9, 2025 22:27
@github-actions github-actions bot added script Script related ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025
@joeldushouyu
Copy link
Author

As I mentioned earlier, I think there’s still a lot of room to optimize the FP16×FP32 kernel by taking advantage of features like VT-CM and DMA. That said, I’m trying to figure out whether there’s any publicly available documentation on how to use the HMX instructions — the built-in matrix-multiplication hardware on the CDSP?

I noticed in the Hexagon SDK docs that the qhl_hmx library was removed starting from SDK 6.0. Is there a specific reason for its removal, and is there any plan to introduce a replacement or an updated HMX library? My impression is that VT-CM can help reduce data redundancy, but the HMX systolic core should still offer better compute throughput compared to implementing matrix multiplies using HVX vector dot-products.

@joeldushouyu
Copy link
Author

joeldushouyu commented Dec 10, 2025

Note: commit c73a2c0 is the patch fix to pass the test case in ggml by running.
Mainly because src0 data memory is non-contiguous on some of the test cases.

HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant