Skip to content

Conversation

@patryk-kaiser-ARM
Copy link
Contributor

@patryk-kaiser-ARM patryk-kaiser-ARM commented Aug 15, 2025

Key changes
This PR integrates KleidiAI SME1 FP32 kernels into the existing kleidiai_sgemm.cpp implementation.

Adding SME2 flag in onnxruntime/core/common/cpuid_info.h & onnxruntime/core/common/cpuid_info.cc
Previous SME2 kernels integrated were using SME(1) check, this change will correctly distinguish between when SME1 and SME2 kernels are to be used.

Bumping KleidiAI version to 1.10.0

Indicative performance data
Single thread Mac Mini M4 runs on various models using: onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1
image

Next steps
Additional commits to come will address outstanding to-do issues from previous PR linked below:
KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI #25187

@patryk-kaiser-ARM
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Arm"

@patryk-kaiser-ARM patryk-kaiser-ARM marked this pull request as draft August 21, 2025 10:06
@patryk-kaiser-ARM patryk-kaiser-ARM marked this pull request as ready for review August 25, 2025 12:10
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@patryk-kaiser-ARM patryk-kaiser-ARM force-pushed the SME1_sgemm_integration branch 2 times, most recently from 443a898 to d165482 Compare September 1, 2025 12:03
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@patryk-kaiser-ARM patryk-kaiser-ARM force-pushed the SME1_sgemm_integration branch 2 times, most recently from 40526c4 to 9800e11 Compare September 9, 2025 12:50
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@hariharans29
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@edgchen1 edgchen1 merged commit ec3bf7f into microsoft:main Sep 12, 2025
155 of 163 checks passed
hariharans29 added a commit that referenced this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants