Add support for QMoE in CPU #25558

apsonawane · 2025-07-28T04:42:18Z

This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new QMoE operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales.

Documentation Updates:

Updated tensor type constraints for fc1_scales, fc2_scales, and fc3_scales in docs/ContribOperators.md to use T2 instead of T.
Added descriptions for the new QMoE operator in docs/OperatorKernels.md. [1] [2]

Operator Enhancements:

Introduced a new QMoE operator for quantized Mixture-of-Experts in CPU kernels (onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc). [1] [2]
Registered the QMoE operator in the kernel registry.

Codebase Additions:

Added MoEBaseCPU class in onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h to provide shared functionality for MoE operations, including input validation and scale checking.
Implemented the QMoE operator in onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h with support for quantized tensor types and activation types.

CUDA and Graph Updates:

Updated type constraints for T2 in CUDA implementation of QMoE.
Adjusted schema definitions for fc1_scales and fc2_scales to use T2 in onnxruntime/core/graph/contrib_ops/contrib_defs.cc. [1] [2]

These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales.

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc

onnxruntime/test/contrib_ops/moe_test.cc

onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.h

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc

onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc

This reverts commit 1675eae.

This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>

### Description Cherry-pick the following PRs: #25943 #25937 #25917 #25909 #25898 #25897 #25888 #25881 #25830 #25619 #25575 #25572 #25558 #25530 #25474 #25455 #25110 Also two dependent PRs for qMoE cpu: #25877 #25822 --------- Co-authored-by: xiaomsft <[email protected]> Co-authored-by: Xiaoyan Hu <[email protected]> Co-authored-by: Akshay Sonawane <[email protected]> Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Pradeep Sakhamoori <[email protected]> Co-authored-by: mingyue <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Emmanuel <[email protected]> Co-authored-by: Emmanuel Assumang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: praneshgo <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Ishwar Raut <[email protected]>

apsonawane requested a review from tianleiwu July 28, 2025 04:43

github-advanced-security bot found potential problems Jul 28, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc Fixed Show fixed Hide fixed

apsonawane changed the title ~~Asonawane/qmoe~~ Add support for QMoE in CPU Jul 28, 2025

This comment was marked as resolved.

Sign in to view

apsonawane force-pushed the asonawane/qmoe branch from fb8c2ae to 6a36b9a Compare July 28, 2025 23:17

tianleiwu reviewed Jul 28, 2025

View reviewed changes

onnxruntime/test/contrib_ops/moe_test.cc Outdated Show resolved Hide resolved

apsonawane force-pushed the asonawane/qmoe branch 3 times, most recently from b0fe68a to 731021d Compare July 29, 2025 04:17

tianleiwu reviewed Jul 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h Outdated Show resolved Hide resolved

tianleiwu reviewed Jul 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Jul 30, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc Outdated Show resolved Hide resolved