-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add support for QMoE in CPU #25558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add support for QMoE in CPU #25558
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fb8c2ae to
6a36b9a
Compare
tianleiwu
reviewed
Jul 28, 2025
b0fe68a to
731021d
Compare
tianleiwu
reviewed
Jul 30, 2025
tianleiwu
reviewed
Jul 30, 2025
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
tianleiwu
reviewed
Jul 30, 2025
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.cc
Outdated
Show resolved
Hide resolved
This reverts commit 1675eae.
d3762dd to
bd721db
Compare
tianleiwu
previously approved these changes
Aug 2, 2025
kunal-vaishnavi
previously approved these changes
Aug 2, 2025
271cba4
kunal-vaishnavi
approved these changes
Aug 2, 2025
sophies927
pushed a commit
that referenced
this pull request
Aug 2, 2025
This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>
sanketkaleoss
pushed a commit
to sanketkaleoss/onnxruntime
that referenced
this pull request
Aug 11, 2025
This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>
gedoensmax
pushed a commit
to gedoensmax/onnxruntime
that referenced
this pull request
Sep 2, 2025
This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>
tianleiwu
added a commit
that referenced
this pull request
Sep 4, 2025
This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>
jywu-msft
pushed a commit
that referenced
this pull request
Sep 5, 2025
### Description Cherry-pick the following PRs: #25943 #25937 #25917 #25909 #25898 #25897 #25888 #25881 #25830 #25619 #25575 #25572 #25558 #25530 #25474 #25455 #25110 Also two dependent PRs for qMoE cpu: #25877 #25822 --------- Co-authored-by: xiaomsft <[email protected]> Co-authored-by: Xiaoyan Hu <[email protected]> Co-authored-by: Akshay Sonawane <[email protected]> Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Pradeep Sakhamoori <[email protected]> Co-authored-by: mingyue <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Emmanuel <[email protected]> Co-authored-by: Emmanuel Assumang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: praneshgo <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Ishwar Raut <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new
QMoEoperator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales.Documentation Updates:
fc1_scales,fc2_scales, andfc3_scalesindocs/ContribOperators.mdto useT2instead ofT.QMoEoperator indocs/OperatorKernels.md. [1] [2]Operator Enhancements:
QMoEoperator for quantized Mixture-of-Experts in CPU kernels (onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc). [1] [2]QMoEoperator in the kernel registry.Codebase Additions:
MoEBaseCPUclass inonnxruntime/contrib_ops/cpu/moe/moe_base_cpu.hto provide shared functionality for MoE operations, including input validation and scale checking.QMoEoperator inonnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.hwith support for quantized tensor types and activation types.CUDA and Graph Updates:
T2in CUDA implementation ofQMoE.fc1_scalesandfc2_scalesto useT2inonnxruntime/core/graph/contrib_ops/contrib_defs.cc. [1] [2]These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales.