Skip to content

Conversation

@apsonawane
Copy link
Contributor

@apsonawane apsonawane commented Jul 28, 2025

This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new QMoE operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales.

Documentation Updates:

  • Updated tensor type constraints for fc1_scales, fc2_scales, and fc3_scales in docs/ContribOperators.md to use T2 instead of T.
  • Added descriptions for the new QMoE operator in docs/OperatorKernels.md. [1] [2]

Operator Enhancements:

  • Introduced a new QMoE operator for quantized Mixture-of-Experts in CPU kernels (onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc). [1] [2]
  • Registered the QMoE operator in the kernel registry.

Codebase Additions:

  • Added MoEBaseCPU class in onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h to provide shared functionality for MoE operations, including input validation and scale checking.
  • Implemented the QMoE operator in onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h with support for quantized tensor types and activation types.

CUDA and Graph Updates:

  • Updated type constraints for T2 in CUDA implementation of QMoE.
  • Adjusted schema definitions for fc1_scales and fc2_scales to use T2 in onnxruntime/core/graph/contrib_ops/contrib_defs.cc. [1] [2]

These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales.

@apsonawane apsonawane requested a review from tianleiwu July 28, 2025 04:43
@apsonawane apsonawane changed the title Asonawane/qmoe Add support for QMoE in CPU Jul 28, 2025
github-actions[bot]

This comment was marked as resolved.

@apsonawane apsonawane force-pushed the asonawane/qmoe branch 3 times, most recently from b0fe68a to 731021d Compare July 29, 2025 04:17
tianleiwu
tianleiwu previously approved these changes Aug 2, 2025
kunal-vaishnavi
kunal-vaishnavi previously approved these changes Aug 2, 2025
@tianleiwu tianleiwu dismissed stale reviews from kunal-vaishnavi and themself via 271cba4 August 2, 2025 03:07
@tianleiwu tianleiwu merged commit 4004a15 into main Aug 2, 2025
99 of 106 checks passed
@tianleiwu tianleiwu deleted the asonawane/qmoe branch August 2, 2025 06:20
sophies927 pushed a commit that referenced this pull request Aug 2, 2025
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.

### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)

### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.

### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.

### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)

These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.

---------

Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.

### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)

### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.

### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.

### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)

These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.

---------

Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
gedoensmax pushed a commit to gedoensmax/onnxruntime that referenced this pull request Sep 2, 2025
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.

### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)

### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.

### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.

### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)

These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.

---------

Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
tianleiwu added a commit that referenced this pull request Sep 4, 2025
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.

### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)

### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.

### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.

### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)

These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.

---------

Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
@tianleiwu tianleiwu added cherry-picked Cherry-picked for a cherrypicks branch and removed release:1.23.0 labels Sep 4, 2025
jywu-msft pushed a commit that referenced this pull request Sep 5, 2025
### Description
Cherry-pick the following PRs:
#25943
#25937 
#25917
#25909
#25898
#25897
#25888
#25881
#25830
#25619
#25575
#25572
#25558
#25530
#25474
#25455
#25110

Also two dependent PRs for qMoE cpu: 
#25877
#25822

---------

Co-authored-by: xiaomsft <[email protected]>
Co-authored-by: Xiaoyan Hu <[email protected]>
Co-authored-by: Akshay Sonawane <[email protected]>
Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Pradeep Sakhamoori <[email protected]>
Co-authored-by: mingyue <[email protected]>
Co-authored-by: Maximilian Müller <[email protected]>
Co-authored-by: Adrian Lizarraga <[email protected]>
Co-authored-by: Dmitri Smirnov <[email protected]>
Co-authored-by: Emmanuel <[email protected]>
Co-authored-by: Emmanuel Assumang <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: praneshgo <[email protected]>
Co-authored-by: Hariharan Seshadri <[email protected]>
Co-authored-by: Jing Fang <[email protected]>
Co-authored-by: Ishwar Raut <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked Cherry-picked for a cherrypicks branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants