Update MoE and qMoE spec #25619

tianleiwu · 2025-08-01T00:19:40Z

Weight Shape Update

Make sure the shape reflects actual memory layout. The weight is stored in column major.

Add support for SwiGLU activation attributes

Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the Triton kernel implementation.

New Attributes for SwiGLU

swiglu_fusion:
- 0: Not fused — two separate GEMMs (FC1 and FC3).
- 1: Fused GEMMs using interleaved format (g and l are interleaved per row).
- 2: Fused GEMMs using non-interleaved (concatenated) format.
swiglu_limit: Clamp threshold applied to g and l.
activation_alpha: Scalar multiplier applied to g before sigmoid.
activation_beta: Added to l before the final output computation.

SwiGLU Activation Function

The SwiGLU function is defined as:

g = xW + b
l = xV + c
G = min(g, limit)
L = max(min(l, limit), -limit)
swiglu = G * sigmoid(alpha * G) * (L + beta)

x: Input
W, V: Weight matrices
b, c: Bias vectors
alpha, beta, limit: Float constants

Fusion Behavior

When swiglu_fusion = 0:
- Two GEMMs are computed independently.
- FC1 → computes g, FC3 → computes l.
When swiglu_fusion = 1:
- g and l are computed in a single fused GEMM (FC1).
- Output is interleaved per row as: gate, linear, gate, linear, ....
When swiglu_fusion = 2:
- g and l are computed in a single GEMM (FC1).
- Output is concatenated per row: [g | l].

Implement swiglu_limit for CUDA

Update CUDA kernel to use default swiglu limit.
Update test_moe_cuda.py to have same logic in reference implementation.

Remaining Works

The main purpose of this PR is to update spec instead of implementing them.
Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes.

Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/moe_helper.h

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/moe_helper.h

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

onnxruntime/test/python/transformers/test_moe_cuda.py

### Weight Shape Update Make sure the shape reflects actual memory layout. The weight is stored in column major. ### Add support for SwiGLU activation attributes Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the [Triton kernel implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py). #### New Attributes for SwiGLU * **`swiglu_fusion`**: * `0`: Not fused — two separate GEMMs (FC1 and FC3). * `1`: Fused GEMMs using **interleaved** format (g and l are interleaved per row). * `2`: Fused GEMMs using **non-interleaved** (concatenated) format. * **`swiglu_limit`**: Clamp threshold applied to `g` and `l`. * **`activation_alpha`**: Scalar multiplier applied to `g` before sigmoid. * **`activation_beta`**: Added to `l` before the final output computation. --- ### SwiGLU Activation Function The SwiGLU function is defined as: ``` g = xW + b l = xV + c G = min(g, limit) L = max(min(l, limit), -limit) swiglu = G * sigmoid(alpha * G) * (L + beta) ``` * `x`: Input * `W`, `V`: Weight matrices * `b`, `c`: Bias vectors * `alpha`, `beta`, `limit`: Float constants --- ### Fusion Behavior * When `swiglu_fusion = 0`: * Two GEMMs are computed independently. * FC1 → computes `g`, FC3 → computes `l`. * When `swiglu_fusion = 1`: * `g` and `l` are computed in a **single fused GEMM** (FC1). * Output is **interleaved** per row as: `gate, linear, gate, linear, ...`. * When `swiglu_fusion = 2`: * `g` and `l` are computed in a single GEMM (FC1). * Output is **concatenated** per row: `[g | l]`. ### Implement swiglu_limit for CUDA Update CUDA kernel to use default swiglu limit. Update test_moe_cuda.py to have same logic in reference implementation. ### Remaining Works The main purpose of this PR is to update spec instead of implementing them. Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes. Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.

Make sure the shape reflects actual memory layout. The weight is stored in column major. Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the [Triton kernel implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py). * **`swiglu_fusion`**: * `0`: Not fused — two separate GEMMs (FC1 and FC3). * `1`: Fused GEMMs using **interleaved** format (g and l are interleaved per row). * `2`: Fused GEMMs using **non-interleaved** (concatenated) format. * **`swiglu_limit`**: Clamp threshold applied to `g` and `l`. * **`activation_alpha`**: Scalar multiplier applied to `g` before sigmoid. * **`activation_beta`**: Added to `l` before the final output computation. --- The SwiGLU function is defined as: ``` g = xW + b l = xV + c G = min(g, limit) L = max(min(l, limit), -limit) swiglu = G * sigmoid(alpha * G) * (L + beta) ``` * `x`: Input * `W`, `V`: Weight matrices * `b`, `c`: Bias vectors * `alpha`, `beta`, `limit`: Float constants --- * When `swiglu_fusion = 0`: * Two GEMMs are computed independently. * FC1 → computes `g`, FC3 → computes `l`. * When `swiglu_fusion = 1`: * `g` and `l` are computed in a **single fused GEMM** (FC1). * Output is **interleaved** per row as: `gate, linear, gate, linear, ...`. * When `swiglu_fusion = 2`: * `g` and `l` are computed in a single GEMM (FC1). * Output is **concatenated** per row: `[g | l]`. Update CUDA kernel to use default swiglu limit. Update test_moe_cuda.py to have same logic in reference implementation. The main purpose of this PR is to update spec instead of implementing them. Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes. Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.

### Weight Shape Update Make sure the shape reflects actual memory layout. The weight is stored in column major. ### Add support for SwiGLU activation attributes Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the [Triton kernel implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py). #### New Attributes for SwiGLU * **`swiglu_fusion`**: * `0`: Not fused — two separate GEMMs (FC1 and FC3). * `1`: Fused GEMMs using **interleaved** format (g and l are interleaved per row). * `2`: Fused GEMMs using **non-interleaved** (concatenated) format. * **`swiglu_limit`**: Clamp threshold applied to `g` and `l`. * **`activation_alpha`**: Scalar multiplier applied to `g` before sigmoid. * **`activation_beta`**: Added to `l` before the final output computation. --- ### SwiGLU Activation Function The SwiGLU function is defined as: ``` g = xW + b l = xV + c G = min(g, limit) L = max(min(l, limit), -limit) swiglu = G * sigmoid(alpha * G) * (L + beta) ``` * `x`: Input * `W`, `V`: Weight matrices * `b`, `c`: Bias vectors * `alpha`, `beta`, `limit`: Float constants --- ### Fusion Behavior * When `swiglu_fusion = 0`: * Two GEMMs are computed independently. * FC1 → computes `g`, FC3 → computes `l`. * When `swiglu_fusion = 1`: * `g` and `l` are computed in a **single fused GEMM** (FC1). * Output is **interleaved** per row as: `gate, linear, gate, linear, ...`. * When `swiglu_fusion = 2`: * `g` and `l` are computed in a single GEMM (FC1). * Output is **concatenated** per row: `[g | l]`. ### Implement swiglu_limit for CUDA Update CUDA kernel to use default swiglu limit. Update test_moe_cuda.py to have same logic in reference implementation. ### Remaining Works The main purpose of this PR is to update spec instead of implementing them. Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes. Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.

### Description Cherry-pick the following PRs: #25943 #25937 #25917 #25909 #25898 #25897 #25888 #25881 #25830 #25619 #25575 #25572 #25558 #25530 #25474 #25455 #25110 Also two dependent PRs for qMoE cpu: #25877 #25822 --------- Co-authored-by: xiaomsft <[email protected]> Co-authored-by: Xiaoyan Hu <[email protected]> Co-authored-by: Akshay Sonawane <[email protected]> Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Pradeep Sakhamoori <[email protected]> Co-authored-by: mingyue <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Emmanuel <[email protected]> Co-authored-by: Emmanuel Assumang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: praneshgo <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Ishwar Raut <[email protected]>

update moe spec

e7cb844

tianleiwu marked this pull request as draft August 1, 2025 00:21

github-actions bot reviewed Aug 1, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/moe_helper.h Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Aug 1, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/moe_helper.h Fixed Show fixed Hide fixed

update doc

bd36de4

github-actions bot reviewed Aug 1, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/moe_helper.h Outdated Show resolved Hide resolved

tianleiwu added 11 commits August 1, 2025 09:18

format

1d70f69

add swiglu limit

451814f

Merge branch 'main' into tlwu/moe_spec

fa0224c

CPU change from apsonawane

4a0d84f

use moe_helper in CPU

03a6146

remove MoEQuantType

6a5871e

Fix build

c4eb332

Add swiglu parameters

08c3114

Merge branch 'main' into tlwu/moe_spec

b7de4a7

update doc

edd065c

improve backward compatible

1b72088

tianleiwu marked this pull request as ready for review August 2, 2025 07:14

tianleiwu added 3 commits August 2, 2025 00:30

Revert "emsdk" change

b1562dd

refacotring

37abf5d

Disable cpu qmoe test

0635f11

kunal-vaishnavi reviewed Aug 2, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/contrib_defs.cc Show resolved Hide resolved

kunal-vaishnavi reviewed Aug 2, 2025

View reviewed changes

onnxruntime/test/python/transformers/test_moe_cuda.py Show resolved Hide resolved

kunal-vaishnavi approved these changes Aug 2, 2025

View reviewed changes

tianleiwu merged commit 562760a into main Aug 2, 2025
92 checks passed

tianleiwu deleted the tlwu/moe_spec branch August 2, 2025 23:31

kunal-vaishnavi mentioned this pull request Aug 14, 2025

Add OpenAI's gpt-oss to ONNX Runtime GenAI microsoft/onnxruntime-genai#1678

Merged

jywu-msft added the release:1.23.0 label Aug 29, 2025

tianleiwu mentioned this pull request Sep 4, 2025

cherry picks for 1.23.0 release #25959

Merged

tianleiwu added cherry-picked Cherry-picked for a cherrypicks branch and removed release:1.23.0 labels Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update MoE and qMoE spec #25619

Update MoE and qMoE spec #25619

Uh oh!

tianleiwu commented Aug 1, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update MoE and qMoE spec #25619

Update MoE and qMoE spec #25619

Uh oh!

Conversation

tianleiwu commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Weight Shape Update

Add support for SwiGLU activation attributes

New Attributes for SwiGLU

SwiGLU Activation Function

Fusion Behavior

Implement swiglu_limit for CUDA

Remaining Works

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Aug 1, 2025 •

edited

Loading