Skip to content

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Sep 4, 2025

Description

Cherry-pick the following PRs:
#25943
#25937
#25917
#25909
#25898
#25897
#25888
#25881
#25830
#25619
#25575
#25572
#25558
#25530
#25474
#25455
#25110

Also two dependent PRs for qMoE cpu:
#25877
#25822

tianleiwu and others added 20 commits September 4, 2025 15:08
### Description

This implements the SwiGLU activation for MoE and qMoE. The activation
is corresponding to
https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py.

Also update test_parity_moe.py to enable test for qMoE in CI pipelines.

### Motivation and Context

This is naive implementation of the activation. Since the activation
will reduce each row length to half, we cannot directly use epilogue.
Current implementations need an extra buffer to run SwiGLU kernel.

In the future, we might take a look at other alternatives that does not
need extra buffer.
Add support of bfloat16 in MoE and qMoE cuda ops.
### Description
This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit
and 8 bit data support.


### Motivation and Context
GatherBlockQuantified operator is essential for MOE model's expert
selection, especially when the model has been statically quantized.

---------

Co-authored-by: Xiaoyan Hu <[email protected]>
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.

### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)

### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.

### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.

### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)

These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.

---------

Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
### Weight Shape Update
Make sure the shape reflects actual memory layout. The weight is stored
in column major.

### Add support for SwiGLU activation attributes
Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by
introducing a few new attributes. For reference, see the [Triton kernel
implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py).


#### New Attributes for SwiGLU

* **`swiglu_fusion`**:

  * `0`: Not fused — two separate GEMMs (FC1 and FC3).
* `1`: Fused GEMMs using **interleaved** format (g and l are interleaved
per row).
  * `2`: Fused GEMMs using **non-interleaved** (concatenated) format.

* **`swiglu_limit`**: Clamp threshold applied to `g` and `l`.

* **`activation_alpha`**: Scalar multiplier applied to `g` before
sigmoid.

* **`activation_beta`**: Added to `l` before the final output
computation.

---

### SwiGLU Activation Function

The SwiGLU function is defined as:

```
g = xW + b
l = xV + c
G = min(g, limit)
L = max(min(l, limit), -limit)
swiglu = G * sigmoid(alpha * G) * (L + beta)
```

* `x`: Input
* `W`, `V`: Weight matrices
* `b`, `c`: Bias vectors
* `alpha`, `beta`, `limit`: Float constants

---

### Fusion Behavior

* When `swiglu_fusion = 0`:

  * Two GEMMs are computed independently.
  * FC1 → computes `g`, FC3 → computes `l`.

* When `swiglu_fusion = 1`:

  * `g` and `l` are computed in a **single fused GEMM** (FC1).
* Output is **interleaved** per row as: `gate, linear, gate, linear,
...`.

* When `swiglu_fusion = 2`:

  * `g` and `l` are computed in a single GEMM (FC1).
  * Output is **concatenated** per row: `[g | l]`.

### Implement swiglu_limit for CUDA
Update CUDA kernel to use default swiglu limit.
Update test_moe_cuda.py to have same logic in reference implementation.

### Remaining Works
The main purpose of this PR is to update spec instead of implementing
them.
Note that MoE/qMoE ops and tests still use hard-coded parameters and
will be changed later to read from those attributes.

Column-wise symmetric quantization is used for qMoE. We will add more
quantization details when we add support of block-wise quantization
soon.
This pull request introduces several improvements and refactorings to
the quantized Mixture-of-Experts (QMoE) operator in ONNX Runtime,
focusing on enhanced support for FP32 mode, improved SwiGLU activation
handling, and better test coverage. The most important changes are
grouped below by theme.

### Operator Registration and Type Support

- Added explicit registration and support for `QMoE` operator with both
`MLFloat16` and `float` data types, enabling FP32 (non-quantized) mode
in addition to quantized modes. This includes updates to kernel
registration and schema/type constraints.
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9L109-R110)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9L275-R277)
[[3]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1467-R1467)
[[4]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1548-R1548)

### SwiGLU Activation Improvements

- Refactored `ApplySwiGLUActivation` to accept configurable
`activation_alpha` and `activation_beta` parameters, matching CUDA
behavior and allowing flexibility in activation function tuning. Also,
dropped support for non-interleaved memory layouts (now not
implemented).
[[1]](diffhunk://#diff-4e4afb8dcdade0abe18bd8bea68b148b4090cd86d60a1b1422c049960231737dR49-R60)
[[2]](diffhunk://#diff-edb344a38502bba9a0083ab98e274ec1b5b2606639a61df7be474a600a7b99d2L29-R61)
[[3]](diffhunk://#diff-f85806c745243652a0336da094126687a6c0d14b19fe760abe73df1d940dc4cbL12-R13)
- Now reads `activation_alpha` and `activation_beta` attributes from
operator parameters, defaulting to values appropriate for SwiGLU.

### QMoE Operator Implementation Refactor

- Refactored the QMoE operator to clarify separation between quantized
and FP32 implementations, and restructured internal methods for better
maintainability. Added template parameterization for data types and
improved handling of expert weights and biases.
[[1]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5R13-R35)
[[2]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5L38-R55)
[[3]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5L58-L59)

### Shape Checking and Layout

- Removed legacy shape/layout support in QMoE input validation,
enforcing only the new memory layout for expert weights and improving
consistency and forward compatibility.

### Test and Documentation Updates

- Updated unit tests for QMoE to use correct zero-point values for
quantized weights (e.g., 0x88 for int4, 128 for int8), ensuring that
test cases accurately reflect expected zero-output behavior for zero
weights. Also clarified comments and expected outputs for SwiGLU and
quantized scenarios.
[[1]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1340-R1349)
[[2]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1379-R1380)
[[3]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1404-R1413)
[[4]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1525-R1538)

These changes collectively improve the flexibility, correctness, and
maintainability of the QMoE operator in ONNX Runtime.


Unit test result
```
sRunning test: batch_size=1, sequence_length=8, quant_bits=4, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000372
.Running test: batch_size=1, sequence_length=8, quant_bits=8, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000392
.Running test: batch_size=1, sequence_length=32, quant_bits=4, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000470
.Running test: batch_size=1, sequence_length=32, quant_bits=8, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000442
.Running test: batch_size=4, sequence_length=8, quant_bits=4, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000470
.Running test: batch_size=4, sequence_length=8, quant_bits=8, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000442
.Running test: batch_size=4, sequence_length=32, quant_bits=4, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000609
.Running test: batch_size=4, sequence_length=32, quant_bits=8, use_swiglu=True, swiglu_interleaved=True
Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000702
.
----------------------------------------------------------------------
Ran 9 tests in 46.754s

OK (skipped=1)
```

---------

Co-authored-by: Tianlei Wu <[email protected]>
This change adds skip test for QMoE CPU tests when running on TensorRT
or CUDA EP.
In the QMoE kernel there was a memory overwrite bug in the accumulate
part, updated that and this fixed the python tests back
## Summary
Adds EP metadata library path support to enable custom ops DLL
registration with proper path resolution.

## Changes
- Added `library_path` metadata key to EP metadata infrastructure
- Pass resolved library path directly to `EpLibraryProviderBridge`
constructor
- Simplified implementation per reviewer feedback (removed virtual
method complexity)
- Added `#include <utility>` for std::move compliance

## Purpose
Enables downstream applications (like onnxruntime-genai) to resolve
relative custom ops library paths using EP metadata, improving DLL
registration reliability.

## Files Modified
- `plugin_ep/ep_factory_provider_bridge.h`
- `plugin_ep/ep_library.h` 
- `plugin_ep/ep_library_plugin.h`
- `plugin_ep/ep_library_provider_bridge.cc`
- `plugin_ep/ep_library_provider_bridge.h`
- `utils.cc`
…25881)

### Description
Fix illegal memory access in GetInputIndices with optional inputs

### Motivation and Context
When an input is optional, its ValueInfo may be nullptr. 
The current implementation directly calls InputValueInfo->GetName(), leading to illegal memory access.

Update logic to skip optional inputs when valueInfo is nullptr .
This PR adds a missing sync method and fixes the linux CI
### Description

Change from fread to mmap to save on system memory. This also
accelerated the load time of a ~4GB model in my testing by 1.5X.
### Description

Runtime caches can accelerate the JIT time when deserializing an engine
of TRT RTX. Here we introduce a per engine caching in a user specified
folder. The cache file will be named after the fused node name - which
will also be the node name of an ep context node.

@chilo-ms we would like to pick this to 1.23
### Description
- Disables graph optimizations by default when using the explicit
compiling API.
- Adds `ModelCompilationOptions_SetGraphOptimizationLevel` to allow the
user to set an optimization level.
- Adds C++, Python, and C# bindings for the new API function.
- Updates `ModelCompilationOptions_SetFlags` to take in a `uint32_t
flags` parameter instead of `size_t flags` to ensure the same size
across platforms. This API is not yet in a public ORT release, so safe
to modify.



### Motivation and Context
When compiling, prefer allowing the EP to do the optimizations instead
of ORT.
### Description
<!-- Describe your changes. -->
Create or augment existing C++ API for new entry points

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable exception safe coding in C++ codebase.
### Description
- Added `VERIFIED_PUBLIC_MODELS` (20 unique models) and
`ONNX_ZOO_MODELS` (158 total)
- Implemented `IsModelAllowed()` function with O(1) hash lookup
- Added comprehensive name mapping for backwards compatibility
- Maintained all existing EP provider configurations



### Motivation and Context
- Part of ONNX Runtime Phase 1 migration initiative
- Addresses security requirements for public CI pipelines
- Prepares for ORT 1.23.0 release compliance

---------

Co-authored-by: Emmanuel Assumang <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Some compilers we use in our pipeline do not support string::data() nonconst
)

### Description
- Adds `ModelCompilationOptions_SetOutputModelWriteFunc` to the compile
API to allow writing the output model ONNX bytes to a user-provided
write function (i.e., for streaming).
- Adds `ModelCompilationOptions_SetOutputModelHandleInitializerFunc` to
the compile API to allow the user to write individual initializers to
some destination. Also allows specifying if an initializer should be
embedded within the ONNX model or written to a custom file.
- Adds C++, Python, and C# bindings for the new APIs.

A follow-up PR adds a write function for EPContext node binary data:
#25471

### Example
`ModelCompilationOptions_SetOutputModelWriteFunc`:
https://github.com/microsoft/onnxruntime/blob/c62ed23c328cbbfefd3083c1f7a6ced604772c19/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L2075-L2131

`ModelCompilationOptions_SetOutputModelHandleInitializerFunc`:

https://github.com/microsoft/onnxruntime/blob/c62ed23c328cbbfefd3083c1f7a6ced604772c19/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L2160-L2292

### Motivation and Context
Add output streaming capabilities when saving compiled models.
…ng cudastream by value in createNotification API (#25937)

Fixing the stream parameter in CopyTensors API and passing cudastream by
value in createNotification API

### Description
<!-- Describe your changes. -->
Fixing the stream parameter in CopyTensors API to pass the application
passed stream instead of nullptr Passing cudastream by value in
createNotification API as passing pointer was leading to dangling
reference issues.
Can you please make sure that this goes into 1.23? @chilo-ms 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- Without this code, copy tensors always happens synchronously even when
app specifies a different stream to it.
- Passing pointer for cudastream in EP API is leading to dangling
reference issues, hence switched to passing value.
### Description
Enable 8-bit weights Gemm on ARM64 via MLAS

1. Supports 2 flavors of the 8-bit Gemm kernel - one uses `vdotq` (U8U8)
and the other uses `vusdotq` (U8S8) on platforms where I8MM is
supported.

2. Provides access to these new MLAS Gemm kernels via the `MatmulNBits`
contrib operator

3. Tests:
    **MLAS**
  3 new sets of tests:
- `SQ8BitQuantA` : Tests the dynamic activation quantization MLAS kernel
(`fp32 -> uint8_t` or `fp32 -> int8_t` on I8MM platforms)
- `SQ8BitPrepack`: Tests the prepacking of the weights for the 8-bit
Gemm kernels
    - `SQ8BitGemm`: Tests the 8-bit Gemm kernels

    **MatmulNBits contrib tests**
- Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86)

### Motivation and Context
Enable 8-bit weights Gemm on ARM64 via MLAS

Based on work and contribution by @fajin-corp 


Phi-4-mini-instruct perf numbers (before and after this change):

<img width="593" height="179" alt="image"
src="https://github.com/user-attachments/assets/d81b9059-b8db-407c-8c0f-527099f9358c"
/>

---------

Co-authored-by: Jing Fang <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
The EP will reject the node with unsupported data types. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The user will face a crash if the model with an unsupported datatype is
used.
@jywu-msft jywu-msft merged commit 491f0c1 into rel-1.23.0 Sep 5, 2025
80 checks passed
@jywu-msft jywu-msft deleted the tlwu/1.23_cherry_pick branch September 5, 2025 18:12
@snnn snnn mentioned this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.