cherry picks for 1.23.0 release #25959

tianleiwu · 2025-09-04T22:28:08Z

Description

Cherry-pick the following PRs:
#25943
#25937
#25917
#25909
#25898
#25897
#25888
#25881
#25830
#25619
#25575
#25572
#25558
#25530
#25474
#25455
#25110

Also two dependent PRs for qMoE cpu:
#25877
#25822

### Description This implements the SwiGLU activation for MoE and qMoE. The activation is corresponding to https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py. Also update test_parity_moe.py to enable test for qMoE in CI pipelines. ### Motivation and Context This is naive implementation of the activation. Since the activation will reduce each row length to half, we cannot directly use epilogue. Current implementations need an extra buffer to run SwiGLU kernel. In the future, we might take a look at other alternatives that does not need extra buffer.

Add support of bfloat16 in MoE and qMoE cuda ops.

### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>

This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>

### Weight Shape Update Make sure the shape reflects actual memory layout. The weight is stored in column major. ### Add support for SwiGLU activation attributes Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the [Triton kernel implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py). #### New Attributes for SwiGLU * **`swiglu_fusion`**: * `0`: Not fused — two separate GEMMs (FC1 and FC3). * `1`: Fused GEMMs using **interleaved** format (g and l are interleaved per row). * `2`: Fused GEMMs using **non-interleaved** (concatenated) format. * **`swiglu_limit`**: Clamp threshold applied to `g` and `l`. * **`activation_alpha`**: Scalar multiplier applied to `g` before sigmoid. * **`activation_beta`**: Added to `l` before the final output computation. --- ### SwiGLU Activation Function The SwiGLU function is defined as: ``` g = xW + b l = xV + c G = min(g, limit) L = max(min(l, limit), -limit) swiglu = G * sigmoid(alpha * G) * (L + beta) ``` * `x`: Input * `W`, `V`: Weight matrices * `b`, `c`: Bias vectors * `alpha`, `beta`, `limit`: Float constants --- ### Fusion Behavior * When `swiglu_fusion = 0`: * Two GEMMs are computed independently. * FC1 → computes `g`, FC3 → computes `l`. * When `swiglu_fusion = 1`: * `g` and `l` are computed in a **single fused GEMM** (FC1). * Output is **interleaved** per row as: `gate, linear, gate, linear, ...`. * When `swiglu_fusion = 2`: * `g` and `l` are computed in a single GEMM (FC1). * Output is **concatenated** per row: `[g | l]`. ### Implement swiglu_limit for CUDA Update CUDA kernel to use default swiglu limit. Update test_moe_cuda.py to have same logic in reference implementation. ### Remaining Works The main purpose of this PR is to update spec instead of implementing them. Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes. Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.

This pull request introduces several improvements and refactorings to the quantized Mixture-of-Experts (QMoE) operator in ONNX Runtime, focusing on enhanced support for FP32 mode, improved SwiGLU activation handling, and better test coverage. The most important changes are grouped below by theme. ### Operator Registration and Type Support - Added explicit registration and support for `QMoE` operator with both `MLFloat16` and `float` data types, enabling FP32 (non-quantized) mode in addition to quantized modes. This includes updates to kernel registration and schema/type constraints. [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9L109-R110) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9L275-R277) [[3]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1467-R1467) [[4]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1548-R1548) ### SwiGLU Activation Improvements - Refactored `ApplySwiGLUActivation` to accept configurable `activation_alpha` and `activation_beta` parameters, matching CUDA behavior and allowing flexibility in activation function tuning. Also, dropped support for non-interleaved memory layouts (now not implemented). [[1]](diffhunk://#diff-4e4afb8dcdade0abe18bd8bea68b148b4090cd86d60a1b1422c049960231737dR49-R60) [[2]](diffhunk://#diff-edb344a38502bba9a0083ab98e274ec1b5b2606639a61df7be474a600a7b99d2L29-R61) [[3]](diffhunk://#diff-f85806c745243652a0336da094126687a6c0d14b19fe760abe73df1d940dc4cbL12-R13) - Now reads `activation_alpha` and `activation_beta` attributes from operator parameters, defaulting to values appropriate for SwiGLU. ### QMoE Operator Implementation Refactor - Refactored the QMoE operator to clarify separation between quantized and FP32 implementations, and restructured internal methods for better maintainability. Added template parameterization for data types and improved handling of expert weights and biases. [[1]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5R13-R35) [[2]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5L38-R55) [[3]](diffhunk://#diff-e54124baa488af74400fae0f0dbd5cf7d4f1e307c0a5ba0e9dc79622e1315cd5L58-L59) ### Shape Checking and Layout - Removed legacy shape/layout support in QMoE input validation, enforcing only the new memory layout for expert weights and improving consistency and forward compatibility. ### Test and Documentation Updates - Updated unit tests for QMoE to use correct zero-point values for quantized weights (e.g., 0x88 for int4, 128 for int8), ensuring that test cases accurately reflect expected zero-output behavior for zero weights. Also clarified comments and expected outputs for SwiGLU and quantized scenarios. [[1]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1340-R1349) [[2]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1379-R1380) [[3]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1404-R1413) [[4]](diffhunk://#diff-27ea1ef8d40401d116e653d6b935304a7ad68ee8300d04ea98e814c585abee75L1525-R1538) These changes collectively improve the flexibility, correctness, and maintainability of the QMoE operator in ONNX Runtime. Unit test result ``` sRunning test: batch_size=1, sequence_length=8, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000372 .Running test: batch_size=1, sequence_length=8, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000392 .Running test: batch_size=1, sequence_length=32, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000470 .Running test: batch_size=1, sequence_length=32, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000442 .Running test: batch_size=4, sequence_length=8, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000470 .Running test: batch_size=4, sequence_length=8, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000442 .Running test: batch_size=4, sequence_length=32, quant_bits=4, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 4-bit: max_diff = 0.000609 .Running test: batch_size=4, sequence_length=32, quant_bits=8, use_swiglu=True, swiglu_interleaved=True Parity check - SwiGLU(interleaved=True) 8-bit: max_diff = 0.000702 . ---------------------------------------------------------------------- Ran 9 tests in 46.754s OK (skipped=1) ``` --------- Co-authored-by: Tianlei Wu <[email protected]>

This change adds skip test for QMoE CPU tests when running on TensorRT or CUDA EP. In the QMoE kernel there was a memory overwrite bug in the accumulate part, updated that and this fixed the python tests back

## Summary Adds EP metadata library path support to enable custom ops DLL registration with proper path resolution. ## Changes - Added `library_path` metadata key to EP metadata infrastructure - Pass resolved library path directly to `EpLibraryProviderBridge` constructor - Simplified implementation per reviewer feedback (removed virtual method complexity) - Added `#include <utility>` for std::move compliance ## Purpose Enables downstream applications (like onnxruntime-genai) to resolve relative custom ops library paths using EP metadata, improving DLL registration reliability. ## Files Modified - `plugin_ep/ep_factory_provider_bridge.h` - `plugin_ep/ep_library.h` - `plugin_ep/ep_library_plugin.h` - `plugin_ep/ep_library_provider_bridge.cc` - `plugin_ep/ep_library_provider_bridge.h` - `utils.cc`

…25881) ### Description Fix illegal memory access in GetInputIndices with optional inputs ### Motivation and Context When an input is optional, its ValueInfo may be nullptr. The current implementation directly calls InputValueInfo->GetName(), leading to illegal memory access. Update logic to skip optional inputs when valueInfo is nullptr .

This PR adds a missing sync method and fixes the linux CI

### Description Change from fread to mmap to save on system memory. This also accelerated the load time of a ~4GB model in my testing by 1.5X.

@chilo-ms

### Description Runtime caches can accelerate the JIT time when deserializing an engine of TRT RTX. Here we introduce a per engine caching in a user specified folder. The cache file will be named after the fused node name - which will also be the node name of an ep context node. @chilo-ms we would like to pick this to 1.23

### Description - Disables graph optimizations by default when using the explicit compiling API. - Adds `ModelCompilationOptions_SetGraphOptimizationLevel` to allow the user to set an optimization level. - Adds C++, Python, and C# bindings for the new API function. - Updates `ModelCompilationOptions_SetFlags` to take in a `uint32_t flags` parameter instead of `size_t flags` to ensure the same size across platforms. This API is not yet in a public ORT release, so safe to modify. ### Motivation and Context When compiling, prefer allowing the EP to do the optimizations instead of ORT.

### Description  Create or augment existing C++ API for new entry points ### Motivation and Context  Enable exception safe coding in C++ codebase.

### Description - Added `VERIFIED_PUBLIC_MODELS` (20 unique models) and `ONNX_ZOO_MODELS` (158 total) - Implemented `IsModelAllowed()` function with O(1) hash lookup - Added comprehensive name mapping for backwards compatibility - Maintained all existing EP provider configurations ### Motivation and Context - Part of ONNX Runtime Phase 1 migration initiative - Addresses security requirements for public CI pipelines - Prepares for ORT 1.23.0 release compliance --------- Co-authored-by: Emmanuel Assumang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description  Some compilers we use in our pipeline do not support string::data() nonconst

) ### Description - Adds `ModelCompilationOptions_SetOutputModelWriteFunc` to the compile API to allow writing the output model ONNX bytes to a user-provided write function (i.e., for streaming). - Adds `ModelCompilationOptions_SetOutputModelHandleInitializerFunc` to the compile API to allow the user to write individual initializers to some destination. Also allows specifying if an initializer should be embedded within the ONNX model or written to a custom file. - Adds C++, Python, and C# bindings for the new APIs. A follow-up PR adds a write function for EPContext node binary data: #25471 ### Example `ModelCompilationOptions_SetOutputModelWriteFunc`: https://github.com/microsoft/onnxruntime/blob/c62ed23c328cbbfefd3083c1f7a6ced604772c19/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L2075-L2131 `ModelCompilationOptions_SetOutputModelHandleInitializerFunc`: https://github.com/microsoft/onnxruntime/blob/c62ed23c328cbbfefd3083c1f7a6ced604772c19/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L2160-L2292 ### Motivation and Context Add output streaming capabilities when saving compiled models.

@chilo-ms

…ng cudastream by value in createNotification API (#25937) Fixing the stream parameter in CopyTensors API and passing cudastream by value in createNotification API ### Description  Fixing the stream parameter in CopyTensors API to pass the application passed stream instead of nullptr Passing cudastream by value in createNotification API as passing pointer was leading to dangling reference issues. Can you please make sure that this goes into 1.23? @chilo-ms ### Motivation and Context  - Without this code, copy tensors always happens synchronously even when app specifies a different stream to it. - Passing pointer for cudastream in EP API is leading to dangling reference issues, hence switched to passing value.

@fajin-corp

### Description Enable 8-bit weights Gemm on ARM64 via MLAS 1. Supports 2 flavors of the 8-bit Gemm kernel - one uses `vdotq` (U8U8) and the other uses `vusdotq` (U8S8) on platforms where I8MM is supported. 2. Provides access to these new MLAS Gemm kernels via the `MatmulNBits` contrib operator 3. Tests: **MLAS** 3 new sets of tests: - `SQ8BitQuantA` : Tests the dynamic activation quantization MLAS kernel (`fp32 -> uint8_t` or `fp32 -> int8_t` on I8MM platforms) - `SQ8BitPrepack`: Tests the prepacking of the weights for the 8-bit Gemm kernels - `SQ8BitGemm`: Tests the 8-bit Gemm kernels **MatmulNBits contrib tests** - Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86) ### Motivation and Context Enable 8-bit weights Gemm on ARM64 via MLAS Based on work and contribution by @fajin-corp Phi-4-mini-instruct perf numbers (before and after this change): <img width="593" height="179" alt="image" src="https://github.com/user-attachments/assets/d81b9059-b8db-407c-8c0f-527099f9358c" /> --------- Co-authored-by: Jing Fang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description  The EP will reject the node with unsupported data types. ### Motivation and Context  The user will face a crash if the model with an unsupported datatype is used.

tianleiwu and others added 20 commits September 4, 2025 15:08

[CUDA] BF16 MoE and qMoE (#25572)

a8e1186

Add support of bfloat16 in MoE and qMoE cuda ops.

Fix MoE CPP tests (#25877)

dd32daf

This change adds skip test for QMoE CPU tests when running on TensorRT or CUDA EP. In the QMoE kernel there was a memory overwrite bug in the accumulate part, updated that and this fixed the python tests back

[TRT RTX EP] Add sync method (#25898)

6c7f150

This PR adds a missing sync method and fixes the linux CI

[TRT RTX EP] Memory map the engine buffer (#25909)

535fcc6

### Description Change from fread to mmap to save on system memory. This also accelerated the load time of a ~4GB model in my testing by 1.5X.

Remove std::string::data() non-const usage from public headers (#25943)

ab71f1e

### Description  Some compilers we use in our pipeline do not support string::data() nonconst

adrianlizarraga approved these changes Sep 5, 2025

View reviewed changes

jywu-msft approved these changes Sep 5, 2025

View reviewed changes

jywu-msft merged commit 491f0c1 into rel-1.23.0 Sep 5, 2025
80 checks passed

jywu-msft deleted the tlwu/1.23_cherry_pick branch September 5, 2025 18:12

snnn mentioned this pull request Sep 5, 2025

[NV TensorRT RTX] Handle unsupported data types #25953

Merged

snnn mentioned this pull request Sep 16, 2025

users/snnn/cr #26058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cherry picks for 1.23.0 release #25959

cherry picks for 1.23.0 release #25959

Uh oh!

tianleiwu commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

cherry picks for 1.23.0 release #25959

cherry picks for 1.23.0 release #25959

Uh oh!

Conversation

tianleiwu commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

tianleiwu commented Sep 4, 2025 •

edited

Loading