Cherry-picks for 1.23.0 release #25889

snnn · 2025-08-28T23:32:28Z

Relax WeightBiasQuantization constraint for larger QDQ node group (Relax WeightBiasQuantization constraint for larger QDQ node group #25673)
Add cuda graph implementation for NV TRT RTX EP (Add cuda graph implementation for NV TRT RTX EP #25787)
python GPU IO Bindings for NVIDIA (python GPU IO Bindings for NVIDIA #25776)
Fixes for DynamicQuantizeMatMul and Attention3D tests (Fixes for DynamicQuantizeMatMul and Attention3D tests #25814)
Fix a long standing bug on file memory mapping on windows. (Fix a long standing bug on file memory mapping on windows. #25833)
Add API for precompiled model compatibility check using just the compat info (Add API for precompiled model compatibility check using just the compat info #25841)
Enable ABSL_FLAGS flag registration for onnxruntime_perf_test for mobile build (Enable ABSL_FLAGS flag registration for onnxruntime_perf_test for mobile build #25849)
Add default constructor to Ort::Status. (Add default constructor to Ort::Status. #25860)

### Description We have a big packaging pipeline that build nuget/java/nodejs packages. After that we run these. This PR split the tests to a dedicated pipeline and refactored the code that use maven to download deps instead of using direct HTTP fetch. The new approach allows us to use Azure DevOps artifacts as an internal mirror to meet network isolation requirements. Thsi PR also enabled WebGPU and CoreML EP tests for java package on macOS. This PR also updated tools/python/run_packaging_pipelines.py a little bit to add the support for RC releases. ### Motivation and Context Make the packaging pipelines smaller and easier to use.

…gth (#25594) ### Description  #25372 adds sliding window support for Group Query Attention, disabling Flash Attention as it's not yet supported. This PR adds a check for the sliding window and applies Flash Attention when the window size exceeds the KV cache length or total sequence length. ### Motivation and Context See above.

…5673) ### Description Relax WeightBiasQuantization constraint for larger QDQ node group ### Motivation and Context The transformer `WeightBiasQuantization` quantizes float weights on `Q -> DQ -> Conv/ConvTranspose/Gemm's Weights -> Q-> DQ` sequence; The check on `Weights -> Q` (`children_nodes.size() != 1 || children_nodes[0]->OpType() != QDQ::QOpName`) is an issue due to it would skip quantization for many common patterns such as unfused activations followed by `Conv` (`DQ - Conv -> ReLU -> Q`). It's actually unnecessary to check ending Q here (the fold can happen anyway without changing model semantics). However, in order to minimize the current behavior change, this PR simply extend the pattern to include single path (no branch), type-preserving path lead to `Q` to enable more quantization support.

### Description This change adds CUDA Graph support to the NV TensorRT RTX Execution Provider (EP). ### Motivation and Context Integrating CUDA Graphs into the NV TRT RTX EP provides: Lower latency by minimizing per-kernel launch overhead. Better throughput for repeated inference runs. Improved efficiency on GPUs with high kernel launches overhead sensitivity. --------- Co-authored-by: Maximilian Mueller <[email protected]> Co-authored-by: Gaurav Garg <[email protected]>

### Description  1. A Small change to use the shared allocator in Python binding. 2. Remove the FP64 support from the EP. ### Motivation and Context  The Python GPU IO binding is necessary for performance. The change will enable the shared allocator for GPU allocation. The FP64 was using the FP32 inference—aligned WRT TRT RTX support. --------- Co-authored-by: Gaurav Garg <[email protected]>

### Description This change fixes correctness issues in two areas that were causing failures in onnxruntime_test_all: - DynamicQuantizeMatMul.WithConstantBInputs - AttentionTest.Attention3DDefault - AttentionTest.Attention3DWithPastAndPresentQkMatmul What was wrong and how it’s fixed 1) DynamicQuantizeMatMul.WithConstantBInputs - Root cause: The Kleidi dynamic quantization GEMM path could be selected even when the B scales contained values such as (zero, negative, or non-finite). That violates kernel assumptions and can lead to incorrect results. - Fix: In `onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc`, we now explicitly validate that all B scales are finite and strictly positive before enabling the Kleidi/MLAS dynamic path. If any scale is invalid, we disable that path. 2) Attention tests (Attention3DDefault, Attention3DWithPastAndPresentQkMatmul) - Root causes in `onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp`: - Incorrect handling of GEMM corner cases for alpha/beta and K==0 (e.g., not respecting C = beta*C when alpha==0 or K==0). - Unnecessary or premature fallbacks for small shapes. - Fixes: - Add early-outs for degenerate sizes: if M==0 or N==0, return handled. - Correctly implement alpha/beta semantics: --------- Signed-off-by: Jonathan Clohessy <[email protected]>

### Description  While memory profiling some models I noticed multiple file mapping failures. `WindowsEnv::MapFileIntoMemory()` While it properly checks for the mapping offset to be granularity aligned, it calculates it as page aligned. Also, while saving external tensors we do not need to align big tensors to windows granularity or anything that is platform dependent. Set it to 4096 for all platforms. Granularity matters only for calculating mapping address. ### Motivation and Context  Multiple failures for file mapping for certain models. This saves some hundreds of Mbs for some models.

…at info (#25841) ### Description This PR adds a new API that applications can use to verify compatibility of a precompiled model with the underlying system, using only the compatibility info string from the model's metadata. ### Motivation and Context  - This is a feature to enable apps to check compatibility of a precompiled model without necessarily having the model locally on the device. This enables precompiled models to be stored remotely and downloaded once the application has been able to confirm the validity of a given model with EPs on the device. ### Testing - New unit tests pass - For regression testing, built a private version of WinML + AMD NPU EP with these changes. Ran the Cpp Selfcontained Desktop sample successfully; ran with compilation and also re-ran using the already-compiled model to verify that session initialization continued to work as expected. --------- Co-authored-by: Aditya Rastogi <[email protected]>

…ile build (#25849) ### Description `ABSL_FLAGS_STRIP_NAMES `is set to 1 by default to disable flag registration when building for Android, iPhone, and "embedded devices". So, running onnxruntime_perf_test on Android will see that flags are not registered. <img width="872" height="182" alt="image (2)" src="https://github.com/user-attachments/assets/eb6a6772-cdff-4d60-a3c7-4352477e956c" /> Set `ABSL_FLAGS_STRIP_NAMES ` to 0 by default for all builds.

### Description  Fix packaging pipelines ### Motivation and Context  During CIs and local builds Ort::Status() gets inherited from the base due to using directives, however, that does not work for packaging pipelines. Having default ctor is important for storing Status in containers if needed.

snnn · 2025-08-29T01:28:20Z

All the cherry-picks were clean merged. But, there are some strange build errors(missing symbols).

onnxruntime\test\platform\file_io_test.cc(171,5): Error C3861: 'ASSERT_STATUS_OK': identifier not found

Investigating ...

…onnxruntime into users/snnn/rel-1.23.0

snnn · 2025-08-29T19:34:47Z

Added a line of code to fix the missing include issue.

### Description When using attention bias input for GQA op with FP16, on the platforms that don't natively support FP16 math a cast to fp32 needs to be performed, and thus a temporary buffer needs to be created to store the fp32 values. The issue is that this temporary buffer was being allocated / deallocated inside of a loop for every token being processed. Refactored the implementation so that the allocation takes place only once. Phi model throughput increased by 15%.

### Description This change builds on top of #25841 , and adds the scaffolding necessary to call into this API from C++ / C# / Python. ### Motivation and Context #25454 talks more about the broader notion of precompiled model compatibility. This change is directed at app developers whose apps may want to determine if a particular precompiled model (e.g. on a server somewhere) is compatible with the device where the application is running. There is functionality in `OrtEpFactory` for making this determination, which was exposed as a C API in #25841, and this change makes the API more broadly available in other languages. ### Testing and Validation Introduced new unit test cases across each language, and verified that the API was being called and returned the correct result for the default CPU EP. --------- Co-authored-by: Aditya Rastogi <[email protected]>

### Description This update introduces multiple improvements, fixes, and feature enhancements to the OpenVINO Execution Provider (OVEP) and related components in ONNX Runtime: #### Configuration & Properties - Updated load_config mapping to act as a passthrough to OpenVINO properties. - Added support for providing layout information to inputs/outputs in OpenVINO. #### Inference & Tensor Handling - Improved OVInferRequest::SetTensor to correctly handle cached binding shape mismatches. - Added support for self-detecting on-the-fly bfloat16 → float16 conversion. - Fixed issues with input ONNX models when used with shared execution contexts. #### Model Handling & Operator Support - Fixed model copying behavior for QDQ stripping. - Updated operator support status for OpenVINO 2025.2. #### Platform & Integration Fixes - Applied multiple PSU Lora fixes and related updates. - Resolved filename confusion issues with wrapped OVIRs in EPCtx. - Enabled memory-mapped native binaries for OpenVINO 2025.3. #### Quality & Maintenance - Addressed linting issues. - Fixed coverage gaps in OVEP. - Added a new test script for OpenVINO with ORT ABI integration. --------- Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: Ryan Metcalfe <[email protected]> Co-authored-by: Klimenko, Mikhail <[email protected]> Co-authored-by: sfatimar <[email protected]> Co-authored-by: Garth Long <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: MayureshV1 <[email protected]> Co-authored-by: Eric Crawford <[email protected]> Co-authored-by: jatinwadhwa921 <[email protected]> Co-authored-by: Vishnudas Thaniel S <[email protected]> Co-authored-by: Javier Martinez <[email protected]>

…ting Node_GetTensorAttributeAsOrtValue (#25886) ### Description Replace `Node_GetTensorAttributeAsOrtValue` with `OpAttr_GetTensorAttributeAsOrtValue`. Change the API signature to make it one of the `OpAttr` interfaces instead of the `OrtNode` interface. The original API was added [here](#25566).

### Description 1. Check process exit code when running 7z.exe . Currently the errors were silently ignored. 2. Add snld20 flag to the 7z.exe commands, which is needed to be compatible with the latest 7z release.

fs-eire and others added 12 commits July 31, 2025 17:55

[build] disable CodeQL for NPM Packaging Pipeline (#25614)

5b8550c

[build] upgrade Node.js for NPM packaging pipeline (#25568)

2769b01

snnn added 2 commits August 29, 2025 10:43

update

f3464aa

Merge branch 'users/snnn/rel-1.23.0' of https://github.com/microsoft/…

1a743ae

…onnxruntime into users/snnn/rel-1.23.0

derdeljan-msft and others added 5 commits August 29, 2025 13:13

Add error handling to extract_nuget_files.ps1 (#25866)

6a9ddb6

### Description 1. Check process exit code when running 7z.exe . Currently the errors were silently ignored. 2. Add snld20 flag to the 7z.exe commands, which is needed to be compatible with the latest 7z release.

snnn changed the title ~~users/snnn/rel 1.23.0~~ Cherry-picks for 1.23.0 release Aug 29, 2025

adrianlizarraga approved these changes Aug 29, 2025

View reviewed changes

hanbitmyths approved these changes Aug 29, 2025

View reviewed changes

jywu-msft approved these changes Aug 29, 2025

View reviewed changes

snnn merged commit 30612fb into rel-1.23.0 Aug 29, 2025
126 of 144 checks passed

snnn deleted the users/snnn/rel-1.23.0 branch August 29, 2025 23:26

snnn mentioned this pull request Sep 16, 2025

users/snnn/cr #26058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry-picks for 1.23.0 release #25889

Cherry-picks for 1.23.0 release #25889

Uh oh!

snnn commented Aug 28, 2025

Uh oh!

snnn commented Aug 29, 2025

Uh oh!

snnn commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Cherry-picks for 1.23.0 release #25889

Cherry-picks for 1.23.0 release #25889

Uh oh!

Conversation

snnn commented Aug 28, 2025

Uh oh!

snnn commented Aug 29, 2025

Uh oh!

snnn commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants