-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Cherry-picks for 1.23.0 release #25889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
snnn
commented
Aug 28, 2025
- Relax WeightBiasQuantization constraint for larger QDQ node group (Relax WeightBiasQuantization constraint for larger QDQ node group #25673)
- Add cuda graph implementation for NV TRT RTX EP (Add cuda graph implementation for NV TRT RTX EP #25787)
- python GPU IO Bindings for NVIDIA (python GPU IO Bindings for NVIDIA #25776)
- Fixes for DynamicQuantizeMatMul and Attention3D tests (Fixes for DynamicQuantizeMatMul and Attention3D tests #25814)
- Fix a long standing bug on file memory mapping on windows. (Fix a long standing bug on file memory mapping on windows. #25833)
- Add API for precompiled model compatibility check using just the compat info (Add API for precompiled model compatibility check using just the compat info #25841)
- Enable ABSL_FLAGS flag registration for onnxruntime_perf_test for mobile build (Enable ABSL_FLAGS flag registration for onnxruntime_perf_test for mobile build #25849)
- Add default constructor to Ort::Status. (Add default constructor to Ort::Status. #25860)
### Description We have a big packaging pipeline that build nuget/java/nodejs packages. After that we run these. This PR split the tests to a dedicated pipeline and refactored the code that use maven to download deps instead of using direct HTTP fetch. The new approach allows us to use Azure DevOps artifacts as an internal mirror to meet network isolation requirements. Thsi PR also enabled WebGPU and CoreML EP tests for java package on macOS. This PR also updated tools/python/run_packaging_pipelines.py a little bit to add the support for RC releases. ### Motivation and Context Make the packaging pipelines smaller and easier to use.
…gth (#25594) ### Description <!-- Describe your changes. --> #25372 adds sliding window support for Group Query Attention, disabling Flash Attention as it's not yet supported. This PR adds a check for the sliding window and applies Flash Attention when the window size exceeds the KV cache length or total sequence length. ### Motivation and Context See above.
…5673) ### Description Relax WeightBiasQuantization constraint for larger QDQ node group ### Motivation and Context The transformer `WeightBiasQuantization` quantizes float weights on `Q -> DQ -> Conv/ConvTranspose/Gemm's Weights -> Q-> DQ` sequence; The check on `Weights -> Q` (`children_nodes.size() != 1 || children_nodes[0]->OpType() != QDQ::QOpName`) is an issue due to it would skip quantization for many common patterns such as unfused activations followed by `Conv` (`DQ - Conv -> ReLU -> Q`). It's actually unnecessary to check ending Q here (the fold can happen anyway without changing model semantics). However, in order to minimize the current behavior change, this PR simply extend the pattern to include single path (no branch), type-preserving path lead to `Q` to enable more quantization support.
### Description This change adds CUDA Graph support to the NV TensorRT RTX Execution Provider (EP). ### Motivation and Context Integrating CUDA Graphs into the NV TRT RTX EP provides: Lower latency by minimizing per-kernel launch overhead. Better throughput for repeated inference runs. Improved efficiency on GPUs with high kernel launches overhead sensitivity. --------- Co-authored-by: Maximilian Mueller <[email protected]> Co-authored-by: Gaurav Garg <[email protected]>
### Description <!-- Describe your changes. --> 1. A Small change to use the shared allocator in Python binding. 2. Remove the FP64 support from the EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The Python GPU IO binding is necessary for performance. The change will enable the shared allocator for GPU allocation. The FP64 was using the FP32 inference—aligned WRT TRT RTX support. --------- Co-authored-by: Gaurav Garg <[email protected]>
### Description This change fixes correctness issues in two areas that were causing failures in onnxruntime_test_all: - DynamicQuantizeMatMul.WithConstantBInputs - AttentionTest.Attention3DDefault - AttentionTest.Attention3DWithPastAndPresentQkMatmul What was wrong and how it’s fixed 1) DynamicQuantizeMatMul.WithConstantBInputs - Root cause: The Kleidi dynamic quantization GEMM path could be selected even when the B scales contained values such as (zero, negative, or non-finite). That violates kernel assumptions and can lead to incorrect results. - Fix: In `onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc`, we now explicitly validate that all B scales are finite and strictly positive before enabling the Kleidi/MLAS dynamic path. If any scale is invalid, we disable that path. 2) Attention tests (Attention3DDefault, Attention3DWithPastAndPresentQkMatmul) - Root causes in `onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp`: - Incorrect handling of GEMM corner cases for alpha/beta and K==0 (e.g., not respecting C = beta*C when alpha==0 or K==0). - Unnecessary or premature fallbacks for small shapes. - Fixes: - Add early-outs for degenerate sizes: if M==0 or N==0, return handled. - Correctly implement alpha/beta semantics: --------- Signed-off-by: Jonathan Clohessy <[email protected]>
### Description <!-- Describe your changes. --> While memory profiling some models I noticed multiple file mapping failures. `WindowsEnv::MapFileIntoMemory()` While it properly checks for the mapping offset to be granularity aligned, it calculates it as page aligned. Also, while saving external tensors we do not need to align big tensors to windows granularity or anything that is platform dependent. Set it to 4096 for all platforms. Granularity matters only for calculating mapping address. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Multiple failures for file mapping for certain models. This saves some hundreds of Mbs for some models.
…at info (#25841) ### Description This PR adds a new API that applications can use to verify compatibility of a precompiled model with the underlying system, using only the compatibility info string from the model's metadata. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - This is a feature to enable apps to check compatibility of a precompiled model without necessarily having the model locally on the device. This enables precompiled models to be stored remotely and downloaded once the application has been able to confirm the validity of a given model with EPs on the device. ### Testing - New unit tests pass - For regression testing, built a private version of WinML + AMD NPU EP with these changes. Ran the Cpp Selfcontained Desktop sample successfully; ran with compilation and also re-ran using the already-compiled model to verify that session initialization continued to work as expected. --------- Co-authored-by: Aditya Rastogi <[email protected]>
…ile build (#25849) ### Description `ABSL_FLAGS_STRIP_NAMES `is set to 1 by default to disable flag registration when building for Android, iPhone, and "embedded devices". So, running onnxruntime_perf_test on Android will see that flags are not registered. <img width="872" height="182" alt="image (2)" src="https://github.com/user-attachments/assets/eb6a6772-cdff-4d60-a3c7-4352477e956c" /> Set `ABSL_FLAGS_STRIP_NAMES ` to 0 by default for all builds.
### Description <!-- Describe your changes. --> Fix packaging pipelines ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> During CIs and local builds Ort::Status() gets inherited from the base due to using directives, however, that does not work for packaging pipelines. Having default ctor is important for storing Status in containers if needed.
|
All the cherry-picks were clean merged. But, there are some strange build errors(missing symbols). Investigating ... |
…onnxruntime into users/snnn/rel-1.23.0
|
Added a line of code to fix the missing include issue. |
### Description When using attention bias input for GQA op with FP16, on the platforms that don't natively support FP16 math a cast to fp32 needs to be performed, and thus a temporary buffer needs to be created to store the fp32 values. The issue is that this temporary buffer was being allocated / deallocated inside of a loop for every token being processed. Refactored the implementation so that the allocation takes place only once. Phi model throughput increased by 15%.
### Description This change builds on top of #25841 , and adds the scaffolding necessary to call into this API from C++ / C# / Python. ### Motivation and Context #25454 talks more about the broader notion of precompiled model compatibility. This change is directed at app developers whose apps may want to determine if a particular precompiled model (e.g. on a server somewhere) is compatible with the device where the application is running. There is functionality in `OrtEpFactory` for making this determination, which was exposed as a C API in #25841, and this change makes the API more broadly available in other languages. ### Testing and Validation Introduced new unit test cases across each language, and verified that the API was being called and returned the correct result for the default CPU EP. --------- Co-authored-by: Aditya Rastogi <[email protected]>
### Description This update introduces multiple improvements, fixes, and feature enhancements to the OpenVINO Execution Provider (OVEP) and related components in ONNX Runtime: #### Configuration & Properties - Updated load_config mapping to act as a passthrough to OpenVINO properties. - Added support for providing layout information to inputs/outputs in OpenVINO. #### Inference & Tensor Handling - Improved OVInferRequest::SetTensor to correctly handle cached binding shape mismatches. - Added support for self-detecting on-the-fly bfloat16 → float16 conversion. - Fixed issues with input ONNX models when used with shared execution contexts. #### Model Handling & Operator Support - Fixed model copying behavior for QDQ stripping. - Updated operator support status for OpenVINO 2025.2. #### Platform & Integration Fixes - Applied multiple PSU Lora fixes and related updates. - Resolved filename confusion issues with wrapped OVIRs in EPCtx. - Enabled memory-mapped native binaries for OpenVINO 2025.3. #### Quality & Maintenance - Addressed linting issues. - Fixed coverage gaps in OVEP. - Added a new test script for OpenVINO with ORT ABI integration. --------- Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: Ryan Metcalfe <[email protected]> Co-authored-by: Klimenko, Mikhail <[email protected]> Co-authored-by: sfatimar <[email protected]> Co-authored-by: Garth Long <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: MayureshV1 <[email protected]> Co-authored-by: Eric Crawford <[email protected]> Co-authored-by: jatinwadhwa921 <[email protected]> Co-authored-by: Vishnudas Thaniel S <[email protected]> Co-authored-by: Javier Martinez <[email protected]>
### Description 1. Check process exit code when running 7z.exe . Currently the errors were silently ignored. 2. Add snld20 flag to the 7z.exe commands, which is needed to be compatible with the latest 7z release.
adrianlizarraga
approved these changes
Aug 29, 2025
hanbitmyths
approved these changes
Aug 29, 2025
jywu-msft
approved these changes
Aug 29, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.