Skip to content

Conversation

@snnn
Copy link
Member

@snnn snnn commented Aug 28, 2025

fs-eire and others added 12 commits July 31, 2025 17:55
### Description
We have a big packaging pipeline that build nuget/java/nodejs packages.
After that we run these. This PR split the tests to a dedicated pipeline
and refactored the code that use maven to download deps instead of using
direct HTTP fetch. The new approach allows us to use Azure DevOps
artifacts as an internal mirror to meet network isolation requirements.
Thsi PR also enabled WebGPU and CoreML EP tests for java package on macOS.

This PR also updated tools/python/run_packaging_pipelines.py a little
bit to add the support for RC releases.

### Motivation and Context
Make the packaging pipelines smaller and easier to use.
…gth (#25594)

### Description
<!-- Describe your changes. -->
#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
…5673)

### Description
Relax WeightBiasQuantization constraint for larger QDQ node group

### Motivation and Context
The transformer `WeightBiasQuantization` quantizes float weights on `Q -> DQ -> Conv/ConvTranspose/Gemm's Weights -> Q-> DQ` sequence; The check on `Weights -> Q` (`children_nodes.size() != 1 || children_nodes[0]->OpType() != QDQ::QOpName`) is an issue due to it would skip quantization for many common patterns such as unfused activations followed by `Conv` (`DQ - Conv -> ReLU -> Q`).

It's actually unnecessary to check ending Q here (the fold can happen anyway without changing model semantics). However, in order to minimize the current behavior change, this PR simply extend the pattern to include single path (no branch), type-preserving path lead to `Q` to enable more quantization support.
### Description
This change adds CUDA Graph support to the NV TensorRT RTX Execution
Provider (EP).

### Motivation and Context
Integrating CUDA Graphs into the NV TRT RTX EP provides:
Lower latency by minimizing per-kernel launch overhead.
Better throughput for repeated inference runs.
Improved efficiency on GPUs with high kernel launches overhead
sensitivity.

---------

Co-authored-by: Maximilian Mueller <[email protected]>
Co-authored-by: Gaurav Garg <[email protected]>
### Description
<!-- Describe your changes. -->
1. A Small change to use the shared allocator in Python binding. 
2. Remove the FP64 support from the EP. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The Python GPU IO binding is necessary for performance. The change will
enable the shared allocator for GPU allocation.
The FP64 was using the FP32 inference—aligned WRT TRT RTX support.

---------

Co-authored-by: Gaurav Garg <[email protected]>
### Description
This change fixes correctness issues in two areas that were causing
failures in onnxruntime_test_all:

- DynamicQuantizeMatMul.WithConstantBInputs
- AttentionTest.Attention3DDefault
- AttentionTest.Attention3DWithPastAndPresentQkMatmul

What was wrong and how it’s fixed
1) DynamicQuantizeMatMul.WithConstantBInputs
- Root cause: The Kleidi dynamic quantization GEMM path could be
selected even when the B scales contained values such as (zero,
negative, or non-finite). That violates kernel assumptions and can lead
to incorrect results.
- Fix: In
`onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc`,
we now explicitly validate that all B scales are finite and strictly
positive before enabling the Kleidi/MLAS dynamic path. If any scale is
invalid, we disable that path.

2) Attention tests (Attention3DDefault,
Attention3DWithPastAndPresentQkMatmul)
- Root causes in
`onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp`:
- Incorrect handling of GEMM corner cases for alpha/beta and K==0 (e.g.,
not respecting C = beta*C when alpha==0 or K==0).
  - Unnecessary or premature fallbacks for small shapes.
- Fixes:
- Add early-outs for degenerate sizes: if M==0 or N==0, return handled.
  - Correctly implement alpha/beta semantics:

---------

Signed-off-by: Jonathan Clohessy <[email protected]>
### Description
<!-- Describe your changes. -->
While memory profiling some models I noticed multiple file mapping
failures.
`WindowsEnv::MapFileIntoMemory()` While it properly checks for the
mapping offset to be granularity
  aligned, it calculates it as page aligned.
Also, while saving external tensors we do not need to align big tensors
to windows granularity or anything
  that is platform dependent. Set it to 4096 for all platforms.
  Granularity matters only for calculating mapping address.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Multiple failures for file mapping for certain models.
This saves some hundreds of Mbs for some models.
…at info (#25841)

### Description
This PR adds a new API that applications can use to verify compatibility
of a precompiled model with the underlying system, using only the
compatibility info string from the model's metadata.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- This is a feature to enable apps to check compatibility of a
precompiled model without necessarily having the model locally on the
device. This enables precompiled models to be stored remotely and
downloaded once the application has been able to confirm the validity of
a given model with EPs on the device.

### Testing
- New unit tests pass 
- For regression testing, built a private version of WinML + AMD NPU EP
with these changes. Ran the Cpp Selfcontained Desktop sample
successfully; ran with compilation and also re-ran using the
already-compiled model to verify that session initialization continued
to work as expected.

---------

Co-authored-by: Aditya Rastogi <[email protected]>
…ile build (#25849)

### Description
`ABSL_FLAGS_STRIP_NAMES `is set to 1 by default to disable flag
registration when building for Android, iPhone, and "embedded devices".
So, running onnxruntime_perf_test on Android will see that flags are not
registered.

<img width="872" height="182" alt="image (2)"
src="https://github.com/user-attachments/assets/eb6a6772-cdff-4d60-a3c7-4352477e956c"
/>

Set `ABSL_FLAGS_STRIP_NAMES ` to 0 by default for all builds.
### Description
<!-- Describe your changes. -->
Fix packaging pipelines


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
During CIs and local builds Ort::Status() gets inherited from the base
due to using directives,
  however, that does not work for packaging pipelines.
Having default ctor is important for storing Status in containers if
needed.
@snnn
Copy link
Member Author

snnn commented Aug 29, 2025

All the cherry-picks were clean merged. But, there are some strange build errors(missing symbols).

onnxruntime\test\platform\file_io_test.cc(171,5): Error C3861: 'ASSERT_STATUS_OK': identifier not found

Investigating ...

@snnn
Copy link
Member Author

snnn commented Aug 29, 2025

Added a line of code to fix the missing include issue.

derdeljan-msft and others added 5 commits August 29, 2025 13:13
### Description

When using attention bias input for GQA op with FP16, on the platforms
that don't natively support FP16 math a cast to fp32 needs to be
performed, and thus a temporary buffer needs to be created to store the
fp32 values. The issue is that this temporary buffer was being allocated
/ deallocated inside of a loop for every token being processed.
Refactored the implementation so that the allocation takes place only
once.

Phi model throughput increased by 15%.
### Description
This change builds on top of #25841 , and adds the scaffolding necessary
to call into this API from C++ / C# / Python.

### Motivation and Context
#25454 talks more about the broader notion of precompiled model
compatibility. This change is directed at app developers whose apps may
want to determine if a particular precompiled model (e.g. on a server
somewhere) is compatible with the device where the application is
running. There is functionality in `OrtEpFactory` for making this
determination, which was exposed as a C API in #25841, and this change
makes the API more broadly available in other languages.

### Testing and Validation
Introduced new unit test cases across each language, and verified that
the API was being called and returned the correct result for the default
CPU EP.

---------

Co-authored-by: Aditya Rastogi <[email protected]>
### Description
This update introduces multiple improvements, fixes, and feature enhancements to the OpenVINO Execution Provider (OVEP) and related components in ONNX Runtime:

#### Configuration & Properties

- Updated load_config mapping to act as a passthrough to OpenVINO properties.
- Added support for providing layout information to inputs/outputs in OpenVINO.

#### Inference & Tensor Handling

- Improved OVInferRequest::SetTensor to correctly handle cached binding shape mismatches.
- Added support for self-detecting on-the-fly bfloat16 → float16 conversion.
- Fixed issues with input ONNX models when used with shared execution contexts.

#### Model Handling & Operator Support

- Fixed model copying behavior for QDQ stripping.
- Updated operator support status for OpenVINO 2025.2.

#### Platform & Integration Fixes

- Applied multiple PSU Lora fixes and related updates.
- Resolved filename confusion issues with wrapped OVIRs in EPCtx.
- Enabled memory-mapped native binaries for OpenVINO 2025.3.

#### Quality & Maintenance

- Addressed linting issues.
- Fixed coverage gaps in OVEP.
- Added a new test script for OpenVINO with ORT ABI integration.

---------

Co-authored-by: Ankit Maheshkar <[email protected]>
Co-authored-by: Ryan Metcalfe <[email protected]>
Co-authored-by: Klimenko, Mikhail <[email protected]>
Co-authored-by: sfatimar <[email protected]>
Co-authored-by: Garth Long <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: MayureshV1 <[email protected]>
Co-authored-by: Eric Crawford <[email protected]>
Co-authored-by: jatinwadhwa921 <[email protected]>
Co-authored-by: Vishnudas Thaniel S <[email protected]>
Co-authored-by: Javier Martinez <[email protected]>
…ting Node_GetTensorAttributeAsOrtValue (#25886)

### Description
Replace `Node_GetTensorAttributeAsOrtValue` with
`OpAttr_GetTensorAttributeAsOrtValue`.
Change the API signature to make it one of the `OpAttr` interfaces
instead of the `OrtNode` interface.

The original API was added
[here](#25566).
### Description
1. Check process exit code when running 7z.exe . Currently the errors
were silently ignored.
2. Add snld20 flag to the 7z.exe commands, which is needed to be
compatible with the latest 7z release.
@snnn snnn changed the title users/snnn/rel 1.23.0 Cherry-picks for 1.23.0 release Aug 29, 2025
@snnn snnn merged commit 30612fb into rel-1.23.0 Aug 29, 2025
126 of 144 checks passed
@snnn snnn deleted the users/snnn/rel-1.23.0 branch August 29, 2025 23:26
@snnn snnn mentioned this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.