Skip to content

Conversation

@gedoensmax
Copy link
Contributor

This reconfiguration is done to NOT allocate tensors with an exact matching size. If that strategy is used a tensor will always trigger an allocation in the arena and not reuse memory since the memory size has to exactly match.
This became a big problem with ORT GenAI since the arena grew constantly when prompting with different prompt lengths. No arena shrinkage was triggered to return older tensors. @skottmckay I am happy to be educated of a better usage of the allocators.

Issues with this: 
Since the arena is not used for workspace allocations anymore (using reserve) it will likely not be possible in the future to allocate on a stream and immediately free memory after an enqueue call. That could have enabled workspace sharing in a multi model pipeline very nicely.

@chilo-ms can you help merge this.

@chilo-ms
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@chilo-ms chilo-ms added ep:NvRTX NV RTX execution provider release:1.23.0 labels Aug 20, 2025
@jywu-msft jywu-msft merged commit 63c1d1a into microsoft:main Aug 22, 2025
86 checks passed
adrianlizarraga pushed a commit that referenced this pull request Aug 22, 2025
)

This reconfiguration is done to NOT allocate tensors with an exact
matching size. If that strategy is used a tensor will always trigger an
allocation in the arena and not reuse memory since the memory size has
to exactly match.
This became a big problem with ORT GenAI since the arena grew constantly
when prompting with different prompt lengths. No arena shrinkage was
triggered to return older tensors. @skottmckay I am happy to be educated
of a better usage of the allocators.

Issues with this: 
Since the arena is not used for workspace allocations anymore (using
reserve) it will likely not be possible in the future to allocate on a
stream and immediately free memory after an enqueue call. That could
have enabled workspace sharing in a multi model pipeline very nicely.

@chilo-ms can you help merge this.
adrianlizarraga added a commit that referenced this pull request Aug 25, 2025
### Description
Cherry-pick the following PRs into the `rel-1.23.0` branch:
- #25592
- #25622
- #25688
- #25729
- #25743
- #25769
- #25745
- #25761
- #25751
- #25716
- #25228
- #25768
- #25788
- #25747
- #25800
- #25818
- #25762
- #25749
- #25831


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: quic-tirupath <[email protected]>
Co-authored-by: quic-calvnguy <[email protected]>
Co-authored-by: qti-kromero <[email protected]>
Co-authored-by: Jeff Kilpatrick <[email protected]>
Co-authored-by: Scott McKay <[email protected]>
Co-authored-by: David Fan <[email protected]>
Co-authored-by: kuanyul-qti <[email protected]>
Co-authored-by: Dmitri Smirnov <[email protected]>
Co-authored-by: Chi Lo <[email protected]>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Chunye Wang@AMD <[email protected]>
Co-authored-by: minfhong-qti <[email protected]>
Co-authored-by: Vishal Agarwal <[email protected]>
Co-authored-by: Maximilian Müller <[email protected]>
Co-authored-by: Maximilian Müller <[email protected]>
Co-authored-by: Changming Sun <[email protected]>
Co-authored-by: adrastogi <[email protected]>
Co-authored-by: Aditya Rastogi <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
gedoensmax added a commit to gedoensmax/onnxruntime that referenced this pull request Sep 2, 2025
…rosoft#25800)

This reconfiguration is done to NOT allocate tensors with an exact
matching size. If that strategy is used a tensor will always trigger an
allocation in the arena and not reuse memory since the memory size has
to exactly match.
This became a big problem with ORT GenAI since the arena grew constantly
when prompting with different prompt lengths. No arena shrinkage was
triggered to return older tensors. @skottmckay I am happy to be educated
of a better usage of the allocators.

Issues with this: 
Since the arena is not used for workspace allocations anymore (using
reserve) it will likely not be possible in the future to allocate on a
stream and immediately free memory after an enqueue call. That could
have enabled workspace sharing in a multi model pipeline very nicely.

@chilo-ms can you help merge this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:NvRTX NV RTX execution provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants