-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Save much memory at model loading time by converting weights to OrtValues early #26345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 3ca49d8. It also makes adjustments for the current source code state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR saves significant memory during model loading by converting weight initializers to OrtValues early in the graph construction process, rather than later during graph transformation. The changes revert previous logic that deferred this conversion and implements early weight conversion at graph initialization time. The PR demonstrates dramatic memory savings during optimization phases (from 8.1GB to 3.5GB for Phi4 Instruct model) by enabling reuse of OrtValues during constant folding operations.
Key changes:
- Early conversion of large initializers to OrtValues during graph construction
- Update of all graph transformation code to use
AddInitializerWithExternalDatainstead ofAddInitializer - Removal of deferred initializer conversion logic from session inference flow
Reviewed Changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/graph/graph.cc | Implements early initializer-to-OrtValue conversion during graph construction |
| include/onnxruntime/core/graph/graph.h | Removes ConvertInitializersIntoOrtValues method declaration |
| onnxruntime/core/session/inference_session.cc | Removes deferred initializer conversion call from transform pipeline |
| onnxruntime/core/optimizer/*.cc | Updates optimizer classes to use AddInitializerWithExternalData for new initializers |
| orttraining/orttraining/core/optimizer/*.cc | Updates training optimizer classes to use AddInitializerWithExternalData |
| onnxruntime/test/ir/graph_test.cc | Removes test code that called the now-removed conversion method |
| onnxruntime/test/framework/cuda/fence_cuda_test.cc | Updates test utility to use AddInitializerWithExternalData |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Description
Converts weights early and revert "Properly remove in-memory references (#25652)"
This reverts commit 3ca49d8 and makes appropriate adjustments for the current state of the code.
This PR is made possible and on the heels of:
#26263
#25833.
Previous history:
#23979
#25320
#25626
#25652
The first change (#26263) allows us to convert initializers to OrtValues early and save lots of memory at model loading time.
Specifically, for Phi-4-mini-instruct-INT4 model before and after looks like this:
Before

After
The two peaks represent memory usage at optimization time (8.1Gb before) and after weights memory mapping (6.5Gb)
After this change corresponding numbers look 3.5Gb and 4.7Gb respectively.
Most of the savings during optimization phase come from
ConstantFoldingwhere we are able to reuse the resulting OrtValues directly for the new initializers.This PR concludes a series of PRs converting initializers to OrtValues.
Memory consumption before the conversion began was 9.3Gb and 6.7Gb respectively. We are saving almost 6Gb during optimization and 2Gb for the steady state.
The model also loads about 12 seconds faster.
Example of ConstantFolding being one of the top contributors where we duplicate memory for higher peak before Resolve takes care of no longer used initializers.

Motivation and Context
Reduce memory usage.