From 591aea38b2ae742949142ae4446415b70c03a991 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Tue, 2 Sep 2025 23:38:26 +0530 Subject: [PATCH 01/15] initial changes --- docs/build/eps.md | 46 +++-- .../TensorRTRTX-ExecutionProvider.md | 172 ++++++------------ 2 files changed, 87 insertions(+), 131 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index b55f0e381c8bd..6e6fed2b3777d 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -161,6 +161,37 @@ Dockerfile instructions are available [here](https://github.com/microsoft/onnxru --- +## NVIDIA TensorRT RTX + +See more information on the TensorRT RTX Execution Provider [here](../execution-providers/TensorRTRTX-ExecutionProvider.md). + +## Minimum requirements + +| ONNX Runtime | TensorRT-RTX | CUDA Toolkit | +| :----------- | :----------- | :------------- | +| main branch | 1.1 | 12.x | +| 1.22 | 1.0 | 12.x | + +### Pre-requisites +* Download latest [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) + * The path to the CUDA installation must be set via the `CUDA_HOME` environment variable, or use the `--cuda_home` arg in the build command. The installation directory should contain `bin`, `include` and `lib` sub-directories. +* Download [TensorRT RTX](https://developer.nvidia.com/tensorrt-rtx) +* Visual Studio Build Tools - https://visualstudio.microsoft.com/downloads/ + +### Windows + +```ps +.\build.bat --config Release --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build --update --use_vcpkg +``` + +### Linux + +```sh +./build.sh --config Release --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home path/to/tensorrt-rtx" --cuda_home "path/to/cuda/home" --build_shared_lib --skip_tests --build --update +``` + +--- + ## NVIDIA Jetson TX1/TX2/Nano/Xavier/Orin ### Build Instructions @@ -235,20 +266,7 @@ These instructions are for the latest [JetPack SDK](https://developer.nvidia.com * For a portion of Jetson devices like the Xavier series, higher power mode involves more cores (up to 6) to compute but it consumes more resource when building ONNX Runtime. Set `--parallel 1` in the build command if OOM happens and system is hanging. -## TensorRT-RTX - -See more information on the NV TensorRT RTX Execution Provider [here](../execution-providers/TensorRTRTX-ExecutionProvider.md). - -### Prerequisites -{: .no_toc } - - * Follow [instructions for CUDA execution provider](#cuda) to install CUDA and setup environment variables. - * Intall TensorRT for RTX from nvidia.com (TODO: add link when available) - -### Build Instructions -{: .no_toc } -`build.bat --config Release --parallel 32 --build_dir _build --build_shared_lib --use_nv_tensorrt_rtx --tensorrt_home "C:\dev\TensorRT-RTX-1.1.0.3" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9" --cmake_generator "Visual Studio 17 2022" --use_vcpkg` -Replace the --tensorrt_home and --cuda_home with correct paths to CUDA and TensorRT-RTX installations. +--- ## oneDNN diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 4c97b79c60534..78ab35682b844 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -1,23 +1,26 @@ --- title: NVIDIA - TensorRT RTX -description: Instructions to execute ONNX Runtime on NVIDIA RTX GPUs with the Nvidia TensorRT RTX execution provider +description: Instructions to execute ONNX Runtime on NVIDIA RTX GPUs with the NVIDIA TensorRT RTX execution provider parent: Execution Providers nav_order: 17 redirect_from: /docs/reference/execution-providers/TensorRTRTX-ExecutionProvider --- -# Nvidia TensorRT RTX Execution Provider +# NVIDIA TensorRT RTX Execution Provider {: .no_toc } -Nvidia TensorRT RTX execution provider is the preferred execution provider for GPU acceleration on consumer hardware (RTX PCs). It is more straightforward to use than the datacenter focused legacy TensorRT Execution provider and more performant than CUDA EP. -Just some of the things that make it a better fit on RTX PCs than our legacy TensorRT Execution Provider: -* Much smaller footprint -* Much faster model compile/load times. -* Better usability in terms of use of cached models across multiple RTX GPUs. +NVIDIA TensorRT RTX Execution Provider (EP) is the **recommended** choice for GPU acceleration on NVIDIA consumer hardware (RTX PCs). It offers a more lightweight experience than the datacenter-focused TensorRT (TRT) EP and delivers superior performance compared to the other EPs. -The Nvidia TensorRT RTX execution provider in the ONNX Runtime makes use of NVIDIA's [TensorRT](https://developer.nvidia.com/tensorrt) RTX Deep Learning inferencing engine (TODO: correct link to TRT RTX documentation once available) to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA worked closely to integrate the TensorRT RTX execution provider with ONNX Runtime. +Here's why it's a better fit for RTX PCs than the legacy TensorRT EP: +* **Smaller package footprint:** Optimizes resource usage. +* **Faster model compile and load times:** Get up and running quicker. +* **Enhanced usability:** Seamlessly use cached models across multiple RTX GPUs. -Currently TensorRT RTX supports RTX GPUs from Ampere or later architectures. Support for Turing GPUs is coming soon. +The TensorRT RTX EP leverages NVIDIA's new deep learning inference engine, [TensorRT RTX](https://developer.nvidia.com/tensorrt-rtx), to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX execution provider with ONNX Runtime. + +Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. Support for Turing GPUs is coming soon. + +For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page. ## Contents {: .no_toc } @@ -26,19 +29,15 @@ Currently TensorRT RTX supports RTX GPUs from Ampere or later architectures. Sup {:toc} ## Install -Please select the Nvidia TensorRT RTX version of Onnx Runtime: https://onnxruntime.ai/docs/install. (TODO!) -## Build from source -See [Build instructions](../build/eps.md#tensorrt-rtx). +Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. -## Requirements +## Build from source -| ONNX Runtime | TensorRT-RTX | CUDA | -| :----------- | :----------- | :------------- | -| main | 1.0 | 12.0-12.9 | -| 1.22 | 1.0 | 12.0-12.9 | +Information on how to build from source for TensorRT RTX EP can be found [here](../build/eps.md#nvidia-tensorrt-rtx). ## Usage + ### C/C++ ```c++ const auto& api = Ort::GetApi(); @@ -47,20 +46,36 @@ api.SessionOptionsAppendExecutionProvider(session_options, "NvTensorRtRtx", null Ort::Session session(env, model_path, session_options); ``` -The C API details are [here](../get-started/with-c.md). - ### Python -To use TensorRT RTX execution provider, you must explicitly register TensorRT RTX execution provider when instantiating the `InferenceSession`. +With Python APIs, you must explicitly register the TensorRT RTX EP when instantiating the `InferenceSession`. ```python import onnxruntime as ort sess = ort.InferenceSession('model.onnx', providers=['NvTensorRtRtxExecutionProvider']) ``` -## Configurations -TensorRT RTX settings can be configured via [TensorRT Execution Provider Session Option](./TensorRTRTX-ExecutionProvider.md#execution-provider-options). +## Execution Provider Options +TensorRT RTX EP provides the following user configurable options with the [Execution Provider Options](./TensorRTRTX-ExecutionProvider.md#execution-provider-options) + + +| Parameter | Type | Description | Default | +|-----------|------|-------------|---------| +| device_id | `int` | GPU device identifier | 0 | +| user_compute_stream | `str` | Specify compute stream to run GPU workload | "" | +| nv_max_workspace_size | `int` | Maximum TensorRT engine workspace (bytes) | 0 (auto) | +| nv_max_shared_mem_size | `int` | Maximum TensorRT engine workspace (bytes) | 0 (auto) | +| nv_dump_subgraphs | `bool` | Enable subgraph dumping for debugging | false | +| nv_detailed_build_log | `bool` | Enable detailed build logging | false | +| enable_cuda_graph | `bool` | Enable [CUDA graph](https://developer.nvidia.com/blog/cuda-graphs/) to reduce inference overhead. Helpful for smaller models | false | +| profile_min_shapes | `str` | Comma-separated list of input tensor shapes for the minimum optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | +| profile_max_shapes | `str` | Comma-separated list of input tensor shapes for the maximum optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | +| profile_opt_shapes | `str` | Comma-separated list of input tensor shapes for the optimal optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | +| nv_multi_profile_enable | `bool` | Enable support for multiple optimization profiles in TensorRT engine. Allows dynamic input shapes for different inference requests | false | +| nv_use_external_data_initializer | `bool` | Use external data initializer for model weights. Useful for EP context large models with external data files | false | + +* + -Here are examples and different [scenarios](./TensorRTRTX-ExecutionProvider.md#scenario) to set NV TensorRT RTX EP session options: #### Click below for Python API example: @@ -115,92 +130,14 @@ Ort::ThrowOnError(api.SessionOptionsAppendExecutionProvider(session_options, "Nv -### Scenario - -| Scenario | NV TensorRT RTX EP Session Option | Type | -| :------------------------------------------------- | :----------------------------------------------------------------------------------------- | :----- | -| Specify GPU id for execution | [device_id](./TensorRTRTX-ExecutionProvider.md#device_id) | int | -| Set custom compute stream for GPU operations | [user_compute_stream](./TensorRTRTX-ExecutionProvider.md#user_compute_stream) | string | -| Set TensorRT RTX EP GPU memory usage limit | [nv_max_workspace_size](./TensorRTRTX-ExecutionProvider.md#nv_max_workspace_size) | int | -| Dump optimized subgraphs for debugging | [nv_dump_subgraphs](./TensorRTRTX-ExecutionProvider.md#nv_dump_subgraphs) | bool | -| Capture CUDA graph for reduced launch overhead | [nv_cuda_graph_enable](./TensorRTRTX-ExecutionProvider.md#nv_cuda_graph_enable) | bool | -| Enable detailed logging of build steps | [nv_detailed_build_log](./TensorRTRTX-ExecutionProvider.md#nv_detailed_build_log) | bool | -| Define min shapes | [nv_profile_min_shapes](./TensorRTRTX-ExecutionProvider.md#nv_profile_min_shapes) | string | -| Define max shapes | [nv_profile_max_shapes](./TensorRTRTX-ExecutionProvider.md#nv_profile_max_shapes) | string | -| Define optimal shapes | [nv_profile_opt_shapes](./TensorRTRTX-ExecutionProvider.md#nv_profile_opt_shapes) | string | - -> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. - -### Execution Provider Options - -TensorRT RTX configurations can be set by execution provider options. It's useful when each model and inference session have their own configurations. All configurations should be set explicitly, otherwise default value will be taken. - -##### device_id - -* Description: GPU device ID. -* Default value: 0 - -##### user_compute_stream - -* Description: define the compute stream for the inference to run on. It implicitly sets the `has_user_compute_stream` option. The stream handle needs to be printed on a string as decimal number and passed down to the session options as shown in the example above. -* This can also be set using the python API. - * i.e The cuda stream captured from pytorch can be passed into ORT-NV TensorRT RTX EP. Click below to check sample code: -
- - ```python - import onnxruntime as ort - import torch - ... - sess = ort.InferenceSession('model.onnx') - if torch.cuda.is_available(): - s = torch.cuda.Stream() - provider_options = { - 'device_id': 0, - 'user_compute_stream': str(s.cuda_stream) - } - sess = ort.InferenceSession( - model_path, - providers=[('NvTensorRtRtxExecutionProvider', provider_options)] - ) - - options = sess.get_provider_options() - assert "NvTensorRtRtxExecutionProvider" in options - assert options["NvTensorRtRtxExecutionProvider"].get("user_compute_stream", "") == str(s.cuda_stream) - ... - ``` - -
- -* To take advantage of user compute stream, it is recommended to use [I/O Binding](https://onnxruntime.ai/docs/performance/device-tensor.html) to bind inputs and outputs to tensors in device. - -##### nv_max_workspace_size - -* Description: maximum workspace size in bytes for TensorRT RTX engine. - -* Default value: 0 (lets TensorRT pick the optimal). - -##### nv_dump_subgraphs - -* Description: dumps the subgraphs if the ONNX was split across multiple execution providers. - * This can help debugging subgraphs, e.g. by using `trtexec --onnx subgraph_1.onnx` and check the outputs of the parser. - -##### nv_detailed_build_log - -* Description: enable detailed build step logging on NV TensorRT RTX EP with timing for each engine build. - -##### nv_cuda_graph_enable - -* Description: this will capture a [CUDA graph](https://developer.nvidia.com/blog/cuda-graphs/) which can drastically help for a network with many small layers as it reduces launch overhead on the CPU. - -##### nv_profile_min_shapes +> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. -##### nv_profile_max_shapes -##### nv_profile_opt_shapes +#### Profile shape options * Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided. * By default TensorRT RTX engines will support dynamic shapes, for perofmance improvements it is possible to specify one or multiple explicit ranges of shapes. @@ -209,27 +146,28 @@ TensorRT RTX configurations can be set by execution provider options. It's usefu * Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor. * Check TensorRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. -## NV TensorRT RTX EP Caches -There are two major TRT RTX EP caches: -* Embedded engine model / EPContext model -* Internal TensorRT RTX cache -The internal TensorRT RTX cache is automatically managed by the EP. The user only needs to manage EPContext caching. -**Caching is important to help reduce session creation time drastically.** +## Cache + +TensorRT RTX EP supports two types of model cache. These helps in reducing model load time significantly. + +1. EP context model +2. Runtime cache + + +### EP context model + +TensorRT RTX separates compilation into two phases - ahead of time (AOT) and just in time (JIT) compilation. In AOT phase, the ONNX model is compiled to an optimized binary blob and stored as an EP context model. This model will be compatible across multiple GPU generations. + +During inference, we only use the compiled EP context model. When loaded, TensorRT RTX will JIT compile the binary blob (engine) to fit to the used GPU. This JIT process is accelerated by TensorRT RTX's internal cache. -TensorRT RTX separates compilation into an ahead of time (AOT) compiled engine and a just in time (JIT) compilation. The AOT compilation can be stored as EPcontext model, this model will be compatible across multiple GPU generations. -Upon loading such an EPcontext model TensorRT RTX will just in time compile the engine to fit to the used GPU. This JIT process is accelerated by TensorRT RTX's internal cache. For an example usage see: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc -### More about Embedded engine model / EPContext model -* TODO: decide on a plan for using weight-stripped engines by default. Fix the EP implementation to enable that. Explain the motivation and provide example on how to use the right options in this document. -* EPContext models also **enable packaging an externally compiled engine** using e.g. `trtexec`. A [python script](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/tensorrt/gen_trt_engine_wrapper_onnx_model.py) that is capable of packaging such a precompiled engine into an ONNX file is included in the python tools. (TODO: document how this works with weight-stripped engines). -## Performance Tuning -For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) +## Performance test -When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`. +When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`. ### TensorRT RTX Plugins Support From fd2e1fdfd60f9e6ae11a3df7eb885564a7487c97 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 12:12:34 +0530 Subject: [PATCH 02/15] refactor --- docs/build/eps.md | 47 +++++++++--- .../TensorRTRTX-ExecutionProvider.md | 73 +++++++------------ 2 files changed, 63 insertions(+), 57 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 6e6fed2b3777d..9545273b799ee 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -169,25 +169,50 @@ See more information on the TensorRT RTX Execution Provider [here](../execution- | ONNX Runtime | TensorRT-RTX | CUDA Toolkit | | :----------- | :----------- | :------------- | -| main branch | 1.1 | 12.x | -| 1.22 | 1.0 | 12.x | +| main branch | 1.1 | 12.9 | +| 1.22 | 1.0 | 12.8 | ### Pre-requisites -* Download latest [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) - * The path to the CUDA installation must be set via the `CUDA_HOME` environment variable, or use the `--cuda_home` arg in the build command. The installation directory should contain `bin`, `include` and `lib` sub-directories. -* Download [TensorRT RTX](https://developer.nvidia.com/tensorrt-rtx) -* Visual Studio Build Tools - https://visualstudio.microsoft.com/downloads/ +* Install git, cmake, Python 3.12 +* Install latest [NVIDIA driver](https://www.nvidia.com/en-us/drivers/) +* Install [CUDA toolkit 12.9](https://developer.nvidia.com/cuda-12-9-1-download-archive) +* Install [TensorRT RTX](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/installing-tensorrt-rtx/installing.html) +* For Windows only, Visual Studio - https://visualstudio.microsoft.com/downloads/ +* Set TensorRT-RTX dlls in `PATH` or put it in same folder as application exe -### Windows -```ps -.\build.bat --config Release --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build --update --use_vcpkg +```sh +git clone https://github.com/microsoft/onnxruntime.git +cd onnxruntime ``` -### Linux +### C/C++ APIs + +#### Windows + +```powershell +.\build.bat --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build --update --use_vcpkg +``` + +#### Linux ```sh -./build.sh --config Release --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home path/to/tensorrt-rtx" --cuda_home "path/to/cuda/home" --build_shared_lib --skip_tests --build --update +./build.sh --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path/to/tensorrt-rtx" --cuda_home "path/to/cuda/home" --build_shared_lib --skip_tests --build --update +``` + +#### Run unit test +```powershell +.\build\Release\Release\onnxruntime_test_all.exe --gtest_filter=*NvExecutionProviderTest.* +``` + +### Python wheel + +```powershell +# build the python wheel +.\build.bat --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build_wheel + +# install +pip install "build\Release\Release\dist\onnxruntime-1.23.0-cp312-cp312-win_amd64.whl" ``` --- diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 78ab35682b844..ce6a8bf7228ba 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -40,9 +40,9 @@ Information on how to build from source for TensorRT RTX EP can be found [here]( ### C/C++ ```c++ -const auto& api = Ort::GetApi(); +Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "SampleApp"); Ort::SessionOptions session_options; -api.SessionOptionsAppendExecutionProvider(session_options, "NvTensorRtRtx", nullptr, nullptr, 0); +session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, {}); Ort::Session session(env, model_path, session_options); ``` @@ -51,9 +51,19 @@ With Python APIs, you must explicitly register the TensorRT RTX EP when instanti ```python import onnxruntime as ort -sess = ort.InferenceSession('model.onnx', providers=['NvTensorRtRtxExecutionProvider']) +sess = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvider']) ``` +## Features + +### CUDA Graph + +### EP context model + + + +### Runtime cache + ## Execution Provider Options TensorRT RTX EP provides the following user configurable options with the [Execution Provider Options](./TensorRTRTX-ExecutionProvider.md#execution-provider-options) @@ -73,8 +83,6 @@ TensorRT RTX EP provides the following user configurable options with the [Execu | nv_multi_profile_enable | `bool` | Enable support for multiple optimization profiles in TensorRT engine. Allows dynamic input shapes for different inference requests | false | | nv_use_external_data_initializer | `bool` | Use external data initializer for model weights. Useful for EP context large models with external data files | false | -* - #### Click below for Python API example: @@ -94,12 +102,12 @@ provider_options = { 'user_compute_stream': stream_handle } -sess_opt = ort.SessionOptions() -sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=[('NvTensorRTRTXExecutionProvider', provider_options)]) +sesion_options = ort.SessionOptions() +sess = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[('NvTensorRTRTXExecutionProvider', provider_options)]) ``` - + #### Click below for C++ API example:
@@ -107,33 +115,24 @@ sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=[('NvTe ```c++ Ort::SessionOptions session_options; +// define a cuda stream cudaStream_t cuda_stream; cudaStreamCreate(&cuda_stream); -// Need to put the CUDA stream handle in a string -char streamHandle[32]; -sprintf_s(streamHandle, "%lld", (uint64_t)cuda_stream); - -const auto& api = Ort::GetApi(); -std::vector option_keys = { - "device_id", - "user_compute_stream", // this implicitly sets "has_user_compute_stream" -}; -std::vector option_values = { - "1", - streamHandle -}; +char stream_handle[32]; +sprintf_s(stream_handle, "%lld", (uint64_t)cuda_stream); -Ort::ThrowOnError(api.SessionOptionsAppendExecutionProvider(session_options, "NvTensorRtRtx", option_keys.data(), option_values.data(), option_keys.size())); +std::unordered_map provider_options; +provider_options[onnxruntime::nv::provider_option_names::kDeviceId] = "1"; +provider_options[onnxruntime::nv::provider_option_names::kUserComputeStream] = stream_handle; +session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, provider_options); ```
- - > Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. @@ -144,31 +143,13 @@ Ort::ThrowOnError(api.SessionOptionsAppendExecutionProvider(session_options, "Nv * The format of the profile shapes is `input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...` * These three flags should all be provided in order to enable explicit profile shapes feature. * Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor. - * Check TensorRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. - - -## Cache - -TensorRT RTX EP supports two types of model cache. These helps in reducing model load time significantly. - -1. EP context model -2. Runtime cache - - -### EP context model - -TensorRT RTX separates compilation into two phases - ahead of time (AOT) and just in time (JIT) compilation. In AOT phase, the ONNX model is compiled to an optimized binary blob and stored as an EP context model. This model will be compatible across multiple GPU generations. - -During inference, we only use the compiled EP context model. When loaded, TensorRT RTX will JIT compile the binary blob (engine) to fit to the used GPU. This JIT process is accelerated by TensorRT RTX's internal cache. - -For an example usage see: -https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc + * Check TensorRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/inference-library/work-with-dynamic-shapes.html) for more details. ## Performance test -When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`. +When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx` -### TensorRT RTX Plugins Support -TensorRT RTX doesn't support plugins. \ No newline at end of file +### Plugins Support +TensorRT RTX doesn't support plugins \ No newline at end of file From 9e0a34168ec662460f3e22b3a0815ac7ee906e91 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 12:15:27 +0530 Subject: [PATCH 03/15] update nav --- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index ce6a8bf7228ba..a7afa6926d17b 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -2,7 +2,7 @@ title: NVIDIA - TensorRT RTX description: Instructions to execute ONNX Runtime on NVIDIA RTX GPUs with the NVIDIA TensorRT RTX execution provider parent: Execution Providers -nav_order: 17 +nav_order: 3 redirect_from: /docs/reference/execution-providers/TensorRTRTX-ExecutionProvider --- From 4b2d1c0bd2d0d142a2bc9f07e2d035602b6a1b26 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 12:17:48 +0530 Subject: [PATCH 04/15] update nav --- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index a7afa6926d17b..94eb039a39137 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -2,7 +2,7 @@ title: NVIDIA - TensorRT RTX description: Instructions to execute ONNX Runtime on NVIDIA RTX GPUs with the NVIDIA TensorRT RTX execution provider parent: Execution Providers -nav_order: 3 +nav_order: 2 redirect_from: /docs/reference/execution-providers/TensorRTRTX-ExecutionProvider --- @@ -60,7 +60,12 @@ sess = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvid ### EP context model +TensorRT RTX separates compilation into two phases - ahead of time (AOT) and just in time (JIT) compilation. In AOT phase, the ONNX model is compiled to an optimized binary blob and stored as an EP context model. This model will be compatible across multiple GPU generations. +During inference, we only use the compiled EP context model. When loaded, TensorRT RTX will JIT compile the binary blob (engine) to fit to the used GPU. This JIT process is accelerated by TensorRT RTX's internal cache. + +For an example usage see: +https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc ### Runtime cache From 759ed215d792ff0f6003981e6a26ad4a4c43f994 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 13:43:48 +0530 Subject: [PATCH 05/15] update EP context --- .../TensorRTRTX-ExecutionProvider.md | 58 +++++++++++++++++-- 1 file changed, 53 insertions(+), 5 deletions(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 94eb039a39137..4079d7086b01d 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -58,19 +58,65 @@ sess = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvid ### CUDA Graph +CUDA Graph Text + ### EP context model -TensorRT RTX separates compilation into two phases - ahead of time (AOT) and just in time (JIT) compilation. In AOT phase, the ONNX model is compiled to an optimized binary blob and stored as an EP context model. This model will be compatible across multiple GPU generations. +In ONNXRuntime, Execution Providers are responsible for converting ONNX models into the graph format required by its specific backend SDK and subsequently compiling them into a format compatible with the target hardware. In large models like LLMs and Diffusion models, this conversion and compilation process can be resource-intensive and time-consuming, often extending to tens of minutes. This overhead significantly impacts the user experience during session creation. + +To mitigate the repetitive nature of model conversion and compilation, the ONNX models can be pre-compiled model as a binary file and persisted in an "EP Context" Model. This pre-compiled model can then be loaded directly by the EP, bypassing the initial compilation steps and enabling immediate execution on the target device. This optimization substantially reduces session creation time and enhances overall operational efficiency. + +TensorRT RTX simplifies this approach by separating compilation into two distinct phases: +* Ahead-of-Time (AOT) Compilation: The ONNX model is compiled into an optimized binary blob, which is then stored as an EP context model. This generated model is designed for compatibility across multiple generations of GPUs. +* Just-in-Time (JIT) Compilation: During inference, the compiled EP context model is loaded. TensorRT RTX then performs a JIT compilation of the binary blob (engine) to precisely adapt it to the specific GPU in use. + +The primary benefit of this multi-phase compilation workflow is a significant reduction in model load times. + +#### Generating EP Context Models with ORT 1.22 + +ONNX Runtime 1.22 introduced dedicated [Compile APIs](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/session/compile_api.h) to simplify the generation of EP context models: + +```cpp +Ort::ModelCompilationOptions compile_options(env, session_options); +compile_options.SetInputModelPath(input_model_path); +compile_options.SetOutputModelPath(compile_model_path); + +Ort::Status status = Ort::CompileModel(env, compile_options); +``` + +After successful generation, the EP context model can be directly loaded for inference: + +```cpp +Ort::Session session(env, compile_model_path, session_options); +``` -During inference, we only use the compiled EP context model. When loaded, TensorRT RTX will JIT compile the binary blob (engine) to fit to the used GPU. This JIT process is accelerated by TensorRT RTX's internal cache. +This approach leads to a considerable reduction in session creation time, thereby improving the overall user experience. + +For a practical example of usage, please refer to: +* EP context samples +* EP context [unit tests](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_ep_context_test.cc) + + +#### NVIDIA recommended settings + +* disable ORT graph optimization +```cpp +session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_DISABLE_ALL); +``` + +* For models > 2GB, set embed_mode = 0 in model compilation options. If binary blob is embedded within the EP context, it fails for > 2GB models due to protobuf limitations +```cpp +Ort::ModelCompilationOptions compile_options(env, session_options); +compile_options.SetEpContextEmbedMode(0); +``` -For an example usage see: -https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc ### Runtime cache + + ## Execution Provider Options -TensorRT RTX EP provides the following user configurable options with the [Execution Provider Options](./TensorRTRTX-ExecutionProvider.md#execution-provider-options) +TensorRT RTX EP provides the following user configurable options with the [Execution Provider Options](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_options.h) | Parameter | Type | Description | Default | @@ -92,6 +138,7 @@ TensorRT RTX EP provides the following user configurable options with the [Execu #### Click below for Python API example: +
```python @@ -115,6 +162,7 @@ sess = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[ #### Click below for C++ API example: +
```c++ From 702e92876d33064a4e40b0d407d1ffd08301f071 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 17:31:37 +0530 Subject: [PATCH 06/15] update doc --- .../TensorRTRTX-ExecutionProvider.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 4079d7086b01d..5036268ea9898 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -97,6 +97,20 @@ For a practical example of usage, please refer to: * EP context [unit tests](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_ep_context_test.cc) +There are two other ways to quick generate an EP context model + +1. **ONNX Runtime Perf Test** + +```sh +onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" +``` + +2. **Python Script** + +```sh +python tools/python/compile_ep_context_model.py -i "path/to/model.onnx" -o "/path/to/model_ctx.onnx" +``` + #### NVIDIA recommended settings * disable ORT graph optimization From 37f561e2e20d4ae26921207c97f80ef9037a28e9 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 17:34:11 +0530 Subject: [PATCH 07/15] update doc --- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 5036268ea9898..0ccac970592bd 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -26,7 +26,7 @@ For compatibility and support matrix, please refer to [this](https://docs.nvidia {: .no_toc } * TOC placeholder -{:toc} +{:toc toc_levels=1..4} ## Install From e1464d87b68a74941717e1400ba1e90e926dd3ab Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 17:40:23 +0530 Subject: [PATCH 08/15] update doc --- _config.yml | 2 +- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_config.yml b/_config.yml index 3079a71fb07e6..ce33a13b3e1b2 100644 --- a/_config.yml +++ b/_config.yml @@ -10,7 +10,7 @@ plugins: - jekyll-redirect-from kramdown: parse_block_html: true - toc_levels: '2' + toc_levels: '4' logo: '/images/ONNX-Runtime-logo.svg' aux_links: 'ONNX Runtime': diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 0ccac970592bd..2b4f8821e276f 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -99,13 +99,13 @@ For a practical example of usage, please refer to: There are two other ways to quick generate an EP context model -1. **ONNX Runtime Perf Test** +**ONNX Runtime Perf Test** ```sh onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" ``` -2. **Python Script** +**Python Script** ```sh python tools/python/compile_ep_context_model.py -i "path/to/model.onnx" -o "/path/to/model_ctx.onnx" From 9ad9d33e493c4ace57f6dfe8067aff751370500d Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 3 Sep 2025 17:42:41 +0530 Subject: [PATCH 09/15] update doc --- _config.yml | 2 +- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_config.yml b/_config.yml index ce33a13b3e1b2..7a01b8863dc32 100644 --- a/_config.yml +++ b/_config.yml @@ -10,7 +10,7 @@ plugins: - jekyll-redirect-from kramdown: parse_block_html: true - toc_levels: '4' + toc_levels: [2, 3, 4] logo: '/images/ONNX-Runtime-logo.svg' aux_links: 'ONNX Runtime': diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 2b4f8821e276f..9fe4870938b2c 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -150,7 +150,7 @@ TensorRT RTX EP provides the following user configurable options with the [Execu -#### Click below for Python API example: +Click below for Python API example:
@@ -174,7 +174,7 @@ sess = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[
-#### Click below for C++ API example: +Click below for C++ API example:
From 76d3cebe0c7cfa740e00ea9996c30afc94f3e7f7 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Thu, 4 Sep 2025 15:21:34 +0530 Subject: [PATCH 10/15] add runtime cache and cuda graph --- .../TensorRTRTX-ExecutionProvider.md | 54 ++++++++++++++++++- 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 9fe4870938b2c..0721d8eb2ff01 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -58,7 +58,56 @@ sess = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvid ### CUDA Graph -CUDA Graph Text +CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from this [blog](https://developer.nvidia.com/blog/cuda-graphs/) + +**Key Benefits** + +* **Reduced CPU Overhead**: The most significant benefit is the reduction in CPU-side work. Instead of the CPU having to schedule and dispatch hundreds or thousands of individual kernels for each inference, it only issues one command to replay the entire graph. +* **Lower Latency**: By eliminating the gaps between kernel launches, CUDA Graphs enable the GPU to work more continuously, leading to lower and more predictable end-to-end latency. +* **Improved Scalability**: This reduced overhead makes multi-threaded workloads more efficient, as the contention for CPU resources to launch kernels is minimized. + +#### Usage + +For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option during the creation of the InferenceSession. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run. + +**Python** + +```python +providers = [('NvTensorRTRTXExecutionProvider', {'enable_cuda_graph': True})] +session = ort.InferenceSession("model.onnx", providers=providers) +``` + +**C/C++** +```cpp +const auto& api = Ort::GetApi(); +Ort::SessionOptions session_options; +const char* keys[] = {onnxruntime::nv::provider_option_names::kCudaGraphEnable}; +const char* values[] = {"1"}; +OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1); +Ort::Session session(env, model_path, session_options); +``` + +**ONNXRuntime Perf Test** +```sh +onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx" +``` + + +**Where to use?** + +Enabling CUDA Graph is advantageous in scenarios characterized by static execution patterns and numerous small GPU kernels, as this reduces CPU overhead and improves GPU utilization. +* Static-shaped models: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates. +* LLMs with stable shapes: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead. +* Workloads with frequent identical executions: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays. + +**Where not to use?** + +Enabling CUDA Graph should be avoided or approached with caution in scenarios where the execution pattern is not stable or where the overhead outweighs the benefits. +* Models with conditional flow or loops: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process. +* Highly variable input shapes: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized. +* Workloads with short-lived executions: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it. +* Models dominated by very large kernels: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal. + ### EP context model @@ -127,6 +176,9 @@ compile_options.SetEpContextEmbedMode(0); ### Runtime cache +Runtime caches help with JIT compilation time. So if you compiled an EP context not and load the produced node model for the first time specialized CUDA kernels for your GPU will be produced. +By specifying a directory as "nv_runtime_cache_path" a cache will be created for every TensorRT RTX engine in an EP context node, upon the second load this cache will be loaded and ensure the optimal kernels are already precompiled and can be deserialized rather than compiled. Especially on large networks with diverse operators this can have significant impact e.g. SD 1.5 which is a mixture of many Conv and MatMul operators. +Nor information about the graph structure nor weights will be serialized to this cache. ## Execution Provider Options From 35878ca712533691ce07d1384edb72bd51854d6b Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Thu, 4 Sep 2025 16:24:09 +0530 Subject: [PATCH 11/15] update content --- docs/build/eps.md | 5 +- .../TensorRTRTX-ExecutionProvider.md | 75 ++++++++++--------- 2 files changed, 42 insertions(+), 38 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 9545273b799ee..3664fd8922430 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -170,6 +170,7 @@ See more information on the TensorRT RTX Execution Provider [here](../execution- | ONNX Runtime | TensorRT-RTX | CUDA Toolkit | | :----------- | :----------- | :------------- | | main branch | 1.1 | 12.9 | +| 1.23 | 1.1 | 12.9 | | 1.22 | 1.0 | 12.8 | ### Pre-requisites @@ -177,7 +178,7 @@ See more information on the TensorRT RTX Execution Provider [here](../execution- * Install latest [NVIDIA driver](https://www.nvidia.com/en-us/drivers/) * Install [CUDA toolkit 12.9](https://developer.nvidia.com/cuda-12-9-1-download-archive) * Install [TensorRT RTX](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/installing-tensorrt-rtx/installing.html) -* For Windows only, Visual Studio - https://visualstudio.microsoft.com/downloads/ +* For Windows only, install [Visual Studio](https://visualstudio.microsoft.com/downloads/) * Set TensorRT-RTX dlls in `PATH` or put it in same folder as application exe @@ -214,6 +215,8 @@ cd onnxruntime # install pip install "build\Release\Release\dist\onnxruntime-1.23.0-cp312-cp312-win_amd64.whl" ``` +{: .note } +TensorRT-RTX .dll or .so are in `PATH` or in the same folder as the application --- diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 0721d8eb2ff01..289424b696469 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -9,16 +9,16 @@ redirect_from: /docs/reference/execution-providers/TensorRTRTX-ExecutionProvider # NVIDIA TensorRT RTX Execution Provider {: .no_toc } -NVIDIA TensorRT RTX Execution Provider (EP) is the **recommended** choice for GPU acceleration on NVIDIA consumer hardware (RTX PCs). It offers a more lightweight experience than the datacenter-focused TensorRT (TRT) EP and delivers superior performance compared to the other EPs. +The NVIDIA TensorRT RTX Execution Provider (EP) is designed for GPU acceleration on NVIDIA consumer hardware - RTX PCs and Pro workstations. It provides a lighter-weight alternative to the datacenter-oriented TensorRT (TRT) EP and generally offers better performance than other available EPs. -Here's why it's a better fit for RTX PCs than the legacy TensorRT EP: +The following are some advantages of using it on RTX PCs compared to the legacy TensorRT EP: * **Smaller package footprint:** Optimizes resource usage. * **Faster model compile and load times:** Get up and running quicker. * **Enhanced usability:** Seamlessly use cached models across multiple RTX GPUs. The TensorRT RTX EP leverages NVIDIA's new deep learning inference engine, [TensorRT RTX](https://developer.nvidia.com/tensorrt-rtx), to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX execution provider with ONNX Runtime. -Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. Support for Turing GPUs is coming soon. +Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures only. For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page. @@ -26,11 +26,11 @@ For compatibility and support matrix, please refer to [this](https://docs.nvidia {: .no_toc } * TOC placeholder -{:toc toc_levels=1..4} +{:toc} ## Install -Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. +Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. See the [WinML install section](../install/#cccwinml-installs) for WinML-related installation instructions. ## Build from source @@ -47,18 +47,19 @@ Ort::Session session(env, model_path, session_options); ``` ### Python -With Python APIs, you must explicitly register the TensorRT RTX EP when instantiating the `InferenceSession`. + +When using the Python API, register the TensorRT RTX Execution Provider by specifying it in the `providers` argument when creating an `InferenceSession`. ```python import onnxruntime as ort -sess = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvider']) +session = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvider']) ``` ## Features ### CUDA Graph -CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from this [blog](https://developer.nvidia.com/blog/cuda-graphs/) +CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from [this blog](https://developer.nvidia.com/blog/cuda-graphs/). **Key Benefits** @@ -66,14 +67,15 @@ CUDA Graph is a representation of a sequence of GPU operations, such as kernel l * **Lower Latency**: By eliminating the gaps between kernel launches, CUDA Graphs enable the GPU to work more continuously, leading to lower and more predictable end-to-end latency. * **Improved Scalability**: This reduced overhead makes multi-threaded workloads more efficient, as the contention for CPU resources to launch kernels is minimized. -#### Usage +**Usage** -For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option during the creation of the InferenceSession. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run. +For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run. **Python** ```python -providers = [('NvTensorRTRTXExecutionProvider', {'enable_cuda_graph': True})] +trt_rtx_provider_options = {'enable_cuda_graph': True} +providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)] session = ort.InferenceSession("model.onnx", providers=providers) ``` @@ -96,36 +98,37 @@ onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "mod **Where to use?** Enabling CUDA Graph is advantageous in scenarios characterized by static execution patterns and numerous small GPU kernels, as this reduces CPU overhead and improves GPU utilization. -* Static-shaped models: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates. -* LLMs with stable shapes: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead. -* Workloads with frequent identical executions: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays. +* **Static-shaped models**: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates. +* **LLMs with stable shapes**: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead. +* **Workloads with frequent identical executions**: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays. **Where not to use?** Enabling CUDA Graph should be avoided or approached with caution in scenarios where the execution pattern is not stable or where the overhead outweighs the benefits. -* Models with conditional flow or loops: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process. -* Highly variable input shapes: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized. -* Workloads with short-lived executions: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it. -* Models dominated by very large kernels: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal. +* **Models with conditional flow or loops**: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process. +* **Highly variable input shapes**: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized. +* **Workloads with short-lived executions**: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it. +* **Models dominated by very large kernels**: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal. ### EP context model -In ONNXRuntime, Execution Providers are responsible for converting ONNX models into the graph format required by its specific backend SDK and subsequently compiling them into a format compatible with the target hardware. In large models like LLMs and Diffusion models, this conversion and compilation process can be resource-intensive and time-consuming, often extending to tens of minutes. This overhead significantly impacts the user experience during session creation. +In ONNX Runtime, Execution Providers (EPs) handle the transformation of ONNX models into the specific graph format required by their backend SDKs, followed by compilation for the target hardware. For large-scale models such as LLMs and Diffusion models, this process can be both computationally expensive and time-consuming, resulting in longer session startup times. -To mitigate the repetitive nature of model conversion and compilation, the ONNX models can be pre-compiled model as a binary file and persisted in an "EP Context" Model. This pre-compiled model can then be loaded directly by the EP, bypassing the initial compilation steps and enabling immediate execution on the target device. This optimization substantially reduces session creation time and enhances overall operational efficiency. +To improve this workflow, ONNX models can be pre-compiled into a binary format and stored as an "EP Context" model. By loading this pre-compiled context, the EP can bypass the initial conversion and compilation phases, enabling immediate execution on the device. This approach greatly accelerates session creation and improves overall efficiency. TensorRT RTX simplifies this approach by separating compilation into two distinct phases: -* Ahead-of-Time (AOT) Compilation: The ONNX model is compiled into an optimized binary blob, which is then stored as an EP context model. This generated model is designed for compatibility across multiple generations of GPUs. -* Just-in-Time (JIT) Compilation: During inference, the compiled EP context model is loaded. TensorRT RTX then performs a JIT compilation of the binary blob (engine) to precisely adapt it to the specific GPU in use. +* **Ahead-of-Time (AOT)**: The ONNX model is compiled into an optimized binary blob, which is then stored as an EP context model. This generated model is designed for compatibility across multiple generations of GPUs. +* **Just-in-Time (JIT)**: At inference time, the EP context model is loaded and TensorRT RTX dynamically compiles the binary blob (engine) to optimize it for the exact GPU hardware being used. The primary benefit of this multi-phase compilation workflow is a significant reduction in model load times. -#### Generating EP Context Models with ORT 1.22 +**Generating EP Context Models** ONNX Runtime 1.22 introduced dedicated [Compile APIs](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/session/compile_api.h) to simplify the generation of EP context models: ```cpp +// AOT phase Ort::ModelCompilationOptions compile_options(env, session_options); compile_options.SetInputModelPath(input_model_path); compile_options.SetOutputModelPath(compile_model_path); @@ -136,19 +139,22 @@ Ort::Status status = Ort::CompileModel(env, compile_options); After successful generation, the EP context model can be directly loaded for inference: ```cpp +// JIT phase Ort::Session session(env, compile_model_path, session_options); ``` This approach leads to a considerable reduction in session creation time, thereby improving the overall user experience. -For a practical example of usage, please refer to: +The JIT time can also be accelerated using runtime cache. A runtime cache directory is created in which a per model cache will be produced that stores the compiled CUDA kernels and can further reduce setup time. More details about it [here](#runtime-cache). + +For a practical example of usage for EP context, please refer to: * EP context samples * EP context [unit tests](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_ep_context_test.cc) There are two other ways to quick generate an EP context model -**ONNX Runtime Perf Test** +**ONNXRuntime Perf Test** ```sh onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" @@ -160,12 +166,7 @@ onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compi python tools/python/compile_ep_context_model.py -i "path/to/model.onnx" -o "/path/to/model_ctx.onnx" ``` -#### NVIDIA recommended settings - -* disable ORT graph optimization -```cpp -session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_DISABLE_ALL); -``` +**NVIDIA recommended settings** * For models > 2GB, set embed_mode = 0 in model compilation options. If binary blob is embedded within the EP context, it fails for > 2GB models due to protobuf limitations ```cpp @@ -176,9 +177,7 @@ compile_options.SetEpContextEmbedMode(0); ### Runtime cache -Runtime caches help with JIT compilation time. So if you compiled an EP context not and load the produced node model for the first time specialized CUDA kernels for your GPU will be produced. -By specifying a directory as "nv_runtime_cache_path" a cache will be created for every TensorRT RTX engine in an EP context node, upon the second load this cache will be loaded and ensure the optimal kernels are already precompiled and can be deserialized rather than compiled. Especially on large networks with diverse operators this can have significant impact e.g. SD 1.5 which is a mixture of many Conv and MatMul operators. -Nor information about the graph structure nor weights will be serialized to this cache. +Runtime caches help to reduce JIT compilation time. When a user compiles an EP context and loads the resulting model for the first time, the system generates specialized CUDA kernels for the GPU. By setting the provider option `"nv_runtime_cache_path"` to a directory, a cache is created for each TensorRT RTX engine in an EP context node. On subsequent loads, this cache allows the system to quickly deserialize precompiled kernels instead of compiling them again. This is especially helpful for large models with many different operators, such as SD 1.5, which includes a mix of Conv and MatMul operations. The cache only contains compiled kernels. No information about the model’s graph structure or weights is stored. ## Execution Provider Options @@ -199,6 +198,7 @@ TensorRT RTX EP provides the following user configurable options with the [Execu | profile_opt_shapes | `str` | Comma-separated list of input tensor shapes for the optimal optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | | nv_multi_profile_enable | `bool` | Enable support for multiple optimization profiles in TensorRT engine. Allows dynamic input shapes for different inference requests | false | | nv_use_external_data_initializer | `bool` | Use external data initializer for model weights. Useful for EP context large models with external data files | false | +| nv_runtime_cache_path | `str` | Path to store runtime cache. Setting this enables faster model loading by caching JIT compiled kernels for each TensorRT RTX engine. | "" (disabled) | @@ -210,7 +210,7 @@ Click below for Python API example: ```python import onnxruntime as ort -model_path = '' +model_path = '/path/to/model' # note: for bool type options in python API, set them as False/True provider_options = { @@ -221,7 +221,7 @@ provider_options = { } sesion_options = ort.SessionOptions() -sess = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[('NvTensorRTRTXExecutionProvider', provider_options)]) +session = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[('NvTensorRTRTXExecutionProvider', provider_options)]) ```
@@ -252,7 +252,8 @@ session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProv -> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. +{: .note } +For bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. #### Profile shape options From a14018977f8ced26265968d62f1270da6ee5a83d Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Thu, 4 Sep 2025 16:39:26 +0530 Subject: [PATCH 12/15] fix header --- docs/build/eps.md | 2 +- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 3664fd8922430..3980b779df16d 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -165,7 +165,7 @@ Dockerfile instructions are available [here](https://github.com/microsoft/onnxru See more information on the TensorRT RTX Execution Provider [here](../execution-providers/TensorRTRTX-ExecutionProvider.md). -## Minimum requirements +### Minimum requirements | ONNX Runtime | TensorRT-RTX | CUDA Toolkit | | :----------- | :----------- | :------------- | diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 289424b696469..8a461ca4b1ae7 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -157,7 +157,7 @@ There are two other ways to quick generate an EP context model **ONNXRuntime Perf Test** ```sh -onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" +onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" "/path/to/model.onnx" ``` **Python Script** From cd8f6b3950f70b793c2b44cc87ffd51a918cb57b Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Thu, 4 Sep 2025 16:41:48 +0530 Subject: [PATCH 13/15] fix header --- docs/build/eps.md | 12 +++++------- .../TensorRTRTX-ExecutionProvider.md | 3 +-- 2 files changed, 6 insertions(+), 9 deletions(-) diff --git a/docs/build/eps.md b/docs/build/eps.md index 3980b779df16d..4cae8d94d9032 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -187,21 +187,19 @@ git clone https://github.com/microsoft/onnxruntime.git cd onnxruntime ``` -### C/C++ APIs - -#### Windows +### Windows ```powershell .\build.bat --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build --update --use_vcpkg ``` -#### Linux +### Linux ```sh ./build.sh --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path/to/tensorrt-rtx" --cuda_home "path/to/cuda/home" --build_shared_lib --skip_tests --build --update ``` -#### Run unit test +### Run unit test ```powershell .\build\Release\Release\onnxruntime_test_all.exe --gtest_filter=*NvExecutionProviderTest.* ``` @@ -215,8 +213,8 @@ cd onnxruntime # install pip install "build\Release\Release\dist\onnxruntime-1.23.0-cp312-cp312-win_amd64.whl" ``` -{: .note } -TensorRT-RTX .dll or .so are in `PATH` or in the same folder as the application + +> NOTE: TensorRT-RTX .dll or .so are in `PATH` or in the same folder as the application --- diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 8a461ca4b1ae7..3b7ad93212a2b 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -252,8 +252,7 @@ session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProv -{: .note } -For bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. +> NOTE: For bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. #### Profile shape options From 7928b1f3db1b4f618f1ad3a95ab4b8b2aa476c34 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Wed, 10 Sep 2025 15:42:09 +0530 Subject: [PATCH 14/15] update content --- .../TensorRTRTX-ExecutionProvider.md | 86 ++++++++----------- 1 file changed, 35 insertions(+), 51 deletions(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 3b7ad93212a2b..46d6aa47ca2be 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -9,18 +9,19 @@ redirect_from: /docs/reference/execution-providers/TensorRTRTX-ExecutionProvider # NVIDIA TensorRT RTX Execution Provider {: .no_toc } -The NVIDIA TensorRT RTX Execution Provider (EP) is designed for GPU acceleration on NVIDIA consumer hardware - RTX PCs and Pro workstations. It provides a lighter-weight alternative to the datacenter-oriented TensorRT (TRT) EP and generally offers better performance than other available EPs. +The NVIDIA TensorRT-RTX Execution Provider (EP) is an inference deployment solution designed specifically for NVIDIA RTX GPUs. It is optimized for client-centric use cases.. -The following are some advantages of using it on RTX PCs compared to the legacy TensorRT EP: -* **Smaller package footprint:** Optimizes resource usage. -* **Faster model compile and load times:** Get up and running quicker. -* **Enhanced usability:** Seamlessly use cached models across multiple RTX GPUs. +TensorRT RTX EP provides the following benefits: -The TensorRT RTX EP leverages NVIDIA's new deep learning inference engine, [TensorRT RTX](https://developer.nvidia.com/tensorrt-rtx), to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX execution provider with ONNX Runtime. +* **Small package footprint:** Optimized resource usage on end-user systems at just under 200 MB. +* **Faster model compile and load times:** Leverages just-in-time compilation techniques, to build RTX hardware-optimized engines on end-user devices in seconds. +* **Portability:** Seamlessly use cached models across multiple RTX GPUs. -Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures only. +The TensorRT RTX EP leverages NVIDIA’s new deep learning inference engine, [TensorRT for RTX](https://developer.nvidia.com/tensorrt-rtx), to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX EP with ONNX Runtime. -For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page. +Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. + +For a full compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page. ## Contents {: .no_toc } @@ -30,7 +31,7 @@ For compatibility and support matrix, please refer to [this](https://docs.nvidia ## Install -Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. See the [WinML install section](../install/#cccwinml-installs) for WinML-related installation instructions. +Currently, TensorRT RTX EP can be built from the source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. See the [WinML install section](../install/#cccwinml-installs) for WinML-related installation instructions. ## Build from source @@ -48,7 +49,7 @@ Ort::Session session(env, model_path, session_options); ### Python -When using the Python API, register the TensorRT RTX Execution Provider by specifying it in the `providers` argument when creating an `InferenceSession`. +Register the TensorRT RTX EP by specifying it in the providers argument when creating an InferenceSession. ```python import onnxruntime as ort @@ -59,17 +60,11 @@ session = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionPro ### CUDA Graph -CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from [this blog](https://developer.nvidia.com/blog/cuda-graphs/). - -**Key Benefits** - -* **Reduced CPU Overhead**: The most significant benefit is the reduction in CPU-side work. Instead of the CPU having to schedule and dispatch hundreds or thousands of individual kernels for each inference, it only issues one command to replay the entire graph. -* **Lower Latency**: By eliminating the gaps between kernel launches, CUDA Graphs enable the GPU to work more continuously, leading to lower and more predictable end-to-end latency. -* **Improved Scalability**: This reduced overhead makes multi-threaded workloads more efficient, as the contention for CPU resources to launch kernels is minimized. +CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured at once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from [this blog](https://developer.nvidia.com/blog/cuda-graphs/). **Usage** -For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run. +CUDA Graph can be enabled by setting a provider option. By default, ONNX Runtime uses a graph annotation ID of 0 and starts capturing graphs. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as \-1, it indicates that the graph will not be captured for that specific run. **Python** @@ -94,35 +89,25 @@ Ort::Session session(env, model_path, session_options); onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx" ``` +**Effectively Using CUDA Graphs** -**Where to use?** +CUDA Graph can be beneficial when execution patterns are static and involve many small GPU kernels. This feature helps reduce CPU overhead and improve GPU utilization, particularly for static execution plans run more than twice. -Enabling CUDA Graph is advantageous in scenarios characterized by static execution patterns and numerous small GPU kernels, as this reduces CPU overhead and improves GPU utilization. -* **Static-shaped models**: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates. -* **LLMs with stable shapes**: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead. -* **Workloads with frequent identical executions**: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays. +Avoid enabling CUDA Graph or proceed with caution if: -**Where not to use?** - -Enabling CUDA Graph should be avoided or approached with caution in scenarios where the execution pattern is not stable or where the overhead outweighs the benefits. -* **Models with conditional flow or loops**: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process. -* **Highly variable input shapes**: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized. -* **Workloads with short-lived executions**: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it. -* **Models dominated by very large kernels**: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal. +* Input shapes or device bindings frequently change. +* The control flow is conditional and data-dependent. ### EP context model -In ONNX Runtime, Execution Providers (EPs) handle the transformation of ONNX models into the specific graph format required by their backend SDKs, followed by compilation for the target hardware. For large-scale models such as LLMs and Diffusion models, this process can be both computationally expensive and time-consuming, resulting in longer session startup times. +EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible. -To improve this workflow, ONNX models can be pre-compiled into a binary format and stored as an "EP Context" model. By loading this pre-compiled context, the EP can bypass the initial conversion and compilation phases, enabling immediate execution on the device. This approach greatly accelerates session creation and improves overall efficiency. +TensorRT RTX handle compilation into two distinct phases: -TensorRT RTX simplifies this approach by separating compilation into two distinct phases: -* **Ahead-of-Time (AOT)**: The ONNX model is compiled into an optimized binary blob, which is then stored as an EP context model. This generated model is designed for compatibility across multiple generations of GPUs. +* **Ahead-of-Time (AOT)**: The ONNX model is compiled into an optimized binary blob, and stored as an EP context model. * **Just-in-Time (JIT)**: At inference time, the EP context model is loaded and TensorRT RTX dynamically compiles the binary blob (engine) to optimize it for the exact GPU hardware being used. -The primary benefit of this multi-phase compilation workflow is a significant reduction in model load times. - **Generating EP Context Models** ONNX Runtime 1.22 introduced dedicated [Compile APIs](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/session/compile_api.h) to simplify the generation of EP context models: @@ -143,16 +128,16 @@ After successful generation, the EP context model can be directly loaded for inf Ort::Session session(env, compile_model_path, session_options); ``` -This approach leads to a considerable reduction in session creation time, thereby improving the overall user experience. +This leads to a considerable reduction in session creation time, improving the overall user experience. -The JIT time can also be accelerated using runtime cache. A runtime cache directory is created in which a per model cache will be produced that stores the compiled CUDA kernels and can further reduce setup time. More details about it [here](#runtime-cache). +The JIT time can be further improved using runtime cache. A runtime cache directory with a per model cache is created. This cache stores the compiled CUDA kernels and reduces session load time. Learn more about the process [here](#runtime-cache). For a practical example of usage for EP context, please refer to: -* EP context samples -* EP context [unit tests](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_ep_context_test.cc) +* EP context samples +* EP context [unit tests](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_ep_context_test.cc) -There are two other ways to quick generate an EP context model +There are two other ways to quick generate an EP context model: **ONNXRuntime Perf Test** @@ -177,7 +162,7 @@ compile_options.SetEpContextEmbedMode(0); ### Runtime cache -Runtime caches help to reduce JIT compilation time. When a user compiles an EP context and loads the resulting model for the first time, the system generates specialized CUDA kernels for the GPU. By setting the provider option `"nv_runtime_cache_path"` to a directory, a cache is created for each TensorRT RTX engine in an EP context node. On subsequent loads, this cache allows the system to quickly deserialize precompiled kernels instead of compiling them again. This is especially helpful for large models with many different operators, such as SD 1.5, which includes a mix of Conv and MatMul operations. The cache only contains compiled kernels. No information about the model’s graph structure or weights is stored. +Runtime caches help reduce JIT compilation time. When a user compiles an EP context and loads the resulting model for the first time, the system generates specialized CUDA kernels for the GPU. By setting the provider option `"nv_runtime_cache_path"` to a directory, a cache is created for each TensorRT RTX engine in an EP context node. On subsequent loads, this cache allows the system to quickly deserialize precompiled kernels instead of compiling them again. This is especially helpful for large models with many different operators, such as SD 1.5, which includes a mix of Conv and MatMul operations. The cache only contains compiled kernels. No information about the model’s graph structure or weights is stored. ## Execution Provider Options @@ -257,18 +242,17 @@ session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProv #### Profile shape options -* Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided. - * By default TensorRT RTX engines will support dynamic shapes, for perofmance improvements it is possible to specify one or multiple explicit ranges of shapes. - * The format of the profile shapes is `input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...` - * These three flags should all be provided in order to enable explicit profile shapes feature. - * Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor. - * Check TensorRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/inference-library/work-with-dynamic-shapes.html) for more details. - +* Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided. + * By default TensorRT RTX engines support dynamic shapes. For additional performance improvements, you can specify one or multiple explicit ranges of shapes. + * The format of the profile shapes is `input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...` + * These three flags must be provided in order to enable explicit profile shapes. + * Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor. + * Check TensorRT for RTX doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/inference-library/work-with-dynamic-shapes.html) for more details. ## Performance test When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx` - ### Plugins Support -TensorRT RTX doesn't support plugins \ No newline at end of file + +TensorRT RTX doesn’t support plugins \ No newline at end of file From 614f4e5ebce7b11bbaf70d324503a4701d11a397 Mon Sep 17 00:00:00 2001 From: Vishal Agarwal Date: Mon, 15 Sep 2025 11:14:45 +0530 Subject: [PATCH 15/15] update doc --- docs/execution-providers/TensorRTRTX-ExecutionProvider.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 46d6aa47ca2be..17230459c7e36 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -253,6 +253,6 @@ session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProv When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx` -### Plugins Support +## Plugins Support TensorRT RTX doesn’t support plugins \ No newline at end of file