diff --git a/_config.yml b/_config.yml index 3079a71fb07e6..7a01b8863dc32 100644 --- a/_config.yml +++ b/_config.yml @@ -10,7 +10,7 @@ plugins: - jekyll-redirect-from kramdown: parse_block_html: true - toc_levels: '2' + toc_levels: [2, 3, 4] logo: '/images/ONNX-Runtime-logo.svg' aux_links: 'ONNX Runtime': diff --git a/docs/build/eps.md b/docs/build/eps.md index b55f0e381c8bd..4cae8d94d9032 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -161,6 +161,63 @@ Dockerfile instructions are available [here](https://github.com/microsoft/onnxru --- +## NVIDIA TensorRT RTX + +See more information on the TensorRT RTX Execution Provider [here](../execution-providers/TensorRTRTX-ExecutionProvider.md). + +### Minimum requirements + +| ONNX Runtime | TensorRT-RTX | CUDA Toolkit | +| :----------- | :----------- | :------------- | +| main branch | 1.1 | 12.9 | +| 1.23 | 1.1 | 12.9 | +| 1.22 | 1.0 | 12.8 | + +### Pre-requisites +* Install git, cmake, Python 3.12 +* Install latest [NVIDIA driver](https://www.nvidia.com/en-us/drivers/) +* Install [CUDA toolkit 12.9](https://developer.nvidia.com/cuda-12-9-1-download-archive) +* Install [TensorRT RTX](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/installing-tensorrt-rtx/installing.html) +* For Windows only, install [Visual Studio](https://visualstudio.microsoft.com/downloads/) +* Set TensorRT-RTX dlls in `PATH` or put it in same folder as application exe + + +```sh +git clone https://github.com/microsoft/onnxruntime.git +cd onnxruntime +``` + +### Windows + +```powershell +.\build.bat --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build --update --use_vcpkg +``` + +### Linux + +```sh +./build.sh --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path/to/tensorrt-rtx" --cuda_home "path/to/cuda/home" --build_shared_lib --skip_tests --build --update +``` + +### Run unit test +```powershell +.\build\Release\Release\onnxruntime_test_all.exe --gtest_filter=*NvExecutionProviderTest.* +``` + +### Python wheel + +```powershell +# build the python wheel +.\build.bat --config Release --build_dir build --parallel --use_nv_tensorrt_rtx --tensorrt_rtx_home "path\to\tensorrt-rtx" --cuda_home "path\to\cuda\home" --cmake_generator "Visual Studio 17 2022" --build_shared_lib --skip_tests --build_wheel + +# install +pip install "build\Release\Release\dist\onnxruntime-1.23.0-cp312-cp312-win_amd64.whl" +``` + +> NOTE: TensorRT-RTX .dll or .so are in `PATH` or in the same folder as the application + +--- + ## NVIDIA Jetson TX1/TX2/Nano/Xavier/Orin ### Build Instructions @@ -235,20 +292,7 @@ These instructions are for the latest [JetPack SDK](https://developer.nvidia.com * For a portion of Jetson devices like the Xavier series, higher power mode involves more cores (up to 6) to compute but it consumes more resource when building ONNX Runtime. Set `--parallel 1` in the build command if OOM happens and system is hanging. -## TensorRT-RTX - -See more information on the NV TensorRT RTX Execution Provider [here](../execution-providers/TensorRTRTX-ExecutionProvider.md). - -### Prerequisites -{: .no_toc } - - * Follow [instructions for CUDA execution provider](#cuda) to install CUDA and setup environment variables. - * Intall TensorRT for RTX from nvidia.com (TODO: add link when available) - -### Build Instructions -{: .no_toc } -`build.bat --config Release --parallel 32 --build_dir _build --build_shared_lib --use_nv_tensorrt_rtx --tensorrt_home "C:\dev\TensorRT-RTX-1.1.0.3" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9" --cmake_generator "Visual Studio 17 2022" --use_vcpkg` -Replace the --tensorrt_home and --cuda_home with correct paths to CUDA and TensorRT-RTX installations. +--- ## oneDNN diff --git a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md index 4c97b79c60534..17230459c7e36 100644 --- a/docs/execution-providers/TensorRTRTX-ExecutionProvider.md +++ b/docs/execution-providers/TensorRTRTX-ExecutionProvider.md @@ -1,23 +1,27 @@ --- title: NVIDIA - TensorRT RTX -description: Instructions to execute ONNX Runtime on NVIDIA RTX GPUs with the Nvidia TensorRT RTX execution provider +description: Instructions to execute ONNX Runtime on NVIDIA RTX GPUs with the NVIDIA TensorRT RTX execution provider parent: Execution Providers -nav_order: 17 +nav_order: 2 redirect_from: /docs/reference/execution-providers/TensorRTRTX-ExecutionProvider --- -# Nvidia TensorRT RTX Execution Provider +# NVIDIA TensorRT RTX Execution Provider {: .no_toc } -Nvidia TensorRT RTX execution provider is the preferred execution provider for GPU acceleration on consumer hardware (RTX PCs). It is more straightforward to use than the datacenter focused legacy TensorRT Execution provider and more performant than CUDA EP. -Just some of the things that make it a better fit on RTX PCs than our legacy TensorRT Execution Provider: -* Much smaller footprint -* Much faster model compile/load times. -* Better usability in terms of use of cached models across multiple RTX GPUs. +The NVIDIA TensorRT-RTX Execution Provider (EP) is an inference deployment solution designed specifically for NVIDIA RTX GPUs. It is optimized for client-centric use cases.. -The Nvidia TensorRT RTX execution provider in the ONNX Runtime makes use of NVIDIA's [TensorRT](https://developer.nvidia.com/tensorrt) RTX Deep Learning inferencing engine (TODO: correct link to TRT RTX documentation once available) to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA worked closely to integrate the TensorRT RTX execution provider with ONNX Runtime. +TensorRT RTX EP provides the following benefits: -Currently TensorRT RTX supports RTX GPUs from Ampere or later architectures. Support for Turing GPUs is coming soon. +* **Small package footprint:** Optimized resource usage on end-user systems at just under 200 MB. +* **Faster model compile and load times:** Leverages just-in-time compilation techniques, to build RTX hardware-optimized engines on end-user devices in seconds. +* **Portability:** Seamlessly use cached models across multiple RTX GPUs. + +The TensorRT RTX EP leverages NVIDIA’s new deep learning inference engine, [TensorRT for RTX](https://developer.nvidia.com/tensorrt-rtx), to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX EP with ONNX Runtime. + +Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. + +For a full compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page. ## Contents {: .no_toc } @@ -26,211 +30,229 @@ Currently TensorRT RTX supports RTX GPUs from Ampere or later architectures. Sup {:toc} ## Install -Please select the Nvidia TensorRT RTX version of Onnx Runtime: https://onnxruntime.ai/docs/install. (TODO!) -## Build from source -See [Build instructions](../build/eps.md#tensorrt-rtx). +Currently, TensorRT RTX EP can be built from the source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. See the [WinML install section](../install/#cccwinml-installs) for WinML-related installation instructions. -## Requirements +## Build from source -| ONNX Runtime | TensorRT-RTX | CUDA | -| :----------- | :----------- | :------------- | -| main | 1.0 | 12.0-12.9 | -| 1.22 | 1.0 | 12.0-12.9 | +Information on how to build from source for TensorRT RTX EP can be found [here](../build/eps.md#nvidia-tensorrt-rtx). ## Usage + ### C/C++ ```c++ -const auto& api = Ort::GetApi(); +Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "SampleApp"); Ort::SessionOptions session_options; -api.SessionOptionsAppendExecutionProvider(session_options, "NvTensorRtRtx", nullptr, nullptr, 0); +session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, {}); Ort::Session session(env, model_path, session_options); ``` -The C API details are [here](../get-started/with-c.md). - ### Python -To use TensorRT RTX execution provider, you must explicitly register TensorRT RTX execution provider when instantiating the `InferenceSession`. + +Register the TensorRT RTX EP by specifying it in the providers argument when creating an InferenceSession. ```python import onnxruntime as ort -sess = ort.InferenceSession('model.onnx', providers=['NvTensorRtRtxExecutionProvider']) +session = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvider']) ``` -## Configurations -TensorRT RTX settings can be configured via [TensorRT Execution Provider Session Option](./TensorRTRTX-ExecutionProvider.md#execution-provider-options). +## Features -Here are examples and different [scenarios](./TensorRTRTX-ExecutionProvider.md#scenario) to set NV TensorRT RTX EP session options: +### CUDA Graph -#### Click below for Python API example: +CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured at once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from [this blog](https://developer.nvidia.com/blog/cuda-graphs/). -
+**Usage** + +CUDA Graph can be enabled by setting a provider option. By default, ONNX Runtime uses a graph annotation ID of 0 and starts capturing graphs. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as \-1, it indicates that the graph will not be captured for that specific run. + +**Python** ```python -import onnxruntime as ort +trt_rtx_provider_options = {'enable_cuda_graph': True} +providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)] +session = ort.InferenceSession("model.onnx", providers=providers) +``` + +**C/C++** +```cpp +const auto& api = Ort::GetApi(); +Ort::SessionOptions session_options; +const char* keys[] = {onnxruntime::nv::provider_option_names::kCudaGraphEnable}; +const char* values[] = {"1"}; +OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1); +Ort::Session session(env, model_path, session_options); +``` -model_path = '' +**ONNXRuntime Perf Test** +```sh +onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx" +``` -# note: for bool type options in python API, set them as False/True -provider_options = { - 'device_id': 0, - 'nv_dump_subgraphs': False, - 'nv_detailed_build_log': True, - 'user_compute_stream': stream_handle -} +**Effectively Using CUDA Graphs** + +CUDA Graph can be beneficial when execution patterns are static and involve many small GPU kernels. This feature helps reduce CPU overhead and improve GPU utilization, particularly for static execution plans run more than twice. + +Avoid enabling CUDA Graph or proceed with caution if: + +* Input shapes or device bindings frequently change. +* The control flow is conditional and data-dependent. + + +### EP context model + +EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible. + +TensorRT RTX handle compilation into two distinct phases: + +* **Ahead-of-Time (AOT)**: The ONNX model is compiled into an optimized binary blob, and stored as an EP context model. +* **Just-in-Time (JIT)**: At inference time, the EP context model is loaded and TensorRT RTX dynamically compiles the binary blob (engine) to optimize it for the exact GPU hardware being used. + +**Generating EP Context Models** -sess_opt = ort.SessionOptions() -sess = ort.InferenceSession(model_path, sess_options=sess_opt, providers=[('NvTensorRTRTXExecutionProvider', provider_options)]) +ONNX Runtime 1.22 introduced dedicated [Compile APIs](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/session/compile_api.h) to simplify the generation of EP context models: + +```cpp +// AOT phase +Ort::ModelCompilationOptions compile_options(env, session_options); +compile_options.SetInputModelPath(input_model_path); +compile_options.SetOutputModelPath(compile_model_path); + +Ort::Status status = Ort::CompileModel(env, compile_options); ``` -
+After successful generation, the EP context model can be directly loaded for inference: -#### Click below for C++ API example: +```cpp +// JIT phase +Ort::Session session(env, compile_model_path, session_options); +``` -
+This leads to a considerable reduction in session creation time, improving the overall user experience. -```c++ -Ort::SessionOptions session_options; +The JIT time can be further improved using runtime cache. A runtime cache directory with a per model cache is created. This cache stores the compiled CUDA kernels and reduces session load time. Learn more about the process [here](#runtime-cache). -cudaStream_t cuda_stream; -cudaStreamCreate(&cuda_stream); +For a practical example of usage for EP context, please refer to: -// Need to put the CUDA stream handle in a string -char streamHandle[32]; -sprintf_s(streamHandle, "%lld", (uint64_t)cuda_stream); +* EP context samples +* EP context [unit tests](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_ep_context_test.cc) -const auto& api = Ort::GetApi(); -std::vector option_keys = { - "device_id", - "user_compute_stream", // this implicitly sets "has_user_compute_stream" -}; -std::vector option_values = { - "1", - streamHandle -}; +There are two other ways to quick generate an EP context model: -Ort::ThrowOnError(api.SessionOptionsAppendExecutionProvider(session_options, "NvTensorRtRtx", option_keys.data(), option_values.data(), option_keys.size())); +**ONNXRuntime Perf Test** +```sh +onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" "/path/to/model.onnx" ``` -
+**Python Script** -### Scenario +```sh +python tools/python/compile_ep_context_model.py -i "path/to/model.onnx" -o "/path/to/model_ctx.onnx" +``` -| Scenario | NV TensorRT RTX EP Session Option | Type | -| :------------------------------------------------- | :----------------------------------------------------------------------------------------- | :----- | -| Specify GPU id for execution | [device_id](./TensorRTRTX-ExecutionProvider.md#device_id) | int | -| Set custom compute stream for GPU operations | [user_compute_stream](./TensorRTRTX-ExecutionProvider.md#user_compute_stream) | string | -| Set TensorRT RTX EP GPU memory usage limit | [nv_max_workspace_size](./TensorRTRTX-ExecutionProvider.md#nv_max_workspace_size) | int | -| Dump optimized subgraphs for debugging | [nv_dump_subgraphs](./TensorRTRTX-ExecutionProvider.md#nv_dump_subgraphs) | bool | -| Capture CUDA graph for reduced launch overhead | [nv_cuda_graph_enable](./TensorRTRTX-ExecutionProvider.md#nv_cuda_graph_enable) | bool | -| Enable detailed logging of build steps | [nv_detailed_build_log](./TensorRTRTX-ExecutionProvider.md#nv_detailed_build_log) | bool | -| Define min shapes | [nv_profile_min_shapes](./TensorRTRTX-ExecutionProvider.md#nv_profile_min_shapes) | string | -| Define max shapes | [nv_profile_max_shapes](./TensorRTRTX-ExecutionProvider.md#nv_profile_max_shapes) | string | -| Define optimal shapes | [nv_profile_opt_shapes](./TensorRTRTX-ExecutionProvider.md#nv_profile_opt_shapes) | string | +**NVIDIA recommended settings** -> Note: for bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. +* For models > 2GB, set embed_mode = 0 in model compilation options. If binary blob is embedded within the EP context, it fails for > 2GB models due to protobuf limitations +```cpp +Ort::ModelCompilationOptions compile_options(env, session_options); +compile_options.SetEpContextEmbedMode(0); +``` -### Execution Provider Options -TensorRT RTX configurations can be set by execution provider options. It's useful when each model and inference session have their own configurations. All configurations should be set explicitly, otherwise default value will be taken. +### Runtime cache -##### device_id +Runtime caches help reduce JIT compilation time. When a user compiles an EP context and loads the resulting model for the first time, the system generates specialized CUDA kernels for the GPU. By setting the provider option `"nv_runtime_cache_path"` to a directory, a cache is created for each TensorRT RTX engine in an EP context node. On subsequent loads, this cache allows the system to quickly deserialize precompiled kernels instead of compiling them again. This is especially helpful for large models with many different operators, such as SD 1.5, which includes a mix of Conv and MatMul operations. The cache only contains compiled kernels. No information about the model’s graph structure or weights is stored. -* Description: GPU device ID. -* Default value: 0 -##### user_compute_stream +## Execution Provider Options +TensorRT RTX EP provides the following user configurable options with the [Execution Provider Options](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_options.h) -* Description: define the compute stream for the inference to run on. It implicitly sets the `has_user_compute_stream` option. The stream handle needs to be printed on a string as decimal number and passed down to the session options as shown in the example above. -* This can also be set using the python API. - * i.e The cuda stream captured from pytorch can be passed into ORT-NV TensorRT RTX EP. Click below to check sample code: +| Parameter | Type | Description | Default | +|-----------|------|-------------|---------| +| device_id | `int` | GPU device identifier | 0 | +| user_compute_stream | `str` | Specify compute stream to run GPU workload | "" | +| nv_max_workspace_size | `int` | Maximum TensorRT engine workspace (bytes) | 0 (auto) | +| nv_max_shared_mem_size | `int` | Maximum TensorRT engine workspace (bytes) | 0 (auto) | +| nv_dump_subgraphs | `bool` | Enable subgraph dumping for debugging | false | +| nv_detailed_build_log | `bool` | Enable detailed build logging | false | +| enable_cuda_graph | `bool` | Enable [CUDA graph](https://developer.nvidia.com/blog/cuda-graphs/) to reduce inference overhead. Helpful for smaller models | false | +| profile_min_shapes | `str` | Comma-separated list of input tensor shapes for the minimum optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | +| profile_max_shapes | `str` | Comma-separated list of input tensor shapes for the maximum optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | +| profile_opt_shapes | `str` | Comma-separated list of input tensor shapes for the optimal optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."` | "" (auto) | +| nv_multi_profile_enable | `bool` | Enable support for multiple optimization profiles in TensorRT engine. Allows dynamic input shapes for different inference requests | false | +| nv_use_external_data_initializer | `bool` | Use external data initializer for model weights. Useful for EP context large models with external data files | false | +| nv_runtime_cache_path | `str` | Path to store runtime cache. Setting this enables faster model loading by caching JIT compiled kernels for each TensorRT RTX engine. | "" (disabled) | -
- - ```python - import onnxruntime as ort - import torch - ... - sess = ort.InferenceSession('model.onnx') - if torch.cuda.is_available(): - s = torch.cuda.Stream() - provider_options = { - 'device_id': 0, - 'user_compute_stream': str(s.cuda_stream) - } - sess = ort.InferenceSession( - model_path, - providers=[('NvTensorRtRtxExecutionProvider', provider_options)] - ) +Click below for Python API example: - options = sess.get_provider_options() - assert "NvTensorRtRtxExecutionProvider" in options - assert options["NvTensorRtRtxExecutionProvider"].get("user_compute_stream", "") == str(s.cuda_stream) - ... - ``` - -
-* To take advantage of user compute stream, it is recommended to use [I/O Binding](https://onnxruntime.ai/docs/performance/device-tensor.html) to bind inputs and outputs to tensors in device. +
-##### nv_max_workspace_size +```python +import onnxruntime as ort -* Description: maximum workspace size in bytes for TensorRT RTX engine. +model_path = '/path/to/model' -* Default value: 0 (lets TensorRT pick the optimal). +# note: for bool type options in python API, set them as False/True +provider_options = { + 'device_id': 0, + 'nv_dump_subgraphs': False, + 'nv_detailed_build_log': True, + 'user_compute_stream': stream_handle +} -##### nv_dump_subgraphs +sesion_options = ort.SessionOptions() +session = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[('NvTensorRTRTXExecutionProvider', provider_options)]) +``` +
-* Description: dumps the subgraphs if the ONNX was split across multiple execution providers. - * This can help debugging subgraphs, e.g. by using `trtexec --onnx subgraph_1.onnx` and check the outputs of the parser. -##### nv_detailed_build_log +Click below for C++ API example: -* Description: enable detailed build step logging on NV TensorRT RTX EP with timing for each engine build. -##### nv_cuda_graph_enable +
-* Description: this will capture a [CUDA graph](https://developer.nvidia.com/blog/cuda-graphs/) which can drastically help for a network with many small layers as it reduces launch overhead on the CPU. +```c++ +Ort::SessionOptions session_options; + +// define a cuda stream +cudaStream_t cuda_stream; +cudaStreamCreate(&cuda_stream); -##### nv_profile_min_shapes +char stream_handle[32]; +sprintf_s(stream_handle, "%lld", (uint64_t)cuda_stream); -##### nv_profile_max_shapes +std::unordered_map provider_options; +provider_options[onnxruntime::nv::provider_option_names::kDeviceId] = "1"; +provider_options[onnxruntime::nv::provider_option_names::kUserComputeStream] = stream_handle; + +session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, provider_options); +``` + +
-##### nv_profile_opt_shapes -* Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided. - * By default TensorRT RTX engines will support dynamic shapes, for perofmance improvements it is possible to specify one or multiple explicit ranges of shapes. - * The format of the profile shapes is `input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...` - * These three flags should all be provided in order to enable explicit profile shapes feature. - * Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor. - * Check TensorRT doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt_profiles) for more details. -## NV TensorRT RTX EP Caches -There are two major TRT RTX EP caches: -* Embedded engine model / EPContext model -* Internal TensorRT RTX cache +> NOTE: For bool type options, assign them with **True**/**False** in python, or **1**/**0** in C++. -The internal TensorRT RTX cache is automatically managed by the EP. The user only needs to manage EPContext caching. -**Caching is important to help reduce session creation time drastically.** -TensorRT RTX separates compilation into an ahead of time (AOT) compiled engine and a just in time (JIT) compilation. The AOT compilation can be stored as EPcontext model, this model will be compatible across multiple GPU generations. -Upon loading such an EPcontext model TensorRT RTX will just in time compile the engine to fit to the used GPU. This JIT process is accelerated by TensorRT RTX's internal cache. -For an example usage see: -https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc +#### Profile shape options -### More about Embedded engine model / EPContext model -* TODO: decide on a plan for using weight-stripped engines by default. Fix the EP implementation to enable that. Explain the motivation and provide example on how to use the right options in this document. -* EPContext models also **enable packaging an externally compiled engine** using e.g. `trtexec`. A [python script](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/tensorrt/gen_trt_engine_wrapper_onnx_model.py) that is capable of packaging such a precompiled engine into an ONNX file is included in the python tools. (TODO: document how this works with weight-stripped engines). +* Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided. + * By default TensorRT RTX engines support dynamic shapes. For additional performance improvements, you can specify one or multiple explicit ranges of shapes. + * The format of the profile shapes is `input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...` + * These three flags must be provided in order to enable explicit profile shapes. + * Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor. + * Check TensorRT for RTX doc [optimization profiles](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/inference-library/work-with-dynamic-shapes.html) for more details. -## Performance Tuning -For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) +## Performance test -When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`. +When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx` +## Plugins Support -### TensorRT RTX Plugins Support -TensorRT RTX doesn't support plugins. \ No newline at end of file +TensorRT RTX doesn’t support plugins \ No newline at end of file