[Draft] TRT RTX documentation update #6

thevishalagarwal · 2025-09-02T18:09:29Z

docs/build/eps.md

gedoensmax · 2025-09-03T08:50:00Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

+# NVIDIA TensorRT RTX Execution Provider
 {: .no_toc }

-Nvidia TensorRT RTX execution provider is the preferred execution provider for GPU acceleration on consumer hardware (RTX PCs). It is more straightforward to use than the datacenter focused legacy TensorRT Execution provider and more performant than CUDA EP. 


I personally don't like making recommended bold.

Updated it. Does this read better now?

gedoensmax · 2025-09-03T16:16:23Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

+
+Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. Support for Turing GPUs is coming soon.
+
+For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page.


That is not quite correct since we disabled Turing support for now.

Removed Turing mention now.

gedoensmax · 2025-09-03T16:18:08Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md


-## Build from source
-See [Build instructions](../build/eps.md#tensorrt-rtx).
+Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon.


Should we mention that is is available through Windows ML without a build from source ?

I have updated it. Does this look right?

gedoensmax · 2025-09-03T16:21:48Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md


-```python
-import onnxruntime as ort
+In ONNXRuntime, Execution Providers are responsible for converting ONNX models into the graph format required by its specific backend SDK and subsequently compiling them into a format compatible with the target hardware. In large models like LLMs and Diffusion models, this conversion and compilation process can be resource-intensive and time-consuming, often extending to tens of minutes. This overhead significantly impacts the user experience during session creation.


Tens of minutes should hopefully not be true for TRT RTX. Besides that these are docs and the shorter the better :)
"""
EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible.
"""

Does this read better now?

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

gedoensmax · 2025-09-04T10:44:19Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

+onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx"
+```


I do not see the flag --compile_ep_context.

I think you might be seeing an older commit

gedoensmax · 2025-09-08T13:52:43Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

+* **Reduced CPU Overhead**: The most significant benefit is the reduction in CPU-side work. Instead of the CPU having to schedule and dispatch hundreds or thousands of individual kernels for each inference, it only issues one command to replay the entire graph.
+* **Lower Latency**: By eliminating the gaps between kernel launches, CUDA Graphs enable the GPU to work more continuously, leading to lower and more predictable end-to-end latency.
+* **Improved Scalability**: This reduced overhead makes multi-threaded workloads more efficient, as the contention for CPU resources to launch kernels is minimized.
+
+**Usage**
+
+For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run.


I think the key benefit should just be reducing CPU overhead. We do not want to emphasize multi threading the CUDA workload submission I think.

The lower latency point has no actionable information or insights - it does not sound like a bullet point for documentation.

gedoensmax · 2025-09-08T13:55:26Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

 ```python
-import onnxruntime as ort
+trt_rtx_provider_options = {'enable_cuda_graph': True}
+providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)]
+session = ort.InferenceSession("model.onnx", providers=providers)
+```

-model_path = '<path to model>'
+**C/C++**
+```cpp
+const auto& api = Ort::GetApi();
+Ort::SessionOptions session_options;
+const char* keys[]   = {onnxruntime::nv::provider_option_names::kCudaGraphEnable};
+const char* values[] = {"1"};
+OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1);
+Ort::Session session(env, model_path, session_options);
+```

-# note: for bool type options in python API, set them as False/True
-provider_options = {
-  'device_id': 0,
-  'nv_dump_subgraphs': False,
-  'nv_detailed_build_log': True,
-  'user_compute_stream': stream_handle
-}
+**ONNXRuntime Perf Test**
+```sh
+onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx"
+```


We show how to set execution provider options at the API examples below the option table. Happy to add ORT perf test to that table as well, but this feels like duplication.

gedoensmax · 2025-09-08T14:03:35Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

+**Where to use?**
+
+Enabling CUDA Graph is advantageous in scenarios characterized by static execution patterns and numerous small GPU kernels, as this reduces CPU overhead and improves GPU utilization.
+* **Static-shaped models**: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates.
+* **LLMs with stable shapes**: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead.
+* **Workloads with frequent identical executions**: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays.
+
+**Where not to use?**
+
+Enabling CUDA Graph should be avoided or approached with caution in scenarios where the execution pattern is not stable or where the overhead outweighs the benefits.
+* **Models with conditional flow or loops**: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process.
+* **Highly variable input shapes**: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized.
+* **Workloads with short-lived executions**: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it.
+* **Models dominated by very large kernels**: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal.


Where to use:

static execution plans that are executed more than 2 times

Where not to use:

often changing input shapes or often changing input device bindings

conditional data dependent control flows

Why do we want to guide a way from CUDA Graphs for large kernels ? Yes the benefit is smaller but there is no harm in enabling.

Updated the text

gedoensmax · 2025-09-08T14:05:03Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|


I believe the type value does not make sense since we do not support setting the actual type but only support strings. The same is true for the default value, let's give a default string.

In C/C++, it is string and for Python, we use the actual types.
Does it work with all strings in Python as well?

gedoensmax · 2025-09-08T14:05:35Z

docs/execution-providers/TensorRTRTX-ExecutionProvider.md

-For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md)
+## Performance test

-When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`.
+When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`


Here we could show how to use EP context or EP options with the perf test.

I have shown the ort perf test usage above in the EP context section - https://thevishalagarwal.github.io/onnxruntime/docs/execution-providers/TensorRTRTX-ExecutionProvider.html#ep-context-model

thevishalagarwal added 2 commits September 2, 2025 23:38

initial changes

591aea3

refactor

fd2e1fd

thevishalagarwal temporarily deployed to github-pages September 3, 2025 06:43 — with GitHub Pages Inactive

update nav

9e0a341

thevishalagarwal temporarily deployed to github-pages September 3, 2025 06:46 — with GitHub Pages Inactive

update nav

4b2d1c0

thevishalagarwal temporarily deployed to github-pages September 3, 2025 06:48 — with GitHub Pages Inactive

update EP context

759ed21

thevishalagarwal temporarily deployed to github-pages September 3, 2025 08:14 — with GitHub Pages Inactive

thevishalagarwal changed the title ~~TRT RTX documentation update~~ [Draft] TRT RTX documentation update Sep 3, 2025

update doc

702e928

thevishalagarwal temporarily deployed to github-pages September 3, 2025 12:02 — with GitHub Pages Inactive

update doc

37f561e

thevishalagarwal temporarily deployed to github-pages September 3, 2025 12:05 — with GitHub Pages Inactive

update doc

e1464d8

thevishalagarwal temporarily deployed to github-pages September 3, 2025 12:11 — with GitHub Pages Inactive

update doc

9ad9d33

thevishalagarwal temporarily deployed to github-pages September 3, 2025 12:13 — with GitHub Pages Inactive

gedoensmax reviewed Sep 4, 2025

View reviewed changes

add runtime cache and cuda graph

76d3ceb

thevishalagarwal temporarily deployed to github-pages September 4, 2025 09:52 — with GitHub Pages Inactive

gedoensmax reviewed Sep 4, 2025

View reviewed changes

update content

35878ca

thevishalagarwal temporarily deployed to github-pages September 4, 2025 10:55 — with GitHub Pages Inactive

fix header

a140189

thevishalagarwal temporarily deployed to github-pages September 4, 2025 11:10 — with GitHub Pages Inactive

fix header

cd8f6b3

thevishalagarwal temporarily deployed to github-pages September 4, 2025 11:12 — with GitHub Pages Inactive

gedoensmax reviewed Sep 8, 2025

View reviewed changes

update content

7928b1f

thevishalagarwal temporarily deployed to github-pages September 10, 2025 10:12 — with GitHub Pages Inactive

update doc

614f4e5

thevishalagarwal deployed to github-pages September 15, 2025 05:45 — with GitHub Pages View deployment


		Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. Support for Turing GPUs is coming soon.

		For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page.

		onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx"
		```

		\| Parameter \| Type \| Description \| Default \|
		\|-----------\|------\|-------------\|---------\|

[Draft] TRT RTX documentation update #6

Are you sure you want to change the base?

[Draft] TRT RTX documentation update #6

Uh oh!

Conversation

thevishalagarwal commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thevishalagarwal commented Sep 2, 2025 •

edited

Loading