Skip to content

Conversation

@thevishalagarwal
Copy link
Owner

@thevishalagarwal thevishalagarwal commented Sep 2, 2025

Updated docs - github pages

@thevishalagarwal thevishalagarwal changed the title TRT RTX documentation update [Draft] TRT RTX documentation update Sep 3, 2025
# NVIDIA TensorRT RTX Execution Provider
{: .no_toc }

Nvidia TensorRT RTX execution provider is the preferred execution provider for GPU acceleration on consumer hardware (RTX PCs). It is more straightforward to use than the datacenter focused legacy TensorRT Execution provider and more performant than CUDA EP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like making recommended bold.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it. Does this read better now?


Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. Support for Turing GPUs is coming soon.

For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not quite correct since we disabled Turing support for now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Turing mention now.


## Build from source
See [Build instructions](../build/eps.md#tensorrt-rtx).
Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that is is available through Windows ML without a build from source ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it. Does this look right?


```python
import onnxruntime as ort
In ONNXRuntime, Execution Providers are responsible for converting ONNX models into the graph format required by its specific backend SDK and subsequently compiling them into a format compatible with the target hardware. In large models like LLMs and Diffusion models, this conversion and compilation process can be resource-intensive and time-consuming, often extending to tens of minutes. This overhead significantly impacts the user experience during session creation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tens of minutes should hopefully not be true for TRT RTX. Besides that these are docs and the shorter the better :)
"""
EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible.
"""

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this read better now?

Comment on lines 154 to 155
onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx"
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the flag --compile_ep_context.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be seeing an older commit

Comment on lines 66 to 72
* **Reduced CPU Overhead**: The most significant benefit is the reduction in CPU-side work. Instead of the CPU having to schedule and dispatch hundreds or thousands of individual kernels for each inference, it only issues one command to replay the entire graph.
* **Lower Latency**: By eliminating the gaps between kernel launches, CUDA Graphs enable the GPU to work more continuously, leading to lower and more predictable end-to-end latency.
* **Improved Scalability**: This reduced overhead makes multi-threaded workloads more efficient, as the contention for CPU resources to launch kernels is minimized.

**Usage**

For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key benefit should just be reducing CPU overhead. We do not want to emphasize multi threading the CUDA workload submission I think.

The lower latency point has no actionable information or insights - it does not sound like a bullet point for documentation.

Comment on lines 76 to 95
```python
import onnxruntime as ort
trt_rtx_provider_options = {'enable_cuda_graph': True}
providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)]
session = ort.InferenceSession("model.onnx", providers=providers)
```

model_path = '<path to model>'
**C/C++**
```cpp
const auto& api = Ort::GetApi();
Ort::SessionOptions session_options;
const char* keys[] = {onnxruntime::nv::provider_option_names::kCudaGraphEnable};
const char* values[] = {"1"};
OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1);
Ort::Session session(env, model_path, session_options);
```
# note: for bool type options in python API, set them as False/True
provider_options = {
'device_id': 0,
'nv_dump_subgraphs': False,
'nv_detailed_build_log': True,
'user_compute_stream': stream_handle
}
**ONNXRuntime Perf Test**
```sh
onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx"
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We show how to set execution provider options at the API examples below the option table. Happy to add ORT perf test to that table as well, but this feels like duplication.

Comment on lines 98 to 111
**Where to use?**

Enabling CUDA Graph is advantageous in scenarios characterized by static execution patterns and numerous small GPU kernels, as this reduces CPU overhead and improves GPU utilization.
* **Static-shaped models**: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates.
* **LLMs with stable shapes**: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead.
* **Workloads with frequent identical executions**: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays.

**Where not to use?**

Enabling CUDA Graph should be avoided or approached with caution in scenarios where the execution pattern is not stable or where the overhead outweighs the benefits.
* **Models with conditional flow or loops**: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process.
* **Highly variable input shapes**: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized.
* **Workloads with short-lived executions**: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it.
* **Models dominated by very large kernels**: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where to use:

  • static execution plans that are executed more than 2 times

Where not to use:

  • often changing input shapes or often changing input device bindings
  • conditional data dependent control flows

Why do we want to guide a way from CUDA Graphs for large kernels ? Yes the benefit is smaller but there is no harm in enabling.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the text

Comment on lines +187 to +188
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the type value does not make sense since we do not support setting the actual type but only support strings. The same is true for the default value, let's give a default string.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C/C++, it is string and for Python, we use the actual types.
Does it work with all strings in Python as well?

Comment on lines -230 to +270
For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md)
## Performance test
When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`.
When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could show how to use EP context or EP options with the perf test.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants