-
Notifications
You must be signed in to change notification settings - Fork 0
[Draft] TRT RTX documentation update #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: doc
Are you sure you want to change the base?
Conversation
| # NVIDIA TensorRT RTX Execution Provider | ||
| {: .no_toc } | ||
|
|
||
| Nvidia TensorRT RTX execution provider is the preferred execution provider for GPU acceleration on consumer hardware (RTX PCs). It is more straightforward to use than the datacenter focused legacy TensorRT Execution provider and more performant than CUDA EP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't like making recommended bold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it. Does this read better now?
|
|
||
| Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures. Support for Turing GPUs is coming soon. | ||
|
|
||
| For compatibility and support matrix, please refer to [this](https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/getting-started/support-matrix.html) page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not quite correct since we disabled Turing support for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Turing mention now.
|
|
||
| ## Build from source | ||
| See [Build instructions](../build/eps.md#tensorrt-rtx). | ||
| Currently, TensorRT RTX EP can be only built from source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that is is available through Windows ML without a build from source ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated it. Does this look right?
|
|
||
| ```python | ||
| import onnxruntime as ort | ||
| In ONNXRuntime, Execution Providers are responsible for converting ONNX models into the graph format required by its specific backend SDK and subsequently compiling them into a format compatible with the target hardware. In large models like LLMs and Diffusion models, this conversion and compilation process can be resource-intensive and time-consuming, often extending to tens of minutes. This overhead significantly impacts the user experience during session creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tens of minutes should hopefully not be true for TRT RTX. Besides that these are docs and the shorter the better :)
"""
EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this read better now?
| onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 "/path/to/model.onnx" --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the flag --compile_ep_context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be seeing an older commit
| * **Reduced CPU Overhead**: The most significant benefit is the reduction in CPU-side work. Instead of the CPU having to schedule and dispatch hundreds or thousands of individual kernels for each inference, it only issues one command to replay the entire graph. | ||
| * **Lower Latency**: By eliminating the gaps between kernel launches, CUDA Graphs enable the GPU to work more continuously, leading to lower and more predictable end-to-end latency. | ||
| * **Improved Scalability**: This reduced overhead makes multi-threaded workloads more efficient, as the contention for CPU resources to launch kernels is minimized. | ||
|
|
||
| **Usage** | ||
|
|
||
| For models where input shapes don't change. e.g. convolutional models, CUDA Graph can be enabled by setting a provider option. By default, ORT uses a graph annotation ID of 0 and starts capturing with this. Users can control the annotation ID at runtime by setting the run option `gpu_graph_id`. If we have `gpu_graph_id` as -1, it indicates that the graph will not be captured for that specific run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key benefit should just be reducing CPU overhead. We do not want to emphasize multi threading the CUDA workload submission I think.
The lower latency point has no actionable information or insights - it does not sound like a bullet point for documentation.
| ```python | ||
| import onnxruntime as ort | ||
| trt_rtx_provider_options = {'enable_cuda_graph': True} | ||
| providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)] | ||
| session = ort.InferenceSession("model.onnx", providers=providers) | ||
| ``` | ||
|
|
||
| model_path = '<path to model>' | ||
| **C/C++** | ||
| ```cpp | ||
| const auto& api = Ort::GetApi(); | ||
| Ort::SessionOptions session_options; | ||
| const char* keys[] = {onnxruntime::nv::provider_option_names::kCudaGraphEnable}; | ||
| const char* values[] = {"1"}; | ||
| OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1); | ||
| Ort::Session session(env, model_path, session_options); | ||
| ``` | ||
| # note: for bool type options in python API, set them as False/True | ||
| provider_options = { | ||
| 'device_id': 0, | ||
| 'nv_dump_subgraphs': False, | ||
| 'nv_detailed_build_log': True, | ||
| 'user_compute_stream': stream_handle | ||
| } | ||
| **ONNXRuntime Perf Test** | ||
| ```sh | ||
| onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx" | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We show how to set execution provider options at the API examples below the option table. Happy to add ORT perf test to that table as well, but this feels like duplication.
| **Where to use?** | ||
|
|
||
| Enabling CUDA Graph is advantageous in scenarios characterized by static execution patterns and numerous small GPU kernels, as this reduces CPU overhead and improves GPU utilization. | ||
| * **Static-shaped models**: Models with fixed input dimensions, such as many convolutional neural networks (CNNs) used for image classification, are ideal candidates. | ||
| * **LLMs with stable shapes**: For Large Language Models, CUDA Graphs are primarily utilized to optimize the decoding phase, where tokens are generated sequentially. This phase involves a repetitive sequence of identical GPU kernel launches, making it well-suited for graph capture and replay. Although the prefill phase is less suitable due to its variable input size, capturing a new graph for each recurring shape enables the decoder to achieve significant speedups and reduced CPU overhead. | ||
| * **Workloads with frequent identical executions**: Applications that repeatedly perform the same sequence of GPU operations benefit from performance improvements, as the initial cost of capturing the graph is amortized over many replays. | ||
|
|
||
| **Where not to use?** | ||
|
|
||
| Enabling CUDA Graph should be avoided or approached with caution in scenarios where the execution pattern is not stable or where the overhead outweighs the benefits. | ||
| * **Models with conditional flow or loops**: Models that use control-flow operators such as loops or conditionals can disrupt the CUDA Graph capture process. | ||
| * **Highly variable input shapes**: For dynamic-shaped models where the input shape changes with every request and there is no repetition, CUDA Graph provides no benefit. In these cases, each run would require a new graph capture, which is slower than regular execution, and the replay mechanism would not be utilized. | ||
| * **Workloads with short-lived executions**: The initial capture phase incurs a cost. If an application performs only one or two inferences, the overhead of capturing the graph may exceed any performance benefit from replaying it. | ||
| * **Models dominated by very large kernels**: If a model's total execution time is primarily spent on a few very large, long-running kernels, the CPU launch overhead is already negligible. In such cases, the benefits of CUDA Graph are minimal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where to use:
- static execution plans that are executed more than 2 times
Where not to use:
- often changing input shapes or often changing input device bindings
- conditional data dependent control flows
Why do we want to guide a way from CUDA Graphs for large kernels ? Yes the benefit is smaller but there is no harm in enabling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the text
| | Parameter | Type | Description | Default | | ||
| |-----------|------|-------------|---------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the type value does not make sense since we do not support setting the actual type but only support strings. The same is true for the default value, let's give a default string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C/C++, it is string and for Python, we use the actual types.
Does it work with all strings in Python as well?
| For performance tuning, please see guidance on this page: [ONNX Runtime Perf Tuning](./../performance/tune-performance/index.md) | ||
| ## Performance test | ||
| When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx`. | ||
| When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest#onnxruntime-performance-test), use the flag `-e nvtensorrttrx` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could show how to use EP context or EP options with the perf test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have shown the ort perf test usage above in the EP context section - https://thevishalagarwal.github.io/onnxruntime/docs/execution-providers/TensorRTRTX-ExecutionProvider.html#ep-context-model
Updated docs - github pages