diff --git a/README.md b/README.md index bb58d309a57..5ab7fb51b7f 100644 --- a/README.md +++ b/README.md @@ -253,5 +253,5 @@ Deprecation is used to inform developers that some APIs and tools are no longer ## Useful Links - [Quantized models on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4): A growing collection of quantized (e.g., FP8, FP4) and optimized LLMs, including [DeepSeek FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), ready for fast inference with TensorRT-LLM. - [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo): A datacenter scale distributed inference serving framework that works seamlessly with TensorRT-LLM. -- [AutoDeploy](./examples/auto_deploy/README.md): An experimental backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models. +- [AutoDeploy](./examples/auto_deploy/README.md): A prototype backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models. - [WeChat Discussion Group](https://github.com/NVIDIA/TensorRT-LLM/issues/5359): A real-time channel for TensorRT-LLM Q&A and news. diff --git a/docs/source/advanced/disaggregated-service.md b/docs/source/advanced/disaggregated-service.md index e5c4a19ba4b..d8e376d62cb 100644 --- a/docs/source/advanced/disaggregated-service.md +++ b/docs/source/advanced/disaggregated-service.md @@ -1,10 +1,10 @@ (disaggregated-service)= -# Disaggregated-Service (Experimental) +# Disaggregated-Service (Prototype) ```{note} Note: -This feature is currently experimental, and the related API is subjected to change in future versions. +This feature is currently in prototype, and the related API is subjected to change in future versions. ``` Currently TRT-LLM supports `disaggregated-service`, where the context and generation phases of a request can run on different executors. TRT-LLM's disaggregated service relies on the executor API, please make sure to read the [executor page](executor.md) before reading the document. diff --git a/docs/source/advanced/gpt-attention.md b/docs/source/advanced/gpt-attention.md index 9fa1ae9b436..760637aed47 100644 --- a/docs/source/advanced/gpt-attention.md +++ b/docs/source/advanced/gpt-attention.md @@ -112,8 +112,6 @@ printed. #### XQA Optimization Another optimization for MQA/GQA in generation phase called XQA optimization. -It is still experimental feature and support limited configurations. LLAMA2 70B -is one model that it supports. Support matrix of the XQA optimization: - FP16 / BF16 compute data type. diff --git a/docs/source/advanced/speculative-decoding.md b/docs/source/advanced/speculative-decoding.md index 85a87ae0624..5b52c8e8a70 100644 --- a/docs/source/advanced/speculative-decoding.md +++ b/docs/source/advanced/speculative-decoding.md @@ -168,7 +168,7 @@ TensorRT-LLM implements the ReDrafter model such that logits prediction, beam se The EAGLE approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. Similarly to ReDrafter, it predicts draft tokens using a recurrent predictor where each draft token depends on the previous one. However, unlike ReDrafter, it uses a single-layer transformer model to predict draft tokens from previous hidden states and decoded tokens. In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam. -Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine. EAGLE-1 and EAGLE-2 are both supported, while EAGLE-2 is currently in the experimental stage. Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. ### Disaggregated Serving diff --git a/docs/source/architecture/model-weights-loader.md b/docs/source/architecture/model-weights-loader.md index eb393d4a7db..361c3853499 100644 --- a/docs/source/architecture/model-weights-loader.md +++ b/docs/source/architecture/model-weights-loader.md @@ -249,7 +249,7 @@ for tllm_key, param in tqdm(trtllm_model.named_parameters()): In this mode, every precision require user's own support. ## Trouble shooting -The weights loader is an experimental feature for now, and is enabled for LLaMA family models and Qwen models by default. +The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only. If users are encountered with failure caused by `ModelWeightsLoader`, a workaround is passing environmental variable `TRTLLM_DISABLE_UNIFIED_CONVERTER=1` to disable the model weights loader and fallback to the legacy path. diff --git a/docs/source/performance/perf-benchmarking.md b/docs/source/performance/perf-benchmarking.md index 814e27b3d38..a7ecc86f269 100644 --- a/docs/source/performance/perf-benchmarking.md +++ b/docs/source/performance/perf-benchmarking.md @@ -236,15 +236,6 @@ The following command builds an FP8 quantized engine by specifying the engine tu trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --max_batch_size 1024 --max_num_tokens 2048 ``` -- [Experimental] Build engine with target ISL/OSL for optimization: -In this experimental mode, you can provide hints to `trtllm-bench`'s tuning heuristic to optimize the engine on specific ISL and OSL targets. -Generally, the target ISL and OSL aligns with the average ISL and OSL of the dataset, but you can experiment with different values to optimize the engine using this mode. -The following command builds an FP8 quantized engine and optimizes for ISL:OSL targets of 128:128. - -```shell -trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --target_isl 128 --target_osl 128 -``` - #### Parallelism Mapping Support The `trtllm-bench build` subcommand supports combinations of tensor-parallel (TP) and pipeline-parallel (PP) mappings as long as the world size (`tp_size x pp_size`) `<=` `8`. The parallelism mapping in build subcommad is controlled by `--tp_size` and `--pp_size` options. The following command builds an engine with TP2-PP2 mapping. diff --git a/docs/source/reference/precision.md b/docs/source/reference/precision.md index 2d30c9053a4..b31eff6d623 100644 --- a/docs/source/reference/precision.md +++ b/docs/source/reference/precision.md @@ -103,8 +103,7 @@ Python function, for details. This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt) and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with -[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and -are likely to evolve in a future release. +[GPT-J](source:examples/models/contrib/gptj). ## FP8 (Hopper) diff --git a/docs/source/torch.md b/docs/source/torch.md index b04c98db1d9..c3283b52909 100644 --- a/docs/source/torch.md +++ b/docs/source/torch.md @@ -2,10 +2,9 @@ ```{note} Note: -This feature is currently experimental, and the related API is subjected to change in future versions. +This feature is currently in beta, and the related API is subjected to change in future versions. ``` - -To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new experimental backend based on PyTorch. +To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch. The PyTorch backend of TensorRT-LLM is available in version 0.17 and later. You can try it via importing `tensorrt_llm._torch`. diff --git a/examples/auto_deploy/README.md b/examples/auto_deploy/README.md index 399d31ce36b..cba226e7310 100644 --- a/examples/auto_deploy/README.md +++ b/examples/auto_deploy/README.md @@ -6,7 +6,7 @@