diff --git a/README.md b/README.md index bb58d309a57..5ab7fb51b7f 100644 --- a/README.md +++ b/README.md @@ -253,5 +253,5 @@ Deprecation is used to inform developers that some APIs and tools are no longer ## Useful Links - [Quantized models on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4): A growing collection of quantized (e.g., FP8, FP4) and optimized LLMs, including [DeepSeek FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), ready for fast inference with TensorRT-LLM. - [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo): A datacenter scale distributed inference serving framework that works seamlessly with TensorRT-LLM. -- [AutoDeploy](./examples/auto_deploy/README.md): An experimental backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models. +- [AutoDeploy](./examples/auto_deploy/README.md): A prototype backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models. - [WeChat Discussion Group](https://github.com/NVIDIA/TensorRT-LLM/issues/5359): A real-time channel for TensorRT-LLM Q&A and news. diff --git a/docs/source/advanced/disaggregated-service.md b/docs/source/advanced/disaggregated-service.md index e5c4a19ba4b..d8e376d62cb 100644 --- a/docs/source/advanced/disaggregated-service.md +++ b/docs/source/advanced/disaggregated-service.md @@ -1,10 +1,10 @@ (disaggregated-service)= -# Disaggregated-Service (Experimental) +# Disaggregated-Service (Prototype) ```{note} Note: -This feature is currently experimental, and the related API is subjected to change in future versions. +This feature is currently in prototype, and the related API is subjected to change in future versions. ``` Currently TRT-LLM supports `disaggregated-service`, where the context and generation phases of a request can run on different executors. TRT-LLM's disaggregated service relies on the executor API, please make sure to read the [executor page](executor.md) before reading the document. diff --git a/docs/source/advanced/gpt-attention.md b/docs/source/advanced/gpt-attention.md index 9fa1ae9b436..760637aed47 100644 --- a/docs/source/advanced/gpt-attention.md +++ b/docs/source/advanced/gpt-attention.md @@ -112,8 +112,6 @@ printed. #### XQA Optimization Another optimization for MQA/GQA in generation phase called XQA optimization. -It is still experimental feature and support limited configurations. LLAMA2 70B -is one model that it supports. Support matrix of the XQA optimization: - FP16 / BF16 compute data type. diff --git a/docs/source/advanced/speculative-decoding.md b/docs/source/advanced/speculative-decoding.md index 85a87ae0624..5b52c8e8a70 100644 --- a/docs/source/advanced/speculative-decoding.md +++ b/docs/source/advanced/speculative-decoding.md @@ -168,7 +168,7 @@ TensorRT-LLM implements the ReDrafter model such that logits prediction, beam se The EAGLE approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. Similarly to ReDrafter, it predicts draft tokens using a recurrent predictor where each draft token depends on the previous one. However, unlike ReDrafter, it uses a single-layer transformer model to predict draft tokens from previous hidden states and decoded tokens. In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam. -Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine. EAGLE-1 and EAGLE-2 are both supported, while EAGLE-2 is currently in the experimental stage. Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. ### Disaggregated Serving diff --git a/docs/source/architecture/model-weights-loader.md b/docs/source/architecture/model-weights-loader.md index eb393d4a7db..361c3853499 100644 --- a/docs/source/architecture/model-weights-loader.md +++ b/docs/source/architecture/model-weights-loader.md @@ -249,7 +249,7 @@ for tllm_key, param in tqdm(trtllm_model.named_parameters()): In this mode, every precision require user's own support. ## Trouble shooting -The weights loader is an experimental feature for now, and is enabled for LLaMA family models and Qwen models by default. +The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only. If users are encountered with failure caused by `ModelWeightsLoader`, a workaround is passing environmental variable `TRTLLM_DISABLE_UNIFIED_CONVERTER=1` to disable the model weights loader and fallback to the legacy path. diff --git a/docs/source/performance/perf-benchmarking.md b/docs/source/performance/perf-benchmarking.md index 814e27b3d38..a7ecc86f269 100644 --- a/docs/source/performance/perf-benchmarking.md +++ b/docs/source/performance/perf-benchmarking.md @@ -236,15 +236,6 @@ The following command builds an FP8 quantized engine by specifying the engine tu trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --max_batch_size 1024 --max_num_tokens 2048 ``` -- [Experimental] Build engine with target ISL/OSL for optimization: -In this experimental mode, you can provide hints to `trtllm-bench`'s tuning heuristic to optimize the engine on specific ISL and OSL targets. -Generally, the target ISL and OSL aligns with the average ISL and OSL of the dataset, but you can experiment with different values to optimize the engine using this mode. -The following command builds an FP8 quantized engine and optimizes for ISL:OSL targets of 128:128. - -```shell -trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --target_isl 128 --target_osl 128 -``` - #### Parallelism Mapping Support The `trtllm-bench build` subcommand supports combinations of tensor-parallel (TP) and pipeline-parallel (PP) mappings as long as the world size (`tp_size x pp_size`) `<=` `8`. The parallelism mapping in build subcommad is controlled by `--tp_size` and `--pp_size` options. The following command builds an engine with TP2-PP2 mapping. diff --git a/docs/source/reference/precision.md b/docs/source/reference/precision.md index 2d30c9053a4..b31eff6d623 100644 --- a/docs/source/reference/precision.md +++ b/docs/source/reference/precision.md @@ -103,8 +103,7 @@ Python function, for details. This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt) and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with -[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and -are likely to evolve in a future release. +[GPT-J](source:examples/models/contrib/gptj). ## FP8 (Hopper) diff --git a/docs/source/torch.md b/docs/source/torch.md index b04c98db1d9..c3283b52909 100644 --- a/docs/source/torch.md +++ b/docs/source/torch.md @@ -2,10 +2,9 @@ ```{note} Note: -This feature is currently experimental, and the related API is subjected to change in future versions. +This feature is currently in beta, and the related API is subjected to change in future versions. ``` - -To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new experimental backend based on PyTorch. +To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch. The PyTorch backend of TensorRT-LLM is available in version 0.17 and later. You can try it via importing `tensorrt_llm._torch`. diff --git a/examples/auto_deploy/README.md b/examples/auto_deploy/README.md index 399d31ce36b..cba226e7310 100644 --- a/examples/auto_deploy/README.md +++ b/examples/auto_deploy/README.md @@ -6,7 +6,7 @@
-AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed. +AutoDeploy is a prototype feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed. ______________________________________________________________________ @@ -450,4 +450,4 @@ the current progress in AutoDeploy and where you can help. ## Disclaimer -This project is in active development and is currently in an early (beta) stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability. Use at your own risk. +This project is in active development and is currently in an early (beta) stage. The code is in prototype, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability. Use at your own risk. diff --git a/examples/disaggregated/README.md b/examples/disaggregated/README.md index 99bd3de2084..713e69e6be2 100644 --- a/examples/disaggregated/README.md +++ b/examples/disaggregated/README.md @@ -83,7 +83,7 @@ Or using the provided client parsing the prompts from a file and sending request python3 ./clients/disagg_client.py -c disagg_config.yaml -p ./clients/prompts.json -e chat ``` -## Dynamic scaling (Experimental) +## Dynamic scaling (Prototype) Currently, trtllm supports dynamic addition and removal of servers by leveraging ETCD. To enable this feature, you should start the context and generation servers with an additional flag ```--metadata_server_config_file``` and ```--server_role```. Before launching the context and generation servers, you should first start the ETCD server. By default, the ETCD server listens for client requests at ```localhost:2379```. diff --git a/examples/eagle/README.md b/examples/eagle/README.md index 637223afb91..0b103ca40ed 100644 --- a/examples/eagle/README.md +++ b/examples/eagle/README.md @@ -98,7 +98,6 @@ To run non-greedy sampling and use typical acceptance, set `--eagle_posterior_th `--temperature` can be specified as well. When no `--eagle_posterior_threshold` is specified or `--temperature=0.0` is set, greedy sampling is used. #### Run EAGLE-2 -**EAGLE-2 is still under the experimental stage.** EAGLE-2 can be enabled with 2 runtime flags (`--eagle_use_dynamic_tree` and `--eagle_dynamic_tree_max_top_k=N`). The same engine can be used for EAGLE-1 and EAGLE-2. Eagle choices must not be set in case of EAGLE-2. EAGLE-2 will generate the tree corresponding to choices dynamically in the runtime. For more details, please refer to [EAGLE-2 paper](https://arxiv.org/pdf/2406.16858). diff --git a/examples/models/core/deepseek_v3/README.md b/examples/models/core/deepseek_v3/README.md index 3f053588059..2efe14b986d 100644 --- a/examples/models/core/deepseek_v3/README.md +++ b/examples/models/core/deepseek_v3/README.md @@ -30,7 +30,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/ - [trtllm-serve](#trtllm-serve) - [Disaggregated Serving](#disaggregated-serving) - [Dynamo](#dynamo) - - [tensorrtllm\_backend for triton inference server (Experimental)](#tensorrtllm_backend-for-triton-inference-server-experimental) + - [tensorrtllm\_backend for triton inference server (Prototype)](#tensorrtllm_backend-for-triton-inference-server-prototype) - [Advanced Usages](#advanced-usages) - [Multi-node](#multi-node) - [mpirun](#mpirun) @@ -392,8 +392,8 @@ settings for your specific use case. NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo supports TensorRT-LLM as one of its inference engine. For details on how to use TensorRT-LLM with Dynamo please refer to [LLM Deployment Examples using TensorRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md) -### tensorrtllm_backend for triton inference server (Experimental) -To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature. +### tensorrtllm_backend for triton inference server (Prototype) +To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as a prototype feature. The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml diff --git a/examples/models/core/llama/README.md b/examples/models/core/llama/README.md index bef4f60123a..b888b287b01 100644 --- a/examples/models/core/llama/README.md +++ b/examples/models/core/llama/README.md @@ -676,7 +676,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \ The peak GPU memory consumption when doing FP8 quantizaton is more than 210GB (there is also some activation memory occupation when doing calibration). So you need a node with at least 4 H100(A100) to run the quantization command. After quantization, 2 GPUs are okay to for building and run. -Experimental: use FP8 GEMV to optimize performance in FP8 small-batch-size cases. +Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases. ```bash # Quantize HF LLaMA 7B into FP8 and export trtllm checkpoint @@ -694,7 +694,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp8 \ --gemm_plugin fp8 ``` -**Note**: FP8 gemm plugin is an experimental feature aimed to improve performance in small-batch-size cases(e.g. BS<=4). Although inputs with batch size larger than 4 can be correctly inferenced, the performance may decrease as batch size grows. +**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS. Over last year, as cuBLAS have improved their performance by a lot under small M case for Hopper(sm90), FP8 gemv kernel may or may not surpass cuBLAS, depending on specific gemm problem shape. Nonetheless, we still strongly recommend FP8 gemv kernel for Ada (sm89) as cuBLAS still falls behind gemv on it. ### Groupwise quantization (AWQ/GPTQ) One can enable AWQ/GPTQ INT4 weight only quantization with these options when building engine with `trtllm-build`: diff --git a/examples/sample_weight_stripping/README.md b/examples/sample_weight_stripping/README.md index bd28a60b840..a005f0904b1 100644 --- a/examples/sample_weight_stripping/README.md +++ b/examples/sample_weight_stripping/README.md @@ -12,7 +12,7 @@ * [Llama-7b FP16 + WoQ INT8](#llama-7b-fp16-woq-int8) * [Llama2-70b FP8 with TP=2](#llama2-70b-fp8-with-tp2) - [Engine Plan File Size Results](#engine-plan-file-size-results) -- [Experimental](#experimental) +- [Prototype](#prototype) * [Checkpoint Pruner](#checkpoint-pruner) * [Pruning a TensorRT-LLM Checkpoint](#pruning-a-tensorrt-llm-checkpoint) @@ -239,7 +239,7 @@ python3 ../summarize.py --engine_dir engines/llama2-70b-hf-fp8-tp2.refit \ |llama-7b FP16 + WoQ INT8 | 6.54GB | 28.69MB | |llama2-70b FP8 + TP=2 | 64.78GB | 60.61MB | -## Experimental +## Prototype ### Checkpoint Pruner The checkpoint pruner allows you to strip `Conv` and `Gemm` weights out of a TensorRT-LLM [checkpoint](https://nvidia.github.io/TensorRT-LLM/latest/architecture/checkpoint.html). Since these make up the vast majority of weights, the pruner will decrease the size of your checkpoint up to 99%.