|
4 | 4 |
|
5 | 5 | All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/). |
6 | 6 |
|
| 7 | +## TensorRT-LLM Release 0.21.0 |
| 8 | + |
| 9 | +### Key Features and Enhancements |
| 10 | +- **Model Support** |
| 11 | + - Added Gemma3 VLM support |
| 12 | +- **Features** |
| 13 | + - Added large-scale EP support |
| 14 | + - Integrated NIXL into the communication layer of the disaggregated service |
| 15 | + - Added fabric Memory support for KV Cache Transfer |
| 16 | + - Added MCP in ScaffoldingLLM |
| 17 | + - Added support for w4a8_mxfp4_fp8 quantization |
| 18 | + - Added support for fp8 rowwise quantization |
| 19 | + - Added generation logits support in TRTLLM Sampler |
| 20 | + - Added log probs support in TRTLLM Sampler |
| 21 | + - Optimized TRTLLM Sampler perf single beam single step |
| 22 | + - Enabled Disaggregated serving for Qwen-3 |
| 23 | + - Added EAGLE3 support for Qwen-3 |
| 24 | + - Fused finalize and allreduce for Qwen-MoE model |
| 25 | + - Refactored Fused MoE module |
| 26 | + - Added support for chunked attention on Blackwell and Hopper |
| 27 | + - Introduced sliding-window attention kernels for the generation phase on Blackwell |
| 28 | + - Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios |
| 29 | + - Added FP8 block-scale GEMM support on SM89 |
| 30 | + - Enabled overlap scheduler between draft forwards |
| 31 | + - Added Piecewise cuda graph support for MLA |
| 32 | + - Added model-agnostic one-engine eagle3 |
| 33 | + - Enabled Finalize + Allreduce + add + rmsnorm fusion |
| 34 | + - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner |
| 35 | + - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow |
| 36 | + - Validated Llama 3.1 models on H200 NVL |
| 37 | +- Benchmark: |
| 38 | + - Added all_reduce.py benchmark script for testing |
| 39 | + - Added beam width to trtllm-bench latency command |
| 40 | + - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors |
| 41 | + - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA |
| 42 | + - Supported post_proc for bench |
| 43 | + - Added no_kv_cache_reuse option and streaming support for trtllm serve bench |
| 44 | + |
| 45 | +### Infrastructure Changes |
| 46 | +- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.05-py3`. |
| 47 | +- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.05-py3`. |
| 48 | +- The dependent public PyTorch version is updated to 2.7.1. |
| 49 | +- The dependent TensorRT version is updated to 10.11. |
| 50 | +- The dependent NVIDIA ModelOpt version is updated to 0.31. |
| 51 | +- The dependent NCCL version is updated to 2.27.5. |
| 52 | + |
| 53 | +### API Changes |
| 54 | +- Set _AutoDeployLlmArgs as primary config object |
| 55 | +- Removed decoder request from decoder interface |
| 56 | +- Enhanced the torch_compile_config in llm args |
| 57 | +- Removed the redundant use_kv_cache field from PytorchConfig |
| 58 | +- Moved allreduce_strategy from committed api to reference |
| 59 | + |
| 60 | +### Fixed Issues |
| 61 | +- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678) |
| 62 | +- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767) |
| 63 | +- Fixed cuda graph padding for spec decoding (#4853) |
| 64 | +- Fixed llama 4 long context issue (#4809) |
| 65 | +- Fixed max_num_sequences calculation with overlap scheduling (#4532) |
| 66 | +- Fixed chunked prefill + overlap scheduling (#5761) |
| 67 | +- Fixed trtllm-bench hang issue due to LLM API IPC (#4798) |
| 68 | +- Fixed index out of bounds error in spec decoding (#5954) |
| 69 | +- Fixed MTP illegal memory access in cuda graph warmup (#5947) |
| 70 | +- Fixed no free slots error with spec decode + disagg (#5975) |
| 71 | +- Fixed one-off attention window size for Gemma3 1B (#5564) |
| 72 | + |
| 73 | +### Known Issues |
| 74 | +- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken. |
| 75 | +- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems. |
| 76 | + |
7 | 77 | ## TensorRT-LLM Release 0.20.0 |
8 | 78 |
|
9 | 79 | ### Key Features and Enhancements |
|
0 commit comments