Skip to content

Commit 41fe3a6

Browse files
authored
Update TensorRT-LLM backend release branch (#359)
* Update TensorRT-LLM backend
1 parent 92493ef commit 41fe3a6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1436
-606
lines changed
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
name: "Bug Report"
2+
description: Submit a bug report to help us improve TensorRT-LLM backend
3+
labels: [ "bug" ]
4+
body:
5+
- type: textarea
6+
id: system-info
7+
attributes:
8+
label: System Info
9+
description: Please share your system info with us.
10+
placeholder: |
11+
- CPU architecture (e.g., x86_64, aarch64)
12+
- CPU/Host memory size (if known)
13+
- GPU properties
14+
- GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
15+
- GPU memory size (if known)
16+
- Clock frequencies used (if applicable)
17+
- Libraries
18+
- TensorRT-LLM branch or tag (e.g., main, v0.7.1)
19+
- TensorRT-LLM commit (if known)
20+
- Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
21+
- Container used (if running TensorRT-LLM in a container)
22+
- NVIDIA driver version
23+
- OS (Ubuntu 22.04, CentOS 7, Windows 10)
24+
- Docker image version
25+
- Any other information that may be useful in reproducing the bug
26+
validations:
27+
required: true
28+
29+
- type: textarea
30+
id: who-can-help
31+
attributes:
32+
label: Who can help?
33+
description: |
34+
To expedite the response to your issue, it would be helpful if you could identify the appropriate person
35+
to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
36+
37+
Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
38+
you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
39+
40+
Please tag fewer than 3 people.
41+
42+
Quantization: @Tracin
43+
44+
Documentation: @juney-nvidia
45+
46+
Feature request: @ncomly-nvidia
47+
48+
Performance: @kaiyux
49+
50+
Others: @byshiue @schetlur-nv
51+
52+
placeholder: "@Username ..."
53+
54+
- type: checkboxes
55+
id: information-scripts-examples
56+
attributes:
57+
label: Information
58+
description: 'The problem arises when using:'
59+
options:
60+
- label: "The official example scripts"
61+
- label: "My own modified scripts"
62+
63+
- type: checkboxes
64+
id: information-tasks
65+
attributes:
66+
label: Tasks
67+
description: "The tasks I am working on are:"
68+
options:
69+
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
70+
- label: "My own task or dataset (give details below)"
71+
72+
- type: textarea
73+
id: reproduction
74+
validations:
75+
required: true
76+
attributes:
77+
label: Reproduction
78+
description: |
79+
Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
80+
Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
81+
82+
Remember to use code tags to properly format your code. You can refer to the
83+
link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
84+
85+
Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
86+
It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
87+
88+
placeholder: |
89+
Steps to reproduce the behavior:
90+
91+
1.
92+
2.
93+
3.
94+
95+
- type: textarea
96+
id: expected-behavior
97+
validations:
98+
required: true
99+
attributes:
100+
label: Expected behavior
101+
description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."
102+
103+
- type: textarea
104+
id: actual-behavior
105+
validations:
106+
required: true
107+
attributes:
108+
label: actual behavior
109+
description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."
110+
111+
- type: textarea
112+
id: additioanl-notes
113+
validations:
114+
required: true
115+
attributes:
116+
label: additional notes
117+
description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."

README.md

Lines changed: 87 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -41,70 +41,58 @@ available in the main [server](https://github.com/triton-inference-server/server
4141
repo. If you don't find your answer there you can ask questions on the
4242
[issues page](https://github.com/triton-inference-server/tensorrtllm_backend/issues).
4343

44-
## Building the TensorRT-LLM Backend
44+
## Accessing the TensorRT-LLM Backend
4545

4646
There are several ways to access the TensorRT-LLM Backend.
4747

48-
**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker)**
48+
**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker).**
4949

50-
### Option 1. Run the Docker Container
50+
### Run the Pre-built Docker Container
5151

5252
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
5353
Backend and Python Backend. This container should have everything to run a
5454
TensorRT-LLM model. You can find this container on the
5555
[Triton NGC page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).
5656

57-
### Option 2. Build via the build.py Script in Server Repo
57+
### Build the Docker Container
58+
59+
#### Option 1. Build via the `build.py` Script in Server Repo
5860

5961
Starting with Triton 23.10 release, you can follow steps described in the
6062
[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
6163
guide and use the
6264
[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
63-
script to build the TRT-LLM backend.
65+
script.
6466

65-
The below commands will build the same Triton TRT-LLM container as the one on the NGC.
67+
A sample command to build a Triton Server container with all options enabled is
68+
shown below, which will build the same TRT-LLM container as the one on the NGC.
6669

6770
```bash
68-
# Prepare the TRT-LLM base image using the dockerfile from tensorrtllm_backend.
69-
cd tensorrtllm_backend
70-
# Specify the build args for the dockerfile.
71-
BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.01-py3-min
72-
TRT_VERSION=9.2.0.5
73-
TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.linux.x86_64-gnu.cuda-12.2.tar.gz
74-
TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-9.2.0.5.Ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
75-
76-
docker build -t trtllm_base
77-
--build-arg BASE_IMAGE="${BASE_IMAGE}"
78-
--build-arg TRT_VER="${TRT_VERSION}"
79-
--build-arg RELEASE_URL_TRT_x86="${TRT_URL_x86}"
80-
--build-arg RELEASE_URL_TRT_ARM="${TRT_URL_ARM}"
81-
-f dockerfile/Dockerfile.triton.trt_llm_backend .
82-
83-
# Run the build script from Triton Server repo. The flags for some features or
84-
# endpoints can be removed if not needed. Please refer to the support matrix to
85-
# see the aligned versions: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
86-
TRTLLM_BASE_IMAGE=trtllm_base
87-
TENSORRTLLM_BACKEND_REPO_TAG=v0.7.2
88-
PYTHON_BACKEND_REPO_TAG=r24.01
89-
90-
cd server
71+
BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
72+
TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
73+
PYTHON_BACKEND_REPO_TAG=r23.10
74+
75+
# Run the build script. The flags for some features or endpoints can be removed if not needed.
9176
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
9277
--enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
9378
--filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
9479
--endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
9580
--backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
96-
--image=base,${TRTLLM_BASE_IMAGE} \
81+
--image=base,${BASE_CONTAINER_IMAGE_NAME} \
9782
--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
9883
--backend=python:${PYTHON_BACKEND_REPO_TAG}
9984
```
10085

101-
The `TRTLLM_BASE_IMAGE` is the base image that will be used to build the
102-
container. The `TENSORRTLLM_BACKEND_REPO_TAG` and `PYTHON_BACKEND_REPO_TAG` are
103-
the tags of the TensorRT-LLM backend and Python backend repositories that will
104-
be used to build the container. You can also remove the features or endpoints
105-
that you don't need by removing the corresponding flags.
86+
The `BASE_CONTAINER_IMAGE_NAME` is the base image that will be used to build the
87+
container. By default it is set to the most recent min image of Triton, on NGC,
88+
that matches the Triton release you are building for. You can change it to a
89+
different image if needed by setting the `--image` flag like the command below.
90+
The `TENSORRTLLM_BACKEND_REPO_TAG` and `PYTHON_BACKEND_REPO_TAG` are the tags of
91+
the TensorRT-LLM backend and Python backend repositories that will be used
92+
to build the container. You can also remove the features or endpoints that you
93+
don't need by removing the corresponding flags.
10694

107-
### Option 3. Build via Docker
95+
#### Option 2. Build via Docker
10896

10997
The version of Triton Server used in this build option can be found in the
11098
[Dockerfile](./dockerfile/Dockerfile.trt_llm_backend).
@@ -168,7 +156,6 @@ python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
168156
--paged_kv_cache \
169157
--use_gemm_plugin float16 \
170158
--remove_input_padding \
171-
--use_layernorm_plugin float16 \
172159
--hidden_act gelu \
173160
--parallel_build \
174161
--output_dir=engines/fp16/4-gpu
@@ -210,7 +197,7 @@ cp tensorrt_llm/examples/gpt/engines/fp16/4-gpu/* triton_model_repo/tensorrt_llm
210197
```
211198

212199
### Modify the model configuration
213-
The following table shows the fields that need to be modified before deployment:
200+
The following table shows the fields that may to be modified before deployment:
214201

215202
*triton_model_repo/preprocessing/config.pbtxt*
216203

@@ -223,17 +210,18 @@ The following table shows the fields that need to be modified before deployment:
223210

224211
| Name | Description
225212
| :----------------------: | :-----------------------------: |
226-
| `decoupled` | Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. |
227-
| `max_beam_width` | The maximum beam width that any request may ask for when using beam search |
228-
| `gpt_model_type` | Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1` |
229-
| `gpt_model_path` | Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
230-
| `max_tokens_in_paged_kv_cache` | The maximum size of the KV cache in number of tokens |
231-
| `max_attention_window_size` | When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults to maximum sequence length |
232-
| `batch_scheduler_policy` | Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.|
233-
| `kv_cache_free_gpu_mem_fraction` | Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache|
234-
| `max_num_sequences` | Maximum number of sequences that the in-flight batching scheme can maintain state for. Defaults to `max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size.
235-
| `enable_trt_overlap` | Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
236-
| `exclude_input_in_output` | Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
213+
| `gpt_model_type` | Mandatory. Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1` |
214+
| `gpt_model_path` | Mandatory. Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
215+
| `batch_scheduler_policy` | Mandatory. Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.|
216+
| `decoupled` | Optional (default=`false`). Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. |
217+
| `max_beam_width` | Optional (default=1). The maximum beam width that any request may ask for when using beam search.|
218+
| `max_tokens_in_paged_kv_cache` | Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
219+
| `max_attention_window_size` | Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
220+
| `kv_cache_free_gpu_mem_fraction` | Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
221+
| `enable_trt_overlap` | Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
222+
| `exclude_input_in_output` | Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
223+
| `normalize_log_probs` | Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs` |
224+
| `enable_chunked_context` | Optional (default=`false`). Set to `true` to enable context chunking. |
237225

238226
*triton_model_repo/postprocessing/config.pbtxt*
239227

@@ -357,6 +345,7 @@ He was a member of the French Academy of Sciences and the French Academy of Arts
357345
Soyer was a member of the French Academy of Sciences and
358346
```
359347

348+
#### Early stopping
360349
You can also stop the generation process early by using the `--stop-after-ms`
361350
option to send a stop request after a few milliseconds:
362351

@@ -368,6 +357,54 @@ You will find that the generation process is stopped early and therefore the
368357
number of generated tokens is lower than 200. You can have a look at the
369358
client code to see how early stopping is achieved.
370359

360+
#### Return context logits and/or generation logits
361+
If you want to get context logits and/or generation logits, you need to enable `--gather_context_logits` and/or `--gather_generation_logits` when building the engine (or `--enable gather_all_token_logits` to enable both at the same time). For more setting details about these two flags, please refer to [build.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/gpt/build.py) or [gpt_runtime](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md).
362+
363+
After launching the server, you could get the output of logits by passing the corresponding parameters `--return-context-logits` and/or `--return-generation-logits` in the client scripts (`end_to_end_grpc_client.py` and `inflight_batcher_llm_client.py`). For example:
364+
```bash
365+
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 20 --tokenizer-dir /path/to/tokenizer/ \
366+
--return-context-logits \
367+
--return-generation-logits
368+
```
369+
370+
The result should be similar to the following:
371+
```
372+
Input sequence: [28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257]
373+
Got completed request
374+
Input: Born in north-east France, Soyer trained as a
375+
Output beam 0: has since worked in restaurants in London,
376+
Output sequence: [21221, 878, 3867, 284, 3576, 287, 262, 1903, 6303, 82, 13, 679, 468, 1201, 3111, 287, 10808, 287, 3576, 11]
377+
context_logits.shape: (1, 12, 50257)
378+
context_logits: [[[ -65.9822 -62.267445 -70.08991 ... -76.16964 -78.8893
379+
-65.90678 ]
380+
[-103.40278 -102.55243 -106.119026 ... -108.925415 -109.408585
381+
-101.37687 ]
382+
[ -63.971176 -64.03466 -67.58809 ... -72.141235 -71.16892
383+
-64.23846 ]
384+
...
385+
[ -80.776375 -79.1815 -85.50916 ... -87.07368 -88.02817
386+
-79.28435 ]
387+
[ -10.551408 -7.786484 -14.524468 ... -13.805856 -15.767286
388+
-7.9322424]
389+
[-106.33096 -105.58956 -111.44852 ... -111.04858 -111.994194
390+
-105.40376 ]]]
391+
generation_logits.shape: (1, 1, 20, 50257)
392+
generation_logits: [[[[-106.33096 -105.58956 -111.44852 ... -111.04858 -111.994194
393+
-105.40376 ]
394+
[ -77.867424 -76.96638 -83.119095 ... -87.82542 -88.53957
395+
-75.64877 ]
396+
[-136.92282 -135.02484 -140.96051 ... -141.78284 -141.55045
397+
-136.01668 ]
398+
...
399+
[-100.03721 -98.98237 -105.25507 ... -108.49254 -109.45882
400+
-98.95136 ]
401+
[-136.78777 -136.16165 -139.13437 ... -142.21495 -143.57468
402+
-134.94667 ]
403+
[ 19.222942 19.127287 14.804495 ... 10.556551 9.685863
404+
19.625107]]]]
405+
```
406+
407+
371408
### Launch Triton server *within Slurm based clusters*
372409

373410
#### Prepare some scripts

all_models/gpt/ensemble/config.pbtxt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,12 @@ input [
7676
dims: [ 1 ]
7777
optional: true
7878
},
79+
{
80+
name: "frequency_penalty"
81+
data_type: TYPE_FP32
82+
dims: [ 1 ]
83+
optional: true
84+
},
7985
{
8086
name: "random_seed"
8187
data_type: TYPE_UINT64
@@ -187,6 +193,10 @@ ensemble_scheduling {
187193
key: "presence_penalty"
188194
value: "presence_penalty"
189195
}
196+
input_map {
197+
key: "frequency_penalty"
198+
value: "frequency_penalty"
199+
}
190200
input_map {
191201
key: "random_seed"
192202
value: "random_seed"

0 commit comments

Comments
 (0)