You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Submit a bug report to help us improve TensorRT-LLM backend
3
+
labels: [ "bug" ]
4
+
body:
5
+
- type: textarea
6
+
id: system-info
7
+
attributes:
8
+
label: System Info
9
+
description: Please share your system info with us.
10
+
placeholder: |
11
+
- CPU architecture (e.g., x86_64, aarch64)
12
+
- CPU/Host memory size (if known)
13
+
- GPU properties
14
+
- GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
15
+
- GPU memory size (if known)
16
+
- Clock frequencies used (if applicable)
17
+
- Libraries
18
+
- TensorRT-LLM branch or tag (e.g., main, v0.7.1)
19
+
- TensorRT-LLM commit (if known)
20
+
- Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
21
+
- Container used (if running TensorRT-LLM in a container)
22
+
- NVIDIA driver version
23
+
- OS (Ubuntu 22.04, CentOS 7, Windows 10)
24
+
- Docker image version
25
+
- Any other information that may be useful in reproducing the bug
26
+
validations:
27
+
required: true
28
+
29
+
- type: textarea
30
+
id: who-can-help
31
+
attributes:
32
+
label: Who can help?
33
+
description: |
34
+
To expedite the response to your issue, it would be helpful if you could identify the appropriate person
35
+
to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
36
+
37
+
Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
38
+
you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
39
+
40
+
Please tag fewer than 3 people.
41
+
42
+
Quantization: @Tracin
43
+
44
+
Documentation: @juney-nvidia
45
+
46
+
Feature request: @ncomly-nvidia
47
+
48
+
Performance: @kaiyux
49
+
50
+
Others: @byshiue @schetlur-nv
51
+
52
+
placeholder: "@Username ..."
53
+
54
+
- type: checkboxes
55
+
id: information-scripts-examples
56
+
attributes:
57
+
label: Information
58
+
description: 'The problem arises when using:'
59
+
options:
60
+
- label: "The official example scripts"
61
+
- label: "My own modified scripts"
62
+
63
+
- type: checkboxes
64
+
id: information-tasks
65
+
attributes:
66
+
label: Tasks
67
+
description: "The tasks I am working on are:"
68
+
options:
69
+
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
70
+
- label: "My own task or dataset (give details below)"
71
+
72
+
- type: textarea
73
+
id: reproduction
74
+
validations:
75
+
required: true
76
+
attributes:
77
+
label: Reproduction
78
+
description: |
79
+
Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
80
+
Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
81
+
82
+
Remember to use code tags to properly format your code. You can refer to the
83
+
link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
84
+
85
+
Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
86
+
It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
87
+
88
+
placeholder: |
89
+
Steps to reproduce the behavior:
90
+
91
+
1.
92
+
2.
93
+
3.
94
+
95
+
- type: textarea
96
+
id: expected-behavior
97
+
validations:
98
+
required: true
99
+
attributes:
100
+
label: Expected behavior
101
+
description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."
102
+
103
+
- type: textarea
104
+
id: actual-behavior
105
+
validations:
106
+
required: true
107
+
attributes:
108
+
label: actual behavior
109
+
description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."
110
+
111
+
- type: textarea
112
+
id: additioanl-notes
113
+
validations:
114
+
required: true
115
+
attributes:
116
+
label: additional notes
117
+
description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."
There are several ways to access the TensorRT-LLM Backend.
47
47
48
-
**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker)**
48
+
**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker).**
49
49
50
-
### Option 1. Run the Docker Container
50
+
### Run the Pre-built Docker Container
51
51
52
52
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
53
53
Backend and Python Backend. This container should have everything to run a
54
54
TensorRT-LLM model. You can find this container on the
55
55
[Triton NGC page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).
56
56
57
-
### Option 2. Build via the build.py Script in Server Repo
57
+
### Build the Docker Container
58
+
59
+
#### Option 1. Build via the `build.py` Script in Server Repo
58
60
59
61
Starting with Triton 23.10 release, you can follow steps described in the
60
62
[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
|`decoupled`| Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. |
227
-
|`max_beam_width`| The maximum beam width that any request may ask for when using beam search |
228
-
|`gpt_model_type`| Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1`|
229
-
|`gpt_model_path`| Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
230
-
|`max_tokens_in_paged_kv_cache`| The maximum size of the KV cache in number of tokens |
231
-
|`max_attention_window_size`| When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults to maximum sequence length |
232
-
|`batch_scheduler_policy`| Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.|
233
-
|`kv_cache_free_gpu_mem_fraction`| Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache|
234
-
| `max_num_sequences` | Maximum number of sequences that the in-flight batching scheme can maintain state for. Defaults to `max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size.
235
-
|`enable_trt_overlap`| Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
236
-
|`exclude_input_in_output`| Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
213
+
|`gpt_model_type`| Mandatory. Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1`|
214
+
|`gpt_model_path`| Mandatory. Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
215
+
|`batch_scheduler_policy`| Mandatory. Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.|
216
+
|`decoupled`| Optional (default=`false`). Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. |
217
+
|`max_beam_width`| Optional (default=1). The maximum beam width that any request may ask for when using beam search.|
218
+
|`max_tokens_in_paged_kv_cache`| Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
219
+
|`max_attention_window_size`| Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
220
+
|`kv_cache_free_gpu_mem_fraction`| Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
221
+
|`enable_trt_overlap`| Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
222
+
|`exclude_input_in_output`| Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
223
+
|`normalize_log_probs`| Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs`|
224
+
|`enable_chunked_context`| Optional (default=`false`). Set to `true` to enable context chunking. |
237
225
238
226
*triton_model_repo/postprocessing/config.pbtxt*
239
227
@@ -357,6 +345,7 @@ He was a member of the French Academy of Sciences and the French Academy of Arts
357
345
Soyer was a member of the French Academy of Sciences and
358
346
```
359
347
348
+
#### Early stopping
360
349
You can also stop the generation process early by using the `--stop-after-ms`
361
350
option to send a stop request after a few milliseconds:
362
351
@@ -368,6 +357,54 @@ You will find that the generation process is stopped early and therefore the
368
357
number of generated tokens is lower than 200. You can have a look at the
369
358
client code to see how early stopping is achieved.
If you want to get context logits and/or generation logits, you need to enable `--gather_context_logits` and/or `--gather_generation_logits` when building the engine (or `--enable gather_all_token_logits` to enable both at the same time). For more setting details about these two flags, please refer to [build.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/gpt/build.py) or [gpt_runtime](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md).
362
+
363
+
After launching the server, you could get the output of logits by passing the corresponding parameters `--return-context-logits` and/or `--return-generation-logits` in the client scripts (`end_to_end_grpc_client.py` and `inflight_batcher_llm_client.py`). For example:
0 commit comments