[chore] Disable block reuse when draft model speculation is being used #5448

mikeiovine · 2025-06-24T20:29:39Z

Description

The issue is as follows:

The draft and target models have different KV cache managers to support different head sizes, dtypes, etc in the generic case.
This line will set context_current_position > 0 if there are cached blocks: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/resource_manager.py#L310. It actually mutates the LLM request
As a result, when we try to allocate KV cache pages for the draft model, is_first_context_chunk returns False and no pages are allocated.

The root of this issue is the mutation. I think we should refactor this such that the LLM request is more agnostic to its KV cache manager.

Remove the number of cached tokens from context_current_position.
Add a method get_num_cached_tokens
The runtime coordinates setting position_id = req.context_current_position + kv_cache_manager.get_num_cached_tokens(req).

I'm not sure how feasible the above refactor is. For now, I've just disabled KV cache reuse and logged a warning when a drafter is required. This is consistent with what we do when the attention backend doesn't support block reuse.

Test Coverage

Existing tests.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

mikeiovine · 2025-06-24T20:39:07Z

/bot run

tensorrt-cicd · 2025-06-24T20:44:58Z

PR_Github #9753 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-24T22:50:14Z

PR_Github #9753 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7187 completed with status: 'FAILURE'

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

mikeiovine · 2025-06-25T14:43:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-25T14:48:44Z

PR_Github #9889 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T17:25:49Z

PR_Github #9889 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7298 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Mike Iovine <[email protected]>

mikeiovine · 2025-06-25T19:21:44Z

/bot reuse-pipeline

tensorrt-cicd · 2025-06-25T19:31:27Z

PR_Github #9909 [ reuse-pipeline ] triggered by Bot

mikeiovine · 2025-06-25T19:37:02Z

/bot reuse-pipeline

tensorrt-cicd · 2025-06-25T19:43:25Z

PR_Github #9910 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-06-25T19:43:26Z

PR_Github #9909 [ reuse-pipeline ] completed with state ABORTED
Can't reuse PR_Github #9889 with status: SUCCESS

tensorrt-cicd · 2025-06-25T19:51:19Z

PR_Github #9910 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #9889 for commit 13da78b

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

mikeiovine requested a review from lfr-0531 June 24, 2025 20:29

mikeiovine requested review from a team as code owners June 24, 2025 20:29

mikeiovine requested a review from yuxianq June 24, 2025 20:29

mikeiovine force-pushed the prevent-crash branch 2 times, most recently from 0db19e5 to 7ec4aa0 Compare June 24, 2025 20:38

yuxianq reviewed Jun 25, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py Outdated Show resolved Hide resolved

yuxianq approved these changes Jun 25, 2025

View reviewed changes

mikeiovine force-pushed the prevent-crash branch from 7ec4aa0 to dbac0db Compare June 25, 2025 14:43

[chore] Disable block reuse when draft model speculation is being used

fdb22a0

Signed-off-by: Mike Iovine <[email protected]>

mikeiovine force-pushed the prevent-crash branch from dbac0db to fdb22a0 Compare June 25, 2025 19:21

Merge branch 'main' into prevent-crash

13da78b

mikeiovine enabled auto-merge (squash) June 25, 2025 19:37

mikeiovine merged commit 5bc8c89 into NVIDIA:main Jun 25, 2025
3 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025

[chore] Disable block reuse when draft model speculation is being used (

23ff21d

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[chore] Disable block reuse when draft model speculation is being used (

5c29ab5

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[chore] Disable block reuse when draft model speculation is being used (

dfdc963

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[chore] Disable block reuse when draft model speculation is being used (

3f8311b

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[chore] Disable block reuse when draft model speculation is being used (

3ad260a

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[chore] Disable block reuse when draft model speculation is being used (

46717a6

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[chore] Disable block reuse when draft model speculation is being used (

aa524b2

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[chore] Disable block reuse when draft model speculation is being used (

bfd156a

NVIDIA#5448) Signed-off-by: Mike Iovine <[email protected]>

ziyixiong-nv mentioned this pull request Jul 17, 2025

[TRTLLM-6452][feat]: Two-model engine KV cache reuse support #6133

Merged

mikeiovine deleted the prevent-crash branch July 23, 2025 18:00

[chore] Disable block reuse when draft model speculation is being used #5448

[chore] Disable block reuse when draft model speculation is being used #5448

Uh oh!

Conversation

mikeiovine commented Jun 24, 2025

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

mikeiovine commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 24, 2025

Uh oh!

Uh oh!

mikeiovine commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

mikeiovine commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

mikeiovine commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants