[TRTLLM-4983] feat: enable overlap scheduler between draft forwards #4802

lfr-0531 · 2025-05-30T09:01:11Z

Description

Enable the overlap scheduler between different draft forwards.
Add a new Eagle3ResourceManager to manage the hidden states, and remove the extra model input.
Move the eagle3 fc into the model forward.
Move the h2d copy to the end of the _prepare_tp_inputs to hide the CPU time.
Disable CUDA graph for the 1st draft forward.

I'll collect more performance and accuracy data and update the results here.

lfr-0531 · 2025-06-06T10:28:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-06T10:34:39Z

PR_Github #7897 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-06T18:37:07Z

PR_Github #7897 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5706 completed with status: 'FAILURE'

mikeiovine

Shall we measure the performance gain? Since our initial goal is to hit parity with vLLM, I think using Llama 3.3 70B on Hopper makes sense. I got these numbers on 8xH200 today.

Framework	Max Batch Size	OSL	Eagle (draft len = 3)?	Output tok/sec
TRTLLM	8	256	No	876
TRTLLM	8	256	Yes	1154
vLLM	8	256	No	715
vLLM	8	256	Yes	1430

A note about the dataset: I used gsm8k from here. For the llama 3.3 eagle drafters we have from the paper author's, you have to apply the tokenizer's chat template or the AR will drop significantly and performance will be regressed. I will send you a preprocessed dataset that is compatible with trtllm-bench.

tensorrt_llm/_torch/attention_backend/interface.py

tensorrt_llm/_torch/models/modeling_llama.py

lfr-0531 · 2025-06-10T12:01:18Z

/bot run

tensorrt-cicd · 2025-06-10T12:06:55Z

PR_Github #8288 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T15:37:22Z

PR_Github #8288 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6000 completed with status: 'FAILURE'

lfr-0531 · 2025-06-10T15:42:50Z

/bot run

tensorrt-cicd · 2025-06-10T15:49:28Z

PR_Github #8319 [ run ] triggered by Bot

lfr-0531 · 2025-06-11T02:36:12Z

/bot kill

tensorrt-cicd · 2025-06-11T02:42:43Z

PR_Github #8377 [ kill ] triggered by Bot

tensorrt-cicd · 2025-06-13T11:11:40Z

PR_Github #8802 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-13T12:20:30Z

PR_Github #8802 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6390 completed with status: 'FAILURE'

mikeiovine · 2025-06-13T17:58:45Z

@lfr-0531 That plan sounds good to me. It can still be useful for other use case (@IzzyPutterman's draft/target speculative decoding should land soon, it'll be useful there since the draft models are a bit bigger than EAGLE).

IzzyPutterman · 2025-06-13T18:19:04Z

Do we already have logic to make sure that extra pass of overlap doesnt run (and infringe on the verify pass)? I think I have an internal MR that does this from a while ago.

IzzyPutterman · 2025-06-13T22:24:26Z

Something like this: #5211

Signed-off-by: Fanrong Li <[email protected]>

…draft forward. Signed-off-by: Fanrong Li <[email protected]>

Signed-off-by: Fanrong Li <[email protected]>

lfr-0531 · 2025-06-15T00:35:32Z

/bot run

tensorrt-cicd · 2025-06-15T00:46:09Z

PR_Github #8907 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-15T03:59:59Z

PR_Github #8907 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6492 completed with status: 'FAILURE'

lfr-0531 · 2025-06-15T08:28:31Z

/bot run

tensorrt-cicd · 2025-06-15T08:34:22Z

PR_Github #8925 [ run ] triggered by Bot

lfr-0531 · 2025-06-15T08:53:56Z

Do we already have logic to make sure that extra pass of overlap doesnt run (and infringe on the verify pass)? I think I have an internal MR that does this from a while ago.

We didn't add it in this PR. This PR is only used to overlap the different forwards in the same iteration. After we have PR-5211, the last iteration, including the _prepare_draft_tokens, can be skipped.

tensorrt-cicd · 2025-06-15T13:38:53Z

PR_Github #8925 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6509 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

jhaotingc · 2025-07-30T18:33:24Z

Hi @lfr-0531,
So, is overlap scheduler supported in Eagle3 Two Model yet?
support_overlap_scheduler
cc @mikeiovine

Thanks!

lfr-0531 force-pushed the user/fanrongl/reduce_eagle3_cpu_overhead branch 4 times, most recently from 652be47 to 275a6c3 Compare June 6, 2025 10:13

lfr-0531 marked this pull request as ready for review June 6, 2025 10:15

lfr-0531 requested review from a team as code owners June 6, 2025 10:15

lfr-0531 requested review from hyukn, juney-nvidia and mikeiovine and removed request for hyukn and juney-nvidia June 6, 2025 10:15

lfr-0531 changed the title ~~draft: enable overlap scheduler between draft forwards~~ [TRTLLM-4983] feat: enable overlap scheduler between draft forwards Jun 6, 2025

mikeiovine reviewed Jun 9, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/interface.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_llama.py Outdated Show resolved Hide resolved

lfr-0531 force-pushed the user/fanrongl/reduce_eagle3_cpu_overhead branch 3 times, most recently from affc695 to 095d2c2 Compare June 10, 2025 11:56

lfr-0531 force-pushed the user/fanrongl/reduce_eagle3_cpu_overhead branch from 095d2c2 to 52328f9 Compare June 10, 2025 15:42

lfr-0531 force-pushed the user/fanrongl/reduce_eagle3_cpu_overhead branch 2 times, most recently from 775cd4c to 5267115 Compare June 11, 2025 02:38

mikeiovine approved these changes Jun 13, 2025

View reviewed changes

lfr-0531 added 13 commits June 14, 2025 17:31

add overlap scheduler to multiple draft engine forward (1st commit).

d2030a5

Signed-off-by: Fanrong Li <[email protected]>

fix.

f310011

Signed-off-by: Fanrong Li <[email protected]>

add a simplified prepare_inputs for draft model engine.

08be223

Signed-off-by: Fanrong Li <[email protected]>

remove more h2d in attm_metadata.prepare.

f07cc84

Signed-off-by: Fanrong Li <[email protected]>

fix eagle hidden state/cuda graph/hidden states gather issues.

147a5ff

Signed-off-by: Fanrong Li <[email protected]>

fix metadata.

5c2a1d3

Signed-off-by: Fanrong Li <[email protected]>

add eagle3 resource manager to manage hidden states.

d1db8ce

Signed-off-by: Fanrong Li <[email protected]>

fix eagle3 resource manager and force disable cuda graph for the 1st …

8d8f568

…draft forward. Signed-off-by: Fanrong Li <[email protected]>

move the h2d copy to the end of the _prepare_tp_inputs.

1610cad

Signed-off-by: Fanrong Li <[email protected]>

revert the metadata prepare changes.

4f4c6bd

Signed-off-by: Fanrong Li <[email protected]>

add _prepare_draft_tp_inputs back and some fixes.

cc12900

Signed-off-by: Fanrong Li <[email protected]>

revert the attention metadata prepare changes.

19c5d02

Signed-off-by: Fanrong Li <[email protected]>

remove prepare_draft_tp_inputs.

be5b542

Signed-off-by: Fanrong Li <[email protected]>

lfr-0531 force-pushed the user/fanrongl/reduce_eagle3_cpu_overhead branch from 9ec2aa9 to be5b542 Compare June 15, 2025 00:35

lfr-0531 merged commit 39bba63 into NVIDIA:main Jun 15, 2025
3 checks passed

lfr-0531 deleted the user/fanrongl/reduce_eagle3_cpu_overhead branch June 27, 2025 12:43

[TRTLLM-4983] feat: enable overlap scheduler between draft forwards #4802

[TRTLLM-4983] feat: enable overlap scheduler between draft forwards #4802

Uh oh!

Conversation

lfr-0531 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

lfr-0531 commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

mikeiovine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lfr-0531 commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

lfr-0531 commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

lfr-0531 commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

tensorrt-cicd commented Jun 13, 2025

Uh oh!

mikeiovine commented Jun 13, 2025

Uh oh!

IzzyPutterman commented Jun 13, 2025

Uh oh!

IzzyPutterman commented Jun 13, 2025

Uh oh!

lfr-0531 commented Jun 15, 2025

Uh oh!

tensorrt-cicd commented Jun 15, 2025

Uh oh!

tensorrt-cicd commented Jun 15, 2025

Uh oh!

lfr-0531 commented Jun 15, 2025

Uh oh!

tensorrt-cicd commented Jun 15, 2025

Uh oh!

lfr-0531 commented Jun 15, 2025

Uh oh!

tensorrt-cicd commented Jun 15, 2025

Uh oh!

Uh oh!

jhaotingc commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lfr-0531 commented May 30, 2025 •

edited

Loading