-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards #4802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards #4802
Conversation
652be47 to
275a6c3
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #7897 [ run ] triggered by Bot |
|
PR_Github #7897 [ run ] completed with state |
mikeiovine
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we measure the performance gain? Since our initial goal is to hit parity with vLLM, I think using Llama 3.3 70B on Hopper makes sense. I got these numbers on 8xH200 today.
| Framework | Max Batch Size | OSL | Eagle (draft len = 3)? | Output tok/sec |
|---|---|---|---|---|
| TRTLLM | 8 | 256 | No | 876 |
| TRTLLM | 8 | 256 | Yes | 1154 |
| vLLM | 8 | 256 | No | 715 |
| vLLM | 8 | 256 | Yes | 1430 |
A note about the dataset: I used gsm8k from here. For the llama 3.3 eagle drafters we have from the paper author's, you have to apply the tokenizer's chat template or the AR will drop significantly and performance will be regressed. I will send you a preprocessed dataset that is compatible with trtllm-bench.
affc695 to
095d2c2
Compare
|
/bot run |
|
PR_Github #8288 [ run ] triggered by Bot |
|
PR_Github #8288 [ run ] completed with state |
095d2c2 to
52328f9
Compare
|
/bot run |
|
PR_Github #8319 [ run ] triggered by Bot |
|
/bot kill |
775cd4c to
5267115
Compare
|
PR_Github #8377 [ kill ] triggered by Bot |
|
PR_Github #8802 [ run ] triggered by Bot |
|
PR_Github #8802 [ run ] completed with state |
|
@lfr-0531 That plan sounds good to me. It can still be useful for other use case (@IzzyPutterman's draft/target speculative decoding should land soon, it'll be useful there since the draft models are a bit bigger than EAGLE). |
|
Do we already have logic to make sure that extra pass of overlap doesnt run (and infringe on the verify pass)? I think I have an internal MR that does this from a while ago. |
|
Something like this: #5211 |
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
…draft forward. Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
9ec2aa9 to
be5b542
Compare
|
/bot run |
|
PR_Github #8907 [ run ] triggered by Bot |
|
PR_Github #8907 [ run ] completed with state |
|
/bot run |
|
PR_Github #8925 [ run ] triggered by Bot |
We didn't add it in this PR. This PR is only used to overlap the different forwards in the same iteration. After we have PR-5211, the last iteration, including the |
|
PR_Github #8925 [ run ] completed with state |
|
Hi @lfr-0531, Thanks! |
Description
I'll collect more performance and accuracy data and update the results here.