-
Notifications
You must be signed in to change notification settings - Fork 738
feat: Parallel prefill (#846) #991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 10 commits
56d1f78
24e9a92
4a9706d
020fcf4
8236897
5b311b7
55cff1d
2b607f1
dffe22c
a7d3ef2
a1f9b3f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -69,6 +69,12 @@ def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs: | |
| default=3, | ||
| help="Maximum queue size for remote prefill. If the prefill queue size is greater than this value, prefill phase of the incoming request will be executed locally.", | ||
| ) | ||
| parser.add_argument( | ||
| "--max-batched-prefill-tokens", | ||
| type=int, | ||
| default=2048, | ||
| help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.", | ||
| ) | ||
|
Comment on lines
+74
to
+79
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this accurate name/description? With nats queue, we can only ensure stop dequeuing when batched prefill tokens is larger than this value. Or, the actual num prefill token is always larger than this value.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternatively, this could be the max, where we dequeue an extra one (that we find exceeds the max) and carry it over to the next batch.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that's a good idea. We should send whatever we dequeue to the engine immediately, or we shouldn't dequeue and hope other prefill workers will pick it up. How about naming it |
||
| parser = AsyncEngineArgs.add_cli_args(parser) | ||
| args = parser.parse_args(vllm_args) | ||
| engine_args = AsyncEngineArgs.from_cli_args(args) | ||
|
|
@@ -78,4 +84,5 @@ def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs: | |
| engine_args.conditional_disagg = args.conditional_disagg | ||
| engine_args.max_local_prefill_length = args.max_local_prefill_length | ||
| engine_args.max_prefill_queue_size = args.max_prefill_queue_size | ||
| engine_args.max_batched_prefill_tokens = args.max_batched_prefill_tokens | ||
| return engine_args | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about define this inside
PrefillWorkerclass so that if we have a batched generate in the future we can direct swap?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. That would mean we'd also need to bring the PrefillWorker.generate call into the prefill queue, which might get a bit weird.