-
Notifications
You must be signed in to change notification settings - Fork 737
feat: Parallel prefill (#846) #991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Some unfortunate findings on the benchmarks: The batched prefills with the current approach have a TTFT from 1-30% longer than the prior approach. So far, I've only tested on genai-perf at ISL 1000, OSL 256, and concurrencies 5, 10, and 20. |
@jthomson04 We need prefill to be short and lagging (i.e., cannot catch up with input requests) to be able to see the improvement. I would suggest try with shorter isl and more requests, i.e., isl=250, num_req=100-500 dumped all at t=0 (you can achieve this by setting conc=num_requests in GAP). Also I recommend turning off conditional remote prefill for benchmarking to make sure prefill engine is doing all the prefill. |
|
Updated benchmarks show a significant reduction in TTFT, as well as a slight increase in output token throughput with the batched prefills. ISL=256, OSL=128, concurrency 500, conditional remote prefill disabled. |
WalkthroughBatch processing for prefill requests was introduced, with new logic to dequeue and process multiple requests at once, constrained by a configurable maximum token count. Supporting methods and configuration parameters were added or updated across the codebase to enable this batching, including changes to queue handling, argument parsing, and configuration files. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PrefillQueue
participant PrefillWorker
Client->>PrefillQueue: Enqueue prefill requests
loop Batch Processing
PrefillWorker->>PrefillQueue: dequeue_prefill_request_batch(max_tokens, block_size)
PrefillQueue-->>PrefillWorker: Batch of requests (<= max_tokens)
PrefillWorker->>PrefillWorker: For each request, create generator
PrefillWorker->>PrefillWorker: wrap_generator for each generator
PrefillWorker->>PrefillWorker: asyncio.gather(all futures)
end
PrefillWorker-->>Client: Processed results
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
examples/llm/utils/vllm.py (1)
72-77: Address past review feedback on parameter naming.Based on the previous discussion, the parameter name
--max-batched-prefill-tokensmay be misleading since the actual behavior allows exceeding this value. The help text also mentions "If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1" but this doesn't accurately reflect the batching logic implemented.Consider renaming to
--min-batched-prefill-tokensand updating the help text to clarify that this is a threshold for batch formation, not a hard maximum.- parser.add_argument( - "--max-batched-prefill-tokens", - type=int, - default=2048, - help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.", - ) + parser.add_argument( + "--min-batched-prefill-tokens", + type=int, + default=2048, + help="Minimum token threshold for batching prefill requests. Batches are formed until this threshold is exceeded.", + )🧰 Tools
🪛 Pylint (3.3.7)
[convention] 76-76: Line too long (169/100)
(C0301)
examples/llm/utils/prefill_queue.py (1)
54-60: Remove unnecessary else clause to reduce nesting.The else clause is unnecessary after the return statement and can be simplified for better readability.
encoded_request = await self.dequeue_task(timeout) if encoded_request is not None: prefill_request = msgspec.json.decode( encoded_request, type=RemotePrefillRequest ) return prefill_request - else: - return None + return None🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 54-60: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
🧹 Nitpick comments (6)
examples/llm/utils/vllm.py (1)
76-76: Fix line length violation.The help text exceeds the 100-character limit as indicated by pylint.
- help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.", + help="Maximum number of tokens to prefill in a single batch. " + "If exceeded, prefill phase will execute as bs=1.",🧰 Tools
🪛 Pylint (3.3.7)
[convention] 76-76: Line too long (169/100)
(C0301)
examples/llm/components/prefill_worker.py (2)
42-48: Improve code quality and address static analysis hints.The
wrap_generatorfunction has several quality issues that should be addressed:
- Missing docstring
- Using
__anext__()instead of theanext()builtin function-async def wrap_generator(generator): - while True: - try: - await generator.__anext__() - except StopAsyncIteration: - break +async def wrap_generator(generator): + """Fully consume an async generator by iterating through all values.""" + while True: + try: + await anext(generator) + except StopAsyncIteration: + breakAdditionally, consider the previous suggestion to define this as a method within the
PrefillWorkerclass for better encapsulation, especially if batched generation methods are added in the future.🧰 Tools
🪛 Pylint (3.3.7)
[convention] 42-42: Missing function or method docstring
(C0116)
[convention] 45-45: Unnecessarily calls dunder method anext. Use anext built-in function.
(C2801)
169-169: Use lazy formatting in logging statement.The logging statement should use lazy % formatting for better performance.
- logger.debug(f"Running batch of {len(reqs)} prefill requests") + logger.debug("Running batch of %d prefill requests", len(reqs))🧰 Tools
🪛 Pylint (3.3.7)
[warning] 169-169: Use lazy % formatting in logging functions
(W1203)
examples/llm/utils/prefill_queue.py (3)
50-53: Add docstring for the updated method.The optional timeout parameter is a good addition for flexibility. However, the method is missing a docstring to document the new parameter.
async def dequeue_prefill_request( self, timeout: Optional[float] = None ) -> Optional[RemotePrefillRequest]: + """ + Dequeue a single prefill request from the queue. + + Args: + timeout: Optional timeout for the dequeue operation. If None, uses default timeout. + + Returns: + A RemotePrefillRequest if available, None otherwise. + """🧰 Tools
🪛 Pylint (3.3.7)
[convention] 50-50: Missing function or method docstring
(C0116)
62-106: Add comprehensive docstring and consider edge cases.The batch dequeue logic is well-implemented and correctly handles overflow requests. However, the method lacks documentation and there are a few considerations:
- Missing docstring explaining the batching strategy
- The TODO comment suggests potential performance optimization opportunities
async def dequeue_prefill_request_batch( self, max_batched_prefill_tokens: int, block_size: int ) -> Optional[List[RemotePrefillRequest]]: + """ + Dequeue a batch of prefill requests constrained by token budget. + + This method accumulates requests until the total token count would exceed + max_batched_prefill_tokens. Requests that don't fit are stored in self.pending + for the next batch. + + Args: + max_batched_prefill_tokens: Maximum total tokens allowed in a batch + block_size: Size of each cached block for token calculation + + Returns: + A list of RemotePrefillRequest objects, or None if no requests available + """The batching logic correctly:
- Prioritizes pending requests from previous batches
- Handles single large requests that exceed the token limit
- Prevents request dropping by storing overflow in
self.pending- Uses efficient token calculation based on new tokens only
🧰 Tools
🪛 Pylint (3.3.7)
[warning] 87-87: TODO: We might want to double-buffer this process
(W0511)
[convention] 62-62: Missing function or method docstring
(C0116)
87-87: Consider addressing the TODO for production readiness.The TODO comment suggests double-buffering to reduce NATS dequeue overhead. While not critical for functionality, this could be important for production performance.
Do you want me to help design a double-buffering strategy or open an issue to track this performance optimization?
🧰 Tools
🪛 Pylint (3.3.7)
[warning] 87-87: TODO: We might want to double-buffer this process
(W0511)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/llm/components/prefill_worker.py(2 hunks)examples/llm/configs/disagg.yaml(1 hunks)examples/llm/configs/disagg_router.yaml(1 hunks)examples/llm/utils/prefill_queue.py(2 hunks)examples/llm/utils/vllm.py(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
examples/llm/utils/prefill_queue.py
[warning] 87-87: TODO: We might want to double-buffer this process
(W0511)
[convention] 44-44: Missing function or method docstring
(C0116)
[convention] 50-50: Missing function or method docstring
(C0116)
[refactor] 54-60: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[convention] 62-62: Missing function or method docstring
(C0116)
examples/llm/components/prefill_worker.py
[convention] 42-42: Missing function or method docstring
(C0116)
[convention] 45-45: Unnecessarily calls dunder method anext. Use anext built-in function.
(C2801)
[warning] 169-169: Use lazy % formatting in logging functions
(W1203)
examples/llm/utils/vllm.py
[convention] 76-76: Line too long (169/100)
(C0301)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (5)
examples/llm/configs/disagg_router.yaml (1)
50-50: Ensure configuration parameter naming consistency.If the parameter name is changed from
max-batched-prefill-tokenstomin-batched-prefill-tokensbased on the discussion in vllm.py, this configuration file will need to be updated accordingly.The configuration value of 2048 is consistent with the default in the argument parser. Monitor for any parameter name changes that would require updating this configuration key.
examples/llm/configs/disagg.yaml (1)
43-43: Configuration parameter is consistent across deployment configs.The addition of
max-batched-prefill-tokens: 2048matches the configuration indisagg_router.yamland the default value in the argument parser, ensuring consistency across different deployment scenarios.Note: If the parameter name is updated based on previous feedback (to
min-batched-prefill-tokens), both configuration files will need to be updated accordingly.examples/llm/components/prefill_worker.py (1)
163-174: Batch processing implementation looks correct.The implementation successfully addresses the requirements by:
- Using the refactored
dequeue_prefill_request_batchmethod from PrefillQueue- Processing requests concurrently using
asyncio.gather- Respecting the configurable token limit
The batching logic has been properly moved to the PrefillQueue as suggested in previous reviews, and the concurrent processing should improve throughput as intended.
🧰 Tools
🪛 Pylint (3.3.7)
[warning] 169-169: Use lazy % formatting in logging functions
(W1203)
examples/llm/utils/prefill_queue.py (2)
17-17: LGTM: Import addition is appropriate.The addition of
Listto the typing imports is necessary for the new batch functionality.
42-42: LGTM: Pending request tracking added.The
pendinginstance variable correctly handles overflow requests that cannot fit in the current batch, ensuring no requests are dropped.
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
That example has moved. @jthomson04 Can this be closed? Or maybe re-targeted to wherever this is now? |
|
Oh boy. I haven't thought about this MR in a long time... We can probably close this. |
Overview:
Enable parallel prefill for disagg serving. The updated process is to dequeue multiple requests until a token threshold is reached. Then, all requests are completed before dequeueing a new set of requests.
Summary by CodeRabbit
New Features
Improvements