feat: Parallel prefill (#846) #991

jthomson04 · 2025-05-07T23:22:29Z

Overview:

Enable parallel prefill for disagg serving. The updated process is to dequeue multiple requests until a token threshold is reached. Then, all requests are completed before dequeueing a new set of requests.

closes GitHub issue: [FEATURE]: is it possible to do batch prefills in prefill worker? #846

Summary by CodeRabbit

New Features
- Introduced batch processing for prefill requests, allowing multiple requests to be handled together for improved efficiency.
- Added a configurable parameter to set the maximum number of tokens processed in a prefill batch via command-line arguments and configuration files.
Improvements
- Enhanced queue handling to support batch dequeuing of prefill requests based on token limits.
- Added timeout support for dequeuing individual prefill requests.

jthomson04 · 2025-05-13T03:13:10Z

Some unfortunate findings on the benchmarks: The batched prefills with the current approach have a TTFT from 1-30% longer than the prior approach. So far, I've only tested on genai-perf at ISL 1000, OSL 256, and concurrencies 5, 10, and 20.

tedzhouhk · 2025-05-13T15:57:08Z

Some unfortunate findings on the benchmarks: The batched prefills with the current approach have a TTFT from 1-30% longer than the prior approach. So far, I've only tested on genai-perf at ISL 1000, OSL 256, and concurrencies 5, 10, and 20.

@jthomson04 We need prefill to be short and lagging (i.e., cannot catch up with input requests) to be able to see the improvement. I would suggest try with shorter isl and more requests, i.e., isl=250, num_req=100-500 dumped all at t=0 (you can achieve this by setting conc=num_requests in GAP). Also I recommend turning off conditional remote prefill for benchmarking to make sure prefill engine is doing all the prefill.

jthomson04 · 2025-05-17T17:24:57Z

Updated benchmarks show a significant reduction in TTFT, as well as a slight increase in output token throughput with the batched prefills. ISL=256, OSL=128, concurrency 500, conditional remote prefill disabled.

coderabbitai · 2025-05-30T18:40:01Z

Walkthrough

Batch processing for prefill requests was introduced, with new logic to dequeue and process multiple requests at once, constrained by a configurable maximum token count. Supporting methods and configuration parameters were added or updated across the codebase to enable this batching, including changes to queue handling, argument parsing, and configuration files.

Changes

File(s)	Change Summary
examples/llm/components/prefill_worker.py	Added async helper `wrap_generator`. Updated `prefill_queue_handler` to batch dequeue and process prefill requests.
examples/llm/utils/prefill_queue.py	Added `dequeue_prefill_request_batch` for batched dequeueing; updated single dequeue to accept timeout; new `pending` logic.
examples/llm/utils/vllm.py	Added `--max-batched-prefill-tokens` CLI argument and support in `parse_vllm_args` and `AsyncEngineArgs`.
examples/llm/configs/disagg.yaml examples/llm/configs/disagg_router.yaml	Added `max-batched-prefill-tokens: 2048` to `PrefillWorker` config sections.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PrefillQueue
    participant PrefillWorker

    Client->>PrefillQueue: Enqueue prefill requests
    loop Batch Processing
        PrefillWorker->>PrefillQueue: dequeue_prefill_request_batch(max_tokens, block_size)
        PrefillQueue-->>PrefillWorker: Batch of requests (<= max_tokens)
        PrefillWorker->>PrefillWorker: For each request, create generator
        PrefillWorker->>PrefillWorker: wrap_generator for each generator
        PrefillWorker->>PrefillWorker: asyncio.gather(all futures)
    end
    PrefillWorker-->>Client: Processed results

Poem

In the meadow of code, requests now batch,
Prefill tokens counted, no single match.
With queues that gather, and futures that run,
Rabbits process together—more swiftly, more fun!
Configs now tuned for a token-rich spree,
Hopping through batches, as happy as can be! 🐇✨

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

examples/llm/utils/vllm.py (1)
72-77: Address past review feedback on parameter naming.

Based on the previous discussion, the parameter name --max-batched-prefill-tokens may be misleading since the actual behavior allows exceeding this value. The help text also mentions "If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1" but this doesn't accurately reflect the batching logic implemented.

Consider renaming to --min-batched-prefill-tokens and updating the help text to clarify that this is a threshold for batch formation, not a hard maximum.
-    parser.add_argument(
-        "--max-batched-prefill-tokens",
-        type=int,
-        default=2048,
-        help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.",
-    )
+    parser.add_argument(
+        "--min-batched-prefill-tokens", 
+        type=int,
+        default=2048,
+        help="Minimum token threshold for batching prefill requests. Batches are formed until this threshold is exceeded.",
+    )
🧰 Tools

🪛 Pylint (3.3.7)

[convention] 76-76: Line too long (169/100)

(C0301)
examples/llm/utils/prefill_queue.py (1)
54-60: Remove unnecessary else clause to reduce nesting.

The else clause is unnecessary after the return statement and can be simplified for better readability.
     encoded_request = await self.dequeue_task(timeout)
     if encoded_request is not None:
         prefill_request = msgspec.json.decode(
             encoded_request, type=RemotePrefillRequest
         )
         return prefill_request
-    else:
-        return None
+    return None
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 54-60: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🧹 Nitpick comments (6)

examples/llm/utils/vllm.py (1)
76-76: Fix line length violation.

The help text exceeds the 100-character limit as indicated by pylint.
-        help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.",
+        help="Maximum number of tokens to prefill in a single batch. "
+             "If exceeded, prefill phase will execute as bs=1.",
🧰 Tools

🪛 Pylint (3.3.7)

[convention] 76-76: Line too long (169/100)

(C0301)
examples/llm/components/prefill_worker.py (2)
42-48: Improve code quality and address static analysis hints.

The wrap_generator function has several quality issues that should be addressed:

Missing docstring

Using __anext__() instead of the anext() builtin function
-async def wrap_generator(generator):
-    while True:
-        try:
-            await generator.__anext__()
-        except StopAsyncIteration:
-            break
+async def wrap_generator(generator):
+    """Fully consume an async generator by iterating through all values."""
+    while True:
+        try:
+            await anext(generator)
+        except StopAsyncIteration:
+            break
Additionally, consider the previous suggestion to define this as a method within the PrefillWorker class for better encapsulation, especially if batched generation methods are added in the future.

🧰 Tools

🪛 Pylint (3.3.7)

[convention] 42-42: Missing function or method docstring

(C0116)

[convention] 45-45: Unnecessarily calls dunder method anext. Use anext built-in function.

(C2801)

169-169: Use lazy formatting in logging statement.

The logging statement should use lazy % formatting for better performance.
-                    logger.debug(f"Running batch of {len(reqs)} prefill requests")
+                    logger.debug("Running batch of %d prefill requests", len(reqs))
🧰 Tools

🪛 Pylint (3.3.7)

[warning] 169-169: Use lazy % formatting in logging functions

(W1203)
examples/llm/utils/prefill_queue.py (3)
50-53: Add docstring for the updated method.

The optional timeout parameter is a good addition for flexibility. However, the method is missing a docstring to document the new parameter.
 async def dequeue_prefill_request(
     self, timeout: Optional[float] = None
 ) -> Optional[RemotePrefillRequest]:
+    """
+    Dequeue a single prefill request from the queue.
+    
+    Args:
+        timeout: Optional timeout for the dequeue operation. If None, uses default timeout.
+        
+    Returns:
+        A RemotePrefillRequest if available, None otherwise.
+    """
🧰 Tools

🪛 Pylint (3.3.7)

[convention] 50-50: Missing function or method docstring

(C0116)

62-106: Add comprehensive docstring and consider edge cases.

The batch dequeue logic is well-implemented and correctly handles overflow requests. However, the method lacks documentation and there are a few considerations:

Missing docstring explaining the batching strategy

The TODO comment suggests potential performance optimization opportunities
 async def dequeue_prefill_request_batch(
     self, max_batched_prefill_tokens: int, block_size: int
 ) -> Optional[List[RemotePrefillRequest]]:
+    """
+    Dequeue a batch of prefill requests constrained by token budget.
+    
+    This method accumulates requests until the total token count would exceed
+    max_batched_prefill_tokens. Requests that don't fit are stored in self.pending
+    for the next batch.
+    
+    Args:
+        max_batched_prefill_tokens: Maximum total tokens allowed in a batch
+        block_size: Size of each cached block for token calculation
+        
+    Returns:
+        A list of RemotePrefillRequest objects, or None if no requests available
+    """
The batching logic correctly:

Prioritizes pending requests from previous batches

Handles single large requests that exceed the token limit

Prevents request dropping by storing overflow in self.pending

Uses efficient token calculation based on new tokens only

🧰 Tools

🪛 Pylint (3.3.7)

[warning] 87-87: TODO: We might want to double-buffer this process

(W0511)

[convention] 62-62: Missing function or method docstring

(C0116)

87-87: Consider addressing the TODO for production readiness.

The TODO comment suggests double-buffering to reduce NATS dequeue overhead. While not critical for functionality, this could be important for production performance.

Do you want me to help design a double-buffering strategy or open an issue to track this performance optimization?

🧰 Tools

🪛 Pylint (3.3.7)

[warning] 87-87: TODO: We might want to double-buffer this process

(W0511)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ea0830 and a7d3ef2.

📒 Files selected for processing (5)

examples/llm/components/prefill_worker.py (2 hunks)
examples/llm/configs/disagg.yaml (1 hunks)
examples/llm/configs/disagg_router.yaml (1 hunks)
examples/llm/utils/prefill_queue.py (2 hunks)
examples/llm/utils/vllm.py (2 hunks)

🧰 Additional context used

🪛 Pylint (3.3.7)

examples/llm/utils/prefill_queue.py

[warning] 87-87: TODO: We might want to double-buffer this process

(W0511)

[convention] 44-44: Missing function or method docstring

(C0116)

[convention] 50-50: Missing function or method docstring

(C0116)

[refactor] 54-60: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[convention] 62-62: Missing function or method docstring

(C0116)

examples/llm/components/prefill_worker.py

[convention] 42-42: Missing function or method docstring

(C0116)

[convention] 45-45: Unnecessarily calls dunder method anext. Use anext built-in function.

(C2801)

[warning] 169-169: Use lazy % formatting in logging functions

(W1203)

examples/llm/utils/vllm.py

[convention] 76-76: Line too long (169/100)

(C0301)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build and Test - vllm

🔇 Additional comments (5)

examples/llm/configs/disagg_router.yaml (1)

50-50: Ensure configuration parameter naming consistency.

If the parameter name is changed from max-batched-prefill-tokens to min-batched-prefill-tokens based on the discussion in vllm.py, this configuration file will need to be updated accordingly.

The configuration value of 2048 is consistent with the default in the argument parser. Monitor for any parameter name changes that would require updating this configuration key.

examples/llm/configs/disagg.yaml (1)

43-43: Configuration parameter is consistent across deployment configs.

The addition of max-batched-prefill-tokens: 2048 matches the configuration in disagg_router.yaml and the default value in the argument parser, ensuring consistency across different deployment scenarios.

Note: If the parameter name is updated based on previous feedback (to min-batched-prefill-tokens), both configuration files will need to be updated accordingly.

examples/llm/components/prefill_worker.py (1)

163-174: Batch processing implementation looks correct.

The implementation successfully addresses the requirements by:

Using the refactored dequeue_prefill_request_batch method from PrefillQueue

Processing requests concurrently using asyncio.gather

Respecting the configurable token limit

The batching logic has been properly moved to the PrefillQueue as suggested in previous reviews, and the concurrent processing should improve throughput as intended.

🧰 Tools

🪛 Pylint (3.3.7)

[warning] 169-169: Use lazy % formatting in logging functions

(W1203)

examples/llm/utils/prefill_queue.py (2)

17-17: LGTM: Import addition is appropriate.

The addition of List to the typing imports is necessary for the new batch functionality.

42-42: LGTM: Pending request tracking added.

The pending instance variable correctly handles overflow requests that cannot fit in the current batch, ensuring no requests are dropped.

github-actions · 2025-06-30T09:36:48Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

grahamking · 2025-08-18T20:19:32Z

That example has moved. @jthomson04 Can this be closed? Or maybe re-targeted to wherever this is now?

jthomson04 · 2025-08-18T20:20:57Z

Oh boy. I haven't thought about this MR in a long time... We can probably close this.

pull-request-size bot added the size/M label May 7, 2025

copy-pr-bot bot temporarily deployed to GITLAB May 7, 2025 23:22 Inactive

jthomson04 changed the title ~~Parallel prefill~~ feat: Parallel prefill (#846) May 7, 2025

copy-pr-bot bot temporarily deployed to GITLAB May 7, 2025 23:26 Inactive

Base automatically changed from hzhou/serve_cleanup to main May 7, 2025 23:35

pull-request-size bot added size/L and removed size/M labels May 7, 2025

jthomson04 marked this pull request as ready for review May 8, 2025 00:02

Merge branch 'main' into jthomson04/parallel-prefill2

5b311b7

copy-pr-bot bot temporarily deployed to GITLAB May 12, 2025 16:11 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 12, 2025 16:12 Inactive

Merge branch 'main' into jthomson04/parallel-prefill2

55cff1d

copy-pr-bot bot temporarily deployed to GITLAB May 17, 2025 16:35 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 17, 2025 16:36 Inactive

Merge branch 'main' into jthomson04/parallel-prefill2

2b607f1

copy-pr-bot bot temporarily deployed to GITLAB May 20, 2025 16:05 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 20, 2025 16:06 Inactive

jthomson04 requested review from PeaBrane and tedzhouhk May 22, 2025 04:54

Merge branch 'main' into jthomson04/parallel-prefill2

dffe22c

copy-pr-bot bot temporarily deployed to GITLAB May 22, 2025 04:55 Inactive

Merge branch 'main' into jthomson04/parallel-prefill2

a7d3ef2

copy-pr-bot bot temporarily deployed to GITLAB May 30, 2025 18:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 30, 2025 18:35 Inactive

coderabbitai bot reviewed May 30, 2025

View reviewed changes

github-actions bot added the Stale label Jun 30, 2025

Merge branch 'main' into jthomson04/parallel-prefill2

a1f9b3f

copy-pr-bot bot temporarily deployed to GITLAB July 1, 2025 03:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 1, 2025 03:35 Inactive

github-actions bot removed the Stale label Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Parallel prefill (#846) #991

feat: Parallel prefill (#846) #991

Uh oh!

jthomson04 commented May 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

jthomson04 commented May 13, 2025

Uh oh!

tedzhouhk commented May 13, 2025

Uh oh!

jthomson04 commented May 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented May 30, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

grahamking commented Aug 18, 2025

Uh oh!

jthomson04 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: Parallel prefill (#846) #991

Are you sure you want to change the base?

feat: Parallel prefill (#846) #991

Uh oh!

Conversation

jthomson04 commented May 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

jthomson04 commented May 13, 2025

Uh oh!

tedzhouhk commented May 13, 2025

Uh oh!

jthomson04 commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented May 30, 2025

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

grahamking commented Aug 18, 2025

Uh oh!

jthomson04 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jthomson04 commented May 7, 2025 •

edited by coderabbitai bot

Loading

jthomson04 commented May 17, 2025 •

edited

Loading