Skip to content

Conversation

@jthomson04
Copy link
Contributor

@jthomson04 jthomson04 commented May 7, 2025

Overview:

Enable parallel prefill for disagg serving. The updated process is to dequeue multiple requests until a token threshold is reached. Then, all requests are completed before dequeueing a new set of requests.

Summary by CodeRabbit

  • New Features

    • Introduced batch processing for prefill requests, allowing multiple requests to be handled together for improved efficiency.
    • Added a configurable parameter to set the maximum number of tokens processed in a prefill batch via command-line arguments and configuration files.
  • Improvements

    • Enhanced queue handling to support batch dequeuing of prefill requests based on token limits.
    • Added timeout support for dequeuing individual prefill requests.

@jthomson04 jthomson04 changed the title Parallel prefill feat: Parallel prefill (#846) May 7, 2025
Base automatically changed from hzhou/serve_cleanup to main May 7, 2025 23:35
@pull-request-size pull-request-size bot added size/L and removed size/M labels May 7, 2025
@jthomson04 jthomson04 marked this pull request as ready for review May 8, 2025 00:02
@jthomson04
Copy link
Contributor Author

Some unfortunate findings on the benchmarks: The batched prefills with the current approach have a TTFT from 1-30% longer than the prior approach. So far, I've only tested on genai-perf at ISL 1000, OSL 256, and concurrencies 5, 10, and 20.

@tedzhouhk
Copy link
Contributor

Some unfortunate findings on the benchmarks: The batched prefills with the current approach have a TTFT from 1-30% longer than the prior approach. So far, I've only tested on genai-perf at ISL 1000, OSL 256, and concurrencies 5, 10, and 20.

@jthomson04 We need prefill to be short and lagging (i.e., cannot catch up with input requests) to be able to see the improvement. I would suggest try with shorter isl and more requests, i.e., isl=250, num_req=100-500 dumped all at t=0 (you can achieve this by setting conc=num_requests in GAP). Also I recommend turning off conditional remote prefill for benchmarking to make sure prefill engine is doing all the prefill.

@jthomson04
Copy link
Contributor Author

jthomson04 commented May 17, 2025

Updated benchmarks show a significant reduction in TTFT, as well as a slight increase in output token throughput with the batched prefills. ISL=256, OSL=128, concurrency 500, conditional remote prefill disabled.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 30, 2025

Walkthrough

Batch processing for prefill requests was introduced, with new logic to dequeue and process multiple requests at once, constrained by a configurable maximum token count. Supporting methods and configuration parameters were added or updated across the codebase to enable this batching, including changes to queue handling, argument parsing, and configuration files.

Changes

File(s) Change Summary
examples/llm/components/prefill_worker.py Added async helper wrap_generator. Updated prefill_queue_handler to batch dequeue and process prefill requests.
examples/llm/utils/prefill_queue.py Added dequeue_prefill_request_batch for batched dequeueing; updated single dequeue to accept timeout; new pending logic.
examples/llm/utils/vllm.py Added --max-batched-prefill-tokens CLI argument and support in parse_vllm_args and AsyncEngineArgs.
examples/llm/configs/disagg.yaml
examples/llm/configs/disagg_router.yaml
Added max-batched-prefill-tokens: 2048 to PrefillWorker config sections.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PrefillQueue
    participant PrefillWorker

    Client->>PrefillQueue: Enqueue prefill requests
    loop Batch Processing
        PrefillWorker->>PrefillQueue: dequeue_prefill_request_batch(max_tokens, block_size)
        PrefillQueue-->>PrefillWorker: Batch of requests (<= max_tokens)
        PrefillWorker->>PrefillWorker: For each request, create generator
        PrefillWorker->>PrefillWorker: wrap_generator for each generator
        PrefillWorker->>PrefillWorker: asyncio.gather(all futures)
    end
    PrefillWorker-->>Client: Processed results
Loading

Poem

In the meadow of code, requests now batch,
Prefill tokens counted, no single match.
With queues that gather, and futures that run,
Rabbits process together—more swiftly, more fun!
Configs now tuned for a token-rich spree,
Hopping through batches, as happy as can be! 🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/llm/utils/vllm.py (1)

72-77: Address past review feedback on parameter naming.

Based on the previous discussion, the parameter name --max-batched-prefill-tokens may be misleading since the actual behavior allows exceeding this value. The help text also mentions "If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1" but this doesn't accurately reflect the batching logic implemented.

Consider renaming to --min-batched-prefill-tokens and updating the help text to clarify that this is a threshold for batch formation, not a hard maximum.

-    parser.add_argument(
-        "--max-batched-prefill-tokens",
-        type=int,
-        default=2048,
-        help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.",
-    )
+    parser.add_argument(
+        "--min-batched-prefill-tokens", 
+        type=int,
+        default=2048,
+        help="Minimum token threshold for batching prefill requests. Batches are formed until this threshold is exceeded.",
+    )
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 76-76: Line too long (169/100)

(C0301)

examples/llm/utils/prefill_queue.py (1)

54-60: Remove unnecessary else clause to reduce nesting.

The else clause is unnecessary after the return statement and can be simplified for better readability.

     encoded_request = await self.dequeue_task(timeout)
     if encoded_request is not None:
         prefill_request = msgspec.json.decode(
             encoded_request, type=RemotePrefillRequest
         )
         return prefill_request
-    else:
-        return None
+    return None
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 54-60: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🧹 Nitpick comments (6)
examples/llm/utils/vllm.py (1)

76-76: Fix line length violation.

The help text exceeds the 100-character limit as indicated by pylint.

-        help="Maximum number of tokens to prefill in a single batch. If the number of tokens to prefill is greater than this value, prefill phase will execute as bs=1.",
+        help="Maximum number of tokens to prefill in a single batch. "
+             "If exceeded, prefill phase will execute as bs=1.",
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 76-76: Line too long (169/100)

(C0301)

examples/llm/components/prefill_worker.py (2)

42-48: Improve code quality and address static analysis hints.

The wrap_generator function has several quality issues that should be addressed:

  1. Missing docstring
  2. Using __anext__() instead of the anext() builtin function
-async def wrap_generator(generator):
-    while True:
-        try:
-            await generator.__anext__()
-        except StopAsyncIteration:
-            break
+async def wrap_generator(generator):
+    """Fully consume an async generator by iterating through all values."""
+    while True:
+        try:
+            await anext(generator)
+        except StopAsyncIteration:
+            break

Additionally, consider the previous suggestion to define this as a method within the PrefillWorker class for better encapsulation, especially if batched generation methods are added in the future.

🧰 Tools
🪛 Pylint (3.3.7)

[convention] 42-42: Missing function or method docstring

(C0116)


[convention] 45-45: Unnecessarily calls dunder method anext. Use anext built-in function.

(C2801)


169-169: Use lazy formatting in logging statement.

The logging statement should use lazy % formatting for better performance.

-                    logger.debug(f"Running batch of {len(reqs)} prefill requests")
+                    logger.debug("Running batch of %d prefill requests", len(reqs))
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 169-169: Use lazy % formatting in logging functions

(W1203)

examples/llm/utils/prefill_queue.py (3)

50-53: Add docstring for the updated method.

The optional timeout parameter is a good addition for flexibility. However, the method is missing a docstring to document the new parameter.

 async def dequeue_prefill_request(
     self, timeout: Optional[float] = None
 ) -> Optional[RemotePrefillRequest]:
+    """
+    Dequeue a single prefill request from the queue.
+    
+    Args:
+        timeout: Optional timeout for the dequeue operation. If None, uses default timeout.
+        
+    Returns:
+        A RemotePrefillRequest if available, None otherwise.
+    """
🧰 Tools
🪛 Pylint (3.3.7)

[convention] 50-50: Missing function or method docstring

(C0116)


62-106: Add comprehensive docstring and consider edge cases.

The batch dequeue logic is well-implemented and correctly handles overflow requests. However, the method lacks documentation and there are a few considerations:

  1. Missing docstring explaining the batching strategy
  2. The TODO comment suggests potential performance optimization opportunities
 async def dequeue_prefill_request_batch(
     self, max_batched_prefill_tokens: int, block_size: int
 ) -> Optional[List[RemotePrefillRequest]]:
+    """
+    Dequeue a batch of prefill requests constrained by token budget.
+    
+    This method accumulates requests until the total token count would exceed
+    max_batched_prefill_tokens. Requests that don't fit are stored in self.pending
+    for the next batch.
+    
+    Args:
+        max_batched_prefill_tokens: Maximum total tokens allowed in a batch
+        block_size: Size of each cached block for token calculation
+        
+    Returns:
+        A list of RemotePrefillRequest objects, or None if no requests available
+    """

The batching logic correctly:

  • Prioritizes pending requests from previous batches
  • Handles single large requests that exceed the token limit
  • Prevents request dropping by storing overflow in self.pending
  • Uses efficient token calculation based on new tokens only
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 87-87: TODO: We might want to double-buffer this process

(W0511)


[convention] 62-62: Missing function or method docstring

(C0116)


87-87: Consider addressing the TODO for production readiness.

The TODO comment suggests double-buffering to reduce NATS dequeue overhead. While not critical for functionality, this could be important for production performance.

Do you want me to help design a double-buffering strategy or open an issue to track this performance optimization?

🧰 Tools
🪛 Pylint (3.3.7)

[warning] 87-87: TODO: We might want to double-buffer this process

(W0511)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ea0830 and a7d3ef2.

📒 Files selected for processing (5)
  • examples/llm/components/prefill_worker.py (2 hunks)
  • examples/llm/configs/disagg.yaml (1 hunks)
  • examples/llm/configs/disagg_router.yaml (1 hunks)
  • examples/llm/utils/prefill_queue.py (2 hunks)
  • examples/llm/utils/vllm.py (2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
examples/llm/utils/prefill_queue.py

[warning] 87-87: TODO: We might want to double-buffer this process

(W0511)


[convention] 44-44: Missing function or method docstring

(C0116)


[convention] 50-50: Missing function or method docstring

(C0116)


[refactor] 54-60: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[convention] 62-62: Missing function or method docstring

(C0116)

examples/llm/components/prefill_worker.py

[convention] 42-42: Missing function or method docstring

(C0116)


[convention] 45-45: Unnecessarily calls dunder method anext. Use anext built-in function.

(C2801)


[warning] 169-169: Use lazy % formatting in logging functions

(W1203)

examples/llm/utils/vllm.py

[convention] 76-76: Line too long (169/100)

(C0301)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (5)
examples/llm/configs/disagg_router.yaml (1)

50-50: Ensure configuration parameter naming consistency.

If the parameter name is changed from max-batched-prefill-tokens to min-batched-prefill-tokens based on the discussion in vllm.py, this configuration file will need to be updated accordingly.

The configuration value of 2048 is consistent with the default in the argument parser. Monitor for any parameter name changes that would require updating this configuration key.

examples/llm/configs/disagg.yaml (1)

43-43: Configuration parameter is consistent across deployment configs.

The addition of max-batched-prefill-tokens: 2048 matches the configuration in disagg_router.yaml and the default value in the argument parser, ensuring consistency across different deployment scenarios.

Note: If the parameter name is updated based on previous feedback (to min-batched-prefill-tokens), both configuration files will need to be updated accordingly.

examples/llm/components/prefill_worker.py (1)

163-174: Batch processing implementation looks correct.

The implementation successfully addresses the requirements by:

  • Using the refactored dequeue_prefill_request_batch method from PrefillQueue
  • Processing requests concurrently using asyncio.gather
  • Respecting the configurable token limit

The batching logic has been properly moved to the PrefillQueue as suggested in previous reviews, and the concurrent processing should improve throughput as intended.

🧰 Tools
🪛 Pylint (3.3.7)

[warning] 169-169: Use lazy % formatting in logging functions

(W1203)

examples/llm/utils/prefill_queue.py (2)

17-17: LGTM: Import addition is appropriate.

The addition of List to the typing imports is necessary for the new batch functionality.


42-42: LGTM: Pending request tracking added.

The pending instance variable correctly handles overflow requests that cannot fit in the current batch, ensuring no requests are dropped.

@github-actions
Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 30, 2025
@grahamking
Copy link
Contributor

That example has moved. @jthomson04 Can this be closed? Or maybe re-targeted to wherever this is now?

@jthomson04
Copy link
Contributor Author

Oh boy. I haven't thought about this MR in a long time... We can probably close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants