Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Jul 10, 2025

Overview:

Additionally:

  1. Guard find_best_matches such that only one request can run it at a time. Performance tradeoff for more optimal routing empirically.
  2. Set a threshold for TimerManager to rebuild the binary heap when it gets too large (too many stale entries). Should not be normally needed unless the time duration is set very long
  3. Cosmetic cleanups
approx

Summary by CodeRabbit

  • Documentation

    • Updated the user guide for the CLI tool to include a new optional argument for KV event handling, with clearer explanations and improved formatting for related options.
  • New Features

    • Added a configuration option to control whether the router listens to KV events or uses an approximate prediction method for cached blocks.
  • Improvements

    • Enhanced internal logic for managing cached block routing, including improved concurrency control and more efficient handling of stale entries.

No changes to public APIs outside of the new configuration option.

@PeaBrane PeaBrane requested a review from jthomson04 July 10, 2025 20:41
@PeaBrane PeaBrane marked this pull request as ready for review July 10, 2025 20:42
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 10, 2025

Walkthrough

The changes introduce a new use_kv_events flag to the Dynamo KV router, allowing users to choose between the original event-driven KvIndexer and a new approximate ApproxKvIndexer. The router and its configuration are updated to support this flag, with unified indexer handling, concurrency improvements, and documentation updates reflecting the new option.

Changes

File(s) Change Summary
docs/guides/dynamo_run.md Updated documentation to describe the new --use-kv-events CLI argument, explain its behavior, and clarify related options.
launch/dynamo-run/src/flags.rs Added the use_kv_events flag to CLI flags and passed it to router configuration.
lib/llm/src/discovery/model_manager.rs Simplified logic to use the use_kv_events flag from config directly when constructing the router.
lib/llm/src/kv_router.rs Added use_kv_events to KvRouterConfig, unified KvIndexer and ApproxKvIndexer under an Indexer enum, added mutex for concurrency, and updated router logic to select indexer based on the flag.
lib/llm/src/kv_router/approx.rs Enhanced TimerManager with a threshold and heap rebuild logic, updated ApproxKvIndexer to use new timer manager signature, and adjusted tests accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Flags
    participant RouterConfig
    participant KvRouter
    participant KvIndexer
    participant ApproxKvIndexer

    User->>CLI: Run with --use-kv-events=[true|false]
    CLI->>Flags: Parse arguments
    Flags->>RouterConfig: Pass use_kv_events flag
    RouterConfig->>KvRouter: Instantiate with config
    alt use_kv_events = true
        KvRouter->>KvIndexer: Create KvIndexer
    else use_kv_events = false
        KvRouter->>ApproxKvIndexer: Create ApproxKvIndexer
    end
    KvRouter->>KvRouter: Route requests using selected indexer
Loading

Possibly related PRs

Poem

In the warren of code, a new flag appears,
Choose your indexer—let's all give three cheers!
With events or a guess, the router will know,
Which blocks are cached, where data should go.
🐇✨ Now the routing is clever and neat—
A hop, a skip, and requests can't be beat!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/llm/src/kv_router/approx.rs (1)

196-197: Consider making the threshold configurable.

While the fixed threshold of 50 is reasonable, consider making it configurable through the constructor parameters for better flexibility in different deployment scenarios.

lib/llm/src/kv_router.rs (2)

108-125: Clean abstraction over indexer implementations.

The Indexer enum effectively unifies the interface. Regarding the TODO comment: Rust doesn't auto-derive trait implementations for enums, but you could use a macro crate like enum_dispatch to automate this boilerplate.


160-165: Consider making the TTL configurable.

The 120-second TTL for ApproxKvIndexer is hardcoded. Consider making this configurable through KvRouterConfig to allow tuning based on deployment characteristics.

Apply this diff to make TTL configurable:

 pub struct KvRouterConfig {
     pub overlap_score_weight: f64,
     pub router_temperature: f64,
     pub use_kv_events: bool,
+    pub approx_indexer_ttl_secs: u64,
     pub max_num_batched_tokens: u32,
 }

 impl Default for KvRouterConfig {
     fn default() -> Self {
         Self {
             overlap_score_weight: 1.0,
             router_temperature: 0.5,
             use_kv_events: true,
+            approx_indexer_ttl_secs: 120,
             max_num_batched_tokens: 8192,
         }
     }
 }

Then use it in the instantiation:

-                Duration::from_secs(120),
+                Duration::from_secs(config.approx_indexer_ttl_secs),
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd9411 and 5321a2a.

📒 Files selected for processing (5)
  • docs/guides/dynamo_run.md (2 hunks)
  • launch/dynamo-run/src/flags.rs (2 hunks)
  • lib/llm/src/discovery/model_manager.rs (1 hunks)
  • lib/llm/src/kv_router.rs (11 hunks)
  • lib/llm/src/kv_router/approx.rs (6 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
docs/guides/dynamo_run.md (1)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
lib/llm/src/discovery/model_manager.rs (5)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: jthomson04
PR: ai-dynamo/dynamo#1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `wait_for_key_count` function correctly uses exact equality (`==`) instead of greater-than-or-equal (`>=`) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
lib/llm/src/kv_router/approx.rs (1)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/llm/src/kv_router.rs (8)
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (13)
launch/dynamo-run/src/flags.rs (2)

131-136: LGTM! Well-documented CLI flag addition.

The new use_kv_events flag is properly implemented with clear documentation explaining its purpose and default behavior.


224-224: Correctly passes the new flag to router configuration.

The use_kv_events flag is properly propagated to the KvRouterConfig constructor.

docs/guides/dynamo_run.md (2)

11-11: Documentation accurately reflects the new CLI option.

The usage example correctly shows the new --use-kv-events flag with its default value.


204-210: Excellent documentation of the KV routing options.

The reformatted bullet points improve readability, and the explanation of the --use-kv-events flag clearly describes when to use each indexer type.

lib/llm/src/discovery/model_manager.rs (2)

215-215: Appropriate use of clone for ownership transfer.

The change from borrowing to cloning is correct since DefaultWorkerSelector::new needs ownership of the config. The KvRouterConfig struct is small and derives Clone, making this an efficient operation.


220-223: Good simplification of the configuration logic.

Directly using the use_kv_events flag from the config is cleaner than the previous approach of deriving it from other fields.

lib/llm/src/kv_router/approx.rs (3)

84-97: Good addition of threshold mechanism for heap management.

The threshold field is well-documented and will help prevent unbounded growth of stale entries in the expiration heap.


100-108: Efficient heap rebuild implementation.

The rebuild_heap method correctly reconstructs the heap from the authoritative timers map, effectively removing all stale entries.


127-131: Smart threshold-based rebuild trigger.

The condition self.expirations.len() > self.timers.len() * self.threshold effectively triggers rebuilds when stale entries accumulate beyond the threshold multiplier.

lib/llm/src/kv_router.rs (4)

71-71: Well-structured configuration extension.

The use_kv_events field is properly integrated into the config struct with a sensible default value of true, maintaining backward compatibility.

Also applies to: 82-82, 94-101


137-139: Consider the performance implications of the mutex.

The mutex serializes all find_best_match calls. As your TODO suggests, benchmark whether making the subroutines synchronous would be more efficient than async with a mutex. This could be a bottleneck under high concurrent load.

Also applies to: 220-222


227-228: Good refactoring and proper state management.

Using the compute_block_hash_for_seq helper function improves code reuse, and properly notifying ApproxKvIndexer about routing decisions is essential for maintaining its internal state.

Also applies to: 241-246


208-208: Correct mutex initialization.

The mutex is properly initialized for synchronization purposes.

Co-authored-by: Hongkuan Zhou <[email protected]>
Signed-off-by: Yan Ru Pei <[email protected]>
@PeaBrane PeaBrane enabled auto-merge (squash) July 10, 2025 22:20
@PeaBrane PeaBrane merged commit 13640e1 into main Jul 10, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants