-
Notifications
You must be signed in to change notification settings - Fork 767
feat: allow using ApproxKvIndexer for routing via use_kv_events flag #1869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Flags
participant RouterConfig
participant KvRouter
participant KvIndexer
participant ApproxKvIndexer
User->>CLI: Run with --use-kv-events=[true|false]
CLI->>Flags: Parse arguments
Flags->>RouterConfig: Pass use_kv_events flag
RouterConfig->>KvRouter: Instantiate with config
alt use_kv_events = true
KvRouter->>KvIndexer: Create KvIndexer
else use_kv_events = false
KvRouter->>ApproxKvIndexer: Create ApproxKvIndexer
end
KvRouter->>KvRouter: Route requests using selected indexer
Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/llm/src/kv_router/approx.rs (1)
196-197: Consider making the threshold configurable.While the fixed threshold of 50 is reasonable, consider making it configurable through the constructor parameters for better flexibility in different deployment scenarios.
lib/llm/src/kv_router.rs (2)
108-125: Clean abstraction over indexer implementations.The
Indexerenum effectively unifies the interface. Regarding the TODO comment: Rust doesn't auto-derive trait implementations for enums, but you could use a macro crate likeenum_dispatchto automate this boilerplate.
160-165: Consider making the TTL configurable.The 120-second TTL for
ApproxKvIndexeris hardcoded. Consider making this configurable throughKvRouterConfigto allow tuning based on deployment characteristics.Apply this diff to make TTL configurable:
pub struct KvRouterConfig { pub overlap_score_weight: f64, pub router_temperature: f64, pub use_kv_events: bool, + pub approx_indexer_ttl_secs: u64, pub max_num_batched_tokens: u32, } impl Default for KvRouterConfig { fn default() -> Self { Self { overlap_score_weight: 1.0, router_temperature: 0.5, use_kv_events: true, + approx_indexer_ttl_secs: 120, max_num_batched_tokens: 8192, } } }Then use it in the instantiation:
- Duration::from_secs(120), + Duration::from_secs(config.approx_indexer_ttl_secs),
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/guides/dynamo_run.md(2 hunks)launch/dynamo-run/src/flags.rs(2 hunks)lib/llm/src/discovery/model_manager.rs(1 hunks)lib/llm/src/kv_router.rs(11 hunks)lib/llm/src/kv_router/approx.rs(6 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
docs/guides/dynamo_run.md (1)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
lib/llm/src/discovery/model_manager.rs (5)
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: jthomson04
PR: ai-dynamo/dynamo#1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `wait_for_key_count` function correctly uses exact equality (`==`) instead of greater-than-or-equal (`>=`) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
lib/llm/src/kv_router/approx.rs (1)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
lib/llm/src/kv_router.rs (8)
Learnt from: alec-flowers
PR: ai-dynamo/dynamo#1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1236
File: lib/llm/src/mocker/engine.rs:140-161
Timestamp: 2025-06-17T00:50:44.845Z
Learning: In Rust async code, when an Arc<Mutex<_>> is used solely to transfer ownership of a resource (like a channel receiver) into a spawned task rather than for sharing between multiple tasks, holding the mutex lock across an await is not problematic since there's no actual contention.
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
🔇 Additional comments (13)
launch/dynamo-run/src/flags.rs (2)
131-136: LGTM! Well-documented CLI flag addition.The new
use_kv_eventsflag is properly implemented with clear documentation explaining its purpose and default behavior.
224-224: Correctly passes the new flag to router configuration.The
use_kv_eventsflag is properly propagated to theKvRouterConfigconstructor.docs/guides/dynamo_run.md (2)
11-11: Documentation accurately reflects the new CLI option.The usage example correctly shows the new
--use-kv-eventsflag with its default value.
204-210: Excellent documentation of the KV routing options.The reformatted bullet points improve readability, and the explanation of the
--use-kv-eventsflag clearly describes when to use each indexer type.lib/llm/src/discovery/model_manager.rs (2)
215-215: Appropriate use of clone for ownership transfer.The change from borrowing to cloning is correct since
DefaultWorkerSelector::newneeds ownership of the config. TheKvRouterConfigstruct is small and derivesClone, making this an efficient operation.
220-223: Good simplification of the configuration logic.Directly using the
use_kv_eventsflag from the config is cleaner than the previous approach of deriving it from other fields.lib/llm/src/kv_router/approx.rs (3)
84-97: Good addition of threshold mechanism for heap management.The threshold field is well-documented and will help prevent unbounded growth of stale entries in the expiration heap.
100-108: Efficient heap rebuild implementation.The
rebuild_heapmethod correctly reconstructs the heap from the authoritativetimersmap, effectively removing all stale entries.
127-131: Smart threshold-based rebuild trigger.The condition
self.expirations.len() > self.timers.len() * self.thresholdeffectively triggers rebuilds when stale entries accumulate beyond the threshold multiplier.lib/llm/src/kv_router.rs (4)
71-71: Well-structured configuration extension.The
use_kv_eventsfield is properly integrated into the config struct with a sensible default value of true, maintaining backward compatibility.Also applies to: 82-82, 94-101
137-139: Consider the performance implications of the mutex.The mutex serializes all
find_best_matchcalls. As your TODO suggests, benchmark whether making the subroutines synchronous would be more efficient than async with a mutex. This could be a bottleneck under high concurrent load.Also applies to: 220-222
227-228: Good refactoring and proper state management.Using the
compute_block_hash_for_seqhelper function improves code reuse, and properly notifyingApproxKvIndexerabout routing decisions is essential for maintaining its internal state.Also applies to: 241-246
208-208: Correct mutex initialization.The mutex is properly initialized for synchronization purposes.
Co-authored-by: Hongkuan Zhou <[email protected]> Signed-off-by: Yan Ru Pei <[email protected]>
Overview:
Additionally:
find_best_matchessuch that only one request can run it at a time. Performance tradeoff for more optimal routing empirically.TimerManagerto rebuild the binary heap when it gets too large (too many stale entries). Should not be normally needed unless the time duration is set very longSummary by CodeRabbit
Documentation
New Features
Improvements
No changes to public APIs outside of the new configuration option.