-
Notifications
You must be signed in to change notification settings - Fork 738
fix: KV-Router: degrade to empty overlaps when indexer offline #4428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi vladnosiv! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughChanges to the KV router indexer error handling replace panic-prone operations and error propagation with graceful degradation—returning empty results with warning logs when the indexer is offline or requests fail. Tests added to verify empty OverlapScores are returned when offline. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/llm/src/kv_router/indexer.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: tests/router/test_router_e2e_with_mockers.py:961-974
Timestamp: 2025-08-29T09:53:45.266Z
Learning: Indexer dumps in the KV router system are designed to never contain remove or clear events - they only contain "stored" events. Therefore, code that processes indexer dump events can safely assume the presence of event["event"]["data"]["stored"] structure without additional error handling for other event types.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-10-14T00:58:05.744Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-09-17T20:55:06.333Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.
Applied to files:
lib/llm/src/kv_router/indexer.rs
🧬 Code graph analysis (1)
lib/llm/src/kv_router/indexer.rs (3)
lib/bindings/python/src/dynamo/_core.pyi (8)
OverlapScores(483-508)scores(490-497)metrics(88-95)metrics(117-124)metrics(160-167)CancellationToken(60-71)KvIndexer(583-619)frequencies(500-508)lib/llm/src/kv_router.rs (3)
new(134-155)new(221-314)new(486-491)lib/llm/src/kv_router/approx.rs (3)
new(124-132)new(268-435)sequence(469-473)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: clippy (.)
🔇 Additional comments (4)
lib/llm/src/kv_router/indexer.rs (4)
996-1002: LGTM! Graceful degradation aligns with best-effort semantics.The change correctly handles indexer unavailability by returning empty overlap scores with a warning instead of propagating an error. This aligns with the PR objective that choosing an optimal instance is a best-effort operation and should not cause frontend failures.
1004-1013: LGTM! Consistent graceful degradation for response failures.The receive failure handling is consistent with the send failure handling above. Both cases now degrade gracefully by returning empty scores when the indexer is offline, preventing cascading failures to the frontend.
1234-1244: LGTM! Sharded implementation consistent with non-sharded version.The broadcast failure handling is correctly applied to the sharded indexer, maintaining consistency with the non-sharded
KvIndexerimplementation. When the broadcast fails (all shards offline or channel closed), returning empty scores immediately is the right approach.
2159-2187: LGTM! Good test coverage for the new behavior.The tests correctly verify that both
KvIndexerandKvIndexerShardedreturn emptyOverlapScores(with emptyscores,frequencies, andtree_sizes) when the indexer is offline. This provides good coverage of the graceful degradation behavior.
|
@PeaBrane Could you take a look? |
|
/ok to test 2c4c8d8 |
bd86ea3 to
27fe75c
Compare
Signed-off-by: Vladislav Nosivskoy <[email protected]>
Signed-off-by: Vladislav Nosivskoy <[email protected]>
Signed-off-by: Vladislav Nosivskoy <[email protected]>
Signed-off-by: Vladislav Nosivskoy <[email protected]>
Signed-off-by: Vladislav Nosivskoy <[email protected]>
Signed-off-by: Vladislav Nosivskoy <[email protected]>
27fe75c to
a17a2cf
Compare
Signed-off-by: Vladislav Nosivskoy <[email protected]>
|
Hey, @PeaBrane ! |
Gentle bump @nnshah1 @rmccorm4. |
|
/ok to test f7c2cdc |
@rmccorm4, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
@vladnosiv After discussing with the team, our understanding is that the indexer should only go offline if it panics. Some panics are recoverable (e.g., invalid KV events), while others may be unrecoverable—though it's not yet clear what would fall into that category. For recoverable panics, we should not shut down the KV indexer; for truly unrecoverable ones, we should. What are the scenarios of recoverable and unrecoverable panics you have in mind (or you have encountered)? In general, we should aim to handle panics gracefully rather than shutting the indexer down indiscriminately. What do you think? |
|
@PeaBrane I completely agree with those points. However, I’d like to note that they are somewhat orthogonal to the proposal in this PR. It seems to me that this is a game from two sides. On the one hand, as you wrote, an indexer should recover from his problems. On the other hand, as reflected in this PR, the frontend as a whole, using the indexer as a component, should be tolerant of its fall (it seems to me). The incident that prompted these changes demonstrates exactly this: the indexer crashed (due to an issue that was unrecoverable via automatic restarts on that code version). This caused a complete service unavailable, even though the failing component was non-critical cause the frontend can continue processing requests, albeit with a lower cache hit rate. Without this fix, such an incident leads to downtime for the Dynamo consumer service, lasting simply as long as it takes the on-call engineer to respond, disable the KV-indexing env flag, and redeploy the frontend. So, my stance remains that the system must be tolerant for an indexer failure. I am certainly not suggesting we let it crash on purpose, and I am fully in favor of improving its reliability as well. |
|
/ok to test 1980391 |
Overview:
Now, when indexer is unavailable (for example, due to a bug here #4394) frontend doesn't process any requests, ending them with a status of 500.
Since determining the optimal instance is a best-effort operation, this is not expected.
Details:
As part of this PR, if the indexer is unavailable, the behavior degrades to the case when nothing is cached.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Partially Resolves: #4394
Summary by CodeRabbit
Bug Fixes
Tests