fix: KV-Router: degrade to empty overlaps when indexer offline #4428

vladnosiv · 2025-11-18T11:32:47Z

Overview:

Now, when indexer is unavailable (for example, due to a bug here #4394) frontend doesn't process any requests, ending them with a status of 500.

Since determining the optimal instance is a best-effort operation, this is not expected.

Details:

As part of this PR, if the indexer is unavailable, the behavior degrades to the case when nothing is cached.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Partially Resolves: #4394

Summary by CodeRabbit

Bug Fixes
- Improved system stability by gracefully handling offline indexer components; empty results now returned instead of errors when indexer becomes unavailable.
Tests
- Added test coverage to verify consistent error handling when indexer components are offline or disconnected.

copy-pr-bot · 2025-11-18T11:32:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-11-18T11:32:56Z

👋 Hi vladnosiv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-11-18T11:37:12Z

Walkthrough

Changes to the KV router indexer error handling replace panic-prone operations and error propagation with graceful degradation—returning empty results with warning logs when the indexer is offline or requests fail. Tests added to verify empty OverlapScores are returned when offline.

Changes

Cohort / File(s)	Summary
KV Indexer error handling & tests `lib/llm/src/kv_router/indexer.rs`	Updated error handling to log warnings and return empty OverlapScores instead of propagating errors when match requests fail, responses fail to await, or broadcast operations fail. Added tests verifying KvIndexer and KvIndexerSharded return empty OverlapScores after shutdown (offline state).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Verify the error handling change from panic/error propagation to graceful empty-result returns aligns with system reliability goals
Confirm test coverage adequately validates offline behavior for both single and sharded indexer variants
Review warning log messages for clarity and debuggability

Poem

A rabbit once crashed when blocks went awry,
With RefCells borrowed and panics so nigh.
But warnings and grace saved the day with a bound,
Now empty results, not crashes, abound! 🐰✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main change: KV-Router degrades gracefully by returning empty overlaps when the indexer is offline instead of propagating errors.
Description check	✅ Passed	The description follows the template with all required sections (Overview, Details, Related Issues) and provides clear context about the problem, solution, and related issue.
Linked Issues check	✅ Passed	The PR partially addresses issue #4394 by implementing graceful degradation when the indexer is offline (falling back to empty overlaps), which directly supports the objective of maintaining frontend availability even when KV routing is unavailable.
Out of Scope Changes check	✅ Passed	All changes are scoped to the indexer offline behavior: error handling modifications, control flow adjustments for broadcast failures, and added tests for offline scenarios. These are directly related to the PR objective and linked issue requirements.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc2ad4e and ac027b9.

📒 Files selected for processing (1)

lib/llm/src/kv_router/indexer.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: tests/router/test_router_e2e_with_mockers.py:961-974
Timestamp: 2025-08-29T09:53:45.266Z
Learning: Indexer dumps in the KV router system are designed to never contain remove or clear events - they only contain "stored" events. Therefore, code that processes indexer dump events can safely assume the presence of event["event"]["data"]["stored"] structure without additional error handling for other event types.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

📚 Learning: 2025-05-30T06:38:09.630Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-06-05T01:02:15.318Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-10-14T00:58:05.744Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-05-29T00:02:35.018Z

Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-17T20:55:06.333Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-11T03:24:47.820Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

lib/llm/src/kv_router/indexer.rs

🧬 Code graph analysis (1)

lib/llm/src/kv_router/indexer.rs (3)

lib/bindings/python/src/dynamo/_core.pyi (8)

OverlapScores (483-508)

scores (490-497)

metrics (88-95)

metrics (117-124)

metrics (160-167)

CancellationToken (60-71)

KvIndexer (583-619)

frequencies (500-508)

lib/llm/src/kv_router.rs (3)

new (134-155)

new (221-314)

new (486-491)

lib/llm/src/kv_router/approx.rs (3)

new (124-132)

new (268-435)

sequence (469-473)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build and Test - dynamo
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: tests (.)
GitHub Check: clippy (.)

🔇 Additional comments (4)

lib/llm/src/kv_router/indexer.rs (4)

996-1002: LGTM! Graceful degradation aligns with best-effort semantics.

The change correctly handles indexer unavailability by returning empty overlap scores with a warning instead of propagating an error. This aligns with the PR objective that choosing an optimal instance is a best-effort operation and should not cause frontend failures.

1004-1013: LGTM! Consistent graceful degradation for response failures.

The receive failure handling is consistent with the send failure handling above. Both cases now degrade gracefully by returning empty scores when the indexer is offline, preventing cascading failures to the frontend.

1234-1244: LGTM! Sharded implementation consistent with non-sharded version.

The broadcast failure handling is correctly applied to the sharded indexer, maintaining consistency with the non-sharded KvIndexer implementation. When the broadcast fails (all shards offline or channel closed), returning empty scores immediately is the right approach.

2159-2187: LGTM! Good test coverage for the new behavior.

The tests correctly verify that both KvIndexer and KvIndexerSharded return empty OverlapScores (with empty scores, frequencies, and tree_sizes) when the indexer is offline. This provides good coverage of the graceful degradation behavior.

lib/llm/src/kv_router/indexer.rs

grahamking · 2025-11-18T16:23:05Z

@PeaBrane Could you take a look?

grahamking · 2025-11-18T16:23:56Z

/ok to test 2c4c8d8

lib/llm/src/kv_router/approx.rs

lib/llm/src/kv_router/indexer.rs

Signed-off-by: Vladislav Nosivskoy <[email protected]>

vladnosiv · 2025-12-01T15:58:11Z

Hey, @PeaBrane !
Did you get a chance to sync with the team about whether we still want the frontend to tolerate indexer failures instead of panicking? I’ve rebased onto the latest main just in case.

PeaBrane · 2025-12-01T17:57:05Z

@nnshah1 @rmccorm4 how do you guys feel about falling back to round robin if the indexer fails. My concern here is that this would be a bit too "silent" for the user, and they may keep using round robin unintentionally without restarting the router or digging into the root cause

vladnosiv · 2025-12-05T16:50:10Z

@nnshah1 @rmccorm4 how do you guys feel about falling back to round robin if the indexer fails. My concern here is that this would be a bit too "silent" for the user, and they may keep using round robin unintentionally without restarting the router or digging into the root cause

Gentle bump @nnshah1 @rmccorm4.
Regarding the concern about the fallback being too "silent": perhaps we could introduce (in this pr) a specific metric that increments whenever the fallback triggers? I can also update the documentation to recommend setting up an alert for this metric.
Let me know if this approach sounds reasonable to you!

rmccorm4 · 2025-12-09T18:09:50Z

/ok to test f7c2cdc

copy-pr-bot · 2025-12-09T18:09:53Z

/ok to test f7c2cdc

@rmccorm4, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

PeaBrane · 2025-12-09T20:46:33Z

@vladnosiv After discussing with the team, our understanding is that the indexer should only go offline if it panics. Some panics are recoverable (e.g., invalid KV events), while others may be unrecoverable—though it's not yet clear what would fall into that category. For recoverable panics, we should not shut down the KV indexer; for truly unrecoverable ones, we should.

What are the scenarios of recoverable and unrecoverable panics you have in mind (or you have encountered)? In general, we should aim to handle panics gracefully rather than shutting the indexer down indiscriminately. What do you think?

vladnosiv · 2025-12-09T21:22:53Z

@PeaBrane I completely agree with those points. However, I’d like to note that they are somewhat orthogonal to the proposal in this PR.

It seems to me that this is a game from two sides. On the one hand, as you wrote, an indexer should recover from his problems. On the other hand, as reflected in this PR, the frontend as a whole, using the indexer as a component, should be tolerant of its fall (it seems to me).

The incident that prompted these changes demonstrates exactly this: the indexer crashed (due to an issue that was unrecoverable via automatic restarts on that code version). This caused a complete service unavailable, even though the failing component was non-critical cause the frontend can continue processing requests, albeit with a lower cache hit rate.

Without this fix, such an incident leads to downtime for the Dynamo consumer service, lasting simply as long as it takes the on-call engineer to respond, disable the KV-indexing env flag, and redeploy the frontend.

So, my stance remains that the system must be tolerant for an indexer failure. I am certainly not suggesting we let it crash on purpose, and I am fully in favor of improving its reliability as well.

PeaBrane · 2025-12-10T01:10:38Z

@rmccorm4 @nnshah1
I approved the changes to unblock. The code itself looks Kosher. But will hand off the merge button to one of you guys.
Maybe we can look into making the Indexer panic-tolerant in future PRs

rmccorm4 · 2025-12-10T21:27:14Z

/ok to test 1980391

vladnosiv requested a review from a team as a code owner November 18, 2025 11:32

pull-request-size bot added the size/M label Nov 18, 2025

github-actions bot added external-contribution Pull request is from an external contributor fix labels Nov 18, 2025

vladnosiv mentioned this pull request Nov 18, 2025

[BUG]: KV-Router indexer receives invalid self-referential blocks #4394

Closed

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/indexer.rs Show resolved Hide resolved

pull-request-size bot added size/L and removed size/M labels Nov 18, 2025

grahamking requested a review from PeaBrane November 18, 2025 16:23

copy-pr-bot bot temporarily deployed to GITLAB November 18, 2025 16:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 18, 2025 16:28 Inactive

PeaBrane reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/approx.rs Outdated Show resolved Hide resolved

PeaBrane reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/indexer.rs Outdated Show resolved Hide resolved

PeaBrane reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/indexer.rs Show resolved Hide resolved

vladnosiv requested a review from a team as a code owner November 19, 2025 12:46

vladnosiv force-pushed the indexer-is-best-effort branch from bd86ea3 to 27fe75c Compare November 20, 2025 13:15

vladnosiv added 6 commits December 1, 2025 18:30

degrade to empty overlaps when indexer offline

0b5a059

Signed-off-by: Vladislav Nosivskoy <[email protected]>

error level

8a82891

Signed-off-by: Vladislav Nosivskoy <[email protected]>

add panic tests

362560e

Signed-off-by: Vladislav Nosivskoy <[email protected]>

fix fmt

977a53f

Signed-off-by: Vladislav Nosivskoy <[email protected]>

fix clippy

9f595a9

Signed-off-by: Vladislav Nosivskoy <[email protected]>

revert panic tests

a17a2cf

Signed-off-by: Vladislav Nosivskoy <[email protected]>

vladnosiv force-pushed the indexer-is-best-effort branch from 27fe75c to a17a2cf Compare December 1, 2025 15:52

pull-request-size bot added size/M and removed size/L labels Dec 1, 2025

fix log level

c3e83e2

Signed-off-by: Vladislav Nosivskoy <[email protected]>

vladnosiv and others added 2 commits December 9, 2025 17:52

Merge branch 'main' into indexer-is-best-effort

f7c2cdc

Merge branch 'main' into indexer-is-best-effort

b9ec7ed

PeaBrane approved these changes Dec 10, 2025

View reviewed changes

Merge branch 'main' into indexer-is-best-effort

1980391

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 21:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 21:29 Inactive

fix: KV-Router: degrade to empty overlaps when indexer offline #4428

Are you sure you want to change the base?

fix: KV-Router: degrade to empty overlaps when indexer offline #4428

Uh oh!

Conversation

vladnosiv commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

coderabbitai bot commented Nov 18, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grahamking commented Nov 18, 2025

Uh oh!

grahamking commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vladnosiv commented Dec 1, 2025

Uh oh!

PeaBrane commented Dec 1, 2025

Uh oh!

vladnosiv commented Dec 5, 2025

Uh oh!

rmccorm4 commented Dec 9, 2025

Uh oh!

copy-pr-bot bot commented Dec 9, 2025

Uh oh!

PeaBrane commented Dec 9, 2025

Uh oh!

vladnosiv commented Dec 9, 2025

Uh oh!

PeaBrane commented Dec 10, 2025

Uh oh!

rmccorm4 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vladnosiv commented Nov 18, 2025 •

edited by coderabbitai bot

Loading