feat: KVBM improved external block matching #1714

jthomson04 · 2025-07-01T05:20:13Z

@ziqif-nv

Summary by CodeRabbit

New Features

Introduced a distributed Key-Value Block Manager (KVBM) with leader-worker architecture for scalable block management.
Added advanced block transfer system supporting multiple storage types (device, host, disk) and locality-aware transfers (local, logical, remote).
Implemented Python bindings and integration for vLLM KV cache management, including cache manager, request, and block list types.
Added support for new block layouts, including per-layer separate storage and fully contiguous layouts.
Enabled asynchronous and remote block transfer strategies, including CUDA, memcpy, and NIXL-based transfers.
Provided detailed Rust and Python APIs for slot/block management, offloading, onboarding, and block lifecycle control.

Bug Fixes

Improved error handling and validation throughout block transfer, offload, and onboarding workflows.
Enhanced test coverage and robustness for distributed and local block management operations.

Documentation

Added comprehensive documentation and test plans for block lifecycle, offload management, slot/block manager workflows, and distributed messaging.
Updated and expanded Python type stubs and module-level documentation.

Tests

Introduced extensive Rust and Python test suites covering distributed messaging, block transfer, slot/block management, and vLLM cache integration.
Added parameterized and integration tests for new block layouts and distributed workflows.

Chores

Updated development environment configurations for improved Python linting and container builds.
Refactored and organized codebase for clarity, modularity, and extensibility.

copy-pr-bot · 2025-07-01T05:20:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-07-01T05:36:07Z

Caution

Review failed

Failed to post review comments.

Walkthrough

This change introduces a comprehensive overhaul and extension of the LLM block manager system, spanning Rust and Python bindings. Major additions include locality-aware block abstractions, distributed leader-worker block management, a new block transfer subsystem (supporting memcpy, CUDA, and NIXL), new block layouts, and extensive vLLM integration. The update covers new traits, structs, methods, macros, documentation, and test plans, with significant refactoring for modularity and extensibility.

Changes

File(s) / Path(s)	Change Summary
`.devcontainer/devcontainer.json`, `dynamo.code-workspace`	Devcontainer now builds from Dockerfile; VSCode Python linting enhanced; workspace Python analysis paths extended.
`lib/bindings/python/Cargo.toml`, `lib/llm/Cargo.toml`	Rust crates updated: new features, dependencies, and dev-dependencies added.
`lib/bindings/python/rust/lib.rs`, `lib/bindings/python/rust/llm.rs`, ...	Python Rust bindings refactored: distributed leader-worker model, new vLLM integration, block manager API updated, async/await and locality support.
`lib/bindings/python/rust/llm/block_manager/distributed.rs`, `leader.rs`, ...	New distributed block manager leader/worker Python bindings and Rust implementations, supporting async block transfers and resource management.
`lib/bindings/python/rust/llm/block_manager/vllm.rs`, `block_list.rs`, ...	vLLM cache manager and slot management Python bindings and Rust backend added, with slot/block tracking and update APIs.
`lib/bindings/python/src/dynamo/_core.pyi`, `lib/bindings/python/src/dynamo/llm/__init__.py`	Python stubs and imports updated for new KvbmCacheManager, KvbmRequest, KvbmLeader, KvbmWorker.
`lib/bindings/python/src/dynamo/llm/vllm_integration/` (multiple files)	New Python vLLM integration: cache manager, block utilities, Rust loader, and protocol-compliant APIs.
`lib/bindings/python/tests/test_kvbm.py`	New async Python test for KVBM cache manager and vLLM integration.
`lib/llm/src/block_manager.rs`, `block_manager.md`, `offload.rs`, ...	Block manager refactored for locality, async, and distributed support; new documentation and locality-aware offload manager.
`lib/llm/src/block_manager/block.rs`, `block_next.rs`, `block_v2.rs`, ...	Block abstractions refactored: locality, metadata, mutability, storage, and NIXL integration; new traits and error handling.
`lib/llm/src/block_manager/block/data.rs`, `local.rs`, `logical.rs`, ...	New block data modules: locality, logical resources, distributed leader-worker, and null resource implementations.
`lib/llm/src/block_manager/block/factory.rs`, `local.rs`, `logical.rs`	Block factory traits and implementations for local and logical block creation.
`lib/llm/src/block_manager/block/locality.rs`	LocalityProvider trait and implementations for local and logical block transfers.
`lib/llm/src/block_manager/block/state.rs`	BlockState enum extended with apply_token_block method.
`lib/llm/src/block_manager/block/transfer.rs`, `transfer_next.rs`, ...	Major refactor: new block transfer system supporting memcpy, CUDA, NIXL; new traits, strategies, and error handling.
`lib/llm/src/block_manager/block/transfer_v2.rs`, `executors.rs`, `macros.rs`, ...	New modular transfer v2 system: traits, coordinators, executors, macros for block transfers across locality/storage types.
`lib/llm/src/block_manager/block/transfer_v3.rs`	New transfer v3 file: block layer descriptors and type abstractions.
`lib/llm/src/block_manager/layout.rs`, `nixl.rs`, `utils.rs`, ...	New block layouts (LayerSeparate), trait refactors, NIXL serialization/deserialization, and layout validation utilities.
`lib/llm/src/block_manager/distributed.rs`, `active_message.rs`, `zmq.rs`, ...	Distributed block manager with async active message system, ZMQ-based leader-worker comms, and block transfer handlers.
`lib/llm/src/block_manager/config.rs`	BlockParallelismStrategy and logical parallelism added to manager config.
Documentation: `block_manager.md`, `README.md`, test plans, ...	New and updated documentation: block lifecycle, offload management, distributed messaging, and test plans for slots and slot manager.
Tests: `worker_test.rs`, `test_kvbm.py`, ...	New and updated tests for distributed messaging, concurrency, resource capture, vLLM integration, and block manager behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant PythonUser
    participant KvbmCacheManager (Python)
    participant RustKvbmCacheManager
    participant BlockManager
    participant KvbmLeader
    participant KvbmWorker

    PythonUser->>KvbmCacheManager: allocate_slots(request, tokens)
    KvbmCacheManager->>RustKvbmCacheManager: allocate_slots(update)
    RustKvbmCacheManager->>BlockManager: update_slot(...)
    BlockManager->>KvbmLeader: (if needed) transfer_blocks_request
    BlockManager->>KvbmWorker: allocate blocks, onboard/offload
    RustKvbmCacheManager-->>KvbmCacheManager: result (block states)
    KvbmCacheManager-->>PythonUser: result (block IDs)

sequenceDiagram
    participant Leader
    participant Worker
    participant ZMQ
    participant BlockTransferHandler

    Leader->>ZMQ: broadcast transfer_blocks_request
    ZMQ->>Worker: deliver message
    Worker->>BlockTransferHandler: handle transfer request
    BlockTransferHandler->>BlockTransferHandler: get source/target blocks
    BlockTransferHandler->>BlockTransferHandler: perform transfer (memcpy/CUDA/NIXL)
    BlockTransferHandler-->>Worker: notify completion
    Worker->>ZMQ: send ACK
    ZMQ->>Leader: receive ACK
    Leader-->>Leader: unblock on all ACKs

Possibly related PRs

ai-dynamo/dynamo#1141: Adds async Python bindings and a new Layer class, directly related to this PR's refactor and extension of block manager Python bindings and async support.

Poem

In fields of code where blocks now roam,
Local or logical, each finds a home.
Leaders and workers, in sync they chat,
With CUDA and NIXL, and tests to combat.
From Python to Rust, the system’s grown vast—
A rabbit hops forward, the future amassed!
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

zhaohaidao · 2025-07-01T06:51:48Z

@jthomson04 Hi,
I have a question. The changes only involve two files. Why does the coderabbitai summary involve so much content?

tzulingk · 2025-07-01T15:19:35Z

lib/bindings/python/src/dynamo/llm/vllm_integration/kv_cache_manager.py


-        need_to_allocate = num_external_hit_tokens - num_computed_tokens

        # In a full-prompt-hit case, we need to recompute the last token


just curious: Why do we need to recompute the last token if a full-prompt is hit?

I learned from https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L832

essentially, to generate the next token, we need not only the kv cache, but logits. so we need to let vllm re-computes the last token

Thanks for the explanation. Would you mind adding your explanation as part of the comment?

tzulingk · 2025-07-01T15:33:22Z

lib/bindings/python/src/dynamo/llm/vllm_integration/kv_cache_manager.py

        sequence_hashes = self._create_slot(request)

-        host_owned_blocks, disk_owned_blocks = self.cache_manager.get_offloaded_computed_blocks(sequence_hashes)
+        remaining_sequence_hashes = sequence_hashes[num_computed_tokens // self.block_size:]


I am new to the team, do you know why my dynamo workspace doesn't have lib/bindings/python/src/dynamo/llm/vllm_integration/kv_cache_manager.py?

tzulingk@66d8878-lcedt:~/workspace_venv/dynamo$ ls lib/bindings/python/src/dynamo/llm/ __init__.py __pycache__/ tensorrtllm/

Are you on the right branch?

it is because kvbm related python bindings like you pointed is not in main/release branch yet. only exist in our private branch. we will merge to main once ready

jthomson04 · 2025-07-01T15:37:24Z

@jthomson04 Hi, I have a question. The changes only involve two files. Why does the coderabbitai summary involve so much content?

Because when I created the MR, I accidentally set the target branch to main

zhaohaidao · 2025-07-01T16:51:05Z

@jthomson04 Hi, I have a question. The changes only involve two files. Why does the coderabbitai summary involve so much content?

Because when I created the MR, I accidentally set the target branch to main

Got it. Thanks for the explanation

Hierarchical reuse

cbabc40

jthomson04 requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, grahamking, ishandhanani, kkranen, nnshah1, oandreeva-nv, paulhendricks, piotrm-nvidia, ptarasiewiczNV, rmccorm4, ryanolson, tanmayv25, tedzhouhk and tmonty12 as code owners July 1, 2025 05:20

github-actions bot added the feat label Jul 1, 2025

jthomson04 changed the base branch from main to ziqif/add-get-g2g3-reuse July 1, 2025 05:20

pull-request-size bot added size/XXL size/S and removed size/XXL labels Jul 1, 2025

tzulingk reviewed Jul 1, 2025

View reviewed changes

jthomson04 closed this Jul 1, 2025


		need_to_allocate = num_external_hit_tokens - num_computed_tokens

		# In a full-prompt-hit case, we need to recompute the last token

feat: KVBM improved external block matching #1714

feat: KVBM improved external block matching #1714

Uh oh!

Conversation

jthomson04 commented Jul 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jul 1, 2025

Uh oh!

coderabbitai bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

zhaohaidao commented Jul 1, 2025

Uh oh!

tzulingk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

ziqifan617 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tzulingk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tzulingk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

jthomson04 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

ziqifan617 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tzulingk Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

jthomson04 commented Jul 1, 2025

Uh oh!

zhaohaidao commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jthomson04 commented Jul 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 1, 2025 •

edited

Loading