Skip to content

Conversation

@jthomson04
Copy link
Contributor

@jthomson04 jthomson04 commented Jul 1, 2025

@ziqif-nv

Summary by CodeRabbit

New Features

  • Introduced a distributed Key-Value Block Manager (KVBM) with leader-worker architecture for scalable block management.
  • Added advanced block transfer system supporting multiple storage types (device, host, disk) and locality-aware transfers (local, logical, remote).
  • Implemented Python bindings and integration for vLLM KV cache management, including cache manager, request, and block list types.
  • Added support for new block layouts, including per-layer separate storage and fully contiguous layouts.
  • Enabled asynchronous and remote block transfer strategies, including CUDA, memcpy, and NIXL-based transfers.
  • Provided detailed Rust and Python APIs for slot/block management, offloading, onboarding, and block lifecycle control.

Bug Fixes

  • Improved error handling and validation throughout block transfer, offload, and onboarding workflows.
  • Enhanced test coverage and robustness for distributed and local block management operations.

Documentation

  • Added comprehensive documentation and test plans for block lifecycle, offload management, slot/block manager workflows, and distributed messaging.
  • Updated and expanded Python type stubs and module-level documentation.

Tests

  • Introduced extensive Rust and Python test suites covering distributed messaging, block transfer, slot/block management, and vLLM cache integration.
  • Added parameterized and integration tests for new block layouts and distributed workflows.

Chores

  • Updated development environment configurations for improved Python linting and container builds.
  • Refactored and organized codebase for clarity, modularity, and extensibility.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Jul 1, 2025
@jthomson04 jthomson04 changed the base branch from main to ziqif/add-get-g2g3-reuse July 1, 2025 05:20
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 1, 2025

Caution

Review failed

Failed to post review comments.

Walkthrough

This change introduces a comprehensive overhaul and extension of the LLM block manager system, spanning Rust and Python bindings. Major additions include locality-aware block abstractions, distributed leader-worker block management, a new block transfer subsystem (supporting memcpy, CUDA, and NIXL), new block layouts, and extensive vLLM integration. The update covers new traits, structs, methods, macros, documentation, and test plans, with significant refactoring for modularity and extensibility.

Changes

File(s) / Path(s) Change Summary
.devcontainer/devcontainer.json, dynamo.code-workspace Devcontainer now builds from Dockerfile; VSCode Python linting enhanced; workspace Python analysis paths extended.
lib/bindings/python/Cargo.toml, lib/llm/Cargo.toml Rust crates updated: new features, dependencies, and dev-dependencies added.
lib/bindings/python/rust/lib.rs, lib/bindings/python/rust/llm.rs, ... Python Rust bindings refactored: distributed leader-worker model, new vLLM integration, block manager API updated, async/await and locality support.
lib/bindings/python/rust/llm/block_manager/distributed.rs, leader.rs, ... New distributed block manager leader/worker Python bindings and Rust implementations, supporting async block transfers and resource management.
lib/bindings/python/rust/llm/block_manager/vllm.rs, block_list.rs, ... vLLM cache manager and slot management Python bindings and Rust backend added, with slot/block tracking and update APIs.
lib/bindings/python/src/dynamo/_core.pyi, lib/bindings/python/src/dynamo/llm/__init__.py Python stubs and imports updated for new KvbmCacheManager, KvbmRequest, KvbmLeader, KvbmWorker.
lib/bindings/python/src/dynamo/llm/vllm_integration/ (multiple files) New Python vLLM integration: cache manager, block utilities, Rust loader, and protocol-compliant APIs.
lib/bindings/python/tests/test_kvbm.py New async Python test for KVBM cache manager and vLLM integration.
lib/llm/src/block_manager.rs, block_manager.md, offload.rs, ... Block manager refactored for locality, async, and distributed support; new documentation and locality-aware offload manager.
lib/llm/src/block_manager/block.rs, block_next.rs, block_v2.rs, ... Block abstractions refactored: locality, metadata, mutability, storage, and NIXL integration; new traits and error handling.
lib/llm/src/block_manager/block/data.rs, local.rs, logical.rs, ... New block data modules: locality, logical resources, distributed leader-worker, and null resource implementations.
lib/llm/src/block_manager/block/factory.rs, local.rs, logical.rs Block factory traits and implementations for local and logical block creation.
lib/llm/src/block_manager/block/locality.rs LocalityProvider trait and implementations for local and logical block transfers.
lib/llm/src/block_manager/block/state.rs BlockState enum extended with apply_token_block method.
lib/llm/src/block_manager/block/transfer.rs, transfer_next.rs, ... Major refactor: new block transfer system supporting memcpy, CUDA, NIXL; new traits, strategies, and error handling.
lib/llm/src/block_manager/block/transfer_v2.rs, executors.rs, macros.rs, ... New modular transfer v2 system: traits, coordinators, executors, macros for block transfers across locality/storage types.
lib/llm/src/block_manager/block/transfer_v3.rs New transfer v3 file: block layer descriptors and type abstractions.
lib/llm/src/block_manager/layout.rs, nixl.rs, utils.rs, ... New block layouts (LayerSeparate), trait refactors, NIXL serialization/deserialization, and layout validation utilities.
lib/llm/src/block_manager/distributed.rs, active_message.rs, zmq.rs, ... Distributed block manager with async active message system, ZMQ-based leader-worker comms, and block transfer handlers.
lib/llm/src/block_manager/config.rs BlockParallelismStrategy and logical parallelism added to manager config.
Documentation: block_manager.md, README.md, test plans, ... New and updated documentation: block lifecycle, offload management, distributed messaging, and test plans for slots and slot manager.
Tests: worker_test.rs, test_kvbm.py, ... New and updated tests for distributed messaging, concurrency, resource capture, vLLM integration, and block manager behaviors.

Sequence Diagram(s)

sequenceDiagram
    participant PythonUser
    participant KvbmCacheManager (Python)
    participant RustKvbmCacheManager
    participant BlockManager
    participant KvbmLeader
    participant KvbmWorker

    PythonUser->>KvbmCacheManager: allocate_slots(request, tokens)
    KvbmCacheManager->>RustKvbmCacheManager: allocate_slots(update)
    RustKvbmCacheManager->>BlockManager: update_slot(...)
    BlockManager->>KvbmLeader: (if needed) transfer_blocks_request
    BlockManager->>KvbmWorker: allocate blocks, onboard/offload
    RustKvbmCacheManager-->>KvbmCacheManager: result (block states)
    KvbmCacheManager-->>PythonUser: result (block IDs)
Loading
sequenceDiagram
    participant Leader
    participant Worker
    participant ZMQ
    participant BlockTransferHandler

    Leader->>ZMQ: broadcast transfer_blocks_request
    ZMQ->>Worker: deliver message
    Worker->>BlockTransferHandler: handle transfer request
    BlockTransferHandler->>BlockTransferHandler: get source/target blocks
    BlockTransferHandler->>BlockTransferHandler: perform transfer (memcpy/CUDA/NIXL)
    BlockTransferHandler-->>Worker: notify completion
    Worker->>ZMQ: send ACK
    ZMQ->>Leader: receive ACK
    Leader-->>Leader: unblock on all ACKs
Loading

Possibly related PRs

  • ai-dynamo/dynamo#1141: Adds async Python bindings and a new Layer class, directly related to this PR's refactor and extension of block manager Python bindings and async support.

Poem

In fields of code where blocks now roam,
Local or logical, each finds a home.
Leaders and workers, in sync they chat,
With CUDA and NIXL, and tests to combat.
From Python to Rust, the system’s grown vast—
A rabbit hops forward, the future amassed!
🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@zhaohaidao
Copy link
Contributor

@jthomson04 Hi,
I have a question. The changes only involve two files. Why does the coderabbitai summary involve so much content?


need_to_allocate = num_external_hit_tokens - num_computed_tokens

# In a full-prompt-hit case, we need to recompute the last token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: Why do we need to recompute the last token if a full-prompt is hit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned from https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L832

essentially, to generate the next token, we need not only the kv cache, but logits. so we need to let vllm re-computes the last token

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Would you mind adding your explanation as part of the comment?

sequence_hashes = self._create_slot(request)

host_owned_blocks, disk_owned_blocks = self.cache_manager.get_offloaded_computed_blocks(sequence_hashes)
remaining_sequence_hashes = sequence_hashes[num_computed_tokens // self.block_size:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to the team, do you know why my dynamo workspace doesn't have lib/bindings/python/src/dynamo/llm/vllm_integration/kv_cache_manager.py?

tzulingk@66d8878-lcedt:~/workspace_venv/dynamo$ ls lib/bindings/python/src/dynamo/llm/
__init__.py  __pycache__/ tensorrtllm/ 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you on the right branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is because kvbm related python bindings like you pointed is not in main/release branch yet. only exist in our private branch. we will merge to main once ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@jthomson04
Copy link
Contributor Author

@jthomson04 Hi, I have a question. The changes only involve two files. Why does the coderabbitai summary involve so much content?

Because when I created the MR, I accidentally set the target branch to main

@zhaohaidao
Copy link
Contributor

@jthomson04 Hi, I have a question. The changes only involve two files. Why does the coderabbitai summary involve so much content?

Because when I created the MR, I accidentally set the target branch to main

Got it. Thanks for the explanation

@jthomson04 jthomson04 closed this Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants