-
Notifications
You must be signed in to change notification settings - Fork 748
feat: KVBM improved external block matching #1714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: KVBM improved external block matching #1714
Conversation
|
Caution Review failedFailed to post review comments. WalkthroughThis change introduces a comprehensive overhaul and extension of the LLM block manager system, spanning Rust and Python bindings. Major additions include locality-aware block abstractions, distributed leader-worker block management, a new block transfer subsystem (supporting memcpy, CUDA, and NIXL), new block layouts, and extensive vLLM integration. The update covers new traits, structs, methods, macros, documentation, and test plans, with significant refactoring for modularity and extensibility. Changes
Sequence Diagram(s)sequenceDiagram
participant PythonUser
participant KvbmCacheManager (Python)
participant RustKvbmCacheManager
participant BlockManager
participant KvbmLeader
participant KvbmWorker
PythonUser->>KvbmCacheManager: allocate_slots(request, tokens)
KvbmCacheManager->>RustKvbmCacheManager: allocate_slots(update)
RustKvbmCacheManager->>BlockManager: update_slot(...)
BlockManager->>KvbmLeader: (if needed) transfer_blocks_request
BlockManager->>KvbmWorker: allocate blocks, onboard/offload
RustKvbmCacheManager-->>KvbmCacheManager: result (block states)
KvbmCacheManager-->>PythonUser: result (block IDs)
sequenceDiagram
participant Leader
participant Worker
participant ZMQ
participant BlockTransferHandler
Leader->>ZMQ: broadcast transfer_blocks_request
ZMQ->>Worker: deliver message
Worker->>BlockTransferHandler: handle transfer request
BlockTransferHandler->>BlockTransferHandler: get source/target blocks
BlockTransferHandler->>BlockTransferHandler: perform transfer (memcpy/CUDA/NIXL)
BlockTransferHandler-->>Worker: notify completion
Worker->>ZMQ: send ACK
ZMQ->>Leader: receive ACK
Leader-->>Leader: unblock on all ACKs
Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
@jthomson04 Hi, |
|
|
||
| need_to_allocate = num_external_hit_tokens - num_computed_tokens | ||
|
|
||
| # In a full-prompt-hit case, we need to recompute the last token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious: Why do we need to recompute the last token if a full-prompt is hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned from https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L832
essentially, to generate the next token, we need not only the kv cache, but logits. so we need to let vllm re-computes the last token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Would you mind adding your explanation as part of the comment?
| sequence_hashes = self._create_slot(request) | ||
|
|
||
| host_owned_blocks, disk_owned_blocks = self.cache_manager.get_offloaded_computed_blocks(sequence_hashes) | ||
| remaining_sequence_hashes = sequence_hashes[num_computed_tokens // self.block_size:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am new to the team, do you know why my dynamo workspace doesn't have lib/bindings/python/src/dynamo/llm/vllm_integration/kv_cache_manager.py?
tzulingk@66d8878-lcedt:~/workspace_venv/dynamo$ ls lib/bindings/python/src/dynamo/llm/
__init__.py __pycache__/ tensorrtllm/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you on the right branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is because kvbm related python bindings like you pointed is not in main/release branch yet. only exist in our private branch. we will merge to main once ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Because when I created the MR, I accidentally set the target branch to main |
Got it. Thanks for the explanation |
@ziqif-nv
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores