Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Hierarchical reuse
  • Loading branch information
jthomson04 committed Jul 1, 2025
commit cbabc401863cf24d6a35510aceee8b35788c3d1b
2 changes: 1 addition & 1 deletion lib/bindings/python/rust/llm/block_manager/vllm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ impl KvbmCacheManager {
};

let disk_blocks = if let Some(disk) = self.block_manager().disk() {
disk.match_sequence_hashes_blocking(&sequence_hashes)
disk.match_sequence_hashes_blocking(&sequence_hashes[host_blocks.len()..])
.map_err(to_pyerr)?
} else {
vec![]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -117,19 +117,22 @@ def get_offloaded_computed_blocks(

sequence_hashes = self._create_slot(request)

host_owned_blocks, disk_owned_blocks = self.cache_manager.get_offloaded_computed_blocks(sequence_hashes)
remaining_sequence_hashes = sequence_hashes[num_computed_tokens // self.block_size:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to the team, do you know why my dynamo workspace doesn't have lib/bindings/python/src/dynamo/llm/vllm_integration/kv_cache_manager.py?

tzulingk@66d8878-lcedt:~/workspace_venv/dynamo$ ls lib/bindings/python/src/dynamo/llm/
__init__.py  __pycache__/ tensorrtllm/ 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you on the right branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is because kvbm related python bindings like you pointed is not in main/release branch yet. only exist in our private branch. we will merge to main once ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.


host_owned_blocks, disk_owned_blocks = self.cache_manager.get_offloaded_computed_blocks(remaining_sequence_hashes)
host_block_count = host_owned_blocks.block_count()
disk_block_count = disk_owned_blocks.block_count()

num_host_computed_tokens = host_block_count * self.block_size
num_disk_computed_tokens = disk_block_count * self.block_size

num_external_hit_tokens = max(num_disk_computed_tokens, num_host_computed_tokens)
num_external_hit_tokens = num_host_computed_tokens + num_disk_computed_tokens

need_to_allocate = num_external_hit_tokens

need_to_allocate = num_external_hit_tokens - num_computed_tokens

# In a full-prompt-hit case, we need to recompute the last token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: Why do we need to recompute the last token if a full-prompt is hit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned from https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L832

essentially, to generate the next token, we need not only the kv cache, but logits. so we need to let vllm re-computes the last token

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Would you mind adding your explanation as part of the comment?

if num_external_hit_tokens == request.num_tokens:
if num_computed_tokens + num_external_hit_tokens == request.num_tokens:
need_to_allocate -= 1

# TODO: add stats for offloaded computed tokens
Expand Down