Skip to content

Conversation

@kthui
Copy link
Contributor

@kthui kthui commented May 20, 2025

Overview:

  • Add async Python bindings methods for KV Block Manager.
  • Introduce the outer dimemsion.
  • Introduce the layer class.
  • Refactor dlpack implementation.
  • Add tests for layer class.

Details:

N/A

Where should the reviewer start?

Start with the async Python bindings for block manager. Then, review the test case for layer class and finally the layer class implementation.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

  • New Features

    • Introduced a Layer class with DLPack interoperability for efficient data sharing.
    • Added asynchronous methods to allocate host and device blocks.
    • Enhanced Block objects to support Python container and iterator protocols, allowing iteration, indexing, and conversion to lists of layers.
    • Updated tensor shapes to include an additional dimension for improved data representation.
  • Bug Fixes

    • Improved error handling for block allocation and configuration failures, returning Python runtime errors instead of panicking.
  • Tests

    • Added and updated tests to cover asynchronous block allocation, layer access, iteration, and data copying between host and device blocks, including support for an additional tensor dimension.

@copy-pr-bot
Copy link

copy-pr-bot bot commented May 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kthui kthui self-assigned this May 20, 2025
@github-actions github-actions bot added the feat label May 20, 2025
@pull-request-size pull-request-size bot added size/L and removed size/S labels May 21, 2025
@kthui kthui force-pushed the jacky-kvbm-py-async branch from 3306ec8 to 867883d Compare May 21, 2025 01:42
@kthui kthui marked this pull request as ready for review May 21, 2025 01:48
@kthui kthui enabled auto-merge (squash) May 21, 2025 01:48
@kthui kthui force-pushed the jacky-kvbm-py-async branch 2 times, most recently from e04422e to 70ad4cf Compare May 28, 2025 23:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 28, 2025

Walkthrough

The changes introduce a new Python-exposed Layer class representing a block layer with DLPack support, enhance the Block class with Python container and iterator protocols, and add asynchronous allocation methods to BlockManager. DLPack interoperability is refactored for both blocks and layers. Tests and type hints are updated to reflect new tensor shapes, async APIs, and layer-level access.

Changes

File(s) Change Summary
lib/bindings/python/rust/llm/block_manager.rs Refactored to use shared Tokio runtime; improved error handling; added async methods allocate_host_blocks and allocate_device_blocks; exposed new layer::Layer class; removed tokio_runtime field.
lib/bindings/python/rust/llm/block_manager/block.rs Refactored DLPack logic directly into Block; added field for Python iterator state; implemented Python container and iterator protocols; simplified device info retrieval; removed DlPackTensor struct.
lib/bindings/python/rust/llm/block_manager/block_list.rs Updated to_list and __iter__ method signatures to use explicit Python GIL tokens and mutable references, simplifying GIL handling and iterator reset logic.
lib/bindings/python/rust/llm/block_manager/dlpack.rs Added new file implementing DLPack tensor and device info interoperability for blocks and layers, including helper functions for DLPack capsule and device tuple creation.
lib/bindings/python/rust/llm/block_manager/layer.rs Added new file defining the Layer struct with DLPack support and Python bindings, including methods for DLPack capsule and device info retrieval.
lib/bindings/python/src/dynamo/_core.pyi Added Layer class; updated Block with container/iterator protocols and to_list; added async allocation methods to BlockManager; updated docstrings for DLPack support and exception behavior.
lib/bindings/python/tests/test_block_manager.py Refactored for new tensor shape with extra dimension; updated to use async allocation methods; added tests for layer access, iteration, and layer-level copying; revised permutation and data integrity checks; introduced new fixture for fresh managers.

Sequence Diagram(s)

sequenceDiagram
    participant Python as Python User
    participant BlockManager as BlockManager (Rust)
    participant BlockList as BlockList
    participant Block as Block
    participant Layer as Layer

    Python->>BlockManager: await allocate_host_blocks(count)
    BlockManager->>BlockList: create BlockList of Blocks
    BlockList-->>Python: BlockList

    Python->>BlockList: __iter__()
    BlockList-->>Python: iterator

    loop over BlockList
        Python->>Block: __getitem__(index)
        Block-->>Layer: return Layer
        Python->>Layer: __dlpack__()
        Layer-->>Python: DLPack capsule
    end
Loading

Poem

In the meadow of memory blocks,
Layers now hop with DLPack socks.
Async bunnies leap to allocate,
Iterators nibble—oh, how great!
With shapes that stretch and tests anew,
This rabbit’s code just grew and grew.
🐇✨


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a26025 and 8baca8e.

📒 Files selected for processing (1)
  • lib/bindings/python/rust/llm/block_manager.rs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/bindings/python/rust/llm/block_manager.rs
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - vllm

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot]

This comment was marked as outdated.

@kthui kthui force-pushed the jacky-kvbm-py-async branch from 70ad4cf to 0d03d94 Compare May 29, 2025 00:03
coderabbitai[bot]

This comment was marked as outdated.

@kthui kthui force-pushed the jacky-kvbm-py-async branch from 0d03d94 to 1690afc Compare May 29, 2025 00:20
@kthui kthui force-pushed the jacky-kvbm-py-async branch from 1690afc to 42bfbb7 Compare May 29, 2025 00:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

♻️ Duplicate comments (3)
lib/bindings/python/rust/llm/block_manager/dlpack.rs (1)

114-129: ⚠️ Potential issue

unwrap() panics and bare PyTuple still break compilation – please propagate errors and return Ok

The exact same problems flagged in the previous review are still present:

  1. unwrap() on every Python‐level call (eval, import, getattr, call1) will abort the interpreter when Python raises.
  2. The function promises PyResult<…> but returns a raw PyTuple, producing a mismatched types compilation error.
-    let dev_type_list = py.eval(c_str!("[('CPU', 1), ...)]"), None, None).unwrap();
+    // `eval` expects &str; pass a string slice and propagate errors
+    let dev_type_list = py.eval(
+        "[('CPU', 1), ('CUDA', 2), ('CPU_PINNED', 3), ('OPENCL', 4), \
+          ('VULKAN', 7), ('METAL', 8), ('VPI', 9), ('ROCM', 10)]",
+        None,
+        None,
+    )?;

-    let dev_type_enum = py
-        .import("enum")
-        .unwrap()
-        .getattr("Enum")
-        .unwrap()
-        .call1(("DLDeviceType", dev_type_list))
-        .unwrap();
+    let dev_type_enum = py
+        .import("enum")?
+        .getattr("Enum")?
+        .call1(("DLDeviceType", dev_type_list))?;

 ...
-    PyTuple::new(py, dev)
+    Ok(PyTuple::new(py, dev))

Fixing these two issues will unblock compilation and ensure Python exceptions are surfaced as PyErr rather than hard-panics.

lib/bindings/python/rust/llm/block_manager/block.rs (1)

47-49: 🛠️ Refactor suggestion

Iterator state inside Block is still racy – use a dedicated iterator object

Keeping py_itr_idx inside the shared Block means:

  • Nested loops (for l1 in block: … for l2 in block: …) corrupt each other.
  • Two async tasks iterating the same Block race on the counter.

The earlier review recommended moving the index into a separate BlockIter object; the code still uses the in-place counter. Please refactor as suggested:

-    // Python iterator state
-    py_itr_idx: usize,
+    // (removed)
 #[pymethods]
 impl Block {
-    fn __iter__(mut slf: PyRefMut<'_, Self>) -> PyResult<PyRefMut<'_, Self>> {
-        slf.py_itr_idx = 0;
-        Ok(slf)
-    }
-
-    fn __next__(&mut self) -> PyResult<layer::Layer> {
-        if self.py_itr_idx >= self.num_layers() {
-            return Err(pyo3::exceptions::PyStopIteration::new_err(()));
-        }
-        let layer = …;
-        self.py_itr_idx += 1;
-        Ok(layer)
-    }
+    fn __iter__(slf: PyRef<Self>) -> BlockIter {
+        BlockIter { block: slf.into(), idx: 0 }
+    }
 }
+
+#[pyclass]
+struct BlockIter {
+    block: Block,
+    idx: usize,
+}
+
+#[pymethods]
+impl BlockIter {
+    fn __iter__(slf: PyRef<Self>) -> PyRef<Self> { slf }
+    fn __next__(&mut self) -> PyResult<layer::Layer> {
+        if self.idx >= self.block.num_layers() {
+            return Err(pyo3::exceptions::PyStopIteration::new_err(()));
+        }
+        let layer = self.block.__getitem__(self.idx)?;
+        self.idx += 1;
+        Ok(layer)
+    }
+}

This isolates iteration state, eliminates data races, and matches Python’s iterator semantics.

Also applies to: 112-133

lib/bindings/python/tests/test_block_manager.py (1)

382-396: Custom main() still bypasses pytest – skip markers are ignored

Invoking the test functions manually runs GPU tests unconditionally, even on CPU-only machines, negating the @pytest.mark.skipif decorators flagged earlier. Please delete the main() block and rely on pytest:

-if __name__ == "__main__":
-    asyncio.run(main())
+# Intentionally left blank – run with `pytest -q`
🧹 Nitpick comments (4)
lib/bindings/python/rust/llm/block_manager/dlpack.rs (2)

43-56: Pinned memory is reported as generic CPU – potential interoperability mismatch

Device::CPU is returned for BlockType::Pinned, but many frameworks (PyTorch, CuPy) expose a distinct CUDA-host / pinned device (kDLCPUHost, DLDeviceType::kDLCUDAHost). Mapping everything to plain CPU may silently fall back to pageable memory and degrade performance.

If dlpack / dlpark supports CudaHost, return that instead; otherwise document the limitation clearly so downstream users are not surprised.


58-76: Enum mapping lacks exhaustiveness checks – future dtypes may trigger UB

The match is non-exhaustive with respect to the upstream enum but relies on the compiler to detect missing variants. Add a wildcard arm returning a sensible default (unreachable!() or DataType::F32) or derive #[non_exhaustive] on the upstream enum to make future additions compile-time errors.

             dynamo_llm::common::dtype::DType::I64 => DataType::I64,
+            // Prevent future UB if new variants are added
+            _ => unreachable!("Unhandled DType variant in DlPack mapping"),
lib/bindings/python/tests/test_block_manager.py (2)

108-136: CPU-only tests are unnecessarily skipped when CUDA is absent

test_cpu_block_access and test_block_list_iteration operate solely on host memory but are wrapped in @skipif(not torch.cuda.is_available()). This prevents valuable coverage on CPU-only CI jobs.

Remove the CUDA guard for purely host tests to keep them runnable everywhere.


38-55: new_block_manager() creates device blocks by default – tests that only need host blocks pay the cost

Most CPU-side tests don’t use GPU allocations, yet new_block_manager() always requests DEVICE_NUM_BLOCKS. Consider adding parameters or a second fixture that provisions only the resources required for the test, reducing startup time and VRAM usage.

🛑 Comments failed to post (1)
lib/bindings/python/rust/llm/block_manager/block.rs (1)

166-195: ⚠️ Potential issue

Unsafe pointer may dangle after releasing the lock

ptr is captured while the mutex is held but used after the lock is released when the DLPack capsule is consumed in Python. If another thread resizes / frees the block before the consumer finishes, the pointer may dangle.

Consider:

  • Keeping the mutex guard alive inside DlPackTensor (store the MutexGuard) so memory cannot be mutated while a DLPack view exists, or
  • Exposing a read-only view and document that the block must outlive all exported tensors.

Without this, users can trigger use-after-free through concurrency.

🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager/block.rs around lines 166 to 195,
the unsafe pointer `ptr` is obtained while holding a mutex lock but used after
the lock is released, risking a dangling pointer if the block is mutated
concurrently. To fix this, modify the code to keep the mutex guard alive for the
lifetime of the DlPackTensor by storing the MutexGuard inside it, ensuring the
memory cannot be mutated or freed while the DLPack view exists. Alternatively,
expose a read-only view and clearly document that the block must outlive all
exported tensors to prevent use-after-free errors.

@kthui kthui changed the title feat: Add async Python bindings to KVBM feat: KVBM async Python bindings and Layer class May 29, 2025
@kthui kthui requested a review from PeaBrane as a code owner May 29, 2025 17:08
@kthui kthui disabled auto-merge May 29, 2025 17:08
@kthui kthui enabled auto-merge (squash) May 29, 2025 17:15
@kthui kthui merged commit 7677f74 into main May 29, 2025
9 checks passed
@kthui kthui deleted the jacky-kvbm-py-async branch May 29, 2025 17:49
@coderabbitai coderabbitai bot mentioned this pull request Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants