-
Notifications
You must be signed in to change notification settings - Fork 737
feat: Support Dynamo KVBM with TRTLLM Disagg #3527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: jthomson04 <[email protected]>
…terminism Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
ccc7447 to
21a5928
Compare
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
WalkthroughThis pull request adds KV cache connector support to the TensorRT-LLM engine integration by introducing a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Areas requiring extra attention during review:
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
lib/bindings/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (1)
328-338: Per-requestscheduled_tokenswiring looks correct; consider tightening missing-key handlingUsing
scheduler_output.num_scheduled_tokens.get(request_id).unwrap_or(&0)for both new and cached requests aligns with the Python side dict keyed by stringifiedrequest_id, and passingscheduled_tokensintoslot.apply_scheduler_outputmatches the new slot API.One thing to consider: silently defaulting to
0if the key is missing will mask any mismatch between the Python and Rust schedulers. A lightweight guard like adebug_assert!(scheduler_output.num_scheduled_tokens.contains_key(request_id))(or at least a debug log) would make integration issues much easier to diagnose while keeping release behavior unchanged.Also applies to: 364-374
components/src/dynamo/trtllm/main.py (1)
32-32: Verifyconnector_modulepath andkv_connector_configwiring semanticsThe conditional construction of
KvCacheConnectorConfigfromconfig.connectorand passingkv_connector_configthrougharg_mapis a clean way to make connector support optional and fail fast on invalid values.Two details worth double-checking:
Module path consistency:
build_kv_connector_configusesconnector_module="dynamo.llm.trtllm_integration.connector"while the existing deterministic tests for aggregated mode reference
"kvbm.trtllm_integration.connector"as the module. Please confirm thatdynamo.llm.trtllm_integration.connectoractually exposesDynamoKVBMConnectorLeader/DynamoKVBMConnectorWorker, or consider aligning this string with the tested path to avoid runtime import errors.
Nonesemantics forkv_connector_config: whenconfig.connectoris unset, the key is still present inarg_mapwith a value ofNone. That should be acceptable ifkv_connector_configis an optional field in the underlying LLM args, but if tensorrt-llm distinguishes between “key absent” vs “key present withNone”, it may be slightly safer to omit the key entirely in that case.Neither point blocks the overall design, but they’re worth verifying against the tensorrt-llm API and the existing kvbm integration tests.
Also applies to: 106-117, 192-193, 210-210
tests/kvbm_integration/test_determinism_disagg.py (3)
173-181: Use of fixed/tmp/...yamlpaths is fine for tests but can be made more robustThe prefill/decode configs are written to fixed filenames under
/tmp:prefill_config_path = os.environ.get(..., "/tmp/kvbm_llm_api_prefill_config.yaml") decode_config_path = os.environ.get(..., "/tmp/kvbm_llm_api_decode_config.yaml") ... yaml.dump(..., default_flow_style=False, sort_keys=False)For CI and single-test runs this is usually acceptable, but it does open you up to path collisions if multiple test processes run concurrently on the same host.
If this becomes an issue, you could switch to
tempfile.NamedTemporaryFileor include something like the PID or port in the default filenames; otherwise, this is probably fine as-is for an internal integration test.Also applies to: 244-247
424-456: Improved health logging is helpful; consider tightening control flow slightlyThe extra logging around the health check and model endpoint status, plus the explicit
requests.exceptions.RequestExceptionhandling, should make TRTLLM startup issues much easier to diagnose.If you want to mirror static-analysis expectations, you could move
return response.status_code == 200into anelseblock under the secondif response.status_code != 200for slightly clearer control flow, but functionally the current structure is sound.
507-512: Server-type selection now supports TRTLLM; skip message could be updatedFalling back to
ServerType.trtllmwhenvllmis absent buttensorrt_llmis present is a sensible way to reuse the same test harness.The only minor nit is the skip message:
else: pytest.skip("vllm module is not available in the current environment.")If neither backend is present, this message is slightly misleading now that TRTLLM is also a valid option; you may want to reword it to mention both backends or say something like “No supported LLM backend (vllm or tensorrt_llm) is available”.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
components/src/dynamo/trtllm/main.py(4 hunks)components/src/dynamo/trtllm/utils/trtllm_utils.py(3 hunks)lib/bindings/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py(1 hunks)lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs(0 hunks)lib/bindings/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs(2 hunks)tests/kvbm_integration/test_determinism_disagg.py(6 hunks)
💤 Files with no reviewable changes (1)
- lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
🧬 Code graph analysis (4)
tests/kvbm_integration/test_determinism_disagg.py (2)
tests/kvbm_integration/common.py (1)
ServerType(133-135)tests/kvbm_integration/test_determinism_agg.py (1)
_set_up_trtllm_config(122-164)
lib/bindings/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (1)
lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs (2)
request_id(95-95)request_id(460-462)
lib/bindings/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py (2)
lib/bindings/kvbm/src/block_manager/vllm/connector.rs (1)
add_num_scheduled_tokens(79-82)lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs (2)
request_id(95-95)request_id(460-462)
components/src/dynamo/trtllm/main.py (2)
components/src/dynamo/trtllm/utils/trtllm_utils.py (1)
Config(29-97)components/src/dynamo/vllm/args.py (1)
Config(30-78)
🪛 Ruff (0.14.5)
tests/kvbm_integration/test_determinism_disagg.py
175-175: Probable insecure usage of temporary file or directory: "/tmp/kvbm_llm_api_prefill_config.yaml"
(S108)
180-180: Probable insecure usage of temporary file or directory: "/tmp/kvbm_llm_api_decode_config.yaml"
(S108)
228-235: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
237-242: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
452-452: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: operator (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (.)
- GitHub Check: clippy (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
lib/bindings/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py (1)
76-82:num_scheduled_tokensexport is consistent with the Rust APIThe dict comprehension that merges
scheduler_output.new_requestsandscheduler_output.cached_requestsinto a{request_id: num_scheduled_tokens}mapping is exactly whatRustSchedulerOutput.add_num_scheduled_tokensexpects, and the key format (str(req.request_id)) matches what the Rust side uses for lookups.Just make sure that every entry in both
new_requestsandcached_requestsalways hasnum_scheduled_tokenspopulated; otherwise the Rust side will quietly treat the request as having0scheduled tokens due to theunwrap_or(&0)default.components/src/dynamo/trtllm/utils/trtllm_utils.py (1)
63-63: Connector option plumbing is straightforward and consistentAdding
connectortoConfig, exposing it via--connector(with explicit choices) and wiringconfig.connector = args.connectorkeeps the configuration path simple and type-safe. This should compose cleanly with the newbuild_kv_connector_confighelper inmain.pyand is easy to extend if more connectors are added later.Also applies to: 279-285, 368-368
| def _set_up_trtllm_config(self, gpu_cache_blocks): | ||
| # Mostly the same parameters here as in the | ||
| prefill_config_path = os.environ.get( | ||
| "KVBM_TRTLLM_LLMAPI_PREFILL_CONFIG_PATH", | ||
| "/tmp/kvbm_llm_api_prefill_config.yaml", | ||
| ) | ||
|
|
||
| decode_config_path = os.environ.get( | ||
| "KVBM_TRTLLM_LLMAPI_DECODE_CONFIG_PATH", | ||
| "/tmp/kvbm_llm_api_decode_config.yaml", | ||
| ) | ||
|
|
||
| llm_api_config: Dict[str, Any] = {} | ||
| llm_api_config["kv_cache_config"] = { | ||
| "enable_partial_reuse": False, | ||
| "free_gpu_memory_fraction": 0.10, | ||
| "tokens_per_block": 16, | ||
| } | ||
|
|
||
| # GPU blocks override | ||
| if gpu_cache_blocks is not None: | ||
| del llm_api_config["kv_cache_config"]["free_gpu_memory_fraction"] | ||
| llm_api_config["kv_cache_config"]["max_tokens"] = ( | ||
| int(gpu_cache_blocks) * 32 | ||
| ) # TRTLLM defaults 32 tokens per block | ||
|
|
||
| prefill_config = deepcopy(llm_api_config) | ||
| prefill_config["disable_overlap_scheduler"] = True | ||
| prefill_config["cache_transceiver_config"] = { | ||
| "backend": "DEFAULT", | ||
| "max_tokens_in_buffer": 16384, | ||
| } | ||
| prefill_config["cuda_graph_config"] = None | ||
|
|
||
| decode_config = deepcopy(llm_api_config) | ||
| decode_config["disable_overlap_scheduler"] = False | ||
| decode_config["cache_transceiver_config"] = { | ||
| "backend": "DEFAULT", | ||
| "max_tokens_in_buffer": 65536, | ||
| } | ||
|
|
||
| model = os.environ.get( | ||
| "KVBM_MODEL_ID", "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" | ||
| ) | ||
|
|
||
| cmd_root = [ | ||
| "python3", | ||
| "-m", | ||
| "dynamo.trtllm", | ||
| "--model", | ||
| model, | ||
| "--kv-block-size", | ||
| "16", | ||
| "--max-num-tokens", | ||
| "8000", | ||
| ] | ||
|
|
||
| self.prefiller_cmd = cmd_root + [ | ||
| "--extra-engine-args", | ||
| prefill_config_path, | ||
| "--disaggregation-mode", | ||
| "prefill", | ||
| "--connector", | ||
| "kvbm", | ||
| ] | ||
|
|
||
| self.decoder_cmd = cmd_root + [ | ||
| "--extra-engine-args", | ||
| decode_config_path, | ||
| "--disaggregation-mode", | ||
| "decode", | ||
| ] | ||
|
|
||
| with open(prefill_config_path, "w") as f: | ||
| yaml.dump(prefill_config, f, default_flow_style=False, sort_keys=False) | ||
| with open(decode_config_path, "w") as f: | ||
| yaml.dump(decode_config, f, default_flow_style=False, sort_keys=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
GPU cache blocks override likely doubles effective blocks due to inconsistent multiplier
In _set_up_trtllm_config, you configure:
llm_api_config["kv_cache_config"] = {
"enable_partial_reuse": False,
"free_gpu_memory_fraction": 0.10,
"tokens_per_block": 16,
}
...
if gpu_cache_blocks is not None:
del llm_api_config["kv_cache_config"]["free_gpu_memory_fraction"]
llm_api_config["kv_cache_config"]["max_tokens"] = int(gpu_cache_blocks) * 32 # comment: defaults 32 tokens per blockWith tokens_per_block explicitly set to 16, this * 32 multiplier makes the effective number of blocks max_tokens / tokens_per_block = 2 * gpu_cache_blocks, which doesn’t match the name of the override or the comment (that comment matches the aggregated-mode helper where tokens_per_block stayed at the TRTLLM default of 32).
To keep the override semantics intuitive, consider deriving max_tokens from the configured tokens_per_block, e.g.:
- llm_api_config["kv_cache_config"]["max_tokens"] = (
- int(gpu_cache_blocks) * 32
- ) # TRTLLM defaults 32 tokens per block
+ tokens_per_block = llm_api_config["kv_cache_config"]["tokens_per_block"]
+ llm_api_config["kv_cache_config"]["max_tokens"] = int(gpu_cache_blocks) * tokens_per_blockThis keeps the block count aligned with gpu_cache_blocks regardless of what tokens_per_block you choose for the test.
I'll verify the review comment by examining the actual code and its context.
🏁 Script executed:
#!/bin/bash
# Locate and examine the test file to verify the GPU cache blocks override logic
find . -name "test_determinism_disagg.py" -type f | head -1 | xargs -I {} cat {}Length of output: 20913
🏁 Script executed:
#!/bin/bash
# Search for other usages of tokens_per_block to understand the intended semantics
rg -n "tokens_per_block" --type py -C 3Length of output: 2376
🏁 Script executed:
#!/bin/bash
# Search for other TRTLLM config patterns or gpu_cache_blocks references
rg -n "gpu_cache_blocks|max_tokens" --type py -C 2Length of output: 68469
Fix max_tokens multiplier to use configured tokens_per_block instead of hard-coded 32
In _set_up_trtllm_config (lines 171–195), the code explicitly sets tokens_per_block to 16 but then multiplies gpu_cache_blocks by 32 when computing max_tokens. This creates an effective block count of 2× the parameter value:
llm_api_config["kv_cache_config"]["tokens_per_block"] = 16
...
llm_api_config["kv_cache_config"]["max_tokens"] = int(gpu_cache_blocks) * 32
# → effective blocks = (gpu_cache_blocks * 32) / 16 = 2 * gpu_cache_blocksThe comment references "TRTLLM defaults 32 tokens per block," but that default is overridden here. Use the configured value instead:
-llm_api_config["kv_cache_config"]["max_tokens"] = (
- int(gpu_cache_blocks) * 32
-) # TRTLLM defaults 32 tokens per block
+tokens_per_block = llm_api_config["kv_cache_config"]["tokens_per_block"]
+llm_api_config["kv_cache_config"]["max_tokens"] = int(gpu_cache_blocks) * tokens_per_blockThis ensures the override semantics remain intuitive and block count stays aligned with the gpu_cache_blocks parameter.
🧰 Tools
🪛 Ruff (0.14.5)
175-175: Probable insecure usage of temporary file or directory: "/tmp/kvbm_llm_api_prefill_config.yaml"
(S108)
180-180: Probable insecure usage of temporary file or directory: "/tmp/kvbm_llm_api_decode_config.yaml"
(S108)
228-235: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
237-242: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
In tests/kvbm_integration/test_determinism_disagg.py around lines 171-247, the
code sets kv_cache_config["tokens_per_block"]=16 but then computes max_tokens
using a hard-coded 32 multiplier, effectively doubling the intended blocks;
change the computation to use the configured tokens_per_block value (read
tokens_per_block from kv_cache_config or a local variable) when calculating
max_tokens so max_tokens = int(gpu_cache_blocks) * tokens_per_block (and remove
the misleading comment about TRTLLM default or update it to reflect using the
configured value).
Signed-off-by: jthomson04 <[email protected]>
|
please update https://github.com/ai-dynamo/dynamo/blob/main/docs/kvbm/trtllm-setup.md to reflect disagg is now supported |
| num_scheduled_tokens: usize, | ||
| ) -> Result<(), SlotError>; | ||
|
|
||
| // TRT-LLM does not include scheduled tokens in the scheduler output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so trtllm 1.2.0 includes scheduled tokens now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 1.2.0rc2 supports it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can run connector with trtllm starting 1.1.0rc5, if we remove this part, what happens with 1.1.0rc5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.1.0rc5 would break with this MR. We could (in theory) detect the TRTLLM version, and fallback to the non scheduled-tokens implementation, but that could be super ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the scheduled-tokens output in rc2 (as well as the scheduled-tokens handling on the KVBM-side) is required for Dynamo + kvbm to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd need to handle it in some way, even if it's simple detect trtllm version -> fail if incompatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we-d need to also adjust docs, where 1.1.0.rc5 is mentioned as supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docs, and added a little check that the num_scheduled_tokens field exists; throws an error if it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oandreeva-nv Can you please re-review?
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
--connectorcommand-line argument for connector selectionTests
✏️ Tip: You can customize this high-level summary in your review settings.