-
Notifications
You must be signed in to change notification settings - Fork 738
feat: add frontend based prefill request routing for sglang #4635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sean Choi <[email protected]>
WalkthroughThis PR introduces disaggregated serving with bootstrap optimization across a distributed LLM inference system. Changes add concurrent endpoint startup with readiness gates, bootstrap endpoint configuration, and conditional prefill-decode decoupling with background task execution to enable optimized worker placement. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)
87-96: Logging at DEBUG level may miss important bootstrap_room source info.When router-provided
bootstrap_roomis used (line 92), the log is at DEBUG level. If this is an important diagnostic for disaggregated serving issues, consider using INFO level for the router-provided case to distinguish it from locally-generated rooms.if isinstance(extra_args, dict): bootstrap_room = extra_args.get("bootstrap_room") - logging.debug(f"Using router-provided bootstrap_room: {bootstrap_room}") + if bootstrap_room is not None: + logging.info(f"Using router-provided bootstrap_room: {bootstrap_room}") if bootstrap_room is None: bootstrap_room = self._generate_bootstrap_room() logging.debug(f"Generated bootstrap_room locally: {bootstrap_room}")lib/bindings/python/rust/llm/local_model.rs (1)
120-129: Consider early return when both arguments are None.The setter always creates a
DisaggregatedEndpointeven when bothbootstrap_hostandbootstrap_portareNone, resulting inSome(DisaggregatedEndpoint { bootstrap_host: None, bootstrap_port: None }). This may be intentional to distinguish "explicitly set to empty" from "never set", but if the intent is to clear the endpoint when both are None, consider:fn set_disaggregated_endpoint( &mut self, bootstrap_host: Option<String>, bootstrap_port: Option<u16>, ) { + if bootstrap_host.is_none() && bootstrap_port.is_none() { + self.inner.disaggregated_endpoint = None; + return; + } self.inner.disaggregated_endpoint = Some(RsDisaggregatedEndpoint { bootstrap_host, bootstrap_port, }); }components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)
166-174: Potential KeyError ifdisaggregated_paramsis present but missing expected keys.The code checks for
disaggregated_paramsexistence but doesn't validate that it contains the required keys (bootstrap_host,bootstrap_port,bootstrap_room). If a malformed response is received, theengine.async_generatecall at lines 180-182 will raise aKeyError.Consider adding validation or using
.get()with defaults:async for info in prefill_stream: data = info.data() if data and "disaggregated_params" in data: - bootstrap_info = data["disaggregated_params"] + params = data["disaggregated_params"] + if all(k in params for k in ("bootstrap_host", "bootstrap_port", "bootstrap_room")): + bootstrap_info = params breaklib/llm/src/kv_router/scheduler.rs (1)
168-173: Consider reducing log verbosity.Logging at
INFOlevel every time a runtime config is found during monitoring updates (line 170-171) may produce excessive logs in production when configs change frequently. Consider usingDEBUGorTRACElevel instead.for worker_id in &new_instance_ids { let config = new_configs.get(worker_id).cloned(); if config.is_some() { - tracing::info!("Runtime config found for worker_id: {}", worker_id); + tracing::debug!("Runtime config found for worker_id: {}", worker_id); } new_workers_with_configs.insert(*worker_id, config); }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
components/src/dynamo/sglang/main.py(1 hunks)components/src/dynamo/sglang/register.py(2 hunks)components/src/dynamo/sglang/request_handlers/llm/decode_handler.py(1 hunks)components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py(1 hunks)lib/bindings/python/rust/llm/local_model.rs(2 hunks)lib/llm/src/kv_router.rs(1 hunks)lib/llm/src/kv_router/prefill_router.rs(5 hunks)lib/llm/src/kv_router/scheduler.rs(4 hunks)lib/llm/src/local_model/runtime_config.rs(3 hunks)lib/llm/src/protocols/common/preprocessor.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/local_model/runtime_config.rslib/bindings/python/rust/llm/local_model.rslib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Applied to files:
lib/llm/src/kv_router/prefill_router.rs
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
Repo: ai-dynamo/dynamo PR: 2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.
Applied to files:
lib/bindings/python/rust/llm/local_model.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
Applied to files:
lib/llm/src/protocols/common/preprocessor.rs
🧬 Code graph analysis (8)
lib/llm/src/kv_router.rs (1)
lib/llm/src/kv_router/scheduler.rs (1)
get_disaggregated_endpoint(355-364)
lib/bindings/python/rust/llm/local_model.rs (2)
lib/llm/src/entrypoint.rs (1)
local_model(69-76)lib/llm/src/local_model.rs (2)
runtime_config(173-176)runtime_config(394-396)
lib/llm/src/protocols/common/preprocessor.rs (1)
lib/llm/src/preprocessor.rs (1)
builder(202-247)
components/src/dynamo/sglang/register.py (1)
lib/bindings/python/rust/llm/local_model.rs (3)
bootstrap_port(140-145)bootstrap_host(132-137)set_disaggregated_endpoint(120-129)
lib/llm/src/kv_router/scheduler.rs (4)
lib/llm/src/local_model.rs (2)
runtime_config(173-176)runtime_config(394-396)lib/bindings/python/src/dynamo/_core.pyi (1)
ModelRuntimeConfig(426-447)lib/llm/src/kv_router.rs (1)
get_disaggregated_endpoint(484-489)lib/llm/src/block_manager/block.rs (1)
worker_id(1155-1157)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)
generate(54-126)components/src/dynamo/sglang/protocol.py (1)
DisaggPreprocessedRequest(63-66)
components/src/dynamo/sglang/main.py (2)
lib/bindings/python/src/dynamo/_core.pyi (4)
serve_endpoint(127-139)generate(1372-1411)ModelInput(1003-1005)ModelType(1007-1014)components/src/dynamo/sglang/register.py (1)
register_llm_with_readiness_gate(181-221)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (2)
components/src/dynamo/sglang/request_handlers/handler_base.py (2)
_generate_bootstrap_room(90-96)_get_input_param(70-87)lib/bindings/python/rust/llm/local_model.rs (2)
bootstrap_host(132-137)bootstrap_port(140-145)
🪛 Ruff (0.14.5)
components/src/dynamo/sglang/register.py
99-99: Consider moving this statement to an else block
(TRY300)
100-100: Do not catch blind exception: Exception
(BLE001)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
174-174: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (22)
components/src/dynamo/sglang/register.py (2)
68-102: Looks good, but consider narrowing exception handling.The helper function correctly extracts bootstrap information from the SGLang engine. A few observations:
The broad
except Exception(Ruff BLE001) is acceptable here since this is a best-effort config extraction, but consider catching more specific exceptions if known failure modes emerge.Potential DNS resolution issue:
socket.gethostbyname()on line 93-94 can raisesocket.gaierrorif the hostname can't be resolved. This is currently caught by the broad exception handler, which is fine.Per Ruff TRY300, line 99 could move to an
elseblock, but the current structure is readable.
123-130: LGTM - Bootstrap endpoint registration integrates well.The conditional registration of the disaggregated endpoint is clean. The
[OPTIMIZATION]log prefix helps identify this as part of the new prefill routing optimization during debugging.lib/llm/src/protocols/common/preprocessor.rs (2)
102-105: LGTM - Field properly integrated intoPreprocessedRequest.The
bootstrap_infofield uses appropriate attributes:
#[builder(default)]for optional builder usage#[serde(default, skip_serializing_if = "Option::is_none")]for clean serialization
13-23: TheDefaultderive concern is not an issue in practice—BootstrapInfo::default()is never called directly in the codebase. All instances ofBootstrapInfoare explicitly constructed viabuild_bootstrap_info()inprefill_router.rs(lines 225–228) with all fields explicitly set. TheDefaultderive is safe but technically unnecessary given the usage pattern.lib/llm/src/kv_router/prefill_router.rs (4)
180-183: LGTM - Simple random room ID generation.Using
rand::rng().random()for u64 provides sufficient uniqueness for bootstrap room IDs.
185-231: Verify handling whenfind_best_matchfails or endpoint lookup returnsNone.The function gracefully returns
Noneon failures, triggering the fallback path. However, the_dp_rank(line 204) is extracted but unused. If data parallelism is relevant for bootstrap info, consider whether this should be included.Also,
chooser.get_disaggregated_endpoint(worker_id)could returnNoneif the worker hasn't published its endpoint yet (race condition during startup). The fallback to the original path handles this correctly.
384-413: Optimization path correctly injectsbootstrap_roomintoextra_argsand spawns background prefill.The logic is sound:
- Tries
build_bootstrap_infooptimization first- If successful, injects
bootstrap_roomintoextra_argsfor the prefill worker- Forces routing to specific worker via
backend_instance_id- Spawns prefill as background task
- Returns
(None, Some(worker_id), Some(bootstrap_info))to proceed to decode immediatelyOne observation: The
extra_argsmutation (lines 392-395) modifies a clonedprefill_req, which is correct.
426-442: LGTM - Decode request correctly updated with prefill result and bootstrap info.The conditional update of
prefill_result(only in original path) and injection ofbootstrap_info(only in optimization path) is correct and aligns with the two-path design.components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (3)
68-85: Request format handling supports both legacy and new formats.The dual-format support (checking for
"request"key) ensures backward compatibility. The sampling_params extraction fromsampling_optionsandstop_conditionscorrectly filters outNonevalues.
104-111: Output format aligns withPrefillRouterexpectations.The
disaggregated_paramsstructure matches whatcall_prefillinprefill_router.rsexpects to extract. The placeholder fields (token_ids: [],text: None,finish_reason: None) are appropriate for this bootstrap info response.
113-122: The code properly supports both request format paths with appropriate structure compatibility.The review comment's concern about
inner_requeststructure compatibility is valid to verify, but the code is actually sound. Here's what I confirmed:
Two request types are supported:
PreprocessedRequest(whenskip_tokenizer_init=True): containstoken_idsfieldChatCompletionRequest(whenskip_tokenizer_init=False): containsmessagesfield per OpenAI standard
skip_tokenizer_initis configured globally at startup (args.py lines 483-492), ensuring requests match the tokenizer mode.Both request format paths in prefill_handler.py properly pass
inner_request:
- DisaggPreprocessedRequest format:
inner_request = request["request"](unwraps the wrapper)- Direct request format:
inner_request = request
_get_input_paramcorrectly routes both cases (handler_base.py lines 79-87):
- When
skip_tokenizer_init=True: accessesrequest["token_ids"]- When
skip_tokenizer_init=False: accessesrequest["messages"]The inner_request structure is guaranteed to match
_get_input_paramexpectations because the tokenizer mode is set at startup and enforced throughout the request lifecycle.lib/llm/src/kv_router.rs (1)
482-489: LGTM - Clean delegation to scheduler for endpoint lookup.The method correctly delegates to
self.scheduler.get_disaggregated_endpoint(worker_id)and the docstring clearly explains its purpose for prefill worker bootstrap lookup.lib/bindings/python/rust/llm/local_model.rs (1)
131-145: LGTM!The getters correctly use
and_thenfor safe Option chaining to access nested optional fields.components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2)
119-129: LGTM - Pre-computed bootstrap_info path.The optimization correctly checks for pre-computed
bootstrap_infoin the request first, avoiding the network round-trip to prefill workers when the router has already provided this information.
135-164: Based on my verification, I can now provide the final assessment:instance_ids() returns an empty list (not None) when no instances are available; the fallback path executes correctly as intended.
Analysis confirms:
instance_ids()method returnslist[int](verified intests/router/common.py:216)- Empty list is falsy in Python, so the condition evaluates to
Falsewhen no instances exist- The fallback path at line 156 (
await self.prefill_client.generate(...)) executes as intended- No issue exists; the code behaves correctly
lib/llm/src/local_model/runtime_config.rs (2)
10-17: LGTM - Well-structured DisaggregatedEndpoint type.The struct correctly uses optional fields with appropriate serde attributes for serialization. Deriving
Defaultenables convenient initialization.
48-51: LGTM - Clean integration into ModelRuntimeConfig.The new
disaggregated_endpointfield is properly documented, uses consistent serde attributes with other optional fields, and is correctly initialized toNonein theDefaultimplementation.Also applies to: 69-69
components/src/dynamo/sglang/main.py (2)
234-253: LGTM - Concurrent startup pattern.Using
asyncio.gatherto start the endpoint and register concurrently is a good pattern. The registration publishes the runtime_config with bootstrap endpoint information, enabling the prefill routing optimization.
231-253: Theready_eventis created but not used for request gating—request queuing is not yet implemented.The
ready_eventis passed toregister_llm_with_readiness_gate, which sets it after registration succeeds (line 219 in register.py). However,serve_endpointdoes not receive theready_eventparameter and therefore cannot gate or queue requests based on it. The TODO comment at line 161 confirms this: "Requests queue until ready_event is set (TODO: Part of new PR)".Requests arriving before registration completes may not receive the bootstrap endpoint information needed for optimization. Either implement request queuing in
serve_endpointby passing and awaitingready_event, or document this limitation explicitly.lib/llm/src/kv_router/scheduler.rs (3)
77-87: LGTM - Correct single-send semantics with take().Changing
respondto&mut selfand usingtake()properly enforces that the response channel can only be used once. The error handling for double-respond is appropriate.
112-123: LGTM - Proper initialization of workers_with_configs.The initialization correctly merges instance IDs with their runtime configs, handling the case where a worker may not have a config yet.
355-364: LGTM - Clean endpoint lookup implementation.The
get_disaggregated_endpointmethod correctly uses async read lock and proper Option chaining to retrieve the disaggregated endpoint for a given worker.
|
@sshchoi thanks! Can you take a look at the rabbit and CIs? I'll give it a look in the meantime Can you also share some e2e test / benchmarking results on this in the PR desc if you have some |
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
Outdated
Show resolved
Hide resolved
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py
Outdated
Show resolved
Hide resolved
|
Most of my comments are nits. Before this is merged in we need to run the SA workloads. This specifically means running
I will share info on how to do this via slack |
Signed-off-by: Sean Choi <[email protected]>
Signed-off-by: Sean Choi <[email protected]>
…prefill_router.rs (#4728) Signed-off-by: PeaBrane <[email protected]>
PeaBrane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to unblock, but please confirm with @ishandhanani before merging
Signed-off-by: PeaBrane <[email protected]>
Overview:
Implements a prefill optimization for disaggregated serving that eliminates the extra network round-trip decode workers previously needed to learn bootstrap connection info from prefill workers
Details:
New Data Structures:
Prefill Worker Registration:
ModelRuntimeConfig.set_disaggregated_endpoint()Router-Side Optimization:
PrefillRouter::build_bootstrap_info()queries the best worker upfront before starting prefillbootstrap_infoso decode can start connectingScheduler Changes:
get_disaggregated_endpoint(worker_id)to look up bootstrap endpointsDecode Worker Fallback:
Where should the reviewer start?
lib/llm/src/kv_router/prefill_router.rs- Core optimization logic in build_bootstrap_info()lib/llm/src/local_model/runtime_config.rs- New DisaggregatedEndpoint structcomponents/src/dynamo/sglang/register.py- Bootstrap endpoint extraction and publishingcomponents/src/dynamo/sglang/request_handlers/llm/decode_handler.py- Pre-computed bootstrap_info usageRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.