feat: add frontend based prefill request routing for sglang #4635

sshchoi · 2025-11-26T18:15:25Z

Overview:

Implements a prefill optimization for disaggregated serving that eliminates the extra network round-trip decode workers previously needed to learn bootstrap connection info from prefill workers

Details:

New Data Structures:

Added DisaggregatedEndpoint struct to ModelRuntimeConfig containing bootstrap_host and bootstrap_port
Added BootstrapInfo struct to PreprocessedRequest containing host, port, and room

Prefill Worker Registration:

Prefill workers now extract their bootstrap endpoint from the SGLang engine during startup and publish it to discovery via ModelRuntimeConfig.set_disaggregated_endpoint()
Added Python binding set_disaggregated_endpoint() backed by Rust implementation

Router-Side Optimization:

PrefillRouter::build_bootstrap_info() queries the best worker upfront before starting prefill
Looks up the worker's bootstrap endpoint from the scheduler's cached runtime configs
Generates a bootstrap_room and injects it into the request
Spawns prefill as a background task
Returns immediately with pre-built bootstrap_info so decode can start connecting

Scheduler Changes:

KvScheduler now maintains a workers_with_configs map synced with discovery updates
Added get_disaggregated_endpoint(worker_id) to look up bootstrap endpoints

Decode Worker Fallback:

DecodeWorkerHandler checks for pre-computed bootstrap_info first, falling back to the legacy prefill-fetch path if unavailable

Where should the reviewer start?

lib/llm/src/kv_router/prefill_router.rs - Core optimization logic in build_bootstrap_info()
lib/llm/src/local_model/runtime_config.rs - New DisaggregatedEndpoint struct
components/src/dynamo/sglang/register.py - Bootstrap endpoint extraction and publishing
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py - Pre-computed bootstrap_info usage

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [FEATURE]: Support new frontend based prefill routing with SGLang #3978

Summary by CodeRabbit

New Features
- Added support for disaggregated serving with configurable bootstrap endpoints for distributed request processing
- Introduced bootstrap pre-computation to optimize request routing efficiency
- Implemented concurrent model initialization with readiness semantics for faster startup
Refactor
- Enhanced internal request handling to support bootstrap information propagation through the serving pipeline

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Sean Choi <[email protected]>

coderabbitai · 2025-11-26T18:19:42Z

Walkthrough

This PR introduces disaggregated serving with bootstrap optimization across a distributed LLM inference system. Changes add concurrent endpoint startup with readiness gates, bootstrap endpoint configuration, and conditional prefill-decode decoupling with background task execution to enable optimized worker placement.

Changes

Cohort / File(s)	Summary
Concurrent Startup & Readiness `components/src/dynamo/sglang/main.py`	Replaces sequential startup with concurrent `asyncio.gather` approach: starts endpoint immediately while registering model in parallel using `ready_event` as a readiness gate to delay request handling until readiness is confirmed.
Bootstrap Configuration `components/src/dynamo/sglang/register.py`	Adds `_get_bootstrap_info_for_config` helper to extract bootstrap host/port from SGLang engine configuration; conditionally configures disaggregated serving endpoint in runtime config with error handling.
Request Handler Updates `components/src/dynamo/sglang/request_handlers/llm/decode_handler.py`, `components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py`	Decode handler: accepts bootstrap_info from requests or prefill stream, validates availability, uses for engine calls. Prefill handler: supports dual input formats (DisaggPreprocessedRequest and raw), emits bootstrap info via disaggregated_params, uses router-provided or locally-generated bootstrap_room.
Rust Runtime Config `lib/llm/src/local_model/runtime_config.rs`	Introduces new `DisaggregatedEndpoint` struct with optional bootstrap_host and bootstrap_port; adds `disaggregated_endpoint` field to `ModelRuntimeConfig` with serde support.
Rust Bindings `lib/bindings/python/rust/llm/local_model.rs`	Adds setter `set_disaggregated_endpoint` and getters `bootstrap_host`, `bootstrap_port` to `ModelRuntimeConfig` for Python API access.
Router Disaggregated Endpoint Lookup `lib/llm/src/kv_router.rs`	New async method `get_disaggregated_endpoint(worker_id)` delegates to scheduler for worker bootstrap endpoint retrieval.
Scheduler Per-Worker Configuration `lib/llm/src/kv_router/scheduler.rs`	Adds `workers_with_configs: Arc<RwLock<HashMap>>` field for per-worker runtime configs; new `get_disaggregated_endpoint` method; changes `SchedulingRequest::respond` signature from `&self` to `&mut self` with `take()` semantics on response channel.
Prefill Router Bootstrap Optimization `lib/llm/src/kv_router/prefill_router.rs`	Implements bootstrap optimization path: `build_bootstrap_info` queries best worker and resolves endpoint, `spawn_prefill_task` launches background prefill; main flow conditionally routes to bootstrap path with context linking; decode request updated with bootstrap_info and router config override when available.
Protocol Bootstrap Structure `lib/llm/src/protocols/common/preprocessor.rs`	Introduces `BootstrapInfo` struct (bootstrap_host, bootstrap_port, bootstrap_room) and optional `bootstrap_info` field in `PreprocessedRequest` with serde support.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

lib/llm/src/kv_router/prefill_router.rs — Bootstrap optimization introduces conditional control flow with background task spawning; verify correctness of context linking, error handling in spawned tasks, and state consistency between bootstrap and fallback paths.
lib/llm/src/kv_router/scheduler.rs — Mutable reference change to SchedulingRequest::respond with take() semantics may have cascading effects on callers; ensure single-send guarantee is preserved throughout the call chain.
Cross-file bootstrap flow — Trace bootstrap_info propagation from registration through decode handler to verify all code paths (direct request, prefill stream, fallback) are correct and logging accurately reflects execution path.
Python/Rust boundary — Verify Python register.py correctly marshals bootstrap config through Rust bindings and that serde annotations properly serialize/deserialize in both directions.

Poem

🐰 A distributed dream takes flight,
Bootstrap rooms emerge in the night,
Prefill and decode, now apart yet aligned,
With readiness gates and worker designs.
Fast as a rabbit, we optimize the way! 🚀

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description comprehensively covers all template sections: Overview explains the optimization goal, Details describes data structures and implementation changes across multiple components, Where should the reviewer start provides specific file paths, and Related Issues references the GitHub issue #3978.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title clearly summarizes the main change: adding frontend-based prefill request routing optimization for sglang to reduce network round-trips in disaggregated serving.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)
87-96: Logging at DEBUG level may miss important bootstrap_room source info.

When router-provided bootstrap_room is used (line 92), the log is at DEBUG level. If this is an important diagnostic for disaggregated serving issues, consider using INFO level for the router-provided case to distinguish it from locally-generated rooms.
         if isinstance(extra_args, dict):
             bootstrap_room = extra_args.get("bootstrap_room")
-            logging.debug(f"Using router-provided bootstrap_room: {bootstrap_room}")
+            if bootstrap_room is not None:
+                logging.info(f"Using router-provided bootstrap_room: {bootstrap_room}")

         if bootstrap_room is None:
             bootstrap_room = self._generate_bootstrap_room()
             logging.debug(f"Generated bootstrap_room locally: {bootstrap_room}")
lib/bindings/python/rust/llm/local_model.rs (1)
120-129: Consider early return when both arguments are None.

The setter always creates a DisaggregatedEndpoint even when both bootstrap_host and bootstrap_port are None, resulting in Some(DisaggregatedEndpoint { bootstrap_host: None, bootstrap_port: None }). This may be intentional to distinguish "explicitly set to empty" from "never set", but if the intent is to clear the endpoint when both are None, consider:
 fn set_disaggregated_endpoint(
     &mut self,
     bootstrap_host: Option<String>,
     bootstrap_port: Option<u16>,
 ) {
+    if bootstrap_host.is_none() && bootstrap_port.is_none() {
+        self.inner.disaggregated_endpoint = None;
+        return;
+    }
     self.inner.disaggregated_endpoint = Some(RsDisaggregatedEndpoint {
         bootstrap_host,
         bootstrap_port,
     });
 }
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)
166-174: Potential KeyError if disaggregated_params is present but missing expected keys.

The code checks for disaggregated_params existence but doesn't validate that it contains the required keys (bootstrap_host, bootstrap_port, bootstrap_room). If a malformed response is received, the engine.async_generate call at lines 180-182 will raise a KeyError.

Consider adding validation or using .get() with defaults:
             async for info in prefill_stream:
                 data = info.data()
                 if data and "disaggregated_params" in data:
-                    bootstrap_info = data["disaggregated_params"]
+                    params = data["disaggregated_params"]
+                    if all(k in params for k in ("bootstrap_host", "bootstrap_port", "bootstrap_room")):
+                        bootstrap_info = params
                 break
lib/llm/src/kv_router/scheduler.rs (1)
168-173: Consider reducing log verbosity.

Logging at INFO level every time a runtime config is found during monitoring updates (line 170-171) may produce excessive logs in production when configs change frequently. Consider using DEBUG or TRACE level instead.
                 for worker_id in &new_instance_ids {
                     let config = new_configs.get(worker_id).cloned();
                     if config.is_some() {
-                        tracing::info!("Runtime config found for worker_id: {}", worker_id);
+                        tracing::debug!("Runtime config found for worker_id: {}", worker_id);
                     }
                     new_workers_with_configs.insert(*worker_id, config);
                 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26eb14c and 71c0efd.

📒 Files selected for processing (10)

components/src/dynamo/sglang/main.py (1 hunks)
components/src/dynamo/sglang/register.py (2 hunks)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1 hunks)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1 hunks)
lib/bindings/python/rust/llm/local_model.rs (2 hunks)
lib/llm/src/kv_router.rs (1 hunks)
lib/llm/src/kv_router/prefill_router.rs (5 hunks)
lib/llm/src/kv_router/scheduler.rs (4 hunks)
lib/llm/src/local_model/runtime_config.rs (3 hunks)
lib/llm/src/protocols/common/preprocessor.rs (2 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-06-05T01:02:15.318Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

lib/llm/src/kv_router.rs

📚 Learning: 2025-05-30T06:38:09.630Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

lib/llm/src/kv_router.rs

📚 Learning: 2025-09-02T16:46:54.015Z

Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

lib/llm/src/local_model/runtime_config.rs
lib/bindings/python/rust/llm/local_model.rs
lib/llm/src/protocols/common/preprocessor.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/llm/src/kv_router/prefill_router.rs

📚 Learning: 2025-08-21T17:23:02.836Z

Learnt from: michaelfeil
Repo: ai-dynamo/dynamo PR: 2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.

Applied to files:

lib/bindings/python/rust/llm/local_model.rs

📚 Learning: 2025-06-24T20:59:35.725Z

Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

lib/llm/src/protocols/common/preprocessor.rs

🧬 Code graph analysis (8)

lib/llm/src/kv_router.rs (1)

lib/llm/src/kv_router/scheduler.rs (1)

get_disaggregated_endpoint (355-364)

lib/bindings/python/rust/llm/local_model.rs (2)

lib/llm/src/entrypoint.rs (1)

local_model (69-76)

lib/llm/src/local_model.rs (2)

runtime_config (173-176)

runtime_config (394-396)

lib/llm/src/protocols/common/preprocessor.rs (1)

lib/llm/src/preprocessor.rs (1)

builder (202-247)

components/src/dynamo/sglang/register.py (1)

lib/bindings/python/rust/llm/local_model.rs (3)

bootstrap_port (140-145)

bootstrap_host (132-137)

set_disaggregated_endpoint (120-129)

lib/llm/src/kv_router/scheduler.rs (4)

lib/llm/src/local_model.rs (2)

runtime_config (173-176)

runtime_config (394-396)

lib/bindings/python/src/dynamo/_core.pyi (1)

ModelRuntimeConfig (426-447)

lib/llm/src/kv_router.rs (1)

get_disaggregated_endpoint (484-489)

lib/llm/src/block_manager/block.rs (1)

worker_id (1155-1157)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2)

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)

generate (54-126)

components/src/dynamo/sglang/protocol.py (1)

DisaggPreprocessedRequest (63-66)

components/src/dynamo/sglang/main.py (2)

lib/bindings/python/src/dynamo/_core.pyi (4)

serve_endpoint (127-139)

generate (1372-1411)

ModelInput (1003-1005)

ModelType (1007-1014)

components/src/dynamo/sglang/register.py (1)

register_llm_with_readiness_gate (181-221)

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (2)

components/src/dynamo/sglang/request_handlers/handler_base.py (2)

_generate_bootstrap_room (90-96)

_get_input_param (70-87)

lib/bindings/python/rust/llm/local_model.rs (2)

bootstrap_host (132-137)

bootstrap_port (140-145)

🪛 Ruff (0.14.5)

components/src/dynamo/sglang/register.py

99-99: Consider moving this statement to an else block

(TRY300)

100-100: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

174-174: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: trtllm (arm64)
GitHub Check: sglang (arm64)
GitHub Check: operator (amd64)
GitHub Check: vllm (amd64)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (22)

components/src/dynamo/sglang/register.py (2)

68-102: Looks good, but consider narrowing exception handling.

The helper function correctly extracts bootstrap information from the SGLang engine. A few observations:

The broad except Exception (Ruff BLE001) is acceptable here since this is a best-effort config extraction, but consider catching more specific exceptions if known failure modes emerge.

Potential DNS resolution issue: socket.gethostbyname() on line 93-94 can raise socket.gaierror if the hostname can't be resolved. This is currently caught by the broad exception handler, which is fine.

Per Ruff TRY300, line 99 could move to an else block, but the current structure is readable.

123-130: LGTM - Bootstrap endpoint registration integrates well.

The conditional registration of the disaggregated endpoint is clean. The [OPTIMIZATION] log prefix helps identify this as part of the new prefill routing optimization during debugging.

lib/llm/src/protocols/common/preprocessor.rs (2)

102-105: LGTM - Field properly integrated into PreprocessedRequest.

The bootstrap_info field uses appropriate attributes:

#[builder(default)] for optional builder usage

#[serde(default, skip_serializing_if = "Option::is_none")] for clean serialization

13-23: The Default derive concern is not an issue in practice—BootstrapInfo::default() is never called directly in the codebase. All instances of BootstrapInfo are explicitly constructed via build_bootstrap_info() in prefill_router.rs (lines 225–228) with all fields explicitly set. The Default derive is safe but technically unnecessary given the usage pattern.

lib/llm/src/kv_router/prefill_router.rs (4)

180-183: LGTM - Simple random room ID generation.

Using rand::rng().random() for u64 provides sufficient uniqueness for bootstrap room IDs.

185-231: Verify handling when find_best_match fails or endpoint lookup returns None.

The function gracefully returns None on failures, triggering the fallback path. However, the _dp_rank (line 204) is extracted but unused. If data parallelism is relevant for bootstrap info, consider whether this should be included.

Also, chooser.get_disaggregated_endpoint(worker_id) could return None if the worker hasn't published its endpoint yet (race condition during startup). The fallback to the original path handles this correctly.

384-413: Optimization path correctly injects bootstrap_room into extra_args and spawns background prefill.

The logic is sound:

Tries build_bootstrap_info optimization first

If successful, injects bootstrap_room into extra_args for the prefill worker

Forces routing to specific worker via backend_instance_id

Spawns prefill as background task

Returns (None, Some(worker_id), Some(bootstrap_info)) to proceed to decode immediately

One observation: The extra_args mutation (lines 392-395) modifies a cloned prefill_req, which is correct.

426-442: LGTM - Decode request correctly updated with prefill result and bootstrap info.

The conditional update of prefill_result (only in original path) and injection of bootstrap_info (only in optimization path) is correct and aligns with the two-path design.

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (3)

68-85: Request format handling supports both legacy and new formats.

The dual-format support (checking for "request" key) ensures backward compatibility. The sampling_params extraction from sampling_options and stop_conditions correctly filters out None values.

104-111: Output format aligns with PrefillRouter expectations.

The disaggregated_params structure matches what call_prefill in prefill_router.rs expects to extract. The placeholder fields (token_ids: [], text: None, finish_reason: None) are appropriate for this bootstrap info response.

113-122: The code properly supports both request format paths with appropriate structure compatibility.

The review comment's concern about inner_request structure compatibility is valid to verify, but the code is actually sound. Here's what I confirmed:

Two request types are supported:

PreprocessedRequest (when skip_tokenizer_init=True): contains token_ids field

ChatCompletionRequest (when skip_tokenizer_init=False): contains messages field per OpenAI standard

skip_tokenizer_init is configured globally at startup (args.py lines 483-492), ensuring requests match the tokenizer mode.

Both request format paths in prefill_handler.py properly pass inner_request:

DisaggPreprocessedRequest format: inner_request = request["request"] (unwraps the wrapper)

Direct request format: inner_request = request

_get_input_param correctly routes both cases (handler_base.py lines 79-87):

When skip_tokenizer_init=True: accesses request["token_ids"]

When skip_tokenizer_init=False: accesses request["messages"]

The inner_request structure is guaranteed to match _get_input_param expectations because the tokenizer mode is set at startup and enforced throughout the request lifecycle.

lib/llm/src/kv_router.rs (1)

482-489: LGTM - Clean delegation to scheduler for endpoint lookup.

The method correctly delegates to self.scheduler.get_disaggregated_endpoint(worker_id) and the docstring clearly explains its purpose for prefill worker bootstrap lookup.

lib/bindings/python/rust/llm/local_model.rs (1)

131-145: LGTM!

The getters correctly use and_then for safe Option chaining to access nested optional fields.

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2)

119-129: LGTM - Pre-computed bootstrap_info path.

The optimization correctly checks for pre-computed bootstrap_info in the request first, avoiding the network round-trip to prefill workers when the router has already provided this information.

135-164: Based on my verification, I can now provide the final assessment:

instance_ids() returns an empty list (not None) when no instances are available; the fallback path executes correctly as intended.

Analysis confirms:

instance_ids() method returns list[int] (verified in tests/router/common.py:216)

Empty list is falsy in Python, so the condition evaluates to False when no instances exist

The fallback path at line 156 (await self.prefill_client.generate(...)) executes as intended

No issue exists; the code behaves correctly

lib/llm/src/local_model/runtime_config.rs (2)

10-17: LGTM - Well-structured DisaggregatedEndpoint type.

The struct correctly uses optional fields with appropriate serde attributes for serialization. Deriving Default enables convenient initialization.

48-51: LGTM - Clean integration into ModelRuntimeConfig.

The new disaggregated_endpoint field is properly documented, uses consistent serde attributes with other optional fields, and is correctly initialized to None in the Default implementation.

Also applies to: 69-69

components/src/dynamo/sglang/main.py (2)

234-253: LGTM - Concurrent startup pattern.

Using asyncio.gather to start the endpoint and register concurrently is a good pattern. The registration publishes the runtime_config with bootstrap endpoint information, enabling the prefill routing optimization.

231-253: The ready_event is created but not used for request gating—request queuing is not yet implemented.

The ready_event is passed to register_llm_with_readiness_gate, which sets it after registration succeeds (line 219 in register.py). However, serve_endpoint does not receive the ready_event parameter and therefore cannot gate or queue requests based on it. The TODO comment at line 161 confirms this: "Requests queue until ready_event is set (TODO: Part of new PR)".

Requests arriving before registration completes may not receive the bootstrap endpoint information needed for optimization. Either implement request queuing in serve_endpoint by passing and awaiting ready_event, or document this limitation explicitly.

lib/llm/src/kv_router/scheduler.rs (3)

77-87: LGTM - Correct single-send semantics with take().

Changing respond to &mut self and using take() properly enforces that the response channel can only be used once. The error handling for double-respond is appropriate.

112-123: LGTM - Proper initialization of workers_with_configs.

The initialization correctly merges instance IDs with their runtime configs, handling the case where a worker may not have a config yet.

355-364: LGTM - Clean endpoint lookup implementation.

The get_disaggregated_endpoint method correctly uses async read lock and proper Option chaining to retrieve the disaggregated endpoint for a given worker.

lib/llm/src/kv_router/prefill_router.rs

PeaBrane · 2025-11-26T19:38:02Z

@sshchoi thanks! Can you take a look at the rabbit and CIs? I'll give it a look in the meantime

Can you also share some e2e test / benchmarking results on this in the PR desc if you have some

lib/llm/src/kv_router/prefill_router.rs

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

lib/llm/src/kv_router/prefill_router.rs

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py

components/src/dynamo/sglang/register.py

lib/llm/src/kv_router/prefill_router.rs

ishandhanani · 2025-11-27T20:25:31Z

Most of my comments are nits.

Before this is merged in we need to run the SA workloads. This specifically means running

Gb200 fp8

I will share info on how to do this via slack

Signed-off-by: Sean Choi <[email protected]>

…prefill_router.rs (#4728) Signed-off-by: PeaBrane <[email protected]>

PeaBrane

to unblock, but please confirm with @ishandhanani before merging

Signed-off-by: PeaBrane <[email protected]>

Add new request routing flow for sglang

71c0efd

Signed-off-by: Sean Choi <[email protected]>

sshchoi requested a review from ishandhanani November 26, 2025 18:15

sshchoi requested review from a team as code owners November 26, 2025 18:15

pull-request-size bot added the size/L label Nov 26, 2025

sshchoi added the backend::sglang Relates to the sglang backend label Nov 26, 2025

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

lib/llm/src/kv_router/prefill_router.rs Outdated Show resolved Hide resolved

sshchoi self-assigned this Nov 26, 2025

sshchoi requested a review from PeaBrane November 26, 2025 18:20

PeaBrane reviewed Nov 26, 2025

View reviewed changes

lib/llm/src/kv_router/prefill_router.rs Outdated Show resolved Hide resolved

PeaBrane reviewed Nov 26, 2025

View reviewed changes

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py Outdated Show resolved Hide resolved

PeaBrane reviewed Nov 26, 2025

View reviewed changes

lib/llm/src/kv_router/prefill_router.rs Outdated Show resolved Hide resolved

ishandhanani reviewed Nov 27, 2025

View reviewed changes

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py Outdated Show resolved Hide resolved

ishandhanani reviewed Nov 27, 2025

View reviewed changes

components/src/dynamo/sglang/register.py Outdated Show resolved Hide resolved

ishandhanani reviewed Nov 27, 2025

View reviewed changes

components/src/dynamo/sglang/register.py Outdated Show resolved Hide resolved

ishandhanani reviewed Nov 27, 2025

View reviewed changes

lib/llm/src/kv_router/prefill_router.rs Show resolved Hide resolved

fix dp_rank, add decode timeout

ffe3ece

Signed-off-by: Sean Choi <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 00:29 Failure

Merge branch 'main' into sechoi/sglang_prefill_router

508473b

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 00:32 Failure

sshchoi changed the title ~~Add frontend based prefill request routing for sglang~~ feat: add frontend based prefill request routing for sglang Dec 2, 2025

github-actions bot added the feat label Dec 2, 2025

formatting

8e31e27

Signed-off-by: Sean Choi <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 01:48 Failure

Merge branch 'main' into sechoi/sglang_prefill_router

187d63c

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 07:45 Failure

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 18:18 Failure

Merge branch 'main' into sechoi/sglang_prefill_router

03dd93c

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:37 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:43 Inactive

Merge branch 'main' into sechoi/sglang_prefill_router

5d99390

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:54 Inactive

Merge branch 'main' into sechoi/sglang_prefill_router

865e99d

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 06:18 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 06:22 Inactive

chore: sharing code paths for call_prefill and spawn_prefill_task in …

348cb17

…prefill_router.rs (#4728) Signed-off-by: PeaBrane <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 3, 2025 19:59 Failure

Merge branch 'main' into sechoi/sglang_prefill_router

7fd9f4c

copy-pr-bot bot had a problem deploying to GITLAB December 3, 2025 19:59 Failure

PeaBrane approved these changes Dec 5, 2025

View reviewed changes

Merge branch 'main' into sechoi/sglang_prefill_router

0c0a50d

copy-pr-bot bot temporarily deployed to GITLAB December 5, 2025 08:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 5, 2025 09:02 Inactive

Merge branch 'main' into sechoi/sglang_prefill_router

0efe049

copy-pr-bot bot had a problem deploying to GITLAB December 8, 2025 01:55 Failure

update disagg_router.sh

7e95560

Signed-off-by: PeaBrane <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 18:34 Failure

Merge branch 'main' into sechoi/sglang_prefill_router

2798bdf

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 19:28 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add frontend based prefill request routing for sglang #4635

feat: add frontend based prefill request routing for sglang #4635

Uh oh!

sshchoi commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

PeaBrane commented Nov 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishandhanani commented Nov 27, 2025 •

edited

Loading

Uh oh!

PeaBrane left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add frontend based prefill request routing for sglang #4635

Are you sure you want to change the base?

feat: add frontend based prefill request routing for sglang #4635

Uh oh!

Conversation

sshchoi commented Nov 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PeaBrane commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishandhanani commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeaBrane left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sshchoi commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading

PeaBrane commented Nov 26, 2025 •

edited

Loading

ishandhanani commented Nov 27, 2025 •

edited

Loading