Skip to content

Conversation

@sshchoi
Copy link
Contributor

@sshchoi sshchoi commented Nov 26, 2025

Overview:

Implements a prefill optimization for disaggregated serving that eliminates the extra network round-trip decode workers previously needed to learn bootstrap connection info from prefill workers

Details:

New Data Structures:

  • Added DisaggregatedEndpoint struct to ModelRuntimeConfig containing bootstrap_host and bootstrap_port
  • Added BootstrapInfo struct to PreprocessedRequest containing host, port, and room

Prefill Worker Registration:

  • Prefill workers now extract their bootstrap endpoint from the SGLang engine during startup and publish it to discovery via ModelRuntimeConfig.set_disaggregated_endpoint()
  • Added Python binding set_disaggregated_endpoint() backed by Rust implementation

Router-Side Optimization:

  • PrefillRouter::build_bootstrap_info() queries the best worker upfront before starting prefill
  • Looks up the worker's bootstrap endpoint from the scheduler's cached runtime configs
  • Generates a bootstrap_room and injects it into the request
  • Spawns prefill as a background task
  • Returns immediately with pre-built bootstrap_info so decode can start connecting

Scheduler Changes:

  • KvScheduler now maintains a workers_with_configs map synced with discovery updates
  • Added get_disaggregated_endpoint(worker_id) to look up bootstrap endpoints

Decode Worker Fallback:

  • DecodeWorkerHandler checks for pre-computed bootstrap_info first, falling back to the legacy prefill-fetch path if unavailable

Where should the reviewer start?

  • lib/llm/src/kv_router/prefill_router.rs - Core optimization logic in build_bootstrap_info()
  • lib/llm/src/local_model/runtime_config.rs - New DisaggregatedEndpoint struct
  • components/src/dynamo/sglang/register.py - Bootstrap endpoint extraction and publishing
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py - Pre-computed bootstrap_info usage

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • Added support for disaggregated serving with configurable bootstrap endpoints for distributed request processing
    • Introduced bootstrap pre-computation to optimize request routing efficiency
    • Implemented concurrent model initialization with readiness semantics for faster startup
  • Refactor

    • Enhanced internal request handling to support bootstrap information propagation through the serving pipeline

✏️ Tip: You can customize this high-level summary in your review settings.

@sshchoi sshchoi requested a review from ishandhanani November 26, 2025 18:15
@sshchoi sshchoi requested review from a team as code owners November 26, 2025 18:15
@sshchoi sshchoi added the backend::sglang Relates to the sglang backend label Nov 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 26, 2025

Walkthrough

This PR introduces disaggregated serving with bootstrap optimization across a distributed LLM inference system. Changes add concurrent endpoint startup with readiness gates, bootstrap endpoint configuration, and conditional prefill-decode decoupling with background task execution to enable optimized worker placement.

Changes

Cohort / File(s) Summary
Concurrent Startup & Readiness
components/src/dynamo/sglang/main.py
Replaces sequential startup with concurrent asyncio.gather approach: starts endpoint immediately while registering model in parallel using ready_event as a readiness gate to delay request handling until readiness is confirmed.
Bootstrap Configuration
components/src/dynamo/sglang/register.py
Adds _get_bootstrap_info_for_config helper to extract bootstrap host/port from SGLang engine configuration; conditionally configures disaggregated serving endpoint in runtime config with error handling.
Request Handler Updates
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py
Decode handler: accepts bootstrap_info from requests or prefill stream, validates availability, uses for engine calls. Prefill handler: supports dual input formats (DisaggPreprocessedRequest and raw), emits bootstrap info via disaggregated_params, uses router-provided or locally-generated bootstrap_room.
Rust Runtime Config
lib/llm/src/local_model/runtime_config.rs
Introduces new DisaggregatedEndpoint struct with optional bootstrap_host and bootstrap_port; adds disaggregated_endpoint field to ModelRuntimeConfig with serde support.
Rust Bindings
lib/bindings/python/rust/llm/local_model.rs
Adds setter set_disaggregated_endpoint and getters bootstrap_host, bootstrap_port to ModelRuntimeConfig for Python API access.
Router Disaggregated Endpoint Lookup
lib/llm/src/kv_router.rs
New async method get_disaggregated_endpoint(worker_id) delegates to scheduler for worker bootstrap endpoint retrieval.
Scheduler Per-Worker Configuration
lib/llm/src/kv_router/scheduler.rs
Adds workers_with_configs: Arc<RwLock<HashMap>> field for per-worker runtime configs; new get_disaggregated_endpoint method; changes SchedulingRequest::respond signature from &self to &mut self with take() semantics on response channel.
Prefill Router Bootstrap Optimization
lib/llm/src/kv_router/prefill_router.rs
Implements bootstrap optimization path: build_bootstrap_info queries best worker and resolves endpoint, spawn_prefill_task launches background prefill; main flow conditionally routes to bootstrap path with context linking; decode request updated with bootstrap_info and router config override when available.
Protocol Bootstrap Structure
lib/llm/src/protocols/common/preprocessor.rs
Introduces BootstrapInfo struct (bootstrap_host, bootstrap_port, bootstrap_room) and optional bootstrap_info field in PreprocessedRequest with serde support.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • lib/llm/src/kv_router/prefill_router.rs — Bootstrap optimization introduces conditional control flow with background task spawning; verify correctness of context linking, error handling in spawned tasks, and state consistency between bootstrap and fallback paths.
  • lib/llm/src/kv_router/scheduler.rs — Mutable reference change to SchedulingRequest::respond with take() semantics may have cascading effects on callers; ensure single-send guarantee is preserved throughout the call chain.
  • Cross-file bootstrap flow — Trace bootstrap_info propagation from registration through decode handler to verify all code paths (direct request, prefill stream, fallback) are correct and logging accurately reflects execution path.
  • Python/Rust boundary — Verify Python register.py correctly marshals bootstrap config through Rust bindings and that serde annotations properly serialize/deserialize in both directions.

Poem

🐰 A distributed dream takes flight,
Bootstrap rooms emerge in the night,
Prefill and decode, now apart yet aligned,
With readiness gates and worker designs.
Fast as a rabbit, we optimize the way! 🚀

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively covers all template sections: Overview explains the optimization goal, Details describes data structures and implementation changes across multiple components, Where should the reviewer start provides specific file paths, and Related Issues references the GitHub issue #3978.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title clearly summarizes the main change: adding frontend-based prefill request routing optimization for sglang to reduce network round-trips in disaggregated serving.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)

87-96: Logging at DEBUG level may miss important bootstrap_room source info.

When router-provided bootstrap_room is used (line 92), the log is at DEBUG level. If this is an important diagnostic for disaggregated serving issues, consider using INFO level for the router-provided case to distinguish it from locally-generated rooms.

         if isinstance(extra_args, dict):
             bootstrap_room = extra_args.get("bootstrap_room")
-            logging.debug(f"Using router-provided bootstrap_room: {bootstrap_room}")
+            if bootstrap_room is not None:
+                logging.info(f"Using router-provided bootstrap_room: {bootstrap_room}")

         if bootstrap_room is None:
             bootstrap_room = self._generate_bootstrap_room()
             logging.debug(f"Generated bootstrap_room locally: {bootstrap_room}")
lib/bindings/python/rust/llm/local_model.rs (1)

120-129: Consider early return when both arguments are None.

The setter always creates a DisaggregatedEndpoint even when both bootstrap_host and bootstrap_port are None, resulting in Some(DisaggregatedEndpoint { bootstrap_host: None, bootstrap_port: None }). This may be intentional to distinguish "explicitly set to empty" from "never set", but if the intent is to clear the endpoint when both are None, consider:

 fn set_disaggregated_endpoint(
     &mut self,
     bootstrap_host: Option<String>,
     bootstrap_port: Option<u16>,
 ) {
+    if bootstrap_host.is_none() && bootstrap_port.is_none() {
+        self.inner.disaggregated_endpoint = None;
+        return;
+    }
     self.inner.disaggregated_endpoint = Some(RsDisaggregatedEndpoint {
         bootstrap_host,
         bootstrap_port,
     });
 }
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

166-174: Potential KeyError if disaggregated_params is present but missing expected keys.

The code checks for disaggregated_params existence but doesn't validate that it contains the required keys (bootstrap_host, bootstrap_port, bootstrap_room). If a malformed response is received, the engine.async_generate call at lines 180-182 will raise a KeyError.

Consider adding validation or using .get() with defaults:

             async for info in prefill_stream:
                 data = info.data()
                 if data and "disaggregated_params" in data:
-                    bootstrap_info = data["disaggregated_params"]
+                    params = data["disaggregated_params"]
+                    if all(k in params for k in ("bootstrap_host", "bootstrap_port", "bootstrap_room")):
+                        bootstrap_info = params
                 break
lib/llm/src/kv_router/scheduler.rs (1)

168-173: Consider reducing log verbosity.

Logging at INFO level every time a runtime config is found during monitoring updates (line 170-171) may produce excessive logs in production when configs change frequently. Consider using DEBUG or TRACE level instead.

                 for worker_id in &new_instance_ids {
                     let config = new_configs.get(worker_id).cloned();
                     if config.is_some() {
-                        tracing::info!("Runtime config found for worker_id: {}", worker_id);
+                        tracing::debug!("Runtime config found for worker_id: {}", worker_id);
                     }
                     new_workers_with_configs.insert(*worker_id, config);
                 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26eb14c and 71c0efd.

📒 Files selected for processing (10)
  • components/src/dynamo/sglang/main.py (1 hunks)
  • components/src/dynamo/sglang/register.py (2 hunks)
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1 hunks)
  • components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1 hunks)
  • lib/bindings/python/rust/llm/local_model.rs (2 hunks)
  • lib/llm/src/kv_router.rs (1 hunks)
  • lib/llm/src/kv_router/prefill_router.rs (5 hunks)
  • lib/llm/src/kv_router/scheduler.rs (4 hunks)
  • lib/llm/src/local_model/runtime_config.rs (3 hunks)
  • lib/llm/src/protocols/common/preprocessor.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

  • lib/llm/src/kv_router.rs
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

  • lib/llm/src/kv_router.rs
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/local_model/runtime_config.rs
  • lib/bindings/python/rust/llm/local_model.rs
  • lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

  • lib/llm/src/kv_router/prefill_router.rs
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
Repo: ai-dynamo/dynamo PR: 2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.

Applied to files:

  • lib/bindings/python/rust/llm/local_model.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

  • lib/llm/src/protocols/common/preprocessor.rs
🧬 Code graph analysis (8)
lib/llm/src/kv_router.rs (1)
lib/llm/src/kv_router/scheduler.rs (1)
  • get_disaggregated_endpoint (355-364)
lib/bindings/python/rust/llm/local_model.rs (2)
lib/llm/src/entrypoint.rs (1)
  • local_model (69-76)
lib/llm/src/local_model.rs (2)
  • runtime_config (173-176)
  • runtime_config (394-396)
lib/llm/src/protocols/common/preprocessor.rs (1)
lib/llm/src/preprocessor.rs (1)
  • builder (202-247)
components/src/dynamo/sglang/register.py (1)
lib/bindings/python/rust/llm/local_model.rs (3)
  • bootstrap_port (140-145)
  • bootstrap_host (132-137)
  • set_disaggregated_endpoint (120-129)
lib/llm/src/kv_router/scheduler.rs (4)
lib/llm/src/local_model.rs (2)
  • runtime_config (173-176)
  • runtime_config (394-396)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • ModelRuntimeConfig (426-447)
lib/llm/src/kv_router.rs (1)
  • get_disaggregated_endpoint (484-489)
lib/llm/src/block_manager/block.rs (1)
  • worker_id (1155-1157)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)
  • generate (54-126)
components/src/dynamo/sglang/protocol.py (1)
  • DisaggPreprocessedRequest (63-66)
components/src/dynamo/sglang/main.py (2)
lib/bindings/python/src/dynamo/_core.pyi (4)
  • serve_endpoint (127-139)
  • generate (1372-1411)
  • ModelInput (1003-1005)
  • ModelType (1007-1014)
components/src/dynamo/sglang/register.py (1)
  • register_llm_with_readiness_gate (181-221)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (2)
components/src/dynamo/sglang/request_handlers/handler_base.py (2)
  • _generate_bootstrap_room (90-96)
  • _get_input_param (70-87)
lib/bindings/python/rust/llm/local_model.rs (2)
  • bootstrap_host (132-137)
  • bootstrap_port (140-145)
🪛 Ruff (0.14.5)
components/src/dynamo/sglang/register.py

99-99: Consider moving this statement to an else block

(TRY300)


100-100: Do not catch blind exception: Exception

(BLE001)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

174-174: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (22)
components/src/dynamo/sglang/register.py (2)

68-102: Looks good, but consider narrowing exception handling.

The helper function correctly extracts bootstrap information from the SGLang engine. A few observations:

  1. The broad except Exception (Ruff BLE001) is acceptable here since this is a best-effort config extraction, but consider catching more specific exceptions if known failure modes emerge.

  2. Potential DNS resolution issue: socket.gethostbyname() on line 93-94 can raise socket.gaierror if the hostname can't be resolved. This is currently caught by the broad exception handler, which is fine.

  3. Per Ruff TRY300, line 99 could move to an else block, but the current structure is readable.


123-130: LGTM - Bootstrap endpoint registration integrates well.

The conditional registration of the disaggregated endpoint is clean. The [OPTIMIZATION] log prefix helps identify this as part of the new prefill routing optimization during debugging.

lib/llm/src/protocols/common/preprocessor.rs (2)

102-105: LGTM - Field properly integrated into PreprocessedRequest.

The bootstrap_info field uses appropriate attributes:

  • #[builder(default)] for optional builder usage
  • #[serde(default, skip_serializing_if = "Option::is_none")] for clean serialization

13-23: The Default derive concern is not an issue in practice—BootstrapInfo::default() is never called directly in the codebase. All instances of BootstrapInfo are explicitly constructed via build_bootstrap_info() in prefill_router.rs (lines 225–228) with all fields explicitly set. The Default derive is safe but technically unnecessary given the usage pattern.

lib/llm/src/kv_router/prefill_router.rs (4)

180-183: LGTM - Simple random room ID generation.

Using rand::rng().random() for u64 provides sufficient uniqueness for bootstrap room IDs.


185-231: Verify handling when find_best_match fails or endpoint lookup returns None.

The function gracefully returns None on failures, triggering the fallback path. However, the _dp_rank (line 204) is extracted but unused. If data parallelism is relevant for bootstrap info, consider whether this should be included.

Also, chooser.get_disaggregated_endpoint(worker_id) could return None if the worker hasn't published its endpoint yet (race condition during startup). The fallback to the original path handles this correctly.


384-413: Optimization path correctly injects bootstrap_room into extra_args and spawns background prefill.

The logic is sound:

  1. Tries build_bootstrap_info optimization first
  2. If successful, injects bootstrap_room into extra_args for the prefill worker
  3. Forces routing to specific worker via backend_instance_id
  4. Spawns prefill as background task
  5. Returns (None, Some(worker_id), Some(bootstrap_info)) to proceed to decode immediately

One observation: The extra_args mutation (lines 392-395) modifies a cloned prefill_req, which is correct.


426-442: LGTM - Decode request correctly updated with prefill result and bootstrap info.

The conditional update of prefill_result (only in original path) and injection of bootstrap_info (only in optimization path) is correct and aligns with the two-path design.

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (3)

68-85: Request format handling supports both legacy and new formats.

The dual-format support (checking for "request" key) ensures backward compatibility. The sampling_params extraction from sampling_options and stop_conditions correctly filters out None values.


104-111: Output format aligns with PrefillRouter expectations.

The disaggregated_params structure matches what call_prefill in prefill_router.rs expects to extract. The placeholder fields (token_ids: [], text: None, finish_reason: None) are appropriate for this bootstrap info response.


113-122: The code properly supports both request format paths with appropriate structure compatibility.

The review comment's concern about inner_request structure compatibility is valid to verify, but the code is actually sound. Here's what I confirmed:

  1. Two request types are supported:

    • PreprocessedRequest (when skip_tokenizer_init=True): contains token_ids field
    • ChatCompletionRequest (when skip_tokenizer_init=False): contains messages field per OpenAI standard
  2. skip_tokenizer_init is configured globally at startup (args.py lines 483-492), ensuring requests match the tokenizer mode.

  3. Both request format paths in prefill_handler.py properly pass inner_request:

    • DisaggPreprocessedRequest format: inner_request = request["request"] (unwraps the wrapper)
    • Direct request format: inner_request = request
  4. _get_input_param correctly routes both cases (handler_base.py lines 79-87):

    • When skip_tokenizer_init=True: accesses request["token_ids"]
    • When skip_tokenizer_init=False: accesses request["messages"]

The inner_request structure is guaranteed to match _get_input_param expectations because the tokenizer mode is set at startup and enforced throughout the request lifecycle.

lib/llm/src/kv_router.rs (1)

482-489: LGTM - Clean delegation to scheduler for endpoint lookup.

The method correctly delegates to self.scheduler.get_disaggregated_endpoint(worker_id) and the docstring clearly explains its purpose for prefill worker bootstrap lookup.

lib/bindings/python/rust/llm/local_model.rs (1)

131-145: LGTM!

The getters correctly use and_then for safe Option chaining to access nested optional fields.

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2)

119-129: LGTM - Pre-computed bootstrap_info path.

The optimization correctly checks for pre-computed bootstrap_info in the request first, avoiding the network round-trip to prefill workers when the router has already provided this information.


135-164: Based on my verification, I can now provide the final assessment:

instance_ids() returns an empty list (not None) when no instances are available; the fallback path executes correctly as intended.

Analysis confirms:

  • instance_ids() method returns list[int] (verified in tests/router/common.py:216)
  • Empty list is falsy in Python, so the condition evaluates to False when no instances exist
  • The fallback path at line 156 (await self.prefill_client.generate(...)) executes as intended
  • No issue exists; the code behaves correctly
lib/llm/src/local_model/runtime_config.rs (2)

10-17: LGTM - Well-structured DisaggregatedEndpoint type.

The struct correctly uses optional fields with appropriate serde attributes for serialization. Deriving Default enables convenient initialization.


48-51: LGTM - Clean integration into ModelRuntimeConfig.

The new disaggregated_endpoint field is properly documented, uses consistent serde attributes with other optional fields, and is correctly initialized to None in the Default implementation.

Also applies to: 69-69

components/src/dynamo/sglang/main.py (2)

234-253: LGTM - Concurrent startup pattern.

Using asyncio.gather to start the endpoint and register concurrently is a good pattern. The registration publishes the runtime_config with bootstrap endpoint information, enabling the prefill routing optimization.


231-253: The ready_event is created but not used for request gating—request queuing is not yet implemented.

The ready_event is passed to register_llm_with_readiness_gate, which sets it after registration succeeds (line 219 in register.py). However, serve_endpoint does not receive the ready_event parameter and therefore cannot gate or queue requests based on it. The TODO comment at line 161 confirms this: "Requests queue until ready_event is set (TODO: Part of new PR)".

Requests arriving before registration completes may not receive the bootstrap endpoint information needed for optimization. Either implement request queuing in serve_endpoint by passing and awaiting ready_event, or document this limitation explicitly.

lib/llm/src/kv_router/scheduler.rs (3)

77-87: LGTM - Correct single-send semantics with take().

Changing respond to &mut self and using take() properly enforces that the response channel can only be used once. The error handling for double-respond is appropriate.


112-123: LGTM - Proper initialization of workers_with_configs.

The initialization correctly merges instance IDs with their runtime configs, handling the case where a worker may not have a config yet.


355-364: LGTM - Clean endpoint lookup implementation.

The get_disaggregated_endpoint method correctly uses async read lock and proper Option chaining to retrieve the disaggregated endpoint for a given worker.

@sshchoi sshchoi self-assigned this Nov 26, 2025
@sshchoi sshchoi requested a review from PeaBrane November 26, 2025 18:20
@PeaBrane
Copy link
Contributor

PeaBrane commented Nov 26, 2025

@sshchoi thanks! Can you take a look at the rabbit and CIs? I'll give it a look in the meantime

Can you also share some e2e test / benchmarking results on this in the PR desc if you have some

@ishandhanani
Copy link
Contributor

ishandhanani commented Nov 27, 2025

Most of my comments are nits.

Before this is merged in we need to run the SA workloads. This specifically means running

  1. Gb200 fp8

I will share info on how to do this via slack

@sshchoi sshchoi changed the title Add frontend based prefill request routing for sglang feat: add frontend based prefill request routing for sglang Dec 2, 2025
@github-actions github-actions bot added the feat label Dec 2, 2025
Signed-off-by: Sean Choi <[email protected]>
Copy link
Contributor

@PeaBrane PeaBrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to unblock, but please confirm with @ishandhanani before merging

Signed-off-by: PeaBrane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants