dockerfile: add lightweight runtime stage and refactors #13861

ishandhanani · 2025-11-24T19:09:28Z

Will merge this in after next release

Multi-stage Dockerfile splits SGLang builds into base, framework, and runtime stages. Runtime cuts image size roughly in half.

sglang                            framework-test     be66a8e51a09   39.3GB
sglang                            runtime-test          a4dac91fe030    20GB

Tests

cu13 arm - https://github.com/ishandhanani/srt-slurm/blob/main/recipies/gb300-fp4/1p2.yaml ✅
cu12 arm - https://github.com/ishandhanani/srt-slurm/blob/main/recipies/gb200-fp4/1p2d.yaml ✅
x86 - all simple agg/disagg examples ✅

gemini-code-assist · 2025-11-24T19:09:51Z

Summary of Changes

Hello @ishandhanani, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive refactoring of the Dockerfile to implement a multi-stage build strategy. The core objective is to create a highly optimized and lightweight runtime image for SGLang, distinct from the development and build environment. This change enhances build performance, ensures correct CUDA architecture compatibility, and improves the overall reliability of the Docker image creation process.

Highlights

Multi-stage Dockerfile Implementation: The Dockerfile has been refactored into a multi-stage build process, introducing distinct base, framework, and runtime stages. This significantly improves image size and build efficiency by separating build-time dependencies from the final runtime environment.
Lightweight Runtime Image: A new runtime stage has been added, designed to be a lightweight production image. It includes only essential runtime dependencies and copies pre-built artifacts from the framework stage, drastically reducing the final image footprint for deployment.
Optimized Dependency Management and Caching: Dependency installations across all stages have been reorganized for better Docker layer caching. RUN commands are consolidated, and system dependencies are categorized for clarity and efficiency. Redundant packages are removed from earlier stages if not strictly necessary.
CUDA Architecture Compatibility Fix: The DeepEP build process now correctly handles CUDA architecture targeting. Specifically, CUDA 12.8.1 no longer attempts to build for Blackwell (10.3) architecture, which is now exclusively supported by 12.9.1 and later versions, resolving potential compilation issues.
Enhanced Build Robustness: Retry logic (curl --retry 3 --retry-delay 2) has been added to various download commands for external dependencies (e.g., GDRCopy, DeepEP, Triton, clangd, CMake, just, oh-my-zsh plugins), making the build process more resilient to transient network issues.
SGL-Router Integration and Size Optimization: The sgl-router component is now built and installed within the framework stage. Crucially, the Rust toolchain used for its compilation is removed immediately after the build, ensuring that it does not contribute to the final image size.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant improvements to the Dockerfile by refactoring it into a multi-stage build with base, framework, and runtime stages. This is a great change that improves maintainability, reduces image size for production, and optimizes the build process. The introduction of a lightweight runtime stage is particularly valuable. Other notable improvements include adding retry logic for downloads, fixing a critical CUDA architecture compilation bug, and better organization of dependencies.

I have one suggestion to further optimize the runtime stage by combining apt operations to reduce layers and remove redundant commands. Overall, this is an excellent contribution.

docker/Dockerfile

slin1237 · 2025-11-27T06:27:30Z

PR looks good to me
we are currently using this image to launch router and engine in the same container, I think some other people are doing this too
since we are also releasing framework image, I think by the time we do the release
just make sure to write this note in the release page

docker/Dockerfile

nv-tusharma

There are a few comments I've made so feel free to check those out but overall LGTM.

…into ishan/dockerfile-opt

docker/Dockerfile

Fridge003 · 2025-12-05T00:40:51Z

Need to add some explanation for the runtime image here
https://github.com/sgl-project/sglang/blob/main/docs/get_started/install.md

…into ishan/dockerfile-opt

hnyls2002 · 2025-12-07T03:21:00Z

In the new Docker image

 zsh
bash: zsh: command not found

@ishandhanani

* [model-gateway] extract conversation out of oai router (sgl-project#14440) Co-authored-by: key4ng <[email protected]> * [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 (sgl-project#14325) * [model-gateway] move oai header util to router header util (sgl-project#14441) Co-authored-by: key4ng <[email protected]> * [FIX] trtllm-moe-fp4-renorm for Qwen series models (sgl-project#14350) * add doc for quantized kv cache (sgl-project#14348) Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Ho-Ren (Jack) Chuang <[email protected]> * fix: Correct environment variable syntax in docker-compose configuration (sgl-project#8287) Signed-off-by: Kay Yan <[email protected]> * [model-gateway] move all responses api event from oai to proto (sgl-project#14446) Co-authored-by: key4ng <[email protected]> * [model-gateway] add mistral 3 image processor (sgl-project#14445) Co-authored-by: Chang Su <[email protected]> * [model-gateway] grpc to leverage event type (sgl-project#14450) Co-authored-by: Chang Su <[email protected]> * ministral3 (sgl-project#14251) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: Yueming Yuan <[email protected]> * [Bug] fix not desired disable fused share experts caused by rocm logic (sgl-project#14432) * Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE (sgl-project#14421) Co-authored-by: Lianmin Zheng <[email protected]> * [diffusion] improve: further optimize model load (sgl-project#13836) * Add CI permissions for user 'yushengsu-thu' (sgl-project#14468) * [ez] Fix typing (sgl-project#14473) * Add AMD stage support to /rerun-stage command and fix related bugs (sgl-project#14463) * Add YAMY1234 to CI Permission (sgl-project#14475) * clean up gemlite usage (sgl-project#14444) * [diffusion] chore: further improve model searching logic (sgl-project#14484) * [diffusion] fix: fix bug about pin memory when offloading (sgl-project#14472) * [diffusion] cli: add argument --adjust-frames and --override-protected-fields (sgl-project#13996) Co-authored-by: dev <[email protected]> Co-authored-by: Mick <[email protected]> * dockerfile: add runtime stage + ubuntu 24.04 (sgl-project#13861) * [diffusion] fix: fix CLIP text encoder attention mask not used (sgl-project#14364) Co-authored-by: niehen6174 <[email protected]> Co-authored-by: Mick <[email protected]> * Enable RadixCache for Mamba2 models (sgl-project#13584) * [diffusion] fix: Fix profiler trace missing Python stack in diffusion pipeline (sgl-project#14499) * support GLM-V vision model dp (sgl-project#14097) * [misc] add model arch and type to server info and use it for harmony (sgl-project#14456) * Add Mistral Large 3 Eagle Support (sgl-project#14466) Co-authored-by: Linda-Stadter <[email protected]> * Add Mistral Large 3 to nightly CI tests (sgl-project#14459) * [diffusion] chore: set allowing overriding protected fields of sampling params as default behavior (sgl-project#14471) * [model-gateway] move conversation to first class routing (sgl-project#14506) Co-authored-by: key4ng <[email protected]> * [Spec] Mamba2 support in target models (sgl-project#13434) * [diffusion] feat: support cache-dit integration (sgl-project#14234) Co-authored-by: shuxiguo <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Mick <[email protected]> * Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-project#14093) Co-authored-by: Qiaolin Yu <[email protected]> * [model-gateway] Add WASM support for middleware (sgl-project#12471) Signed-off-by: Tony Lu <[email protected]> * [model-gateway] reorganized conversation handler (sgl-project#14507) Co-authored-by: key4ng <[email protected]> * tiny remove deprecated endpoint call (sgl-project#13607) * [model-gateway] fix server info comment (sgl-project#14508) * Add Mistral Large 3 basic test to PR CI (sgl-project#14460) * Fix removing worker will make it healthy forever in prometheus metrics (sgl-project#14420) * [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (sgl-project#14405) * [model-gateway] change sgl-router to sgl-model-gateway (sgl-project#14312) * [model-gateway] fix left over sgl-router names to sgl-model-gateway (sgl-project#14512) * [model-gateway] fix logs in smg workflow (sgl-project#14513) * [model-gateway] fix left over sgl-router names in wasm (sgl-project#14514) * [model-gateway] fix code owner for wasm (sgl-project#14516) * chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14427) Co-authored-by: sglang-bot <[email protected]> * Tiny use trtllm_mha as default when possible (sgl-project#14291) * [Docs] Add /rerun-stage command to contribution guide (sgl-project#14521) * Fix safetensors validation to catch corruption after download (sgl-project#14465) * [CODEOWNER] update codeowner for qwen3-next related (sgl-project#14522) * fix: fix rmsnorm -> layernorm in qwen3 omni (sgl-project#11791) Co-authored-by: Brayden Zhong <[email protected]> * [diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT (sgl-project#14530) * [bug] fix notebook to include new keys from model_info (sgl-project#14528) * Revise DP Multi-Modal Encoder Document (sgl-project#14290) * [CPU] add mamba fla kernels for Qwen3-next (sgl-project#12324) * Revert "tiny remove deprecated endpoint call" (sgl-project#14533) * support mtp with deepseek r1 nvfp4 model (sgl-project#13115) Co-authored-by: Trevor Morris <[email protected]> * [diffusion] refactor: simplify sampling params' override logic (sgl-project#14539) * [diffusion] perf: add QKV fusion optimization for Flux models (sgl-project#14505) Co-authored-by: Mick <[email protected]> * [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (sgl-project#13897) * [diffusion] lora: fix LoRA dtype handling and weight attribute access for z-image model (sgl-project#14543) Co-authored-by: niehen6174 <[email protected]> * fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled (sgl-project#14464) * [1/n] Fix hanging during DeepGemm Warmup (sgl-project#14493) * [Bug fix] Add /model_info endpoint to mini_lb (sgl-project#14535) * [Qwen3-next] remove heuristics and add radix cache kl test (sgl-project#14520) * [Misc]Register and refactor some environs for dpsk-fp4 and DeepEp (sgl-project#14538) * chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14518) * Update CI_PERMISSIONS.json (sgl-project#14552) * Update DeepSeek V3 docs to use B200 (sgl-project#14447) * [Doc] Add short explanation on page size (sgl-project#14557) * [docs] Add missing word in argument description (sgl-project#14205) * support piecewise cuda graph for Olmo models (sgl-project#14476) * Enhance prefill PP node robustness (sgl-project#14494) * DOC update nemo-skills in docs (sgl-project#14555) Signed-off-by: George Armstrong <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> * remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) (sgl-project#14337) * feat(ci): add framework target to release-docker workflows (sgl-project#14559) * Fix attention backend logic for Qwen3-Next on SM100 (sgl-project#14560) * [FLA] Add explicit kernel arguments to kda.py for Kimi Linear support (sgl-project#14561) * Add CUDA kernel size analysis tool for sgl-kernel optimization (sgl-project#14544) * [DLLM] feat: Add threshold based parallel decoding support (sgl-project#14412) Co-authored-by: Jinwei Yao <[email protected]> Co-authored-by: 赵晨阳 <[email protected]> * Add unit-test-backend-8-gpu-b200 to rerun-stage command (sgl-project#14569) * [apply][2/2] Fused qk_norm_rope for Qwen3-MoE (sgl-project#13998) Co-authored-by: luoyuan.luo <[email protected]> * Add Expert Parallelism (EP) support for kimi-k2-thinking (sgl-project#13725) * Tiny remove wrong import from `python.sglang` (sgl-project#14577) * Add small model test for spec v2 + dp + trtllm_mla (sgl-project#14576) * [diffusion] cli: profiling utilities support (sgl-project#14185) Co-authored-by: jianyingzhu <[email protected]> Co-authored-by: Jianying <[email protected]> Co-authored-by: Mick <[email protected]> * [NPU]LoRA: Adding Torch Native backend (sgl-project#14132) * [BugFix] fix prefixcache performance and accuracy on ascend (sgl-project#13573) * Fix FP8 KV Triton type issue and add regression test (sgl-project#14553) * Rename TensorRT Model Optimizer to Model Optimizer (sgl-project#14455) Signed-off-by: Zhiyu Cheng <[email protected]> * [CI] Tiny speed up VLM CI (sgl-project#14517) Co-authored-by: Brayden Zhong <[email protected]> * [Minor] Temporarily skipping deepep large mtp test (sgl-project#14586) * [model-gateway] extra accumulator and tool handler in oai router (sgl-project#14587) * [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (sgl-project#14588) * [model-gateway] reorganize metrics, logging, and otel to its own module (sgl-project#14590) * Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (sgl-project#14141) * [CI]Unblock and split spec v2+dp test (sgl-project#14551) * [Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode (sgl-project#14573) * [feat] use cachebuffer to store mm feature to speedup hash (sgl-project#14386) * [CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage (sgl-project#14591) * [model-gateway] fix WASM memory limit per module (sgl-project#14600) * Tiny fix missing policy decision recording (sgl-project#14605) * Super tiny remove unneeded policy flag (sgl-project#14608) * [model-gateway] refactor otel to be more efficient (sgl-project#14604) * Super tiny remove unused select_worker_pair (sgl-project#14609) * [model-gateway] fix WASM unbounded request/response body read vuln (sgl-project#14612) * [2/2] Add rope kernel in sgl-kernel (sgl-project#14452) * [DLLM] Add initial cuda graph support (sgl-project#14203) * Super tiny fix unused code in router (sgl-project#14618) * [Glm46v] Bug fix for accuracy drop and unable to launch server (sgl-project#14585) Co-authored-by: yhyang201 <[email protected]> Co-authored-by: zRzRzRzRzRzRzR <[email protected]> Co-authored-by: Minglei Zhu <[email protected]> * Fix amd rope definition (sgl-project#14556) * modify the sgl-kernel to be compatible with transformers 5.x. (sgl-project#14625) * [Reasoning + Structured Output] make reasoning compatible with structured output (sgl-project#12551) Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]> * [diffusion] feat: add support for LoRA layers in transformer_2 within LoRAPipeline (sgl-project#14606) * chore: bump sgl-kernel version to 0.3.19 (sgl-project#14632) * [cpu] Implement all gather/reduce for arm64 cpu (sgl-project#12527) * [diffusion] chore: further refine output resolution adjustment logic (sgl-project#14558) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix dp-aware incompatible with service-discovery (sgl-project#14629) * update transformers package version to 5.0.0rc0 (sgl-project#14356) * chore: bump sgl-kernel version to 0.3.19 (sgl-project#14649) * chore: bump SGLang version to 0.5.6.post1 (sgl-project#14651) * [AMD] change fused rms quant interface for aiter upgrade (sgl-project#14497) * [model-gateway] reducing cpu overhead in various of places (sgl-project#14658) * [model-gateway] reduce cpu overhead in grpc router (sgl-project#14663) * [model-gateway] fix WASM arbitrary file read security vol (sgl-project#14664) * vlm: Use fa3 as the default backend for qwen3 vl (sgl-project#14634) * [model-gateway] Optimize memory usage in HTTP router (sgl-project#14667) * fix: use .get() when accessing strict mem-check env variable (sgl-project#14657) * improve default glm mtp setting (sgl-project#14457) Signed-off-by: Brayden Zhong <[email protected]> * Fix cache-aware router should pick min load instead of min tenant size (sgl-project#14650) * Bump up diffusers to latest official release version (sgl-project#14670) * [model-gateway] add OTEL integration to grpc router (sgl-project#14671) * [CI] Increase max-parallel to 15 for high priority PRs (sgl-project#14675) * [HiCache] fix condition check when use decode offload (sgl-project#14489) * [RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) (sgl-project#13334) Signed-off-by: CLFutureX <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> * Tiny support printing requests in bench_serving for observability (sgl-project#14652) * Aiter fp8 kv cache (sgl-project#13147) * [SMG]feat: implement TokenGuardBody for managing token return (sgl-project#14653) * [NPU] chore: bump basic software version to 8.3.rc2 (sgl-project#14614) * [CI] Unblock gb200 cutedsl test (sgl-project#14469) * Add ffmpeg into sglang docker - required by transformers multimodal V… (sgl-project#14679) * [Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config (sgl-project#14627) * Tiny support sgl-router http response status code metrics (sgl-project#14689) * [CI] Migrate Eagle 1-GPU tests to test/registered/ (sgl-project#14529) * Revert "[Bug] fix not desired disable fused share experts caused by r… (sgl-project#14676) * Add per-request decode tp size (sgl-project#14678) Co-authored-by: Byron Hsu <[email protected]> * [ci][smg] fix docker release ci and add it to pr test (sgl-project#14683) * Tiny extract select_worker_min_load (sgl-project#14648) * Fix dp-aware incompatible with completions and chat completions APIs (sgl-project#14647) * [CI] Fix Llama 3.1 8B FP4 CI (sgl-project#14699) * fix: make override DeepseekV2Model work (sgl-project#14707) * chore: add code owners for deepseek_v2.py (sgl-project#14714) * [CI] Move mistral large 3 basic to nightly (sgl-project#14622) * fix the deepep 8 gpu unit test (sgl-project#14601) * Add fuse_marlin_moe test to ci and add new ep test (sgl-project#14686) * [Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py (sgl-project#14461) Signed-off-by: Kun(llfl) <[email protected]> * [Feat] Add received_time in serving_base (sgl-project#13432) Signed-off-by: zhanghaotong <[email protected]> * fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled (sgl-project#12039) * [Test] Skip STANDALONE speculative decoding tests for different hidden sizes (sgl-project#14733) * [diffusion] feat: support comparing batch perf (sgl-project#14738) Co-authored-by: shuxiguo <[email protected]> Co-authored-by: Mick <[email protected]> * Revert "[Feat] Add received_time in serving_base" (sgl-project#14743) * [Model] Add PaddleOCR-VL Model Support (sgl-project#12953) Co-authored-by: luoyuan.luo <[email protected]> * fix rope parameter initialization error caused by transformers v5.0 update (sgl-project#14745) * [model-gateway] optimize core modules (sgl-project#14751) * [SMG] perf: optimize tokenizer for reduced CPU and memory overhead (sgl-project#14752) * Add FP8 Blockwise GEMM Backend Flag `--fp8-gemm-backend` (sgl-project#14379) * fix: checking if tokenizer is in cache before downloading from HF (sgl-project#14698) * fix: making rate limit a warning instead of error (sgl-project#14753) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * move multi-item scoring functions in tokenizer manager into a separate file (sgl-project#14740) * Improve CI by trying a warmup before unit tests (sgl-project#14669) * [Perf] Optimize radix tree for cache-aware load balancin (sgl-project#14758) * [Feature] Add LoRA support for embedding layers (sgl-project#14177) Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Beichen-Ma <[email protected]> * [model-gateway] release gateway 0.2.4 (sgl-project#14763) * [ci]: Enable the new hf API (sgl-project#14687) * Re-add the API serving timing metrics. (sgl-project#14744) Signed-off-by: zhanghaotong <[email protected]> Co-authored-by: zhanghaotong <[email protected]> * fix: adding rate limit warning at verify token permission stage (sgl-project#14756) * Disable 8-gpu-b200 runner in PR tests (sgl-project#14768) * [fix] Fix issues for in-flight weight updates (sgl-project#14064) Co-authored-by: 赵晨阳 <[email protected]> * [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: race condition between validation and download locks (sgl-project#14761) * Fix VLM accuracy thresholds for nightly tests (sgl-project#14777) * fix server args bug (sgl-project#14725) * handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784) * fix b200 ci (sgl-project#14786) * [RL] support weight reload for low-bit rollout (sgl-project#9650) Co-authored-by: Hecate0821 <[email protected]> Co-authored-by: eternally-z <[email protected]> Co-authored-by: Wilboludriver <[email protected]> Co-authored-by: Wilbolu <[email protected]> Co-authored-by: Ke Bao <[email protected]> * fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794) * fix b200 fa4 ci (sgl-project#14788) * [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803) * [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720) Co-authored-by: luoyuan.luo <[email protected]> * [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697) * Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808) * [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793) * [SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (sgl-project#14770) * [model-gateway] Dynamically Populate Tool Call Parser Choices (sgl-project#14807) * Support HTTP response status code prometheus metrics (sgl-project#14710) * Fix router keep nonzero metrics after worker is deleted (sgl-project#14819) * Tiny fix incorrect worker removal command (sgl-project#14822) * [NPU] bug fix for mtp and w4a8 (sgl-project#14806) * [CI] fix UT success check in `test_eagle_infer_beta_dp_attention.py` (sgl-project#14831) * Fix CI registry scan to only check test/registered directory (sgl-project#14812) * [model-gateway] add anthropic message api spec (sgl-project#14834) * [diffusion] doc: fix tiny typo in multimodal_gen/README.md (sgl-project#14830) * [model-gateway] support customizing Prometheus duration buckets (sgl-project#14716) * [model-gateway] support engine response http status statistics in router (sgl-project#14712) * [CI] Reduce stage-b auto-partition from 4 to 2 (sgl-project#14769) Co-authored-by: Liangsheng Yin <[email protected]> * Apply back moe_sum_reduce for fused_marlin_moe (sgl-project#14829) * [diffusion] parallel: pad tokens for video models under sp (sgl-project#14833) * [diffusion] CI: use unified sampling_params for CI (sgl-project#14045) * [Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) (sgl-project#14837) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jue Wang <[email protected]> * Revert transformers to 4.57.1 (sgl-project#14801) * [model-gateway] Fix incompatible metric comparison in` PowerOfTwo` policy (sgl-project#14823) * [bugfix] qwen25-VL support lora (sgl-project#14638) * fix lora target all + csgmv backend (sgl-project#14796) * [model-gateway] adds default implementations to RouterTrait in mod.rs (sgl-project#14841) * [AMD] Add model to AMD nightly test (sgl-project#14442) * Treat unittest SkipTest exception as pass instead of as failure (sgl-project#14847) * [model-gateway] code clean up on oai router (sgl-project#14850) * [model-gateway] fix import order in oai conversation (sgl-project#14851) * fix fp8 gemm nightly CI (sgl-project#14844) Co-authored-by: Brayden Zhong <[email protected]> * fix: restrict cache validation behaviors to CI only (sgl-project#14849) * Fix CUDA version handling in ci_install_deepep.sh (sgl-project#14854) * Fix TestGLM41VPPAccuracy test flakiness (sgl-project#14848) * Minor code style fix for dllm (sgl-project#14836) * Enable TP for Mamba-based models (sgl-project#14811) Signed-off-by: Roi Koren <[email protected]> * [CI] Temp disable gb200 test (sgl-project#14865) * Refactor Marlin MoeRunner (sgl-project#14554) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cheng Wan <[email protected]> * [6/n] Fix `num_token_non_padded` computation in prefill (sgl-project#14313) Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: Runkai Tao <[email protected]> * Remove myself to test CI gate issue (sgl-project#14871) * fix: creating blobs only once for publish trace retries (sgl-project#14845) * Move and update MindSpore docs, make it appear on the online documentation (sgl-project#14861) Co-authored-by: wangtiance <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix nightly vlm ci : restore original eval for requests without regex (sgl-project#14875) * Only count limitations for previous runs that reaches the test stages (sgl-project#14856) * [CI][BUG] fix ib setup for disaggregation hicache test (sgl-project#14877) Signed-off-by: lukotong-7 <[email protected]> * [Fix] Remove unused import from test_disaggregation_hicache.py (sgl-project#14880) * fix: adding temporary bypass for nightly tests (sgl-project#14876) * Avoid deleting entire cache for missing shards (sgl-project#14754 follow-up) (sgl-project#14853) * Tiny add more error info for bench_serving (sgl-project#14827) * Tiny support range ratio in GSP in bench serving (sgl-project#14828) * [diffusion] feat: enable torch compile to eliminate GPU bubble (sgl-project#13641) Co-authored-by: jianyingzhu <[email protected]> Co-authored-by: Jianying <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> * [NPU] adapt dsv3.2 nsa prefill context parallel (sgl-project#14541) * [diffusion] feat: support sageattn & sageattn3 backend (sgl-project#14878) * dsv32 multistream opt * clean code * delete renormalize in topk * dsv32 use batch_matmul_transpose in MTP * modify comment * Support dynamic w8a8 * dsv3 support ascend_fuseep * rebase modify --------- Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Tony Lu <[email protected]> Signed-off-by: George Armstrong <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: CLFutureX <[email protected]> Signed-off-by: Kun(llfl) <[email protected]> Signed-off-by: zhanghaotong <[email protected]> Signed-off-by: Roi Koren <[email protected]> Signed-off-by: lukotong-7 <[email protected]> Co-authored-by: Simo Lin <[email protected]> Co-authored-by: key4ng <[email protected]> Co-authored-by: YAMY <[email protected]> Co-authored-by: Sam <[email protected]> Co-authored-by: b8zhong <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]> Co-authored-by: Yueming Yuan <[email protected]> Co-authored-by: Junrong Lin <[email protected]> Co-authored-by: sglang-bot <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: zyksir <[email protected]> Co-authored-by: Alison Shao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Minglei Zhu <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: GMI Xiao Jin <[email protected]> Co-authored-by: dev <[email protected]> Co-authored-by: ishandhanani <[email protected]> Co-authored-by: WenhaoZhang <[email protected]> Co-authored-by: niehen6174 <[email protected]> Co-authored-by: roikoren755 <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: Yuxuan Zhang <[email protected]> Co-authored-by: elvischenv <[email protected]> Co-authored-by: Linda-Stadter <[email protected]> Co-authored-by: blahblah <[email protected]> Co-authored-by: shuxiguo <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: Hudson Xing <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: Tony Lu <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: Wenyi Xu <[email protected]> Co-authored-by: sglang-bot <[email protected]> Co-authored-by: Hanming Lu <[email protected]> Co-authored-by: Vincent Zhong <[email protected]> Co-authored-by: Yuhao Yang <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Rain Jiang <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Feng Su <[email protected]> Co-authored-by: niehen6174 <[email protected]> Co-authored-by: gongwei-130 <[email protected]> Co-authored-by: harrisonlimh <[email protected]> Co-authored-by: Lee Nau <[email protected]> Co-authored-by: almaslof <[email protected]> Co-authored-by: Rain H <[email protected]> Co-authored-by: George Armstrong <[email protected]> Co-authored-by: Chen1022 <[email protected]> Co-authored-by: Tiwei Bie <[email protected]> Co-authored-by: Jinwei Yao <[email protected]> Co-authored-by: 赵晨阳 <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: AichenF <[email protected]> Co-authored-by: jianyingzhu <[email protected]> Co-authored-by: Jianying <[email protected]> Co-authored-by: Vladimir Serov <[email protected]> Co-authored-by: khalilzhk <[email protected]> Co-authored-by: Zhiyu <[email protected]> Co-authored-by: wentx <[email protected]> Co-authored-by: Nicholas <[email protected]> Co-authored-by: Binyao Jiang <[email protected]> Co-authored-by: yhyang201 <[email protected]> Co-authored-by: Muqi Li <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]> Co-authored-by: Prozac614 <[email protected]> Co-authored-by: Yibo Cai <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: yctseng0211 <[email protected]> Co-authored-by: Francis <[email protected]> Co-authored-by: PiteXChen <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: kk <[email protected]> Co-authored-by: Jimmy <[email protected]> Co-authored-by: Even Zhou <[email protected]> Co-authored-by: Yineng Zhang <[email protected]> Co-authored-by: Byron Hsu <[email protected]> Co-authored-by: kun-llfl <[email protected]> Co-authored-by: zhanghaotong <[email protected]> Co-authored-by: yrk111222 <[email protected]> Co-authored-by: yudian0504 <[email protected]> Co-authored-by: Douglas Yang <[email protected]> Co-authored-by: Ethan (Yusheng) Su <[email protected]> Co-authored-by: Beichen-Ma <[email protected]> Co-authored-by: MingxuZh <[email protected]> Co-authored-by: ShawnY112358 <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: TomerBN-Nvidia <[email protected]> Co-authored-by: Peng Zhang <[email protected]> Co-authored-by: Hecate0821 <[email protected]> Co-authored-by: eternally-z <[email protected]> Co-authored-by: Wilboludriver <[email protected]> Co-authored-by: Wilbolu <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: ybyang <[email protected]> Co-authored-by: liupeng374 <[email protected]> Co-authored-by: Li Jinliang <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: Jue Wang <[email protected]> Co-authored-by: Praneth Paruchuri <[email protected]> Co-authored-by: Siyuan Chen <[email protected]> Co-authored-by: michael-amd <[email protected]> Co-authored-by: Trang Do <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: yuchengz816-bot <[email protected]> Co-authored-by: Runkai Tao <[email protected]> Co-authored-by: Kangyan-Zhou <[email protected]> Co-authored-by: Tiance Wang <[email protected]> Co-authored-by: wangtiance <[email protected]> Co-authored-by: shicanwei.scw <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: liupeng374 <[email protected]>

optimization

a8f79c5

ishandhanani requested review from Fridge003, HaiShaw and ispobock as code owners November 24, 2025 19:09

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

docker/Dockerfile Outdated Show resolved Hide resolved

releaes and fix

dec9f2b

ishandhanani requested review from Kangyan-Zhou and merrymercy as code owners November 24, 2025 20:39

slin1237 approved these changes Nov 27, 2025

View reviewed changes

ishandhanani added 3 commits November 27, 2025 14:54

Merge branch 'main' into ishan/dockerfile-opt

de194ec

go

cf8e30e

go

9e7fd89

ishandhanani force-pushed the ishan/dockerfile-opt branch from ac2e8ff to 9e7fd89 Compare November 27, 2025 21:03

ishandhanani added 3 commits November 27, 2025 15:06

go

922d1fc

bump

4369be5

rip

659b00e