feat: add dockerfile for gb200 + mooncake by ishandhanani · Pull Request #7556 · sgl-project/sglang

ishandhanani · 2025-06-26T08:21:23Z

TODO

E2E test on GB200 (requires IBGDA)
Confirm I am installing nvshmem right. Specifically this line -DCMAKE_CUDA_ARCHITECTURES=100
Is there a blackwell equivalent to #RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so

Update 6/30

I can send a single request in and get output. Next request crashes. This is with NIXL backend

Update 7/10

Can send multiple requests in! No perf benchmarking yet

Update 7/14

The SGL team updated the instructions. Dockerfile has been updated to a pinned version of SGL + DeepGEMM

gemini-code-assist

Summary of Changes

Hello @ishandhanani, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request's primary objective is to provide a dedicated Docker image for the NVIDIA GB200 platform. It sets up a comprehensive and specialized development environment, integrating various system libraries, Python frameworks, and high-performance computing tools like NVSHMEM and DeepEP, to facilitate the development and deployment of AI and HPC applications optimized for the GB200 architecture.

Highlights

New Dockerfile for GB200: This pull request introduces a new Dockerfile located at docker/Dockerfile.gb200, specifically designed to set up a development environment tailored for the NVIDIA GB200 platform.
Comprehensive Environment Setup: The Dockerfile establishes a robust environment starting from nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04, installing a wide array of system dependencies, Python tools, and specialized libraries required for high-performance computing and deep learning.
Integration of HPC/AI Frameworks: Key components like sglang, sgl_kernel, GDRCopy, NVSHMEM, and DeepEP are cloned, built, and configured within the Docker image, indicating a focus on advanced AI model development and efficient GPU communication.
Python Package Management with UV: The Dockerfile leverages uv for Python package installation and virtual environment management, ensuring efficient and reproducible dependency resolution for Python components, including nightly PyTorch builds.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new Dockerfile specifically tailored for the GB200 architecture, building upon a CUDA 12.8 base image. The changes include comprehensive system dependency installations, setup of a Python virtual environment using uv, and the cloning and building of sglang, PyTorch, sgl_kernel, NVSHMEM, and DeepEP. The review identifies several areas for improvement related to Dockerfile best practices, including removing unused arguments, enhancing security by avoiding direct curl | sh execution, optimizing image size by removing unnecessary DKMS packages, and improving maintainability by consolidating source code modifications into patches rather than using sed.

docker/Dockerfile.gb200

yicwang · 2025-07-08T17:07:46Z

docker/Dockerfile.gb200

+
+# --- Fix DeepEp IGBDA symlink ---
+# Not sure what the alternative here is
+#RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so


This is x86_64? Probably don't need on GB200? For my setup, it seems to be correct out-of-box:
root@GPU-195:/usr/lib/aarch64-linux-gnu# ll libmlx*
-rw-r--r-- 1 root root 77686 Sep 8 2024 libmlx4.a
lrwxrwxrwx 1 root root 12 Sep 8 2024 libmlx4.so -> libmlx4.so.1
lrwxrwxrwx 1 root root 19 Sep 8 2024 libmlx4.so.1 -> libmlx4.so.1.0.54.0
-rw-r--r-- 1 root root 47032 Sep 8 2024 libmlx4.so.1.0.54.0
-rw-r--r-- 1 root root 777254 Sep 8 2024 libmlx5.a
lrwxrwxrwx 1 root root 12 Sep 8 2024 libmlx5.so -> libmlx5.so.1
lrwxrwxrwx 1 root root 20 Sep 8 2024 libmlx5.so.1 -> libmlx5.so.1.25.54.0
-rw-r--r-- 1 root root 520576 Sep 8 2024 libmlx5.so.1.25.54.0

yicwang · 2025-07-08T18:02:53Z

docker/Dockerfile.gb200

+    make -j && \
+    make -j install-strip && \
+    ldconfig
+


Maybe we should add a cleanup once everything is installed?
rm -rf /usr/local/src/ucx

yicwang · 2025-07-08T21:49:20Z

Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error:

FAILED: subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o
ccache c++ -Isubprojects/abseil-cpp-20240722.0/libabsl_time.a.p -Isubprojects/abseil-cpp-20240722.0 -I../subprojects/abseil-cpp-20240722.0 -fdiagnostics-color=always -D_GLIBCXX_ASSERTIONS=1 -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -std=c++17 -O3 -Wno-sign-compare -Wno-error=maybe-uninitialized -Wno-maybe-uninitialized -fPIC -MD -MQ subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -MF subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o.d -o subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -c ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc
In file included from /usr/include/c++/12/string:40,
                 from ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:66:
In static member function ‘static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)’,
    inlined from ‘static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:431:21,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Allocator>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.tcc:532:22,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:1655:19,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:823:28,
    inlined from ‘std::string absl::lts_20240722::FormatDuration(Duration)’ at ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:802:9:
/usr/include/c++/12/bits/char_traits.h:435:56: error: ‘void* __builtin_memcpy(void*, const void*, long unsigned int)’ accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775807] and 1 may overlap up to 9223372036854775813 bytes at offset -3 [-Werror=restrict]
  435 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

ishandhanani · 2025-07-11T00:11:13Z

Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error:

FAILED: subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o
ccache c++ -Isubprojects/abseil-cpp-20240722.0/libabsl_time.a.p -Isubprojects/abseil-cpp-20240722.0 -I../subprojects/abseil-cpp-20240722.0 -fdiagnostics-color=always -D_GLIBCXX_ASSERTIONS=1 -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -std=c++17 -O3 -Wno-sign-compare -Wno-error=maybe-uninitialized -Wno-maybe-uninitialized -fPIC -MD -MQ subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -MF subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o.d -o subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -c ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc
In file included from /usr/include/c++/12/string:40,
                 from ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:66:
In static member function ‘static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)’,
    inlined from ‘static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:431:21,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Allocator>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.tcc:532:22,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:1655:19,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:823:28,
    inlined from ‘std::string absl::lts_20240722::FormatDuration(Duration)’ at ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:802:9:
/usr/include/c++/12/bits/char_traits.h:435:56: error: ‘void* __builtin_memcpy(void*, const void*, long unsigned int)’ accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775807] and 1 may overlap up to 9223372036854775813 bytes at offset -3 [-Werror=restrict]
  435 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Hm - have not hit this myself sorry. This is a dumb question but are you sure you are building for ARM correctly?

yicwang · 2025-07-11T05:25:52Z

Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error:

FAILED: subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o
ccache c++ -Isubprojects/abseil-cpp-20240722.0/libabsl_time.a.p -Isubprojects/abseil-cpp-20240722.0 -I../subprojects/abseil-cpp-20240722.0 -fdiagnostics-color=always -D_GLIBCXX_ASSERTIONS=1 -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -std=c++17 -O3 -Wno-sign-compare -Wno-error=maybe-uninitialized -Wno-maybe-uninitialized -fPIC -MD -MQ subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -MF subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o.d -o subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -c ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc
In file included from /usr/include/c++/12/string:40,
                 from ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:66:
In static member function ‘static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)’,
    inlined from ‘static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:431:21,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Allocator>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.tcc:532:22,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:1655:19,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:823:28,
    inlined from ‘std::string absl::lts_20240722::FormatDuration(Duration)’ at ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:802:9:
/usr/include/c++/12/bits/char_traits.h:435:56: error: ‘void* __builtin_memcpy(void*, const void*, long unsigned int)’ accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775807] and 1 may overlap up to 9223372036854775813 bytes at offset -3 [-Werror=restrict]
  435 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Hm - have not hit this myself sorry. This is a dumb question but are you sure you are building for ARM correctly?

I did it on the host, and I ended up with a "-Dwarning_level=0" to meson as a workaround. But inside the container, things just work as expected. BTW, I think it might be better to use "git checkout 0.3.1" as it is an official tag now for NIXL.

Thanks very much!

Swipe4057 · 2025-08-05T19:48:16Z

Try to build the image with the improvements that I provided in this PR:
#8761

I'm not a reviewer, but I have a few questions:

Maybe it's better to change the default argument to CUDA_VERSION=12.9.1?
The image installs the driver nvidia-dkms-550, which supports CUDA version no newer than 12.4, but maybe it's better to upgrade to nvidia-dkms-575?
On Ubuntu 22.04, some apt packages are already no longer updated and have reached end-of-life, and the Python version is 3.10. I tried upgrading to Ubuntu 24.04 on my side, and you could try that too!
It might resolve some of your image build errors.

init untested

5b09a48

ishandhanani requested review from ByronHsu, HaiShaw and zhyncs as code owners June 26, 2025 08:21

gemini-code-assist bot reviewed Jun 26, 2025

View reviewed changes

docker/Dockerfile.gb200 Outdated Show resolved Hide resolved

docker/Dockerfile.gb200 Outdated Show resolved Hide resolved

docker/Dockerfile.gb200 Show resolved Hide resolved

docker/Dockerfile.gb200 Outdated Show resolved Hide resolved

docker/Dockerfile.gb200 Show resolved Hide resolved

vim

14ffdb3

ishandhanani mentioned this pull request Jul 1, 2025

[Roadmap] Blackwell Support and Optimizations #7227

Closed

7 tasks

bump

87f3b76

yicwang reviewed Jul 8, 2025

View reviewed changes

bump

23755e8

ishandhanani changed the title ~~[WIP] feat: add dockerfile for gb200~~ feat: add dockerfile for gb200 + NIXL Jul 11, 2025

dockerfile is now for mooncake for repro

a3c148b

ishandhanani changed the title ~~feat: add dockerfile for gb200 + NIXL~~ feat: add dockerfile for gb200 + mooncake Jul 16, 2025

Merge branch 'sgl-project:main' into ishan/gbdockerfile

db26dd3

ishandhanani closed this Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dockerfile for gb200 + mooncake#7556

feat: add dockerfile for gb200 + mooncake#7556
ishandhanani wants to merge 6 commits intosgl-project:mainfrom
ishandhanani:ishan/gbdockerfile

ishandhanani commented Jun 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yicwang Jul 8, 2025

Uh oh!

yicwang Jul 8, 2025

Uh oh!

yicwang commented Jul 8, 2025 •

edited

Loading

Uh oh!

ishandhanani commented Jul 11, 2025

Uh oh!

yicwang commented Jul 11, 2025

Uh oh!

Swipe4057 commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ishandhanani commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yicwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

yicwang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

yicwang commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Jul 11, 2025

Uh oh!

yicwang commented Jul 11, 2025

Uh oh!

Swipe4057 commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ishandhanani commented Jun 26, 2025 •

edited

Loading

yicwang commented Jul 8, 2025 •

edited

Loading