Skip to content

feat: add dockerfile for gb200 + mooncake#7556

Closed
ishandhanani wants to merge 6 commits intosgl-project:mainfrom
ishandhanani:ishan/gbdockerfile
Closed

feat: add dockerfile for gb200 + mooncake#7556
ishandhanani wants to merge 6 commits intosgl-project:mainfrom
ishandhanani:ishan/gbdockerfile

Conversation

@ishandhanani
Copy link
Collaborator

@ishandhanani ishandhanani commented Jun 26, 2025

TODO

  • E2E test on GB200 (requires IBGDA)
  • Confirm I am installing nvshmem right. Specifically this line -DCMAKE_CUDA_ARCHITECTURES=100
  • Is there a blackwell equivalent to #RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so

Update 6/30

  • I can send a single request in and get output. Next request crashes. This is with NIXL backend

Update 7/10

  • Can send multiple requests in! No perf benchmarking yet

Update 7/14

  • The SGL team updated the instructions. Dockerfile has been updated to a pinned version of SGL + DeepGEMM

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ishandhanani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request's primary objective is to provide a dedicated Docker image for the NVIDIA GB200 platform. It sets up a comprehensive and specialized development environment, integrating various system libraries, Python frameworks, and high-performance computing tools like NVSHMEM and DeepEP, to facilitate the development and deployment of AI and HPC applications optimized for the GB200 architecture.

Highlights

  • New Dockerfile for GB200: This pull request introduces a new Dockerfile located at docker/Dockerfile.gb200, specifically designed to set up a development environment tailored for the NVIDIA GB200 platform.
  • Comprehensive Environment Setup: The Dockerfile establishes a robust environment starting from nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04, installing a wide array of system dependencies, Python tools, and specialized libraries required for high-performance computing and deep learning.
  • Integration of HPC/AI Frameworks: Key components like sglang, sgl_kernel, GDRCopy, NVSHMEM, and DeepEP are cloned, built, and configured within the Docker image, indicating a focus on advanced AI model development and efficient GPU communication.
  • Python Package Management with UV: The Dockerfile leverages uv for Python package installation and virtual environment management, ensuring efficient and reproducible dependency resolution for Python components, including nightly PyTorch builds.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Dockerfile specifically tailored for the GB200 architecture, building upon a CUDA 12.8 base image. The changes include comprehensive system dependency installations, setup of a Python virtual environment using uv, and the cloning and building of sglang, PyTorch, sgl_kernel, NVSHMEM, and DeepEP. The review identifies several areas for improvement related to Dockerfile best practices, including removing unused arguments, enhancing security by avoiding direct curl | sh execution, optimizing image size by removing unnecessary DKMS packages, and improving maintainability by consolidating source code modifications into patches rather than using sed.


# --- Fix DeepEp IGBDA symlink ---
# Not sure what the alternative here is
#RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is x86_64? Probably don't need on GB200? For my setup, it seems to be correct out-of-box:
root@GPU-195:/usr/lib/aarch64-linux-gnu# ll libmlx*
-rw-r--r-- 1 root root 77686 Sep 8 2024 libmlx4.a
lrwxrwxrwx 1 root root 12 Sep 8 2024 libmlx4.so -> libmlx4.so.1
lrwxrwxrwx 1 root root 19 Sep 8 2024 libmlx4.so.1 -> libmlx4.so.1.0.54.0
-rw-r--r-- 1 root root 47032 Sep 8 2024 libmlx4.so.1.0.54.0
-rw-r--r-- 1 root root 777254 Sep 8 2024 libmlx5.a
lrwxrwxrwx 1 root root 12 Sep 8 2024 libmlx5.so -> libmlx5.so.1
lrwxrwxrwx 1 root root 20 Sep 8 2024 libmlx5.so.1 -> libmlx5.so.1.25.54.0
-rw-r--r-- 1 root root 520576 Sep 8 2024 libmlx5.so.1.25.54.0

make -j && \
make -j install-strip && \
ldconfig

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a cleanup once everything is installed?
rm -rf /usr/local/src/ucx

@yicwang
Copy link
Contributor

yicwang commented Jul 8, 2025

Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error:

FAILED: subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o
ccache c++ -Isubprojects/abseil-cpp-20240722.0/libabsl_time.a.p -Isubprojects/abseil-cpp-20240722.0 -I../subprojects/abseil-cpp-20240722.0 -fdiagnostics-color=always -D_GLIBCXX_ASSERTIONS=1 -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -std=c++17 -O3 -Wno-sign-compare -Wno-error=maybe-uninitialized -Wno-maybe-uninitialized -fPIC -MD -MQ subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -MF subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o.d -o subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -c ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc
In file included from /usr/include/c++/12/string:40,
                 from ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:66:
In static member function ‘static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)’,
    inlined from ‘static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:431:21,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Allocator>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.tcc:532:22,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:1655:19,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:823:28,
    inlined from ‘std::string absl::lts_20240722::FormatDuration(Duration)’ at ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:802:9:
/usr/include/c++/12/bits/char_traits.h:435:56: error: ‘void* __builtin_memcpy(void*, const void*, long unsigned int)’ accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775807] and 1 may overlap up to 9223372036854775813 bytes at offset -3 [-Werror=restrict]
  435 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

@ishandhanani ishandhanani changed the title [WIP] feat: add dockerfile for gb200 feat: add dockerfile for gb200 + NIXL Jul 11, 2025
@ishandhanani
Copy link
Collaborator Author

Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error:

FAILED: subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o
ccache c++ -Isubprojects/abseil-cpp-20240722.0/libabsl_time.a.p -Isubprojects/abseil-cpp-20240722.0 -I../subprojects/abseil-cpp-20240722.0 -fdiagnostics-color=always -D_GLIBCXX_ASSERTIONS=1 -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -std=c++17 -O3 -Wno-sign-compare -Wno-error=maybe-uninitialized -Wno-maybe-uninitialized -fPIC -MD -MQ subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -MF subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o.d -o subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -c ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc
In file included from /usr/include/c++/12/string:40,
                 from ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:66:
In static member function ‘static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)’,
    inlined from ‘static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:431:21,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Allocator>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.tcc:532:22,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:1655:19,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:823:28,
    inlined from ‘std::string absl::lts_20240722::FormatDuration(Duration)’ at ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:802:9:
/usr/include/c++/12/bits/char_traits.h:435:56: error: ‘void* __builtin_memcpy(void*, const void*, long unsigned int)’ accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775807] and 1 may overlap up to 9223372036854775813 bytes at offset -3 [-Werror=restrict]
  435 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Hm - have not hit this myself sorry. This is a dumb question but are you sure you are building for ARM correctly?

@yicwang
Copy link
Contributor

yicwang commented Jul 11, 2025

Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error:

FAILED: subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o
ccache c++ -Isubprojects/abseil-cpp-20240722.0/libabsl_time.a.p -Isubprojects/abseil-cpp-20240722.0 -I../subprojects/abseil-cpp-20240722.0 -fdiagnostics-color=always -D_GLIBCXX_ASSERTIONS=1 -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -std=c++17 -O3 -Wno-sign-compare -Wno-error=maybe-uninitialized -Wno-maybe-uninitialized -fPIC -MD -MQ subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -MF subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o.d -o subprojects/abseil-cpp-20240722.0/libabsl_time.a.p/absl_time_duration.cc.o -c ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc
In file included from /usr/include/c++/12/string:40,
                 from ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:66:
In static member function ‘static std::char_traits<char>::char_type* std::char_traits<char>::copy(char_type*, const char_type*, std::size_t)’,
    inlined from ‘static void std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_S_copy(_CharT*, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:431:21,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Allocator>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::_M_replace(size_type, size_type, const _CharT*, size_type) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.tcc:532:22,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::assign(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:1655:19,
    inlined from ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>& std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::operator=(const _CharT*) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ at /usr/include/c++/12/bits/basic_string.h:823:28,
    inlined from ‘std::string absl::lts_20240722::FormatDuration(Duration)’ at ../subprojects/abseil-cpp-20240722.0/absl/time/duration.cc:802:9:
/usr/include/c++/12/bits/char_traits.h:435:56: error: ‘void* __builtin_memcpy(void*, const void*, long unsigned int)’ accessing 9223372036854775810 or more bytes at offsets [2, 9223372036854775807] and 1 may overlap up to 9223372036854775813 bytes at offset -3 [-Werror=restrict]
  435 |         return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n));
      |                                        ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Hm - have not hit this myself sorry. This is a dumb question but are you sure you are building for ARM correctly?

I did it on the host, and I ended up with a "-Dwarning_level=0" to meson as a workaround. But inside the container, things just work as expected. BTW, I think it might be better to use "git checkout 0.3.1" as it is an official tag now for NIXL.

Thanks very much!

@ishandhanani ishandhanani changed the title feat: add dockerfile for gb200 + NIXL feat: add dockerfile for gb200 + mooncake Jul 16, 2025
@Swipe4057
Copy link
Contributor

Try to build the image with the improvements that I provided in this PR:
#8761

I'm not a reviewer, but I have a few questions:

Maybe it's better to change the default argument to CUDA_VERSION=12.9.1?
The image installs the driver nvidia-dkms-550, which supports CUDA version no newer than 12.4, but maybe it's better to upgrade to nvidia-dkms-575?
On Ubuntu 22.04, some apt packages are already no longer updated and have reached end-of-life, and the Python version is 3.10. I tried upgrading to Ubuntu 24.04 on my side, and you could try that too!
It might resolve some of your image build errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants