feat: add dockerfile for gb200 + mooncake#7556
feat: add dockerfile for gb200 + mooncake#7556ishandhanani wants to merge 6 commits intosgl-project:mainfrom ishandhanani:ishan/gbdockerfile
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @ishandhanani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request's primary objective is to provide a dedicated Docker image for the NVIDIA GB200 platform. It sets up a comprehensive and specialized development environment, integrating various system libraries, Python frameworks, and high-performance computing tools like NVSHMEM and DeepEP, to facilitate the development and deployment of AI and HPC applications optimized for the GB200 architecture.
Highlights
- New Dockerfile for GB200: This pull request introduces a new Dockerfile located at
docker/Dockerfile.gb200, specifically designed to set up a development environment tailored for the NVIDIA GB200 platform. - Comprehensive Environment Setup: The Dockerfile establishes a robust environment starting from
nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04, installing a wide array of system dependencies, Python tools, and specialized libraries required for high-performance computing and deep learning. - Integration of HPC/AI Frameworks: Key components like
sglang,sgl_kernel,GDRCopy,NVSHMEM, andDeepEPare cloned, built, and configured within the Docker image, indicating a focus on advanced AI model development and efficient GPU communication. - Python Package Management with UV: The Dockerfile leverages
uvfor Python package installation and virtual environment management, ensuring efficient and reproducible dependency resolution for Python components, including nightly PyTorch builds.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a new Dockerfile specifically tailored for the GB200 architecture, building upon a CUDA 12.8 base image. The changes include comprehensive system dependency installations, setup of a Python virtual environment using uv, and the cloning and building of sglang, PyTorch, sgl_kernel, NVSHMEM, and DeepEP. The review identifies several areas for improvement related to Dockerfile best practices, including removing unused arguments, enhancing security by avoiding direct curl | sh execution, optimizing image size by removing unnecessary DKMS packages, and improving maintainability by consolidating source code modifications into patches rather than using sed.
docker/Dockerfile.gb200
Outdated
|
|
||
| # --- Fix DeepEp IGBDA symlink --- | ||
| # Not sure what the alternative here is | ||
| #RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so |
There was a problem hiding this comment.
This is x86_64? Probably don't need on GB200? For my setup, it seems to be correct out-of-box:
root@GPU-195:/usr/lib/aarch64-linux-gnu# ll libmlx*
-rw-r--r-- 1 root root 77686 Sep 8 2024 libmlx4.a
lrwxrwxrwx 1 root root 12 Sep 8 2024 libmlx4.so -> libmlx4.so.1
lrwxrwxrwx 1 root root 19 Sep 8 2024 libmlx4.so.1 -> libmlx4.so.1.0.54.0
-rw-r--r-- 1 root root 47032 Sep 8 2024 libmlx4.so.1.0.54.0
-rw-r--r-- 1 root root 777254 Sep 8 2024 libmlx5.a
lrwxrwxrwx 1 root root 12 Sep 8 2024 libmlx5.so -> libmlx5.so.1
lrwxrwxrwx 1 root root 20 Sep 8 2024 libmlx5.so.1 -> libmlx5.so.1.25.54.0
-rw-r--r-- 1 root root 520576 Sep 8 2024 libmlx5.so.1.25.54.0
docker/Dockerfile.gb200
Outdated
| make -j && \ | ||
| make -j install-strip && \ | ||
| ldconfig | ||
|
|
There was a problem hiding this comment.
Maybe we should add a cleanup once everything is installed?
rm -rf /usr/local/src/ucx
|
Are you able to make NIXL complied successfully? I tried different combinations (NIXL master/0.3.1, UCX 1.8.0, 1.8.1, 1.9.0-rc1), and same error: |
Hm - have not hit this myself sorry. This is a dumb question but are you sure you are building for ARM correctly? |
I did it on the host, and I ended up with a "-Dwarning_level=0" to meson as a workaround. But inside the container, things just work as expected. BTW, I think it might be better to use "git checkout 0.3.1" as it is an official tag now for NIXL. Thanks very much! |
|
Try to build the image with the improvements that I provided in this PR: I'm not a reviewer, but I have a few questions: Maybe it's better to change the default argument to CUDA_VERSION=12.9.1? |
TODO
-DCMAKE_CUDA_ARCHITECTURES=100#RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.soUpdate 6/30
Update 7/10
Update 7/14