-
Notifications
You must be signed in to change notification settings - Fork 749
fix: install rdma libs in runtime image. #2163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi karya0! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThe Dockerfile for the runtime image was updated to install additional packages, including Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
container/Dockerfile.vllm (2)
440-441: Pin or document thecurlversion to avoid future breakage
curlis often backward-compatible, yet sporadic regressions (e.g., HTTP/2 or TLS quirks) do happen between minor releases. Consider either:
- Pinning to the current known-good version (e.g.
curl=8.5.0-1) or- Adding a comment that runtime compatibility is manually verified on every base-image update.
That keeps the container deterministic and shields CI from upstream repo surprises.
442-444: Dev headers unnecessarily inflate the runtime image
libibverbs-dev,libibumad-dev,librdmacm-dev,libnuma-devship headers and static libs that are only needed at build time.
The runtime layer merely needs the shared libraries, which live in the non--devpackages (libibverbs1,libibumad3,librdmacm1,libnuma1).
Dropping the-devvariants trims several MB and speeds up security-scans without impacting UCX at runtime.Example adjustment:
- libibverbs-dev rdma-core ibverbs-utils libibumad-dev \ - libnuma-dev librdmacm-dev ibverbs-providers \ + libibverbs1 rdma-core ibverbs-utils libibumad3 \ + libnuma1 librdmacm1 ibverbs-providers \If headers are genuinely required (e.g., for on-device JIT of UCX plugins), clarify that in a comment to avoid accidental removal later.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
container/Dockerfile.vllm(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
container/Dockerfile.vllm (1)
Learnt from: grahamking
PR: #1177
File: container/Dockerfile.vllm:102-105
Timestamp: 2025-05-28T22:54:46.875Z
Learning: In Dockerfiles, when appending to environment variables that may not exist in the base image, Docker validation will fail if you reference undefined variables with ${VARIABLE} syntax. In such cases, setting the environment variable directly (e.g., ENV CPATH=/usr/include) rather than appending is the appropriate approach.
|
Maybe can add vim too? Found myself installing it a few times. |
Good idea. I forgot about that. Added now. |
ishandhanani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also required in the SGL runtime container?
CC @GuanLuo who debugged an issue with RDMA not being setup/installed correctly in the past, leading to unexpected high TTFT times. I think we're overdue for some sanity tests that all the NIXL/IB/RDMA related things are installed correctly in the container -- it seems like there is a lot of thrash on these containers where people make changes that unexpectedly break some aspect of these networking components and we don't find out until later. Any thoughts @karya0 @aranadive? |
|
The change looks reasonable to me. Given the fact that we have done the same dance in NIXL installation part, I would suggest to pull this out as "nixl_dependency" file (similar to requirements.txt in Python or just installation script similar to |
Having a install script like For now, how long before we can merge this PR? It looks like there are some unrelated trtllm failures in the gitlab ci. |
@rmccorm4 @karya0
I think we need to introduce two stages in these dockerfile with regard to NIXL and its dependencies.
This would make it easier to track nixl dependencies and add to the stages there to keep everything in one place. |
The base build images are different in all flavors: VLLM (nvcr.io/nvidia/cuda-dl-base) and TRTLLM (nvcr.io/nvidia/pytorch) SGLang (nvcr.io/nvidia/cuda-dl-base) . The runtime image is the same though: nvcr.io/nvidia/cuda Having said that, I am not an expert/user of TRTLLM/SGLang so don't know if the omission is intentional or oversight. |
|
Updated to make NIXL dependency installation/declaration similar in all three variants |
Signed-off-by: Kapil Arya <[email protected]>
Overview:
The runtime container image for VLLM is missing packages related to ibverbs, etc., which are essential for UCX to discover underlying RDMA devices. Further, curl is rather helpful in communicating with the dynamo endpoint and polling other service and their health endpoints.
The impact on container size is relatively negligible (~28 MB).
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
curlfor improved endpoint polling capabilities.