Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cleanup
  • Loading branch information
nv-anants committed Jul 7, 2025
commit e3bb46b6b249a292b71810db51b5b6b6ca2ad739
42 changes: 23 additions & 19 deletions container/Dockerfile.tensorrt_llm
Original file line number Diff line number Diff line change
Expand Up @@ -397,8 +397,8 @@ COPY --from=build /usr/local/bin/etcd/ /usr/local/bin/etcd/
COPY --from=build /usr/local/ucx /usr/local/ucx
# Copy NIXL source from build image (required for NIXL plugins)
COPY --from=build /usr/local/nixl /usr/local/nixl
# Copy HPCX from base image
COPY --from=build /opt/hpcx /opt/hpcx
# Copy OpenMPI from build image
COPY --from=build /opt/hpcx/ompi /opt/hpcx/ompi
# Copy NUMA library from build image
COPY --from=build /usr/lib/x86_64-linux-gnu/libnuma.so* /usr/lib/x86_64-linux-gnu/

Expand All @@ -408,21 +408,22 @@ RUN uv venv $VIRTUAL_ENV --python 3.12 && \
echo "source $VIRTUAL_ENV/bin/activate" >> ~/.bashrc

# Common dependencies
# ToDo: Remove extra install and use pyproject.toml to define all dependencies
RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \
uv pip install --requirement /tmp/requirements.txt

# Install test dependencies
#TODO: Remove this once we have a functional dev image built on top of the runtime image
# TODO: Remove this once we have a functional CI image built on top of the runtime image
RUN --mount=type=bind,source=./container/deps/requirements.test.txt,target=/tmp/requirements.txt \
uv pip install --requirement /tmp/requirements.txt

# Copy CUDA toolkit components needed for nvcc, cudafe, cicc etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we will run into an issue where the BASE container is not on the same cuda version as the RUNTIME container, thus causing issues down the line with straight copying the libs. I would highlight this as a risk rather than a hard stop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but its the same case with other packages. I believe torch and other packages also expect a certain version of cuda so they would also need to match wrt BASE and RUNTIME containers? I can try using the cuda-devel image which comes with all these cuda binaries, but I would need to check on size impact for that

COPY --from=build /usr/local/cuda/bin/ /usr/local/cuda/bin/
COPY --from=build /usr/local/cuda/include /usr/local/cuda/include
COPY --from=build /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc
COPY --from=build /usr/local/cuda/bin/cudafe++ /usr/local/cuda/bin/cudafe++
COPY --from=build /usr/local/cuda/bin/ptxas /usr/local/cuda/bin/ptxas
COPY --from=build /usr/local/cuda/bin/fatbinary /usr/local/cuda/bin/fatbinary
COPY --from=build /usr/local/cuda/include/ /usr/local/cuda/include/
COPY --from=build /usr/local/cuda/lib64/libcudart.so* /usr/local/cuda/lib64/
COPY --from=build /usr/local/cuda/lib64/libnvvm.so* /usr/local/cuda/lib64/
COPY --from=build /usr/local/cuda/lib64/libnvvmx.so* /usr/local/cuda/lib64/
COPY --from=build /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/
COPY --from=build /usr/local/cuda/nvvm /usr/local/cuda/nvvm

# Copy pytorch installation from NGC PyTorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this all be better defined as a pip install ... rather than copying the installed packages themselves? We can still pin the dependencies and it will be easier to read (or maintain) in the future. It also makes explicit what is unique in the trtllm container and what came from upstream. I'm open to hearing arguments on this.

I get why you did it this way to begin with, to prove we have all the packages needed for trtllm. Now we have something to compare the next iteration against to make it better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are internal versions, so I dont think there's any easy way to pip install them? I could be wrong here

Expand All @@ -435,7 +436,6 @@ ARG NETWORKX_VER=3.4.2
ARG SYMPY_VER=1.14.0
ARG PACKAGING_VER=23.2
ARG FLASH_ATTN_VER=2.7.3
ARG MPI4PY_VER
ARG MPMATH_VER=1.3.0
COPY --from=build /usr/local/lib/lib* /usr/local/lib/
COPY --from=build /usr/local/lib/python3.12/dist-packages/torch /usr/local/lib/python3.12/dist-packages/torch
Expand All @@ -447,8 +447,8 @@ COPY --from=build /usr/local/lib/python3.12/dist-packages/torchvision.libs /usr/
COPY --from=build /usr/local/lib/python3.12/dist-packages/setuptools /usr/local/lib/python3.12/dist-packages/setuptools
COPY --from=build /usr/local/lib/python3.12/dist-packages/setuptools-${SETUPTOOLS_VER}.dist-info /usr/local/lib/python3.12/dist-packages/setuptools-${SETUPTOOLS_VER}.dist-info
COPY --from=build /usr/local/lib/python3.12/dist-packages/functorch /usr/local/lib/python3.12/dist-packages/functorch
COPY --from=build /usr/local/lib/python3.12/dist-packages/pytorch_triton-${PYTORCH_TRITON_VER}.dist-info /usr/local/lib/python3.12/dist-packages/pytorch_triton-${PYTORCH_TRITON_VER}.dist-info
COPY --from=build /usr/local/lib/python3.12/dist-packages/triton /usr/local/lib/python3.12/dist-packages/triton
COPY --from=build /usr/local/lib/python3.12/dist-packages/pytorch_triton-${PYTORCH_TRITON_VER}.dist-info /usr/local/lib/python3.12/dist-packages/pytorch_triton-${PYTORCH_TRITON_VER}.dist-info
COPY --from=build /usr/local/lib/python3.12/dist-packages/jinja2 /usr/local/lib/python3.12/dist-packages/jinja2
COPY --from=build /usr/local/lib/python3.12/dist-packages/jinja2-${JINJA2_VER}.dist-info /usr/local/lib/python3.12/dist-packages/jinja2-${JINJA2_VER}.dist-info
COPY --from=build /usr/local/lib/python3.12/dist-packages/networkx /usr/local/lib/python3.12/dist-packages/networkx
Expand All @@ -460,21 +460,16 @@ COPY --from=build /usr/local/lib/python3.12/dist-packages/packaging-${PACKAGING_
COPY --from=build /usr/local/lib/python3.12/dist-packages/flash_attn /usr/local/lib/python3.12/dist-packages/flash_attn
COPY --from=build /usr/local/lib/python3.12/dist-packages/flash_attn-${FLASH_ATTN_VER}.dist-info /usr/local/lib/python3.12/dist-packages/flash_attn-${FLASH_ATTN_VER}.dist-info
COPY --from=build /usr/local/lib/python3.12/dist-packages/flash_attn_2_cuda.cpython-312-*-linux-gnu.so /usr/local/lib/python3.12/dist-packages/
COPY --from=build /usr/local/lib/python3.12/dist-packages/mpmath /usr/local/lib/python3.12/dist-packages/mpmath
COPY --from=build /usr/local/lib/python3.12/dist-packages/mpmath-${MPMATH_VER}.dist-info /usr/local/lib/python3.12/dist-packages/mpmath-${MPMATH_VER}.dist-info
# COPY --from=build /usr/local/lib/python3.12/dist-packages/mpmath /usr/local/lib/python3.12/dist-packages/mpmath
# COPY --from=build /usr/local/lib/python3.12/dist-packages/mpmath-${MPMATH_VER}.dist-info /usr/local/lib/python3.12/dist-packages/mpmath-${MPMATH_VER}.dist-info

# Setup environment variables
ARG ARCH_ALT
ENV NIXL_PLUGIN_DIR=/usr/local/nixl/lib/${ARCH_ALT}-linux-gnu/plugins
ENV LD_LIBRARY_PATH=/usr/local/nixl/lib/${ARCH_ALT}-linux-gnu:/usr/local/nixl/lib/${ARCH_ALT}-linux-gnu/plugins:/usr/local/ucx/lib:/opt/hpcx/ompi/lib:$LD_LIBRARY_PATH
ENV OMPI_HOME=/opt/hpcx/ompi
ENV PATH=/opt/hpcx/ompi/bin:/usr/local/bin/etcd/:$PATH
ENV PATH=/opt/hpcx/ompi/bin:/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:$PATH
ENV OPAL_PREFIX=/opt/hpcx/ompi

#TODO: Remove this once we have a functional dev image built on top of the runtime image
COPY . /workspace
RUN uv pip install /workspace/benchmarks

# Install TensorRT-LLM (same as in build stage)
ARG HAS_TRTLLM_CONTEXT=0
ARG TENSORRTLLM_PIP_WHEEL="tensorrt-llm"
Expand All @@ -490,8 +485,17 @@ RUN uv pip install --index-url "${TENSORRTLLM_INDEX_URL}" \
uv pip install ai-dynamo --find-links wheelhouse && \
uv pip install nixl --find-links wheelhouse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future I would consider installing nixl first. I get that nixl is an optional dependency of ai-dynamo at the moment but I would not want to get the wrong version should that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but my concern with installing nixl first is that if it brings any dependency which is common and not version controlled in ai-dynamo, the one from nixl will be installed and used. We should do single install i think, options there could be

  1. Single command pip install ai-dynamo nixl ... or
  2. We add a ai-dynamo[nixl] option in dynamo install


# Copy TensorRT-LLM environment setup script
# Setup TRTLLM environment variables, same as in dev image
ENV TRTLLM_USE_UCX_KVCACHE=1
COPY --from=dev /usr/local/bin/set_trtllm_env.sh /usr/local/bin/set_trtllm_env.sh
RUN echo 'source /usr/local/bin/set_trtllm_env.sh' >> /root/.bashrc

# Copy benchmarks, exmaples and tests for CI
# TODO: Remove this once we have a functional CI image built on top of the runtime image
COPY tests /workspace/tests
COPY benchmarks /workspace/benchmarks
COPY examples /workspace/examples
RUN uv pip install /workspace/benchmarks

# Copy launch banner
RUN --mount=type=bind,source=./container/launch_message.txt,target=/workspace/launch_message.txt \
Expand Down
Loading