Skip to content

Conversation

@BatshevaBlack
Copy link
Collaborator

@BatshevaBlack BatshevaBlack commented Aug 17, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Improved CUDA library discovery and safer LD_LIBRARY_PATH handling; environment restored after use and persistent LD_LIBRARY_PATH export added.
  • Chores

    • Upgraded NIXL to 0.5.0 and UCX to 1.19.x with a full rebuild/reinstall.
    • CI image tags updated to use staging Triton-devel images.
  • Tests / Breaking Changes

    • Transfer tests now use a separate sync-message delivery path; related public API call pattern simplified.
    • Test config: reduced free GPU memory fraction from 0.5 to 0.25.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@BatshevaBlack BatshevaBlack requested review from a team as code owners August 17, 2025 08:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 17, 2025

📝 Walkthrough

Walkthrough

Bumps NIXL and UCX installer versions and adjusts NIXL libcuda discovery and LD_LIBRARY_PATH handling; updates four LLM Jenkins image tag variables to tritondevel staging tags; moves SyncMessage delivery out of TransferRequest into a new notifySyncMessage API and updates tests; lowers free_gpu_memory_fraction in a disaggregated test config; test usage of TorchLLM changed to context-manager.

Changes

Cohort / File(s) Summary of Changes
NIXL installer updates
docker/common/install_nixl.sh
Bump NIXL_VERSION to 0.5.0. Add OLD_LD_LIBRARY_PATH and CUDA_SO_PATH. Search /usr/local for libcuda.so.1, exit if missing. Stop creating a symlink; derive lib directory via dirname of libcuda.so.1, prepend that directory to LD_LIBRARY_PATH, export it, restore OLD_LD_LIBRARY_PATH at script end, and append an export LD_LIBRARY_PATH=... line to ENV.
UCX installer updates
docker/common/install_ucx.sh
Bump UCX_VERSION to v1.19.x. Remove the installation guard and always rm -rf the install path, then clone, build (autogen/configure/make) and install UCX. Preserve appending LD_LIBRARY_PATH updates to ENV.
Jenkins image tag updates
jenkins/current_image_tags.properties
Update four LLM-related docker image variables to tensorrt-llm-staging/tritondevel platform-specific tags, replacing prior ubuntu/rocky/pytorch or cuda-based image tags.
Transfer agent API & tests
cpp/tests/unit_tests/executor/transferAgentTest.cpp, tensorrt_llm/executor/transferAgent.h (API)
Move SyncMessage delivery out of TransferRequest: add notifySyncMessage(recipient, message) API and remove syncMessage parameter from TransferRequest constructor (now 4 args). Tests updated to call notifySyncMessage before constructing TransferRequest and to verify notified sync messages.
Integration test config
tests/integration/defs/disaggregated/test_configs/disagg_config_diff_max_tokens.yaml
Reduce free_gpu_memory_fraction from 0.5 to 0.25.
Unit test usage
tests/unittest/llmapi/test_llm_args.py
Change test_runtime_sizes to use TorchLLM as a context manager and move assertions inside the with-block.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant AgentA
    participant AgentB

    Test->>AgentA: notifySyncMessage(recipient=AgentB, message)
    AgentA->>AgentB: deliver SyncMessage
    Test->>AgentA: create TransferRequest(op, srcDescs, dstDescs, recipient)
    AgentA->>AgentB: perform data transfer (TransferRequest)
    AgentB-->>Test: observe SyncMessage and transfer completion
Loading
sequenceDiagram
    participant User
    participant install_nixl.sh
    participant FS
    participant ENV

    User->>install_nixl.sh: run script
    install_nixl.sh->>install_nixl.sh: save OLD_LD_LIBRARY_PATH
    install_nixl.sh->>FS: search /usr/local for libcuda.so.1
    alt found
        install_nixl.sh->>install_nixl.sh: set CUDA_SO_PATH and lib dir
        install_nixl.sh->>install_nixl.sh: export LD_LIBRARY_PATH with cuda dir
        install_nixl.sh->>ENV: append export LD_LIBRARY_PATH line
        install_nixl.sh->>install_nixl.sh: restore OLD_LD_LIBRARY_PATH
    else not found
        install_nixl.sh-->>User: print "libcuda.so.1 not found" and exit
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • Shixiaowei02
  • chzblych
  • chuangz0
  • pcastonguay
  • nv-guomingz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@BatshevaBlack
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
docker/common/install_nixl.sh (1)

45-45: Ensure ENV file is defined (dup).

Same note as in install_ucx.sh: confirm ENV is set to a valid path before appending.

🧹 Nitpick comments (7)
docker/common/install_ucx.sh (5)

5-5: Consider pinning to an exact tag or commit for reproducibility.

Using a moving branch like v1.19.x can change over time. Prefer a specific tag (e.g., v1.19.1) or a commit SHA to make builds deterministic.

If you choose to pin, apply:

-UCX_VERSION="v1.19.x"
+UCX_VERSION="v1.19.1"

10-12: Harden clone/install steps and quote paths.

  • Quote paths to avoid word-splitting.
  • Use --single-branch and explicit --branch for clarity.
  • Optional: add basic retry to reduce transient network flakiness.

Apply:

-rm -rf ${UCX_INSTALL_PATH}
-git clone --depth 1 -b ${UCX_VERSION} ${UCX_REPO}
+rm -rf "${UCX_INSTALL_PATH}"
+git clone --depth 1 --single-branch --branch "${UCX_VERSION}" "${UCX_REPO}"
 cd ucx

Optionally, add a simple retry wrapper around git clone if CI flakes are a concern.


26-26: Shellcheck SC2046: quote -j argument.

Prevents accidental word-splitting; also matches static analysis hint.

Apply:

-make install -j$(nproc)
+make install -j"$(nproc)"

29-29: Avoid double slash in path and ensure ENV is set.

  • UCX_INSTALL_PATH ends with a slash; ${UCX_INSTALL_PATH}/lib yields //lib. Use ${UCX_INSTALL_PATH%/} to normalize.
  • Verify that ENV is defined to avoid writing to an empty path.

Apply:

-echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH}/lib:\$LD_LIBRARY_PATH" >> "${ENV}"
+echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH%/}/lib:\$LD_LIBRARY_PATH" >> "${ENV}"

Please confirm that ENV is always set in the Docker build context; if not, default it or fail fast.


10-29: Unconditional reinstall acknowledged.

The remove-and-reinstall flow aligns with the PR objective to always rebuild UCX. Just note this increases build time; caching or a version-guard could be reintroduced later if CI time becomes an issue.

docker/common/install_nixl.sh (2)

9-9: Preserve empty LD_LIBRARY_PATH safely and quote.

Avoids leading/trailing colons and word-splitting later.

Apply:

-OLD_LD_LIBRARY_PATH=$LD_LIBRARY_PATH
+OLD_LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}"

43-43: Restore LD_LIBRARY_PATH with quotes.

Prevents word-splitting and preserves emptiness faithfully.

Apply:

-export LD_LIBRARY_PATH=$OLD_LD_LIBRARY_PATH
+export LD_LIBRARY_PATH="$OLD_LD_LIBRARY_PATH"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1e72721 and a164e44.

📒 Files selected for processing (2)
  • docker/common/install_nixl.sh (3 hunks)
  • docker/common/install_ucx.sh (1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
docker/common/install_ucx.sh

[warning] 26-26: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docker/common/install_nixl.sh (2)

7-7: Version bump to 0.5.0 looks good.

No issues spotted with the new NIXL version reference.


45-45: Prevent premature expansion of LD_LIBRARY_PATH in persisted ENV.

As written, ${LD_LIBRARY_PATH} expands at build time, baking a fixed value. Escape it so runtime shells inherit/extend their current LD_LIBRARY_PATH.

Apply:

-echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:${LD_LIBRARY_PATH}" >> "${ENV}"
+echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"

Likely an incorrect or invalid review comment.

Comment on lines +22 to 27
CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1)

if [[ -z "$CUDA_SO_PATH" ]]; then
echo "libcuda.so.1 not found"
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Broaden CUDA library discovery beyond /usr/local.

libcuda.so.1 commonly lives under host driver mounts like /usr/lib/x86_64-linux-gnu or /usr/lib/aarch64-linux-gnu. Searching only /usr/local is likely to fail on many systems.

Apply:

-CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1)
+CUDA_SO_PATH=$(
+  { find /usr/local /usr -type f -name "libcuda.so.1" 2>/dev/null; \
+    find "/usr/local/cuda/compat" -type f -name "libcuda.so.1" 2>/dev/null; } \
+  | head -n1
+)
 
-if [[ -z "$CUDA_SO_PATH" ]]; then
+if [[ -z "$CUDA_SO_PATH" ]]; then
     echo "libcuda.so.1 not found"
     exit 1
 fi

Notes:

  • Includes common locations: /usr/lib/x86_64-linux-gnu, /usr/lib/aarch64-linux-gnu, /usr/local/cuda/compat.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1)
if [[ -z "$CUDA_SO_PATH" ]]; then
echo "libcuda.so.1 not found"
exit 1
fi
CUDA_SO_PATH=$(
{ find /usr/local /usr -type f -name "libcuda.so.1" 2>/dev/null; \
find "/usr/local/cuda/compat" -type f -name "libcuda.so.1" 2>/dev/null; } \
| head -n1
)
if [[ -z "$CUDA_SO_PATH" ]]; then
echo "libcuda.so.1 not found"
exit 1
fi
🤖 Prompt for AI Agents
In docker/common/install_nixl.sh around lines 22 to 27, the script only searches
/usr/local for libcuda.so.1 which misses common host driver mounts; update the
discovery to search additional standard locations (/usr/lib/x86_64-linux-gnu,
/usr/lib/aarch64-linux-gnu, /usr/local/cuda/compat) and fall back to a broader
search (e.g., under /usr or root) so libcuda.so.1 is found on systems where
drivers are mounted outside /usr/local; ensure the new search still returns the
first match and preserves the existing error handling if nothing is found.

Comment on lines +29 to +31
CUDA_SO_PATH=$(dirname $CUDA_SO_PATH)

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Quote dirname input and avoid leading colon when appending.

  • Use dirname with -- and quotes.
  • Only insert a colon if LD_LIBRARY_PATH is non-empty.

Apply:

-CUDA_SO_PATH=$(dirname $CUDA_SO_PATH)
-
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH
+CUDA_SO_DIR=$(dirname -- "$CUDA_SO_PATH")
+
+export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}$CUDA_SO_DIR"

Also update subsequent references to use CUDA_SO_DIR if any (none outside these lines).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CUDA_SO_PATH=$(dirname $CUDA_SO_PATH)
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH
CUDA_SO_DIR=$(dirname -- "$CUDA_SO_PATH")
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}$CUDA_SO_DIR"
🤖 Prompt for AI Agents
In docker/common/install_nixl.sh around lines 29 to 31, change the use of
dirname to safely handle paths by using dirname -- "$CUDA_SO_PATH" and store the
result in a new variable CUDA_SO_DIR; then append to LD_LIBRARY_PATH without
producing a leading colon by only adding ":$CUDA_SO_DIR" when LD_LIBRARY_PATH is
non-empty (otherwise set LD_LIBRARY_PATH="$CUDA_SO_DIR"). Also replace any later
references to CUDA_SO_PATH with CUDA_SO_DIR if present.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15531 [ run ] triggered by Bot

@BatshevaBlack BatshevaBlack changed the title [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 #6900 [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 Aug 17, 2025
@BatshevaBlack BatshevaBlack requested a review from a team as a code owner August 17, 2025 09:43
@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
jenkins/current_image_tags.properties (2)

14-17: Prefer digest-pinned images for reproducibility

Staging tags are mutable and can be GC’ed; pinning by digest (image@sha256:…) prevents drift and makes PR reruns reproducible.

If keeping tags is required for ergonomics, consider adding companion DIGEST variables and having the pipeline prefer digests when present. I can generate a follow-up patch if you provide the digests.


14-17: Verify images actually contain UCX v1.19.x and NIXL 0.5.0

Given the PR objective, ensure the staging images embed the intended versions so CI exercises the upgrade.

Suggested minimal verification step (to add in CI or a pre-build script):

  • ucx_info -v (should show 1.19.x)
  • strings /opt/nvidia/nvda_nixl/lib*/libnixl.so | grep -E '(^| )0.5.0( |$)' or nixl version command if available

I can prep a small CI snippet to print these versions early in the logs for quick triage.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a164e44 and 8f62260.

📒 Files selected for processing (1)
  • jenkins/current_image_tags.properties (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/current_image_tags.properties (1)

14-17: Rename tooling and consumers support the new tag shape

I’ve verified that:

  • The IMAGE_MAPPING in scripts/rename_docker_images.py has been updated to produce the new tensorrt-llm-staging/tritondevel: prefixes and tag patterns.
  • All downstream consumers (Jenkins pipelines, Makefiles, performance scripts, etc.) source the LLM_*_DOCKER_IMAGE variables and treat them as opaque image URIs—there are no regex‐based parses of the old tag shape.
  • No hard-coded references to the old tensorrt-llm: prefix or redundant substrings remain in the rename script or in the codebase.

The rename script will generate and propagate the updated tags correctly, and no tooling assumptions have been broken.

Comment on lines +14 to +17
LLM_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:x86_64-tritondevel-torch_skip-a164e44-github-pr-6966-704
LLM_SBSA_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:sbsa-tritondevel-torch_skip-a164e44-github-pr-6966-704
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:x86_64-rockylinux8-torch_skip-py310-a164e44-github-pr-6966-704
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:x86_64-rockylinux8-torch_skip-py312-a164e44-github-pr-6966-704
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Confirm “torch_skip” images won’t break stages expecting PyTorch

Tags contain torch_skip; if any PR stages rely on torch presence (builds/tests importing torch or using torch CUDA toolchain), they’ll fail or silently skip coverage.

Please validate stage mappings and adjust bot invocation if needed (e.g., skip torch-dependent stages) or switch to non-skip images for those stages. To audit, list stage mappings and references:

If specific stages require torch, I can help craft a stage-list for the /bot run that avoids them or switches images conditionally.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Test-to-stage mapping (if present):"
fd -a test_to_stage_mapping.py -x sh -c 'echo "--- {} ---"; sed -n "1,200p" "{}"' sh || true

echo
echo "Search for torch usage within CI scripts and Jenkinsfiles:"
rg -n -C2 -P '\bimport\s+torch\b|torch\.' scripts Jenkinsfile jenkins || true

Length of output: 18194


🚨 Ensure PyTorch support in CI for torch_skip images

We’ve confirmed the Jenkins pipelines include PyTorch-specific stages and tests (e.g.
– tests/unittest/llmapi/test_llm_pytorch.py in L0_MergeRequest.groovy
– perf stages in L0_Test.groovy that install torch via pip3)
The new *_torch_skip* images omit PyTorch and will cause these stages to fail or silently skip coverage.

Action items:

  • In jenkins/current_image_tags.properties (lines 14–17), introduce or revert to non-skip image tags for PyTorch-dependent stages (e.g.
    LLM_DOCKER_IMAGE_PYTORCH=…).
  • In Jenkinsfiles:
    • jenkins/L0_MergeRequest.groovy
    • jenkins/L0_Test.groovy
    update stage-to-image selection so that any stage with “PyTorch” uses a full image, or explicitly skip those stages when running on a torch-free image.

These CI fixes are required to prevent test failures or missing coverage in PyTorch stages.

🤖 Prompt for AI Agents
In jenkins/current_image_tags.properties around lines 14 to 17, the current
LLM_*_DOCKER_IMAGE values point to "torch_skip" images which lack PyTorch and
will break or skip PyTorch-dependent CI stages; add explicit non-skip PyTorch
image entries (for example LLM_DOCKER_IMAGE_PYTORCH,
LLM_SBSA_DOCKER_IMAGE_PYTORCH, LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE_PYTORCH,
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE_PYTORCH) referencing corresponding images
that include torch, and then update Jenkinsfiles (jenkins/L0_MergeRequest.groovy
and jenkins/L0_Test.groovy) so PyTorch stages select the *_PYTORCH images (or
explicitly skip those stages when the selected image is a torch-free
"torch_skip" variant).

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15532 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15531 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15532 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11697 completed with status: 'FAILURE'

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15536 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15536 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11701 completed with status: 'FAILURE'

@bo-nv bo-nv self-requested a review August 18, 2025 07:47
@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (1)

299-304: Wrong agent used in checkRemoteDescs; likely causes a tight infinite loop.

After connecting nixlAgent1 to agent0 (Line 287), the remote-desc check should be initiated from nixlAgent1 against agent0. Currently it checks from nixlAgent0 to agent1 with a descriptor owned by agent0, which is inconsistent and will likely never succeed.

Apply this diff:

-    bool checked2 = false;
-    do
-    {
-        checked2 = nixlAgent0->checkRemoteDescs(agent1, regMem1.getDescs());
-    } while (!checked2);
+    bool checked2 = false;
+    do
+    {
+        checked2 = nixlAgent1->checkRemoteDescs(agent0, regMem1.getDescs());
+    } while (!checked2);
🧹 Nitpick comments (2)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (2)

262-266: Replace unbounded busy-waits with bounded wait + sleep to avoid CPU spin and test timeouts.

These loops can spin forever and burn CPU. Add a small sleep and a bounded retry budget to make tests robust on slower CI.

Apply these diffs:

-    for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif.size() == 0; i++)
+    for (int i = 0; i < 5000 && notif.empty(); ++i)
     {
+        std::this_thread::sleep_for(std::chrono::milliseconds(1));
         notif = nixlAgent1->getNotifiedSyncMessages();
     }
-    for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif2.size() == 0; i++)
+    for (int i = 0; i < 5000 && notif2.empty(); ++i)
     {
+        std::this_thread::sleep_for(std::chrono::milliseconds(1));
         notif2 = nixlAgent1->getNotifiedSyncMessages();
     }
-    for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif3.size() == 0; i++)
+    for (int i = 0; i < 5000 && notif3.empty(); ++i)
     {
+        std::this_thread::sleep_for(std::chrono::milliseconds(1));
         notif3 = nixlAgent0->getNotifiedSyncMessages();
     }
-    for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif4.size() == 0; i++)
+    for (int i = 0; i < 5000 && notif4.empty(); ++i)
     {
+        std::this_thread::sleep_for(std::chrono::milliseconds(1));
         notif4 = nixlAgent0->getNotifiedSyncMessages();
     }
-    for (size_t i = 0; i < MAX_QUERY_TIMES && notif5.size() == 0; i++)
+    for (int i = 0; i < 5000 && notif5.empty(); ++i)
     {
+        std::this_thread::sleep_for(std::chrono::milliseconds(1));
         notif5 = nixlAgent1->getNotifiedSyncMessages();
     }

Additionally, add the necessary includes once at the top of the file:

#include <thread>
#include <chrono>

Also applies to: 276-280, 291-294, 310-313, 330-333


146-152: Wrap CUDA calls with TLLM_CUDA_CHECK for consistent error handling.

cudaMalloc/cudaMemcpy are unchecked, while cudaFree is checked. Align all CUDA calls to fail fast in tests.

Apply these diffs:

-    cudaMalloc(&dev_ptr0, size);
-    cudaMalloc(&dev_ptr1, size);
+    TLLM_CUDA_CHECK(cudaMalloc(&dev_ptr0, size));
+    TLLM_CUDA_CHECK(cudaMalloc(&dev_ptr1, size));
-    cudaMemcpy(dev_ptr0, memory0.data(), size, cudaMemcpyHostToDevice);
-    cudaMemcpy(dev_ptr1, memory1.data(), size, cudaMemcpyHostToDevice);
+    TLLM_CUDA_CHECK(cudaMemcpy(dev_ptr0, memory0.data(), size, cudaMemcpyHostToDevice));
+    TLLM_CUDA_CHECK(cudaMemcpy(dev_ptr1, memory1.data(), size, cudaMemcpyHostToDevice));
-    cudaMemcpy(memory0.data(), dev_ptr0, size, cudaMemcpyDeviceToHost);
-    cudaMemcpy(memory1.data(), dev_ptr1, size, cudaMemcpyDeviceToHost);
+    TLLM_CUDA_CHECK(cudaMemcpy(memory0.data(), dev_ptr0, size, cudaMemcpyDeviceToHost));
+    TLLM_CUDA_CHECK(cudaMemcpy(memory1.data(), dev_ptr1, size, cudaMemcpyDeviceToHost));

Also applies to: 169-171

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8f62260 and ab547ce.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/executor/transferAgentTest.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (2)

258-260: New notifySyncMessage + 4-arg TransferRequest usage looks correct.

Decoupling sync-message delivery from the transfer payload is exercised properly here. The 4-argument TransferRequest matches the updated API.


306-308: Second notifySyncMessage + write flow looks correct.

Assuming the remote-desc readiness check is corrected (Lines 299-304), this sequence aligns with the new API and intended semantics.

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (1)

299-308: Wrong agent in checkRemoteDescs before write from agent1 → agent0.

You connect agent1 to agent0 and then issue a write from agent1 to agent0, but the readiness check is performed by agent0 against agent1 and uses local descriptors. This is logically inverted and may mask issues.

-    bool checked2 = false;
-    do
-    {
-        checked2 = nixlAgent0->checkRemoteDescs(agent1, regMem1.getDescs());
-    } while (!checked2);
+    bool checked2 = false;
+    do
+    {
+        // agent1 is about to write to agent0; ensure agent0's descriptors are known to agent1
+        checked2 = nixlAgent1->checkRemoteDescs(agent0, regMem1.getDescs());
+    } while (!checked2);
♻️ Duplicate comments (3)
docker/common/install_nixl.sh (2)

22-27: Broaden libcuda discovery beyond /usr/local (likely failure on many hosts).

Restricting to /usr/local will commonly miss libcuda.so.1 under /usr/lib/*-linux-gnu or /usr/local/cuda/compat, causing the script to exit. Reuse the previously suggested multi-path search.

-CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1)
-
-if [[ -z "$CUDA_SO_PATH" ]]; then
-    echo "libcuda.so.1 not found"
-    exit 1
-fi
+CUDA_SO_PATH=$(
+  {
+    find /usr/local /usr -type f -name "libcuda.so.1" 2>/dev/null;
+    find "/usr/local/cuda/compat" -type f -name "libcuda.so.1" 2>/dev/null;
+  } | head -n1
+)
+if [[ -z "$CUDA_SO_PATH" ]]; then
+  echo "libcuda.so.1 not found"
+  exit 1
+fi

29-31: Quote dirname input and avoid leading colon when appending.

Use dirname safely and only prepend a colon when LD_LIBRARY_PATH is non-empty.

-CUDA_SO_PATH=$(dirname $CUDA_SO_PATH)
-
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH
+CUDA_SO_DIR=$(dirname -- "$CUDA_SO_PATH")
+export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}$CUDA_SO_DIR"
jenkins/current_image_tags.properties (1)

14-17: torch_skip images likely break PyTorch stages; provide torch-enabled variants or adjust stage mapping.

CI pipelines contain PyTorch-dependent stages; using torch_skip images will fail those stages or skip coverage. Either:

  • introduce explicit torch-enabled image variables for PyTorch stages and update Jenkinsfiles to use them, or
  • gate/skip PyTorch stages when a torch_skip image is selected.

I can help patch Jenkinsfiles if you confirm the strategy. To audit current PyTorch usage quickly:

#!/bin/bash
set -euo pipefail
# Look for torch usage in CI configs and tests
rg -n -C2 -P 'pip(\S+)?\s+install\s+(\S*torch\S*)|import\s+torch\b|torch\.' jenkins scripts tests || true
# Show where these image vars are referenced
rg -n -C2 -P 'LLM(_SBSA)?(_ROCKYLINUX8_.*)?_DOCKER_IMAGE' jenkins || true

If you prefer adding torch-enabled images alongside current ones, append after Line 17:

LLM_DOCKER_IMAGE_PYTORCH=...<full torch image tag>...
LLM_SBSA_DOCKER_IMAGE_PYTORCH=...<full torch image tag>...
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE_PYTORCH=...<full torch image tag>...
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE_PYTORCH=...<full torch image tag>...

Then update PyTorch stages in Jenkinsfiles to select the *_PYTORCH variables.

🧹 Nitpick comments (5)
docker/common/install_ucx.sh (4)

5-5: Prefer a pinned UCX tag or commit for reproducible builds.

Using a moving branch (v1.19.x) can yield different artifacts over time. Consider pinning to a specific tag (e.g., v1.19.0/v1.19.1) or a commit SHA to ensure deterministic CI.

-UCX_VERSION="v1.19.x"
+# Example: pin to a specific version or commit for reproducibility
+UCX_VERSION="v1.19.1"

10-10: Harden rm -rf with quoting and trailing-slash guard.

Quote the path and strip a trailing slash to avoid accidental globbing/path mishaps.

-rm -rf ${UCX_INSTALL_PATH}
+rm -rf -- "${UCX_INSTALL_PATH%/}"

26-26: Shellcheck SC2046: quote and place -j before the target.

Quote $(nproc) and pass -j to make, not the install target.

-make install -j$(nproc)
+make -j"$(nproc)" install

29-29: Ensure idempotent LD_LIBRARY_PATH export in install_ucx.sh

Repeated runs of install_ucx.sh will currently append duplicate export lines, and although Dockerfile.multi sets ENV and install_base.sh creates the file, it’s safer to guard against both an unset file and duplicate entries. Update the snippet as follows:

-echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH}/lib:\$LD_LIBRARY_PATH" >> "${ENV}"
+mkdir -p "$(dirname "$ENV")"
+touch "$ENV"
+# Append only once; keep LD_LIBRARY_PATH dynamic at runtime
+grep -qsF "${UCX_INSTALL_PATH%/}/lib" "$ENV" \
+  || echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH%/}/lib:\$LD_LIBRARY_PATH" >> "$ENV"

• Keeps using the $ENV path defined by Dockerfile.multi
• Ensures the target file exists before appending
• Prevents duplicate exports on subsequent runs

cpp/tests/unit_tests/executor/transferAgentTest.cpp (1)

257-271: Avoid unbounded busy-wait loops; add a bounded wait with small sleeps.

MAX_QUERY_TIMES is set to size_t::max and loops spin without sleep. If notifications are delayed or lost, this can burn CPU indefinitely.

Example refactor (outside this hunk):

#include <chrono>
#include <thread>

// ...
for (std::size_t i = 0; i < 1000 && notif.empty(); ++i) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
    notif = nixlAgent1->getNotifiedSyncMessages();
}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ab547ce and 855cd3a.

📒 Files selected for processing (4)
  • cpp/tests/unit_tests/executor/transferAgentTest.cpp (2 hunks)
  • docker/common/install_nixl.sh (3 hunks)
  • docker/common/install_ucx.sh (1 hunks)
  • jenkins/current_image_tags.properties (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
🪛 Shellcheck (0.10.0)
docker/common/install_ucx.sh

[warning] 26-26: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

rm $CUDA_PATH/lib64/libcuda.so.1
export LD_LIBRARY_PATH=$OLD_LD_LIBRARY_PATH

echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Make LD_LIBRARY_PATH persistence safe and idempotent; handle unset ENV.

As with UCX, ensure ENV is defined, create its directory, and avoid duplicate lines on repeated runs.

-echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"
+ENV=${ENV:-/etc/profile.d/trtllm-env.sh}
+mkdir -p "$(dirname -- "$ENV")"
+grep -qsF "/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}" "$ENV" || \
+  echo 'export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:$LD_LIBRARY_PATH' >> "$ENV"

Optional runtime verification in the built image:

  • ldd on a NIXL .so resolves against host driver’s libcuda without needing temporary LD_LIBRARY_PATH.
  • If not, consider persisting the discovered CUDA_SO_DIR as a fallback.

🏁 Script executed:

#!/bin/bash
set -e
# Display the first 200 lines of the install_nixl script to inspect ENV usage
sed -n '1,200p' docker/common/install_nixl.sh

Length of output: 1354


Ensure ENV is defined and make LD_LIBRARY_PATH update idempotent

The script currently ends with:

echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"

ENV isn’t set in this script (causing a redirection error), and repeated runs will keep appending duplicates. Please update as follows:

- echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"
+ # Ensure ENV is defined, its directory exists, and avoid duplicate entries
+ ENV=${ENV:-/etc/profile.d/trtllm-env.sh}
+ mkdir -p "$(dirname -- "$ENV")"
+ if ! grep -qF "/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}" "$ENV"; then
+   echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "$ENV"
+ fi

Optional container‐image check: run ldd on one of the NIXL .so files to confirm it picks up the host’s libcuda.so without needing a transient LD_LIBRARY_PATH, or fall back to persisting the discovered CUDA_SO_PATH if necessary.

🤖 Prompt for AI Agents
In docker/common/install_nixl.sh around line 45, the script appends an
LD_LIBRARY_PATH export to "${ENV}" even though ENV may be undefined and repeated
runs will duplicate the line; update the script to ensure ENV is defined (use a
documented default like /etc/profile.d/nvidia_nixl.sh if not provided and create
the file if missing) and make the write idempotent by replacing any existing
NIXL LD_LIBRARY_PATH export (or removing matching lines) before writing a single
canonical export line; additionally, as an optional follow-up, run ldd on one of
the installed NIXL .so files to see if system libcuda.so is already picked up
and, if so, avoid persisting LD_LIBRARY_PATH or otherwise persist the discovered
CUDA_SO_PATH instead.

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
docker/common/install_ucx.sh (6)

10-10: Guard rm -rf against empty or unexpected paths

A defensive guard prevents accidental deletions if UCX_INSTALL_PATH is ever empty/mis-set. Low risk inside Docker, but cheap insurance.

Apply:

-rm -rf ${UCX_INSTALL_PATH}
+: "${UCX_INSTALL_PATH:?UCX_INSTALL_PATH must be set}"
+[[ "${UCX_INSTALL_PATH}" != "/" ]] || { echo "Refusing to remove /"; exit 1; }
+rm -rf -- "${UCX_INSTALL_PATH%/}"

11-11: Use --single-branch and quote vars in git clone

Slightly reduces fetch size and hardens against word-splitting.

-git clone --depth 1 -b ${UCX_VERSION} ${UCX_REPO}
+git clone --depth 1 --single-branch -b "${UCX_VERSION}" "${UCX_REPO}"

4-4: Remove unused GITHUB_URL or wire it into UCX_REPO

Currently unused; either drop it or use to allow mirroring.

Option A (remove):

-GITHUB_URL="https://github.com"

Option B (use it):

-GITHUB_URL="https://github.com"
-UCX_REPO="https://github.com/openucx/ucx.git"
+GITHUB_URL="https://github.com"
+UCX_REPO="${GITHUB_URL}/openucx/ucx.git"

26-26: Quote $(nproc) and avoid parallelized install target

ShellCheck SC2046: quote to prevent word splitting. Also, many projects recommend building in parallel, but running “install” serially to avoid race conditions.

-make install -j$(nproc)
+# Build in parallel, then install serially
+make -j"$(nproc)"
+make install

If you prefer a minimal change:

-make install -j$(nproc)
+make install -j"$(nproc)"

22-23: Fail early if CUDA_PATH is missing, or auto-discover CUDA

The hardcoded /usr/local/cuda usually exists, but a clear error helps debugging CI failures.

-CUDA_PATH="/usr/local/cuda"
+CUDA_PATH="/usr/local/cuda"
+if [[ ! -d "${CUDA_PATH}" ]]; then
+  echo "CUDA_PATH not found at ${CUDA_PATH}" >&2
+  exit 1
+fi

If you want to auto-discover (similar to the NIXL script changes), we can add that in a follow-up.


5-5: docker/common/install_ucx.sh:5 – Confirm UCX version pinning

Verified that the branch v1.19.x exists upstream and that the only stable tag available today is v1.19.0 (in addition to RC releases). Using

UCX_VERSION="v1.19.x"

will float to future patch releases when they are published.

If you require fully reproducible Docker images, consider pinning to the current stable tag:

- UCX_VERSION="v1.19.x"
+ UCX_VERSION="v1.19.0"

Otherwise, leaving v1.19.x is acceptable to automatically pick up patch updates.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 855cd3a and cd53b11.

📒 Files selected for processing (4)
  • cpp/tests/unit_tests/executor/transferAgentTest.cpp (2 hunks)
  • docker/common/install_nixl.sh (3 hunks)
  • docker/common/install_ucx.sh (1 hunks)
  • jenkins/current_image_tags.properties (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/tests/unit_tests/executor/transferAgentTest.cpp
  • docker/common/install_nixl.sh
  • jenkins/current_image_tags.properties
🧰 Additional context used
🪛 Shellcheck (0.10.0)
docker/common/install_ucx.sh

[warning] 26-26: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
docker/common/install_ucx.sh (3)

13-25: Build flags preserved and look reasonable for UCX 1.19

Keeping shared libs, optimizations, verbs, DM, CMA, CUDA, and mt is consistent with prior configs. No concerns here.


28-28: Good: source tree cleanup to reduce image size

Removing the UCX source after install keeps images lean.


29-29: No changes needed: ENV is always defined and the script runs only once
We verified that in docker/Dockerfile.multi, ENV ENV=${ENV:-/etc/shinit_v2} is set before install_ucx.sh is invoked, and that install_ucx.sh is only executed once per build stage. Therefore:

  • $ENV will never be unset or empty under set -e.
  • There’s no risk of duplicate export LD_LIBRARY_PATH… lines across rebuilds, since the script only runs a single time.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15618 [ run ] triggered by Bot

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15624 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15618 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15624 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11762 completed with status: 'FAILURE'

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15648 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15648 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11780 completed with status: 'FAILURE'

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_args.py (1)

443-446: Avoid asserting against a private attribute; prefer a public accessor if available.

Accessing llm._executor_config couples the test to internal details and risks breakage on refactors. If a stable public API exists (e.g., llm.executor_config or a getter), prefer that. To keep tests resilient while remaining backward-compatible, consider this small tweak:

-            assert llm._executor_config.max_beam_width == 1
-            assert llm._executor_config.max_num_tokens == 256
-            assert llm._executor_config.max_seq_len == 128
-            assert llm._executor_config.max_batch_size == 8
+            _cfg = getattr(llm, "executor_config", llm._executor_config)
+            assert _cfg.max_beam_width == 1
+            assert _cfg.max_num_tokens == 256
+            assert _cfg.max_seq_len == 128
+            assert _cfg.max_batch_size == 8
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3b37f68 and fa689a8.

📒 Files selected for processing (1)
  • tests/unittest/llmapi/test_llm_args.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/llmapi/test_llm_args.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/llmapi/test_llm_args.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/llmapi/test_llm_args.py (1)

431-437: Good switch to a context-managed LLM lifecycle.

Using TorchLLM as a context manager is the right move here. It ensures deterministic teardown of GPU/UCX resources and avoids handle leaks; placing the assertions inside the with-block validates state before cleanup.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15661 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15661 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11786 completed with status: 'FAILURE'

@BatshevaBlack
Copy link
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15729 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15729 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11822 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@BatshevaBlack BatshevaBlack deleted the upgrade-ucx-nixl branch September 1, 2025 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants