-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 #6966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughBumps NIXL and UCX installer versions and adjusts NIXL libcuda discovery and LD_LIBRARY_PATH handling; updates four LLM Jenkins image tag variables to tritondevel staging tags; moves SyncMessage delivery out of TransferRequest into a new notifySyncMessage API and updates tests; lowers free_gpu_memory_fraction in a disaggregated test config; test usage of TorchLLM changed to context-manager. Changes
Sequence Diagram(s)sequenceDiagram
participant Test
participant AgentA
participant AgentB
Test->>AgentA: notifySyncMessage(recipient=AgentB, message)
AgentA->>AgentB: deliver SyncMessage
Test->>AgentA: create TransferRequest(op, srcDescs, dstDescs, recipient)
AgentA->>AgentB: perform data transfer (TransferRequest)
AgentB-->>Test: observe SyncMessage and transfer completion
sequenceDiagram
participant User
participant install_nixl.sh
participant FS
participant ENV
User->>install_nixl.sh: run script
install_nixl.sh->>install_nixl.sh: save OLD_LD_LIBRARY_PATH
install_nixl.sh->>FS: search /usr/local for libcuda.so.1
alt found
install_nixl.sh->>install_nixl.sh: set CUDA_SO_PATH and lib dir
install_nixl.sh->>install_nixl.sh: export LD_LIBRARY_PATH with cuda dir
install_nixl.sh->>ENV: append export LD_LIBRARY_PATH line
install_nixl.sh->>install_nixl.sh: restore OLD_LD_LIBRARY_PATH
else not found
install_nixl.sh-->>User: print "libcuda.so.1 not found" and exit
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run --stage-list "Build-Docker-Images" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
docker/common/install_nixl.sh (1)
45-45: Ensure ENV file is defined (dup).Same note as in install_ucx.sh: confirm ENV is set to a valid path before appending.
🧹 Nitpick comments (7)
docker/common/install_ucx.sh (5)
5-5: Consider pinning to an exact tag or commit for reproducibility.Using a moving branch like v1.19.x can change over time. Prefer a specific tag (e.g., v1.19.1) or a commit SHA to make builds deterministic.
If you choose to pin, apply:
-UCX_VERSION="v1.19.x" +UCX_VERSION="v1.19.1"
10-12: Harden clone/install steps and quote paths.
- Quote paths to avoid word-splitting.
- Use --single-branch and explicit --branch for clarity.
- Optional: add basic retry to reduce transient network flakiness.
Apply:
-rm -rf ${UCX_INSTALL_PATH} -git clone --depth 1 -b ${UCX_VERSION} ${UCX_REPO} +rm -rf "${UCX_INSTALL_PATH}" +git clone --depth 1 --single-branch --branch "${UCX_VERSION}" "${UCX_REPO}" cd ucxOptionally, add a simple retry wrapper around git clone if CI flakes are a concern.
26-26: Shellcheck SC2046: quote -j argument.Prevents accidental word-splitting; also matches static analysis hint.
Apply:
-make install -j$(nproc) +make install -j"$(nproc)"
29-29: Avoid double slash in path and ensure ENV is set.
- UCX_INSTALL_PATH ends with a slash;
${UCX_INSTALL_PATH}/libyields//lib. Use${UCX_INSTALL_PATH%/}to normalize.- Verify that ENV is defined to avoid writing to an empty path.
Apply:
-echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH}/lib:\$LD_LIBRARY_PATH" >> "${ENV}" +echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH%/}/lib:\$LD_LIBRARY_PATH" >> "${ENV}"Please confirm that ENV is always set in the Docker build context; if not, default it or fail fast.
10-29: Unconditional reinstall acknowledged.The remove-and-reinstall flow aligns with the PR objective to always rebuild UCX. Just note this increases build time; caching or a version-guard could be reintroduced later if CI time becomes an issue.
docker/common/install_nixl.sh (2)
9-9: Preserve empty LD_LIBRARY_PATH safely and quote.Avoids leading/trailing colons and word-splitting later.
Apply:
-OLD_LD_LIBRARY_PATH=$LD_LIBRARY_PATH +OLD_LD_LIBRARY_PATH="${LD_LIBRARY_PATH:-}"
43-43: Restore LD_LIBRARY_PATH with quotes.Prevents word-splitting and preserves emptiness faithfully.
Apply:
-export LD_LIBRARY_PATH=$OLD_LD_LIBRARY_PATH +export LD_LIBRARY_PATH="$OLD_LD_LIBRARY_PATH"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docker/common/install_nixl.sh(3 hunks)docker/common/install_ucx.sh(1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
docker/common/install_ucx.sh
[warning] 26-26: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docker/common/install_nixl.sh (2)
7-7: Version bump to 0.5.0 looks good.No issues spotted with the new NIXL version reference.
45-45: Prevent premature expansion of LD_LIBRARY_PATH in persisted ENV.As written, ${LD_LIBRARY_PATH} expands at build time, baking a fixed value. Escape it so runtime shells inherit/extend their current LD_LIBRARY_PATH.
Apply:
-echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:${LD_LIBRARY_PATH}" >> "${ENV}" +echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"Likely an incorrect or invalid review comment.
| CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1) | ||
|
|
||
| if [[ -z "$CUDA_SO_PATH" ]]; then | ||
| echo "libcuda.so.1 not found" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Broaden CUDA library discovery beyond /usr/local.
libcuda.so.1 commonly lives under host driver mounts like /usr/lib/x86_64-linux-gnu or /usr/lib/aarch64-linux-gnu. Searching only /usr/local is likely to fail on many systems.
Apply:
-CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1)
+CUDA_SO_PATH=$(
+ { find /usr/local /usr -type f -name "libcuda.so.1" 2>/dev/null; \
+ find "/usr/local/cuda/compat" -type f -name "libcuda.so.1" 2>/dev/null; } \
+ | head -n1
+)
-if [[ -z "$CUDA_SO_PATH" ]]; then
+if [[ -z "$CUDA_SO_PATH" ]]; then
echo "libcuda.so.1 not found"
exit 1
fiNotes:
- Includes common locations: /usr/lib/x86_64-linux-gnu, /usr/lib/aarch64-linux-gnu, /usr/local/cuda/compat.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1) | |
| if [[ -z "$CUDA_SO_PATH" ]]; then | |
| echo "libcuda.so.1 not found" | |
| exit 1 | |
| fi | |
| CUDA_SO_PATH=$( | |
| { find /usr/local /usr -type f -name "libcuda.so.1" 2>/dev/null; \ | |
| find "/usr/local/cuda/compat" -type f -name "libcuda.so.1" 2>/dev/null; } \ | |
| | head -n1 | |
| ) | |
| if [[ -z "$CUDA_SO_PATH" ]]; then | |
| echo "libcuda.so.1 not found" | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
In docker/common/install_nixl.sh around lines 22 to 27, the script only searches
/usr/local for libcuda.so.1 which misses common host driver mounts; update the
discovery to search additional standard locations (/usr/lib/x86_64-linux-gnu,
/usr/lib/aarch64-linux-gnu, /usr/local/cuda/compat) and fall back to a broader
search (e.g., under /usr or root) so libcuda.so.1 is found on systems where
drivers are mounted outside /usr/local; ensure the new search still returns the
first match and preserves the existing error handling if nothing is found.
| CUDA_SO_PATH=$(dirname $CUDA_SO_PATH) | ||
|
|
||
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Quote dirname input and avoid leading colon when appending.
- Use dirname with -- and quotes.
- Only insert a colon if LD_LIBRARY_PATH is non-empty.
Apply:
-CUDA_SO_PATH=$(dirname $CUDA_SO_PATH)
-
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH
+CUDA_SO_DIR=$(dirname -- "$CUDA_SO_PATH")
+
+export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}$CUDA_SO_DIR"Also update subsequent references to use CUDA_SO_DIR if any (none outside these lines).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CUDA_SO_PATH=$(dirname $CUDA_SO_PATH) | |
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH | |
| CUDA_SO_DIR=$(dirname -- "$CUDA_SO_PATH") | |
| export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}$CUDA_SO_DIR" |
🤖 Prompt for AI Agents
In docker/common/install_nixl.sh around lines 29 to 31, change the use of
dirname to safely handle paths by using dirname -- "$CUDA_SO_PATH" and store the
result in a new variable CUDA_SO_DIR; then append to LD_LIBRARY_PATH without
producing a leading colon by only adding ":$CUDA_SO_DIR" when LD_LIBRARY_PATH is
non-empty (otherwise set LD_LIBRARY_PATH="$CUDA_SO_DIR"). Also replace any later
references to CUDA_SO_PATH with CUDA_SO_DIR if present.
|
PR_Github #15531 [ run ] triggered by Bot |
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
jenkins/current_image_tags.properties (2)
14-17: Prefer digest-pinned images for reproducibilityStaging tags are mutable and can be GC’ed; pinning by digest (image@sha256:…) prevents drift and makes PR reruns reproducible.
If keeping tags is required for ergonomics, consider adding companion DIGEST variables and having the pipeline prefer digests when present. I can generate a follow-up patch if you provide the digests.
14-17: Verify images actually contain UCX v1.19.x and NIXL 0.5.0Given the PR objective, ensure the staging images embed the intended versions so CI exercises the upgrade.
Suggested minimal verification step (to add in CI or a pre-build script):
- ucx_info -v (should show 1.19.x)
- strings /opt/nvidia/nvda_nixl/lib*/libnixl.so | grep -E '(^| )0.5.0( |$)' or nixl version command if available
I can prep a small CI snippet to print these versions early in the logs for quick triage.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/current_image_tags.properties(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/current_image_tags.properties (1)
14-17: Rename tooling and consumers support the new tag shapeI’ve verified that:
- The
IMAGE_MAPPINGinscripts/rename_docker_images.pyhas been updated to produce the newtensorrt-llm-staging/tritondevel:prefixes and tag patterns.- All downstream consumers (Jenkins pipelines, Makefiles, performance scripts, etc.) source the
LLM_*_DOCKER_IMAGEvariables and treat them as opaque image URIs—there are no regex‐based parses of the old tag shape.- No hard-coded references to the old
tensorrt-llm:prefix or redundant substrings remain in the rename script or in the codebase.The rename script will generate and propagate the updated tags correctly, and no tooling assumptions have been broken.
| LLM_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:x86_64-tritondevel-torch_skip-a164e44-github-pr-6966-704 | ||
| LLM_SBSA_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:sbsa-tritondevel-torch_skip-a164e44-github-pr-6966-704 | ||
| LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:x86_64-rockylinux8-torch_skip-py310-a164e44-github-pr-6966-704 | ||
| LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/tritondevel:x86_64-rockylinux8-torch_skip-py312-a164e44-github-pr-6966-704 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Confirm “torch_skip” images won’t break stages expecting PyTorch
Tags contain torch_skip; if any PR stages rely on torch presence (builds/tests importing torch or using torch CUDA toolchain), they’ll fail or silently skip coverage.
Please validate stage mappings and adjust bot invocation if needed (e.g., skip torch-dependent stages) or switch to non-skip images for those stages. To audit, list stage mappings and references:
If specific stages require torch, I can help craft a stage-list for the /bot run that avoids them or switches images conditionally.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Test-to-stage mapping (if present):"
fd -a test_to_stage_mapping.py -x sh -c 'echo "--- {} ---"; sed -n "1,200p" "{}"' sh || true
echo
echo "Search for torch usage within CI scripts and Jenkinsfiles:"
rg -n -C2 -P '\bimport\s+torch\b|torch\.' scripts Jenkinsfile jenkins || trueLength of output: 18194
🚨 Ensure PyTorch support in CI for torch_skip images
We’ve confirmed the Jenkins pipelines include PyTorch-specific stages and tests (e.g.
– tests/unittest/llmapi/test_llm_pytorch.py in L0_MergeRequest.groovy
– perf stages in L0_Test.groovy that install torch via pip3)
The new *_torch_skip* images omit PyTorch and will cause these stages to fail or silently skip coverage.
Action items:
- In
jenkins/current_image_tags.properties(lines 14–17), introduce or revert to non-skip image tags for PyTorch-dependent stages (e.g.
LLM_DOCKER_IMAGE_PYTORCH=…). - In Jenkinsfiles:
• jenkins/L0_MergeRequest.groovy
• jenkins/L0_Test.groovy
update stage-to-image selection so that any stage with “PyTorch” uses a full image, or explicitly skip those stages when running on a torch-free image.
These CI fixes are required to prevent test failures or missing coverage in PyTorch stages.
🤖 Prompt for AI Agents
In jenkins/current_image_tags.properties around lines 14 to 17, the current
LLM_*_DOCKER_IMAGE values point to "torch_skip" images which lack PyTorch and
will break or skip PyTorch-dependent CI stages; add explicit non-skip PyTorch
image entries (for example LLM_DOCKER_IMAGE_PYTORCH,
LLM_SBSA_DOCKER_IMAGE_PYTORCH, LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE_PYTORCH,
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE_PYTORCH) referencing corresponding images
that include torch, and then update Jenkinsfiles (jenkins/L0_MergeRequest.groovy
and jenkins/L0_Test.groovy) so PyTorch stages select the *_PYTORCH images (or
explicitly skip those stages when the selected image is a torch-free
"torch_skip" variant).
|
PR_Github #15532 [ run ] triggered by Bot |
|
PR_Github #15531 [ run ] completed with state |
|
PR_Github #15532 [ run ] completed with state |
|
/bot run --add-multi-gpu-test |
|
PR_Github #15536 [ run ] triggered by Bot |
|
PR_Github #15536 [ run ] completed with state |
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (1)
299-304: Wrong agent used in checkRemoteDescs; likely causes a tight infinite loop.After connecting nixlAgent1 to agent0 (Line 287), the remote-desc check should be initiated from nixlAgent1 against agent0. Currently it checks from nixlAgent0 to agent1 with a descriptor owned by agent0, which is inconsistent and will likely never succeed.
Apply this diff:
- bool checked2 = false; - do - { - checked2 = nixlAgent0->checkRemoteDescs(agent1, regMem1.getDescs()); - } while (!checked2); + bool checked2 = false; + do + { + checked2 = nixlAgent1->checkRemoteDescs(agent0, regMem1.getDescs()); + } while (!checked2);
🧹 Nitpick comments (2)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (2)
262-266: Replace unbounded busy-waits with bounded wait + sleep to avoid CPU spin and test timeouts.These loops can spin forever and burn CPU. Add a small sleep and a bounded retry budget to make tests robust on slower CI.
Apply these diffs:
- for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif.size() == 0; i++) + for (int i = 0; i < 5000 && notif.empty(); ++i) { + std::this_thread::sleep_for(std::chrono::milliseconds(1)); notif = nixlAgent1->getNotifiedSyncMessages(); }- for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif2.size() == 0; i++) + for (int i = 0; i < 5000 && notif2.empty(); ++i) { + std::this_thread::sleep_for(std::chrono::milliseconds(1)); notif2 = nixlAgent1->getNotifiedSyncMessages(); }- for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif3.size() == 0; i++) + for (int i = 0; i < 5000 && notif3.empty(); ++i) { + std::this_thread::sleep_for(std::chrono::milliseconds(1)); notif3 = nixlAgent0->getNotifiedSyncMessages(); }- for (std::size_t i = 0; i < MAX_QUERY_TIMES && notif4.size() == 0; i++) + for (int i = 0; i < 5000 && notif4.empty(); ++i) { + std::this_thread::sleep_for(std::chrono::milliseconds(1)); notif4 = nixlAgent0->getNotifiedSyncMessages(); }- for (size_t i = 0; i < MAX_QUERY_TIMES && notif5.size() == 0; i++) + for (int i = 0; i < 5000 && notif5.empty(); ++i) { + std::this_thread::sleep_for(std::chrono::milliseconds(1)); notif5 = nixlAgent1->getNotifiedSyncMessages(); }Additionally, add the necessary includes once at the top of the file:
#include <thread> #include <chrono>Also applies to: 276-280, 291-294, 310-313, 330-333
146-152: Wrap CUDA calls with TLLM_CUDA_CHECK for consistent error handling.cudaMalloc/cudaMemcpy are unchecked, while cudaFree is checked. Align all CUDA calls to fail fast in tests.
Apply these diffs:
- cudaMalloc(&dev_ptr0, size); - cudaMalloc(&dev_ptr1, size); + TLLM_CUDA_CHECK(cudaMalloc(&dev_ptr0, size)); + TLLM_CUDA_CHECK(cudaMalloc(&dev_ptr1, size)); - cudaMemcpy(dev_ptr0, memory0.data(), size, cudaMemcpyHostToDevice); - cudaMemcpy(dev_ptr1, memory1.data(), size, cudaMemcpyHostToDevice); + TLLM_CUDA_CHECK(cudaMemcpy(dev_ptr0, memory0.data(), size, cudaMemcpyHostToDevice)); + TLLM_CUDA_CHECK(cudaMemcpy(dev_ptr1, memory1.data(), size, cudaMemcpyHostToDevice));- cudaMemcpy(memory0.data(), dev_ptr0, size, cudaMemcpyDeviceToHost); - cudaMemcpy(memory1.data(), dev_ptr1, size, cudaMemcpyDeviceToHost); + TLLM_CUDA_CHECK(cudaMemcpy(memory0.data(), dev_ptr0, size, cudaMemcpyDeviceToHost)); + TLLM_CUDA_CHECK(cudaMemcpy(memory1.data(), dev_ptr1, size, cudaMemcpyDeviceToHost));Also applies to: 169-171
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tests/unit_tests/executor/transferAgentTest.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (2)
258-260: New notifySyncMessage + 4-arg TransferRequest usage looks correct.Decoupling sync-message delivery from the transfer payload is exercised properly here. The 4-argument TransferRequest matches the updated API.
306-308: Second notifySyncMessage + write flow looks correct.Assuming the remote-desc readiness check is corrected (Lines 299-304), this sequence aligns with the new API and intended semantics.
ab547ce to
855cd3a
Compare
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (1)
299-308: Wrong agent in checkRemoteDescs before write from agent1 → agent0.You connect agent1 to agent0 and then issue a write from agent1 to agent0, but the readiness check is performed by agent0 against agent1 and uses local descriptors. This is logically inverted and may mask issues.
- bool checked2 = false; - do - { - checked2 = nixlAgent0->checkRemoteDescs(agent1, regMem1.getDescs()); - } while (!checked2); + bool checked2 = false; + do + { + // agent1 is about to write to agent0; ensure agent0's descriptors are known to agent1 + checked2 = nixlAgent1->checkRemoteDescs(agent0, regMem1.getDescs()); + } while (!checked2);
♻️ Duplicate comments (3)
docker/common/install_nixl.sh (2)
22-27: Broaden libcuda discovery beyond /usr/local (likely failure on many hosts).Restricting to /usr/local will commonly miss libcuda.so.1 under /usr/lib/*-linux-gnu or /usr/local/cuda/compat, causing the script to exit. Reuse the previously suggested multi-path search.
-CUDA_SO_PATH=$(find "/usr/local" -name "libcuda.so.1" 2>/dev/null | head -n1) - -if [[ -z "$CUDA_SO_PATH" ]]; then - echo "libcuda.so.1 not found" - exit 1 -fi +CUDA_SO_PATH=$( + { + find /usr/local /usr -type f -name "libcuda.so.1" 2>/dev/null; + find "/usr/local/cuda/compat" -type f -name "libcuda.so.1" 2>/dev/null; + } | head -n1 +) +if [[ -z "$CUDA_SO_PATH" ]]; then + echo "libcuda.so.1 not found" + exit 1 +fi
29-31: Quote dirname input and avoid leading colon when appending.Use dirname safely and only prepend a colon when LD_LIBRARY_PATH is non-empty.
-CUDA_SO_PATH=$(dirname $CUDA_SO_PATH) - -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_SO_PATH +CUDA_SO_DIR=$(dirname -- "$CUDA_SO_PATH") +export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}$CUDA_SO_DIR"jenkins/current_image_tags.properties (1)
14-17: torch_skip images likely break PyTorch stages; provide torch-enabled variants or adjust stage mapping.CI pipelines contain PyTorch-dependent stages; using torch_skip images will fail those stages or skip coverage. Either:
- introduce explicit torch-enabled image variables for PyTorch stages and update Jenkinsfiles to use them, or
- gate/skip PyTorch stages when a torch_skip image is selected.
I can help patch Jenkinsfiles if you confirm the strategy. To audit current PyTorch usage quickly:
#!/bin/bash set -euo pipefail # Look for torch usage in CI configs and tests rg -n -C2 -P 'pip(\S+)?\s+install\s+(\S*torch\S*)|import\s+torch\b|torch\.' jenkins scripts tests || true # Show where these image vars are referenced rg -n -C2 -P 'LLM(_SBSA)?(_ROCKYLINUX8_.*)?_DOCKER_IMAGE' jenkins || trueIf you prefer adding torch-enabled images alongside current ones, append after Line 17:
LLM_DOCKER_IMAGE_PYTORCH=...<full torch image tag>... LLM_SBSA_DOCKER_IMAGE_PYTORCH=...<full torch image tag>... LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE_PYTORCH=...<full torch image tag>... LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE_PYTORCH=...<full torch image tag>...Then update PyTorch stages in Jenkinsfiles to select the *_PYTORCH variables.
🧹 Nitpick comments (5)
docker/common/install_ucx.sh (4)
5-5: Prefer a pinned UCX tag or commit for reproducible builds.Using a moving branch (v1.19.x) can yield different artifacts over time. Consider pinning to a specific tag (e.g., v1.19.0/v1.19.1) or a commit SHA to ensure deterministic CI.
-UCX_VERSION="v1.19.x" +# Example: pin to a specific version or commit for reproducibility +UCX_VERSION="v1.19.1"
10-10: Harden rm -rf with quoting and trailing-slash guard.Quote the path and strip a trailing slash to avoid accidental globbing/path mishaps.
-rm -rf ${UCX_INSTALL_PATH} +rm -rf -- "${UCX_INSTALL_PATH%/}"
26-26: Shellcheck SC2046: quote and place -j before the target.Quote $(nproc) and pass -j to make, not the install target.
-make install -j$(nproc) +make -j"$(nproc)" install
29-29: Ensure idempotent LD_LIBRARY_PATH export in install_ucx.shRepeated runs of install_ucx.sh will currently append duplicate export lines, and although Dockerfile.multi sets ENV and install_base.sh creates the file, it’s safer to guard against both an unset file and duplicate entries. Update the snippet as follows:
-echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH}/lib:\$LD_LIBRARY_PATH" >> "${ENV}" +mkdir -p "$(dirname "$ENV")" +touch "$ENV" +# Append only once; keep LD_LIBRARY_PATH dynamic at runtime +grep -qsF "${UCX_INSTALL_PATH%/}/lib" "$ENV" \ + || echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH%/}/lib:\$LD_LIBRARY_PATH" >> "$ENV"• Keeps using the
$ENVpath defined by Dockerfile.multi
• Ensures the target file exists before appending
• Prevents duplicate exports on subsequent runscpp/tests/unit_tests/executor/transferAgentTest.cpp (1)
257-271: Avoid unbounded busy-wait loops; add a bounded wait with small sleeps.MAX_QUERY_TIMES is set to size_t::max and loops spin without sleep. If notifications are delayed or lost, this can burn CPU indefinitely.
Example refactor (outside this hunk):
#include <chrono> #include <thread> // ... for (std::size_t i = 0; i < 1000 && notif.empty(); ++i) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); notif = nixlAgent1->getNotifiedSyncMessages(); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
cpp/tests/unit_tests/executor/transferAgentTest.cpp(2 hunks)docker/common/install_nixl.sh(3 hunks)docker/common/install_ucx.sh(1 hunks)jenkins/current_image_tags.properties(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/executor/transferAgentTest.cpp
🪛 Shellcheck (0.10.0)
docker/common/install_ucx.sh
[warning] 26-26: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
| rm $CUDA_PATH/lib64/libcuda.so.1 | ||
| export LD_LIBRARY_PATH=$OLD_LD_LIBRARY_PATH | ||
|
|
||
| echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Make LD_LIBRARY_PATH persistence safe and idempotent; handle unset ENV.
As with UCX, ensure ENV is defined, create its directory, and avoid duplicate lines on repeated runs.
-echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"
+ENV=${ENV:-/etc/profile.d/trtllm-env.sh}
+mkdir -p "$(dirname -- "$ENV")"
+grep -qsF "/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}" "$ENV" || \
+ echo 'export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:$LD_LIBRARY_PATH' >> "$ENV"Optional runtime verification in the built image:
- ldd on a NIXL .so resolves against host driver’s libcuda without needing temporary LD_LIBRARY_PATH.
- If not, consider persisting the discovered CUDA_SO_DIR as a fallback.
🏁 Script executed:
#!/bin/bash
set -e
# Display the first 200 lines of the install_nixl script to inspect ENV usage
sed -n '1,200p' docker/common/install_nixl.shLength of output: 1354
Ensure ENV is defined and make LD_LIBRARY_PATH update idempotent
The script currently ends with:
echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"– ENV isn’t set in this script (causing a redirection error), and repeated runs will keep appending duplicates. Please update as follows:
- echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}"
+ # Ensure ENV is defined, its directory exists, and avoid duplicate entries
+ ENV=${ENV:-/etc/profile.d/trtllm-env.sh}
+ mkdir -p "$(dirname -- "$ENV")"
+ if ! grep -qF "/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}" "$ENV"; then
+ echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "$ENV"
+ fiOptional container‐image check: run ldd on one of the NIXL .so files to confirm it picks up the host’s libcuda.so without needing a transient LD_LIBRARY_PATH, or fall back to persisting the discovered CUDA_SO_PATH if necessary.
🤖 Prompt for AI Agents
In docker/common/install_nixl.sh around line 45, the script appends an
LD_LIBRARY_PATH export to "${ENV}" even though ENV may be undefined and repeated
runs will duplicate the line; update the script to ensure ENV is defined (use a
documented default like /etc/profile.d/nvidia_nixl.sh if not provided and create
the file if missing) and make the write idempotent by replacing any existing
NIXL LD_LIBRARY_PATH export (or removing matching lines) before writing a single
canonical export line; additionally, as an optional follow-up, run ldd on one of
the installed NIXL .so files to see if system libcuda.so is already picked up
and, if so, avoid persisting LD_LIBRARY_PATH or otherwise persist the discovered
CUDA_SO_PATH instead.
Signed-off-by: Batsheva Black <[email protected]>
855cd3a to
cd53b11
Compare
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
docker/common/install_ucx.sh (6)
10-10: Guard rm -rf against empty or unexpected pathsA defensive guard prevents accidental deletions if UCX_INSTALL_PATH is ever empty/mis-set. Low risk inside Docker, but cheap insurance.
Apply:
-rm -rf ${UCX_INSTALL_PATH} +: "${UCX_INSTALL_PATH:?UCX_INSTALL_PATH must be set}" +[[ "${UCX_INSTALL_PATH}" != "/" ]] || { echo "Refusing to remove /"; exit 1; } +rm -rf -- "${UCX_INSTALL_PATH%/}"
11-11: Use --single-branch and quote vars in git cloneSlightly reduces fetch size and hardens against word-splitting.
-git clone --depth 1 -b ${UCX_VERSION} ${UCX_REPO} +git clone --depth 1 --single-branch -b "${UCX_VERSION}" "${UCX_REPO}"
4-4: Remove unused GITHUB_URL or wire it into UCX_REPOCurrently unused; either drop it or use to allow mirroring.
Option A (remove):
-GITHUB_URL="https://github.com"Option B (use it):
-GITHUB_URL="https://github.com" -UCX_REPO="https://github.com/openucx/ucx.git" +GITHUB_URL="https://github.com" +UCX_REPO="${GITHUB_URL}/openucx/ucx.git"
26-26: Quote $(nproc) and avoid parallelized install targetShellCheck SC2046: quote to prevent word splitting. Also, many projects recommend building in parallel, but running “install” serially to avoid race conditions.
-make install -j$(nproc) +# Build in parallel, then install serially +make -j"$(nproc)" +make installIf you prefer a minimal change:
-make install -j$(nproc) +make install -j"$(nproc)"
22-23: Fail early if CUDA_PATH is missing, or auto-discover CUDAThe hardcoded /usr/local/cuda usually exists, but a clear error helps debugging CI failures.
-CUDA_PATH="/usr/local/cuda" +CUDA_PATH="/usr/local/cuda" +if [[ ! -d "${CUDA_PATH}" ]]; then + echo "CUDA_PATH not found at ${CUDA_PATH}" >&2 + exit 1 +fiIf you want to auto-discover (similar to the NIXL script changes), we can add that in a follow-up.
5-5: docker/common/install_ucx.sh:5 – Confirm UCX version pinningVerified that the branch
v1.19.xexists upstream and that the only stable tag available today isv1.19.0(in addition to RC releases). UsingUCX_VERSION="v1.19.x"will float to future patch releases when they are published.
If you require fully reproducible Docker images, consider pinning to the current stable tag:
- UCX_VERSION="v1.19.x" + UCX_VERSION="v1.19.0"Otherwise, leaving
v1.19.xis acceptable to automatically pick up patch updates.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
cpp/tests/unit_tests/executor/transferAgentTest.cpp(2 hunks)docker/common/install_nixl.sh(3 hunks)docker/common/install_ucx.sh(1 hunks)jenkins/current_image_tags.properties(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- cpp/tests/unit_tests/executor/transferAgentTest.cpp
- docker/common/install_nixl.sh
- jenkins/current_image_tags.properties
🧰 Additional context used
🪛 Shellcheck (0.10.0)
docker/common/install_ucx.sh
[warning] 26-26: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
docker/common/install_ucx.sh (3)
13-25: Build flags preserved and look reasonable for UCX 1.19Keeping shared libs, optimizations, verbs, DM, CMA, CUDA, and mt is consistent with prior configs. No concerns here.
28-28: Good: source tree cleanup to reduce image sizeRemoving the UCX source after install keeps images lean.
29-29: No changes needed: ENV is always defined and the script runs only once
We verified that indocker/Dockerfile.multi,ENV ENV=${ENV:-/etc/shinit_v2}is set beforeinstall_ucx.shis invoked, and thatinstall_ucx.shis only executed once per build stage. Therefore:
$ENVwill never be unset or empty underset -e.- There’s no risk of duplicate
export LD_LIBRARY_PATH…lines across rebuilds, since the script only runs a single time.
|
PR_Github #15618 [ run ] triggered by Bot |
Signed-off-by: Batsheva Black <[email protected]>
Signed-off-by: Batsheva Black <[email protected]>
cd53b11 to
4389aeb
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #15624 [ run ] triggered by Bot |
|
PR_Github #15618 [ run ] completed with state |
|
PR_Github #15624 [ run ] completed with state |
|
/bot run --add-multi-gpu-test |
|
PR_Github #15648 [ run ] triggered by Bot |
|
PR_Github #15648 [ run ] completed with state |
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unittest/llmapi/test_llm_args.py (1)
443-446: Avoid asserting against a private attribute; prefer a public accessor if available.Accessing llm._executor_config couples the test to internal details and risks breakage on refactors. If a stable public API exists (e.g., llm.executor_config or a getter), prefer that. To keep tests resilient while remaining backward-compatible, consider this small tweak:
- assert llm._executor_config.max_beam_width == 1 - assert llm._executor_config.max_num_tokens == 256 - assert llm._executor_config.max_seq_len == 128 - assert llm._executor_config.max_batch_size == 8 + _cfg = getattr(llm, "executor_config", llm._executor_config) + assert _cfg.max_beam_width == 1 + assert _cfg.max_num_tokens == 256 + assert _cfg.max_seq_len == 128 + assert _cfg.max_batch_size == 8
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/unittest/llmapi/test_llm_args.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/llmapi/test_llm_args.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/llmapi/test_llm_args.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/llmapi/test_llm_args.py (1)
431-437: Good switch to a context-managed LLM lifecycle.Using TorchLLM as a context manager is the right move here. It ensures deterministic teardown of GPU/UCX resources and avoids handle leaks; placing the assertions inside the with-block validates state before cleanup.
|
PR_Github #15661 [ run ] triggered by Bot |
|
PR_Github #15661 [ run ] completed with state |
fa689a8 to
4389aeb
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #15729 [ run ] triggered by Bot |
|
PR_Github #15729 [ run ] completed with state |
Summary by CodeRabbit
Bug Fixes
Chores
Tests / Breaking Changes
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.