Skip to content

Conversation

@chuangz0
Copy link
Collaborator

@chuangz0 chuangz0 commented Jul 16, 2025

PR title

ucx establish connection with zmq

need apt install -y libzmq3-dev

We used UCX listener to establish UCX connection before. https://github.com/NVIDIA/TensorRT-LLM/blob/fbee27990917affc73d2050384dd7e33d594a2[…]/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
UCX listener bind to port and need UCX enable TCP.
If user set UCX_NET_DEVICES and exclude specific nic interface or set UCX_TLS=^tcp ,then the UCX listener won't work.
if UCX_NET_DEVICES include nic interface, kv cache transfer may use tcp ,which is very slow.

Please write the PR title by following template:

[JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] <summary of this PR>

For example, assume I have a PR hope to support a new feature about cache manager of Jira TRTLLM-1000 ticket, it would be like

[TRTLLM-1000][feat] Support a new feature about cache manager

Description

Please explain the issue and the solution in short.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Introduced ZeroMQ-based communication for connection management, enabling more flexible and robust connection handling.
    • Added support for exchanging worker addresses using ZeroMQ messaging.
  • Refactor

    • Replaced previous UCX listener callback mechanism with a ZeroMQ protocol for connection establishment and management.
    • Updated connection methods to use worker address strings instead of UCX connection request handles.
  • Chores

    • Integrated the cppzmq library as a new third-party dependency.
    • Added ZeroMQ development libraries to Ubuntu and Rocky Linux installation scripts.
    • Updated Docker image tags for LLM environments to new repository paths and tagging format.
  • Bug Fixes

    • Enhanced error messages when loading the UCX wrapper library to include detailed dynamic linker error information.

@Shunkangz
Copy link
Collaborator

From my side, no more question for this patch. What I am thinking is how we can organize the zeromq to better leverage it in other places.

@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from 5399a0e to b672ae4 Compare August 4, 2025 02:11
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 4, 2025

📝 Walkthrough

Walkthrough

ZeroMQ (ZMQ) support is integrated into the UCX cache communication utility. The build system is updated to include the cppzmq submodule and link ZMQ dependencies. The connection management logic is refactored to use ZeroMQ messaging for exchanging worker addresses and managing connections, replacing the previous UCX listener mechanism. Additionally, package installation scripts and CI image tags are updated accordingly. Error reporting for UCX wrapper library loading is also improved.

Changes

Cohort / File(s) Change Summary
Submodule Addition
.gitmodules, 3rdparty/cppzmq
Added the cppzmq third-party library as a git submodule for ZeroMQ C++ bindings at a specific commit.
Build System Update
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt
Updated CMake configuration to include cppzmq headers and link against libzmq using pkg-config, integrating ZMQ into the build process.
UCX-ZMQ Communication Refactor
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp, cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
Replaced UCX listener-based connection management with a ZeroMQ-based protocol. Introduced a message class, new helper functions, and changed connection logic to use ZMQ messaging for exchanging worker addresses and managing connections. Updated class members and method signatures accordingly.
Docker Package Installation Update
docker/common/install_base.sh
Added installation of ZeroMQ development packages (libzmq3-dev for Ubuntu and zeromq-devel for Rocky Linux) to the respective OS package installation scripts.
CI Image Tag Update
jenkins/current_image_tags.properties
Updated LLM Docker image tags to new timestamped build identifiers reflecting recent staging builds with commit and PR metadata.
Improved Error Diagnostics
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Enhanced error message when loading the UCX wrapper library by appending dynamic linker error details from dlerror().

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ZMQ_REQ_Socket
    participant Server
    participant ZMQ_REP_Socket
    participant UCX

    Client->>ZMQ_REQ_Socket: Send GET_WORKER_ADDRESS (with own worker address)
    ZMQ_REQ_Socket->>ZMQ_REP_Socket: Connect and send message
    ZMQ_REP_Socket->>Server: Notify of incoming request
    Server->>UCX: Get server worker address
    Server->>ZMQ_REP_Socket: Reply with SERVER_WORKER_ADDRESS (with server worker address)
    ZMQ_REP_Socket->>ZMQ_REQ_Socket: Send response
    ZMQ_REQ_Socket->>Client: Receive server worker address
    Client->>UCX: Create endpoint using server worker address
    Server->>UCX: Create endpoint using client worker address
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 188ad04 and 91a644c.

📒 Files selected for processing (8)
  • .gitmodules (1 hunks)
  • 3rdparty/cppzmq (1 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (7 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h (3 hunks)
  • docker/common/install_base.sh (2 hunks)
  • jenkins/current_image_tags.properties (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • .gitmodules
  • jenkins/current_image_tags.properties
🚧 Files skipped from review as they are similar to previous changes (6)
  • 3rdparty/cppzmq
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • docker/common/install_base.sh
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
.gitmodules (1)

26-28: Consider pinning cppzmq to a stable tag/branch for reproducible builds

Every CI run will fetch whatever the default branch of cppzmq happens to be at the time the submodule is updated. To avoid unexpected breakage, add an explicit branch = <tag-or-branch> (similar to cxxopts above) or rely on a specific commit hash in the super-project.

[submodule "3rdparty/cppzmq"]
 	path = 3rdparty/cppzmq
 	url = https://github.com/zeromq/cppzmq.git
+	# e.g. lock to latest release tag
+	branch = v4.10.0

This keeps builds deterministic and eases future upgrades because the diff will clearly show when you move to a newer cppzmq version.

3rdparty/cppzmq (1)

1-1: Consider declarative dependency management instead of a git sub-module

Using a CMake FetchContent or CPM declaration keeps third-party code out of the repo, simplifies updates, and avoids sub-module headaches. Worth considering if you don’t need to modify cppzmq locally.

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt (1)

12-12: Fix incorrect comment

The comment refers to "NIXL wrapper target" but should likely refer to ZMQ or remain generic.

-  # Add the NIXL wrapper target
+  # Link ZMQ libraries
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (1)

368-369: Consider setting socket options for reliability

Add timeout and linger options to the REQ socket to handle network issues gracefully.

 auto reqSocket = zmq::socket_t(mZmqContext, zmq::socket_type::req);
+reqSocket.set(zmq::sockopt::rcvtimeo, 30000); // 30 second timeout
+reqSocket.set(zmq::sockopt::linger, 0); // Don't wait on close
 reqSocket.connect(build_zmq_endpoint(ip, port));
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7547a7d and b672ae4.

📒 Files selected for processing (5)
  • .gitmodules (1 hunks)
  • 3rdparty/cppzmq (1 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (7 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
🔇 Additional comments (4)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h (1)

59-62: No thread‐safety issue detected for the ZMQ REP socket
All mZmqRepSocket operations (bind, recv, send) are confined to the dedicated mZmqRepThread. The main thread never calls into mZmqRepSocket while the thread is running—it only performs join() and then close() after the thread exits. This satisfies ZeroMQ’s single‐threaded socket requirement, so no additional synchronization is needed.

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (3)

34-74: LGTM! Well-structured message protocol

The UcxCmMessage class provides a clean protocol for ZMQ communication with proper serialization/deserialization support.


132-146: LGTM! Comprehensive endpoint parsing

The function correctly handles both IPv4 and IPv6 ZMQ endpoint formats with appropriate regex patterns.


285-303: LGTM! Clean shutdown mechanism

The destructor correctly uses a separate REQ socket to send the STOP message, avoiding thread-safety issues and ensuring graceful shutdown with acknowledgment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
docker/common/install_base.sh (1)

120-134: Install pkgconf alongside zeromq-devel for pkg-config support

On Rocky Linux (RHEL 8), the pkg-config functionality is provided by the pkgconf package, not “pkgconfig.” Add pkgconf before zeromq-devel so CMake/autotools can find the .pc metadata.

• File: docker/common/install_base.sh
• Lines: 120–134

   dnf install \
     openmpi \
     openmpi-devel \
     pigz \
-    rdma-core-devel \
-    zeromq-devel \
+    rdma-core-devel \
+    pkgconf \
+    zeromq-devel \
     -y
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b672ae4 and ea094ba.

📒 Files selected for processing (1)
  • docker/common/install_base.sh (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from ea094ba to f7375c1 Compare August 4, 2025 03:00
@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13906 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13906 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10469 (Partly Tested) completed with status: 'FAILURE'

@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot run --stage-list "Build-Docker-Images" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13911 [ run ] triggered by Bot

@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot kill

@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot run --add-multi-gpu-test

@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from 87f20ee to 2704423 Compare August 4, 2025 06:19
@tensorrt-cicd
Copy link
Collaborator

PR_Github #13947 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13911 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13947 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 2704423

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13948 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13948 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10503 completed with status: 'FAILURE'

@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from 2704423 to 4552450 Compare August 4, 2025 08:47
@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13974 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13974 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10526 completed with status: 'FAILURE'

@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from 4552450 to b7a3184 Compare August 4, 2025 11:27
@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13985 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13985 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10536 completed with status: 'FAILURE'

@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from b7a3184 to 1b47732 Compare August 4, 2025 13:30
@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 4, 2025

/bot run --add-multi-gpu-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (2)

229-268: Add error handling in ZMQ message processing thread

The ZMQ REP thread should handle exceptions to prevent silent failures. Wrap the thread logic in try-catch to log errors before terminating.

 mZmqRepThread = std::thread(
     [this]()
     {
+        try
+        {
             while (true)
             {
                 zmq::message_t message;
                 auto ret = mZmqRepSocket.recv(message);
                 TLLM_CHECK_WITH_INFO(ret, "mZmqRepSocket.recv failed");
                 // ... rest of the loop ...
             }
+        }
+        catch (std::exception const& e)
+        {
+            TLLM_LOG_ERROR(mpi::MpiComm::world().getRank(), 
+                "ZMQ REP thread error: %s", e.what());
+        }
     });

343-344: Improve IPv6 address detection

The current regex pattern is too permissive and may incorrectly identify non-IPv6 addresses as IPv6. Consider using a more robust check.

-    std::regex ipv6_regex(R"([0-9a-fA-F]*:[0-9a-fA-F]*:[0-9a-fA-F]*.*)");
-    if (std::regex_match(ip, ipv6_regex) && ip.find(':') != std::string::npos)
+    // Check if IP contains ':' and doesn't contain '.' (simple heuristic for IPv6)
+    if (ip.find(':') != std::string::npos && ip.find('.') == std::string::npos)

Alternatively, use inet_pton to validate:

struct sockaddr_in6 sa6;
if (inet_pton(AF_INET6, ip.c_str(), &sa6.sin6_addr) == 1)
{
    oss << "tcp://[" << ip << "]:" << port;
}
🧹 Nitpick comments (1)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (1)

368-385: Add explicit socket closure for proper resource management.

The ZMQ REQ socket should be explicitly closed after use to ensure proper resource cleanup, even though it will be destroyed when going out of scope.

             std::string serverWorkerAddress = serverMessage.mWorkerAddress.value();
             auto serverWorkerAddressPtr = ucxx::createAddressFromString(serverWorkerAddress);
             auto newEp = mWorkersPool.front()->createEndpointFromWorkerAddress(serverWorkerAddressPtr, true);
+            reqSocket.close();
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4552450 and 1b47732.

📒 Files selected for processing (7)
  • .gitmodules (1 hunks)
  • 3rdparty/cppzmq (1 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (7 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h (3 hunks)
  • docker/common/install_base.sh (2 hunks)
  • jenkins/current_image_tags.properties (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • .gitmodules
  • jenkins/current_image_tags.properties
🚧 Files skipped from review as they are similar to previous changes (4)
  • 3rdparty/cppzmq
  • docker/common/install_base.sh
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
🧠 Learnings (2)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : exceptions must not be thrown across library boundaries....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Exceptions must not be thrown across library boundaries.

Applied to files:

  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (3)

132-146: LGTM!

The endpoint parsing logic correctly handles both IPv4 and IPv6 formats using appropriate regex patterns. The optional return type properly handles parsing failures.


285-307: LGTM!

The destructor properly implements graceful shutdown by sending a STOP message to the ZMQ REP thread, waiting for acknowledgment, and cleaning up all ZMQ resources. The sequence ensures proper thread termination before resource cleanup.


311-337: LGTM!

The refactored addConnection method correctly adapts to the new ZeroMQ-based workflow by creating UCX endpoints from worker address strings instead of connection requests. The async connection creation pattern is preserved.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14005 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14005 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10555 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@chuangz0 chuangz0 requested a review from ZhanruiSunCh August 5, 2025 03:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

162-163: Good improvement for error diagnostics, but add null check for dlerror().

The addition of dlerror() to provide more detailed diagnostic information is excellent for debugging library loading issues. However, dlerror() can return nullptr in some cases.

Consider this safer approach:

-        TLLM_CHECK_WITH_INFO(
-            mWrapperLibHandle != nullptr, "UCX wrapper library is not open correctly. error : %s", dlerror());
+        const char* dlError = dlerror();
+        TLLM_CHECK_WITH_INFO(mWrapperLibHandle != nullptr, 
+            "UCX wrapper library is not open correctly. error : %s", dlError ? dlError : "unknown");
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e94d101 and 188ad04.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
@chuangz0 chuangz0 force-pushed the ucx_with_zmq_connection branch from 188ad04 to 91a644c Compare August 5, 2025 06:21
@chuangz0 chuangz0 changed the title ucx establish connection with zmq [None][chore] ucx establish connection with zmq Aug 5, 2025
@chuangz0 chuangz0 enabled auto-merge (squash) August 5, 2025 06:23
@chuangz0
Copy link
Collaborator Author

chuangz0 commented Aug 5, 2025

/bot skip --comment " all ci test has passed"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14094 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14094 [ skip ] completed with state SUCCESS
Skipping testing for commit 91a644c

@chuangz0 chuangz0 merged commit 4d040b5 into NVIDIA:main Aug 5, 2025
4 of 5 checks passed
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025
@rmccorm4
Copy link
Contributor

Hi @chuangz0 can you add context to the PR description for this feature? For example, why was zmq added? Is it for performance reasons? Was there an observed % speedup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants