-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][chore] ucx establish connection with zmq #6090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
From my side, no more question for this patch. What I am thinking is how we can organize the zeromq to better leverage it in other places. |
5399a0e to
b672ae4
Compare
📝 WalkthroughWalkthroughZeroMQ (ZMQ) support is integrated into the UCX cache communication utility. The build system is updated to include the cppzmq submodule and link ZMQ dependencies. The connection management logic is refactored to use ZeroMQ messaging for exchanging worker addresses and managing connections, replacing the previous UCX listener mechanism. Additionally, package installation scripts and CI image tags are updated accordingly. Error reporting for UCX wrapper library loading is also improved. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ZMQ_REQ_Socket
participant Server
participant ZMQ_REP_Socket
participant UCX
Client->>ZMQ_REQ_Socket: Send GET_WORKER_ADDRESS (with own worker address)
ZMQ_REQ_Socket->>ZMQ_REP_Socket: Connect and send message
ZMQ_REP_Socket->>Server: Notify of incoming request
Server->>UCX: Get server worker address
Server->>ZMQ_REP_Socket: Reply with SERVER_WORKER_ADDRESS (with server worker address)
ZMQ_REP_Socket->>ZMQ_REQ_Socket: Send response
ZMQ_REQ_Socket->>Client: Receive server worker address
Client->>UCX: Create endpoint using server worker address
Server->>UCX: Create endpoint using client worker address
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (8)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (6)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
.gitmodules (1)
26-28: Consider pinningcppzmqto a stable tag/branch for reproducible buildsEvery CI run will fetch whatever the default branch of
cppzmqhappens to be at the time the submodule is updated. To avoid unexpected breakage, add an explicitbranch = <tag-or-branch>(similar tocxxoptsabove) or rely on a specific commit hash in the super-project.[submodule "3rdparty/cppzmq"] path = 3rdparty/cppzmq url = https://github.com/zeromq/cppzmq.git + # e.g. lock to latest release tag + branch = v4.10.0This keeps builds deterministic and eases future upgrades because the diff will clearly show when you move to a newer cppzmq version.
3rdparty/cppzmq (1)
1-1: Consider declarative dependency management instead of a git sub-moduleUsing a CMake
FetchContentorCPMdeclaration keeps third-party code out of the repo, simplifies updates, and avoids sub-module headaches. Worth considering if you don’t need to modify cppzmq locally.cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt (1)
12-12: Fix incorrect commentThe comment refers to "NIXL wrapper target" but should likely refer to ZMQ or remain generic.
- # Add the NIXL wrapper target + # Link ZMQ librariescpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (1)
368-369: Consider setting socket options for reliabilityAdd timeout and linger options to the REQ socket to handle network issues gracefully.
auto reqSocket = zmq::socket_t(mZmqContext, zmq::socket_type::req); +reqSocket.set(zmq::sockopt::rcvtimeo, 30000); // 30 second timeout +reqSocket.set(zmq::sockopt::linger, 0); // Don't wait on close reqSocket.connect(build_zmq_endpoint(ip, port));
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitmodules(1 hunks)3rdparty/cppzmq(1 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp(7 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
🔇 Additional comments (4)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h (1)
59-62: No thread‐safety issue detected for the ZMQ REP socket
AllmZmqRepSocketoperations (bind, recv, send) are confined to the dedicatedmZmqRepThread. The main thread never calls intomZmqRepSocketwhile the thread is running—it only performsjoin()and thenclose()after the thread exits. This satisfies ZeroMQ’s single‐threaded socket requirement, so no additional synchronization is needed.cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (3)
34-74: LGTM! Well-structured message protocolThe
UcxCmMessageclass provides a clean protocol for ZMQ communication with proper serialization/deserialization support.
132-146: LGTM! Comprehensive endpoint parsingThe function correctly handles both IPv4 and IPv6 ZMQ endpoint formats with appropriate regex patterns.
285-303: LGTM! Clean shutdown mechanismThe destructor correctly uses a separate REQ socket to send the STOP message, avoiding thread-safety issues and ensuring graceful shutdown with acknowledgment.
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
Show resolved
Hide resolved
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
docker/common/install_base.sh (1)
120-134: Install pkgconf alongside zeromq-devel for pkg-config supportOn Rocky Linux (RHEL 8), the pkg-config functionality is provided by the pkgconf package, not “pkgconfig.” Add pkgconf before zeromq-devel so CMake/autotools can find the .pc metadata.
• File: docker/common/install_base.sh
• Lines: 120–134dnf install \ openmpi \ openmpi-devel \ pigz \ - rdma-core-devel \ - zeromq-devel \ + rdma-core-devel \ + pkgconf \ + zeromq-devel \ -y
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docker/common/install_base.sh(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
ea094ba to
f7375c1
Compare
|
/bot run --stage-list "Build-Docker-Images" |
|
PR_Github #13906 [ run ] triggered by Bot |
|
PR_Github #13906 [ run ] completed with state |
|
/bot run --stage-list "Build-Docker-Images" --disable-fail-fast |
|
PR_Github #13911 [ run ] triggered by Bot |
|
/bot kill |
|
/bot run --add-multi-gpu-test |
87f20ee to
2704423
Compare
|
PR_Github #13947 [ kill ] triggered by Bot |
|
PR_Github #13911 [ run ] completed with state |
|
PR_Github #13947 [ kill ] completed with state |
|
PR_Github #13948 [ run ] triggered by Bot |
|
PR_Github #13948 [ run ] completed with state |
2704423 to
4552450
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #13974 [ run ] triggered by Bot |
|
PR_Github #13974 [ run ] completed with state |
4552450 to
b7a3184
Compare
|
/bot run --add-multi-gpu-test |
|
PR_Github #13985 [ run ] triggered by Bot |
|
PR_Github #13985 [ run ] completed with state |
b7a3184 to
1b47732
Compare
|
/bot run --add-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (2)
229-268: Add error handling in ZMQ message processing threadThe ZMQ REP thread should handle exceptions to prevent silent failures. Wrap the thread logic in try-catch to log errors before terminating.
mZmqRepThread = std::thread( [this]() { + try + { while (true) { zmq::message_t message; auto ret = mZmqRepSocket.recv(message); TLLM_CHECK_WITH_INFO(ret, "mZmqRepSocket.recv failed"); // ... rest of the loop ... } + } + catch (std::exception const& e) + { + TLLM_LOG_ERROR(mpi::MpiComm::world().getRank(), + "ZMQ REP thread error: %s", e.what()); + } });
343-344: Improve IPv6 address detectionThe current regex pattern is too permissive and may incorrectly identify non-IPv6 addresses as IPv6. Consider using a more robust check.
- std::regex ipv6_regex(R"([0-9a-fA-F]*:[0-9a-fA-F]*:[0-9a-fA-F]*.*)"); - if (std::regex_match(ip, ipv6_regex) && ip.find(':') != std::string::npos) + // Check if IP contains ':' and doesn't contain '.' (simple heuristic for IPv6) + if (ip.find(':') != std::string::npos && ip.find('.') == std::string::npos)Alternatively, use
inet_ptonto validate:struct sockaddr_in6 sa6; if (inet_pton(AF_INET6, ip.c_str(), &sa6.sin6_addr) == 1) { oss << "tcp://[" << ip << "]:" << port; }
🧹 Nitpick comments (1)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (1)
368-385: Add explicit socket closure for proper resource management.The ZMQ REQ socket should be explicitly closed after use to ensure proper resource cleanup, even though it will be destroyed when going out of scope.
std::string serverWorkerAddress = serverMessage.mWorkerAddress.value(); auto serverWorkerAddressPtr = ucxx::createAddressFromString(serverWorkerAddress); auto newEp = mWorkersPool.front()->createEndpointFromWorkerAddress(serverWorkerAddressPtr, true); + reqSocket.close();
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.gitmodules(1 hunks)3rdparty/cppzmq(1 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp(7 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h(3 hunks)docker/common/install_base.sh(2 hunks)jenkins/current_image_tags.properties(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- .gitmodules
- jenkins/current_image_tags.properties
🚧 Files skipped from review as they are similar to previous changes (4)
- 3rdparty/cppzmq
- docker/common/install_base.sh
- cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt
- cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...
Files:
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
🧠 Learnings (2)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : exceptions must not be thrown across library boundaries....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Exceptions must not be thrown across library boundaries.
Applied to files:
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp (3)
132-146: LGTM!The endpoint parsing logic correctly handles both IPv4 and IPv6 formats using appropriate regex patterns. The optional return type properly handles parsing failures.
285-307: LGTM!The destructor properly implements graceful shutdown by sending a STOP message to the ZMQ REP thread, waiting for acknowledgment, and cleaning up all ZMQ resources. The sequence ensures proper thread termination before resource cleanup.
311-337: LGTM!The refactored
addConnectionmethod correctly adapts to the new ZeroMQ-based workflow by creating UCX endpoints from worker address strings instead of connection requests. The async connection creation pattern is preserved.
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
Show resolved
Hide resolved
|
PR_Github #14005 [ run ] triggered by Bot |
|
PR_Github #14005 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)
162-163: Good improvement for error diagnostics, but add null check for dlerror().The addition of
dlerror()to provide more detailed diagnostic information is excellent for debugging library loading issues. However,dlerror()can returnnullptrin some cases.Consider this safer approach:
- TLLM_CHECK_WITH_INFO( - mWrapperLibHandle != nullptr, "UCX wrapper library is not open correctly. error : %s", dlerror()); + const char* dlError = dlerror(); + TLLM_CHECK_WITH_INFO(mWrapperLibHandle != nullptr, + "UCX wrapper library is not open correctly. error : %s", dlError ? dlError : "unknown");
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope and function-...
Files:
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
188ad04 to
91a644c
Compare
|
/bot skip --comment " all ci test has passed" |
|
PR_Github #14094 [ skip ] triggered by Bot |
|
PR_Github #14094 [ skip ] completed with state |
Signed-off-by: Chuang Zhu <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
|
Hi @chuangz0 can you add context to the PR description for this feature? For example, why was zmq added? Is it for performance reasons? Was there an observed % speedup? |
PR title
ucx establish connection with zmq
need
apt install -y libzmq3-devWe used UCX listener to establish UCX connection before. https://github.com/NVIDIA/TensorRT-LLM/blob/fbee27990917affc73d2050384dd7e33d594a2[…]/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
UCX listener bind to port and need UCX enable TCP.
If user set UCX_NET_DEVICES and exclude specific nic interface or set UCX_TLS=^tcp ,then the UCX listener won't work.
if UCX_NET_DEVICES include nic interface, kv cache transfer may use tcp ,which is very slow.
Please write the PR title by following template:
[JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] <summary of this PR>
For example, assume I have a PR hope to support a new feature about cache manager of Jira TRTLLM-1000 ticket, it would be like
[TRTLLM-1000][feat] Support a new feature about cache manager
Description
Please explain the issue and the solution in short.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Refactor
Chores
Bug Fixes