Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ communicator* UserBufferAllocator::comm()
return mUbComm;
}

#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 27, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Missing platform headers for dynamic loading.

dlopen/dlsym/dlclose and LoadLibraryA/GetProcAddress require including <dlfcn.h> or <windows.h>. Add these near the top of the file.

Outside this hunk:

#ifdef _WIN32
#include <windows.h>
#else
#include <dlfcn.h>
#endif
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp around line 86, the
code uses dlopen/dlsym/dlclose and Windows LoadLibraryA/GetProcAddress without
including the platform headers; add a conditional include near the top of the
file (with the other includes) that #ifdef _WIN32 include <windows.h> else
include <dlfcn.h>, so the dynamic-loading symbols are declared on each platform
and compilation errors are avoided.


void NCCLUserBufferAllocator::initialize(tensorrt_llm::runtime::WorldConfig const& worldConfig)
{
if (!isInitialized())
Expand Down Expand Up @@ -244,6 +246,18 @@ bool NCCLHelper::isLoaded() const
return mIsLoaded;
}

#else

void NCCLUserBufferAllocator::initialize(tensorrt_llm::runtime::WorldConfig const& worldConfig)
{
TLLM_CHECK_WITH_INFO(false,
"NCCL symmetric is not supported for nccl version < 2.27. Please upgrade nccl to 2.27 or higher and rebuild "
"tensorrt_llm or disable nccl symmetric");
return;
}

#endif

bool UserBufferAllocator::use_nccl_symmetric = false;

}; // namespace tensorrt_llm::runtime::ub
20 changes: 19 additions & 1 deletion cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,22 @@ struct UBBuffer
void* addr;
int handle;
size_t size;
#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 27, 0)
ncclWindow_t window;
#endif

UBBuffer(void* a = nullptr, int h = -1, size_t s = 0, ncclWindow_t w = nullptr)
UBBuffer(void* a = nullptr, int h = -1, size_t s = 0
#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 27, 0)
,
ncclWindow_t w = nullptr
#endif
)
: addr(a)
, handle(h)
, size(s)
#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 27, 0)
, window(w)
#endif
{
}

Expand Down Expand Up @@ -77,6 +86,8 @@ class UserBufferAllocator
tensorrt_llm::runtime::WorldConfig mWorldConfig;
};

#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 27, 0)

class NCCLHelper
{
public:
Expand Down Expand Up @@ -125,6 +136,13 @@ class NCCLUserBufferAllocator : public UserBufferAllocator
std::shared_ptr<ncclComm_t> mComm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

shared_ptr<ncclComm_t> is incorrect and dangerous

ncclComm_t is a pointer type; shared_ptr<ncclComm_t> manages a ncclComm_t* (double pointer) and will call delete by default — undefined behavior. Use a raw handle or a smart pointer with a proper deleter.

Apply one of:

Option A (simplest; recommended unless shared ownership is required):

-    std::shared_ptr<ncclComm_t> mComm;
+    ncclComm_t mComm{}; // raw NCCL handle; destroy via ncclCommDestroy in .cpp

Option B (owned handle with RAII deleter):

+    using NcclCommHandle = std::unique_ptr<std::remove_pointer_t<ncclComm_t>, void(*)(ncclComm_t)>;
-    std::shared_ptr<ncclComm_t> mComm;
+    NcclCommHandle mComm{nullptr, nullptr};

Note: Option B requires #include <type_traits> and setting the deleter to ncclCommDestroy at initialization.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::shared_ptr<ncclComm_t> mComm;
// Was:
// std::shared_ptr<ncclComm_t> mComm;
//
// Use a plain NCCL handle and destroy it in the .cpp via ncclCommDestroy():
ncclComm_t mComm{}; // raw NCCL handle; destroy via ncclCommDestroy in .cpp
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h at line 136, mComm is
declared as std::shared_ptr<ncclComm_t> which is incorrect because ncclComm_t is
already a pointer type and shared_ptr would manage a double pointer and call
delete (UB); fix by replacing the shared_ptr with either a raw handle or an RAII
smart pointer with the correct deleter: Option A (recommended): use ncclComm_t
mComm; and manage lifetime explicitly where created/destroyed; Option B: use a
unique_ptr with the nccl destroy function as deleter, e.g.
std::unique_ptr<std::remove_pointer_t<ncclComm_t>, decltype(&ncclCommDestroy)>
mComm{nullptr, &ncclCommDestroy} and include the appropriate headers before use;
ensure creation/assignment matches the chosen ownership model.

static std::unique_ptr<NCCLHelper> mNCCLHelper;
};
#else
class NCCLUserBufferAllocator : public UserBufferAllocator
{
public:
void initialize(tensorrt_llm::runtime::WorldConfig const& world_config) override;
};
#endif

#else
using communicator = void;
Expand Down
5 changes: 2 additions & 3 deletions docker/common/install_tensorrt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,8 @@ CUDA_VER="12.9" # 12.9.1
# Keep the installation for cuDNN if users want to install PyTorch with source codes.
# PyTorch 2.x can compile with cuDNN v9.
CUDNN_VER="9.10.2.21-1"
# NGC PyTorch 25.06 image uses NCCL 2.27.3, while NCCL 2.27.5 resolves a perf regression issue.
# Use NCCL version 2.27.5 instead.
NCCL_VER="2.27.5-1+cuda12.9"
# Downgrade NCCL version to 2.25.1 temporarily
NCCL_VER="2.25.1-1+cuda12.8"
CUBLAS_VER="12.9.1.4-1"
# Align with the pre-installed CUDA / NVCC / NVRTC versions from
# https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Expand Down
8 changes: 4 additions & 4 deletions jenkins/current_image_tags.properties
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# NB: Typically, the suffix indicates the PR whose CI pipeline generated the images. In case that
# images are adopted from PostMerge pipelines, the abbreviated commit hash is used instead.
IMAGE_NAME=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm
LLM_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.06-py3-x86_64-ubuntu24.04-trt10.11.0.33-skip-tritondevel-202508201630-pre-test
LLM_SBSA_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.06-py3-aarch64-ubuntu24.04-trt10.11.0.33-skip-tritondevel-202508201630-pre-test
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.1-devel-rocky8-x86_64-rocky8-py310-trt10.11.0.33-skip-tritondevel-202508201630-pre-test
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.1-devel-rocky8-x86_64-rocky8-py312-trt10.11.0.33-skip-tritondevel-202508201630-pre-test
LLM_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.06-py3-x86_64-ubuntu24.04-trt10.11.0.33-skip-tritondevel-202509051530-7556
LLM_SBSA_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.06-py3-aarch64-ubuntu24.04-trt10.11.0.33-skip-tritondevel-202509051530-7556
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.1-devel-rocky8-x86_64-rocky8-py310-trt10.11.0.33-skip-tritondevel-202509051530-7556
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE=urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.1-devel-rocky8-x86_64-rocky8-py312-trt10.11.0.33-skip-tritondevel-202509051530-7556
2 changes: 2 additions & 0 deletions tensorrt_llm/_torch/pyexecutor/model_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -2334,6 +2334,8 @@ def _init_userbuffers(self, hidden_size):
if not ub.ub_supported():
return False
use_nccl_symmetric = self.pytorch_backend_config.allreduce_strategy == "NCCL_SYMMETRIC"
if use_nccl_symmetric:
return False
ub.initialize_userbuffers_manager(
self.mapping.tp_size, self.mapping.pp_size, self.mapping.cp_size,
self.mapping.rank, self.mapping.gpus_per_node,
Expand Down
3 changes: 3 additions & 0 deletions tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,9 @@ def func(input, residual, norm_weight, eps, enable_fusion):
def test_row_linear_residual_norm_fusion(seq_len, hidden_size, dtype, strategy,
fusion):

if strategy == AllReduceStrategy.NCCL_SYMMETRIC:
pytest.skip("NCCL symmetric is not supported for nccl version < 2.27.")

Comment on lines +187 to +189
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Make the skip conditional on NCCL version instead of unconditional.

Right now this always skips NCCL_SYMMETRIC. Gate it on runtime NCCL < 2.27.

-    if strategy == AllReduceStrategy.NCCL_SYMMETRIC:
-        pytest.skip("NCCL symmetric is not supported for nccl version < 2.27.")
+    if strategy == AllReduceStrategy.NCCL_SYMMETRIC:
+        try:
+            import torch.cuda.nccl as nccl
+            nccl_ver = nccl.version()  # e.g., 2708 for 2.7.8; returns int like 22700 for 2.27.0 in newer builds
+        except Exception:
+            nccl_ver = 0
+        # Treat values < 22700 as older than 2.27.0
+        if nccl_ver < 22700:
+            pytest.skip("NCCL symmetric is not supported for nccl version < 2.27.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if strategy == AllReduceStrategy.NCCL_SYMMETRIC:
pytest.skip("NCCL symmetric is not supported for nccl version < 2.27.")
if strategy == AllReduceStrategy.NCCL_SYMMETRIC:
try:
import torch.cuda.nccl as nccl
nccl_ver = nccl.version() # e.g., 2708 for 2.7.8; returns int like 22700 for 2.27.0 in newer builds
except Exception:
nccl_ver = 0
# Treat values < 22700 as older than 2.27.0
if nccl_ver < 22700:
pytest.skip("NCCL symmetric is not supported for nccl version < 2.27.")
🤖 Prompt for AI Agents
In tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py around lines 187-189,
the current code always skips when strategy == AllReduceStrategy.NCCL_SYMMETRIC;
change this to only skip when the runtime NCCL version is less than 2.27. Query
the runtime NCCL version (e.g., via torch.cuda.nccl.version() or an equivalent
helper used in the test suite), normalize it to a comparable numeric form (e.g.,
2.27 -> 227 or parse a tuple), and only call pytest.skip(...) when the detected
version is < 2.27; otherwise let the test run. Ensure you import/handle the
torch API used and account for None/unsupported returns by treating them as
“skip” if appropriate.

torch.manual_seed(42)
tensor_parallel_size = 2

Expand Down
Loading