Skip to content

Conversation

@lancelly
Copy link
Collaborator

@lancelly lancelly commented Aug 22, 2025

Since we can’t reproduce the issue on the local cluster but it appears in the post-merge CI, we’ll add some logging first. When the OOM recurs in post-merge CI, we’ll have more information; once the problem is located, we’ll revert these logs.

Summary by CodeRabbit

  • New Features

    • Adds comprehensive GPU memory profiling and human-readable logging across CUDA graph capture, model load/compile, warmup, and execution; reports totals, free, inside-PyTorch vs outside-PyTorch, and per-stage deltas with contextual metadata (batch size, execution path, layer).
    • Instruments memory logging inside model internals and executor worker initialization for finer-grained diagnostics; new memory-tracking helpers are exposed.
  • Tests

    • Removed a test-waive entry from integration test lists (adjusts skip configuration).

@lancelly lancelly requested a review from a team as a code owner August 22, 2025 07:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 22, 2025

📝 Walkthrough

Walkthrough

Adds GPU memory instrumentation and logging across CUDA-graph capture, PyTorch execution paths, decoder MoE/MLP branches, and executor worker init. Introduces MemoryStats NamedTuple and helpers (get_memory_stats, log_memory_stats, log_memory_comparison); logs per-stage memory snapshots and deltas. Removes one test waive entry.

Changes

Cohort / File(s) Summary of Changes
CUDA graph runner instrumentation
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
Added MemoryStats and helpers (get_memory_stats, log_memory_stats, log_memory_comparison); import device_memory_info, logger; record and log memory at capture start, before/after warmup, before/after capture, and end using device & PyTorch queries; expanded capture inputs payload; preserved existing capture flags/weakrefs.
Model engine memory profiling
tensorrt_llm/_torch/pyexecutor/model_engine.py
Added MemoryStats and helpers; instrumented memory sampling/logging around model load, torch.compile (before/after), disable-optimization flows, CUDA-graph warmup/capture, and forward execution paths; logs contextual inside/outside/delta metrics (batch_size, draft_len, execution_path).
Decoder layer MoE/MLP instrumentation
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Inserted calls to get_memory_stats and log_memory_stats before and after MoE/MLP computations in decoder layer forward; per-layer labeled memory snapshots.
Executor init instrumentation
tensorrt_llm/executor/executor.py, tensorrt_llm/executor/worker.py
Local imports of memory helpers and single-point logging of memory before GenerationExecutor/worker initialization (no control flow changes).
Test waives update
tests/integration/test_lists/waives.txt
Removed one waives entry: accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_multi_gpus[throughput_tp8] SKIP (https://nvbugs/5455140).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant ME as ModelEngine
  participant CGR as CudaGraphRunner
  participant T as Torch
  participant G as GPU

  Caller->>ME: forward(inputs, ...)
  ME->>G: device_memory_info()
  ME->>T: torch.cuda.mem_get_info()
  ME->>T: torch.cuda.memory_stats()
  ME->>ME: get_memory_stats()  -- record BEFORE

  alt CUDA-graph capture path
    ME->>CGR: capture(forward_fn, inputs...)
    CGR->>G: device_memory_info() / torch.cuda.mem_get_info()  -- START
    CGR->>T: torch.cuda.memory_stats()  -- START
    CGR->>CGR: log_memory_stats("START")
    CGR->>T: warmup run (x2)
    CGR->>G: device_memory_info() / torch.cuda.mem_get_info()  -- AFTER WARMUP
    CGR->>T: torch.cuda.memory_stats()  -- AFTER WARMUP
    CGR->>CGR: log_memory_comparison("warmup", before, after)
    CGR->>T: capture graph
    CGR->>G: device_memory_info() / torch.cuda.mem_get_info()  -- AFTER CAPTURE
    CGR->>T: torch.cuda.memory_stats()  -- AFTER CAPTURE
    CGR->>CGR: log_memory_comparison("capture", before, after)
    CGR->>CGR: set weakref(output)
  else Regular execution
    ME->>T: run forward normally
  end

  ME->>G: device_memory_info() / torch.cuda.mem_get_info()  -- END
  ME->>T: torch.cuda.memory_stats()  -- END
  ME->>ME: log_memory_comparison("COMPLETE", before, after)
  ME-->>Caller: outputs
Loading
sequenceDiagram
  autonumber
  participant GC as PythonGC
  participant CGR as CudaGraphRunner
  participant T as Torch
  participant G as GPU

  GC->>CGR: __del__()
  alt graph exists
    CGR->>G: device_memory_info()  -- before reset
    CGR->>T: torch.cuda.mem_get_info()  -- before reset
    CGR->>CGR: try graph.reset()
    CGR->>G: device_memory_info()  -- after reset
    CGR->>T: torch.cuda.mem_get_info()  -- after reset
    CGR->>CGR: log_memory_comparison("graph reset", before, after)
  else no graph
    CGR->>CGR: safe cleanup path
  end
  CGR-->>GC: return
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • Superjomn
  • venkywonka
  • yuxianq
  • dongxuy04
  • chzblych

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)

32-33: Remove duplicate import of device_memory_info.

device_memory_info is imported twice; keep a single import to avoid re-import warnings and maintain clarity.

Apply this diff (remove the second import at Line 65):

-from tensorrt_llm.profiler import device_memory_info

Also applies to: 64-65


755-782: Tighten warmup memory logs: fix unused variable and line-length; keep semantics unchanged.

  • Replace unused mem_total with _ to satisfy linters.
  • Shorten overly long log message fragments to pass E501 without losing signal.

Apply these diffs:

-                        mem_used_before, mem_free_before, mem_total = device_memory_info(
+                        mem_used_before, mem_free_before, _ = device_memory_info(
                         )
-                        logger.info(
-                            f"Memory stats BEFORE graph capture forward (batch_size={bs}, draft_len={draft_len}): "
-                            f"memory used inside torch: {mem_used_inside_torch_before / 1024**3:.2f} GB, "
-                            f"memory used outside torch (method1 - device_memory_info): {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, "
-                            f"memory used outside torch (method2 - mem_get_info): {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, "
-                            f"total memory used: {mem_used_before / 1024**3:.2f} GB, "
-                            f"free memory: {mem_free_before / 1024**3:.2f} GB")
+                        logger.info(
+                            f"Memory BEFORE graph-capture forward (bs={bs}, draft_len={draft_len}): "
+                            f"torch_alloc: {mem_used_inside_torch_before / 1024**3:.2f} GB, "
+                            f"outside_m1: {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, "
+                            f"outside_m2: {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, "
+                            f"total: {mem_used_before / 1024**3:.2f} GB, free: {mem_free_before / 1024**3:.2f} GB"
+                        )
-                        mem_used_after, mem_free_after, mem_total = device_memory_info(
+                        mem_used_after, mem_free_after, _ = device_memory_info(
                         )
-                        logger.info(
-                            f"Memory stats AFTER graph capture forward (batch_size={bs}, draft_len={draft_len}): "
-                            f"memory used inside torch: {mem_used_inside_torch_after / 1024**3:.2f} GB, "
-                            f"memory used outside torch (method1 - device_memory_info): {mem_used_outside_torch_method1_after / 1024**3:.2f} GB, "
-                            f"memory used outside torch (method2 - mem_get_info): {mem_used_outside_torch_method2_after / 1024**3:.2f} GB, "
-                            f"total memory used: {mem_used_after / 1024**3:.2f} GB, "
-                            f"free memory: {mem_free_after / 1024**3:.2f} GB")
+                        logger.info(
+                            f"Memory AFTER graph-capture forward (bs={bs}, draft_len={draft_len}): "
+                            f"torch_alloc: {mem_used_inside_torch_after / 1024**3:.2f} GB, "
+                            f"outside_m1: {mem_used_outside_torch_method1_after / 1024**3:.2f} GB, "
+                            f"outside_m2: {mem_used_outside_torch_method2_after / 1024**3:.2f} GB, "
+                            f"total: {mem_used_after / 1024**3:.2f} GB, free: {mem_free_after / 1024**3:.2f} GB"
+                        )
-                        logger.info(
-                            f"Memory CHANGES during forward (batch_size={bs}, draft_len={draft_len}): "
-                            f"inside torch change: {inside_torch_change / 1024**3:.2f} GB, "
-                            f"outside torch change (method1): {outside_torch_change_method1 / 1024**3:.2f} GB, "
-                            f"outside torch change (method2): {outside_torch_change_method2 / 1024**3:.2f} GB, "
-                            f"total memory change: {mem_change / 1024**3:.2f} GB"
-                        )
+                        logger.info(
+                            f"Memory CHANGES during forward (bs={bs}, draft_len={draft_len}): "
+                            f"torch_alloc: {inside_torch_change / 1024**3:.2f} GB, "
+                            f"outside_m1: {outside_torch_change_method1 / 1024**3:.2f} GB, "
+                            f"outside_m2: {outside_torch_change_method2 / 1024**3:.2f} GB, "
+                            f"total: {mem_change / 1024**3:.2f} GB"
+                        )

Also applies to: 789-828


2178-2195: Minor: replace unused mem_total with underscore in forward pre-sample.

Keeps the linter quiet; no behavior change.

-        mem_used_before, mem_free_before, mem_total = device_memory_info()
+        mem_used_before, mem_free_before, _ = device_memory_info()

2231-2271: Synchronize before sampling post-forward memory in no-cache path.

Without a synchronize, allocator/free ops may still be in flight and the numbers can be misleading. Sync only around the profiling to avoid perturbing normal execution.

Apply this diff:

             with MoeLoadBalancerIterContext(moe_load_balancer):
                 outputs = self._forward_step(inputs, gather_ids,
                                              gather_context_logits)

-            # Memory statistics after forward (no cache path)
+            # Ensure kernels finished before sampling memory for accurate numbers
+            torch.cuda.synchronize()
+            # Memory statistics after forward (no cache path)
-            mem_used_after, mem_free_after, mem_total = device_memory_info()
+            mem_used_after, mem_free_after, _ = device_memory_info()

Verification idea:

  • Re-run the CI job that intermittently OOMs and compare the “CHANGES” totals before/after this sync; you should observe more stable deltas across runs at identical batch sizes.

2320-2359: Synchronize before sampling post-forward memory (regular/cuda_graph path).

Same rationale as the no-cache case: make the “AFTER” and “CHANGES” numbers deterministic across runs.

Apply this diff:

             self._execute_logit_post_processors(scheduled_requests, outputs)

-            # Memory statistics after forward
+            # Ensure kernels finished before sampling memory for accurate numbers
+            torch.cuda.synchronize()
+            # Memory statistics after forward
-            mem_used_after, mem_free_after, mem_total = device_memory_info()
+            mem_used_after, mem_free_after, _ = device_memory_info()
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

312-315: Guard mrope_position_deltas copy when self.mrope_position_deltas is None.

If use_mrope=False, self.mrope_position_deltas is None; copying unconditionally will raise. Guard on both the key and the allocated tensor.

Apply this diff:

-        if "mrope_position_deltas" in inputs:
-            self.mrope_position_deltas[:self.batch_size].copy_(
-                inputs["mrope_position_deltas"])
+        if "mrope_position_deltas" in inputs and self.mrope_position_deltas is not None:
+            self.mrope_position_deltas[:self.batch_size].copy_(inputs["mrope_position_deltas"])

If you have a CI repro with multimodal inputs but non-mrope rope scaling, please sanity-check this path.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 253af9f and bff23e9.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py

777-777: Line too long (147 > 120)

(E501)


778-778: Line too long (141 > 120)

(E501)


809-809: Line too long (131 > 120)

(E501)


810-810: Line too long (131 > 120)

(E501)


815-815: Line too long (146 > 120)

(E501)


816-816: Line too long (140 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2298-2310: LGTM: capture function wiring is clean and side-effect free.

Passing capture_forward_fn into maybe_graph.capture and reusing the returned pool aligns with the CUDA graph lifecycle and keeps eager vs. captured paths symmetric.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

130-153: LGTM: capture-start memory sampling and dual accounting look correct.

Using both device_memory_info and mem_get_info provides useful cross-checks for outside-PyTorch usage. Good placement before any warmup/capture side effects.

Signed-off-by: Lanyu Liao <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1-1: Add NVIDIA copyright header (2025) per coding guidelines.

The repository guidelines require a copyright header on all Python/C++ sources. Please prepend the standard NVIDIA header at the top of this file.

Example:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

1-1: Add NVIDIA copyright header (2025) per coding guidelines.

Please prepend the standard NVIDIA header at the top of this file.

Example:

+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

78-80: Fix unsafe del: calling reset() when _graph is None can raise; wrap in a safe guard.

This was flagged previously as well; we should avoid exceptions during teardown and tolerate None.

Apply:

-    def __del__(self):
-        self._graph.reset()
+    def __del__(self):
+        graph = getattr(self, "_graph", None)
+        if graph is not None:
+            try:
+                graph.reset()
+            except Exception as e:
+                # Avoid raising from __del__
+                logger.warning(f"Error during CUDA graph reset in __del__: {e}")
🧹 Nitpick comments (11)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)

754-781: Warmup: minor cleanups and line-length fix for the log.

  • mem_total is unused; replace with “_” to avoid unused var warnings.
  • Several logger.info lines exceed the 120-char limit flagged by Ruff; split the long f-strings for E501 compliance.

Apply:

-                        mem_used_before, mem_free_before, mem_total = device_memory_info(
+                        mem_used_before, mem_free_before, _ = device_memory_info(
                         )

And rewrap the log for line-length compliance:

-                        logger.info(
-                            f"Memory stats BEFORE graph capture forward (batch_size={bs}, draft_len={draft_len}): "
-                            f"memory used inside torch: {mem_used_inside_torch_before / 1024**3:.2f} GB, "
-                            f"memory used outside torch (method1 - device_memory_info): {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, "
-                            f"memory used outside torch (method2 - mem_get_info): {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, "
-                            f"total memory used: {mem_used_before / 1024**3:.2f} GB, "
-                            f"free memory: {mem_free_before / 1024**3:.2f} GB")
+                        logger.info(
+                            (
+                                f"Memory stats BEFORE graph capture forward "
+                                f"(batch_size={bs}, draft_len={draft_len}): "
+                                f"inside torch: {mem_used_inside_torch_before / 1024**3:.2f} GB, "
+                                f"outside torch (method1): {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, "
+                                f"outside torch (method2): {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, "
+                                f"total: {mem_used_before / 1024**3:.2f} GB, "
+                                f"free: {mem_free_before / 1024**3:.2f} GB"
+                            )
+                        )

788-827: Warmup-after: minor cleanups and line-length fix for the logs.

  • mem_total is unused; replace with “_”.
  • Split the two logger.info calls to satisfy E501.

Apply:

-                        mem_used_after, mem_free_after, mem_total = device_memory_info(
+                        mem_used_after, mem_free_after, _ = device_memory_info(
                         )

And rewrap both logger.info calls similarly to the previous hunk to keep each physical line under 120 characters.


2177-2194: Before-forward memory snapshot: tiny tidy-up and optional gating.

  • mem_total is unused; replace with “_”.
  • Optional: gate this pre-forward snapshot under an env flag (e.g., TLLM_ENABLE_MEM_LOGS=1) to keep hot-path overhead minimal when not debugging CI OOMs.

Apply:

-        mem_used_before, mem_free_before, mem_total = device_memory_info()
+        mem_used_before, mem_free_before, _ = device_memory_info()

If you want gating:

if os.environ.get("TLLM_ENABLE_MEM_LOGS") == "1":
    # collect and log memory stats

2232-2269: Synchronize before post-forward memory read to avoid skewed stats.

Kernels may still be in-flight; without a sync, the “after” snapshot can under-report usage or make deltas noisy. Synchronize once before measuring to improve accuracy for the CI diagnostics.

Apply:

-            # Memory statistics after forward (no cache path)
-            mem_used_after, mem_free_after, mem_total = device_memory_info()
+            # Memory statistics after forward (no cache path)
+            torch.cuda.synchronize()
+            mem_used_after, mem_free_after, _ = device_memory_info()

Also consider rewrapping the logger.info into multiple shorter segments to satisfy E501 (same style as earlier suggestion).


2319-2357: Graph/regular path: add one synchronize before “after” stats and fix E501.

  • Add torch.cuda.synchronize() before reading the “after” memory to keep the deltas consistent.
  • mem_total is unused; replace with “_”.
  • Split the logger.info string for E501.

Apply:

-            # Memory statistics after forward
-            mem_used_after, mem_free_after, mem_total = device_memory_info()
+            # Memory statistics after forward
+            torch.cuda.synchronize()
+            mem_used_after, mem_free_after, _ = device_memory_info()

And rewrap the logger.info similar to the warmup hunk to keep physical lines < 120 chars.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (6)

86-110: Capture start: small cleanups and E501 compliance.

  • mem_total is unused; replace with “_”.
  • Split the logger.info string to satisfy Ruff E501.

Apply:

-        mem_used_start, mem_free_start, mem_total = device_memory_info()
+        mem_used_start, mem_free_start, _ = device_memory_info()

Rewrap the logger.info similar to:

-        logger.info(
-            f"CUDA graph capture START (batch_size={self.batch_size}): "
-            f"inside torch: {mem_used_inside_torch_start / 1024**3:.2f} GB, "
-            f"outside torch (method1): {mem_used_outside_torch_method1_start / 1024**3:.2f} GB, "
-            f"outside torch (method2): {mem_used_outside_torch_method2_start / 1024**3:.2f} GB, "
-            f"total: {mem_used_start / 1024**3:.2f} GB, free: {mem_free_start / 1024**3:.2f} GB"
-        )
+        logger.info(
+            (
+                f"CUDA graph capture START (batch_size={self.batch_size}): "
+                f"inside torch: {mem_used_inside_torch_start / 1024**3:.2f} GB, "
+                f"outside torch (method1): {mem_used_outside_torch_method1_start / 1024**3:.2f} GB, "
+                f"outside torch (method2): {mem_used_outside_torch_method2_start / 1024**3:.2f} GB, "
+                f"total: {mem_used_start / 1024**3:.2f} GB, free: {mem_free_start / 1024**3:.2f} GB"
+            )
+        )

121-125: Before warmup snapshot: minor tidy-up.

Replace unused mem_total with “_”.

-        mem_used_before_warmup, _, _ = device_memory_info()
+        mem_used_before_warmup, _, _ = device_memory_info()
# (The third return is already ignored; keeping as-is is fine if the linter allows it.)

135-164: After warmup: add synchronize before reading, and rewrap logs.

  • Insert torch.cuda.synchronize() prior to reading memory after the two warmup iterations to ensure kernels are complete.
  • Consider rewrapping the logger.info to satisfy E501.

Apply:

-        # Memory statistics after warmup
-        mem_used_after_warmup, mem_free_after_warmup, _ = device_memory_info()
+        # Memory statistics after warmup
+        torch.cuda.synchronize()
+        mem_used_after_warmup, mem_free_after_warmup, _ = device_memory_info()

165-169: Before capture: minor tidy-up.

Replace unused mem_total with “_” (already ignored). No functional issues.


174-202: After capture: add synchronize and rewrap log.

  • Synchronize once after exiting the torch.cuda.graph context to ensure all allocations are reflected in the snapshot.
  • Split the logger.info for E501.

Apply:

-        # Memory statistics after graph capture
-        mem_used_after_capture, mem_free_after_capture, _ = device_memory_info()
+        # Memory statistics after graph capture
+        torch.cuda.synchronize()
+        mem_used_after_capture, mem_free_after_capture, _ = device_memory_info()

208-241: Capture end: add synchronize, replace unused mem_total, and rewrap log.

  • Synchronize before the “end” snapshot to keep totals consistent.
  • Replace mem_total with “_”.
  • Split the logger.info to satisfy E501.

Apply:

-        # Memory statistics at capture end
-        mem_used_end, mem_free_end, mem_total = device_memory_info()
+        # Memory statistics at capture end
+        torch.cuda.synchronize()
+        mem_used_end, mem_free_end, _ = device_memory_info()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bff23e9 and 348ec5a.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py

777-777: Line too long (147 > 120)

(E501)


778-778: Line too long (141 > 120)

(E501)


809-809: Line too long (131 > 120)

(E501)


810-810: Line too long (131 > 120)

(E501)


815-815: Line too long (146 > 120)

(E501)


816-816: Line too long (140 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

32-32: Import of device_memory_info looks good.

Importing device_memory_info centralizes device memory queries and aligns with the added logging. No issues spotted.


2229-2231: No-cache forward call path looks fine.

The refactor to pass gather_context_logits through is correct and preserves behavior.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

6-8: New imports for logging/instrumentation are appropriate.

Bringing in logger and device_memory_info is consistent with the PR objective and mirrors model_engine.py.

Signed-off-by: Lanyu Liao <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

146-148: Fix unsafe del: guard None and avoid raising during teardown.

Calling reset() unconditionally can raise when _graph is None or during interpreter shutdown. This was flagged previously; please restore a safe guard.

-    def __del__(self):
-        self._graph.reset()
+    def __del__(self):
+        graph = getattr(self, "_graph", None)
+        if graph is not None:
+            try:
+                graph.reset()
+            except Exception as e:
+                logger.warning(f"Error resetting CUDA graph in __del__: {e}")
🧹 Nitpick comments (7)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

2279-2285: Synchronize before sampling “after forward” memory (no-cache path).

Without a barrier, kernels may still be in-flight, making memory numbers noisy. Add a sync immediately before sampling.

             with MoeLoadBalancerIterContext(moe_load_balancer):
                 outputs = self._forward_step(inputs, gather_ids,
                                              gather_context_logits)
 
-            # Memory statistics after forward (no cache path)
+            # Memory statistics after forward (no cache path)
+            torch.cuda.synchronize()
             stats_after_forward = get_memory_stats()

2335-2342: Synchronize before sampling “after forward” memory (regular/cuda_graph).

Same rationale as the no-cache path; ensures the reported deltas are trustworthy.

             self._execute_logit_post_processors(scheduled_requests, outputs)
 
-            # Memory statistics after forward
+            # Memory statistics after forward
+            torch.cuda.synchronize()
             stats_after_forward = get_memory_stats()
             execution_path = "cuda_graph" if maybe_graph is not None else "regular"
             batch_size = len(scheduled_requests.all_requests())
             log_memory_comparison(
                 f"Forward memory stats [{execution_path}] (batch_size={batch_size})",
                 stats_before_forward, stats_after_forward)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the 2025 NVIDIA copyright header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

68-75: Centralize MemoryStats and Helpers into a Shared Module

We’ve confirmed that the MemoryStats NamedTuple and its accompanying functions (get_memory_stats, log_memory_stats, log_memory_comparison) are duplicated verbatim in both:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py (lines 68–134)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (lines 14–62)

Duplicated logic increases maintenance burden and the risk of drift when enhancements (e.g., device‐awareness or syncing) are applied in only one location. To consolidate:

• Extract MemoryStats, get_memory_stats, log_memory_stats, and log_memory_comparison into a new module, e.g.:
tensorrt_llm/_torch/pyexecutor/memory_utils.py

• In both model_engine.py and cuda_graph_runner.py, replace the local definitions with imports:

-from tensorrt_llm.profiler import device_memory_info
+from tensorrt_llm.profiler import device_memory_info
+from .memory_utils import (
+    MemoryStats,
+    get_memory_stats,
+    log_memory_stats,
+    log_memory_comparison,
+)

• Optionally enhance get_memory_stats signature to accept a device: Optional[int] and sync: bool parameter for explicit multi‐GPU support and stable debug measurements.

Please refactor these files to import the shared utilities and remove the in‐file duplicates.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)

182-186: Synchronize before sampling “after warmup” memory.

Warmup launches kernels; synchronize to ensure stable “after” readings.

         for _ in range(2):
             forward_fn(inputs)
 
-        # Memory statistics after warmup
+        # Memory statistics after warmup
+        torch.cuda.synchronize()
         stats_after_warmup = get_memory_stats()
         log_memory_comparison("CUDA graph warmup completed",
                               stats_before_warmup, stats_after_warmup)

14-21: Deduplicate MemoryStats and logging utilities; import from one place.

Same recommendation as model_engine.py: centralize these helpers to prevent drift and ease future tweaks (e.g., device-aware sampling, extra metrics like reserved_bytes).

@@
-from tensorrt_llm.logger import logger
-from tensorrt_llm.profiler import device_memory_info
+from tensorrt_llm.logger import logger
+from tensorrt_llm.profiler import device_memory_info
+from .memory_utils import (
+    MemoryStats,
+    get_memory_stats,
+    log_memory_stats,
+    log_memory_comparison,
+)
@@
-class MemoryStats(NamedTuple):
-    """Memory usage statistics"""
-    inside_torch: int
-    outside_torch_method1: int
-    outside_torch_method2: int
-    total_used: int
-    total_free: int
-
-
-def get_memory_stats() -> MemoryStats:
-    """Get current GPU memory usage statistics"""
-    # Get memory info from device_memory_info
-    mem_used, mem_free, mem_total = device_memory_info()
-    ...
-    return MemoryStats(...)
-
-
-def log_memory_stats(description: str, stats: MemoryStats):
-    ...
-
-
-def log_memory_comparison(description: str, before: MemoryStats,
-                          after: MemoryStats):
-    ...
+# (Removed local MemoryStats utilities; use shared ones)

Also applies to: 23-47, 49-58, 60-80


1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the 2025 NVIDIA copyright header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 348ec5a and 0d83b91.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (15 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

395-399: LGTM: Instrumentation points are well-chosen and minimally invasive.

Coverage before/after compile, warmups, autotuner, CUDA graph capture, piecewise warmups, model load, and forward paths looks good for chasing OOMs.

Also applies to: 427-431, 754-758, 784-789, 791-795, 815-819, 855-860, 867-873, 875-880, 903-908, 1225-1227, 2239-2241

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

155-159: LGTM: Capture lifecycle logging is thorough and targeted.

Start, warmup, pre/post capture, and completion deltas will be useful when the OOM reappears.

Also applies to: 170-172, 187-189, 193-197, 203-208

@lancelly
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16169 [ run ] triggered by Bot

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Collaborator Author

/bot kill

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16182 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16169 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16180 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16182 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16180 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 9d554e0

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3"

@lancelly lancelly changed the title [https://nvbugs/5455140][chore] Add some CUDA graph memory logs to help locate an occasional OOM issue. [DO NOT MERGE][TEST] Add some CUDA graph memory logs to help locate an occasional OOM issue. Aug 23, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16262 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16262 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #275 (Partly Tested) completed with status: 'SUCCESS'

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16273 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16273 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #279 (Partly Tested) completed with status: 'SUCCESS'

@lancelly lancelly changed the title [DO NOT MERGE][TEST] Add some CUDA graph memory logs to help locate an occasional OOM issue. [https://nvbugs/5455140][chore] Add some CUDA graph memory logs to help locate an occasional OOM issue. Aug 24, 2025
@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16288 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16288 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #280 (Partly Tested) completed with status: 'SUCCESS'

@lancelly
Copy link
Collaborator Author

/bot help

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16359 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit b5e586f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16362 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9d554e0 and b5e586f.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • get_memory_stats (77-100)
  • log_memory_stats (103-111)
tensorrt_llm/_torch/modules/gated_mlp.py (1)
  • GatedMLP (19-171)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

761-767: Review: Ensure DeepseekV3DecoderLayer Memory Logging Is Safe Under CUDA Graph Capture

We didn’t find any special guard around the memory-tracking calls in DeepseekV3DecoderLayer.forward, and there are CUDA graph capture paths in our model_engine (e.g. maybe_graph.capture(capture_forward_fn, …)) that wrap the entire decoder forward. Calling the raw CUDA memory APIs during a graph capture can invalidate or corrupt the capture.

Please verify:

  • Location of potential issue
    • File: tensorrt_llm/_torch/models/modeling_deepseekv3.py
    • Lines 761–767 (the get_memory_stats() / log_memory_stats() snippet)

  • Definitions of the logging helpers
    get_memory_stats and log_memory_stats in tensorrt_llm/_torch/pyexecutor/model_engine.py
    – Ensure they no-op or defer when a CUDA graph is active

  • Capture paths in model_engine
    • The DecodingCUDAGraphRunner.capture() call used in model_engine._maybe_get_cuda_graph() / capture_forward_fn
    – Confirm that any submodule forwards (including DeepseekV3DecoderLayer) are either outside the captured region or have their memory logging guarded

Suggested action: wrap the memory-stats calls in forward with a guard such as:

from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import is_graph_capturing

if not is_graph_capturing():
    stats = get_memory_stats()
    log_memory_stats(f"… layer {self.layer_idx}", stats)

This will ensure the layer’s forward remains safe under CUDA graph captures.

Comment on lines +761 to +767
# Memory tracking before MoE/MLP
from tensorrt_llm._torch.pyexecutor.model_engine import (
get_memory_stats, log_memory_stats)
stats_before_moe_mlp = get_memory_stats()
log_memory_stats(f"Memory stats BEFORE MoE/MLP layer {self.layer_idx}",
stats_before_moe_mlp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard memory sampling; skip during CUDA Graph capture; log only on rank 0.

Calling device memory APIs per layer on every forward can be expensive and may interfere with CUDA Graph capture. Please gate the sampling:

  • behind an env flag (so CI can enable it without impacting prod),
  • skip when the current stream is being captured,
  • and reduce log noise to rank 0.

Apply this diff:

-        # Memory tracking before MoE/MLP
-        from tensorrt_llm._torch.pyexecutor.model_engine import (
-            get_memory_stats, log_memory_stats)
-        stats_before_moe_mlp = get_memory_stats()
-        log_memory_stats(f"Memory stats BEFORE MoE/MLP layer {self.layer_idx}",
-                         stats_before_moe_mlp)
+        # Memory tracking before MoE/MLP (guarded; CI-only)
+        should_log_mem = os.getenv("TRTLLM_MEMLOG", "0") == "1" and getattr(self.mapping, "rank", 0) == 0
+        # Avoid logging inside CUDA Graph capture to prevent capture failures
+        is_capturing = getattr(torch.cuda, "is_current_stream_capturing", lambda: False)()
+        if should_log_mem and not is_capturing:
+            # Keep import local to avoid potential import-time cycles
+            from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
+            try:
+                stats_before_moe_mlp = _memlog.get_memory_stats()
+                _memlog.log_memory_stats(
+                    f"Memory stats BEFORE MoE/MLP layer {self.layer_idx}",
+                    stats_before_moe_mlp,
+                )
+            except Exception as e:
+                logger.debug(f"Memory logging (BEFORE) skipped: {e}")

Additional nitpick: to align with our Python guidelines (“maintain module namespace in imports”), prefer importing the module (as shown) over from ... import func.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 761 to 767,
the per-layer device memory sampling/logging is unguarded; wrap it so it only
runs when an env var (e.g., TLLM_ENABLE_MEM_SAMPLING) is set, skip sampling when
the current CUDA stream is being captured (use
torch.cuda.current_stream().is_capturing()), and emit logs only on rank 0 (use
torch.distributed.is_initialized() ? torch.distributed.get_rank() == 0 : True).
Also change the import to import the module (import
tensorrt_llm._torch.pyexecutor.model_engine as model_engine) and call
model_engine.get_memory_stats()/model_engine.log_memory_stats() so the sampling
is cheap and safe during normal inference and does not interfere with CUDA Graph
capture.

Comment on lines +769 to 781
result = self.forward_MoE(
hidden_states=hidden_states,
attn_metadata=attn_metadata,
residual=residual,
)

# Memory tracking after MoE
stats_after_moe = get_memory_stats()
log_memory_stats(f"Memory stats AFTER MoE layer {self.layer_idx}",
stats_after_moe)

return result
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Post‑MoE logging should be gated the same way.

Mirror the same capture/rank/env gating after the MoE path to avoid hot‑path overhead and log storms.

Apply this diff:

             result = self.forward_MoE(
                 hidden_states=hidden_states,
                 attn_metadata=attn_metadata,
                 residual=residual,
             )
 
-            # Memory tracking after MoE
-            stats_after_moe = get_memory_stats()
-            log_memory_stats(f"Memory stats AFTER MoE layer {self.layer_idx}",
-                             stats_after_moe)
+            # Memory tracking after MoE (guarded)
+            if 'should_log_mem' in locals() and should_log_mem and not is_capturing:
+                from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
+                try:
+                    stats_after_moe = _memlog.get_memory_stats()
+                    _memlog.log_memory_stats(
+                        f"Memory stats AFTER MoE layer {self.layer_idx}",
+                        stats_after_moe,
+                    )
+                except Exception as e:
+                    logger.debug(f"Memory logging (AFTER MoE) skipped: {e}")
 
             return result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = self.forward_MoE(
hidden_states=hidden_states,
attn_metadata=attn_metadata,
residual=residual,
)
# Memory tracking after MoE
stats_after_moe = get_memory_stats()
log_memory_stats(f"Memory stats AFTER MoE layer {self.layer_idx}",
stats_after_moe)
return result
else:
result = self.forward_MoE(
hidden_states=hidden_states,
attn_metadata=attn_metadata,
residual=residual,
)
# Memory tracking after MoE (guarded)
if 'should_log_mem' in locals() and should_log_mem and not is_capturing:
from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
try:
stats_after_moe = _memlog.get_memory_stats()
_memlog.log_memory_stats(
f"Memory stats AFTER MoE layer {self.layer_idx}",
stats_after_moe,
)
except Exception as e:
logger.debug(f"Memory logging (AFTER MoE) skipped: {e}")
return result
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 769 to 781,
the post-MoE memory capture and log_memory_stats call are unconditional; mirror
the same capture/rank/environment gating used elsewhere in the non-MoE path to
avoid hot-path overhead and log storms. Wrap the stats_after_moe =
get_memory_stats() and log_memory_stats(...) calls in the same conditional (the
capture flag and rank/env check used for the other branch) so memory is only
captured and logged when that gating is true.

Comment on lines +783 to +794
result = self.forward_mlp(
hidden_states=hidden_states,
residual=residual,
)

# Memory tracking after MLP
stats_after_mlp = get_memory_stats()
log_memory_stats(f"Memory stats AFTER MLP layer {self.layer_idx}",
stats_after_mlp)

return result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Post‑MLP logging should be gated the same way.

Same rationale as MoE branch: keep it CI‑only, skip during capture, and limit to rank 0.

Apply this diff:

             result = self.forward_mlp(
                 hidden_states=hidden_states,
                 residual=residual,
             )
 
-            # Memory tracking after MLP
-            stats_after_mlp = get_memory_stats()
-            log_memory_stats(f"Memory stats AFTER MLP layer {self.layer_idx}",
-                             stats_after_mlp)
+            # Memory tracking after MLP (guarded)
+            if 'should_log_mem' in locals() and should_log_mem and not is_capturing:
+                from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
+                try:
+                    stats_after_mlp = _memlog.get_memory_stats()
+                    _memlog.log_memory_stats(
+                        f"Memory stats AFTER MLP layer {self.layer_idx}",
+                        stats_after_mlp,
+                    )
+                except Exception as e:
+                    logger.debug(f"Memory logging (AFTER MLP) skipped: {e}")
 
             return result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = self.forward_mlp(
hidden_states=hidden_states,
residual=residual,
)
# Memory tracking after MLP
stats_after_mlp = get_memory_stats()
log_memory_stats(f"Memory stats AFTER MLP layer {self.layer_idx}",
stats_after_mlp)
return result
result = self.forward_mlp(
hidden_states=hidden_states,
residual=residual,
)
# Memory tracking after MLP (guarded)
if 'should_log_mem' in locals() and should_log_mem and not is_capturing:
from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
try:
stats_after_mlp = _memlog.get_memory_stats()
_memlog.log_memory_stats(
f"Memory stats AFTER MLP layer {self.layer_idx}",
stats_after_mlp,
)
except Exception as e:
logger.debug(f"Memory logging (AFTER MLP) skipped: {e}")
return result
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 783 to 794,
the post-MLP memory logging is unconditional; wrap the get_memory_stats and
log_memory_stats calls with the same guard used in the MoE branch (only run in
CI, skip during capture, and only on rank 0). Modify the code so the memory-stat
capture and log are executed only when the CI flag is set, the code is not in
capture mode, and the current process is rank 0, mirroring the exact conditional
used earlier for the MoE branch.

@lancelly
Copy link
Collaborator Author

/bot run

@lancelly
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16421 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16362 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16422 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16421 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16422 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit b5e586f

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16432 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16432 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 2a688ed

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
tensorrt_llm/executor/executor.py (1)

389-395: Gate main-process memory log and make it robust.

This “pre-GenerationExecution” snapshot runs in the main process (often not the process that will own the GPU). It’s useful for coarse baseline but can be noisy or misleading on CPU-only launchers or when device 0 is not the later worker device.

  • Gate behind CUDA availability and optionally an env var to avoid flooding.
  • Make it leader-only to reduce noise in multi-process launchers that reuse the main process.

Apply this diff:

@@
-        from tensorrt_llm._torch.pyexecutor.model_engine import (
-            get_memory_stats, log_memory_stats)
-        stats_before_worker_init = get_memory_stats()
-        log_memory_stats(
-            "Memory stats BEFORE GenerationExecution initialization",
-            stats_before_worker_init)
+        from tensorrt_llm._torch.pyexecutor.model_engine import (
+            get_memory_stats, log_memory_stats)
+        if torch.cuda.is_available() and os.environ.get("TLLM_MEMLOG_EXECUTOR", "1") == "1":
+            try:
+                stats_before_worker_init = get_memory_stats()
+                log_memory_stats("Memory stats BEFORE GenerationExecution initialization",
+                                 stats_before_worker_init)
+            except Exception as _e:
+                logger.debug(f"Skipped executor pre-init memlog: {_e}")
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

68-75: Make get_memory_stats device-aware and resilient.

  • The current implementation implicitly uses the current device. Exposing an optional device parameter improves attribution and lets callers snapshot specific devices without first mutating global device state.
  • Add a light guard for non-CUDA environments to avoid surprising crashes in CPU-only test stubs.

Apply this diff:

@@
-class MemoryStats(NamedTuple):
+class MemoryStats(NamedTuple):
     """Memory usage statistics"""
     inside_torch: int
     outside_torch_method1: int
     outside_torch_method2: int
     total_used: int
     total_free: int
 
 
-def get_memory_stats() -> MemoryStats:
+def get_memory_stats(device: Optional[int] = None) -> MemoryStats:
     """Get current GPU memory usage statistics"""
-    # Get memory info from device_memory_info
-    mem_used, mem_free, mem_total = device_memory_info()
+    if not torch.cuda.is_available():
+        return MemoryStats(0, 0, 0, 0, 0)
+    # Respect optional device parameter
+    dev_cm = torch.cuda.device(device) if device is not None else contextlib.nullcontext()
+    with dev_cm:
+        # Get memory info from device_memory_info
+        mem_used, mem_free, mem_total = device_memory_info()
 
-    # Get PyTorch allocated memory
-    mem_inside_torch = torch.cuda.memory_stats()["allocated_bytes.all.current"]
+        # Get PyTorch allocated memory
+        mem_inside_torch = torch.cuda.memory_stats()["allocated_bytes.all.current"]
@@
-    # Method 2: using torch.cuda.mem_get_info
-    mem_free_cuda, total_gpu_memory = torch.cuda.mem_get_info()
+        # Method 2: using torch.cuda.mem_get_info
+        mem_free_cuda, total_gpu_memory = torch.cuda.mem_get_info()
         total_used_bytes = total_gpu_memory - mem_free_cuda
         mem_outside_torch_method2 = (
             total_used_bytes -
             mem_inside_torch) if total_used_bytes > mem_inside_torch else 0
 
-    return MemoryStats(inside_torch=mem_inside_torch,
-                       outside_torch_method1=mem_outside_torch_method1,
-                       outside_torch_method2=mem_outside_torch_method2,
-                       total_used=mem_used,
-                       total_free=mem_free)
+    return MemoryStats(
+        inside_torch=mem_inside_torch,
+        outside_torch_method1=mem_outside_torch_method1,
+        outside_torch_method2=mem_outside_torch_method2,
+        total_used=mem_used,
+        total_free=mem_free,
+    )

Also applies to: 77-101


103-111: Include device id in log line for unambiguous attribution.

When reading CI logs from many ranks/devices, including device id in each line saves time correlating.

Apply this diff:

 def log_memory_stats(description: str, stats: MemoryStats):
     """Log memory statistics with a description"""
-    logger.info(
+    dev = torch.cuda.current_device() if torch.cuda.is_available() else -1
+    logger.info(
         f"{description}: "
+        f"device: {dev}, "
         f"inside torch: {stats.inside_torch / 1024**3:.2f} GB, "
         f"outside torch (method1): {stats.outside_torch_method1 / 1024**3:.2f} GB, "
         f"outside torch (method2): {stats.outside_torch_method2 / 1024**3:.2f} GB, "
         f"total: {stats.total_used / 1024**3:.2f} GB, "
         f"free: {stats.total_free / 1024**3:.2f} GB")

114-134: Also log free-memory delta for quicker leak/regression eyeballing.

The comparison omits free-memory change; adding it completes the picture and helps spot fragmentation or allocator behavior changes.

Apply this diff:

 def log_memory_comparison(description: str, before: MemoryStats,
                           after: MemoryStats):
     """Log memory statistics comparison with current values and changes"""
     # Calculate changes
     inside_torch_change = after.inside_torch - before.inside_torch
     outside_torch_change_method1 = after.outside_torch_method1 - before.outside_torch_method1
     outside_torch_change_method2 = after.outside_torch_method2 - before.outside_torch_method2
     total_change = after.total_used - before.total_used
+    free_change = after.total_free - before.total_free
 
-    logger.info(
+    dev = torch.cuda.current_device() if torch.cuda.is_available() else -1
+    logger.info(
         f"{description}: "
-        f"CURRENT - inside torch: {after.inside_torch / 1024**3:.2f} GB, "
+        f"CURRENT - device: {dev}, inside torch: {after.inside_torch / 1024**3:.2f} GB, "
         f"outside torch (method1): {after.outside_torch_method1 / 1024**3:.2f} GB, "
         f"outside torch (method2): {after.outside_torch_method2 / 1024**3:.2f} GB, "
         f"total: {after.total_used / 1024**3:.2f} GB, "
-        f"free: {after.total_free / 1024**3:.2f} GB | "
+        f"free: {after.total_free / 1024**3:.2f} GB | "
         f"CHANGES - inside torch: {inside_torch_change / 1024**3:.2f} GB, "
         f"outside torch (method1): {outside_torch_change_method1 / 1024**3:.2f} GB, "
         f"outside torch (method2): {outside_torch_change_method2 / 1024**3:.2f} GB, "
-        f"total: {total_change / 1024**3:.2f} GB")
+        f"total: {total_change / 1024**3:.2f} GB, "
+        f"free: {free_change / 1024**3:.2f} GB")

397-401: Sync and reduce verbosity for warmup/autotune markers.

  • Consider calling torch.cuda.synchronize() immediately before “BEFORE … warmup/autotuner” snapshots to stabilize readings; warmup tends to enqueue kernels.
  • To keep CI logs readable, gate these info-level logs behind an env var, e.g., TLLM_MEMLOG_VERBOSE=1.

Example minimal guarding pattern (applies to all these blocks):

-# Memory statistics before torch compile warmup
-stats_before_warmup = get_memory_stats()
-log_memory_stats("Memory stats BEFORE torch compile warmup", stats_before_warmup)
+if os.environ.get("TLLM_MEMLOG_VERBOSE", "1") == "1":
+    torch.cuda.synchronize()
+    stats_before_warmup = get_memory_stats()
+    log_memory_stats("Memory stats BEFORE torch compile warmup", stats_before_warmup)

Also applies to: 429-433, 756-760, 786-791, 793-797, 816-821


857-875: Capture path: sync before “before-capture” snapshot for tighter deltas.

Right before measuring “BEFORE graph capture forward”, add a synchronize to flush any prior work; you already sync after the forward.

Apply this diff:

@@
-                        # Memory statistics before CUDA graph capture
-                        stats_before_graph_capture = get_memory_stats()
+                        # Memory statistics before CUDA graph capture
+                        torch.cuda.synchronize()
+                        stats_before_graph_capture = get_memory_stats()

Also applies to: 869-875


876-910: Piecewise CUDA graph warmup logs: consider gating or sampling.

This loop can emit many lines across multiple seq_lens. If CI’s OOM is rare, logging on every seq_lens may be excessive. Suggest gating behind TLLM_MEMLOG_VERBOSE or sampling every N iterations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b5e586f and 2a688ed.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (16 hunks)
  • tensorrt_llm/executor/executor.py (1 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • get_memory_stats (77-100)
  • log_memory_stats (103-111)
tensorrt_llm/executor/executor.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • get_memory_stats (77-100)
  • log_memory_stats (103-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

359-361: LGTM: model-loading memory snapshots are well placed.

Pre/post model loading snapshots give useful bookends around checkpoint hydration and CUDA initialization. No functional concerns.

Also applies to: 1227-1230


32-33: Cross-check device_memory_info semantics vs mem_get_info.

device_memory_info() and torch.cuda.mem_get_info() may not agree 1:1 on certain driver/runtime revisions (e.g., accounting for BAR1/page-locked allocations). You’re already exposing both “outside_torch” methods — good. If CI OOMs persist, consider logging both allocated_bytes and reserved_bytes (allocator) to spot fragmentation.

Would you like me to add a minimal “reserved_bytes” field to MemoryStats and thread it through the log lines?

Comment on lines +2241 to +2243
# Memory statistics before forward
stats_before_forward = get_memory_stats()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid per-forward memstats in the hot path unless explicitly enabled.

These snapshots execute on every forward, which:

  • adds non-trivial overhead (torch.cuda.memory_stats builds a large dict),
  • floods logs in steady-state generation.

Recommend gating behind an env var or sampling (e.g., every N iters).

Apply this diff:

@@
-        # Memory statistics before forward
-        stats_before_forward = get_memory_stats()
+        # Memory statistics before forward (gated)
+        memlog_enabled = os.environ.get("TLLM_MEMLOG_FORWARD", "0") == "1"
+        if memlog_enabled:
+            stats_before_forward = get_memory_stats()
@@
-            # Memory statistics after forward (no cache path)
-            stats_after_forward = get_memory_stats()
-            batch_size = len(scheduled_requests.all_requests())
-            log_memory_comparison(
-                f"Forward memory stats [no_cache] (batch_size={batch_size})",
-                stats_before_forward, stats_after_forward)
+            if memlog_enabled:
+                stats_after_forward = get_memory_stats()
+                batch_size = len(scheduled_requests.all_requests())
+                log_memory_comparison(
+                    f"Forward memory stats [no_cache] (batch_size={batch_size})",
+                    stats_before_forward, stats_after_forward)
@@
-            # Memory statistics after forward
-            stats_after_forward = get_memory_stats()
-            execution_path = "cuda_graph" if maybe_graph is not None else "regular"
-            batch_size = len(scheduled_requests.all_requests())
-            log_memory_comparison(
-                f"Forward memory stats [{execution_path}] (batch_size={batch_size})",
-                stats_before_forward, stats_after_forward)
+            if memlog_enabled:
+                stats_after_forward = get_memory_stats()
+                execution_path = "cuda_graph" if maybe_graph is not None else "regular"
+                batch_size = len(scheduled_requests.all_requests())
+                log_memory_comparison(
+                    f"Forward memory stats [{execution_path}] (batch_size={batch_size})",
+                    stats_before_forward, stats_after_forward)

Also applies to: 2281-2289, 2337-2344

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 2241-2243 (and
also apply same change at 2281-2289 and 2337-2344): the code currently captures
full memory statistics on every forward (get_memory_stats()) which is expensive
and noisy; gate these memstats snapshots behind a configurable switch
(environment variable like ENABLE_MEM_STATS) or a sampling mechanism (e.g.,
capture only 1/N iterations using an iteration counter or random sampling) so
that memstats are only collected when enabled or sampled, and update the
surrounding logic to skip calling get_memory_stats() and related logging when
the gate is off. Ensure the env var is read once (cached) and the sampling uses
a lightweight check to avoid adding hot-path overhead.

Comment on lines +63 to +69
# Memory tracking before worker initialization
from tensorrt_llm._torch.pyexecutor.model_engine import (
get_memory_stats, log_memory_stats)
stats_before_worker_init = get_memory_stats()
log_memory_stats("Memory stats BEFORE worker initialization",
stats_before_worker_init)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Set CUDA device and logger rank before taking memory snapshots.

Right now, the memory snapshot runs before:

  • selecting the device for this worker (torch defaults to device 0), and
  • setting the logger rank prefix.

In multi-GPU/MPI runs, all workers will likely report device 0, producing misleading data right at the most critical “before init” point and making logs hard to attribute.

Recommend:

  • set logger rank early, and
  • set the CUDA device based on global rank before calling get_memory_stats().

Apply this diff:

@@
     ) -> None:
-        # Memory tracking before worker initialization
-        from tensorrt_llm._torch.pyexecutor.model_engine import (
-            get_memory_stats, log_memory_stats)
-        stats_before_worker_init = get_memory_stats()
-        log_memory_stats("Memory stats BEFORE worker initialization",
-                         stats_before_worker_init)
+        # Establish rank prefix and device before any memory logging
+        try:
+            grank = global_mpi_rank()
+            if global_mpi_size() > 1:
+                logger.set_rank(grank)
+            if torch.cuda.is_available():
+                device_id = grank % max(1, torch.cuda.device_count())
+                torch.cuda.set_device(device_id)
+        except Exception as _e:
+            logger.warning(f"Failed to set rank/device before memlog: {_e}")
+
+        # Memory tracking before worker initialization (per-rank, per-device)
+        from tensorrt_llm._torch.pyexecutor.model_engine import (
+            get_memory_stats, log_memory_stats)
+        stats_before_worker_init = get_memory_stats()
+        log_memory_stats("Memory stats BEFORE worker initialization", stats_before_worker_init)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Memory tracking before worker initialization
from tensorrt_llm._torch.pyexecutor.model_engine import (
get_memory_stats, log_memory_stats)
stats_before_worker_init = get_memory_stats()
log_memory_stats("Memory stats BEFORE worker initialization",
stats_before_worker_init)
# Establish rank prefix and device before any memory logging
try:
grank = global_mpi_rank()
if global_mpi_size() > 1:
logger.set_rank(grank)
if torch.cuda.is_available():
device_id = grank % max(1, torch.cuda.device_count())
torch.cuda.set_device(device_id)
except Exception as _e:
logger.warning(f"Failed to set rank/device before memlog: {_e}")
# Memory tracking before worker initialization (per-rank, per-device)
from tensorrt_llm._torch.pyexecutor.model_engine import (
get_memory_stats, log_memory_stats)
stats_before_worker_init = get_memory_stats()
log_memory_stats("Memory stats BEFORE worker initialization", stats_before_worker_init)
🤖 Prompt for AI Agents
In tensorrt_llm/executor/worker.py around lines 63-69, the memory snapshot is
taken before the worker sets its logger rank prefix and selects its CUDA device,
causing misleading "before init" stats in multi-GPU/MPI runs; move the logger
rank setup and CUDA device selection to run before calling get_memory_stats():
set the logger rank prefix using the process/global rank (e.g., from MPI/dist or
the existing global_rank variable) and call
torch.cuda.set_device(appropriate_device_index based on that rank and
torch.cuda.device_count()) (ensure torch is imported and any rank lookup is
available) and only then call get_memory_stats() and log_memory_stats().

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16434 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16434 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #297 (Partly Tested) completed with status: 'FAILURE'

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16453 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16453 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #299 (Partly Tested) completed with status: 'SUCCESS'

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16473 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16473 [ run ] completed with state FAILURE
/LLM/release-1.0/L0_MergeRequest_PR pipeline #302 (Partly Tested) completed with status: 'FAILURE'

@lancelly
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16489 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16489 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #306 (Partly Tested) completed with status: 'SUCCESS'

@lancelly
Copy link
Collaborator Author

Closing this since we have identified the root cause.

@lancelly lancelly closed this Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants