-
Notifications
You must be signed in to change notification settings - Fork 2k
[https://nvbugs/5455140][chore] Add some CUDA graph memory logs to help locate an occasional OOM issue. #7156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds GPU memory instrumentation and logging across CUDA-graph capture, PyTorch execution paths, decoder MoE/MLP branches, and executor worker init. Introduces MemoryStats NamedTuple and helpers (get_memory_stats, log_memory_stats, log_memory_comparison); logs per-stage memory snapshots and deltas. Removes one test waive entry. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant ME as ModelEngine
participant CGR as CudaGraphRunner
participant T as Torch
participant G as GPU
Caller->>ME: forward(inputs, ...)
ME->>G: device_memory_info()
ME->>T: torch.cuda.mem_get_info()
ME->>T: torch.cuda.memory_stats()
ME->>ME: get_memory_stats() -- record BEFORE
alt CUDA-graph capture path
ME->>CGR: capture(forward_fn, inputs...)
CGR->>G: device_memory_info() / torch.cuda.mem_get_info() -- START
CGR->>T: torch.cuda.memory_stats() -- START
CGR->>CGR: log_memory_stats("START")
CGR->>T: warmup run (x2)
CGR->>G: device_memory_info() / torch.cuda.mem_get_info() -- AFTER WARMUP
CGR->>T: torch.cuda.memory_stats() -- AFTER WARMUP
CGR->>CGR: log_memory_comparison("warmup", before, after)
CGR->>T: capture graph
CGR->>G: device_memory_info() / torch.cuda.mem_get_info() -- AFTER CAPTURE
CGR->>T: torch.cuda.memory_stats() -- AFTER CAPTURE
CGR->>CGR: log_memory_comparison("capture", before, after)
CGR->>CGR: set weakref(output)
else Regular execution
ME->>T: run forward normally
end
ME->>G: device_memory_info() / torch.cuda.mem_get_info() -- END
ME->>T: torch.cuda.memory_stats() -- END
ME->>ME: log_memory_comparison("COMPLETE", before, after)
ME-->>Caller: outputs
sequenceDiagram
autonumber
participant GC as PythonGC
participant CGR as CudaGraphRunner
participant T as Torch
participant G as GPU
GC->>CGR: __del__()
alt graph exists
CGR->>G: device_memory_info() -- before reset
CGR->>T: torch.cuda.mem_get_info() -- before reset
CGR->>CGR: try graph.reset()
CGR->>G: device_memory_info() -- after reset
CGR->>T: torch.cuda.mem_get_info() -- after reset
CGR->>CGR: log_memory_comparison("graph reset", before, after)
else no graph
CGR->>CGR: safe cleanup path
end
CGR-->>GC: return
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
32-33: Remove duplicate import of device_memory_info.device_memory_info is imported twice; keep a single import to avoid re-import warnings and maintain clarity.
Apply this diff (remove the second import at Line 65):
-from tensorrt_llm.profiler import device_memory_infoAlso applies to: 64-65
755-782: Tighten warmup memory logs: fix unused variable and line-length; keep semantics unchanged.
- Replace unused mem_total with _ to satisfy linters.
- Shorten overly long log message fragments to pass E501 without losing signal.
Apply these diffs:
- mem_used_before, mem_free_before, mem_total = device_memory_info( + mem_used_before, mem_free_before, _ = device_memory_info( )- logger.info( - f"Memory stats BEFORE graph capture forward (batch_size={bs}, draft_len={draft_len}): " - f"memory used inside torch: {mem_used_inside_torch_before / 1024**3:.2f} GB, " - f"memory used outside torch (method1 - device_memory_info): {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, " - f"memory used outside torch (method2 - mem_get_info): {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, " - f"total memory used: {mem_used_before / 1024**3:.2f} GB, " - f"free memory: {mem_free_before / 1024**3:.2f} GB") + logger.info( + f"Memory BEFORE graph-capture forward (bs={bs}, draft_len={draft_len}): " + f"torch_alloc: {mem_used_inside_torch_before / 1024**3:.2f} GB, " + f"outside_m1: {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, " + f"outside_m2: {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, " + f"total: {mem_used_before / 1024**3:.2f} GB, free: {mem_free_before / 1024**3:.2f} GB" + )- mem_used_after, mem_free_after, mem_total = device_memory_info( + mem_used_after, mem_free_after, _ = device_memory_info( )- logger.info( - f"Memory stats AFTER graph capture forward (batch_size={bs}, draft_len={draft_len}): " - f"memory used inside torch: {mem_used_inside_torch_after / 1024**3:.2f} GB, " - f"memory used outside torch (method1 - device_memory_info): {mem_used_outside_torch_method1_after / 1024**3:.2f} GB, " - f"memory used outside torch (method2 - mem_get_info): {mem_used_outside_torch_method2_after / 1024**3:.2f} GB, " - f"total memory used: {mem_used_after / 1024**3:.2f} GB, " - f"free memory: {mem_free_after / 1024**3:.2f} GB") + logger.info( + f"Memory AFTER graph-capture forward (bs={bs}, draft_len={draft_len}): " + f"torch_alloc: {mem_used_inside_torch_after / 1024**3:.2f} GB, " + f"outside_m1: {mem_used_outside_torch_method1_after / 1024**3:.2f} GB, " + f"outside_m2: {mem_used_outside_torch_method2_after / 1024**3:.2f} GB, " + f"total: {mem_used_after / 1024**3:.2f} GB, free: {mem_free_after / 1024**3:.2f} GB" + )- logger.info( - f"Memory CHANGES during forward (batch_size={bs}, draft_len={draft_len}): " - f"inside torch change: {inside_torch_change / 1024**3:.2f} GB, " - f"outside torch change (method1): {outside_torch_change_method1 / 1024**3:.2f} GB, " - f"outside torch change (method2): {outside_torch_change_method2 / 1024**3:.2f} GB, " - f"total memory change: {mem_change / 1024**3:.2f} GB" - ) + logger.info( + f"Memory CHANGES during forward (bs={bs}, draft_len={draft_len}): " + f"torch_alloc: {inside_torch_change / 1024**3:.2f} GB, " + f"outside_m1: {outside_torch_change_method1 / 1024**3:.2f} GB, " + f"outside_m2: {outside_torch_change_method2 / 1024**3:.2f} GB, " + f"total: {mem_change / 1024**3:.2f} GB" + )Also applies to: 789-828
2178-2195: Minor: replace unused mem_total with underscore in forward pre-sample.Keeps the linter quiet; no behavior change.
- mem_used_before, mem_free_before, mem_total = device_memory_info() + mem_used_before, mem_free_before, _ = device_memory_info()
2231-2271: Synchronize before sampling post-forward memory in no-cache path.Without a synchronize, allocator/free ops may still be in flight and the numbers can be misleading. Sync only around the profiling to avoid perturbing normal execution.
Apply this diff:
with MoeLoadBalancerIterContext(moe_load_balancer): outputs = self._forward_step(inputs, gather_ids, gather_context_logits) - # Memory statistics after forward (no cache path) + # Ensure kernels finished before sampling memory for accurate numbers + torch.cuda.synchronize() + # Memory statistics after forward (no cache path) - mem_used_after, mem_free_after, mem_total = device_memory_info() + mem_used_after, mem_free_after, _ = device_memory_info()Verification idea:
- Re-run the CI job that intermittently OOMs and compare the “CHANGES” totals before/after this sync; you should observe more stable deltas across runs at identical batch sizes.
2320-2359: Synchronize before sampling post-forward memory (regular/cuda_graph path).Same rationale as the no-cache case: make the “AFTER” and “CHANGES” numbers deterministic across runs.
Apply this diff:
self._execute_logit_post_processors(scheduled_requests, outputs) - # Memory statistics after forward + # Ensure kernels finished before sampling memory for accurate numbers + torch.cuda.synchronize() + # Memory statistics after forward - mem_used_after, mem_free_after, mem_total = device_memory_info() + mem_used_after, mem_free_after, _ = device_memory_info()tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
312-315: Guard mrope_position_deltas copy when self.mrope_position_deltas is None.If use_mrope=False, self.mrope_position_deltas is None; copying unconditionally will raise. Guard on both the key and the allocated tensor.
Apply this diff:
- if "mrope_position_deltas" in inputs: - self.mrope_position_deltas[:self.batch_size].copy_( - inputs["mrope_position_deltas"]) + if "mrope_position_deltas" in inputs and self.mrope_position_deltas is not None: + self.mrope_position_deltas[:self.batch_size].copy_(inputs["mrope_position_deltas"])If you have a CI repro with multimodal inputs but non-mrope rope scaling, please sanity-check this path.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
777-777: Line too long (147 > 120)
(E501)
778-778: Line too long (141 > 120)
(E501)
809-809: Line too long (131 > 120)
(E501)
810-810: Line too long (131 > 120)
(E501)
815-815: Line too long (146 > 120)
(E501)
816-816: Line too long (140 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
2298-2310: LGTM: capture function wiring is clean and side-effect free.Passing capture_forward_fn into maybe_graph.capture and reusing the returned pool aligns with the CUDA graph lifecycle and keeps eager vs. captured paths symmetric.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
130-153: LGTM: capture-start memory sampling and dual accounting look correct.Using both device_memory_info and mem_get_info provides useful cross-checks for outside-PyTorch usage. Good placement before any warmup/capture side effects.
Signed-off-by: Lanyu Liao <[email protected]>
bff23e9 to
348ec5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1-1: Add NVIDIA copyright header (2025) per coding guidelines.The repository guidelines require a copyright header on all Python/C++ sources. Please prepend the standard NVIDIA header at the top of this file.
Example:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
1-1: Add NVIDIA copyright header (2025) per coding guidelines.Please prepend the standard NVIDIA header at the top of this file.
Example:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
78-80: Fix unsafe del: calling reset() when _graph is None can raise; wrap in a safe guard.This was flagged previously as well; we should avoid exceptions during teardown and tolerate None.
Apply:
- def __del__(self): - self._graph.reset() + def __del__(self): + graph = getattr(self, "_graph", None) + if graph is not None: + try: + graph.reset() + except Exception as e: + # Avoid raising from __del__ + logger.warning(f"Error during CUDA graph reset in __del__: {e}")
🧹 Nitpick comments (11)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5)
754-781: Warmup: minor cleanups and line-length fix for the log.
- mem_total is unused; replace with “_” to avoid unused var warnings.
- Several logger.info lines exceed the 120-char limit flagged by Ruff; split the long f-strings for E501 compliance.
Apply:
- mem_used_before, mem_free_before, mem_total = device_memory_info( + mem_used_before, mem_free_before, _ = device_memory_info( )And rewrap the log for line-length compliance:
- logger.info( - f"Memory stats BEFORE graph capture forward (batch_size={bs}, draft_len={draft_len}): " - f"memory used inside torch: {mem_used_inside_torch_before / 1024**3:.2f} GB, " - f"memory used outside torch (method1 - device_memory_info): {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, " - f"memory used outside torch (method2 - mem_get_info): {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, " - f"total memory used: {mem_used_before / 1024**3:.2f} GB, " - f"free memory: {mem_free_before / 1024**3:.2f} GB") + logger.info( + ( + f"Memory stats BEFORE graph capture forward " + f"(batch_size={bs}, draft_len={draft_len}): " + f"inside torch: {mem_used_inside_torch_before / 1024**3:.2f} GB, " + f"outside torch (method1): {mem_used_outside_torch_method1_before / 1024**3:.2f} GB, " + f"outside torch (method2): {mem_used_outside_torch_method2_before / 1024**3:.2f} GB, " + f"total: {mem_used_before / 1024**3:.2f} GB, " + f"free: {mem_free_before / 1024**3:.2f} GB" + ) + )
788-827: Warmup-after: minor cleanups and line-length fix for the logs.
- mem_total is unused; replace with “_”.
- Split the two logger.info calls to satisfy E501.
Apply:
- mem_used_after, mem_free_after, mem_total = device_memory_info( + mem_used_after, mem_free_after, _ = device_memory_info( )And rewrap both logger.info calls similarly to the previous hunk to keep each physical line under 120 characters.
2177-2194: Before-forward memory snapshot: tiny tidy-up and optional gating.
- mem_total is unused; replace with “_”.
- Optional: gate this pre-forward snapshot under an env flag (e.g., TLLM_ENABLE_MEM_LOGS=1) to keep hot-path overhead minimal when not debugging CI OOMs.
Apply:
- mem_used_before, mem_free_before, mem_total = device_memory_info() + mem_used_before, mem_free_before, _ = device_memory_info()If you want gating:
if os.environ.get("TLLM_ENABLE_MEM_LOGS") == "1": # collect and log memory stats
2232-2269: Synchronize before post-forward memory read to avoid skewed stats.Kernels may still be in-flight; without a sync, the “after” snapshot can under-report usage or make deltas noisy. Synchronize once before measuring to improve accuracy for the CI diagnostics.
Apply:
- # Memory statistics after forward (no cache path) - mem_used_after, mem_free_after, mem_total = device_memory_info() + # Memory statistics after forward (no cache path) + torch.cuda.synchronize() + mem_used_after, mem_free_after, _ = device_memory_info()Also consider rewrapping the logger.info into multiple shorter segments to satisfy E501 (same style as earlier suggestion).
2319-2357: Graph/regular path: add one synchronize before “after” stats and fix E501.
- Add torch.cuda.synchronize() before reading the “after” memory to keep the deltas consistent.
- mem_total is unused; replace with “_”.
- Split the logger.info string for E501.
Apply:
- # Memory statistics after forward - mem_used_after, mem_free_after, mem_total = device_memory_info() + # Memory statistics after forward + torch.cuda.synchronize() + mem_used_after, mem_free_after, _ = device_memory_info()And rewrap the logger.info similar to the warmup hunk to keep physical lines < 120 chars.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (6)
86-110: Capture start: small cleanups and E501 compliance.
- mem_total is unused; replace with “_”.
- Split the logger.info string to satisfy Ruff E501.
Apply:
- mem_used_start, mem_free_start, mem_total = device_memory_info() + mem_used_start, mem_free_start, _ = device_memory_info()Rewrap the logger.info similar to:
- logger.info( - f"CUDA graph capture START (batch_size={self.batch_size}): " - f"inside torch: {mem_used_inside_torch_start / 1024**3:.2f} GB, " - f"outside torch (method1): {mem_used_outside_torch_method1_start / 1024**3:.2f} GB, " - f"outside torch (method2): {mem_used_outside_torch_method2_start / 1024**3:.2f} GB, " - f"total: {mem_used_start / 1024**3:.2f} GB, free: {mem_free_start / 1024**3:.2f} GB" - ) + logger.info( + ( + f"CUDA graph capture START (batch_size={self.batch_size}): " + f"inside torch: {mem_used_inside_torch_start / 1024**3:.2f} GB, " + f"outside torch (method1): {mem_used_outside_torch_method1_start / 1024**3:.2f} GB, " + f"outside torch (method2): {mem_used_outside_torch_method2_start / 1024**3:.2f} GB, " + f"total: {mem_used_start / 1024**3:.2f} GB, free: {mem_free_start / 1024**3:.2f} GB" + ) + )
121-125: Before warmup snapshot: minor tidy-up.Replace unused mem_total with “_”.
- mem_used_before_warmup, _, _ = device_memory_info() + mem_used_before_warmup, _, _ = device_memory_info() # (The third return is already ignored; keeping as-is is fine if the linter allows it.)
135-164: After warmup: add synchronize before reading, and rewrap logs.
- Insert torch.cuda.synchronize() prior to reading memory after the two warmup iterations to ensure kernels are complete.
- Consider rewrapping the logger.info to satisfy E501.
Apply:
- # Memory statistics after warmup - mem_used_after_warmup, mem_free_after_warmup, _ = device_memory_info() + # Memory statistics after warmup + torch.cuda.synchronize() + mem_used_after_warmup, mem_free_after_warmup, _ = device_memory_info()
165-169: Before capture: minor tidy-up.Replace unused mem_total with “_” (already ignored). No functional issues.
174-202: After capture: add synchronize and rewrap log.
- Synchronize once after exiting the torch.cuda.graph context to ensure all allocations are reflected in the snapshot.
- Split the logger.info for E501.
Apply:
- # Memory statistics after graph capture - mem_used_after_capture, mem_free_after_capture, _ = device_memory_info() + # Memory statistics after graph capture + torch.cuda.synchronize() + mem_used_after_capture, mem_free_after_capture, _ = device_memory_info()
208-241: Capture end: add synchronize, replace unused mem_total, and rewrap log.
- Synchronize before the “end” snapshot to keep totals consistent.
- Replace mem_total with “_”.
- Split the logger.info to satisfy E501.
Apply:
- # Memory statistics at capture end - mem_used_end, mem_free_end, mem_total = device_memory_info() + # Memory statistics at capture end + torch.cuda.synchronize() + mem_used_end, mem_free_end, _ = device_memory_info()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
777-777: Line too long (147 > 120)
(E501)
778-778: Line too long (141 > 120)
(E501)
809-809: Line too long (131 > 120)
(E501)
810-810: Line too long (131 > 120)
(E501)
815-815: Line too long (146 > 120)
(E501)
816-816: Line too long (140 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
32-32: Import of device_memory_info looks good.Importing device_memory_info centralizes device memory queries and aligns with the added logging. No issues spotted.
2229-2231: No-cache forward call path looks fine.The refactor to pass gather_context_logits through is correct and preserves behavior.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
6-8: New imports for logging/instrumentation are appropriate.Bringing in logger and device_memory_info is consistent with the PR objective and mirrors model_engine.py.
Signed-off-by: Lanyu Liao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
146-148: Fix unsafe del: guard None and avoid raising during teardown.Calling reset() unconditionally can raise when _graph is None or during interpreter shutdown. This was flagged previously; please restore a safe guard.
- def __del__(self): - self._graph.reset() + def __del__(self): + graph = getattr(self, "_graph", None) + if graph is not None: + try: + graph.reset() + except Exception as e: + logger.warning(f"Error resetting CUDA graph in __del__: {e}")
🧹 Nitpick comments (7)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
2279-2285: Synchronize before sampling “after forward” memory (no-cache path).Without a barrier, kernels may still be in-flight, making memory numbers noisy. Add a sync immediately before sampling.
with MoeLoadBalancerIterContext(moe_load_balancer): outputs = self._forward_step(inputs, gather_ids, gather_context_logits) - # Memory statistics after forward (no cache path) + # Memory statistics after forward (no cache path) + torch.cuda.synchronize() stats_after_forward = get_memory_stats()
2335-2342: Synchronize before sampling “after forward” memory (regular/cuda_graph).Same rationale as the no-cache path; ensures the reported deltas are trustworthy.
self._execute_logit_post_processors(scheduled_requests, outputs) - # Memory statistics after forward + # Memory statistics after forward + torch.cuda.synchronize() stats_after_forward = get_memory_stats() execution_path = "cuda_graph" if maybe_graph is not None else "regular" batch_size = len(scheduled_requests.all_requests()) log_memory_comparison( f"Forward memory stats [{execution_path}] (batch_size={batch_size})", stats_before_forward, stats_after_forward)
1-1: Add NVIDIA copyright header.Per coding guidelines, prepend the 2025 NVIDIA copyright header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
68-75: Centralize MemoryStats and Helpers into a Shared ModuleWe’ve confirmed that the
MemoryStatsNamedTuple and its accompanying functions (get_memory_stats,log_memory_stats,log_memory_comparison) are duplicated verbatim in both:
tensorrt_llm/_torch/pyexecutor/model_engine.py(lines 68–134)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(lines 14–62)Duplicated logic increases maintenance burden and the risk of drift when enhancements (e.g., device‐awareness or syncing) are applied in only one location. To consolidate:
• Extract
MemoryStats,get_memory_stats,log_memory_stats, andlog_memory_comparisoninto a new module, e.g.:
tensorrt_llm/_torch/pyexecutor/memory_utils.py• In both
model_engine.pyandcuda_graph_runner.py, replace the local definitions with imports:-from tensorrt_llm.profiler import device_memory_info +from tensorrt_llm.profiler import device_memory_info +from .memory_utils import ( + MemoryStats, + get_memory_stats, + log_memory_stats, + log_memory_comparison, +)• Optionally enhance
get_memory_statssignature to accept adevice: Optional[int]andsync: boolparameter for explicit multi‐GPU support and stable debug measurements.Please refactor these files to import the shared utilities and remove the in‐file duplicates.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
182-186: Synchronize before sampling “after warmup” memory.Warmup launches kernels; synchronize to ensure stable “after” readings.
for _ in range(2): forward_fn(inputs) - # Memory statistics after warmup + # Memory statistics after warmup + torch.cuda.synchronize() stats_after_warmup = get_memory_stats() log_memory_comparison("CUDA graph warmup completed", stats_before_warmup, stats_after_warmup)
14-21: Deduplicate MemoryStats and logging utilities; import from one place.Same recommendation as model_engine.py: centralize these helpers to prevent drift and ease future tweaks (e.g., device-aware sampling, extra metrics like reserved_bytes).
@@ -from tensorrt_llm.logger import logger -from tensorrt_llm.profiler import device_memory_info +from tensorrt_llm.logger import logger +from tensorrt_llm.profiler import device_memory_info +from .memory_utils import ( + MemoryStats, + get_memory_stats, + log_memory_stats, + log_memory_comparison, +) @@ -class MemoryStats(NamedTuple): - """Memory usage statistics""" - inside_torch: int - outside_torch_method1: int - outside_torch_method2: int - total_used: int - total_free: int - - -def get_memory_stats() -> MemoryStats: - """Get current GPU memory usage statistics""" - # Get memory info from device_memory_info - mem_used, mem_free, mem_total = device_memory_info() - ... - return MemoryStats(...) - - -def log_memory_stats(description: str, stats: MemoryStats): - ... - - -def log_memory_comparison(description: str, before: MemoryStats, - after: MemoryStats): - ... +# (Removed local MemoryStats utilities; use shared ones)Also applies to: 23-47, 49-58, 60-80
1-1: Add NVIDIA copyright header.Per coding guidelines, prepend the 2025 NVIDIA copyright header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(15 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
395-399: LGTM: Instrumentation points are well-chosen and minimally invasive.Coverage before/after compile, warmups, autotuner, CUDA graph capture, piecewise warmups, model load, and forward paths looks good for chasing OOMs.
Also applies to: 427-431, 754-758, 784-789, 791-795, 815-819, 855-860, 867-873, 875-880, 903-908, 1225-1227, 2239-2241
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
155-159: LGTM: Capture lifecycle logging is thorough and targeted.Start, warmup, pre/post capture, and completion deltas will be useful when the OOM reappears.
Also applies to: 170-172, 187-189, 193-197, 203-208
|
/bot run |
|
PR_Github #16169 [ run ] triggered by Bot |
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot kill |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" |
|
PR_Github #16182 [ run ] triggered by Bot |
|
PR_Github #16169 [ run ] completed with state |
|
PR_Github #16180 [ kill ] triggered by Bot |
|
PR_Github #16182 [ run ] completed with state |
|
PR_Github #16180 [ kill ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" |
|
PR_Github #16262 [ run ] triggered by Bot |
|
PR_Github #16262 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" |
|
PR_Github #16273 [ run ] triggered by Bot |
|
PR_Github #16273 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" |
|
PR_Github #16288 [ run ] triggered by Bot |
|
PR_Github #16288 [ run ] completed with state |
|
/bot help |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test |
|
PR_Github #16359 [ kill ] completed with state |
|
PR_Github #16362 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
get_memory_stats(77-100)log_memory_stats(103-111)tensorrt_llm/_torch/modules/gated_mlp.py (1)
GatedMLP(19-171)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
761-767: Review: Ensure DeepseekV3DecoderLayer Memory Logging Is Safe Under CUDA Graph CaptureWe didn’t find any special guard around the memory-tracking calls in
DeepseekV3DecoderLayer.forward, and there are CUDA graph capture paths in ourmodel_engine(e.g.maybe_graph.capture(capture_forward_fn, …)) that wrap the entire decoder forward. Calling the raw CUDA memory APIs during a graph capture can invalidate or corrupt the capture.Please verify:
Location of potential issue
• File:tensorrt_llm/_torch/models/modeling_deepseekv3.py
• Lines 761–767 (theget_memory_stats()/log_memory_stats()snippet)Definitions of the logging helpers
•get_memory_statsandlog_memory_statsintensorrt_llm/_torch/pyexecutor/model_engine.py
– Ensure they no-op or defer when a CUDA graph is activeCapture paths in
model_engine
• TheDecodingCUDAGraphRunner.capture()call used inmodel_engine._maybe_get_cuda_graph()/capture_forward_fn
– Confirm that any submodule forwards (including DeepseekV3DecoderLayer) are either outside the captured region or have their memory logging guardedSuggested action: wrap the memory-stats calls in forward with a guard such as:
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import is_graph_capturing if not is_graph_capturing(): stats = get_memory_stats() log_memory_stats(f"… layer {self.layer_idx}", stats)This will ensure the layer’s forward remains safe under CUDA graph captures.
| # Memory tracking before MoE/MLP | ||
| from tensorrt_llm._torch.pyexecutor.model_engine import ( | ||
| get_memory_stats, log_memory_stats) | ||
| stats_before_moe_mlp = get_memory_stats() | ||
| log_memory_stats(f"Memory stats BEFORE MoE/MLP layer {self.layer_idx}", | ||
| stats_before_moe_mlp) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard memory sampling; skip during CUDA Graph capture; log only on rank 0.
Calling device memory APIs per layer on every forward can be expensive and may interfere with CUDA Graph capture. Please gate the sampling:
- behind an env flag (so CI can enable it without impacting prod),
- skip when the current stream is being captured,
- and reduce log noise to rank 0.
Apply this diff:
- # Memory tracking before MoE/MLP
- from tensorrt_llm._torch.pyexecutor.model_engine import (
- get_memory_stats, log_memory_stats)
- stats_before_moe_mlp = get_memory_stats()
- log_memory_stats(f"Memory stats BEFORE MoE/MLP layer {self.layer_idx}",
- stats_before_moe_mlp)
+ # Memory tracking before MoE/MLP (guarded; CI-only)
+ should_log_mem = os.getenv("TRTLLM_MEMLOG", "0") == "1" and getattr(self.mapping, "rank", 0) == 0
+ # Avoid logging inside CUDA Graph capture to prevent capture failures
+ is_capturing = getattr(torch.cuda, "is_current_stream_capturing", lambda: False)()
+ if should_log_mem and not is_capturing:
+ # Keep import local to avoid potential import-time cycles
+ from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
+ try:
+ stats_before_moe_mlp = _memlog.get_memory_stats()
+ _memlog.log_memory_stats(
+ f"Memory stats BEFORE MoE/MLP layer {self.layer_idx}",
+ stats_before_moe_mlp,
+ )
+ except Exception as e:
+ logger.debug(f"Memory logging (BEFORE) skipped: {e}")Additional nitpick: to align with our Python guidelines (“maintain module namespace in imports”), prefer importing the module (as shown) over from ... import func.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 761 to 767,
the per-layer device memory sampling/logging is unguarded; wrap it so it only
runs when an env var (e.g., TLLM_ENABLE_MEM_SAMPLING) is set, skip sampling when
the current CUDA stream is being captured (use
torch.cuda.current_stream().is_capturing()), and emit logs only on rank 0 (use
torch.distributed.is_initialized() ? torch.distributed.get_rank() == 0 : True).
Also change the import to import the module (import
tensorrt_llm._torch.pyexecutor.model_engine as model_engine) and call
model_engine.get_memory_stats()/model_engine.log_memory_stats() so the sampling
is cheap and safe during normal inference and does not interfere with CUDA Graph
capture.
| result = self.forward_MoE( | ||
| hidden_states=hidden_states, | ||
| attn_metadata=attn_metadata, | ||
| residual=residual, | ||
| ) | ||
|
|
||
| # Memory tracking after MoE | ||
| stats_after_moe = get_memory_stats() | ||
| log_memory_stats(f"Memory stats AFTER MoE layer {self.layer_idx}", | ||
| stats_after_moe) | ||
|
|
||
| return result | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Post‑MoE logging should be gated the same way.
Mirror the same capture/rank/env gating after the MoE path to avoid hot‑path overhead and log storms.
Apply this diff:
result = self.forward_MoE(
hidden_states=hidden_states,
attn_metadata=attn_metadata,
residual=residual,
)
- # Memory tracking after MoE
- stats_after_moe = get_memory_stats()
- log_memory_stats(f"Memory stats AFTER MoE layer {self.layer_idx}",
- stats_after_moe)
+ # Memory tracking after MoE (guarded)
+ if 'should_log_mem' in locals() and should_log_mem and not is_capturing:
+ from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
+ try:
+ stats_after_moe = _memlog.get_memory_stats()
+ _memlog.log_memory_stats(
+ f"Memory stats AFTER MoE layer {self.layer_idx}",
+ stats_after_moe,
+ )
+ except Exception as e:
+ logger.debug(f"Memory logging (AFTER MoE) skipped: {e}")
return result📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = self.forward_MoE( | |
| hidden_states=hidden_states, | |
| attn_metadata=attn_metadata, | |
| residual=residual, | |
| ) | |
| # Memory tracking after MoE | |
| stats_after_moe = get_memory_stats() | |
| log_memory_stats(f"Memory stats AFTER MoE layer {self.layer_idx}", | |
| stats_after_moe) | |
| return result | |
| else: | |
| result = self.forward_MoE( | |
| hidden_states=hidden_states, | |
| attn_metadata=attn_metadata, | |
| residual=residual, | |
| ) | |
| # Memory tracking after MoE (guarded) | |
| if 'should_log_mem' in locals() and should_log_mem and not is_capturing: | |
| from tensorrt_llm._torch.pyexecutor import model_engine as _memlog | |
| try: | |
| stats_after_moe = _memlog.get_memory_stats() | |
| _memlog.log_memory_stats( | |
| f"Memory stats AFTER MoE layer {self.layer_idx}", | |
| stats_after_moe, | |
| ) | |
| except Exception as e: | |
| logger.debug(f"Memory logging (AFTER MoE) skipped: {e}") | |
| return result |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 769 to 781,
the post-MoE memory capture and log_memory_stats call are unconditional; mirror
the same capture/rank/environment gating used elsewhere in the non-MoE path to
avoid hot-path overhead and log storms. Wrap the stats_after_moe =
get_memory_stats() and log_memory_stats(...) calls in the same conditional (the
capture flag and rank/env check used for the other branch) so memory is only
captured and logged when that gating is true.
| result = self.forward_mlp( | ||
| hidden_states=hidden_states, | ||
| residual=residual, | ||
| ) | ||
|
|
||
| # Memory tracking after MLP | ||
| stats_after_mlp = get_memory_stats() | ||
| log_memory_stats(f"Memory stats AFTER MLP layer {self.layer_idx}", | ||
| stats_after_mlp) | ||
|
|
||
| return result | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Post‑MLP logging should be gated the same way.
Same rationale as MoE branch: keep it CI‑only, skip during capture, and limit to rank 0.
Apply this diff:
result = self.forward_mlp(
hidden_states=hidden_states,
residual=residual,
)
- # Memory tracking after MLP
- stats_after_mlp = get_memory_stats()
- log_memory_stats(f"Memory stats AFTER MLP layer {self.layer_idx}",
- stats_after_mlp)
+ # Memory tracking after MLP (guarded)
+ if 'should_log_mem' in locals() and should_log_mem and not is_capturing:
+ from tensorrt_llm._torch.pyexecutor import model_engine as _memlog
+ try:
+ stats_after_mlp = _memlog.get_memory_stats()
+ _memlog.log_memory_stats(
+ f"Memory stats AFTER MLP layer {self.layer_idx}",
+ stats_after_mlp,
+ )
+ except Exception as e:
+ logger.debug(f"Memory logging (AFTER MLP) skipped: {e}")
return result📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = self.forward_mlp( | |
| hidden_states=hidden_states, | |
| residual=residual, | |
| ) | |
| # Memory tracking after MLP | |
| stats_after_mlp = get_memory_stats() | |
| log_memory_stats(f"Memory stats AFTER MLP layer {self.layer_idx}", | |
| stats_after_mlp) | |
| return result | |
| result = self.forward_mlp( | |
| hidden_states=hidden_states, | |
| residual=residual, | |
| ) | |
| # Memory tracking after MLP (guarded) | |
| if 'should_log_mem' in locals() and should_log_mem and not is_capturing: | |
| from tensorrt_llm._torch.pyexecutor import model_engine as _memlog | |
| try: | |
| stats_after_mlp = _memlog.get_memory_stats() | |
| _memlog.log_memory_stats( | |
| f"Memory stats AFTER MLP layer {self.layer_idx}", | |
| stats_after_mlp, | |
| ) | |
| except Exception as e: | |
| logger.debug(f"Memory logging (AFTER MLP) skipped: {e}") | |
| return result |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 783 to 794,
the post-MLP memory logging is unconditional; wrap the get_memory_stats and
log_memory_stats calls with the same guard used in the MoE branch (only run in
CI, skip during capture, and only on rank 0). Modify the code so the memory-stat
capture and log are executed only when the CI flag is set, the code is not in
capture mode, and the current process is rank 0, mirroring the exact conditional
used earlier for the MoE branch.
|
/bot run |
|
/bot kill |
|
PR_Github #16421 [ run ] triggered by Bot |
|
PR_Github #16362 [ run ] completed with state |
|
PR_Github #16422 [ kill ] triggered by Bot |
|
PR_Github #16421 [ run ] completed with state |
|
PR_Github #16422 [ kill ] completed with state |
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot kill |
|
PR_Github #16432 [ kill ] triggered by Bot |
|
PR_Github #16432 [ kill ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (7)
tensorrt_llm/executor/executor.py (1)
389-395: Gate main-process memory log and make it robust.This “pre-GenerationExecution” snapshot runs in the main process (often not the process that will own the GPU). It’s useful for coarse baseline but can be noisy or misleading on CPU-only launchers or when device 0 is not the later worker device.
- Gate behind CUDA availability and optionally an env var to avoid flooding.
- Make it leader-only to reduce noise in multi-process launchers that reuse the main process.
Apply this diff:
@@ - from tensorrt_llm._torch.pyexecutor.model_engine import ( - get_memory_stats, log_memory_stats) - stats_before_worker_init = get_memory_stats() - log_memory_stats( - "Memory stats BEFORE GenerationExecution initialization", - stats_before_worker_init) + from tensorrt_llm._torch.pyexecutor.model_engine import ( + get_memory_stats, log_memory_stats) + if torch.cuda.is_available() and os.environ.get("TLLM_MEMLOG_EXECUTOR", "1") == "1": + try: + stats_before_worker_init = get_memory_stats() + log_memory_stats("Memory stats BEFORE GenerationExecution initialization", + stats_before_worker_init) + except Exception as _e: + logger.debug(f"Skipped executor pre-init memlog: {_e}")tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
68-75: Make get_memory_stats device-aware and resilient.
- The current implementation implicitly uses the current device. Exposing an optional device parameter improves attribution and lets callers snapshot specific devices without first mutating global device state.
- Add a light guard for non-CUDA environments to avoid surprising crashes in CPU-only test stubs.
Apply this diff:
@@ -class MemoryStats(NamedTuple): +class MemoryStats(NamedTuple): """Memory usage statistics""" inside_torch: int outside_torch_method1: int outside_torch_method2: int total_used: int total_free: int -def get_memory_stats() -> MemoryStats: +def get_memory_stats(device: Optional[int] = None) -> MemoryStats: """Get current GPU memory usage statistics""" - # Get memory info from device_memory_info - mem_used, mem_free, mem_total = device_memory_info() + if not torch.cuda.is_available(): + return MemoryStats(0, 0, 0, 0, 0) + # Respect optional device parameter + dev_cm = torch.cuda.device(device) if device is not None else contextlib.nullcontext() + with dev_cm: + # Get memory info from device_memory_info + mem_used, mem_free, mem_total = device_memory_info() - # Get PyTorch allocated memory - mem_inside_torch = torch.cuda.memory_stats()["allocated_bytes.all.current"] + # Get PyTorch allocated memory + mem_inside_torch = torch.cuda.memory_stats()["allocated_bytes.all.current"] @@ - # Method 2: using torch.cuda.mem_get_info - mem_free_cuda, total_gpu_memory = torch.cuda.mem_get_info() + # Method 2: using torch.cuda.mem_get_info + mem_free_cuda, total_gpu_memory = torch.cuda.mem_get_info() total_used_bytes = total_gpu_memory - mem_free_cuda mem_outside_torch_method2 = ( total_used_bytes - mem_inside_torch) if total_used_bytes > mem_inside_torch else 0 - return MemoryStats(inside_torch=mem_inside_torch, - outside_torch_method1=mem_outside_torch_method1, - outside_torch_method2=mem_outside_torch_method2, - total_used=mem_used, - total_free=mem_free) + return MemoryStats( + inside_torch=mem_inside_torch, + outside_torch_method1=mem_outside_torch_method1, + outside_torch_method2=mem_outside_torch_method2, + total_used=mem_used, + total_free=mem_free, + )Also applies to: 77-101
103-111: Include device id in log line for unambiguous attribution.When reading CI logs from many ranks/devices, including device id in each line saves time correlating.
Apply this diff:
def log_memory_stats(description: str, stats: MemoryStats): """Log memory statistics with a description""" - logger.info( + dev = torch.cuda.current_device() if torch.cuda.is_available() else -1 + logger.info( f"{description}: " + f"device: {dev}, " f"inside torch: {stats.inside_torch / 1024**3:.2f} GB, " f"outside torch (method1): {stats.outside_torch_method1 / 1024**3:.2f} GB, " f"outside torch (method2): {stats.outside_torch_method2 / 1024**3:.2f} GB, " f"total: {stats.total_used / 1024**3:.2f} GB, " f"free: {stats.total_free / 1024**3:.2f} GB")
114-134: Also log free-memory delta for quicker leak/regression eyeballing.The comparison omits free-memory change; adding it completes the picture and helps spot fragmentation or allocator behavior changes.
Apply this diff:
def log_memory_comparison(description: str, before: MemoryStats, after: MemoryStats): """Log memory statistics comparison with current values and changes""" # Calculate changes inside_torch_change = after.inside_torch - before.inside_torch outside_torch_change_method1 = after.outside_torch_method1 - before.outside_torch_method1 outside_torch_change_method2 = after.outside_torch_method2 - before.outside_torch_method2 total_change = after.total_used - before.total_used + free_change = after.total_free - before.total_free - logger.info( + dev = torch.cuda.current_device() if torch.cuda.is_available() else -1 + logger.info( f"{description}: " - f"CURRENT - inside torch: {after.inside_torch / 1024**3:.2f} GB, " + f"CURRENT - device: {dev}, inside torch: {after.inside_torch / 1024**3:.2f} GB, " f"outside torch (method1): {after.outside_torch_method1 / 1024**3:.2f} GB, " f"outside torch (method2): {after.outside_torch_method2 / 1024**3:.2f} GB, " f"total: {after.total_used / 1024**3:.2f} GB, " - f"free: {after.total_free / 1024**3:.2f} GB | " + f"free: {after.total_free / 1024**3:.2f} GB | " f"CHANGES - inside torch: {inside_torch_change / 1024**3:.2f} GB, " f"outside torch (method1): {outside_torch_change_method1 / 1024**3:.2f} GB, " f"outside torch (method2): {outside_torch_change_method2 / 1024**3:.2f} GB, " - f"total: {total_change / 1024**3:.2f} GB") + f"total: {total_change / 1024**3:.2f} GB, " + f"free: {free_change / 1024**3:.2f} GB")
397-401: Sync and reduce verbosity for warmup/autotune markers.
- Consider calling torch.cuda.synchronize() immediately before “BEFORE … warmup/autotuner” snapshots to stabilize readings; warmup tends to enqueue kernels.
- To keep CI logs readable, gate these info-level logs behind an env var, e.g., TLLM_MEMLOG_VERBOSE=1.
Example minimal guarding pattern (applies to all these blocks):
-# Memory statistics before torch compile warmup -stats_before_warmup = get_memory_stats() -log_memory_stats("Memory stats BEFORE torch compile warmup", stats_before_warmup) +if os.environ.get("TLLM_MEMLOG_VERBOSE", "1") == "1": + torch.cuda.synchronize() + stats_before_warmup = get_memory_stats() + log_memory_stats("Memory stats BEFORE torch compile warmup", stats_before_warmup)Also applies to: 429-433, 756-760, 786-791, 793-797, 816-821
857-875: Capture path: sync before “before-capture” snapshot for tighter deltas.Right before measuring “BEFORE graph capture forward”, add a synchronize to flush any prior work; you already sync after the forward.
Apply this diff:
@@ - # Memory statistics before CUDA graph capture - stats_before_graph_capture = get_memory_stats() + # Memory statistics before CUDA graph capture + torch.cuda.synchronize() + stats_before_graph_capture = get_memory_stats()Also applies to: 869-875
876-910: Piecewise CUDA graph warmup logs: consider gating or sampling.This loop can emit many lines across multiple seq_lens. If CI’s OOM is rare, logging on every seq_lens may be excessive. Suggest gating behind TLLM_MEMLOG_VERBOSE or sampling every N iterations.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py(16 hunks)tensorrt_llm/executor/executor.py(1 hunks)tensorrt_llm/executor/worker.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
get_memory_stats(77-100)log_memory_stats(103-111)
tensorrt_llm/executor/executor.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
get_memory_stats(77-100)log_memory_stats(103-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
359-361: LGTM: model-loading memory snapshots are well placed.Pre/post model loading snapshots give useful bookends around checkpoint hydration and CUDA initialization. No functional concerns.
Also applies to: 1227-1230
32-33: Cross-check device_memory_info semantics vs mem_get_info.device_memory_info() and torch.cuda.mem_get_info() may not agree 1:1 on certain driver/runtime revisions (e.g., accounting for BAR1/page-locked allocations). You’re already exposing both “outside_torch” methods — good. If CI OOMs persist, consider logging both allocated_bytes and reserved_bytes (allocator) to spot fragmentation.
Would you like me to add a minimal “reserved_bytes” field to MemoryStats and thread it through the log lines?
| # Memory statistics before forward | ||
| stats_before_forward = get_memory_stats() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid per-forward memstats in the hot path unless explicitly enabled.
These snapshots execute on every forward, which:
- adds non-trivial overhead (torch.cuda.memory_stats builds a large dict),
- floods logs in steady-state generation.
Recommend gating behind an env var or sampling (e.g., every N iters).
Apply this diff:
@@
- # Memory statistics before forward
- stats_before_forward = get_memory_stats()
+ # Memory statistics before forward (gated)
+ memlog_enabled = os.environ.get("TLLM_MEMLOG_FORWARD", "0") == "1"
+ if memlog_enabled:
+ stats_before_forward = get_memory_stats()
@@
- # Memory statistics after forward (no cache path)
- stats_after_forward = get_memory_stats()
- batch_size = len(scheduled_requests.all_requests())
- log_memory_comparison(
- f"Forward memory stats [no_cache] (batch_size={batch_size})",
- stats_before_forward, stats_after_forward)
+ if memlog_enabled:
+ stats_after_forward = get_memory_stats()
+ batch_size = len(scheduled_requests.all_requests())
+ log_memory_comparison(
+ f"Forward memory stats [no_cache] (batch_size={batch_size})",
+ stats_before_forward, stats_after_forward)
@@
- # Memory statistics after forward
- stats_after_forward = get_memory_stats()
- execution_path = "cuda_graph" if maybe_graph is not None else "regular"
- batch_size = len(scheduled_requests.all_requests())
- log_memory_comparison(
- f"Forward memory stats [{execution_path}] (batch_size={batch_size})",
- stats_before_forward, stats_after_forward)
+ if memlog_enabled:
+ stats_after_forward = get_memory_stats()
+ execution_path = "cuda_graph" if maybe_graph is not None else "regular"
+ batch_size = len(scheduled_requests.all_requests())
+ log_memory_comparison(
+ f"Forward memory stats [{execution_path}] (batch_size={batch_size})",
+ stats_before_forward, stats_after_forward)Also applies to: 2281-2289, 2337-2344
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 2241-2243 (and
also apply same change at 2281-2289 and 2337-2344): the code currently captures
full memory statistics on every forward (get_memory_stats()) which is expensive
and noisy; gate these memstats snapshots behind a configurable switch
(environment variable like ENABLE_MEM_STATS) or a sampling mechanism (e.g.,
capture only 1/N iterations using an iteration counter or random sampling) so
that memstats are only collected when enabled or sampled, and update the
surrounding logic to skip calling get_memory_stats() and related logging when
the gate is off. Ensure the env var is read once (cached) and the sampling uses
a lightweight check to avoid adding hot-path overhead.
| # Memory tracking before worker initialization | ||
| from tensorrt_llm._torch.pyexecutor.model_engine import ( | ||
| get_memory_stats, log_memory_stats) | ||
| stats_before_worker_init = get_memory_stats() | ||
| log_memory_stats("Memory stats BEFORE worker initialization", | ||
| stats_before_worker_init) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Set CUDA device and logger rank before taking memory snapshots.
Right now, the memory snapshot runs before:
- selecting the device for this worker (torch defaults to device 0), and
- setting the logger rank prefix.
In multi-GPU/MPI runs, all workers will likely report device 0, producing misleading data right at the most critical “before init” point and making logs hard to attribute.
Recommend:
- set logger rank early, and
- set the CUDA device based on global rank before calling get_memory_stats().
Apply this diff:
@@
) -> None:
- # Memory tracking before worker initialization
- from tensorrt_llm._torch.pyexecutor.model_engine import (
- get_memory_stats, log_memory_stats)
- stats_before_worker_init = get_memory_stats()
- log_memory_stats("Memory stats BEFORE worker initialization",
- stats_before_worker_init)
+ # Establish rank prefix and device before any memory logging
+ try:
+ grank = global_mpi_rank()
+ if global_mpi_size() > 1:
+ logger.set_rank(grank)
+ if torch.cuda.is_available():
+ device_id = grank % max(1, torch.cuda.device_count())
+ torch.cuda.set_device(device_id)
+ except Exception as _e:
+ logger.warning(f"Failed to set rank/device before memlog: {_e}")
+
+ # Memory tracking before worker initialization (per-rank, per-device)
+ from tensorrt_llm._torch.pyexecutor.model_engine import (
+ get_memory_stats, log_memory_stats)
+ stats_before_worker_init = get_memory_stats()
+ log_memory_stats("Memory stats BEFORE worker initialization", stats_before_worker_init)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Memory tracking before worker initialization | |
| from tensorrt_llm._torch.pyexecutor.model_engine import ( | |
| get_memory_stats, log_memory_stats) | |
| stats_before_worker_init = get_memory_stats() | |
| log_memory_stats("Memory stats BEFORE worker initialization", | |
| stats_before_worker_init) | |
| # Establish rank prefix and device before any memory logging | |
| try: | |
| grank = global_mpi_rank() | |
| if global_mpi_size() > 1: | |
| logger.set_rank(grank) | |
| if torch.cuda.is_available(): | |
| device_id = grank % max(1, torch.cuda.device_count()) | |
| torch.cuda.set_device(device_id) | |
| except Exception as _e: | |
| logger.warning(f"Failed to set rank/device before memlog: {_e}") | |
| # Memory tracking before worker initialization (per-rank, per-device) | |
| from tensorrt_llm._torch.pyexecutor.model_engine import ( | |
| get_memory_stats, log_memory_stats) | |
| stats_before_worker_init = get_memory_stats() | |
| log_memory_stats("Memory stats BEFORE worker initialization", stats_before_worker_init) |
🤖 Prompt for AI Agents
In tensorrt_llm/executor/worker.py around lines 63-69, the memory snapshot is
taken before the worker sets its logger rank prefix and selects its CUDA device,
causing misleading "before init" stats in multi-GPU/MPI runs; move the logger
rank setup and CUDA device selection to run before calling get_memory_stats():
set the logger rank prefix using the process/global rank (e.g., from MPI/dist or
the existing global_rank variable) and call
torch.cuda.set_device(appropriate_device_index based on that rank and
torch.cuda.device_count()) (ensure torch is imported and any rank lookup is
available) and only then call get_memory_stats() and log_memory_stats().
|
PR_Github #16434 [ run ] triggered by Bot |
|
PR_Github #16434 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log |
|
PR_Github #16453 [ run ] triggered by Bot |
|
PR_Github #16453 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log |
|
PR_Github #16473 [ run ] triggered by Bot |
|
PR_Github #16473 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3" --disable-reuse-test --detailed-log |
|
PR_Github #16489 [ run ] triggered by Bot |
|
PR_Github #16489 [ run ] completed with state |
|
Closing this since we have identified the root cause. |
Since we can’t reproduce the issue on the local cluster but it appears in the post-merge CI, we’ll add some logging first. When the OOM recurs in post-merge CI, we’ll have more information; once the problem is located, we’ll revert these logs.
Summary by CodeRabbit
New Features
Tests