Feat/add lora cuda graph support #7336

shaharmor98 · 2025-08-28T09:18:35Z

Summary by CodeRabbit

New Features
- Added LoRA adapter support with CUDA Graph execution, including prefetching for faster startup and per-request configuration; supports single and batched requests.
Tests
- Introduced end-to-end tests validating LoRA with CUDA Graphs across single and multiple requests.
Chores
- Updated development environment: minor devcontainer comment tweak and disabled the default Hugging Face cache mount in local docker-compose.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Shahar Mor <[email protected]>

coderabbitai · 2025-08-28T09:18:42Z

📝 Walkthrough

Walkthrough

Adds LoRA support across the PyTorch executor stack and CUDA Graph path: propagates per-request LoRA params, introduces LoRA prefetch via a manager, updates dummy-request flow, and wires initialization from PyExecutor/worker. Also tweaks devcontainer config and comments out a Docker volume mount. Adds tests covering LoRA with CUDA Graph.

Changes

Cohort / File(s)	Summary of changes
Devcontainer configs `\.devcontainer/devcontainer.json`, `\.devcontainer/docker-compose.yml`	Added a trailing inline comment after workspaceFolder; commented out the HuggingFace cache volume mount for tensorrt_llm-dev.
CUDA Graph runner (LoRA params) `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`	Extended DecodingCUDAGraphRunner to accept/store optional lora_params; includes lora_params in capture inputs and updates on run when provided.
PyTorch model engine (LoRA integration + CUDA Graph) `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Added LoRA manager/plumbing, prefetch flow, and lora_params threading into input preparation and CUDA Graph capture/warmup. Updated several method signatures to accept lora_params and resource_manager.
Executor wiring `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/executor/worker.py`	PyExecutor initializes LoRA manager via resource manager and triggers prefetch; exposes get_lora_manager(). Worker now uses engine.get_lora_manager() for LoRA config retrieval.
Resource/PEFT managers (dummy requests + accessors) `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	add_dummy_requests accepts lora_request; populates lora fields on dummy LlmRequest. Added get_lora_manager() for KVCacheManager and PeftCacheManager.
LoRA helper/manager types `tensorrt_llm/lora_helper.py`, `tensorrt_llm/lora_manager.py`	LoraConfig gains lora_request field; LoraManager can store cpp peft cache manager and query cache presence via it.
Tests (LoRA + CUDA Graph) `tests/unittest/llmapi/test_llm_pytorch.py`	Added tests for LoRA with CUDA Graph (none/single/multiple requests), updated imports to include CudaGraphConfig.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Worker
  participant PyExecutor
  participant ModelEngine
  participant ResourceMgr as ResourceManager/KVCacheMgr
  participant GraphRunner as DecodingCUDAGraphRunner
  participant Model

  rect rgb(245,248,255)
    note over PyExecutor,ResourceMgr: Initialization and LoRA prefetch
    Client->>Worker: Start engine
    Worker->>PyExecutor: Initialize backend
    PyExecutor->>ModelEngine: set_lora_manager(resource_manager)
    PyExecutor->>ModelEngine: prefetch_lora_dirs()
    ModelEngine->>ResourceMgr: Access PEFT cache / LoRA manager
    ResourceMgr-->>ModelEngine: LoRA adapters info
  end

  rect rgb(245,255,245)
    note over Client,Model: Inference with CUDA Graph
    Client->>Worker: Submit requests (may include LoRA)
    Worker->>PyExecutor: Generate(batch with LoRA)
    PyExecutor->>ModelEngine: forward(batch, resource_manager)
    alt Graph not captured
      ModelEngine->>ResourceMgr: add_dummy_requests(lora_request=...)
      ResourceMgr-->>ModelEngine: Dummy LlmRequests
      ModelEngine->>GraphRunner: Construct(..., lora_params=from requests)
      ModelEngine->>GraphRunner: Capture forward(inputs+lora_params)
    else Graph captured
      ModelEngine->>GraphRunner: Run with updated lora_params
    end
    GraphRunner->>Model: forward(..., lora_params)
    Model-->>GraphRunner: logits/tokens
    GraphRunner-->>ModelEngine: outputs
    ModelEngine-->>PyExecutor: responses
    PyExecutor-->>Worker: responses
    Worker-->>Client: responses
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter #6786 — Adjusts LoRA integration surface and CUDA-graph handling, overlapping with LoRA params propagation and dummy-request flow in this PR.
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction #5616 — Adds cpp_peft_cache_manager plumbing and LoraManager methods, directly related to the new LoraManager setter and cache checks here.
[https://nvbugs/5441438][fix] Set correct draft length for the cuda graph dummy request #6701 — Modifies CUDA-graph warmup/dummy-request logic in the PyTorch model engine, matching changes made to include LoRA in warmup and capture.

Suggested reviewers

amitz-nv
mikeiovine
byshiue
nv-guomingz
Superjomn
Wanli-Jiang

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

tensorrt_llm/executor/worker.py (1)

166-170: Wire C++ PeftCacheManager into LoraManager.
In worker.py, immediately after self._lora_manager = self.engine.get_lora_manager(), invoke
self._lora_manager.set_cpp_peft_cache_manager(
    self.engine.resource_manager.get_cpp_peft_cache_manager()
)
In LoraManager.__init__, change
self._cpp_peft_cache_manager: Optional[...] = None
to
self._cpp_peft_cache_manager = cpp_peft_cache_manager
so the constructor‐passed cache manager is retained.

tests/unittest/llmapi/test_llm_pytorch.py (1)

883-923: Batch test: add guards, determinism, and resource management.

-def test_lora_graph_multiple_requests():
+@force_ampere
+@skip_gpu_memory_less_than_40gb
+def test_lora_graph_multiple_requests():
@@
-    llm = LLM(
-        model=f"{llm_models_root()}/llama-models/llama-7b-hf",
-        lora_config=lora_config,
-        #   cuda_graph_config=None)
-        cuda_graph_config=CudaGraphConfig(max_batch_size=2))
+    with LLM(
+        model=f"{llm_models_root()}/llama-models/llama-7b-hf",
+        lora_config=lora_config,
+        cuda_graph_config=CudaGraphConfig(max_batch_size=2)) as llm:
@@
-    sampling_params = SamplingParams(max_tokens=20)
+    sampling_params = SamplingParams(max_tokens=20, temperature=0.0)
@@
-    outputs = llm.generate(prompts, sampling_params, lora_request=lora_requests)
+    outputs = llm.generate(prompts, sampling_params, lora_request=lora_requests)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

430-486: Handle None entries and tighten typing for lora_request.

Current code assumes lora_request[i] is non-None; guard against None to avoid AttributeError.
Use a broader Sequence[Any] type to clarify expectations and avoid TODO.

-    def add_dummy_requests(
+    def add_dummy_requests(
             self,
             request_ids: List[int],
@@
-            max_beam_width: int = 1,
-            lora_request: Optional[List] = None,  # TODO smor fill type hint
+            max_beam_width: int = 1,
+            lora_request: Optional[Sequence[Any]] = None,
     ):
@@
-            if lora_request is not None and i < len(lora_request):
-                lora_task_id = lora_request[i].task_id
-                lora_weights = lora_request[i].weights
-                lora_config = lora_request[i].config
+            if lora_request is not None and i < len(lora_request):
+                req_i = lora_request[i]
+                if req_i is not None:
+                    lora_task_id = req_i.task_id
+                    lora_weights = req_i.weights
+                    lora_config = req_i.config

Add imports at file top if not present:

from typing import Any, Sequence

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

576-885: Call LoRA setup/prefetch in warmup to guarantee availability.

Warmup relies on has_lora_prefetched but nothing here triggers set_lora_manager/prefetch. Consider invoking them if prefetch list is present.
         def get_cuda_graph_warmup_request(batch_size, draft_len):
Upfront in warmup (after kv_cache_manager/spec_resource_manager retrieval):
+        # Ensure LoRA manager and prefetch are ready before graph capture.
+        try:
+            self.set_lora_manager(resource_manager)
+            if not self.has_lora_prefetched and self.lora_prefetch_requests_list:
+                self.prefetch_lora_dirs()
+        except Exception:
+            logger.exception("LoRA initialization/prefetch failed; proceeding without LoRA in warmup.")
1685-1713: Ensure lora_params shape stability under CUDA graph
CUDA graphs require fixed tensor shapes across replays; _get_lora_params_from_requests currently builds a dynamic layout per batch (varying (layer_id,module_id) keys), which will break graph captures when the union of adapters changes. Capture the full superset of adapters once (warmup) and pad missing entries with zeros at runtime.

🧹 Nitpick comments (13)

tensorrt_llm/lora_helper.py (1)
17-17: Tighten typing and avoid importing Any when a precise type can be used.

Any is only needed for lora_request. Prefer a forward-declared protocol/type to document the contract (e.g., "LoRARequest") and use a sequence type.

Apply:
-from typing import Any, Dict, List, Optional
+from typing import Dict, List, Optional, Sequence
tensorrt_llm/lora_manager.py (2)
705-711: Add a concise docstring and validate setter input.

set_cpp_peft_cache_manager is public; document purpose and add a lightweight type check to fail early on incorrect injection.

Apply:
 def set_cpp_peft_cache_manager(
     self, cpp_peft_cache_manager: tb_internal.batch_manager.PeftCacheManager
 ):
-    self._cpp_peft_cache_manager = cpp_peft_cache_manager
+    """Inject CPP PEFT cache manager used by is_adapter_in_cpu_cache()."""
+    if not hasattr(cpp_peft_cache_manager, "is_task_cached"):
+        raise TypeError("cpp_peft_cache_manager must expose is_task_cached(adapter_uid: int) -> bool")
+    self._cpp_peft_cache_manager = cpp_peft_cache_manager
712-722: Document known race and make return explicitly boolean.

Return a strict bool and reference the known CPU-cache race so call sites treat it as best-effort optimization.

Apply:
-        return (
-            self._cpp_peft_cache_manager.is_task_cached(adapter_uid)
-            if self._cpp_peft_cache_manager
-            else False
-        )
+        # Best-effort: known race in CPU cache presence checks; callers must handle false positives/negatives.
+        return bool(self._cpp_peft_cache_manager and self._cpp_peft_cache_manager.is_task_cached(adapter_uid))
Also consider adding a short comment referencing the limitation so it isn’t regressed.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

21-22: Add NVIDIA copyright header for 2025.

This Python file lacks the required header per repo guidelines.

Add at file top (outside these hunks):

SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

SPDX-License-Identifier: Apache-2.0

Also applies to: 58-58, 79-80, 113-115
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
220-222: Guard LoRA prefetch on manager availability.

Call prefetch only when a LoRA manager is set to avoid surprises if PEFT cache is not present.

Apply:
-        self.model_engine.set_lora_manager(self.resource_manager)
-        self.model_engine.prefetch_lora_dirs()
+        self.model_engine.set_lora_manager(self.resource_manager)
+        if getattr(self.model_engine, "lora_manager", None) is not None:
+            self.model_engine.prefetch_lora_dirs()
tensorrt_llm/_torch/pyexecutor/model_engine.py (8)
1-1: Add NVIDIA copyright header.

All source files must prepend the NVIDIA copyright header for 2025.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
30-30: Prefer module-level import to preserve namespace.

Guideline says “preserve module namespaces.” Import the module and qualify symbols.
-from tensorrt_llm.lora_helper import LoraConfig, LoraManager
+import tensorrt_llm.lora_helper as lora_helper
You’ll also need to update annotations/usages:

lora_config: Optional[lora_helper.LoraConfig]

self.lora_manager: Optional[lora_helper.LoraManager]

852-852: Remove debug comment.

“SMOR suppose to fail here, verify” should not be in production.
-                    torch.cuda.synchronize() # SMOR suppose to fail here, verify
+                    torch.cuda.synchronize()
1324-1326: New parameters lack doc coverage.

Add brief doc entries for cache_indirection_buffer and lora_params to explain expectations and shape constraints during CUDA graph replay.

2220-2241: _prepare_inputs signature extended; propagate doc and ensure None-safe defaults.

Add a docstring for lora_params and confirm callers always pass None or a stable-layout dict; keep cache_indirection_buffer optional.

2101-2217: DoRA flag handling may be misleading.

You retain is_dora based on scaling_vec_pointer == 0. Prior flow removed DoRA usage; keep it but don’t convert to tensors (good), and ensure downstream ignores it to avoid overhead.

If any consumer uses is_dora in PyTorch flow, please point to it; otherwise we should drop it from the packed payload for clarity.

275-277: Type annotations and Python 3.8 compatibility.

If you adopt module-level import for lora_helper, also update the init arg annotation to lora_helper.LoraConfig. Additionally, builtin generics like list[str] require Python 3.9+ unless using “from future import annotations”. If this file must run on 3.8, consider typing.List or enabling future annotations at top.
+from __future__ import annotations
or migrate to typing.List/typing.Dict.

488-506: Add basic unit tests for LoRA+CUDA graph paths.

Warmup with prefetch on and off

Mixed context+generation batch

Replay after changing adapter set (ensure fixed layout)

I can draft pytest cases targeting tests/unittest/llmapi/test_llm_pytorch.py to cover these scenarios.

Also applies to: 592-609, 1082-1122, 2220-2241, 2284-2294

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ae89163 and e05381d.

📒 Files selected for processing (10)

.devcontainer/devcontainer.json (1 hunks)
.devcontainer/docker-compose.yml (1 hunks)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (4 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (14 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tensorrt_llm/lora_helper.py (2 hunks)
tensorrt_llm/lora_manager.py (2 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/lora_manager.py
tensorrt_llm/lora_helper.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/lora_manager.py
tensorrt_llm/lora_helper.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (9)

📓 Common learnings

Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/executor/worker.py
tensorrt_llm/lora_manager.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/llmapi/test_llm_pytorch.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tests/unittest/llmapi/test_llm_pytorch.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tests/unittest/llmapi/test_llm_pytorch.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tests/unittest/llmapi/test_llm_pytorch.py

📚 Learning: 2025-08-19T12:45:35.429Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:2086-2092
Timestamp: 2025-08-19T12:45:35.429Z
Learning: DoRA (Delta Orthogonal Rank Adaptation) functionality has been removed from the PyTorch flow in tensorrt_llm/_torch/pyexecutor/model_engine.py. The is_dora field is computed but not used downstream in the PyTorch flow, so converting it to a tensor would be wasteful overhead.

Applied to files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🧬 Code graph analysis (6)

tensorrt_llm/executor/worker.py (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

get_lora_manager (301-302)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_lora_manager (1091-1093)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

set_lora_manager (488-492)

prefetch_lora_dirs (494-505)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_lora_manager (1091-1093)

tensorrt_llm/lora_manager.py (2)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

PeftCacheManager (1018-1126)

cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)

PeftCacheManager (231-255)

tensorrt_llm/lora_helper.py (1)

tensorrt_llm/_torch/models/modeling_phi4mm.py (1)

lora_request (635-656)

tests/unittest/llmapi/test_llm_pytorch.py (4)

tensorrt_llm/llmapi/llm_args.py (1)

CudaGraphConfig (106-163)

tensorrt_llm/executor/request.py (1)

LoRARequest (24-53)

tensorrt_llm/lora_helper.py (1)

LoraConfig (83-102)

tests/unittest/utils/util.py (1)

similar (371-373)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

tensorrt_llm/_torch/models/modeling_phi4mm.py (2)

lora_request (635-656)

lora_config (612-632)

tensorrt_llm/lora_manager.py (1)

lora_weights (1157-1158)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (271-411)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

get_lora_manager (301-302)

🪛 Biome (2.1.2)

.devcontainer/devcontainer.json

[error] 16-16: unexpected character #

(parse)

[error] 16-16: String values must be double quoted.

(parse)

[error] 16-16: Minus must be followed by a digit

(parse)

[error] 16-16: String values must be double quoted.

(parse)

[error] 16-16: String values must be double quoted.

(parse)

🔇 Additional comments (8)

.devcontainer/docker-compose.yml (1)

26-26: Confirm impact of removing the HF cache volume.

Commenting out the HuggingFace cache mount can increase cold-start latency and container disk usage (HF_HOME still points to /huggingface). If this was intentional for CI isolation, consider documenting it; otherwise, restore the mount.
tensorrt_llm/lora_manager.py (1)
11-11: Use Python 3.8-compatible typing in tensorrt_llm/lora_manager.py.
LoraModelConfig: change
lora_target_modules: list[str]
trtllm_modules_to_hf_modules: dict[str, str]
to
from typing import List, Dict
lora_target_modules: List[str]
trtllm_modules_to_hf_modules: Dict[str, str]
Constructor signature: replace
cpp_peft_cache_manager: PeftCacheManager | None = None
with
from typing import Optional
cpp_peft_cache_manager: Optional[PeftCacheManager] = None
Ensure List, Dict, and Optional are imported from typing at the top.
⛔ Skipped due to learnings
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.545Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

58-58: Storing lora_params on the instance is fine, but only if they are preallocated tensors.

Ensure callers pass preallocated device tensors; otherwise, changing LoRA per run won’t affect the captured graph.

tensorrt_llm/executor/worker.py (1)

166-170: Good switch to Engine-owned LoRA manager.

Using self.engine.get_lora_manager() under the PyTorch backend keeps ownership localized to the engine and removes CPP coupling. Looks correct given the new PyExecutor API.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

301-303: Accessor is fine.

get_lora_manager passthrough is straightforward and matches the worker’s usage.

tests/unittest/llmapi/test_llm_pytorch.py (1)

8-8: Import change LGTM.

Importing CudaGraphConfig (and PeftCacheConfig from same module) aligns with where these types live.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

1120-1122: DecodingCUDAGraphRunner supports lora_params in init and run/capture
Constructor signature includes lora_params: Optional[dict], it’s assigned to self.lora_params, included in the metadata dict, and updated in the capture path. No further changes required.

619-633: No changes needed: add_dummy_requests already supports lora_request parameter.

.devcontainer/devcontainer.json

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/_torch/pyexecutor/resource_manager.py

tensorrt_llm/lora_helper.py

tests/unittest/llmapi/test_llm_pytorch.py

shaharmor98 added 2 commits August 19, 2025 09:00

dump first patched cuda graph version

874191a

Signed-off-by: Shahar Mor <[email protected]>

zuker handoff

e05381d

Signed-off-by: Shahar Mor <[email protected]>

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

Feat/add lora cuda graph support #7336

Are you sure you want to change the base?

Feat/add lora cuda graph support #7336

Uh oh!

Conversation

shaharmor98 commented Aug 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

SPDX-License-Identifier: Apache-2.0

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shaharmor98 commented Aug 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 28, 2025 •

edited

Loading