-
Notifications
You must be signed in to change notification settings - Fork 2k
NOT MEANT FOR REVIEW YET [TRTLLM-6121] TRTLLM Sampler PP support #6415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOT MEANT FOR REVIEW YET [TRTLLM-6121] TRTLLM Sampler PP support #6415
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
📝 WalkthroughWalkthroughThis update systematically renames configuration parameters, method arguments, and internal variables from Changes
Sequence Diagram(s)sequenceDiagram
participant CLI/User
participant LLM Args Parser
participant LLM Config
participant LLM Constructor
participant Sampler Instantiator
CLI/User->>LLM Args Parser: Pass --use_torch_sampler flag
LLM Args Parser->>LLM Config: Set use_torch_sampler in config
LLM Config->>LLM Constructor: Pass use_torch_sampler param
LLM Constructor->>Sampler Instantiator: Instantiate sampler (Torch or TRTLLM)
Sampler Instantiator-->>LLM Constructor: Return sampler instance
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
/bot run --only-multi-gpu-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
589-589: Fix line length violations for better readability.Static analysis indicates lines exceed the 120-character limit. Consider breaking these long lines for better code readability.
For line 589:
- if pytorch_backend_config.use_torch_sampler or pytorch_backend_config.enable_mixed_sampler or engine.spec_config is not None: + if (pytorch_backend_config.use_torch_sampler or + pytorch_backend_config.enable_mixed_sampler or + engine.spec_config is not None):For line 659:
- "Model is built with 'explicit draft tokens' decoding, but decoding mode is something else. Overwriting decoding mode." + "Model is built with 'explicit draft tokens' decoding, but decoding mode is " + "something else. Overwriting decoding mode."Also applies to: 659-659
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
examples/llm-api/quickstart_advanced.py(2 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(6 hunks)tensorrt_llm/_torch/pyexecutor/config.py(1 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py(3 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(3 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ngram.yaml(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_trtllm_sampler.yaml(2 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/integration/test_lists/waives.txt(0 hunks)tests/unittest/_torch/modeling/test_modeling_nemotron_h.py(1 hunks)tests/unittest/_torch/speculative/test_draft_target.py(1 hunks)tests/unittest/_torch/speculative/test_eagle3.py(1 hunks)tests/unittest/_torch/test_beam_search.py(0 hunks)tests/unittest/_torch/test_overlap_scheduler.py(4 hunks)tests/unittest/_torch/test_return_logits.py(4 hunks)tests/unittest/_torch/test_trtllm_sampler.py(0 hunks)tests/unittest/api_stability/references/llm.yaml(1 hunks)tests/unittest/llmapi/apps/_test_openai_misc.py(2 hunks)
💤 Files with no reviewable changes (3)
- tests/unittest/_torch/test_beam_search.py
- tests/unittest/_torch/test_trtllm_sampler.py
- tests/integration/test_lists/waives.txt
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (2)
tests/unittest/_torch/test_overlap_scheduler.py (2)
tests/unittest/_torch/test_trtllm_sampler.py (3)
create_llm(25-38)model_path(21-22)test_case(15-17)tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)
model_path(31-44)
tensorrt_llm/_torch/pyexecutor/_util.py (3)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
TorchSampler(208-457)EarlyStopSampler(70-97)TRTLLMSampler(486-956)tensorrt_llm/logger.py (1)
warning(131-132)cpp/include/tensorrt_llm/executor/types.h (3)
DecodingMode(532-574)DecodingMode(813-816)Eagle(582-591)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/_util.py
589-589: Line too long (129 > 120)
(E501)
659-659: Line too long (131 > 120)
(E501)
🔇 Additional comments (27)
tests/integration/defs/test_e2e.py (2)
2030-2030: LGTM: Test data update reflects model behavior change.The keyword change from "clouds" to "waves" appears to be a legitimate update to match the actual output behavior of the qwen2.5-vl-7b-instruct model.
2032-2035: LGTM: Improved traffic keyword expectations.The updated keywords improve the test by:
- Removing the duplicate "traffic" entry
- Replacing specific terms like "bus" and "police" with more general traffic concepts
- Using more accurate descriptors like "lanes", "congestion", and "road"
This appears to reflect improved model behavior that produces more general traffic descriptions.
tests/integration/defs/disaggregated/test_configs/disagg_config_trtllm_sampler.yaml (2)
14-14: LGTM: Standardized sampler configuration parameter.The change from
enable_trtllm_sampler: Truetouse_torch_sampler: Falsemaintains the same behavior (using TRTLLM sampler) while aligning with the codebase-wide standardization effort.
30-30: LGTM: Consistent sampler configuration across server types.The generation_servers section correctly uses the same standardized
use_torch_sampler: Falseparameter, maintaining consistency with the context_servers configuration.tests/unittest/_torch/speculative/test_draft_target.py (1)
44-44: LGTM: Explicit sampler configuration improves test determinism.Adding
use_torch_sampler=Trueto the common configuration ensures both the speculative and reference LLM instances use the same sampler, making the test behavior explicit and deterministic.tests/integration/defs/disaggregated/test_configs/disagg_config_ngram.yaml (1)
15-15: LGTM: Standardized sampler configuration for NGram test.The addition of
use_torch_sampler: Trueto the context_servers configuration follows the codebase standardization effort and explicitly specifies the sampler choice for this test scenario.tests/unittest/_torch/speculative/test_eagle3.py (1)
63-63: LGTM: Consistent sampler configuration for Eagle3 test.Adding
use_torch_sampler=Trueto the common configuration ensures both speculative and reference LLM instances use the same sampler, providing consistent and deterministic test behavior across all Eagle3 test parameter combinations.tests/unittest/api_stability/references/llm.yaml (1)
106-109: LGTM: Parameter renaming aligns with standardization effort.The renaming from
enable_trtllm_samplertouse_torch_samplercorrectly reflects the new semantics, and the status promotion fromprototypetobetaindicates the feature is becoming more stable.tests/unittest/_torch/modeling/test_modeling_nemotron_h.py (1)
44-44: LGTM: Semantic equivalence maintained.The change from
enable_trtllm_sampler=Truetouse_torch_sampler=Falsecorrectly maintains the same behavior (using TRTLLM sampler) while adopting the new parameter naming convention.tensorrt_llm/_torch/pyexecutor/config.py (1)
57-60: LGTM: Clean parameter renaming with updated documentation.The field renaming from
enable_trtllm_samplertouse_torch_sampleris semantically correct, and the updated docstring clearly explains the new behavior. The default value ofFalsemaintains backward compatibility by defaulting to the TRTLLM sampler.examples/llm-api/quickstart_advanced.py (2)
57-59: LGTM: Command-line argument updated consistently.The argument parser correctly updates from
--enable_trtllm_samplerto--use_torch_samplerwhile maintaining the same default behavior.
208-208: LGTM: LLM constructor parameter updated consistently.The LLM constructor call correctly uses the new
use_torch_samplerparameter, maintaining consistency with the updated command-line argument.tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
189-189: LGTM! Parameter renaming aligns with codebase refactoring.The change from
enable_trtllm_sampler=Truetouse_torch_sampler=Trueis consistent with the broader parameter standardization effort described in the PR objectives.
222-222: LGTM! Parameter renaming maintains test intent.The change from
enable_trtllm_sampler=Falsetouse_torch_sampler=Falsecorrectly maintains the original test behavior while aligning with the parameter standardization.
648-649: LGTM! Reasonable test configuration addition.The addition of
max_batch_size=64parameter aligns with similar configurations used in other tests and likely optimizes the test setup for the Qwen3-8B model.tests/unittest/llmapi/apps/_test_openai_misc.py (2)
15-15: LGTM! Model update for test optimization.The change to use "Qwen3/Qwen3-0.6B-Base" instead of the TinyLlama model is a reasonable test configuration update, likely providing better performance or compatibility for the test suite.
28-32: LGTM! Well-documented parameter adjustment.The max_seq_len update to "32768" correctly aligns with the new model's max_position_embeddings. The added comment provides valuable context for future maintainers.
tests/unittest/_torch/test_return_logits.py (2)
19-44: LGTM! Consistent parameter renaming with correct logic adaptation.The changes from
enable_trtllm_samplertouse_torch_samplerare well-executed:
- Parameter names updated in decorator, function signature, and LLM instantiation
- Conditional logic correctly inverted from
if not enable_trtllm_samplertoif use_torch_sampler- Test behavior remains consistent with the new parameter semantics
86-111: LGTM! Consistent parameter updates in async test function.The parameter renaming in the async test function follows the same correct pattern as the synchronous version, maintaining test consistency while aligning with the codebase refactoring.
tests/unittest/_torch/test_overlap_scheduler.py (1)
24-77: LGTM! Comprehensive and consistent parameter refactoring.All aspects of the parameter renaming from
enable_trtllm_samplertouse_torch_samplerare correctly implemented:
- Function signature and dictionary key updated consistently
- Pytest parameterization and test function parameter updated
- Logic inversion for
stop_wordssetting correctly maintains original behavior- All function calls updated with new parameter name
The changes maintain test functionality while aligning with the broader codebase standardization effort.
tensorrt_llm/llmapi/llm_args.py (2)
1898-1902: LGTM! Improved field naming and documentation.The renaming from
enable_trtllm_samplertouse_torch_samplerwith updated semantics makes the configuration more intuitive. The description clearly indicates thatTruemeans using the Torch sampler instead of the TRTLLM sampler, and the status upgrade to "beta" reflects increased stability.
2198-2198: LGTM! Consistent parameter name update.The parameter name change from
enable_trtllm_samplertouse_torch_samplerin thePyTorchConfigconstructor call correctly reflects the field rename and maintains consistency across the codebase.tensorrt_llm/_torch/pyexecutor/sampler.py (2)
481-482: LGTM! Well-documented optional field enhancement.Making
finalize_eventsoptional with a clear docstring explanation improves the flexibility ofSampleStateTRTLLMcreation, especially for the_forward_step_inter_ppuse case mentioned.
894-894: LGTM! Proper defensive null check.The addition of
finalize_events is not Noneprevents potential errors whenfinalize_eventsisNone, which aligns well with the earlier change making this field optional.tensorrt_llm/_torch/pyexecutor/_util.py (3)
589-590: LGTM! Flag renaming aligns with PR objectives.The change from
enable_trtllm_samplertouse_torch_samplercorrectly reflects the standardized configuration parameter naming throughout the codebase. The broadened condition appropriately selectsTorchSamplerin more scenarios.
594-596: LGTM! Correct fallback sampler for generation models.The fallback to
TRTLLMSamplerwith proper parameters aligns with the reversed semantics of theuse_torch_samplerflag. When the flag isFalse(default), usingTRTLLMSampleris the appropriate behavior.
618-619: LGTM! Consistent defensive attribute access pattern.The systematic replacement of direct attribute access with
getattr(executor_config.speculative_config, "attribute_name", False)improves robustness by preventingAttributeErrorwhen attributes are missing. The consistent use ofFalseas the default value across all cases maintains logical coherence.Also applies to: 626-627, 637-638, 645-646, 656-657, 664-666, 676-677, 684-685, 695-696
|
PR_Github #13213 [ run ] triggered by Bot |
|
PR_Github #13213 [ run ] completed with state |
|
/bot run --only-multi-gpu-test |
|
PR_Github #13215 [ run ] triggered by Bot |
|
PR_Github #13215 [ run ] completed with state |
|
/bot run --only-multi-gpu-test |
|
PR_Github #13359 [ run ] triggered by Bot |
|
PR_Github #13359 [ run ] completed with state |
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #13381 [ run ] triggered by Bot |
|
PR_Github #13381 [ run ] completed with state |
|
/bot run --only-multi-gpu-test --disable-fail-fast |
Signed-off-by: Daniel Campora <[email protected]>
…ition_embeddings. Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
…coder classes - Updated the parameter names and related comments in the DecoderState and GptDecoder classes to reflect the change from maxBatchSize to maxNumSequences. - Adjustments were made in the setup methods, member variables, and associated bindings in the Python interface. - This change improves clarity regarding the number of sequences being processed. Signed-off-by: Robin Kobus <[email protected]>
`Optional` to accommodate `_forward_step_inter_pp` which creates a `SampleState` without `finalize_events` Signed-off-by: Netanel Haber <[email protected]>
Signed-off-by: Netanel Haber <[email protected]> something Signed-off-by: Netanel Haber <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Netanel Haber <[email protected]>
400138a to
dacf557
Compare
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #14185 [ run ] triggered by Bot |
Signed-off-by: Netanel Haber <[email protected]>
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #14189 [ run ] triggered by Bot |
|
PR_Github #14185 [ run ] completed with state |
|
PR_Github #14189 [ run ] completed with state |
Signed-off-by: Netanel Haber <[email protected]>
…p-support Signed-off-by: Netanel Haber <[email protected]>
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #14267 [ run ] triggered by Bot |
|
PR_Github #14267 [ run ] completed with state |
THIS IS A DRAFT PR AND IS NOT CURRENTLY MEANT FOR REVIEW.
It branches out from @dcampora's PR that actually adds these changes, so you should review there. The reason I opened it into TRTLLM main was so I could run the CI.
How to get this branch: