Skip to content

Conversation

@mikeiovine
Copy link
Collaborator

@mikeiovine mikeiovine commented Dec 10, 2025

Description

Implement sampling on 1-model EAGLE. The core sampling implementation is taken from #6245 (thanks to @pathorn and @Shang-Pin from the DeepInfra team for the original implementation). However, many things have been reimplemented. Some key notes:

  • Draft tokens are still sampled with greedy for performance/ease of implementation. This does not seem to have too much impact on AR.
  • Rejection sampling will have to be implemented in a follow up. Perf is left on the table right now because the AR is a bit lower without rejection sampling.
  • This should be easily portable to MTP in theory. I've left that as a follow up for now to limit the testing scope of this PR.
  • The feature is opt-in as the CUDA graph compatible sampling kernel is slower than just using argmax for greedy requests. I didn't want to regress any existing use cases.

Test Coverage

TestGPTOSS already has accuracy tests with non-greedy sampling params. I also tested GPQA (79%) and AIME25 (91%) on high reasoning, which matches the reference from Open AI.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for advanced sampling (temperature, top-k, top-p) during speculative decoding via new --allow_advanced_sampling flag and configuration option for one-model speculative decoding paths.
  • Tests

    • Updated speculative decoding tests to verify advanced sampling functionality.

✏️ Tip: You can customize this high-level summary in your review settings.

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27752 [ run ] triggered by Bot. Commit: 343862c

@mikeiovine mikeiovine marked this pull request as ready for review December 10, 2025 22:26
@mikeiovine mikeiovine requested review from a team as code owners December 10, 2025 22:27
@tensorrt-cicd
Copy link
Collaborator

PR_Github #27754 [ run ] triggered by Bot. Commit: 343862c

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

📝 Walkthrough

Walkthrough

The PR adds advanced (non-greedy) sampling support to Eagle3 one-model speculative decoding via a new allow_advanced_sampling flag. It introduces PyTorch native sampling utilities and integrates per-request temperature/top-k/top-p parameters throughout the speculative decoding pipeline, with fallback to greedy sampling when disabled.

Changes

Cohort / File(s) Summary
Configuration & CLI
examples/llm-api/quickstart_advanced.py, tensorrt_llm/llmapi/llm_args.py
Added allow_advanced_sampling boolean field to DecodingBaseConfig and new CLI flag --allow_advanced_sampling to enable advanced sampling mode in speculative decoding configurations.
Sampling Implementation
tensorrt_llm/_torch/speculative/one_model_sampler.py
New module providing PyTorch-native sampling utilities: forward_native, random_sample, apply_temperature, apply_top_k_top_p, and public entrypoint sampling_batch_spec_dec_one_model for CUDA-graph compatible batch sampling with temperature and top-k/top-p masking.
Metadata & Interface
tensorrt_llm/_torch/speculative/interface.py, tensorrt_llm/_torch/speculative/utils.py
Extended SpecMetadata dataclass with new fields (allow_advanced_sampling, temperatures, top_ks, top_ps) and added populate_sampling_params_for_one_model() method to derive per-request sampling parameters from requests. Updated Eagle3OneModelSpecMetadata construction to pass allow_advanced_sampling from config.
Eagle3 Integration
tensorrt_llm/_torch/speculative/eagle3.py
Added _sample_tokens_for_batch() method to conditionally apply advanced or greedy sampling for draft tokens based on spec metadata settings; refactored sample_and_accept_draft_tokens() to use new sampling path.
Engine & Executor
tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Updated model engine to import and handle Eagle3OneModelSpecMetadata, calling populate_sampling_params_for_one_model() on scheduled requests. Added runtime warnings in executor creation when advanced sampling constraints are detected (greedy fallback or MTP mode unsupported).
Tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Updated Eagle3 4-GPU test configuration to instantiate EagleDecodingConfig with allow_advanced_sampling=True.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI/Config
    participant Engine as Model Engine
    participant Eagle3 as Eagle3 Worker
    participant Sampler as Sampling Module
    participant Device as GPU Device

    CLI->>Engine: create with allow_advanced_sampling=True
    Engine->>Engine: populate_sampling_params_for_one_model()<br/>(derive per-request temps, top-k, top-p)
    Note over Engine: Create SpecMetadata with sampling tensors
    
    Eagle3->>Eagle3: process_batch()
    Eagle3->>Eagle3: check allow_advanced_sampling flag
    
    alt Advanced Sampling Enabled
        Eagle3->>Sampler: _sample_tokens_for_batch()<br/>(logits, temps, top_k, top_p)
        Sampler->>Device: apply_temperature()<br/>apply_top_k_top_p()
        Device->>Sampler: masked logits
        Sampler->>Device: forward_native()<br/>(softmax + random_sample)
        Device->>Sampler: sampled tokens
        Sampler->>Eagle3: return sampled tokens
    else Greedy Sampling (fallback)
        Eagle3->>Eagle3: argmax(logits)
        Eagle3->>Eagle3: return greedy tokens
    end
    
    Eagle3->>Eagle3: sample_and_accept_draft_tokens()
    Note over Eagle3: Use sampled tokens for draft acceptance
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Areas requiring extra attention:
    • tensorrt_llm/_torch/speculative/one_model_sampler.py — New sampling implementation with in-place tensor operations, sorting-based masking, and the random_sample function using the exponential trick; verify numerical stability and CUDA graph compatibility.
    • tensorrt_llm/_torch/speculative/interface.py — New method populate_sampling_params_for_one_model() with tensor initialization, determinism setup, and per-request parameter accumulation; validate tensor shape handling and edge cases.
    • tensorrt_llm/_torch/speculative/eagle3.py — Integration of _sample_tokens_for_batch() into the draft token sampling path; confirm correct branching logic based on allow_advanced_sampling.
    • tensorrt_llm/_torch/pyexecutor/model_engine.py — Type checking and control flow for handling both Eagle3SpecMetadata and Eagle3OneModelSpecMetadata variants.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: implementing sampling functionality for 1-model EAGLE3 speculative decoding.
Description check ✅ Passed The PR description includes all key template sections: a clear description of the feature with important implementation notes, explicit test coverage details (TestGPTOSS tests and manual validation results), and a completed PR checklist confirming adherence to guidelines.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

4293-4296: Clarify whether allow_advanced_sampling should be gated on one_model

Here allow_advanced_sampling=True is set for both 1‑model and 2‑model Eagle3 (one_model is parametrized). Today the flag is only consumed by the one‑model path, so this is probably a no‑op for the 2‑model case, but it could be confusing or start toggling behavior if 2‑model support is added later.

Consider either:

  • Passing allow_advanced_sampling=one_model, or
  • Adding a short comment that advanced sampling is only effective for the one‑model configuration.
tensorrt_llm/_torch/speculative/utils.py (1)

1-5: Consider adding the standard NVIDIA SPDX header

This file doesn’t have the NVIDIA copyright/SPDX header required by the coding guidelines for *.py sources. If you touch this file again, it may be worth adding the standard header at the top for consistency.

tensorrt_llm/_torch/speculative/one_model_sampler.py (1)

63-65: Minor: redundant src parameter.

The src=logits_sort is redundant since logits_sort is already the tensor being scattered. While functionally correct, it's clearer without the redundant argument.

     # Re-sort the probabilities.
-    logits = logits_sort.scatter(dim=-1, index=logits_idx, src=logits_sort)
+    logits = torch.empty_like(logits_sort).scatter_(dim=-1, index=logits_idx, src=logits_sort)
     return logits

Alternatively, keep as-is since scatter returns a new tensor anyway and the current form is valid.

tensorrt_llm/_torch/speculative/interface.py (1)

275-276: Consider adding forward reference import for type checking.

The static analyzer flags LlmRequest as undefined because it's used as a forward reference in the type hint. While the runtime import inside the method works correctly, adding a TYPE_CHECKING import would satisfy static analysis tools.

Add at the top of the file with other imports:

from typing import TYPE_CHECKING, List, Optional, Type

if TYPE_CHECKING:
    from ..pyexecutor.llm_request import LlmRequest

Then update the method signature (no quotes needed):

     def populate_sampling_params_for_one_model(
-            self, requests: list["LlmRequest"]) -> None:
+            self, requests: list[LlmRequest]) -> None:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0293c and 343862c.

📒 Files selected for processing (9)
  • examples/llm-api/quickstart_advanced.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (4 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (2 hunks)
  • tensorrt_llm/_torch/speculative/one_model_sampler.py (1 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/speculative/utils.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/speculative/one_model_sampler.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • examples/llm-api/quickstart_advanced.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/speculative/utils.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/speculative/one_model_sampler.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • examples/llm-api/quickstart_advanced.py
🧠 Learnings (7)
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tensorrt_llm/_torch/speculative/one_model_sampler.py
📚 Learning: 2025-08-28T10:25:22.370Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:887-891
Timestamp: 2025-08-28T10:25:22.370Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the draft_probs and target_probs tensors have shapes [1, steps] not [steps, vocab_size] as might be expected, making the .squeeze(0) operations appropriate for removing the batch dimension of size 1.

Applied to files:

  • tensorrt_llm/_torch/speculative/one_model_sampler.py
📚 Learning: 2025-08-28T10:22:02.288Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.

Applied to files:

  • tensorrt_llm/_torch/speculative/one_model_sampler.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
tensorrt_llm/functional.py (2)
  • softmax (2639-2667)
  • argmax (3303-3347)
tensorrt_llm/_torch/speculative/eagle3.py (1)
tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
  • sampling_batch_spec_dec_one_model (76-91)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/speculative/eagle3.py (1)
  • Eagle3OneModelSpecMetadata (280-352)
tensorrt_llm/_torch/speculative/interface.py (1)
  • populate_sampling_params_for_one_model (275-353)
tensorrt_llm/_torch/speculative/interface.py (2)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • LlmRequestState (47-210)
tensorrt_llm/sampling_params.py (1)
  • params_imply_greedy_decoding (339-351)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
tensorrt_llm/llmapi/llm_args.py (4)
  • spec_dec_mode (731-738)
  • spec_dec_mode (873-878)
  • spec_dec_mode (927-930)
  • spec_dec_mode (1059-1066)
tensorrt_llm/_torch/speculative/interface.py (2)
  • use_one_engine (64-65)
  • is_mtp_one_model (49-50)
🪛 Ruff (0.14.8)
tensorrt_llm/_torch/speculative/interface.py

276-276: Undefined name LlmRequest

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
examples/llm-api/quickstart_advanced.py (2)

141-143: LGTM! Well-structured CLI flag addition.

The new --allow_advanced_sampling flag follows standard argparse patterns and correctly defaults to False for opt-in behavior.


211-212: LGTM! Correct scope for Eagle3 configuration.

The flag is appropriately passed only to EagleDecodingConfig. The MTP branch correctly omits this parameter since advanced sampling for MTP is not yet supported (as noted in the runtime warnings added to py_executor_creator.py).

tensorrt_llm/llmapi/llm_args.py (1)

617-619: LGTM! Clear documentation and appropriate default.

The new allow_advanced_sampling field is well-documented with clear scope (1-model paths only). The default value of False ensures the feature is opt-in, avoiding regressions for existing use cases as mentioned in the PR description.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

284-293: Clear warnings added for one-engine speculative decoding modes.

The logic correctly handles two scenarios:

  1. When advanced sampling is disabled: informs users they can enable it
  2. When it's enabled for MTP one-model: warns that MTP support is incomplete

The warning-only approach (no runtime enforcement) allows for testing partial implementations, which aligns with the PR description stating the feature is "portable to MTP in theory" but with work deferred.

tensorrt_llm/_torch/speculative/utils.py (1)

69-80: Propagating allow_advanced_sampling into one‑model metadata looks correct

Wiring allow_advanced_sampling=spec_config.allow_advanced_sampling into Eagle3OneModelSpecMetadata cleanly exposes the flag to the one‑model Eagle3 path without affecting other speculative modes.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

51-52: LGTM!

The import follows the coding guidelines by maintaining namespace when importing (from ..speculative.eagle3 import), and the addition of Eagle3OneModelSpecMetadata properly supports the new one-model sampling path.


2093-2095: LGTM!

The new branch correctly handles Eagle3OneModelSpecMetadata by populating per-request sampling parameters. The isinstance check ensures the method is only called for the appropriate metadata type, and scheduled_requests.all_requests() correctly includes both context and generation requests as expected by the implementation.

tensorrt_llm/_torch/speculative/eagle3.py (4)

17-17: LGTM!

The import follows the project's namespace import pattern and brings in the required sampling function for the new advanced sampling path.


497-529: LGTM!

The _sample_tokens_for_batch method cleanly implements the conditional sampling logic:

  • Uses per-request sampling parameters when allow_advanced_sampling is enabled
  • Falls back to efficient argmax for greedy decoding
  • The token count calculation num_contexts + num_gens * (max_draft_len + 1) correctly matches the layout of sampling parameters populated in populate_sampling_params_for_one_model

552-554: LGTM!

The integration correctly delegates token sampling to the new method, passing the appropriate parameters from the local scope.


596-598: LGTM!

Good documentation explaining the design decision. Using greedy sampling for draft tokens is a reasonable trade-off for performance and implementation simplicity, especially since the PR description notes negligible impact on acceptance rate.

tensorrt_llm/_torch/speculative/one_model_sampler.py (1)

21-30: LGTM!

Good implementation using the Gumbel-max trick to avoid CPU-GPU synchronization from torch.multinomial. The approach argmax(prob / exponential) is mathematically equivalent to sampling from the categorical distribution.

tensorrt_llm/_torch/speculative/interface.py (4)

232-237: LGTM!

The new fields for advanced sampling are well-typed with appropriate defaults. The allow_advanced_sampling flag provides the opt-in mechanism described in the PR objectives.


287-289: Verify seed placement for multi-iteration scenarios.

The torch.manual_seed(0) is set only when self.temperatures is None (first call). This ensures deterministic sampling initialization, but subsequent calls won't reset the seed. Verify this is the intended behavior - if the same batch position should produce reproducible results across iterations, the seed might need to be set more frequently.


301-327: LGTM!

The per-request parameter extraction and greedy detection logic is correct:

  • Properly extracts first element from sampling config arrays
  • Uses SamplingParams.params_imply_greedy_decoding for consistent greedy determination
  • Applies appropriate disable values for greedy requests
  • Token count correctly differentiates context (1) vs generation (1 + max_draft_len) requests

329-353: LGTM!

The tensor initialization and async copy pattern is efficient:

  • Lazy allocation on first use
  • Pre-allocated tensors sized for max capacity
  • Uses pinned memory for CPU tensors enabling async copies
  • Non-blocking copies to overlap with computation

@jhaotingc
Copy link
Collaborator

cc @nvxuanyuc for review

@jhaotingc
Copy link
Collaborator

Hi @mikeiovine, thanks for the work!

Would you mind elaborate more specifically about what's being re-implemented from #6245? Or is there any perf or feature improvement over #6245? #6245 should support (1) only sampling on target (2) no rejection sampling (3) Support MTP (4) Suports CUDA graph.

Thanks a lot!

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27754 [ run ] completed with state SUCCESS. Commit: 343862c
/LLM/main/L0_MergeRequest_PR pipeline #21181 completed with status: 'ABORTED'

@mikeiovine
Copy link
Collaborator Author

@jhaotingc The only thing that's missing from that list is MTP support, which I will add in a followup shortly after this one is merged. I also got rid of the min_p support since it is not required right now

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27887 [ run ] completed with state FAILURE. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21294 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27904 [ run ] triggered by Bot. Commit: 0b4c288

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27915 [ run ] triggered by Bot. Commit: 0b4c288

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27915 [ run ] completed with state SUCCESS. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21315 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28046 [ run ] triggered by Bot. Commit: 0b4c288

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28046 [ run ] completed with state SUCCESS. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21419 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28078 [ run ] triggered by Bot. Commit: 0b4c288

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28078 [ run ] completed with state SUCCESS. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21450 completed with status: 'SUCCESS'

mikeiovine and others added 2 commits December 13, 2025 10:10
@mikeiovine
Copy link
Collaborator Author

/bot --reuse-pipeline

@github-actions
Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@mikeiovine mikeiovine enabled auto-merge (squash) December 13, 2025 15:15
@mikeiovine
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28112 [ reuse-pipeline ] triggered by Bot. Commit: 434df69

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28112 [ reuse-pipeline ] completed with state SUCCESS. Commit: 434df69
Reusing PR_Github #28078 for commit 434df69

@mikeiovine mikeiovine merged commit 383b13e into NVIDIA:main Dec 13, 2025
5 checks passed
@mikeiovine mikeiovine deleted the 1-model-sampling branch December 13, 2025 16:39
sherry-1001 pushed a commit to sherry-1001/TensorRT-LLM that referenced this pull request Dec 16, 2025
codego7250 pushed a commit to codego7250/TensorRT-LLM that referenced this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants