[None][feat] Implement sampling on 1-model EAGLE3 #9885

mikeiovine · 2025-12-10T21:52:02Z

Description

Implement sampling on 1-model EAGLE. The core sampling implementation is taken from #6245 (thanks to @pathorn and @Shang-Pin from the DeepInfra team for the original implementation). However, many things have been reimplemented. Some key notes:

Draft tokens are still sampled with greedy for performance/ease of implementation. This does not seem to have too much impact on AR.
Rejection sampling will have to be implemented in a follow up. Perf is left on the table right now because the AR is a bit lower without rejection sampling.
This should be easily portable to MTP in theory. I've left that as a follow up for now to limit the testing scope of this PR.
The feature is opt-in as the CUDA graph compatible sampling kernel is slower than just using argmax for greedy requests. I didn't want to regress any existing use cases.

Test Coverage

TestGPTOSS already has accuracy tests with non-greedy sampling params. I also tested GPQA (79%) and AIME25 (91%) on high reasoning, which matches the reference from Open AI.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features
- Added support for advanced sampling (temperature, top-k, top-p) during speculative decoding via new --allow_advanced_sampling flag and configuration option for one-model speculative decoding paths.
Tests
- Updated speculative decoding tests to verify advanced sampling functionality.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

mikeiovine · 2025-12-10T22:16:40Z

/bot run --disable-fail-fast

mikeiovine · 2025-12-10T22:19:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-10T22:22:34Z

PR_Github #27752 [ run ] triggered by Bot. Commit: 343862c

tensorrt-cicd · 2025-12-10T22:27:03Z

PR_Github #27754 [ run ] triggered by Bot. Commit: 343862c

coderabbitai · 2025-12-10T22:30:24Z

📝 Walkthrough

Walkthrough

The PR adds advanced (non-greedy) sampling support to Eagle3 one-model speculative decoding via a new allow_advanced_sampling flag. It introduces PyTorch native sampling utilities and integrates per-request temperature/top-k/top-p parameters throughout the speculative decoding pipeline, with fallback to greedy sampling when disabled.

Changes

Cohort / File(s)	Summary
Configuration & CLI `examples/llm-api/quickstart_advanced.py`, `tensorrt_llm/llmapi/llm_args.py`	Added `allow_advanced_sampling` boolean field to `DecodingBaseConfig` and new CLI flag `--allow_advanced_sampling` to enable advanced sampling mode in speculative decoding configurations.
Sampling Implementation `tensorrt_llm/_torch/speculative/one_model_sampler.py`	New module providing PyTorch-native sampling utilities: `forward_native`, `random_sample`, `apply_temperature`, `apply_top_k_top_p`, and public entrypoint `sampling_batch_spec_dec_one_model` for CUDA-graph compatible batch sampling with temperature and top-k/top-p masking.
Metadata & Interface `tensorrt_llm/_torch/speculative/interface.py`, `tensorrt_llm/_torch/speculative/utils.py`	Extended `SpecMetadata` dataclass with new fields (`allow_advanced_sampling`, `temperatures`, `top_ks`, `top_ps`) and added `populate_sampling_params_for_one_model()` method to derive per-request sampling parameters from requests. Updated `Eagle3OneModelSpecMetadata` construction to pass `allow_advanced_sampling` from config.
Eagle3 Integration `tensorrt_llm/_torch/speculative/eagle3.py`	Added `_sample_tokens_for_batch()` method to conditionally apply advanced or greedy sampling for draft tokens based on spec metadata settings; refactored `sample_and_accept_draft_tokens()` to use new sampling path.
Engine & Executor `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Updated model engine to import and handle `Eagle3OneModelSpecMetadata`, calling `populate_sampling_params_for_one_model()` on scheduled requests. Added runtime warnings in executor creation when advanced sampling constraints are detected (greedy fallback or MTP mode unsupported).
Tests `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Updated Eagle3 4-GPU test configuration to instantiate `EagleDecodingConfig` with `allow_advanced_sampling=True`.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI/Config
    participant Engine as Model Engine
    participant Eagle3 as Eagle3 Worker
    participant Sampler as Sampling Module
    participant Device as GPU Device

    CLI->>Engine: create with allow_advanced_sampling=True
    Engine->>Engine: populate_sampling_params_for_one_model()<br/>(derive per-request temps, top-k, top-p)
    Note over Engine: Create SpecMetadata with sampling tensors
    
    Eagle3->>Eagle3: process_batch()
    Eagle3->>Eagle3: check allow_advanced_sampling flag
    
    alt Advanced Sampling Enabled
        Eagle3->>Sampler: _sample_tokens_for_batch()<br/>(logits, temps, top_k, top_p)
        Sampler->>Device: apply_temperature()<br/>apply_top_k_top_p()
        Device->>Sampler: masked logits
        Sampler->>Device: forward_native()<br/>(softmax + random_sample)
        Device->>Sampler: sampled tokens
        Sampler->>Eagle3: return sampled tokens
    else Greedy Sampling (fallback)
        Eagle3->>Eagle3: argmax(logits)
        Eagle3->>Eagle3: return greedy tokens
    end
    
    Eagle3->>Eagle3: sample_and_accept_draft_tokens()
    Note over Eagle3: Use sampled tokens for draft acceptance

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

Areas requiring extra attention:
- tensorrt_llm/_torch/speculative/one_model_sampler.py — New sampling implementation with in-place tensor operations, sorting-based masking, and the random_sample function using the exponential trick; verify numerical stability and CUDA graph compatibility.
- tensorrt_llm/_torch/speculative/interface.py — New method populate_sampling_params_for_one_model() with tensor initialization, determinism setup, and per-request parameter accumulation; validate tensor shape handling and edge cases.
- tensorrt_llm/_torch/speculative/eagle3.py — Integration of _sample_tokens_for_batch() into the draft token sampling path; confirm correct branching logic based on allow_advanced_sampling.
- tensorrt_llm/_torch/pyexecutor/model_engine.py — Type checking and control flow for handling both Eagle3SpecMetadata and Eagle3OneModelSpecMetadata variants.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main change: implementing sampling functionality for 1-model EAGLE3 speculative decoding.
Description check	✅ Passed	The PR description includes all key template sections: a clear description of the feature with important implementation notes, explicit test coverage details (TestGPTOSS tests and manual validation results), and a completed PR checklist confirming adherence to guidelines.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

4293-4296: Clarify whether allow_advanced_sampling should be gated on one_model

Here allow_advanced_sampling=True is set for both 1‑model and 2‑model Eagle3 (one_model is parametrized). Today the flag is only consumed by the one‑model path, so this is probably a no‑op for the 2‑model case, but it could be confusing or start toggling behavior if 2‑model support is added later.

Consider either:

Passing allow_advanced_sampling=one_model, or

Adding a short comment that advanced sampling is only effective for the one‑model configuration.

tensorrt_llm/_torch/speculative/utils.py (1)

1-5: Consider adding the standard NVIDIA SPDX header

This file doesn’t have the NVIDIA copyright/SPDX header required by the coding guidelines for *.py sources. If you touch this file again, it may be worth adding the standard header at the top for consistency.
tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
63-65: Minor: redundant src parameter.

The src=logits_sort is redundant since logits_sort is already the tensor being scattered. While functionally correct, it's clearer without the redundant argument.
     # Re-sort the probabilities.
-    logits = logits_sort.scatter(dim=-1, index=logits_idx, src=logits_sort)
+    logits = torch.empty_like(logits_sort).scatter_(dim=-1, index=logits_idx, src=logits_sort)
     return logits
Alternatively, keep as-is since scatter returns a new tensor anyway and the current form is valid.
tensorrt_llm/_torch/speculative/interface.py (1)
275-276: Consider adding forward reference import for type checking.

The static analyzer flags LlmRequest as undefined because it's used as a forward reference in the type hint. While the runtime import inside the method works correctly, adding a TYPE_CHECKING import would satisfy static analysis tools.

Add at the top of the file with other imports:
from typing import TYPE_CHECKING, List, Optional, Type

if TYPE_CHECKING:
    from ..pyexecutor.llm_request import LlmRequest
Then update the method signature (no quotes needed):
     def populate_sampling_params_for_one_model(
-            self, requests: list["LlmRequest"]) -> None:
+            self, requests: list[LlmRequest]) -> None:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0293c and 343862c.

📒 Files selected for processing (9)

examples/llm-api/quickstart_advanced.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
tensorrt_llm/_torch/speculative/eagle3.py (4 hunks)
tensorrt_llm/_torch/speculative/interface.py (2 hunks)
tensorrt_llm/_torch/speculative/one_model_sampler.py (1 hunks)
tensorrt_llm/_torch/speculative/utils.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/speculative/utils.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/speculative/one_model_sampler.py
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/interface.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
examples/llm-api/quickstart_advanced.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/speculative/utils.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/speculative/one_model_sampler.py
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/interface.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
examples/llm-api/quickstart_advanced.py

🧠 Learnings (7)

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/speculative/interface.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-08-27T15:03:57.149Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

tensorrt_llm/_torch/speculative/one_model_sampler.py

📚 Learning: 2025-08-28T10:25:22.370Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:887-891
Timestamp: 2025-08-28T10:25:22.370Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the draft_probs and target_probs tensors have shapes [1, steps] not [steps, vocab_size] as might be expected, making the .squeeze(0) operations appropriate for removing the batch dimension of size 1.

Applied to files:

tensorrt_llm/_torch/speculative/one_model_sampler.py

📚 Learning: 2025-08-28T10:22:02.288Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.

Applied to files:

tensorrt_llm/_torch/speculative/one_model_sampler.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

🧬 Code graph analysis (5)

tensorrt_llm/_torch/speculative/one_model_sampler.py (1)

tensorrt_llm/functional.py (2)

softmax (2639-2667)

argmax (3303-3347)

tensorrt_llm/_torch/speculative/eagle3.py (1)

tensorrt_llm/_torch/speculative/one_model_sampler.py (1)

sampling_batch_spec_dec_one_model (76-91)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

tensorrt_llm/_torch/speculative/eagle3.py (1)

Eagle3OneModelSpecMetadata (280-352)

tensorrt_llm/_torch/speculative/interface.py (1)

populate_sampling_params_for_one_model (275-353)

tensorrt_llm/_torch/speculative/interface.py (2)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

LlmRequestState (47-210)

tensorrt_llm/sampling_params.py (1)

params_imply_greedy_decoding (339-351)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

tensorrt_llm/llmapi/llm_args.py (4)

spec_dec_mode (731-738)

spec_dec_mode (873-878)

spec_dec_mode (927-930)

spec_dec_mode (1059-1066)

tensorrt_llm/_torch/speculative/interface.py (2)

use_one_engine (64-65)

is_mtp_one_model (49-50)

🪛 Ruff (0.14.8)

tensorrt_llm/_torch/speculative/interface.py

276-276: Undefined name LlmRequest

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (16)

examples/llm-api/quickstart_advanced.py (2)

141-143: LGTM! Well-structured CLI flag addition.

The new --allow_advanced_sampling flag follows standard argparse patterns and correctly defaults to False for opt-in behavior.

211-212: LGTM! Correct scope for Eagle3 configuration.

The flag is appropriately passed only to EagleDecodingConfig. The MTP branch correctly omits this parameter since advanced sampling for MTP is not yet supported (as noted in the runtime warnings added to py_executor_creator.py).

tensorrt_llm/llmapi/llm_args.py (1)

617-619: LGTM! Clear documentation and appropriate default.

The new allow_advanced_sampling field is well-documented with clear scope (1-model paths only). The default value of False ensures the feature is opt-in, avoiding regressions for existing use cases as mentioned in the PR description.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

284-293: Clear warnings added for one-engine speculative decoding modes.

The logic correctly handles two scenarios:

When advanced sampling is disabled: informs users they can enable it

When it's enabled for MTP one-model: warns that MTP support is incomplete

The warning-only approach (no runtime enforcement) allows for testing partial implementations, which aligns with the PR description stating the feature is "portable to MTP in theory" but with work deferred.

tensorrt_llm/_torch/speculative/utils.py (1)

69-80: Propagating allow_advanced_sampling into one‑model metadata looks correct

Wiring allow_advanced_sampling=spec_config.allow_advanced_sampling into Eagle3OneModelSpecMetadata cleanly exposes the flag to the one‑model Eagle3 path without affecting other speculative modes.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

51-52: LGTM!

The import follows the coding guidelines by maintaining namespace when importing (from ..speculative.eagle3 import), and the addition of Eagle3OneModelSpecMetadata properly supports the new one-model sampling path.

2093-2095: LGTM!

The new branch correctly handles Eagle3OneModelSpecMetadata by populating per-request sampling parameters. The isinstance check ensures the method is only called for the appropriate metadata type, and scheduled_requests.all_requests() correctly includes both context and generation requests as expected by the implementation.

tensorrt_llm/_torch/speculative/eagle3.py (4)

17-17: LGTM!

The import follows the project's namespace import pattern and brings in the required sampling function for the new advanced sampling path.

497-529: LGTM!

The _sample_tokens_for_batch method cleanly implements the conditional sampling logic:

Uses per-request sampling parameters when allow_advanced_sampling is enabled

Falls back to efficient argmax for greedy decoding

The token count calculation num_contexts + num_gens * (max_draft_len + 1) correctly matches the layout of sampling parameters populated in populate_sampling_params_for_one_model

552-554: LGTM!

The integration correctly delegates token sampling to the new method, passing the appropriate parameters from the local scope.

596-598: LGTM!

Good documentation explaining the design decision. Using greedy sampling for draft tokens is a reasonable trade-off for performance and implementation simplicity, especially since the PR description notes negligible impact on acceptance rate.

tensorrt_llm/_torch/speculative/one_model_sampler.py (1)

21-30: LGTM!

Good implementation using the Gumbel-max trick to avoid CPU-GPU synchronization from torch.multinomial. The approach argmax(prob / exponential) is mathematically equivalent to sampling from the categorical distribution.

tensorrt_llm/_torch/speculative/interface.py (4)

232-237: LGTM!

The new fields for advanced sampling are well-typed with appropriate defaults. The allow_advanced_sampling flag provides the opt-in mechanism described in the PR objectives.

287-289: Verify seed placement for multi-iteration scenarios.

The torch.manual_seed(0) is set only when self.temperatures is None (first call). This ensures deterministic sampling initialization, but subsequent calls won't reset the seed. Verify this is the intended behavior - if the same batch position should produce reproducible results across iterations, the seed might need to be set more frequently.

301-327: LGTM!

The per-request parameter extraction and greedy detection logic is correct:

Properly extracts first element from sampling config arrays

Uses SamplingParams.params_imply_greedy_decoding for consistent greedy determination

Applies appropriate disable values for greedy requests

Token count correctly differentiates context (1) vs generation (1 + max_draft_len) requests

329-353: LGTM!

The tensor initialization and async copy pattern is efficient:

Lazy allocation on first use

Pre-allocated tensors sized for max capacity

Uses pinned memory for CPU tensors enabling async copies

Non-blocking copies to overlap with computation

tensorrt_llm/_torch/speculative/one_model_sampler.py

jhaotingc · 2025-12-10T22:31:20Z

cc @nvxuanyuc for review

jhaotingc · 2025-12-10T22:55:42Z

Hi @mikeiovine, thanks for the work!

Would you mind elaborate more specifically about what's being re-implemented from #6245? Or is there any perf or feature improvement over #6245? #6245 should support (1) only sampling on target (2) no rejection sampling (3) Support MTP (4) Suports CUDA graph.

Thanks a lot!

tensorrt-cicd · 2025-12-11T01:43:14Z

PR_Github #27754 [ run ] completed with state SUCCESS. Commit: 343862c
/LLM/main/L0_MergeRequest_PR pipeline #21181 completed with status: 'ABORTED'

mikeiovine · 2025-12-11T16:31:23Z

@jhaotingc The only thing that's missing from that list is MTP support, which I will add in a followup shortly after this one is merged. I also got rid of the min_p support since it is not required right now

tensorrt-cicd · 2025-12-11T17:59:46Z

PR_Github #27887 [ run ] completed with state FAILURE. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21294 completed with status: 'FAILURE'

mikeiovine · 2025-12-11T19:02:50Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-11T19:08:44Z

PR_Github #27904 [ run ] triggered by Bot. Commit: 0b4c288

mikeiovine · 2025-12-11T21:20:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-11T21:26:40Z

PR_Github #27915 [ run ] triggered by Bot. Commit: 0b4c288

tensorrt-cicd · 2025-12-12T13:13:08Z

PR_Github #27915 [ run ] completed with state SUCCESS. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21315 completed with status: 'FAILURE'

mikeiovine · 2025-12-12T15:38:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-12T15:44:14Z

PR_Github #28046 [ run ] triggered by Bot. Commit: 0b4c288

examples/llm-api/quickstart_advanced.py

tensorrt-cicd · 2025-12-12T23:30:47Z

PR_Github #28046 [ run ] completed with state SUCCESS. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21419 completed with status: 'FAILURE'

tensorrt_llm/_torch/speculative/one_model_sampler.py

mikeiovine · 2025-12-13T00:37:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-13T00:44:10Z

PR_Github #28078 [ run ] triggered by Bot. Commit: 0b4c288

tensorrt-cicd · 2025-12-13T02:00:23Z

PR_Github #28078 [ run ] completed with state SUCCESS. Commit: 0b4c288
/LLM/main/L0_MergeRequest_PR pipeline #21450 completed with status: 'SUCCESS'

Signed-off-by: Mike Iovine <[email protected]>

mikeiovine · 2025-12-13T15:15:09Z

/bot --reuse-pipeline

github-actions · 2025-12-13T15:15:18Z