-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] Implement sampling on 1-model EAGLE3 #9885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #27752 [ run ] triggered by Bot. Commit: |
|
PR_Github #27754 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe PR adds advanced (non-greedy) sampling support to Eagle3 one-model speculative decoding via a new Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI/Config
participant Engine as Model Engine
participant Eagle3 as Eagle3 Worker
participant Sampler as Sampling Module
participant Device as GPU Device
CLI->>Engine: create with allow_advanced_sampling=True
Engine->>Engine: populate_sampling_params_for_one_model()<br/>(derive per-request temps, top-k, top-p)
Note over Engine: Create SpecMetadata with sampling tensors
Eagle3->>Eagle3: process_batch()
Eagle3->>Eagle3: check allow_advanced_sampling flag
alt Advanced Sampling Enabled
Eagle3->>Sampler: _sample_tokens_for_batch()<br/>(logits, temps, top_k, top_p)
Sampler->>Device: apply_temperature()<br/>apply_top_k_top_p()
Device->>Sampler: masked logits
Sampler->>Device: forward_native()<br/>(softmax + random_sample)
Device->>Sampler: sampled tokens
Sampler->>Eagle3: return sampled tokens
else Greedy Sampling (fallback)
Eagle3->>Eagle3: argmax(logits)
Eagle3->>Eagle3: return greedy tokens
end
Eagle3->>Eagle3: sample_and_accept_draft_tokens()
Note over Eagle3: Use sampled tokens for draft acceptance
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
4293-4296: Clarify whetherallow_advanced_samplingshould be gated onone_modelHere
allow_advanced_sampling=Trueis set for both 1‑model and 2‑model Eagle3 (one_modelis parametrized). Today the flag is only consumed by the one‑model path, so this is probably a no‑op for the 2‑model case, but it could be confusing or start toggling behavior if 2‑model support is added later.Consider either:
- Passing
allow_advanced_sampling=one_model, or- Adding a short comment that advanced sampling is only effective for the one‑model configuration.
tensorrt_llm/_torch/speculative/utils.py (1)
1-5: Consider adding the standard NVIDIA SPDX headerThis file doesn’t have the NVIDIA copyright/SPDX header required by the coding guidelines for
*.pysources. If you touch this file again, it may be worth adding the standard header at the top for consistency.tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
63-65: Minor: redundantsrcparameter.The
src=logits_sortis redundant sincelogits_sortis already the tensor being scattered. While functionally correct, it's clearer without the redundant argument.# Re-sort the probabilities. - logits = logits_sort.scatter(dim=-1, index=logits_idx, src=logits_sort) + logits = torch.empty_like(logits_sort).scatter_(dim=-1, index=logits_idx, src=logits_sort) return logitsAlternatively, keep as-is since scatter returns a new tensor anyway and the current form is valid.
tensorrt_llm/_torch/speculative/interface.py (1)
275-276: Consider adding forward reference import for type checking.The static analyzer flags
LlmRequestas undefined because it's used as a forward reference in the type hint. While the runtime import inside the method works correctly, adding aTYPE_CHECKINGimport would satisfy static analysis tools.Add at the top of the file with other imports:
from typing import TYPE_CHECKING, List, Optional, Type if TYPE_CHECKING: from ..pyexecutor.llm_request import LlmRequestThen update the method signature (no quotes needed):
def populate_sampling_params_for_one_model( - self, requests: list["LlmRequest"]) -> None: + self, requests: list[LlmRequest]) -> None:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
examples/llm-api/quickstart_advanced.py(2 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(2 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(1 hunks)tensorrt_llm/_torch/speculative/eagle3.py(4 hunks)tensorrt_llm/_torch/speculative/interface.py(2 hunks)tensorrt_llm/_torch/speculative/one_model_sampler.py(1 hunks)tensorrt_llm/_torch/speculative/utils.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/speculative/utils.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/speculative/one_model_sampler.pytensorrt_llm/_torch/speculative/eagle3.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pyexamples/llm-api/quickstart_advanced.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/speculative/utils.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/speculative/one_model_sampler.pytensorrt_llm/_torch/speculative/eagle3.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pyexamples/llm-api/quickstart_advanced.py
🧠 Learnings (7)
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.
Applied to files:
tensorrt_llm/_torch/speculative/one_model_sampler.py
📚 Learning: 2025-08-28T10:25:22.370Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:887-891
Timestamp: 2025-08-28T10:25:22.370Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the draft_probs and target_probs tensors have shapes [1, steps] not [steps, vocab_size] as might be expected, making the .squeeze(0) operations appropriate for removing the batch dimension of size 1.
Applied to files:
tensorrt_llm/_torch/speculative/one_model_sampler.py
📚 Learning: 2025-08-28T10:22:02.288Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.
Applied to files:
tensorrt_llm/_torch/speculative/one_model_sampler.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
tensorrt_llm/functional.py (2)
softmax(2639-2667)argmax(3303-3347)
tensorrt_llm/_torch/speculative/eagle3.py (1)
tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
sampling_batch_spec_dec_one_model(76-91)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/speculative/eagle3.py (1)
Eagle3OneModelSpecMetadata(280-352)tensorrt_llm/_torch/speculative/interface.py (1)
populate_sampling_params_for_one_model(275-353)
tensorrt_llm/_torch/speculative/interface.py (2)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
LlmRequestState(47-210)tensorrt_llm/sampling_params.py (1)
params_imply_greedy_decoding(339-351)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
tensorrt_llm/llmapi/llm_args.py (4)
spec_dec_mode(731-738)spec_dec_mode(873-878)spec_dec_mode(927-930)spec_dec_mode(1059-1066)tensorrt_llm/_torch/speculative/interface.py (2)
use_one_engine(64-65)is_mtp_one_model(49-50)
🪛 Ruff (0.14.8)
tensorrt_llm/_torch/speculative/interface.py
276-276: Undefined name LlmRequest
(F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (16)
examples/llm-api/quickstart_advanced.py (2)
141-143: LGTM! Well-structured CLI flag addition.The new
--allow_advanced_samplingflag follows standard argparse patterns and correctly defaults toFalsefor opt-in behavior.
211-212: LGTM! Correct scope for Eagle3 configuration.The flag is appropriately passed only to
EagleDecodingConfig. The MTP branch correctly omits this parameter since advanced sampling for MTP is not yet supported (as noted in the runtime warnings added to py_executor_creator.py).tensorrt_llm/llmapi/llm_args.py (1)
617-619: LGTM! Clear documentation and appropriate default.The new
allow_advanced_samplingfield is well-documented with clear scope (1-model paths only). The default value ofFalseensures the feature is opt-in, avoiding regressions for existing use cases as mentioned in the PR description.tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
284-293: Clear warnings added for one-engine speculative decoding modes.The logic correctly handles two scenarios:
- When advanced sampling is disabled: informs users they can enable it
- When it's enabled for MTP one-model: warns that MTP support is incomplete
The warning-only approach (no runtime enforcement) allows for testing partial implementations, which aligns with the PR description stating the feature is "portable to MTP in theory" but with work deferred.
tensorrt_llm/_torch/speculative/utils.py (1)
69-80: Propagatingallow_advanced_samplinginto one‑model metadata looks correctWiring
allow_advanced_sampling=spec_config.allow_advanced_samplingintoEagle3OneModelSpecMetadatacleanly exposes the flag to the one‑model Eagle3 path without affecting other speculative modes.tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
51-52: LGTM!The import follows the coding guidelines by maintaining namespace when importing (
from ..speculative.eagle3 import), and the addition ofEagle3OneModelSpecMetadataproperly supports the new one-model sampling path.
2093-2095: LGTM!The new branch correctly handles
Eagle3OneModelSpecMetadataby populating per-request sampling parameters. Theisinstancecheck ensures the method is only called for the appropriate metadata type, andscheduled_requests.all_requests()correctly includes both context and generation requests as expected by the implementation.tensorrt_llm/_torch/speculative/eagle3.py (4)
17-17: LGTM!The import follows the project's namespace import pattern and brings in the required sampling function for the new advanced sampling path.
497-529: LGTM!The
_sample_tokens_for_batchmethod cleanly implements the conditional sampling logic:
- Uses per-request sampling parameters when
allow_advanced_samplingis enabled- Falls back to efficient
argmaxfor greedy decoding- The token count calculation
num_contexts + num_gens * (max_draft_len + 1)correctly matches the layout of sampling parameters populated inpopulate_sampling_params_for_one_model
552-554: LGTM!The integration correctly delegates token sampling to the new method, passing the appropriate parameters from the local scope.
596-598: LGTM!Good documentation explaining the design decision. Using greedy sampling for draft tokens is a reasonable trade-off for performance and implementation simplicity, especially since the PR description notes negligible impact on acceptance rate.
tensorrt_llm/_torch/speculative/one_model_sampler.py (1)
21-30: LGTM!Good implementation using the Gumbel-max trick to avoid CPU-GPU synchronization from
torch.multinomial. The approachargmax(prob / exponential)is mathematically equivalent to sampling from the categorical distribution.tensorrt_llm/_torch/speculative/interface.py (4)
232-237: LGTM!The new fields for advanced sampling are well-typed with appropriate defaults. The
allow_advanced_samplingflag provides the opt-in mechanism described in the PR objectives.
287-289: Verify seed placement for multi-iteration scenarios.The
torch.manual_seed(0)is set only whenself.temperatures is None(first call). This ensures deterministic sampling initialization, but subsequent calls won't reset the seed. Verify this is the intended behavior - if the same batch position should produce reproducible results across iterations, the seed might need to be set more frequently.
301-327: LGTM!The per-request parameter extraction and greedy detection logic is correct:
- Properly extracts first element from sampling config arrays
- Uses
SamplingParams.params_imply_greedy_decodingfor consistent greedy determination- Applies appropriate disable values for greedy requests
- Token count correctly differentiates context (1) vs generation (1 + max_draft_len) requests
329-353: LGTM!The tensor initialization and async copy pattern is efficient:
- Lazy allocation on first use
- Pre-allocated tensors sized for max capacity
- Uses pinned memory for CPU tensors enabling async copies
- Non-blocking copies to overlap with computation
|
cc @nvxuanyuc for review |
|
Hi @mikeiovine, thanks for the work! Would you mind elaborate more specifically about what's being re-implemented from #6245? Or is there any perf or feature improvement over #6245? #6245 should support (1) only sampling on target (2) no rejection sampling (3) Support MTP (4) Suports CUDA graph. Thanks a lot! |
|
PR_Github #27754 [ run ] completed with state |
|
@jhaotingc The only thing that's missing from that list is MTP support, which I will add in a followup shortly after this one is merged. I also got rid of the min_p support since it is not required right now |
343862c to
920e9ba
Compare
|
PR_Github #27887 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #27904 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #27915 [ run ] triggered by Bot. Commit: |
|
PR_Github #27915 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #28046 [ run ] triggered by Bot. Commit: |
|
PR_Github #28046 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #28078 [ run ] triggered by Bot. Commit: |
|
PR_Github #28078 [ run ] completed with state |
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
0b4c288 to
434df69
Compare
|
/bot --reuse-pipeline |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot reuse-pipeline |
|
PR_Github #28112 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #28112 [ reuse-pipeline ] completed with state |
Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]>
Description
Implement sampling on 1-model EAGLE. The core sampling implementation is taken from #6245 (thanks to @pathorn and @Shang-Pin from the DeepInfra team for the original implementation). However, many things have been reimplemented. Some key notes:
Test Coverage
TestGPTOSSalready has accuracy tests with non-greedy sampling params. I also tested GPQA (79%) and AIME25 (91%) on high reasoning, which matches the reference from Open AI.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Release Notes
New Features
--allow_advanced_samplingflag and configuration option for one-model speculative decoding paths.Tests
✏️ Tip: You can customize this high-level summary in your review settings.