Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 37 additions & 26 deletions tensorrt_llm/_torch/models/modeling_speculative.py
Original file line number Diff line number Diff line change
Expand Up @@ -796,6 +796,7 @@ def __init__(self, model: TModel, model_config: ModelConfig[TConfig]):
assert key in model_config.extra_attrs
model_config.extra_attrs[key].update(value)
self.layer_idx = -1
self.enable_cuda_graph_for_draft_model = spec_config.enable_cuda_graph_for_draft_model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential AttributeError when spec_config is None.

The spec_config variable can be None (assigned via getattr(model_config, 'spec_config', None) on line 741). Accessing spec_config.enable_cuda_graph_for_draft_model directly without a null check will raise an AttributeError.

🔎 Proposed fix
         self.layer_idx = -1
-        self.enable_cuda_graph_for_draft_model = spec_config.enable_cuda_graph_for_draft_model
+        self.enable_cuda_graph_for_draft_model = spec_config.enable_cuda_graph_for_draft_model if spec_config else True

Note: Defaulting to True preserves backward-compatible behavior (CUDA graph capture enabled by default).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.enable_cuda_graph_for_draft_model = spec_config.enable_cuda_graph_for_draft_model
self.layer_idx = -1
self.enable_cuda_graph_for_draft_model = spec_config.enable_cuda_graph_for_draft_model if spec_config else True
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around line 799
(spec_config originates from getattr(..., 'spec_config', None) on line 741),
accessing spec_config.enable_cuda_graph_for_draft_model can raise AttributeError
when spec_config is None; change the assignment to read the attribute safely,
e.g. set self.enable_cuda_graph_for_draft_model = getattr(spec_config,
'enable_cuda_graph_for_draft_model', True) or check if spec_config is not None
before accessing and default to True to preserve backward-compatible behavior.


def forward(
self,
Expand Down Expand Up @@ -823,33 +824,15 @@ def forward(
if attn_metadata.padded_num_tokens is not None:
hidden_states = hidden_states[:attn_metadata.num_tokens]

is_capturing = torch.cuda.is_current_stream_capturing()

if self.draft_model is not None:
# get logits
logits = self.logits_processor.forward(
hidden_states[spec_metadata.gather_ids],
self.lm_head,
attn_metadata,
True,
)
mtp_input_ids = input_ids
mtp_position_ids = position_ids
if attn_metadata.padded_num_tokens is not None:
if input_ids is not None:
# Slice along the first dimension
mtp_input_ids = input_ids[:attn_metadata.num_tokens]
if position_ids is not None:
# Slice along the last dimension
mtp_position_ids = position_ids[:, :attn_metadata.
num_tokens]

# get accepted tokens and next draft tokens
return self.spec_worker(input_ids=mtp_input_ids,
position_ids=mtp_position_ids,
hidden_states=hidden_states,
logits=logits,
attn_metadata=attn_metadata,
spec_metadata=spec_metadata,
draft_model=self.draft_model)
if is_capturing and not self.enable_cuda_graph_for_draft_model:
return hidden_states
else:
return self.forward_draft(hidden_states, input_ids,
position_ids, attn_metadata,
spec_metadata)
Comment on lines +827 to +835
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, find the method that contains lines 827-835
head -835 tensorrt_llm/_torch/models/modeling_speculative.py | tail -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2350


🏁 Script executed:

#!/bin/bash
# Also check for forward_draft method and its return type
rg -n "def forward_draft" tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 134


🏁 Script executed:

#!/bin/bash
# Find where this code is called in model_engine.py
rg -n "self.draft_model" tensorrt_llm/_torch/pyexecutor/model_engine.py -B 5 -A 15

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Get the forward_draft method signature and implementation
sed -n '846,900p' tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2456


🏁 Script executed:

#!/bin/bash
# Find where SpeculativeHeadWrapper.forward is called
rg -n "\.forward(" tensorrt_llm/_torch/pyexecutor/model_engine.py -B 2 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 133


🏁 Script executed:

#!/bin/bash
# Search for usage of the return value from the model's forward/forward_draft
rg -n "forward_draft\|SpeculativeHeadWrapper" tensorrt_llm/_torch/pyexecutor/

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Fix the regex and search for forward calls
rg -n "\.forward\(" tensorrt_llm/_torch/pyexecutor/model_engine.py -B 2 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2751


🏁 Script executed:

#!/bin
# Search for SpeculativeHeadWrapper usage
rg -n "SpeculativeHeadWrapper" tensorrt_llm/_torch/

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look at the full forward method to understand return type
sed -n '800,850p' tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1982


🏁 Script executed:

#!/bin/bash
# Search more broadly for how forward is used and what happens to its return value
rg -n "model.forward\|\.forward\(\)" tensorrt_llm/_torch/pyexecutor/ -A 3 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Check what spec_worker returns
rg -n "class.*SpecWorker\|def __call__" tensorrt_llm/_torch/ | grep -i spec

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look for spec_worker definition and return type
fd "spec_worker" tensorrt_llm/_torch/ --type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Let's check what self.spec_worker is - search for its definition/assignment
rg -n "self\.spec_worker\s*=" tensorrt_llm/_torch/models/modeling_speculative.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 438


🏁 Script executed:

#!/bin/bash
# Search for get_spec_worker function
rg -n "def get_spec_worker\|from.*spec_worker" tensorrt_llm/_torch/

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look at the broader file structure to understand the class
head -50 tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2088


🏁 Script executed:

#!/bin/bash
# Find get_spec_worker definition
rg -n "def get_spec_worker" tensorrt_llm/ -A 30

Repository: NVIDIA/TensorRT-LLM

Length of output: 2713


🏁 Script executed:

#!/bin/bash
# Look for speculative module files
fd "speculative" tensorrt_llm/_torch/ --type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 113


🏁 Script executed:

#!/bin/bash
# Check the return type annotation in forward method more carefully
sed -n '814,846p' tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1323


🏁 Script executed:

#!/bin/bash
# Check what the worker classes return by looking at their __call__ methods
rg -n "class MTPWorker\|class MTPEagleWorker\|class Eagle3OneModelWorker" tensorrt_llm/_torch/speculative/ -A 50 | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Search for __call__ method in speculative worker classes
rg -n "def __call__" tensorrt_llm/_torch/speculative/ -A 15

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look at full forward method and its return type annotation
sed -n '807,851p' tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1774


🏁 Script executed:

#!/bin/bash
# Check class definition and where forward is called
rg -n "class SpeculativeHeadWrapper" tensorrt_llm/_torch/models/modeling_speculative.py -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look at how the model output is used after calling forward
rg -n "SpeculativeHeadWrapper\|speculative_head_wrapper" tensorrt_llm/_torch/pyexecutor/model_engine.py -B 5 -A 10

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Check what modules/layers use SpeculativeHeadWrapper
rg -n "SpeculativeHeadWrapper" tensorrt_llm/_torch/ -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Search for the class name from the file context
rg -n "class.*Wrapper" tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look at imports and overall file structure
head -100 tensorrt_llm/_torch/models/modeling_speculative.py | tail -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2201


🏁 Script executed:

#!/bin/bash
# Search for where speculative models are instantiated
rg -n "modeling_speculative\|from.*modeling_speculative" tensorrt_llm/_torch/ -l

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Find the class that contains this forward method - search backwards from line 807
sed -n '700,810p' tensorrt_llm/_torch/models/modeling_speculative.py | grep "^class"

Repository: NVIDIA/TensorRT-LLM

Length of output: 138


🏁 Script executed:

#!/bin/bash
# Let's get a broader view of the file structure
rg -n "^class " tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 611


🏁 Script executed:

#!/bin/bash
# Look at the entire context around the forward method to find the class
sed -n '750,815p' tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 3735


🏁 Script executed:

#!/bin/bash
# Now search for where SpecDecOneEngineForCausalLM is used and how the forward result is handled
rg -n "SpecDecOneEngineForCausalLM" tensorrt_llm/_torch/ -B 2 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 14296


🏁 Script executed:

#!/bin/bash
# Look for the model_engine forward method to see how it uses the model output
rg -n "def forward\(" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 114


🏁 Script executed:

#!/bin/bash
# Check line around 3297-3300 that we saw earlier which calls model.forward
sed -n '3290,3310p' tensorrt_llm/_torch/pyexecutor/model_engine.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1102


🏁 Script executed:

#!/bin/bash
# Check what spec_worker types return
rg -n "class.*Worker.*:" tensorrt_llm/_torch/speculative/ -A 30 | grep -A 25 "class MTPWorker\|class Eagle3OneModelWorker"

Repository: NVIDIA/TensorRT-LLM

Length of output: 4316


🏁 Script executed:

#!/bin/bash
# Look for the __call__ method in worker classes
fd "speculative" tensorrt_llm/_torch/ --type f -name "*.py"

Repository: NVIDIA/TensorRT-LLM

Length of output: 293


🏁 Script executed:

#!/bin/bash
# Check worker implementations
head -300 tensorrt_llm/_torch/speculative/worker.py | tail -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 163


🏁 Script executed:

#!/bin/bash
# Find the return statement in MTPWorker.forward
rg -n "return" tensorrt_llm/_torch/speculative/mtp.py | grep -A 2 "MTPWorker\|def forward" | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Get the full MTPWorker.forward method
sed -n '362,550p' tensorrt_llm/_torch/speculative/mtp.py | tail -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4632


🏁 Script executed:

#!/bin/bash
# Get the full Eagle3OneModelWorker.forward method
sed -n '373,600p' tensorrt_llm/_torch/speculative/eagle3.py | tail -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 6246


🏁 Script executed:

#!/bin/bash
# Check if there's any type casting or handling of the return value
sed -n '3295,3320p' tensorrt_llm/_torch/pyexecutor/model_engine.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 992


🏁 Script executed:

#!/bin/bash
# Look at MTPWorker.forward return statement more carefully
rg -n "return {" tensorrt_llm/_torch/speculative/mtp.py | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 142


🏁 Script executed:

#!/bin/bash
# Get exact return statement from MTPWorker
sed -n '500,530p' tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1629


🏁 Script executed:

#!/bin/bash
# Find the actual return statement in MTPWorker.forward
rg -n "return \{" tensorrt_llm/_torch/speculative/mtp.py -A 10

Repository: NVIDIA/TensorRT-LLM

Length of output: 1963


🏁 Script executed:

#!/bin/bash
# Look at the end of MTPWorker.forward method
tail -200 tensorrt_llm/_torch/speculative/mtp.py | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 5501


🏁 Script executed:

#!/antml/bin/bash
# Check what model_forward returns in model_engine
sed -n '3310,3350p' tensorrt_llm/_torch/pyexecutor/model_engine.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1816


🏁 Script executed:

#!/bin/bash
# Get the exact method signature
sed -n '807,820p' tensorrt_llm/_torch/models/modeling_speculative.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 539


Fix return type annotation.

The method's return type is annotated as torch.Tensor, but it can actually return either a torch.Tensor (when is_capturing and not self.enable_cuda_graph_for_draft_model) or a dict (from self.spec_worker). The spec_worker returns a dict with keys like logits, new_tokens, new_tokens_lens, etc., not a tensor. While PyTorchModelEngine._forward_step correctly handles both types using isinstance(outputs, dict), the method's return type annotation should be updated to reflect the actual return types (e.g., Union[torch.Tensor, Dict[str, Any]]).

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around lines 827 to 835,
the method is annotated to return torch.Tensor but actually may return either a
torch.Tensor or a dict from the spec_worker; update the return type annotation
to Union[torch.Tensor, Dict[str, Any]] (or typing.Any as appropriate),
add/import Union, Dict, Any from typing at the top of the file if not present,
and adjust any related type comments or stubs so the signature matches both
possible return types.

else:
logits = self.logits_processor.forward(
hidden_states,
Expand All @@ -860,6 +843,34 @@ def forward(

return logits

def forward_draft(self, hidden_states, input_ids, position_ids,
attn_metadata, spec_metadata):
# get logits
logits = self.logits_processor.forward(
hidden_states[spec_metadata.gather_ids],
self.lm_head,
attn_metadata,
True,
)
mtp_input_ids = input_ids
mtp_position_ids = position_ids
if attn_metadata.padded_num_tokens is not None:
if input_ids is not None:
# Slice along the first dimension
mtp_input_ids = input_ids[:attn_metadata.num_tokens]
if position_ids is not None:
# Slice along the last dimension
mtp_position_ids = position_ids[:, :attn_metadata.num_tokens]

# get accepted tokens and next draft tokens
return self.spec_worker(input_ids=mtp_input_ids,
position_ids=mtp_position_ids,
hidden_states=hidden_states,
logits=logits,
attn_metadata=attn_metadata,
spec_metadata=spec_metadata,
draft_model=self.draft_model)

def load_weights(self,
weights: Dict,
weight_mapper: Optional[BaseWeightMapper] = None,
Expand Down
7 changes: 7 additions & 0 deletions tensorrt_llm/_torch/pyexecutor/model_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ def __init__(
) or self.model_is_wrapped
self.max_draft_len = spec_config.max_draft_len
self.max_total_draft_tokens = spec_config.max_total_draft_tokens
self.enable_cuda_graph_for_draft_model = spec_config.enable_cuda_graph_for_draft_model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard enable_cuda_graph_for_draft_model and forward_draft usage for non-speculative executors

Right now enable_cuda_graph_for_draft_model is only assigned when spec_config is not None, but the CUDA-graph replay path reads it unconditionally and also assumes inputs['spec_metadata'] exists. For non-speculative executors (no spec_config) that still use CUDA graphs, this can lead to:

  • AttributeError on self.enable_cuda_graph_for_draft_model.
  • Or, once the attribute is initialized, an invalid attempt to call model.forward_draft without spec_metadata and on models that don’t implement that method.

You likely only intend to run forward_draft in speculative mode when the new flag is False. Suggest initializing the flag for all cases and gating the forward_draft call on self.enable_spec_decode as well.

Proposed fix: initialize flag safely and gate the `forward_draft` call
@@
-        self.llm_args = llm_args
-        self.original_max_draft_len = spec_config.max_draft_len if spec_config is not None else 0
-        self.original_max_total_draft_tokens = spec_config.max_total_draft_tokens if spec_config is not None else 0
+        self.llm_args = llm_args
+        self.original_max_draft_len = spec_config.max_draft_len if spec_config is not None else 0
+        self.original_max_total_draft_tokens = spec_config.max_total_draft_tokens if spec_config is not None else 0
@@
-        self.spec_config = spec_config
-        self.is_spec_decode = spec_config is not None
+        self.spec_config = spec_config
+        self.is_spec_decode = spec_config is not None
+        # Default to True so non-speculative executors never take the draft-only path.
+        self.enable_cuda_graph_for_draft_model = (
+            spec_config.enable_cuda_graph_for_draft_model
+            if spec_config is not None else True
+        )
         self.sparse_attention_config = None if is_draft_model else llm_args.sparse_attention_config
         self.enable_spec_decode = self.is_spec_decode
         self.is_draft_model = is_draft_model
@@
-                    else:
-                        with MoeLoadBalancerIterContext(moe_load_balancer):
-                            outputs = self.cuda_graph_runner.replay(key, inputs)
-                            if not self.enable_cuda_graph_for_draft_model:
-                                outputs = self.model.forward_draft(
-                                    outputs, inputs['input_ids'],
-                                    inputs['position_ids'],
-                                    inputs['attn_metadata'],
-                                    inputs['spec_metadata'])
+                    else:
+                        with MoeLoadBalancerIterContext(moe_load_balancer):
+                            outputs = self.cuda_graph_runner.replay(key, inputs)
+                            # When speculative decoding is enabled but we opted out of
+                            # capturing the draft loop in the CUDA graph, run the
+                            # draft-only pass after replay.
+                            if (self.enable_spec_decode
+                                    and not self.enable_cuda_graph_for_draft_model):
+                                outputs = self.model.forward_draft(
+                                    outputs,
+                                    inputs['input_ids'],
+                                    inputs['position_ids'],
+                                    inputs['attn_metadata'],
+                                    inputs['spec_metadata'],
+                                )

This keeps non-speculative flows and non-Eagle3 models on the existing path while enabling the new “draft outside CUDA graph” behavior only where spec_config and forward_draft are defined.

Also applies to: 3269-3274

else:
self.without_logits = False
self.max_draft_len = 0
Expand Down Expand Up @@ -3265,6 +3266,12 @@ def capture_postprocess_fn(inputs: Dict[str, Any]):
else:
with MoeLoadBalancerIterContext(moe_load_balancer):
outputs = self.cuda_graph_runner.replay(key, inputs)
if not self.enable_cuda_graph_for_draft_model:
outputs = self.model.forward_draft(
outputs, inputs['input_ids'],
inputs['position_ids'],
inputs['attn_metadata'],
inputs['spec_metadata'])

if self.forward_pass_callable is not None:
self.forward_pass_callable()
Expand Down
7 changes: 7 additions & 0 deletions tensorrt_llm/llmapi/llm_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -858,6 +858,8 @@ class EagleDecodingConfig(DecodingBaseConfig):
# The model architecture of the eagle3 model.
# choices: llama3, mistral_large3
eagle3_model_arch: str = "llama3"
# Whether if draft is captured in cuda graph
enable_cuda_graph_for_draft_model: Optional[bool] = True

def __init__(self, **kwargs):
super().__init__()
Expand Down Expand Up @@ -912,6 +914,11 @@ def __init__(self, **kwargs):
assert self.dynamic_tree_max_topK is not None and self.dynamic_tree_max_topK > 0, "dynamic_tree_max_topK should be provided, which indicates the number of nodes to expand each time"
assert self.max_total_draft_tokens is not None and self.max_total_draft_tokens > 0, "max_total_draft_tokens should be provided, which indicates the total nodes of the final draft tree. (exclude the root node)"

if self.enable_cuda_graph_for_draft_model == False and self.eagle3_one_model == False:
raise ValueError(
"enable_cuda_graph_for_draft_model can be false only when eagle3_one_model is True"
)

@classmethod
def from_dict(cls, data: dict):
return cls(**data)
Expand Down
4 changes: 2 additions & 2 deletions tests/integration/defs/.test_durations
Original file line number Diff line number Diff line change
Expand Up @@ -309,8 +309,8 @@
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile=True]": 166.85348949534819,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=FLASHINFER]": 167.15153613401344,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM]": 90.12104846700095,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=False-overlap_scheduler=False]": 1112.0988524899585,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=True-overlap_scheduler=True]": 979.2759481471148,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-eagle3_one_model=False-overlap_scheduler=False]": 1112.0988524899585,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-eagle3_one_model=True-overlap_scheduler=True]": 979.2759481471148,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile=False]": 237.24446990108117,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile=True]": 226.39608797896653,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=False]": 174.38962662010454,
Expand Down
15 changes: 11 additions & 4 deletions tests/integration/defs/accuracy/test_llm_api_pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,8 +266,13 @@ def test_fp8_llm_sampler(self):
@parametrize_with_ids("overlap_scheduler", [True, False])
@parametrize_with_ids("eagle3_one_model", [True, False])
@parametrize_with_ids("sampler_async_worker", [True, False])
@parametrize_with_ids("enable_cuda_graph_for_draft_model", [True, False])
def test_eagle3(self, overlap_scheduler, eagle3_one_model,
sampler_async_worker):
sampler_async_worker, enable_cuda_graph_for_draft_model):
if enable_cuda_graph_for_draft_model == False and eagle3_one_model == False:
pytest.skip(
"enable_cuda_graph_for_draft_model can be false only when eagle3_one_model is True"
)
pytorch_config = dict(
max_batch_size=
1, # add max_batch_size to avoid error in overlap scheduler
Expand All @@ -284,9 +289,11 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model,
target_model_dir = f"{llm_models_root()}/llama-3.1-model/Llama-3.1-8B-Instruct"

draft_len = 4
spec_config = EagleDecodingConfig(max_draft_len=draft_len,
speculative_model_dir=eagle_model_dir,
eagle3_one_model=eagle3_one_model)
spec_config = EagleDecodingConfig(
max_draft_len=draft_len,
speculative_model_dir=eagle_model_dir,
eagle3_one_model=eagle3_one_model,
enable_cuda_graph_for_draft_model=enable_cuda_graph_for_draft_model)

with LLM(model=target_model_dir,
**pytorch_config,
Expand Down
7 changes: 4 additions & 3 deletions tests/integration/test_lists/qa/llm_digits_func.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@ test_e2e.py::test_ptp_quickstart_advanced[Mistral-Nemo-12b-Base-Mistral-Nemo-Bas
test_e2e.py::test_ptp_quickstart_advanced[DeepSeek-R1-Distill-Qwen-32B-DeepSeek-R1/DeepSeek-R1-Distill-Qwen-32B]

accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=True-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_ngram

accuracy/test_llm_api_pytorch_multimodal.py::TestNVILA_8B::test_auto_dtype
Expand Down
7 changes: 4 additions & 3 deletions tests/integration/test_lists/qa/llm_function_core.txt
Original file line number Diff line number Diff line change
Expand Up @@ -390,9 +390,10 @@ accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=FLASHINFER]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=True-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_ngram
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[llguidance]
Expand Down
7 changes: 4 additions & 3 deletions tests/integration/test_lists/qa/llm_function_core_sanity.txt
Original file line number Diff line number Diff line change
Expand Up @@ -128,9 +128,10 @@ accuracy/test_llm_api_pytorch.py::TestKimiK2::test_nvfp4[4gpus]
accuracy/test_llm_api_pytorch.py::TestKimiK2::test_nvfp4[8gpus]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=FLASHINFER]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=True-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_4gpus[llguidance]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_4gpus[xgrammar]
Expand Down
7 changes: 4 additions & 3 deletions tests/integration/test_lists/qa/llm_function_l20.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@ accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_beam_search[enable_cuda_graph=False-enable_padding=False-disable_overlap_scheduler=False-sampler_async_worker=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_beam_search[enable_cuda_graph=False-enable_padding=False-disable_overlap_scheduler=False-sampler_async_worker=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=True-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_ngram
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[llguidance]
Expand Down
7 changes: 4 additions & 3 deletions tests/integration/test_lists/qa/llm_function_rtx6k.txt
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,10 @@ accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=FLASHINFER]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=False-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=False-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[enable_cuda_graph_for_draft_model=True-sampler_async_worker=True-eagle3_one_model=True-overlap_scheduler=True]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_ngram
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[llguidance]
Expand Down
Loading
Loading