Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
[None][doc] update feature_combination_matrix of disaggregated and ch…
…unked prefill

Signed-off-by: leslie-fang25 <leslief@nvidia.com>
  • Loading branch information
leslie-fang25 committed Aug 19, 2025
commit aa5e86c517604da9d731428749f2608a4c4ea4bf
8 changes: 4 additions & 4 deletions docs/source/torch/features/feature_combination_matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
| CUDA Graph | Yes | --- | | | | | | | | | | | | |
| Attention Data Parallelism | Yes | Yes | --- | | | | | | | | | | | |
| Disaggregated Serving | Yes | Yes | Yes | --- | | | | | | | | | | |
| Chunked Prefill | Yes | Yes | Yes | Untested | --- | | | | | | | | | |
| Chunked Prefill | Yes | Yes | Yes | Yes | --- | | | | | | | | | |
| MTP | Yes | Yes | Yes | Yes | Yes | --- | | | | | | | | |
| EAGLE-3(One Model Engine) | Yes | Yes | Yes | Yes | Yes | No | --- | | | | | | | |
| EAGLE-3(Two Model Engine) | NO | Yes | Yes | Yes | Yes | No | No | --- | | | | | | |
| EAGLE-3(Two Model Engine) | No | Yes | Yes | Yes | Yes | No | No | --- | | | | | | |
| Torch Sampler | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | --- | | | | | |
| TLLM C++ Sampler | Yes | Yes | Yes | Yes | Yes | No | No | No | No | --- | | | | |
| KV Cache Reuse | Yes | Yes | Yes | Untested | Yes | Untested | Yes | No | Yes | Yes | --- | | | |
| Slide Window Attention | Yes | Yes | Yes | Untested | No | Untested | Untested | Untested | Yes | Yes | WIP | --- | | |
| KV Cache Reuse | Yes | Yes | Yes | Yes | Yes | Untested | Yes | No | Yes | Yes | --- | | | |
| Slide Window Attention | Yes | Yes | Yes | Yes | No | Untested | Untested | Untested | Yes | Yes | WIP | --- | | |
| Logits Post Processor | No | Yes | Yes | No | Yes | No | No | No | Yes | Yes | Yes | Yes | --- | |
| Guided Decoding | Yes | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | Yes | Yes | --- |
37 changes: 37 additions & 0 deletions tests/integration/defs/accuracy/test_disaggregated_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -797,6 +797,43 @@ def test_auto_dtype(self, overlap_scheduler):
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)

def test_chunked_prefill(self):
ctx_server_config = {
"disable_overlap_scheduler": True,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "DEFAULT"
},
"enable_chunked_prefill": True,
"max_num_tokens": 256,
}
gen_server_config = {
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "DEFAULT"
}
}
disaggregated_server_config = {
"hostname": "localhost",
"port": 8000,
"backend": "pytorch",
"context_servers": {
"num_instances": 1,
"urls": ["localhost:8001"]
},
"generation_servers": {
"num_instances": 1,
"urls": ["localhost:8002"]
}
}
with launch_disaggregated_llm(disaggregated_server_config,
ctx_server_config, gen_server_config,
self.MODEL_PATH) as llm:
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)


@skip_pre_blackwell
@pytest.mark.timeout(3600)
Expand Down
1 change: 1 addition & 0 deletions tests/integration/test_lists/test-db/l0_dgx_h100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ l0_dgx_h100:
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_ngram
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[False]
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[True]
- accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_chunked_prefill
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=False-overlap_scheduler=False]
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=True-overlap_scheduler=True]
- accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
Expand Down