-
Notifications
You must be signed in to change notification settings - Fork 2k
feat: Enable AutoDeploy to llm-eval example #4020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable AutoDeploy to llm-eval example #4020
Conversation
lucaslie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to update the usage of lm_eval_ad.py as well, for example, in the integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not set anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this coming from?
|
Hi @meenchen , does it support
All run with Llama3.1 and the number is
I cannot run TP>1 with |
ff6bcc9 to
731b278
Compare
@Fridah-nv, I am able to run TP> 2 for gsm8k. My command:
What's the error you encounter? |
4b4e476 to
33fe071
Compare
|
@Fridah-nv any updates? |
@meenchen will wrap up this PR. Please let me know if there's further question. |
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
33fe071 to
4b6a7fd
Compare
📝 WalkthroughWalkthroughThe changes introduce expanded backend support and chat templating in the model evaluation harness, update documentation and test configurations to reflect new usage patterns, and adjust testing to use the new backend and parameter formats. Some tests are disabled, and dependency versions are updated. Minor code cleanup and configuration adjustments are also included. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant LM_Eval_Harness
participant TRTLLMEvalBase
participant LLM_Backend
User->>LM_Eval_Harness: Run evaluation with CLI args (model, backend, etc.)
LM_Eval_Harness->>TRTLLMEvalBase: Initialize with backend and parameters
TRTLLMEvalBase->>LLM_Backend: Configure backend (torch/pytorch/autodeploy)
TRTLLMEvalBase->>TRTLLMEvalBase: Apply chat template (if needed)
TRTLLMEvalBase->>LLM_Backend: Generate or compute loglikelihood with temperature
LLM_Backend-->>TRTLLMEvalBase: Return results
TRTLLMEvalBase-->>LM_Eval_Harness: Return evaluation results
LM_Eval_Harness-->>User: Output metrics/results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15–25 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py (2)
177-178: Clarify why max_length is no longer set.A previous reviewer asked why
self.max_lengthis no longer being set. Please provide clarification on this change.
188-203: Clarify the source of the chat template implementation.A previous reviewer asked about the origin of this code. Please provide context on whether this is adapted from another implementation or newly written.
🧹 Nitpick comments (3)
examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py (2)
71-77: Consider adding docstrings for new parameters.The new parameters
attn_backend,tokenized_requests, andtemperaturewould benefit from documentation explaining their purpose and valid values.
180-187: Fix docstring formatting issues.The docstring needs a blank line after the summary and line 184 exceeds the character limit.
@property def tokenizer_name(self) -> str: """Must be defined for LM subclasses which implement Chat Templating. + Should return the name of the tokenizer or chat template used. - Used only to properly fingerprint caches when requests are being cached with `--cache_requests`, otherwise not used. + Used only to properly fingerprint caches when requests are being cached + with `--cache_requests`, otherwise not used. """ return ""tests/unittest/_torch/auto_deploy/integration/test_lm_eval.py (1)
61-124: Consider using pytest.mark.skip instead of commenting out test cases.Rather than commenting out test parameters, consider keeping them active with
pytest.mark.skipdecorators. This preserves the test definitions while clearly marking them as temporarily disabled.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/auto_deploy/README.md(1 hunks)examples/auto_deploy/build_and_run_flux.py(0 hunks)examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py(8 hunks)examples/llm-eval/lm-eval-harness/requirements.txt(1 hunks)tests/unittest/_torch/auto_deploy/integration/test_lm_eval.py(3 hunks)tests/unittest/pytest.ini(1 hunks)
💤 Files with no reviewable changes (1)
- examples/auto_deploy/build_and_run_flux.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/pytest.iniexamples/auto_deploy/README.mdexamples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/unittest/pytest.iniexamples/auto_deploy/README.mdexamples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
examples/auto_deploy/README.mdexamples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T08:45:40.701Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
examples/auto_deploy/README.mdexamples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
examples/auto_deploy/README.mdexamples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.pytests/unittest/_torch/auto_deploy/integration/test_lm_eval.py
📚 Learning: ministral is a valid model name from mistral ai, distinct from the regular mistral models. in tensor...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
examples/auto_deploy/README.md
🪛 Ruff (0.12.2)
examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py
128-128: Line too long (121 > 120)
(E501)
131-131: Local variable pytorch_config is assigned to but never used
Remove assignment to unused variable pytorch_config
(F841)
182-185: 1 blank line required between summary line and description
(D205)
184-184: Line too long (124 > 120)
(E501)
🔇 Additional comments (8)
examples/llm-eval/lm-eval-harness/requirements.txt (1)
1-1: LGTM!The patch version update from 0.4.7 to 0.4.8 is appropriate for the new backend support features.
examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py (3)
46-52: LGTM!The
JsonChatStrutility class is well-designed for handling JSON-encoded chat prompts.
324-326: LGTM!The temperature override implementation allows for consistent generation behavior across all requests.
139-151: LGTM!The LLM instantiation correctly supports the new backend architecture.
tests/unittest/pytest.ini (1)
8-8: LGTM!The pythonpath update correctly reflects the new location of the evaluation harness module.
examples/auto_deploy/README.md (1)
195-200: LGTM!The documentation updates correctly reflect the new evaluation harness usage with autodeploy backend support. The example command and reference link are helpful for users.
tests/unittest/_torch/auto_deploy/integration/test_lm_eval.py (2)
12-12: LGTM!The import correctly references the new evaluation harness module.
133-140: LGTM!The model arguments correctly configure the autodeploy backend with deterministic generation settings appropriate for testing.
| if self.backend == 'pytorch': | ||
| if hasattr(PyTorchConfig, "moe_backend"): | ||
| pytorch_config_params["moe_backend"] = self.moe_backend | ||
| print(f"Info: moe_backend is set to {self.moe_backend}") | ||
| pytorch_config_params["attn_backend"] = attn_backend.upper() | ||
| print(f"Info: attn_backend is set to {attn_backend}") | ||
| pytorch_config = PyTorchConfig(**pytorch_config_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate moe_backend configuration.
The moe_backend is already configured at lines 116-118. This duplicate block is redundant.
if self.backend == 'pytorch':
- if hasattr(PyTorchConfig, "moe_backend"):
- pytorch_config_params["moe_backend"] = self.moe_backend
- print(f"Info: moe_backend is set to {self.moe_backend}")
pytorch_config_params["attn_backend"] = attn_backend.upper()
print(f"Info: attn_backend is set to {attn_backend}")
pytorch_config = PyTorchConfig(**pytorch_config_params)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.backend == 'pytorch': | |
| if hasattr(PyTorchConfig, "moe_backend"): | |
| pytorch_config_params["moe_backend"] = self.moe_backend | |
| print(f"Info: moe_backend is set to {self.moe_backend}") | |
| pytorch_config_params["attn_backend"] = attn_backend.upper() | |
| print(f"Info: attn_backend is set to {attn_backend}") | |
| pytorch_config = PyTorchConfig(**pytorch_config_params) | |
| if self.backend == 'pytorch': | |
| pytorch_config_params["attn_backend"] = attn_backend.upper() | |
| print(f"Info: attn_backend is set to {attn_backend}") | |
| pytorch_config = PyTorchConfig(**pytorch_config_params) |
🤖 Prompt for AI Agents
In examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py around lines 120 to
126, remove the duplicate setting of moe_backend in the pytorch_config_params
dictionary since it is already configured at lines 116-118. Keep only the unique
configuration lines such as setting attn_backend and creating the PyTorchConfig
instance, eliminating the redundant moe_backend assignment and its print
statement.
| elif self.backend == 'autodeploy': | ||
| assert self.max_context_length is not None, "max_context_length must be specified for autodeploy backend" | ||
| # Only FlashInfer is supported for autodeploy backend. | ||
| pytorch_config_params["attn_backend"] = "FlashInfer" | ||
| pytorch_config = AutoDeployConfig(**pytorch_config_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix unused variable and line length issues.
The pytorch_config variable is created but not used. Also, line 128 exceeds the 120 character limit.
elif self.backend == 'autodeploy':
- assert self.max_context_length is not None, "max_context_length must be specified for autodeploy backend"
+ assert self.max_context_length is not None, \
+ "max_context_length must be specified for autodeploy backend"
# Only FlashInfer is supported for autodeploy backend.
pytorch_config_params["attn_backend"] = "FlashInfer"
- pytorch_config = AutoDeployConfig(**pytorch_config_params)
+ # AutoDeployConfig is passed directly through kwargsSince the config is passed via kwargs, there's no need to assign it to pytorch_config.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif self.backend == 'autodeploy': | |
| assert self.max_context_length is not None, "max_context_length must be specified for autodeploy backend" | |
| # Only FlashInfer is supported for autodeploy backend. | |
| pytorch_config_params["attn_backend"] = "FlashInfer" | |
| pytorch_config = AutoDeployConfig(**pytorch_config_params) | |
| elif self.backend == 'autodeploy': | |
| assert self.max_context_length is not None, \ | |
| "max_context_length must be specified for autodeploy backend" | |
| # Only FlashInfer is supported for autodeploy backend. | |
| pytorch_config_params["attn_backend"] = "FlashInfer" | |
| # AutoDeployConfig is passed directly through kwargs |
🧰 Tools
🪛 Ruff (0.12.2)
128-128: Line too long (121 > 120)
(E501)
131-131: Local variable pytorch_config is assigned to but never used
Remove assignment to unused variable pytorch_config
(F841)
🤖 Prompt for AI Agents
In examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py around lines 127 to
131, remove the assignment to the unused variable pytorch_config and instead
directly call AutoDeployConfig with the unpacked pytorch_config_params. Also,
break the long line 128 into multiple lines or simplify it to ensure it does not
exceed 120 characters.
PR title
Description
Test Coverage
python lm_eval_tensorrt_llm.py --model trt-llm --model_args model=/home/scratch.omniml_data_1/models/llama3.1/Meta-Llama-3.1-8B,backend=autodeploy,max_context_length=2048,max_gen_toks=128,tp=1,temperature=0 --tasks mmlu --batch_size 4
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores