-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-6471] Infra: unwaive nixl tests and some disagg-serve tests #6095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #12089 [ run ] triggered by Bot |
|
PR_Github #12089 [ run ] completed with state |
d228974 to
341c14d
Compare
WalkthroughThe changes re-enable NIXL backend cache transceiver tests in the multi-GPU integration test suite and expand test coverage for the DeepSeek-V3-Lite-fp8 model variant. Multiple test list files are updated to include new test cases for various configurations, communication backends, and executor types, enhancing integration and sanity testing. Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner
participant NIXLBackend
participant CacheTransceiver
TestRunner->>NIXLBackend: Initialize NIXL backend
NIXLBackend->>CacheTransceiver: Start cache transceiver tests
CacheTransceiver-->>NIXLBackend: Return test results
NIXLBackend-->>TestRunner: Report test outcomes
sequenceDiagram
participant TestList
participant TestRunner
participant DisaggregatedTest
participant Backend
TestList->>TestRunner: Provide new DeepSeek-V3-Lite-fp8 test entries
TestRunner->>DisaggregatedTest: Run test with specified config
DisaggregatedTest->>Backend: Execute backend-specific logic (UCX/NIXL)
Backend-->>DisaggregatedTest: Return results
DisaggregatedTest-->>TestRunner: Report pass/fail
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (5)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
/bot run --add-multi-gpu-test --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
67-76: YAML list grows large – consider logical grouping / aliasesTen new DeepSeek-V3-Lite tests were appended verbatim.
For maintainability you may want to:
- Collapse identical prefixes via YAML anchors / aliases, or
- Group by
{ucx|nixl|mpi}back-end to keep diffs minimal.Not a blocker; purely a readability nit.
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
135-144: Newnixl_kvcachecases added – runtime budget may overflowThe H100 pre-merge block is already the longest in the repo.
Adding six morenixl_kvcachepermutations (symmetric & asymmetric) may push wall-time beyond the 90-minute guardrail.Consider:
- - cpp/test_multi_gpu.py::TestDisagg::test_symmetric_executor[llama-8proc-nixl_kvcache-90] + - cpp/test_multi_gpu.py::TestDisagg::test_symmetric_executor[llama-8proc-nixl_kvcache-60] # cut timeoutor mark a subset as
post_mergeif coverage overlap is acceptable.Also applies to: 151-153
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tests/integration/defs/cpp/test_multi_gpu.py(0 hunks)tests/integration/test_lists/qa/examples_test_list.txt(1 hunks)tests/integration/test_lists/qa/llm_sanity_test.txt(1 hunks)tests/integration/test_lists/test-db/l0_dgx_b200.yml(1 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml(1 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/defs/cpp/test_multi_gpu.py
🔇 Additional comments (2)
tests/integration/test_lists/qa/examples_test_list.txt (1)
594-594: Confirm NIXL backend is enabled in CI before mergingThe new entry relies on the
nixlcomms stack. Please double-check that the CI images used by the QA stage already have the NIXL runtime & env-vars pre-installed/exported (e.g.NV_NIXL_ENABLE=1).
Otherwise this line will systematically xfail / hang.tests/integration/test_lists/qa/llm_sanity_test.txt (1)
63-63: Same NIXL availability concern applies to the sanity suiteEnsure the pre-merge runners for the sanity job mirror the QA configuration, or this addition will create an asymmetry between suites.
|
PR_Github #12134 [ run ] triggered by Bot |
|
PR_Github #12134 [ run ] completed with state |
|
@litaotju @Shixiaowei02 @ZhanruiSunCh please help review/merge the PR, thanks! |
341c14d to
6be1357
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #12254 [ run ] triggered by Bot |
|
PR_Github #12254 [ run ] completed with state |
Signed-off-by: Bo Deng <[email protected]>
Signed-off-by: Bo Deng <[email protected]>
b0937a1 to
97a7286
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #12280 [ run ] triggered by Bot |
|
PR_Github #12280 [ run ] completed with state |
|
/bot run --add-multi-gpu-test |
|
PR_Github #12311 [ run ] triggered by Bot |
|
PR_Github #12311 [ run ] completed with state |
…VIDIA#6095) Signed-off-by: Bo Deng <[email protected]>
…VIDIA#6095) Signed-off-by: Bo Deng <[email protected]>
…VIDIA#6095) Signed-off-by: Bo Deng <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
Description
unwaive nixl tests and some disagg-serve tests
Summary by CodeRabbit