-
Notifications
You must be signed in to change notification settings - Fork 2k
infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) #6132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) #6132
Conversation
WalkthroughThe changes update Jenkins pipeline scripts to distinguish between x86_64 and SBSA architectures when handling multi-GPU test requirements. The build description markers and conditional logic for launching single- and multi-GPU test stages are now architecture-specific, with explicit gating and error handling for SBSA multi-GPU tests based on test outcomes and job type. Changes
Sequence Diagram(s)sequenceDiagram
participant Jenkins
participant L0_MergeRequest
participant L0_Test
Jenkins->>L0_MergeRequest: Start pipeline
L0_MergeRequest->>L0_Test: Launch single-GPU test (x86_64 or SBSA)
L0_Test->>L0_MergeRequest: Report result, update parent job description with arch-specific marker if needed
alt Multi-GPU required (x86_64 or SBSA)
L0_MergeRequest->>L0_Test: Launch multi-GPU test (only if single-GPU passes or post-merge)
L0_Test->>L0_MergeRequest: Report multi-GPU test result
end
Suggested reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
7-11: DuplicateKubernetesManagerimport – remove one occurrence
com.nvidia.bloom.KubernetesManageris imported twice (lines 7 and 10). Duplicate imports add noise and may confuse static-analysis tools.-import com.nvidia.bloom.KubernetesManager
2308-2311: Guard against unknown architectures & keep marker values explicit
archStrcurrently defaults to"SBSA"for every non-x86_64 value.
Ifenv.targetArchis unset or a future arch constant is introduced, the marker string will silently mis-classify the build. Consider explicit branching with a safe fallback.- def archStr = env.targetArch == X86_64_TRIPLE ? "x86_64" : "SBSA" + def archStr + switch (env.targetArch) { + case X86_64_TRIPLE: + archStr = "x86_64" + break + case AARCH64_TRIPLE: + archStr = "SBSA" + break + default: + archStr = "unknown" + }This makes later consumers aware of unsupported targets instead of treating everything as SBSA.
Would you like a follow-up grep script to ensure every pipeline that parses the build description recognises the new
"unknown"marker?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
jenkins/L0_MergeRequest.groovy(4 hunks)jenkins/L0_Test.groovy(1 hunks)
🔇 Additional comments (4)
jenkins/L0_MergeRequest.groovy (4)
975-975: Architecture-specific multi-GPU testing marker looks good.The change from generic "Require Multi-GPU Testing" to "Require x86_64 Multi-GPU Testing" properly distinguishes between architectures, aligning with the PR's objective to split test pipelines.
1033-1078: SBSA single-GPU test stage implementation is well-structured.The new stage follows the established pattern from x86_64 testing with proper:
- Error handling that respects
SBSA_TEST_CHOICEconfiguration- Failure tracking via
singleGpuTestFailedfor multi-GPU gating- Clear stage naming and job invocation
1080-1099: Multi-GPU gating logic properly handles test dependencies.The implementation correctly:
- Checks for SBSA-specific multi-GPU testing marker
- Blocks multi-GPU tests in pre-merge when single-GPU fails (preventing resource waste)
- Allows multi-GPU tests to proceed in post-merge for comprehensive validation
- Mirrors the x86_64 pattern for consistency
1101-1114: Clear naming for SBSA multi-GPU stage and job.The changes improve clarity by:
- Explicitly naming the stage as "[Test-SBSA-Multi-GPU]"
- Using the descriptive job name "L0_Test-SBSA-Multi-GPU"
This aligns perfectly with the PR's goal of splitting single and multi-GPU test pipelines.
|
/bot run --skip-test --add-multi-gpu-test |
|
PR_Github #12173 [ run ] triggered by Bot |
|
PR_Github #12173 [ run ] completed with state |
67040a2 to
48b068d
Compare
|
/bot run --skip-test --add-multi-gpu-test |
|
PR_Github #12833 [ run ] triggered by Bot |
|
PR_Github #12833 [ run ] completed with state |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
/bot run |
|
PR_Github #12975 [ run ] triggered by Bot |
|
PR_Github #12975 [ run ] completed with state |
|
/bot run |
|
PR_Github #13111 [ run ] triggered by Bot |
|
PR_Github #13111 [ run ] completed with state |
48b068d to
c1048e9
Compare
|
/bot reuse-pipeline |
|
PR_Github #13188 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #13188 [ reuse-pipeline ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #13190 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #13190 [ reuse-pipeline ] completed with state |
…d multi GPU(For SBSA) Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
c1048e9 to
f707744
Compare
|
/bot reuse-pipeline |
|
PR_Github #13216 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #13216 [ reuse-pipeline ] completed with state |
|
/bot reuse-pipeline 9808 |
|
PR_Github #13265 Bot args parsing error: usage: /bot [-h] |
|
/bot skip --comment "/LLM/main/L0_MergeRequest_PR pipeline #9808 completed with status: 'SUCCESS'" |
|
PR_Github #13275 [ skip ] triggered by Bot |
|
PR_Github #13275 [ skip ] completed with state |
…d multi GPU(For SBSA) (NVIDIA#6132) Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…d multi GPU(For SBSA)
Summary by CodeRabbit
New Features
Improvements
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.