-
Notifications
You must be signed in to change notification settings - Fork 2k
DeepEP LL support variable hidden size and tokens num #6141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe updates modify the DeepEP dependency version in the build configuration, replace the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FusedMoEWideEP
participant VariableLengthLowLatencyBuffer
participant DeepEP
User->>FusedMoEWideEP: Call forward(...)
FusedMoEWideEP->>VariableLengthLowLatencyBuffer: low_latency_dispatch(all_rank_max_num_tokens)
VariableLengthLowLatencyBuffer->>VariableLengthLowLatencyBuffer: Assert num_experts consistency
VariableLengthLowLatencyBuffer->>DeepEP: Dispatch tokens
DeepEP-->>VariableLengthLowLatencyBuffer: Return results
FusedMoEWideEP->>FusedMoEWideEP: Combine results (direct call, no adapter)
FusedMoEWideEP-->>User: Return output
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
465-469: Good safety check and more accurate token count usage.The assertion ensures we don't exceed the configured buffer size, and using
all_rank_max_num_tokensinstead of the fixed maximum aligns with the actual dispatch requirements.However, line 467 exceeds the 120-character limit.
- x, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + x, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch( + x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
623-626: Consistent safety check for postquant dispatch.Good to see the same assertion and token count handling applied here for consistency.
Line 626 exceeds the 120-character limit.
- fp4_packed_tensor, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + fp4_packed_tensor, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch( + fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
467-467: Line too long (126 > 120)
(E501)
626-626: Line too long (138 > 120)
(E501)
🔇 Additional comments (2)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)
1-1: DeepEP dependency updated to newer commit.The update aligns with the refined token handling changes in the Python modules.
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
142-142: Good improvement to validation logic.Relaxing from strict equality to less-than-or-equal check allows more flexible dispatch token counts while still preventing buffer overflow. This change enables dispatching with variable token counts up to the configured maximum.
|
/bot run |
|
PR_Github #12199 [ run ] triggered by Bot |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
467-467: Line too long (126 > 120)
(E501)
626-626: Line too long (138 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
465-468: LGTM: Assertion and dispatch parameter update look correct.The assertion ensures
all_rank_max_num_tokensdoesn't exceed the configured limit before dispatch, and updating the dispatch call to use the actual maximum tokens per rank instead of the hardcoded limit is a good improvement for handling variable token counts.
711-717: low_latency_combine signature & reshape usage verifiedThe
low_latency_combinemethod indeep_ep_utils.pyis defined as:def low_latency_combine(self, hidden_states: torch.Tensor, topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple): …This matches the call in
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(lines 711–717):num_tokens_per_expert_for_fused_moe = ( self.mapping.moe_ep_size * all_rank_max_num_tokens ) final_hidden_states = final_hidden_states.view( self.expert_size_per_partition, num_tokens_per_expert_for_fused_moe, self.hidden_size, ) final_hidden_states = self.deep_ep_buffer.low_latency_combine( final_hidden_states, deep_ep_topk_idx, deep_ep_topk_weights, deep_ep_handle, )Reshaping with
all_rank_max_num_tokensproduces the expected
(expert_size_per_partition, num_tokens_per_expert, hidden_size)tensor,
and all arguments align with the combine implementation. No changes needed.
|
PR_Github #12200 [ run ] triggered by Bot |
|
PR_Github #12199 [ run ] completed with state |
|
PR_Github #12200 [ run ] completed with state |
|
/bot run |
|
PR_Github #12268 [ run ] triggered by Bot |
|
/bot run |
|
PR_Github #12275 [ run ] triggered by Bot |
|
PR_Github #12268 [ run ] completed with state |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
625-628: Consistent implementation with remaining formatting issue.The assertion and dispatch call update are correctly implemented, consistent with the earlier occurrence in the file. However, the line length violation on line 628 still needs to be addressed.
Apply this formatting fix:
- self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + self.deep_ep_buffer.low_latency_dispatch( + fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-468: Logic changes look good, but fix line length violation.The assertion correctly validates the token count constraint and the dispatch call update aligns with the PR objectives to use actual token counts instead of hardcoded maximums.
However, there's a line length violation on line 468. Apply this formatting fix:
- self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + self.deep_ep_buffer.low_latency_dispatch( + x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
468-468: Line too long (126 > 120)
(E501)
628-628: Line too long (138 > 120)
(E501)
|
PR_Github #12299 [ run ] triggered by Bot |
|
PR_Github #12275 [ run ] completed with state |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
625-628: Fix line length violation - duplicate formatting issue.This segment has the same line length violation as the previous dispatch call. The assertion and parameter logic are correct, but the formatting needs to be consistent.
Apply this diff to fix the line length violation:
- assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens - fp4_packed_tensor, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens + fp4_packed_tensor, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch( + fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots + )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-470: Fix line length violation while maintaining correct logic.The assertion and dispatch parameter changes are correct - using
all_rank_max_num_tokensinstead of the hardcoded maximum aligns with the PR's goal of supporting variable token counts. However, there's a formatting issue that needs to be addressed.Apply this diff to fix the line length violation:
- assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens - x, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens + x, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch( + x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
468-468: Line too long (126 > 120)
(E501)
628-628: Line too long (138 > 120)
(E501)
|
PR_Github #12316 [ run ] triggered by Bot |
|
PR_Github #12299 [ run ] completed with state |
|
PR_Github #12316 [ run ] completed with state |
… and avoid copy Signed-off-by: Yilin Zhang <[email protected]>
Signed-off-by: Yilin Zhang <[email protected]>
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
618-620: Line length violation persists - reformat the dispatch call.The assertion and logic changes are correct, but the line length violation on line 620 needs to be fixed by reformatting the dispatch call arguments.
This appears to be the same formatting issue identified in previous reviews. Please apply the formatting fix:
- assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens - fp4_packed_tensor, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens + fp4_packed_tensor, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch( + fp4_packed_tensor, + deep_ep_topk_idx, + all_rank_max_num_tokens, + self.num_slots, + )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-469: Fix line length violation and verify assertion logic.The assertion correctly validates token count limits before dispatch, and using
all_rank_max_num_tokensaligns with the PR objective of supporting variable token numbers. However, there's a line length violation that needs to be addressed.Apply this diff to fix the line length violation:
- assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens - x, recv_expert_count, deep_ep_handle = \ - self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots) + assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens + x, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch( + x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt(1 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
468-468: Line too long (126 > 120)
(E501)
620-620: Line too long (138 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-620: Implementation correctly supports variable token counts with proper validation.The changes successfully implement the PR objective of supporting variable hidden sizes and token numbers by:
- Adding safety assertions - Both dispatch calls now validate that
all_rank_max_num_tokensdoesn't exceeddeep_ep_max_num_tokensbefore dispatching- Using dynamic token counts - Replacing hardcoded
deep_ep_max_num_tokenswithall_rank_max_num_tokensfor actual dispatch operations- Maintaining consistency - Both DeepEPLowLatency dispatch paths follow the same pattern
This aligns with the PR summary's goal of avoiding overhead from unnecessary data copying and supporting variable token numbers.
The logic changes are sound and improve the flexibility of the DeepEP implementation as intended.
|
PR_Github #12366 [ run ] triggered by Bot |
|
PR_Github #12366 [ run ] completed with state |
Signed-off-by: Yilin Zhang <[email protected]>
Signed-off-by: Yilin Zhang <[email protected]>
Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
DeepEP diff: https://github.com/deepseek-ai/DeepEP/compare/eb3f072664251c05074c3ecc3c3f5dad179c29a9...7b15af835942675df041eca2dcb9930b880287e1?expand=1
I fixed the address of dispatch_rdma_recv_count_buffer to avoid cleaning it after each change in hidden_size or token_num. This eliminates the need to call the low_latency_buffer twice (before and after the LL dispatch). Additionally, we can use all_rank_max_num_tokens instead of self.deep_ep_max_num_tokens for dispatching and combining, which avoids the copy overhead.
Summary by CodeRabbit
Summary by CodeRabbit
Bug Fixes
Refactor
Chores