DeepEP LL support variable hidden size and tokens num #6141

yilin-void · 2025-07-17T10:31:02Z

DeepEP diff: https://github.com/deepseek-ai/DeepEP/compare/eb3f072664251c05074c3ecc3c3f5dad179c29a9...7b15af835942675df041eca2dcb9930b880287e1?expand=1

I fixed the address of dispatch_rdma_recv_count_buffer to avoid cleaning it after each change in hidden_size or token_num. This eliminates the need to call the low_latency_buffer twice (before and after the LL dispatch). Additionally, we can use all_rank_max_num_tokens instead of self.deep_ep_max_num_tokens for dispatching and combining, which avoids the copy overhead.

Summary by CodeRabbit

Bug Fixes
- Improved validation logic for token dispatching to allow more flexibility in buffer sizes.
- Enhanced consistency checks to ensure token counts do not exceed configured limits, reducing potential errors during processing.
Refactor
- Simplified tensor handling by removing redundant buffer cleaning and reshaping steps.
- Unified dispatch and combination processes to use the actual maximum token count per rank for improved reliability.
Chores
- Updated the version of a key dependency for improved compatibility.

coderabbitai · 2025-07-17T10:31:08Z

Walkthrough

The updates modify the DeepEP dependency version in the build configuration, replace the num_max_dispatch_tokens_per_rank variable with num_experts in the low latency buffer class and simplify its validation logic, and add assertions to enforce token count constraints in the fused MoE module. Redundant buffer cleaning, tensor truncation, and adapter logic are removed for streamlined dispatch and combine flows.

Changes

File(s)	Change Summary
cpp/tensorrt_llm/deep_ep/CMakeLists.txt	Updated DeepEP dependency commit hash to a newer version.
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py	Replaced `num_max_dispatch_tokens_per_rank` with `num_experts`; added assertions for `num_experts` consistency; simplified dispatch validation logic.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py	Added assertions for token counts, unified dispatch buffer sizing using actual max tokens, removed redundant buffer cleaning and tensor truncation, and eliminated adapter logic in combine step.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FusedMoEWideEP
    participant VariableLengthLowLatencyBuffer
    participant DeepEP

    User->>FusedMoEWideEP: Call forward(...)
    FusedMoEWideEP->>VariableLengthLowLatencyBuffer: low_latency_dispatch(all_rank_max_num_tokens)
    VariableLengthLowLatencyBuffer->>VariableLengthLowLatencyBuffer: Assert num_experts consistency
    VariableLengthLowLatencyBuffer->>DeepEP: Dispatch tokens
    DeepEP-->>VariableLengthLowLatencyBuffer: Return results
    FusedMoEWideEP->>FusedMoEWideEP: Combine results (direct call, no adapter)
    FusedMoEWideEP-->>User: Return output

Suggested reviewers

yizhang-nv
yuantailing

Poem

A hop and a skip through the code today,
Token counts checked in a rabbit’s own way.
Buffers are tidied, old checks swept aside,
With DeepEP now fresher, we leap with more pride.
No more padding or fuss—just swift, happy hops,
As the code garden blooms with these latest crop drops!
🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
465-469: Good safety check and more accurate token count usage.

The assertion ensures we don't exceed the configured buffer size, and using all_rank_max_num_tokens instead of the fixed maximum aligns with the actual dispatch requirements.

However, line 467 exceeds the 120-character limit.
-                    x, recv_expert_count, deep_ep_handle = \
-                        self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                    x, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch(
+                        x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
623-626: Consistent safety check for postquant dispatch.

Good to see the same assertion and token count handling applied here for consistency.

Line 626 exceeds the 120-character limit.
-                fp4_packed_tensor, recv_expert_count, deep_ep_handle = \
-                    self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                fp4_packed_tensor, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch(
+                    fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a718486 and ee78914.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

467-467: Line too long (126 > 120)

(E501)

626-626: Line too long (138 > 120)

(E501)

🔇 Additional comments (2)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)

1-1: DeepEP dependency updated to newer commit.

The update aligns with the refined token handling changes in the Python modules.

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

142-142: Good improvement to validation logic.

Relaxing from strict equality to less-than-or-equal check allows more flexible dispatch token counts while still preventing buffer overflow. This change enables dispatching with variable token counts up to the configured maximum.

yilin-void · 2025-07-17T10:40:30Z

/bot run

tensorrt-cicd · 2025-07-17T10:48:01Z

PR_Github #12199 [ run ] triggered by Bot

yilin-void · 2025-07-17T10:49:35Z

/bot run

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee78914 and 55fdcec.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

467-467: Line too long (126 > 120)

(E501)

626-626: Line too long (138 > 120)

(E501)

🔇 Additional comments (2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

465-468: LGTM: Assertion and dispatch parameter update look correct.

The assertion ensures all_rank_max_num_tokens doesn't exceed the configured limit before dispatch, and updating the dispatch call to use the actual maximum tokens per rank instead of the hardcoded limit is a good improvement for handling variable token counts.

711-717: low_latency_combine signature & reshape usage verified

The low_latency_combine method in deep_ep_utils.py is defined as:
def low_latency_combine(self,
                        hidden_states: torch.Tensor,
                        topk_idx: torch.Tensor,
                        topk_weights: torch.Tensor,
                        handle: Tuple):
    …
This matches the call in tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (lines 711–717):
num_tokens_per_expert_for_fused_moe = (
    self.mapping.moe_ep_size * all_rank_max_num_tokens
)
final_hidden_states = final_hidden_states.view(
    self.expert_size_per_partition,
    num_tokens_per_expert_for_fused_moe,
    self.hidden_size,
)
final_hidden_states = self.deep_ep_buffer.low_latency_combine(
    final_hidden_states,
    deep_ep_topk_idx,
    deep_ep_topk_weights,
    deep_ep_handle,
)
Reshaping with all_rank_max_num_tokens produces the expected
(expert_size_per_partition, num_tokens_per_expert, hidden_size) tensor,
and all arguments align with the combine implementation. No changes needed.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

tensorrt-cicd · 2025-07-17T10:54:52Z

PR_Github #12200 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-17T10:54:53Z

PR_Github #12199 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-17T15:18:03Z

PR_Github #12200 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9062 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

yilin-void · 2025-07-18T03:10:10Z

/bot run

tensorrt-cicd · 2025-07-18T03:15:32Z

PR_Github #12268 [ run ] triggered by Bot

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

yilin-void · 2025-07-18T04:49:59Z

/bot run

tensorrt-cicd · 2025-07-18T04:59:30Z

PR_Github #12275 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-18T04:59:31Z

PR_Github #12268 [ run ] completed with state ABORTED

yilin-void · 2025-07-18T08:06:14Z

/bot run

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
625-628: Consistent implementation with remaining formatting issue.

The assertion and dispatch call update are correctly implemented, consistent with the earlier occurrence in the file. However, the line length violation on line 628 still needs to be addressed.

Apply this formatting fix:
-                    self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                    self.deep_ep_buffer.low_latency_dispatch(
+                        fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-468: Logic changes look good, but fix line length violation.

The assertion correctly validates the token count constraint and the dispatch call update aligns with the PR objectives to use actual token counts instead of hardcoded maximums.

However, there's a line length violation on line 468. Apply this formatting fix:
-                    self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                    self.deep_ep_buffer.low_latency_dispatch(
+                        x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f69e8fc and 3e4a14f.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

468-468: Line too long (126 > 120)

(E501)

628-628: Line too long (138 > 120)

(E501)

tensorrt-cicd · 2025-07-18T08:11:32Z

PR_Github #12299 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-18T08:11:34Z

PR_Github #12275 [ run ] completed with state ABORTED

yilin-void · 2025-07-18T12:48:39Z

/bot run

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

625-628: Fix line length violation - duplicate formatting issue.

This segment has the same line length violation as the previous dispatch call. The assertion and parameter logic are correct, but the formatting needs to be consistent.

Apply this diff to fix the line length violation:

-                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
-                fp4_packed_tensor, recv_expert_count, deep_ep_handle = \
-                    self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
+                fp4_packed_tensor, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch(
+                    fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots
+                )

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-470: Fix line length violation while maintaining correct logic.

The assertion and dispatch parameter changes are correct - using all_rank_max_num_tokens instead of the hardcoded maximum aligns with the PR's goal of supporting variable token counts. However, there's a formatting issue that needs to be addressed.

Apply this diff to fix the line length violation:
-                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
-                x, recv_expert_count, deep_ep_handle = \
-                    self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
+                x, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch(
+                    x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots
+                )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e4a14f and f32c0ec.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

468-468: Line too long (126 > 120)

(E501)

628-628: Line too long (138 > 120)

(E501)

tensorrt-cicd · 2025-07-18T12:53:57Z

PR_Github #12316 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-18T12:53:59Z

PR_Github #12299 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-18T17:49:10Z

PR_Github #12316 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9150 completed with status: 'FAILURE'

… and avoid copy Signed-off-by: Yilin Zhang <[email protected]>

Signed-off-by: Yilin Zhang <[email protected]>

yilin-void · 2025-07-19T16:48:10Z

/bot run

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

618-620: Line length violation persists - reformat the dispatch call.

The assertion and logic changes are correct, but the line length violation on line 620 needs to be fixed by reformatting the dispatch call arguments.

This appears to be the same formatting issue identified in previous reviews. Please apply the formatting fix:

-                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
-                fp4_packed_tensor, recv_expert_count, deep_ep_handle = \
-                    self.deep_ep_buffer.low_latency_dispatch(fp4_packed_tensor, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
+                fp4_packed_tensor, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch(
+                    fp4_packed_tensor,
+                    deep_ep_topk_idx,
+                    all_rank_max_num_tokens,
+                    self.num_slots,
+                )

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
466-469: Fix line length violation and verify assertion logic.

The assertion correctly validates token count limits before dispatch, and using all_rank_max_num_tokens aligns with the PR objective of supporting variable token numbers. However, there's a line length violation that needs to be addressed.

Apply this diff to fix the line length violation:
-                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
-                x, recv_expert_count, deep_ep_handle = \
-                    self.deep_ep_buffer.low_latency_dispatch(x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots)
+                assert all_rank_max_num_tokens <= self.deep_ep_max_num_tokens
+                x, recv_expert_count, deep_ep_handle = self.deep_ep_buffer.low_latency_dispatch(
+                    x, deep_ep_topk_idx, all_rank_max_num_tokens, self.num_slots
+                )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f32c0ec and d578007.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

468-468: Line too long (126 > 120)

(E501)

620-620: Line too long (138 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

466-620: Implementation correctly supports variable token counts with proper validation.

The changes successfully implement the PR objective of supporting variable hidden sizes and token numbers by:

Adding safety assertions - Both dispatch calls now validate that all_rank_max_num_tokens doesn't exceed deep_ep_max_num_tokens before dispatching

Using dynamic token counts - Replacing hardcoded deep_ep_max_num_tokens with all_rank_max_num_tokens for actual dispatch operations

Maintaining consistency - Both DeepEPLowLatency dispatch paths follow the same pattern

This aligns with the PR summary's goal of avoiding overhead from unnecessary data copying and supporting variable token numbers.

The logic changes are sound and improve the flexibility of the DeepEP implementation as intended.

tensorrt-cicd · 2025-07-19T16:53:53Z

PR_Github #12366 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-19T21:57:10Z

PR_Github #12366 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9187 completed with status: 'SUCCESS'

Signed-off-by: Yilin Zhang <[email protected]>

Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>

Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>

coderabbitai bot reviewed Jul 17, 2025

View reviewed changes

yilin-void requested a review from yuantailing July 17, 2025 10:40

yilin-void marked this pull request as ready for review July 17, 2025 10:40

yilin-void requested a review from a team as a code owner July 17, 2025 10:40

yilin-void requested review from liji-nv and yizhang-nv July 17, 2025 10:40

yilin-void requested review from hyukn and removed request for liji-nv and yizhang-nv July 17, 2025 10:41

yilin-void force-pushed the dev/deep_ep branch from ee78914 to 55fdcec Compare July 17, 2025 10:49

coderabbitai bot reviewed Jul 17, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py Show resolved Hide resolved

yilin-void force-pushed the dev/deep_ep branch from 55fdcec to ff1b93f Compare July 18, 2025 03:08

yuantailing reviewed Jul 18, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py Outdated Show resolved Hide resolved

yilin-void force-pushed the dev/deep_ep branch from 5a3d66f to f69e8fc Compare July 18, 2025 04:45

yuantailing approved these changes Jul 18, 2025

View reviewed changes

hyukn approved these changes Jul 18, 2025

View reviewed changes

yilin-void force-pushed the dev/deep_ep branch from f69e8fc to 3e4a14f Compare July 18, 2025 08:06

coderabbitai bot reviewed Jul 18, 2025

View reviewed changes

yilin-void force-pushed the dev/deep_ep branch from 3e4a14f to f32c0ec Compare July 18, 2025 12:48

coderabbitai bot reviewed Jul 18, 2025

View reviewed changes

yilin-void added 2 commits July 20, 2025 00:47

DeepEP LL support variable hidden size and tokens num to remove clean…

9f43838

… and avoid copy Signed-off-by: Yilin Zhang <[email protected]>

fix

d578007

Signed-off-by: Yilin Zhang <[email protected]>

yilin-void force-pushed the dev/deep_ep branch from f32c0ec to d578007 Compare July 19, 2025 16:47

coderabbitai bot reviewed Jul 19, 2025

View reviewed changes

yilin-void merged commit 118307c into NVIDIA:main Jul 20, 2025
3 checks passed

reasonsolo pushed a commit to reasonsolo/TensorRT-LLM that referenced this pull request Jul 21, 2025

DeepEP LL support variable hidden size and tokens num (NVIDIA#6141)

d07efa8

Signed-off-by: Yilin Zhang <[email protected]>

timlee0212 pushed a commit to timlee0212/TensorRT-LLM that referenced this pull request Jul 21, 2025

DeepEP LL support variable hidden size and tokens num (NVIDIA#6141)

69185c5

Signed-off-by: Yilin Zhang <[email protected]>

coderabbitai bot mentioned this pull request Jul 23, 2025

DeepEP LL dispatch FP4 #6296

Merged

NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025

DeepEP LL support variable hidden size and tokens num (NVIDIA#6141)

9822945

Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>

Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025

DeepEP LL support variable hidden size and tokens num (NVIDIA#6141)

a640a66

Signed-off-by: Yilin Zhang <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>

This was referenced Aug 12, 2025

[None][feat] DeepEP LL combine FP4 #6822

Merged

[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc #6811

Merged

yilin-void deleted the dev/deep_ep branch September 28, 2025 03:27

DeepEP LL support variable hidden size and tokens num #6141

DeepEP LL support variable hidden size and tokens num #6141

Uh oh!

Conversation

yilin-void commented Jul 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yilin-void commented Jul 17, 2025

Uh oh!

tensorrt-cicd commented Jul 17, 2025

Uh oh!

yilin-void commented Jul 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 17, 2025

Uh oh!

tensorrt-cicd commented Jul 17, 2025

Uh oh!

tensorrt-cicd commented Jul 17, 2025

Uh oh!

yilin-void commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

Uh oh!

yilin-void commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

yilin-void commented Jul 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

yilin-void commented Jul 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

yilin-void commented Jul 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 19, 2025

Uh oh!

tensorrt-cicd commented Jul 19, 2025

yilin-void commented Jul 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 17, 2025 •

edited

Loading