[Question] Question on using num_worst_tokens

Hi DeepSeek team,

I noticed that `num_worst_tokens` is included in the arguments of dispatch, which makes sense for CUDA Graph support. However, it seems that the values written to `moe_recv_counter` and `moe_recv_expert_counter` are no longer directly accessible. From the code, it looks like these counters can be recomputed from `recv_topk_idx`, since padded entries are set to `-1`.

The [prefill trace](https://github.com/deepseek-ai/profile-data/blob/main/prefill.json)￼ also doesn’t show any H2D memcpy after the dispatch kernel, so I assume you are recomputing those counters inside another kernel?

Would it be possible to write these counters into CUDA tensors directly? For example, open-source projects like SGLang simply copy `num_recv_tokens_per_expert_list` to device memory. Having these values available on CUDA would make it easier to switch to CPU-async mode and may also help avoid redundant memory accesses.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Question on using num_worst_tokens #505

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Question on using num_worst_tokens #505

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions