-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi DeepSeek team,
I noticed that num_worst_tokens is included in the arguments of dispatch, which makes sense for CUDA Graph support. However, it seems that the values written to moe_recv_counter and moe_recv_expert_counter are no longer directly accessible. From the code, it looks like these counters can be recomputed from recv_topk_idx, since padded entries are set to -1.
The prefill trace also doesn’t show any H2D memcpy after the dispatch kernel, so I assume you are recomputing those counters inside another kernel?
Would it be possible to write these counters into CUDA tensors directly? For example, open-source projects like SGLang simply copy num_recv_tokens_per_expert_list to device memory. Having these values available on CUDA would make it easier to switch to CPU-async mode and may also help avoid redundant memory accesses.
Thanks!