Skip to content

[Question] Question on using num_worst_tokens #505

@yuhyao

Description

@yuhyao

Hi DeepSeek team,

I noticed that num_worst_tokens is included in the arguments of dispatch, which makes sense for CUDA Graph support. However, it seems that the values written to moe_recv_counter and moe_recv_expert_counter are no longer directly accessible. From the code, it looks like these counters can be recomputed from recv_topk_idx, since padded entries are set to -1.

The prefill trace also doesn’t show any H2D memcpy after the dispatch kernel, so I assume you are recomputing those counters inside another kernel?

Would it be possible to write these counters into CUDA tensors directly? For example, open-source projects like SGLang simply copy num_recv_tokens_per_expert_list to device memory. Having these values available on CUDA would make it easier to switch to CPU-async mode and may also help avoid redundant memory accesses.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions