[mxfp8 moe traing] to_mxfp8_a2a_dequant flat perf vs high precision a2a

Looking at torchtitan traces, it seems the mxfp8 a2a dispatch is ~2x faster than bf16 (1681us vs 3487us), but the mxfp8 a2a combine is (1)  roughly exact same duration as bf16 a2a combine, and (2) bf16/mxfp8 a2a impls both take roughly 23x longer than a2a dispatch. This is unexpected.

Tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpHhVPih/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


Trace for llama4 EP=4, using defalut a2a_impl

<img width="1280" height="330" alt="Image" src="https://github.com/user-attachments/assets/a62cf384-c775-4d67-89be-552995ee3c58" />

Trace for llama4 EP=4, to_mxfp8_a2a_dequant

<img width="1336" height="275" alt="Image" src="https://github.com/user-attachments/assets/7a1c1d49-a411-4f53-8cbe-f11359a3c565" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe traing] to_mxfp8_a2a_dequant flat perf vs high precision a2a #3112

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[mxfp8 moe traing] to_mxfp8_a2a_dequant flat perf vs high precision a2a #3112

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions