-
Notifications
You must be signed in to change notification settings - Fork 5.9k
support fp8 per token quant for deepep low latency two stage #76863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
| sizeof(int4) + (kUseFP8 ? (kHidden + kNumScales * sizeof(float)) | ||
| : (kHidden * sizeof(nv_bfloat16))); | ||
| sizeof(int4) + (kUseFP8 | ||
| ? (kHidden + (kNumScales + 3) / 4 * 4 * sizeof(float)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最好不要硬编码成和4对齐,constexpr ALIGN_ELEMS=xxx,类似这种再对齐。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
别的地方也一样
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最好不要硬编码成和4对齐,constexpr ALIGN_ELEMS=xxx,类似这种再对齐。
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是由于一次load 16字节,4个float, 所以需要4个float 对齐,不要用环境变量
| auto num_tokens = static_cast<int>(x.size(0)), | ||
| hidden = static_cast<int>(x.size(1)); | ||
| auto num_scales = hidden / 128, num_topk = static_cast<int>(topk_idx.size(1)); | ||
| auto num_scales = num_per_channel == -1 ? 1 : hidden / 128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果引入了num_per_channel,这里是不是改成hidden / num_per_channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果引入了num_per_channel,这里是不是改成hidden / num_per_channel
这样的话,per-token的num_per_channel需要传hidden_size进来,参数会有点繁琐
|
LGTM |
PR Category
Performance Optimization
PR Types
New features
Description
为DeepEP ll two stage适配激活per-token量化