-
Notifications
You must be signed in to change notification settings - Fork 237
[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #421 +/- ##
=======================================
Coverage 74.43% 74.43%
=======================================
Files 182 182
Lines 18234 18234
=======================================
Hits 13572 13572
Misses 4662 4662 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6da3636 to
cd036ed
Compare
cd036ed to
c5d9682
Compare
d9dfc39 to
a5a6e39
Compare
ae2a32c to
6020e94
Compare
| @@ -0,0 +1,193 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that test will still pass.
a591330 to
234b7c2
Compare
| from .plugins import export_spec_ckpt_config, export_spec_ckpt_state_dict, spec_opt_only | ||
| from .quant_utils import ( | ||
| fuse_prequant_layernorm, | ||
| fuse_prequant_to_linear, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use_prequant_to_linear and fuse_prequant_layernorm be combined or they are mutual exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are quite different. use_prequant_to_linear is rule-based fusion and doesn't need graph tracing.
modelopt/torch/export/quant_utils.py
Outdated
| layernorm_module.weight = torch.nn.Parameter( | ||
| layernorm_module.weight * getattr(modules[0].input_quantizer, "_pre_quant_scale") | ||
| ) | ||
| if hasattr(layernorm_module, "bias"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle bias now (not before) because of some new model support or it's nvfp4 awq related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is just for future proof
| mtq.NVFP4_AWQ_LITE_CFG, | ||
| ], | ||
| ) | ||
| def test_pattern_fuse_prequant_moe(quant_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also cover a test case for BMM style MoE like in llama4 or gpt-oss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not work for BMM style Moe, but we can add the support later.
| .expand(num_kv_heads, n_rep, kv_head_dim) | ||
| .reshape(-1) | ||
| ) | ||
| # Update o_proj's pre_quant_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this update is regards to update o_proj's PQS so we can just take the first head and apply to v right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this updates the o_proj's PQS, so input channels of o_proj associated with the same query group (output channel) of v have the same prequant scale.
cjluo-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this.
d8528f1 to
986824e
Compare
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
986824e to
f55baad
Compare
## What does this PR do? **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases: * For MLP, pre_quant_scale of gate_proj layer is fused into up_proj's weight, so we don't need an extra handle in downstream fused moe kernels. * For attention, we will try to fuse the pre_quant_scale of o_proj to v_proj if their dimensions match, which means we will skip fusion for MQA/GQA models. ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> unit test, e2e test for Qwen3 dense and moe models. ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> --------- Signed-off-by: weimingc <[email protected]>
## What does this PR do? **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases: * For MLP, pre_quant_scale of gate_proj layer is fused into up_proj's weight, so we don't need an extra handle in downstream fused moe kernels. * For attention, we will try to fuse the pre_quant_scale of o_proj to v_proj if their dimensions match, which means we will skip fusion for MQA/GQA models. ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> unit test, e2e test for Qwen3 dense and moe models. ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> --------- Signed-off-by: weimingc <[email protected]>
What does this PR do?
Type of change: ?
Overview:
This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases:
Usage
# Add a code snippet demonstrating how to use thisTesting
unit test, e2e test for Qwen3 dense and moe models.
Before your PR is "Ready for review"
Additional Information