[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

meenchen · 2025-10-09T22:28:45Z

What does this PR do?

Type of change: ?

Overview:

This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases:

For MLP, pre_quant_scale of gate_proj layer is fused into up_proj's weight, so we don't need an extra handle in downstream fused moe kernels.
For attention, we will try to fuse the pre_quant_scale of o_proj to v_proj if their dimensions match, which means we will skip fusion for MQA/GQA models.

Usage

# Add a code snippet demonstrating how to use this

Testing

unit test, e2e test for Qwen3 dense and moe models.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2025-10-09T22:28:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2025-10-09T22:28:55Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch weimingc/fuse_pqs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-09T22:41:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.43%. Comparing base (c02de17) to head (f55baad).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #421   +/-   ##
=======================================
  Coverage   74.43%   74.43%           
=======================================
  Files         182      182           
  Lines       18234    18234           
=======================================
  Hits        13572    13572           
  Misses       4662     4662

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

modelopt/torch/export/quant_utils.py

cjluo-nv · 2025-11-04T19:49:25Z

tests/gpu/torch/export/test_quant_utils.py

@@ -0,0 +1,193 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Do we need update https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py#L43 as well?

No, that test will still pass.

modelopt/torch/export/quant_utils.py

Edwardf0t1 · 2025-11-06T19:46:11Z

modelopt/torch/export/unified_export_hf.py

 from .plugins import export_spec_ckpt_config, export_spec_ckpt_state_dict, spec_opt_only
 from .quant_utils import (
    fuse_prequant_layernorm,
+    fuse_prequant_to_linear,


Can use_prequant_to_linear and fuse_prequant_layernorm be combined or they are mutual exclusive?

They are quite different. use_prequant_to_linear is rule-based fusion and doesn't need graph tracing.

Edwardf0t1 · 2025-11-06T19:52:47Z

modelopt/torch/export/quant_utils.py

    layernorm_module.weight = torch.nn.Parameter(
        layernorm_module.weight * getattr(modules[0].input_quantizer, "_pre_quant_scale")
    )
+    if hasattr(layernorm_module, "bias"):


Do we need to handle bias now (not before) because of some new model support or it's nvfp4 awq related?

No, this is just for future proof

Edwardf0t1 · 2025-11-06T19:55:41Z

tests/gpu/torch/export/test_quant_utils.py

+        mtq.NVFP4_AWQ_LITE_CFG,
+    ],
+)
+def test_pattern_fuse_prequant_moe(quant_config):


Could we also cover a test case for BMM style MoE like in llama4 or gpt-oss?

The current implementation does not work for BMM style Moe, but we can add the support later.

modelopt/torch/export/quant_utils.py

cjluo-nv · 2025-11-18T17:21:05Z

modelopt/torch/export/quant_utils.py

+                            .expand(num_kv_heads, n_rep, kv_head_dim)
+                            .reshape(-1)
+                        )
+                        # Update o_proj's pre_quant_scale


So this update is regards to update o_proj's PQS so we can just take the first head and apply to v right?

yes, this updates the o_proj's PQS, so input channels of o_proj associated with the same query group (output channel) of v have the same prequant scale.

cjluo-nv

Thanks for implementing this.

Signed-off-by: weimingc <[email protected]>

## What does this PR do? **Type of change:** ?  **Overview:** This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases: * For MLP, pre_quant_scale of gate_proj layer is fused into up_proj's weight, so we don't need an extra handle in downstream fused moe kernels. * For attention, we will try to fuse the pre_quant_scale of o_proj to v_proj if their dimensions match, which means we will skip fusion for MQA/GQA models. ## Usage  ```python # Add a code snippet demonstrating how to use this ``` ## Testing  unit test, e2e test for Qwen3 dense and moe models. ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information  --------- Signed-off-by: weimingc <[email protected]>

meenchen force-pushed the weimingc/fuse_pqs branch from 6da3636 to cd036ed Compare October 14, 2025 18:46

meenchen self-assigned this Oct 14, 2025

meenchen force-pushed the weimingc/fuse_pqs branch from cd036ed to c5d9682 Compare October 14, 2025 18:48

meenchen force-pushed the weimingc/fuse_pqs branch from d9dfc39 to a5a6e39 Compare October 27, 2025 19:28

meenchen requested a review from cjluo-nv October 27, 2025 19:29

meenchen mentioned this pull request Oct 27, 2025

[OMNIML-2932] [feat] nvfp4 awq support NVIDIA/TensorRT-LLM#8698

Merged

1 task

meenchen force-pushed the weimingc/fuse_pqs branch from ae2a32c to 6020e94 Compare November 3, 2025 20:57

meenchen changed the title ~~Pattern-based fusion for pre_quant_scale~~ Fusing pre_quant_scale for NVFP4 AWQ Nov 3, 2025

meenchen changed the title ~~Fusing pre_quant_scale for NVFP4 AWQ~~ [OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ Nov 3, 2025

meenchen marked this pull request as ready for review November 3, 2025 23:39

meenchen requested a review from a team as a code owner November 3, 2025 23:39

meenchen requested review from Edwardf0t1 and realAsma November 4, 2025 00:11

cjluo-nv reviewed Nov 4, 2025

View reviewed changes

modelopt/torch/export/quant_utils.py Show resolved Hide resolved

cjluo-nv reviewed Nov 4, 2025

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Nov 4, 2025

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

meenchen force-pushed the weimingc/fuse_pqs branch from a591330 to 234b7c2 Compare November 6, 2025 19:02

Edwardf0t1 reviewed Nov 6, 2025

View reviewed changes

meenchen requested a review from a team as a code owner November 8, 2025 00:23

meenchen requested review from Edwardf0t1 and cjluo-nv November 13, 2025 23:46

cjluo-nv reviewed Nov 18, 2025

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Nov 18, 2025

View reviewed changes

cjluo-nv approved these changes Nov 18, 2025

View reviewed changes

meenchen force-pushed the weimingc/fuse_pqs branch from d8528f1 to 986824e Compare November 19, 2025 19:00

pattern-based fusion

b5153fa

Signed-off-by: weimingc <[email protected]>

meenchen added 22 commits November 19, 2025 19:17

fix GQA

711796d

Signed-off-by: weimingc <[email protected]>

minor

98b3e5f

Signed-off-by: weimingc <[email protected]>

unit test

d4c73ad

Signed-off-by: weimingc <[email protected]>

fix doc

63b8b37

Signed-off-by: weimingc <[email protected]>

minor

8713e3b

Signed-off-by: weimingc <[email protected]>

resmooth

7864f56

Signed-off-by: weimingc <[email protected]>

fix moe fusion

1b40581

Signed-off-by: weimingc <[email protected]>

fix test

6a704d4

Signed-off-by: weimingc <[email protected]>

only fuse for nvfp4 awq

2eeb5bb

Signed-off-by: weimingc <[email protected]>

calibration after fusion

0b14f0b

Signed-off-by: weimingc <[email protected]>

recalib fused layers

f74c041

Signed-off-by: weimingc <[email protected]>

address comment, add calibration after fusion

d646fa9

Signed-off-by: weimingc <[email protected]>

fix, add changelog

756a4ed

Signed-off-by: weimingc <[email protected]>

fix

0781465

Signed-off-by: weimingc <[email protected]>

minor

fbd3ab3

Signed-off-by: weimingc <[email protected]>

minor

dd7f8af

Signed-off-by: weimingc <[email protected]>

minor

e49137e

Signed-off-by: weimingc <[email protected]>

minor

506ba83

Signed-off-by: weimingc <[email protected]>

remove calib

0573c7b

Signed-off-by: weimingc <[email protected]>

minor

6854b80

Signed-off-by: weimingc <[email protected]>

fix

25ff362

Signed-off-by: weimingc <[email protected]>

comment

f55baad

Signed-off-by: weimingc <[email protected]>

meenchen force-pushed the weimingc/fuse_pqs branch from 986824e to f55baad Compare November 19, 2025 19:17

meenchen enabled auto-merge (squash) November 19, 2025 19:18

meenchen merged commit 1d0ee04 into main Nov 19, 2025
27 checks passed

meenchen deleted the weimingc/fuse_pqs branch November 19, 2025 21:16

		@@ -0,0 +1,193 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

Uh oh!

Conversation

meenchen commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

meenchen commented Oct 9, 2025 •

edited

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading