Mixed precision export support for gptq quantized model #1853

rM-planet · 2025-11-03T22:26:05Z

1> Changes in OGA to support mixed precision export of models quantized with GPTQModel.
2> Changes to decide whether to use packed matmul or not on the basis of q,k,v precisions.

rM-planet · 2025-11-03T22:29:14Z

@baijumeswani Please review. Thanks!

kunal-vaishnavi · 2025-11-05T10:19:53Z

src/python/py/models/builder.py

                and not self.attention_attrs["k_norm"]
            )
+
+            if "use_packed_matmul" in self.extra_options:


This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user.

Hi Kunal, there are some cases where in order to improve accuracy we want to have Q,K,V in different precisions. For those cases, we want to have unpacked q,k,v.

This can still be done inside the model builder. For example:

if Q.dtype != K.dtype or K.dtype != V.dtype or Q.dtype != V.dtype: make_separate_matmuls(...) else: make_packed_matmul(...)

We anyways don't expect to use a packed MatMul where the Q, K, and V tensors contain different precisions. A packed MatMul should contain the same precision for all three tensors.

onnxruntime-genai/src/python/py/models/builder.py

Lines 1066 to 1071 in d4eabac

class PackedMatMul:

def __init__(self):

if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits:

raise ValueError("All MatMuls must have the same bits for packed MatMul.")

if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size:

raise ValueError("All MatMuls must have the same group size for packed MatMul.")

@kunal-vaishnavi Thanks for the suggestion. I have addressed the changes accordingly.

src/python/py/models/quantized_model.py

kunal-vaishnavi · 2025-11-06T02:41:19Z

src/python/py/models/builder.py


        # Make MatMul nodes
-        if self.attention_attrs["use_packed_matmul"]:
+        if self.attention_attrs["use_packed_matmul"] and (self.quant_type is None or (attention.q_proj.bits == attention.k_proj.bits == attention.v_proj.bits)):


The value of use_packed_matmul should already consider whether the bits are equal or not. Can we find a way to use the bits check to set the value of use_packed_matmul?

onnxruntime-genai/src/python/py/models/builder.py

Lines 368 to 375 in e6ff697

# Some EPs don't support packed Q/K/V for GQA yet

# Packed MatMul with LoRA/QLoRA is not currently supported

self.attention_attrs["use_packed_matmul"] = (

self.ep not in ["dml", "webgpu"]

and not self.matmul_attrs["use_lora"]

and not self.attention_attrs["q_norm"]

and not self.attention_attrs["k_norm"]

)

We should be able to re-use that check in other locations within the model builder as needed (e.g. below).

onnxruntime-genai/src/python/py/models/builder.py

Lines 1066 to 1071 in d4eabac

class PackedMatMul:

def __init__(self):

if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits:

raise ValueError("All MatMuls must have the same bits for packed MatMul.")

if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size:

raise ValueError("All MatMuls must have the same group size for packed MatMul.")

Currently, "use_packed_matmul" is like a global check i.e applies to all layers. I think it is good for feature based conditions ('use_lora', "dml ep") applicable for all layers. However, there are cases where we want to have initial sensitive few layers in mixed precision and remaining in same low-bit precision and because of that I want to do this check for each layer instead of setting it globally(more restrictive).

Can we leverage a similar style used below for the bits checks?

onnxruntime-genai/src/python/py/models/builder.py

Lines 2025 to 2031 in e6ff697

# Make Add nodes (if bias exists)

q_bias_exists = attention.q_proj.bias is not None and torch.count_nonzero(attention.q_proj.bias) > 0

k_bias_exists = attention.k_proj.bias is not None and torch.count_nonzero(attention.k_proj.bias) > 0

v_bias_exists = attention.v_proj.bias is not None and torch.count_nonzero(attention.v_proj.bias) > 0

any_bias_exists = q_bias_exists or k_bias_exists or v_bias_exists

if self.attention_attrs["use_packed_matmul"] and any_bias_exists:

Something like the following should work.

# Get dtype used for MatMul ops q_dtype = getattr(attention.q_proj, "bits", attention.q_proj.weight.dtype) k_dtype = getattr(attention.k_proj, "bits", attention.k_proj.weight.dtype) v_dtype = getattr(attention.v_proj, "bits", attention.v_proj.weight.dtype) all_dtype_equal = q_dtype == k_dtype == v_dtype if self.attention_attrs["use_packed_matmul"] and all_dtype_equal:

I would like to keep the boolean expression for the if condition as simple as possible.

I have made the changes accordingly.

gtonpe · 2025-11-07T18:31:41Z

In Review

kunal-vaishnavi · 2025-11-08T20:48:43Z

src/python/py/models/builder.py

+
+        # Get dtype used for MatMul ops
+        q_dtype = getattr(attention.q_proj, "bits", None) or getattr(attention.q_proj.weight, "dtype", None)
+        k_dtype = getattr(attention.k_proj, "bits", None) or getattr(attention.k_proj.weight, "dtype", None)


Can we change the predicate order to check for dtype first and then bits second? Most models generated via model builder have not been quantized in advance so the latter predicate is true more often than the latter.

kunal-vaishnavi reviewed Nov 5, 2025

View reviewed changes

src/python/py/models/quantized_model.py Outdated Show resolved Hide resolved

rM-planet force-pushed the gptqmodel_mixed_precision branch from f74cae5 to e6ff697 Compare November 6, 2025 01:39

kunal-vaishnavi reviewed Nov 6, 2025

View reviewed changes

kunal-vaishnavi mentioned this pull request Nov 7, 2025

Shared emb_tokens/lm_head on nibbled 4bit qweights #1854

Open

rM-planet force-pushed the gptqmodel_mixed_precision branch from e6ff697 to f6386c2 Compare November 7, 2025 23:26

Mixed precision export support for gptq quantized model

10059d2

rM-planet force-pushed the gptqmodel_mixed_precision branch from f6386c2 to 10059d2 Compare November 7, 2025 23:41

kunal-vaishnavi reviewed Nov 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixed precision export support for gptq quantized model #1853

Mixed precision export support for gptq quantized model #1853

Uh oh!

rM-planet commented Nov 3, 2025 •

edited

Loading

Uh oh!

rM-planet commented Nov 3, 2025

Uh oh!

kunal-vaishnavi Nov 5, 2025

Uh oh!

rM-planet Nov 5, 2025

Uh oh!

kunal-vaishnavi Nov 5, 2025

Uh oh!

rM-planet Nov 6, 2025

Uh oh!

Uh oh!

kunal-vaishnavi Nov 6, 2025

Uh oh!

rM-planet Nov 6, 2025 •

edited

Loading

Uh oh!

kunal-vaishnavi Nov 7, 2025

Uh oh!

rM-planet Nov 7, 2025

Uh oh!

gtonpe commented Nov 7, 2025

Uh oh!

kunal-vaishnavi Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	class PackedMatMul:
	def __init__(self):
	if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits:
	raise ValueError("All MatMuls must have the same bits for packed MatMul.")
	if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size:
	raise ValueError("All MatMuls must have the same group size for packed MatMul.")

	# Some EPs don't support packed Q/K/V for GQA yet
	# Packed MatMul with LoRA/QLoRA is not currently supported
	self.attention_attrs["use_packed_matmul"] = (
	self.ep not in ["dml", "webgpu"]
	and not self.matmul_attrs["use_lora"]
	and not self.attention_attrs["q_norm"]
	and not self.attention_attrs["k_norm"]
	)

	# Make Add nodes (if bias exists)
	q_bias_exists = attention.q_proj.bias is not None and torch.count_nonzero(attention.q_proj.bias) > 0
	k_bias_exists = attention.k_proj.bias is not None and torch.count_nonzero(attention.k_proj.bias) > 0
	v_bias_exists = attention.v_proj.bias is not None and torch.count_nonzero(attention.v_proj.bias) > 0
	any_bias_exists = q_bias_exists or k_bias_exists or v_bias_exists

	if self.attention_attrs["use_packed_matmul"] and any_bias_exists:

Mixed precision export support for gptq quantized model #1853

Are you sure you want to change the base?

Mixed precision export support for gptq quantized model #1853

Uh oh!

Conversation

rM-planet commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rM-planet commented Nov 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rM-planet Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gtonpe commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rM-planet commented Nov 3, 2025 •

edited

Loading

rM-planet Nov 6, 2025 •

edited

Loading