Skip to content

Conversation

@rM-planet
Copy link

@rM-planet rM-planet commented Nov 3, 2025

1> Changes in OGA to support mixed precision export of models quantized with GPTQModel.
2> Changes to decide whether to use packed matmul or not on the basis of q,k,v precisions.

@rM-planet
Copy link
Author

@baijumeswani Please review. Thanks!

and not self.attention_attrs["k_norm"]
)

if "use_packed_matmul" in self.extra_options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kunal, there are some cases where in order to improve accuracy we want to have Q,K,V in different precisions. For those cases, we want to have unpacked q,k,v.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can still be done inside the model builder. For example:

if Q.dtype != K.dtype or K.dtype != V.dtype or Q.dtype != V.dtype:
    make_separate_matmuls(...)
else:
    make_packed_matmul(...)

We anyways don't expect to use a packed MatMul where the Q, K, and V tensors contain different precisions. A packed MatMul should contain the same precision for all three tensors.

class PackedMatMul:
def __init__(self):
if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits:
raise ValueError("All MatMuls must have the same bits for packed MatMul.")
if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size:
raise ValueError("All MatMuls must have the same group size for packed MatMul.")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunal-vaishnavi Thanks for the suggestion. I have addressed the changes accordingly.

@rM-planet rM-planet force-pushed the gptqmodel_mixed_precision branch from f74cae5 to e6ff697 Compare November 6, 2025 01:39

# Make MatMul nodes
if self.attention_attrs["use_packed_matmul"]:
if self.attention_attrs["use_packed_matmul"] and (self.quant_type is None or (attention.q_proj.bits == attention.k_proj.bits == attention.v_proj.bits)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of use_packed_matmul should already consider whether the bits are equal or not. Can we find a way to use the bits check to set the value of use_packed_matmul?

# Some EPs don't support packed Q/K/V for GQA yet
# Packed MatMul with LoRA/QLoRA is not currently supported
self.attention_attrs["use_packed_matmul"] = (
self.ep not in ["dml", "webgpu"]
and not self.matmul_attrs["use_lora"]
and not self.attention_attrs["q_norm"]
and not self.attention_attrs["k_norm"]
)

We should be able to re-use that check in other locations within the model builder as needed (e.g. below).

class PackedMatMul:
def __init__(self):
if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits:
raise ValueError("All MatMuls must have the same bits for packed MatMul.")
if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size:
raise ValueError("All MatMuls must have the same group size for packed MatMul.")

Copy link
Author

@rM-planet rM-planet Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, "use_packed_matmul" is like a global check i.e applies to all layers. I think it is good for feature based conditions ('use_lora', "dml ep") applicable for all layers. However, there are cases where we want to have initial sensitive few layers in mixed precision and remaining in same low-bit precision and because of that I want to do this check for each layer instead of setting it globally(more restrictive).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage a similar style used below for the bits checks?

# Make Add nodes (if bias exists)
q_bias_exists = attention.q_proj.bias is not None and torch.count_nonzero(attention.q_proj.bias) > 0
k_bias_exists = attention.k_proj.bias is not None and torch.count_nonzero(attention.k_proj.bias) > 0
v_bias_exists = attention.v_proj.bias is not None and torch.count_nonzero(attention.v_proj.bias) > 0
any_bias_exists = q_bias_exists or k_bias_exists or v_bias_exists
if self.attention_attrs["use_packed_matmul"] and any_bias_exists:

Something like the following should work.

# Get dtype used for MatMul ops
q_dtype = getattr(attention.q_proj, "bits", attention.q_proj.weight.dtype)
k_dtype = getattr(attention.k_proj, "bits", attention.k_proj.weight.dtype)
v_dtype = getattr(attention.v_proj, "bits", attention.v_proj.weight.dtype)
all_dtype_equal = q_dtype == k_dtype == v_dtype

if self.attention_attrs["use_packed_matmul"] and all_dtype_equal:

I would like to keep the boolean expression for the if condition as simple as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes accordingly.

@gtonpe
Copy link

gtonpe commented Nov 7, 2025

In Review

@rM-planet rM-planet force-pushed the gptqmodel_mixed_precision branch from e6ff697 to f6386c2 Compare November 7, 2025 23:26
@rM-planet rM-planet force-pushed the gptqmodel_mixed_precision branch from f6386c2 to 10059d2 Compare November 7, 2025 23:41

# Get dtype used for MatMul ops
q_dtype = getattr(attention.q_proj, "bits", None) or getattr(attention.q_proj.weight, "dtype", None)
k_dtype = getattr(attention.k_proj, "bits", None) or getattr(attention.k_proj.weight, "dtype", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the predicate order to check for dtype first and then bits second? Most models generated via model builder have not been quantized in advance so the latter predicate is true more often than the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants