-
Notifications
You must be signed in to change notification settings - Fork 225
Mixed precision export support for gptq quantized model #1853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@baijumeswani Please review. Thanks! |
src/python/py/models/builder.py
Outdated
| and not self.attention_attrs["k_norm"] | ||
| ) | ||
|
|
||
| if "use_packed_matmul" in self.extra_options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Kunal, there are some cases where in order to improve accuracy we want to have Q,K,V in different precisions. For those cases, we want to have unpacked q,k,v.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can still be done inside the model builder. For example:
if Q.dtype != K.dtype or K.dtype != V.dtype or Q.dtype != V.dtype:
make_separate_matmuls(...)
else:
make_packed_matmul(...)We anyways don't expect to use a packed MatMul where the Q, K, and V tensors contain different precisions. A packed MatMul should contain the same precision for all three tensors.
onnxruntime-genai/src/python/py/models/builder.py
Lines 1066 to 1071 in d4eabac
| class PackedMatMul: | |
| def __init__(self): | |
| if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits: | |
| raise ValueError("All MatMuls must have the same bits for packed MatMul.") | |
| if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size: | |
| raise ValueError("All MatMuls must have the same group size for packed MatMul.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunal-vaishnavi Thanks for the suggestion. I have addressed the changes accordingly.
f74cae5 to
e6ff697
Compare
src/python/py/models/builder.py
Outdated
|
|
||
| # Make MatMul nodes | ||
| if self.attention_attrs["use_packed_matmul"]: | ||
| if self.attention_attrs["use_packed_matmul"] and (self.quant_type is None or (attention.q_proj.bits == attention.k_proj.bits == attention.v_proj.bits)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of use_packed_matmul should already consider whether the bits are equal or not. Can we find a way to use the bits check to set the value of use_packed_matmul?
onnxruntime-genai/src/python/py/models/builder.py
Lines 368 to 375 in e6ff697
| # Some EPs don't support packed Q/K/V for GQA yet | |
| # Packed MatMul with LoRA/QLoRA is not currently supported | |
| self.attention_attrs["use_packed_matmul"] = ( | |
| self.ep not in ["dml", "webgpu"] | |
| and not self.matmul_attrs["use_lora"] | |
| and not self.attention_attrs["q_norm"] | |
| and not self.attention_attrs["k_norm"] | |
| ) |
We should be able to re-use that check in other locations within the model builder as needed (e.g. below).
onnxruntime-genai/src/python/py/models/builder.py
Lines 1066 to 1071 in d4eabac
| class PackedMatMul: | |
| def __init__(self): | |
| if q_matmul.bits != k_matmul.bits or q_matmul.bits != v_matmul.bits: | |
| raise ValueError("All MatMuls must have the same bits for packed MatMul.") | |
| if q_matmul.group_size != k_matmul.group_size or q_matmul.group_size != v_matmul.group_size: | |
| raise ValueError("All MatMuls must have the same group size for packed MatMul.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, "use_packed_matmul" is like a global check i.e applies to all layers. I think it is good for feature based conditions ('use_lora', "dml ep") applicable for all layers. However, there are cases where we want to have initial sensitive few layers in mixed precision and remaining in same low-bit precision and because of that I want to do this check for each layer instead of setting it globally(more restrictive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leverage a similar style used below for the bits checks?
onnxruntime-genai/src/python/py/models/builder.py
Lines 2025 to 2031 in e6ff697
| # Make Add nodes (if bias exists) | |
| q_bias_exists = attention.q_proj.bias is not None and torch.count_nonzero(attention.q_proj.bias) > 0 | |
| k_bias_exists = attention.k_proj.bias is not None and torch.count_nonzero(attention.k_proj.bias) > 0 | |
| v_bias_exists = attention.v_proj.bias is not None and torch.count_nonzero(attention.v_proj.bias) > 0 | |
| any_bias_exists = q_bias_exists or k_bias_exists or v_bias_exists | |
| if self.attention_attrs["use_packed_matmul"] and any_bias_exists: |
Something like the following should work.
# Get dtype used for MatMul ops
q_dtype = getattr(attention.q_proj, "bits", attention.q_proj.weight.dtype)
k_dtype = getattr(attention.k_proj, "bits", attention.k_proj.weight.dtype)
v_dtype = getattr(attention.v_proj, "bits", attention.v_proj.weight.dtype)
all_dtype_equal = q_dtype == k_dtype == v_dtype
if self.attention_attrs["use_packed_matmul"] and all_dtype_equal:I would like to keep the boolean expression for the if condition as simple as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the changes accordingly.
|
In Review |
e6ff697 to
f6386c2
Compare
f6386c2 to
10059d2
Compare
|
|
||
| # Get dtype used for MatMul ops | ||
| q_dtype = getattr(attention.q_proj, "bits", None) or getattr(attention.q_proj.weight, "dtype", None) | ||
| k_dtype = getattr(attention.k_proj, "bits", None) or getattr(attention.k_proj.weight, "dtype", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the predicate order to check for dtype first and then bits second? Most models generated via model builder have not been quantized in advance so the latter predicate is true more often than the latter.
1> Changes in OGA to support mixed precision export of models quantized with GPTQModel.
2> Changes to decide whether to use packed matmul or not on the basis of q,k,v precisions.