Shared emb_tokens/lm_head on nibbled 4bit qweights #1854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jixiongdeng wants to merge 4 commits into main from jdeng/shared_4bit_emb

+37 −25

jixiongdeng commented Nov 3, 2025 •

edited

Loading

Problem

The current model builder doesn't support shared embeddings layers with 4bit qweights, which occupies more room in disk and hurts compression rate. builder.py doesn't provide flexible option to toggle the graph construction and quantization config, like unpacked/packed matmul, rtn, kquant, etc.

Solution

Calculated flat_dim in a more generic way on reshape node before GatherBlockQuantized (support 4bit and 8bit).
Added CUDA kernel support in ORT #26484.
Added more extra_options to enable different quant configs and pack options.

Running examples:
unpacked qkv_projs and shared 4 bit RTN on Llama3.2 1B Instruct:

python src/python/py/models/builder.py -m meta-llama/Llama-3.2-1B-Instruct -p int4 -e cuda -o export_model/llama32_1bi_rtn_4_4_unpacked_tied --extra_options int4_is_symmetric=false unpack_matmul=true int4_algo_config=rtn

shared 4 bit k_quant on Phi-4-Mini Instruct:

python src/python/py/models/builder.py -m microsoft/Phi-4-Mini-Instruct -p int4 -e cuda -o export_model/phi4mini_i_kquant_4_4_tied --extra_options int4_is_symmetric=false unpack_matmul=true int4_algo_config=k_quant

Changes

Modified Files

src/python/py/models/builder.py

Key Modifications

Computed flat_dim in a generic manner before feeding in GatherBlockQuantized.
Explicitly defined gather_axis and quantize_axis for clarity.
Added unpack_matmul option to separate qvk_proj if needed.
Added rtn_last like k_quant_last as a new mixed precision option
Added k_quant like rtn as a new 4 bit quantizer option

jixiongdeng added 4 commits

November 3, 2025 22:09


          Added unpack matmul option in extra

4538bb3


          Added 4bit shared emb for GatherBlockQuantized & updated an generic d…

6207e1f

…im for reshape of packed weights


          Restrict the shared 4bit emb only work on & comments

958574d


          Added rtn mixed precision quant options and managed tied 4b embs cond…

b75bb1e

…itions

jixiongdeng requested review from chenfucn, jiafatom, kunal-vaishnavi and tianleiwu

November 3, 2025 22:57

Author

jixiongdeng commented Nov 3, 2025

@jixiongdeng please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

jixiongdeng requested a review from jambayk

November 3, 2025 23:27

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

    
                          )

                          # Allow extra_options to override use_packed_matmul

                          if "unpack_matmul" in extra_options:

Contributor

kunal-vaishnavi Nov 7, 2025

This is an optimization opportunity that should be auto-detected by the model builder. We should not need to give the responsibility to the user. You can see the review comments on this PR for more details.

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

    
                      elif quant_method in {"k_quant_mixed", "k_quant_last"}:

                      elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}:

                          from onnxruntime.quantization.matmul_nbits_quantizer import KQuantWeightOnlyQuantConfig

Contributor

kunal-vaishnavi Nov 7, 2025

Let's move this import up. It was previously here because it was not part of a stable release.

onnxruntime-genai/src/python/py/models/builder.py

Lines 24 to 28 in d4eabac

    
           from onnxruntime.quantization.matmul_nbits_quantizer import ( 
        
               MatMulNBitsQuantizer, 
        
               QuantFormat, 
        
               RTNWeightOnlyQuantConfig, 
        
           )

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

    
                      if quant_method == "rtn":

                          int4_algo_config = RTNWeightOnlyQuantConfig()

                      if quant_method in {"rtn", "rtn_last"}:

Contributor

kunal-vaishnavi Nov 7, 2025

I think this can be simplified to the following.

if quant_method in {"rtn", "rtn_last"}:
    if quant_method == "rtn_last":
        customized_weight_config["/lm_head/MatMul"] = {"bits": 8}
    int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

    
                              int4_algo_config = RTNWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

                      elif quant_method in {"k_quant_mixed", "k_quant_last"}:

                      elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}:

Contributor

kunal-vaishnavi Nov 7, 2025

I think this can be simplified to the following.

elif quant_method in {"k_quant", "k_quant_mixed", "k_quant_last"}:
    if quant_method != "k_quant":
        customized_weight_config["/lm_head/MatMul"] = {"bits": 8}

    if quant_method == "k_quant_mixed":
        # k_quant_mixed is from llama.cpp.
        # Reference: https://github.com/ggml-org/llama.cpp/blob/36667c8edcded08063ed51c7d57e9e086bbfc903/src/llama-quant.cpp#L136
        # We also consider some MatMuls are more senstive to quantization than other MatMuls.
        layers_to_exclude = [
            i
            for i in range(self.num_layers)
            if i < self.num_layers / 8 or i >= 7 * self.num_layers / 8 or (i - (round)(self.num_layers / 8)) % 3 == 2
        ]
        for i in layers_to_exclude:
            customized_weight_config["/model/layers." + str(i) + "/attn/qkv_proj/MatMul"] = {"bits": 8}
            customized_weight_config["/model/layers." + str(i) + "/attn/v_proj/MatMul"] = {"bits": 8}
            customized_weight_config["/model/layers." + str(i) + "/mlp/down_proj/MatMul"] = {"bits": 8}

    int4_algo_config = KQuantWeightOnlyQuantConfig(customized_weight_config=customized_weight_config)

kunal-vaishnavi reviewed

View reviewed changes

src/python/py/models/builder.py

    
                      self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last"}

                      if not self.int8_lm_head:

                      self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"}

                      if not self.int8_lm_head and extra_options.get("int4_algo_config", "default") not in {"rtn", "k_quant"}:

Contributor

kunal-vaishnavi Nov 7, 2025

Can we rewrite the above section and the if condition to just match on the conditions needed for tied embeddings to be true and otherwise set it to false?

Something like this:

self.int8_lm_head = extra_options.get("int4_algo_config", "default") in {"k_quant_mixed", "k_quant_last", "rtn_last"}
self.int4_tied_embeddings = extra_options.get("int4_tied_embeddings", config.tie_word_embeddings if hasattr(config, "tie_word_embeddings") and config.tie_word_embeddings is not None else False)

# matmul_nbits_quantizer.py has a different naming for default quantization, so lm_head.MatMul.weight_Q{}G{} does not match.
# tied_embeddings lm_head.MatMul.weight_Q{}G{} only works with rtn&k_quant on 4bit
self.int4_tied_embeddings = <boolean expression>

Contributor

kunal-vaishnavi commented Nov 7, 2025

Can you update the options for int4_algo_config here and add their descriptions?

onnxruntime-genai/src/python/py/models/builder.py

Lines 4685 to 4688 in d4eabac

    
                           int4_algo_config = Method for int4 quantization. Default is 'default'. 
        
                               Currently supported options are: 'default', 'rtn', 'k_quant_mixed', 'k_quant_last'. 
        
                               k_quant_mixed = k_quant algorithm with mixed precision (int4 + int8). 
        
                               k_quant_last = k_quant algorithm where only the last MatMul (/lm_head/MatMul) is quantized as int8. Other MatMuls are quantized as int4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kunal-vaishnavi kunal-vaishnavi left review comments

chenfucn Awaiting requested review from chenfucn

tianleiwu Awaiting requested review from tianleiwu

jiafatom Awaiting requested review from jiafatom

jambayk Awaiting requested review from jambayk

At least 1 approving review is required to merge this pull request.

Labels

None yet