Add CUDA implementation of GatherBlockQuantized operator #25575

xiaomsft · 2025-07-29T08:35:55Z

Description

This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support.

Motivation and Context

GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized.

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cu

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cuh

tianleiwu · 2025-07-29T15:53:55Z

Please update onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc to test CUDA EP when it is available.

xiaomsft · 2025-07-29T18:23:31Z

Please update onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc to test CUDA EP when it is available.

Working on it

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc

xiaomsft · 2025-07-29T20:52:54Z

@microsoft-github-policy-service agree [company="{Microsoft}"]

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc

xiaomsft · 2025-07-29T20:54:22Z

@microsoft-github-policy-service agree company="Microsoft"

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cu

onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc

include/onnxruntime/core/framework/op_kernel.h

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cu

tianleiwu · 2025-08-01T15:51:19Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-08-01T15:51:39Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2025-08-01T15:51:54Z

Azure Pipelines successfully started running 5 pipeline(s).

tianleiwu · 2025-08-01T19:17:34Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-08-01T19:17:54Z

Azure Pipelines successfully started running 5 pipeline(s).

### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>

…5575) ### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>

### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>

### Description Cherry-pick the following PRs: #25943 #25937 #25917 #25909 #25898 #25897 #25888 #25881 #25830 #25619 #25575 #25572 #25558 #25530 #25474 #25455 #25110 Also two dependent PRs for qMoE cpu: #25877 #25822 --------- Co-authored-by: xiaomsft <[email protected]> Co-authored-by: Xiaoyan Hu <[email protected]> Co-authored-by: Akshay Sonawane <[email protected]> Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Pradeep Sakhamoori <[email protected]> Co-authored-by: mingyue <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Emmanuel <[email protected]> Co-authored-by: Emmanuel Assumang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: praneshgo <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Ishwar Raut <[email protected]>

github-advanced-security bot found potential problems Jul 29, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cu Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cuh Fixed Show fixed Hide fixed