-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add CUDA implementation of GatherBlockQuantized operator #25575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA implementation of GatherBlockQuantized operator #25575
Conversation
|
Please update onnxruntime/test/contrib_ops/gather_block_quantized_op_test.cc to test CUDA EP when it is available. |
Working on it |
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc
Outdated
Show resolved
Hide resolved
|
@microsoft-github-policy-service agree [company="{Microsoft}"] |
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc
Outdated
Show resolved
Hide resolved
|
@microsoft-github-policy-service agree company="Microsoft" |
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cc
Outdated
Show resolved
Hide resolved
c8d2587 to
3e352e4
Compare
onnxruntime/contrib_ops/cuda/quantization/gather_block_quantized.cu
Outdated
Show resolved
Hide resolved
3e352e4 to
bb04d4c
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 5 pipeline(s). |
7be0d43 to
0c7938e
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>
…5575) ### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>
…5575) ### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>
### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>
### Description Cherry-pick the following PRs: #25943 #25937 #25917 #25909 #25898 #25897 #25888 #25881 #25830 #25619 #25575 #25572 #25558 #25530 #25474 #25455 #25110 Also two dependent PRs for qMoE cpu: #25877 #25822 --------- Co-authored-by: xiaomsft <[email protected]> Co-authored-by: Xiaoyan Hu <[email protected]> Co-authored-by: Akshay Sonawane <[email protected]> Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Pradeep Sakhamoori <[email protected]> Co-authored-by: mingyue <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Emmanuel <[email protected]> Co-authored-by: Emmanuel Assumang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: praneshgo <[email protected]> Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Ishwar Raut <[email protected]>
Description
This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support.
Motivation and Context
GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized.