Skip to content

Conversation

@lovedheart
Copy link
Contributor

@lovedheart lovedheart commented Dec 8, 2025

before (Win11):
Device description: AMD Radeon 780M Graphics
Device memory: 73642 MB (69960 MB free)

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4260 runs -   277.47 us/run - 117.44 MFLOP/run - 423.25 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3834 runs -   290.91 us/run - 234.88 MFLOP/run - 807.40 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2556 runs -   432.13 us/run - 352.32 MFLOP/run - 815.32 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1065 runs -  1080.99 us/run - 469.76 MFLOP/run - 434.57 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 342 runs -  3527.08 us/run - 587.20 MFLOP/run - 166.48 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 428 runs -  2444.39 us/run - 939.52 MFLOP/run - 384.36 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                88 runs - 11373.06 us/run -  60.13 GFLOP/run -   5.29 TFLOPS

after (Win11):
Device description: AMD Radeon 780M Graphics
Device memory: 73642 MB (69960 MB free)

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4260 runs -   260.74 us/run - 117.44 MFLOP/run - 450.41 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2556 runs -   432.51 us/run - 234.88 MFLOP/run - 543.07 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2840 runs -   383.47 us/run - 352.32 MFLOP/run - 918.78 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2343 runs -   446.02 us/run - 469.76 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2052 runs -   512.37 us/run - 587.20 MFLOP/run -   1.15 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1498 runs -   701.03 us/run - 939.52 MFLOP/run -   1.34 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                88 runs - 11421.74 us/run -  60.13 GFLOP/run -   5.26 TFLOPS

before (Ubuntu 24.04):

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                6816 runs -   163.45 us/run - 117.44 MFLOP/run - 718.50 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4260 runs -   236.40 us/run - 234.88 MFLOP/run - 993.55 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3692 runs -   274.70 us/run - 352.32 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2769 runs -   372.82 us/run - 469.76 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1710 runs -   585.04 us/run - 587.20 MFLOP/run -   1.00 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 321 runs -  3479.14 us/run - 939.52 MFLOP/run - 270.04 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               100 runs - 10022.90 us/run -  60.13 GFLOP/run -   6.00 TFLOPS

after (Ubuntu 24.04):

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                5964 runs -   181.51 us/run - 117.44 MFLOP/run - 647.01 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4686 runs -   232.84 us/run - 234.88 MFLOP/run -   1.01 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3408 runs -   311.76 us/run - 352.32 MFLOP/run -   1.13 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2982 runs -   347.19 us/run - 469.76 MFLOP/run -   1.35 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2565 runs -   393.64 us/run - 587.20 MFLOP/run -   1.49 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1926 runs -   542.61 us/run - 939.52 MFLOP/run -   1.73 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               102 runs -  9966.51 us/run -  60.13 GFLOP/run -   6.03 TFLOPS

Reference (ROCm Ubunt 24.04):

Backend 1/2: ROCm0
  Device description: AMD Radeon 780M Graphics
  Device memory: 110592 MB (107636 MB free)

  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                5964 runs -   181.06 us/run - 117.44 MFLOP/run - 648.64 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                4686 runs -   220.70 us/run - 234.88 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                3408 runs -   303.75 us/run - 352.32 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2769 runs -   381.01 us/run - 469.76 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                2736 runs -   385.29 us/run - 587.20 MFLOP/run -   1.52 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                1712 runs -   596.95 us/run - 939.52 MFLOP/run -   1.57 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):               112 runs -  8988.31 us/run -  60.13 GFLOP/run -   6.69 TFLOPS

Test:

  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK
  10/10 tests passed
  Backend Vulkan0: OK
  MUL_MAT_ID(type_a=iq1_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
  MUL_MAT_ID(type_a=iq1_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
  2/2 tests passed
  Backend Vulkan0: OK

@lovedheart lovedheart changed the title Improve mul_mat_vec_iq1_s speed Vulkan: Improve mul_mat_vec_iq1_s speed Dec 8, 2025
@lovedheart lovedheart marked this pull request as ready for review December 8, 2025 23:09
@lovedheart lovedheart requested a review from 0cc4m as a code owner December 8, 2025 23:09
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant