Skip to content

Conversation

@SongXiaoXi
Copy link
Contributor

  • Removed redundant _mm256_permute2f128_ps instructions for lane swapping.
  • Reordered final output assignments to match the expected layout directly, simplifying downstream processing.
  • This change reduces register pressure and improves instruction efficiency without altering the computation logic.

Performance Benchmarks:

Tested on Intel 10700k (Single core)

Matrix Size Before Optimization (ns/iter) After Optimization (ns/iter) Improvement (%)
m004 177 (+/- 1) 174 (+/- 17) 1.69%
m006 209 (+/- 19) 202 (+/- 20) 3.35%
m008 200 (+/- 0) 191 (+/- 1) 4.50%
m012 267 (+/- 2) 243 (+/- 4) 8.99%
m016 278 (+/- 6) 233 (+/- 10) 16.19%
m032 1,344 (+/- 14) 989 (+/- 41) 26.39%
m064 8,974 (+/- 169) 6,050 (+/- 17) 32.60%
m127 64,466 (+/- 1,468) 41,704 (+/- 78) 35.32%

@SongXiaoXi SongXiaoXi changed the title sgemm: Reduce unnecessary AVX regiser permutations sgemm: Reduce unnecessary AVX register permutations May 11, 2025
- Removed redundant `_mm256_permute2f128_ps` instructions for lane swapping.
- Consolidated `bv_lh` usage for upper and lower halves, reducing the number of separate permutes.
- Reordered final output assignments to match the expected layout directly, simplifying downstream processing.
- This change reduces register pressure and improves instruction efficiency without altering the computation logic.
@bluss
Copy link
Owner

bluss commented May 11, 2025

If you are curious, larger matrix benchmarks are available in the other benchmark script.

Like (xsv just to make a table, optional)

./benches/benchloop.py  -t f32 -s 384 450 512 | xsv table

@SongXiaoXi
Copy link
Contributor Author

I ran your script and obtained the following results:
Before:

m k n layout type average_ns minimum_ns median_ns samples GFLOPS nc kc mc threads
384 384 384 FCC f32 1,618,910 1,617,871 1,618,362 1560 69.95213322544181 0
450 450 450 FCC f32 2,650,805 2,647,933 2,648,578 1560 68.75269965161526 0
512 512 512 FCC f32 3,834,974 3,821,840 3,824,136 310 69.99668211570665 0

After:

m k n layout type average_ns minimum_ns median_ns samples GFLOPS nc kc mc threads
384 384 384 FCC f32 974,601 973,492 974,549 1560 116.1975085188708 0
450 450 450 FCC f32 1,594,266 1,584,060 1,590,237 1560 114.31592971310936 0
512 512 512 FCC f32 2,334,892 2,289,947 2,357,830 1560 114.96696892190303 0

@bluss
Copy link
Owner

bluss commented May 11, 2025

thanks for this huge improvement. Make sure to ping me if I don't get back to releasing this in the next week

@bluss bluss merged commit 9753008 into bluss:master May 11, 2025
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants