You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[CPU] Optimize GQA attention bias application for FP16 (#25871)
### Description
When using attention bias input for GQA op with FP16, on the platforms
that don't natively support FP16 math a cast to fp32 needs to be
performed, and thus a temporary buffer needs to be created to store the
fp32 values. The issue is that this temporary buffer was being allocated
/ deallocated inside of a loop for every token being processed.
Refactored the implementation so that the allocation takes place only
once.
Phi model throughput increased by 15%.
0 commit comments