Optimize for proper flash attn causal handling #2315

siddartha-RE · 2023-08-25T20:14:12Z

Enable use_cache support with Flash-Attention

As of flash-attn 2.1.0 the library now handles the case of q_len != kv_len so that with causal attention we
can use it even when past key-values are supplied without padding q. This optimizes it for the inference
use case so that this patch is now useful at both inference time and train time.

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

merrymercy · 2023-08-27T09:28:40Z

Can we try to merge these two https://github.com/lm-sys/FastChat/blob/main/fastchat/train/llama2_flash_attn_monkey_patch.py?

siddartha-RE · 2023-08-27T17:08:13Z

Can we try to merge these two https://github.com/lm-sys/FastChat/blob/main/fastchat/train/llama2_flash_attn_monkey_patch.py?

Don't follow. This PR is an update to that file which takes advantage of fixed causal behavior in latest flash attn to use flash attn at inference.

merrymercy · 2023-08-28T11:51:29Z

Do we also need to apply these changes to llama2_flash_attn_monkey_patch.py?

siddartha-RE · 2023-08-28T15:28:46Z

Do we also need to apply these changes to llama2_flash_attn_monkey_patch.py?

Thats the only file I have added it to. I submitted the initial implementation to deal with past KV through brute force padding of the q tensor (which of course has a significant perf penalty). Now that flash-attn 2.1.0 support the correct handling for causal case of q_len != kv_len I have update the implementation to remove the padding. To prevent people getting incorrect results I have added an assert on flash attention version to make sure the corrected behavior is available.

merrymercy · 2023-08-30T19:28:34Z

fastchat/train/llama2_flash_attn_monkey_patch.py

Why do you store past_key_value in transposed layout? I fell it introduces some redundant memory movement.

Very sorry for losing track of this question. The reason is there is code in the LlamaModel (in transformers impl) that assumes a particular memory layout when looking up past kv length. Thats why we have to store it in this order. Fortunately this is a zero copy operation. The transpose operation just reorganizes the tensor metadata but does not result in any memory movement.

I was just recently testing another repo that depends on this and hit this error so it would be great to have this merged so that other people also don't run into this.

Again sorry about not answering this sooner.

merrymercy · 2023-09-29T04:46:26Z

@siddartha-RE Thanks! It is merged.

arvindsun · 2023-10-01T01:51:00Z

Looks like the merge commit never made it to the main branch?

491818f

siddartha-RE · 2023-10-02T02:08:53Z

Not sure what happened here. Possibly main branch was rewritten and the commit dropped. I will try raising the PR again.

merrymercy force-pushed the main branch 2 times, most recently from bf7aa7e to a81a04c Compare August 28, 2023 01:36

siddartha-RE force-pushed the main branch from 19fc9fc to cbb5a62 Compare August 29, 2023 16:51

merrymercy requested changes Aug 30, 2023

View reviewed changes

merrymercy force-pushed the main branch from 86ef64f to dc3dd12 Compare September 6, 2023 03:26

merrymercy force-pushed the main branch 2 times, most recently from 14c0818 to e4758da Compare September 19, 2023 00:32

siddartha-RE requested a review from merrymercy September 26, 2023 04:02

siddartha-RE force-pushed the main branch from 2aa16fe to f572c3f Compare September 26, 2023 14:25

Optimize for proper flash attn causal handling

80f3172

siddartha-RE force-pushed the main branch from f572c3f to 80f3172 Compare September 26, 2023 15:34

merrymercy merged this pull request into lm-sys:main Sep 29, 2023

siddartha-RE mentioned this pull request Oct 2, 2023

Optimize for proper flash attn causal handling #2503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize for proper flash attn causal handling #2315

Optimize for proper flash attn causal handling #2315

Uh oh!

siddartha-RE commented Aug 25, 2023 •

edited

Loading

Uh oh!

merrymercy commented Aug 27, 2023 •

edited

Loading

Uh oh!

siddartha-RE commented Aug 27, 2023

Uh oh!

merrymercy commented Aug 28, 2023

Uh oh!

siddartha-RE commented Aug 28, 2023

Uh oh!

merrymercy Aug 30, 2023

Uh oh!

siddartha-RE Sep 26, 2023

Uh oh!

merrymercy commented Sep 29, 2023

Uh oh!

arvindsun commented Oct 1, 2023

Uh oh!

siddartha-RE commented Oct 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize for proper flash attn causal handling #2315

Optimize for proper flash attn causal handling #2315

Uh oh!

Conversation

siddartha-RE commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enable use_cache support with Flash-Attention

Checks

Uh oh!

merrymercy commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddartha-RE commented Aug 27, 2023

Uh oh!

merrymercy commented Aug 28, 2023

Uh oh!

siddartha-RE commented Aug 28, 2023

Uh oh!

merrymercy Aug 30, 2023

Choose a reason for hiding this comment

Uh oh!

siddartha-RE Sep 26, 2023

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Sep 29, 2023

Uh oh!

arvindsun commented Oct 1, 2023

Uh oh!

siddartha-RE commented Oct 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siddartha-RE commented Aug 25, 2023 •

edited

Loading

merrymercy commented Aug 27, 2023 •

edited

Loading