Add TOVA press #12

SimJeg · 2024-11-25T10:32:47Z

TOVA press as requested in #2

kvpress/presses/tova_press.py

maxjeblick

Thanks a lot for adding TOVA, LGTM!

hassidm · 2024-11-26T10:04:26Z

Thanks a lot for adding TOVA!

Just a small clarification:
The attn_weights at this line are with shape [batch, heads, seq_len, seq_len] or [batch, heads, window_size, seq_len]? for TOVA in the prefilling phase the former should be implemented (and if I understand correctly, the former is implemented according to link or link).
After computing the attention weights with shape [batch, heads, seq_len, seq_len], the compression should correspond to the last token attention weights (averaged across heads):
torch.mean(attn_weights[:,:,-1:,:],dim=1) (which implemented correctly in th PR :) )

Thanks again, and sorry for the misunderstanding :)

SimJeg · 2024-11-26T10:28:11Z

After #L58, the shape is [batch, heads, window_size, seq_len] and not [batch, heads, seq_len, seq_len] because we only compute attention for the last window_size tokens.

There is no need to compute the full attention weight matrix that would cost a lot of memory. We only need attn_weights = full_attn_weights[:, :, -window_size:, :].

So in our code,

# attn_weights -> shape [batch, heads, window_size=1, seq_len]
scores = attn_weights.mean(1) # -> shape [batch, window_size=1, seq_len]
scores = scores.repeat(1, keys.shape[1], 1) # -> shape [batch, num_key_value_heads, seq_len]

Do you confirm everything is as you expect it ?

hassidm · 2024-11-26T10:39:35Z

Thanks for the quick response!

I'm sorry, if I understand correctly this is not exactly the same computation for deeper layers (it is the same for the first attention layer).

For the prefilling case, if token i attended to previous tokens in previous attention layers, the hidden representation will differ from the one where the same token did not attend to other tokens in previous layers. In the current implementation, the non-window tokens (which are all tokens but one) never attend any other tokens, which will result in non contextualized non-window token representations which probably results in (at least) slightly worse performance.

Regarding the memory cost, I agree that this implementation is more costly as we need to store (for a short period) the whole attention matrix of a specific layer, but the KV cache memory stays compressed.

SimJeg · 2024-11-26T12:28:01Z

The current code is doing exactly the same thing what you first posted in the issue here.

In the current implementation, the non-window tokens (which are all tokens but one) never attend any other tokens

No, the press is applied after the forward pass (in a pytorch hook) and only modify the KV cache, not the hidden states. So whatever you prune in the first layers, the input hidden states of the current layer are not impacted. Hidden states during pre-filling are 100% independent from the press.

I hope this clarifies

hassidm · 2024-11-26T12:43:27Z

Thank you for clarifying it, and I apologize for the misunderstanding. As I'm not very familiar with this repository, I wanted to ensure that the implemented code functions as intended.

LGTM, and thanks again!

Add TOVA press

9cb59da

SimJeg linked an issue Nov 25, 2024 that may be closed by this pull request

[NEW PRESS] Add TOVA #2

Closed

update README

60c1fc0

SimJeg requested a review from maxjeblick November 25, 2024 10:55

update docstring

78c772c

maxjeblick reviewed Nov 25, 2024

View reviewed changes

kvpress/presses/tova_press.py Outdated Show resolved Hide resolved

maxjeblick reviewed Nov 25, 2024

View reviewed changes

kvpress/presses/tova_press.py Show resolved Hide resolved

SimJeg added 2 commits November 26, 2024 08:06

Merge branch 'main' into simon/tova-press

7f7b545

Address PR comment

8166c05

maxjeblick approved these changes Nov 26, 2024

View reviewed changes

SimJeg merged commit 3ca0ce4 into main Nov 26, 2024
2 checks passed

SimJeg deleted the simon/tova-press branch November 26, 2024 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TOVA press #12

Add TOVA press #12

Uh oh!

SimJeg commented Nov 25, 2024

Uh oh!

Uh oh!

Uh oh!

maxjeblick left a comment

Uh oh!

hassidm commented Nov 26, 2024

Uh oh!

SimJeg commented Nov 26, 2024

Uh oh!

hassidm commented Nov 26, 2024 •

edited

Loading

Uh oh!

SimJeg commented Nov 26, 2024

Uh oh!

hassidm commented Nov 26, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add TOVA press #12

Add TOVA press #12

Uh oh!

Conversation

SimJeg commented Nov 25, 2024

Uh oh!

Uh oh!

Uh oh!

maxjeblick left a comment

Choose a reason for hiding this comment

Uh oh!

hassidm commented Nov 26, 2024

Uh oh!

SimJeg commented Nov 26, 2024

Uh oh!

hassidm commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimJeg commented Nov 26, 2024

Uh oh!

hassidm commented Nov 26, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hassidm commented Nov 26, 2024 •

edited

Loading