Skip to content

Conversation

@daijh
Copy link
Contributor

@daijh daijh commented Jul 30, 2025

Description

#25372 adds sliding window support for Group Query Attention, disabling Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention when the window size exceeds the KV cache length or total sequence length.

Motivation and Context

See above.

@daijh
Copy link
Contributor Author

daijh commented Jul 30, 2025

@guschmue @fs-eire @qjia7

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jul 30, 2025
qjia7
qjia7 previously approved these changes Jul 31, 2025
Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@guschmue
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@daijh
Copy link
Contributor Author

daijh commented Aug 1, 2025

CI infra issue by check the logs, please help to re-run.

@guschmue guschmue merged commit 7cc93cf into microsoft:main Aug 1, 2025
87 of 91 checks passed
snnn pushed a commit that referenced this pull request Aug 1, 2025
…gth (#25594)

### Description
<!-- Describe your changes. -->
#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
snnn added a commit that referenced this pull request Aug 1, 2025
This PR cherry-picks some pipeline changes from the main branch to the
1.23.0 release branch.


- **[build] disable CodeQL for NPM Packaging Pipeline (#25614)**
- **Refactor Java Test Pipeline (#25608)**
- **[build] upgrade Node.js for NPM packaging pipeline (#25568)**

And a WebGPU change:

- **[webgpu] Apply Flash Attention if sliding window exceeds KV cache
length (#25594)**
@daijh daijh deleted the supports-sliding-window-for-flash-attention branch August 2, 2025 00:52
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
…gth (microsoft#25594)

### Description
<!-- Describe your changes. -->
microsoft#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
gedoensmax pushed a commit to gedoensmax/onnxruntime that referenced this pull request Sep 2, 2025
…gth (microsoft#25594)

### Description
<!-- Describe your changes. -->
microsoft#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants