chronos-2: Address Batch Size Scalability Issue in Group Attention for Faster Training/Inference#442
Open
li-jinpeng wants to merge 2 commits intoamazon-science:mainfrom
Open
chronos-2: Address Batch Size Scalability Issue in Group Attention for Faster Training/Inference#442li-jinpeng wants to merge 2 commits intoamazon-science:mainfrom
li-jinpeng wants to merge 2 commits intoamazon-science:mainfrom
Conversation
added 2 commits
December 22, 2025 21:03
Signed-off-by: xymli <xymli@tencent.com>
Signed-off-by: xymli <xymli@tencent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
Chronos-2 introduces group attention, enabling the foundational time series model to handle multivariate modeling. However, as indicated in the code at here, the computational complexity of group attention is proportional to the square of the batch size. This means that as the batch size increases, the training and inference efficiency of the model can significantly decrease.
Optimization
To address this, this PR optimizes the computation of group attention. Consider an input with


group_ids = [0, 0, 1, 1, 1, 1, 2, 2, 2, 3]. The original group attention matrix would be structured as shown in the following diagram, where a substantial amount of computation is useless.The optimization strategy involves decomposing this large matrix into four smaller group attention matrices, as illustrated in the second diagram. This approach aims to minimize useless computations.
Usage
This optimization can be enabled by setting the environment variable CHRONOS2_USE_FAST_GROUP_ATTENTION=1.
Experimentation & Results
Furthermore, experiments were conducted on a single NVIDIA H20 GPU using the following test code:
The final experimental results are summarized in the table below:
Chronos2 Inference Speed Test
Device: cuda
Sequence length: 1440
Number of features: 20
Forecast horizon: 96
✓ Model loaded successfully from /apdcephfs_fsgm/share_304079515/hunyuan_test/DualWeaver-1214/hf_ltm/chronos-2
✓ Baseline and fast implementations produce identical results
=== Baseline Implementation Benchmark ===
Batch size 20: 0.0228 s/sample
Batch size 40: 0.0303 s/sample
Batch size 60: 0.0358 s/sample
Batch size 80: 0.0415 s/sample
Batch size 100: 0.0471 s/sample
=== Fast Implementation Benchmark ===
Batch size 20: 0.0203 s/sample
Batch size 40: 0.0202 s/sample
Batch size 60: 0.0198 s/sample
Batch size 80: 0.0199 s/sample
Batch size 100: 0.0200 s/sample
=== Benchmark Summary ===
The results demonstrate a significant improvement in end-to-end inference speed, particularly for larger batch sizes.
Summary
This PR optimizes the computation of group attention to accelerate the end-to-end inference speed of Chronos-2 (training is similarly affected). The optimization effect is particularly significant when the batch size is large.
For the computation of group attention, certain
flash_attnoperators, such asflash_attn_varlen_func, are quite suitable and can further enhance its computational efficiency. However, it is noteworthy thatflash_attn_varlen_funccurrently only supports bf16 and fp16 data types. I look forward to integrating more efficient operators into Chronos-2 in the future, enabling it to be applied more broadly.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.