Fix word-level timestamp overflow in Whisper chunked transcription #1483
+76
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
chunk_lento prevent timestamps from exceeding audio durationRoot Cause
When using
chunk_length_s=30with word-level timestamps (return_timestamps: 'word'), the Whisper model can output timestamps up to ~29.98s (the maximum representable timestamp giventime_precision = 30/1500 = 0.02). For a final chunk shorter than 30s, these timestamps would be added to the accumulatedtime_offset, causing the final timestamps to exceed the actual audio duration.Solution
Track the actual
chunk_lenfrom the stride information and clamp raw token timestamps before addingtime_offset. This ensures word-level timestamps never exceed the audio duration while preserving the existing behavior for segment-level timestamps.Test plan