Skip to content

Conversation

@neonwatty
Copy link

Summary

  • Fixes whisper-large-v3-turbo_timestamped has broken timestamps #1357
  • Clamps word-level timestamps to the actual chunk_len to prevent timestamps from exceeding audio duration
  • When the model outputs timestamps near the 30s boundary (e.g., 29.98s) for shorter final chunks, timestamps would incorrectly exceed the audio duration

Root Cause

When using chunk_length_s=30 with word-level timestamps (return_timestamps: 'word'), the Whisper model can output timestamps up to ~29.98s (the maximum representable timestamp given time_precision = 30/1500 = 0.02). For a final chunk shorter than 30s, these timestamps would be added to the accumulated time_offset, causing the final timestamps to exceed the actual audio duration.

Solution

Track the actual chunk_len from the stride information and clamp raw token timestamps before adding time_offset. This ensures word-level timestamps never exceed the audio duration while preserving the existing behavior for segment-level timestamps.

Test plan

  • Added unit test that simulates the bug case (65s audio with 15s final chunk)
  • Verified fix with actual whisper-large-v3-turbo model on HuggingFace GPU
  • All existing tests pass
  • Build succeeds

…uggingface#1357)

Clamp word-level timestamps to the actual chunk_len to prevent timestamps
from exceeding audio duration when the model outputs timestamps near the
30s boundary for shorter final chunks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

whisper-large-v3-turbo_timestamped has broken timestamps

1 participant