Skip to content

Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448

Open
TransZAllen wants to merge 1 commit intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_8_on_newest_dev
Open

Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448
TransZAllen wants to merge 1 commit intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_8_on_newest_dev

Conversation

@TransZAllen
Copy link
Copy Markdown

  • [ √ ] I carefully read the contribution guidelines and agree to them.
  • [ √ ] I have tested the API against NewPipe.
  • [ √ ] I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.

@TransZAllen
Copy link
Copy Markdown
Author

TransZAllen commented Jan 30, 2026

Related issue

Scope of changes

This PR involves two repositories:

  • NewPipeExtractor (main changes)
    Implements subtitle deduplication logic in
    SubtitleDeduplicator.

  • NewPipe (supporting changes)
    Initializes cache/subtitle_cache directory and ensures locally cached
    subtitle files can still be manually downloaded as '*.srt'.

Reproduction case

Android device, duplicated subtitles visible during playback

YouTube video used for testing:
https://www.youtube.com/watch?v=b7vmW_5HSpE

Subtitle cache location

Cached subtitle files (*.ttml) are stored at:

/storage/emulated/0/Android/data/<package_name>/cache/subtitle_cache

The directory name corresponds to subCacheDir
defined in SubtitleDeduplicator.

Cache file naming

Cached subtitle filenames are intentionally descriptive,
so their meaning can be understood without reading the code
(e.g. source, format, origin, subtitle state). For example:
cache/subtitle_cache $ ls -l
total 48
-rw-rw---- 1 u0_a579 sdcard_rw 1214 2026-01-29 17:44 b7vmW_5HSpE--en--auto_generated--original.ttml
-rw-rw---- 1 u0_a579 sdcard_rw 42426 2026-01-29 17:44 b7vmW_5HSpE--en-GB--human_provided--deduplicated.ttml

Cache lifecycle & storage impact

Do cached subtitle files need to be deleted?

No.

SubtitleDeduplicator does not delete cached subtitle files,
regardless of whether duplication is detected.

Why keep cached subtitles?

  1. If a remote subtitle download fails, a previously cached version
    can be reused.

  2. In practice, download failures are rare:

    • User-uploaded and auto-generated YouTube subtitles were
      consistently downloadable in tests.
    • Auto-translated subtitles showed a higher failure rate,
      but that feature is not merged into dev branch and is out of scope here.

Storage considerations

  • Subtitle files are small in size.
  • Even with many cached subtitles, storage usage grows slowly.
  • In the worst case, Android will notify users of low storage
    and suggest clearing app cache, which includes NewPipe’s subtitle cache.

Unit tests

extractor/src/test/java/org/schabi/newpipe/extractor/utils/SubtitleDeduplicatorTest.java

Tests focus on the core deduplication logic:
detecting duplicated adjacent subtitle segments and verifying
the resulting output.

Why SubtitleDeduplicator operates on raw TTML text

SubtitleDeduplicator intentionally operates on raw TTML text before XML entity decoding.
Deduplication is limited to lightweight, string-level normalization to avoid double
subtitle parsing into the screen-display and SRT manually download layer.

This design is intended to be practical and simple. At this stage, the goal is only to detect obviously
duplicated subtitle segments from the same TTML source, not to fully interpret
or normalize subtitle semantics.

Difference from SrtFromTtmlWriter

These two components serve different purposes:

  • SrtFromTtmlWriter

    • Performs full TTML XML parsing
    • Decodes entities
    • Resolves tags (<span>, <br>, etc.)
    • Generates SRT for manual download
  • SubtitleDeduplicator

    • A lightweight pre-processing utility
    • Does not parse XML
    • Does not decode entities
    • Performs minimal string-level normalization only

Note on subtitle caching

SubtitleDeduplicator always fetches remote subtitle content to ensure the latest version is used when detecting duplicated entries.

During playback, however, ExoPlayer may serve subtitle data from its internal cache (cache/exoplayer) if a cached version is available. As a result, there is a potential inconsistency where the subtitle content displayed to the user may not immediately reflect a recently updated remote subtitle.

This is intentional and won’t be changing for now:

  • YouTube subtitles don’t update often, so while users might encounter outdated cached subtitles, the chance is really low.
  • ExoPlayer’s caching is part of the player, and I’m not sure how much code changing it would require, but that’s outside the scope of this PR.
  • This PR is all about fixing subtitle duplication, not touching the caching setup between the extractor and player.

@TransZAllen
Copy link
Copy Markdown
Author

TransZAllen commented Jan 30, 2026

The fix has been tested with a YouTube video link: https://www.youtube.com/watch?v=b7vmW_5HSpE

Before the fix, the subtitle is shown as follows:

After applying the fix, the subtitle is displayed as follows:

Copy link
Copy Markdown
Member

@AudricV AudricV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want NewPipe Extractor to download files directly, so your approach must be changed, especially as you do not delete files. Also, I would avoid downloading each subtitle to avoid reaching rate limits.

The extractor is not an Android library, therefore Android specific comments should be removed.

If YouTube provides incorrect subtitles, this should be not to the extractor to fix them in my opinion. It makes more sense to be fixed with a custom ExoPlayer component in the app side for me.

@AudricV AudricV added bug Issue is related to a bug youtube service, https://www.youtube.com/ labels Jan 30, 2026
@TransZAllen
Copy link
Copy Markdown
Author

@AudricV

Thanks for the feedback, it’s helpful for me to better understand the intended boundaries of NewPipeExtractor.

I’m preparing some follow-up comments to explain these commits, especially around subtitle downloading. I’m also taking some time to think about whether this design makes sense.

I’ll add more comments soon.

@TransZAllen
Copy link
Copy Markdown
Author

TransZAllen commented Feb 1, 2026

About

“we don't want NewPipe Extractor to download files directly”

:

@AudricV

Just to make sure I understand correctly: currently, the extractor only provides
subtitle URLs, and the actual downloading is done later on the app side
(either by ExoPlayer or by the manual subtitle download feature), right?

My original idea was to fix duplicated subtitles as early as possible and
in a centralized place — at the source where subtitle URLs are produced.
That’s why I chose getSubtitles(final MediaFormat format) in
YoutubeStreamExtractor.java. If the source is deduplicated, all later code
would receive clean subtitles.

However, I now realize that my changes effectively moved the subtitle downloading
responsibility. Previously, subtitle URLs were passed through the extractor and
only downloaded on the app side. With this change, subtitles are downloaded
inside SubtitleDeduplicator instead.

At first, I thought this was acceptable since subtitles are eventually downloaded
anyway, and this could even reduce network requests by avoiding separate downloads
for playback (via ExoPlayer) and for manual SRT downloads. But after tracing the
code path more, I see that from getSubtitles() in NewPipeExtractor to
VideoPlaybackResolver.resolve() on the app side, subtitles are still handled
purely as URLs, without any download happening in the extractor.

So, performing file downloads inside NewPipeExtractor crosses its intended boundary, right?

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Feb 5, 2026

@TransZAllen
Copy link
Copy Markdown
Author

TransZAllen commented Mar 22, 2026

Hi,

Just a small update.

Today I found a YouTube video where the subtitle still shows duplicated lines, so the current deduplication logic does not cover this case yet. I think I found the reason, and the logic probably needs a bit more adjustment.

===========

On the app side, I have already moved the deduplication logic (with network download and file I/O) from the extractor, and after some trial and error, I now have a better approach.

I will update the code soon and leave a message here.

Thanks.

… subtitle URL parameters.

- Add `V`, `LANG`, `TLANG` constants to `YoutubeParsingHelper`
- Implement `extractVideoId()`, `extractLanguageCode()`, `extractTranslationCode()`
- Add `extractQueryParam()` utility in `Utils.java`
@TransZAllen TransZAllen force-pushed the duplicated_subtitle_8_on_newest_dev branch from ec00c79 to e944f81 Compare April 1, 2026 14:57
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 1, 2026

@TransZAllen
Copy link
Copy Markdown
Author

TransZAllen commented Apr 1, 2026

Hi,

Just a quick note.

The previous commits from the first review were overwritten when updating this PR.
To avoid losing that context, I have kept a backup here:
https://github.com/TransZAllen/NewPipeExtractor/tree/duplicated_subtitle_8_on_newest_dev-REVIEW_1
https://github.com/TransZAllen/NewPipe/tree/duplicated_subtitle_5_on_newest_dev-REVIEW_1

@TransZAllen
Copy link
Copy Markdown
Author

Here is the updated description for this round of changes:

Background

As suggested in the previous review, the subtitle deduplication logic has been moved from the extractor side to the app side.
During this refactoring, several approaches were explored before arriving at the current solution.

Attempts and Design Decisions

1. Deduplication inside StreamInfoTag

StreamInfo (which contains the original subtitle data from the extractor) is defined inside StreamInfoTag:

private final StreamInfo streamInfo;

Initially, deduplication was implemented inside the constructor of StreamInfoTag (player/mediaitem/StreamInfoTag.java).

Advantages:

1)Centralized entry point for subtitle processing
2)Affects both: On-screen subtitles + Manual subtitle downloads (SRT)

Issues:

StreamInfo originates from the extractor, but is modified in the app layer
→ This effectively mutates upstream data (data pollution across layers)

Observed during testing:
a)StreamInfoTag constructor is called twice
b)SubtitleDeduplicator is executed 4 times --> Remote subtitle downloading is also triggered 4 times (Root cause not fully investigated)

👉 This approach was abandoned.

2. Introducing AppStreamInfo

A second attempt introduced a new domain class AppStreamInfo to hold processed subtitle data in the app layer.

Goals:
Keep processed (deduplicated) data isolated from extractor data
Provide a unified data source within the app

Current status:
Used in on-screen subtitle rendering
NOT used in manual SRT download
--Deduplication occurs independently in both the on-screen subtitle rendering and manual SRT download.

Reason:
The download pipeline cannot easily share AppStreamInfo with the UI layer.
Even if a separate AppStreamInfo instance is created for download, the following issue remains:
The download flow has multiple entry points:
DownloadInitializer.java
DownloadRunnable.java
DownloadMissionRecover.java
Covering all these paths would require broad and invasive changes.

👉 Therefore, AppStreamInfo was not used in the download part.

3. Final Approach for Download

The deduplication logic is now applied inside TtmlConverter.process()
This method internally calls SrtFromTtmlWriter.build(...) to convert downloaded TTML subtitles → SRT subtitles.

Current solution:
Deduplicate TTML content before calling SrtFromTtmlWriter.build(...).
In other words:
First → deduplicated TTML
Then → convert to SRT

Advantages:
Deduplication is applied at the end of remote subtitle download --> Avoids modifying multiple download entry points

@TransZAllen
Copy link
Copy Markdown
Author

Additional Observation / Question

While working on this, I found an behavior that ChunkFileInputStream.read() returns 0 at EOF.
This was confirmed using debug logs in the new method TtmlConverter.readSharpStreamToString():

NewPipe$ git diff app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
diff --git a/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java b/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
index 935eacf59..0f229944a 100644
--- a/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
+++ b/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
@@ -57,7 +57,7 @@ class TtmlConverter extends Postprocessing {
     }
 
     private static String readSharpStreamToString(final SharpStream stream) throws IOException {
-
+        Log.d(TAG, "SharpStream implementation: " + stream.getClass().getName());
         final ByteArrayOutputStream out = new ByteArrayOutputStream();
         final byte[] buffer = new byte[8192];
 
@@ -77,8 +77,10 @@ class TtmlConverter extends Postprocessing {
         //   can safely be switched back to `read != -1`. Keeping `> 0` is
         //   also safe and will continue to work.
         while ((read = stream.read(buffer)) > 0) {
+            Log.d(TAG, "read bytes: " + read);
             out.write(buffer, 0, read);
         }
+        Log.d(TAG, "read loop finished with value: " + read);
 
         final String result = out.toString(StandardCharsets.UTF_8);
 

However, according to standard Java InputStream behavior:

read() should return -1 when reaching end-of-file (EOF)

This is also noted in the comment inside TtmlConverter.readSharpStreamToString().


Investigation

Search results in the NewPipe repository:

grep -rn -i "ChunkFileInputStream" . --include=*.{java,kt}

Result shows:

ChunkFileInputStream is mainly used in:

Postprocessing.java
TtmlConverter.java

The read() behavior is explicitly handled in TtmlConverter.java ('read > 0' instead of '!= -1')

Questions

Is returning 0 at EOF an intentional design in NewPipe?
Would it be safe to change it to the standard Java behavior (-1)?

@TransZAllen
Copy link
Copy Markdown
Author

Observed Cases of Duplicate Subtitle Entries in YouTube TTML Files

The subtitle format involved is TTML.

A subtitle entry (paragraph) consists of three parts: a begin timestamp, an end timestamp, and text content. TTML supports visual styling of subtitle text (e.g., colors, font sizes) via <span style="..."> tags. Since NewPipe (using the ExoPlayer module) does not support styled subtitle rendering, two entries are considered duplicates if their timestamps are identical and their visible text content is the same after stripping style tags.

Text comparison is strict: it compares normalized plain text only and requires exact character equality (no semantic analysis).

All cases below assume the two entries have identical begin and end timestamps.


Case 1: Two subtitle entries are completely identical

Both the text content and all style attributes are exactly the same.

Example from https://www.youtube.com/watch?v=7w3jBGX7UcY (en-GB language):

<p begin="00:00:01.352" end="00:00:01.852" style="s2">
  <span style="s3">​</span>
  <span style="s4">​ ​Land </span>
  <span style="s5">mines​ ​</span>
  <span style="s3"><br/></span>
  <span style="s5">​ ​I've left them everywhere​ ​</span>
  <span style="s3">​</span>
</p>
<p begin="00:00:01.352" end="00:00:01.852" style="s2">
  <span style="s3">​</span>
  <span style="s4">​ ​Land </span>
  <span style="s5">mines​ ​</span>
  <span style="s3"><br/></span>
  <span style="s5">​ ​I've left them everywhere​ ​</span>
  <span style="s3">​</span>
</p>

Related issues and videos:

Note: As of around 2026-03-03, the subtitle languages that previously had duplication issues for all of the above videos no longer appear in the captions list. The historical analysis in the linked issues documents the original duplication behavior.


Case 2: Same visible text, different style attributes

The text content is identical after stripping <span style="..."> tags, but the style attributes differ between entries.

Simple example — the same word styled differently across two entries:

<p begin="00:00:11.452" end="00:00:14.388" style="s2">
  <span style="s3">Magic</span>
</p>
<p begin="00:00:11.452" end="00:00:14.388" style="s2">
  <span style="s11">Magic</span>
</p>

Complex example — the same sentence, but one entry applies a single style to the whole sentence, while the other applies a different style to each individual word (or even each character):

<p begin="00:00:05.000" end="00:00:08.000" style="s2">
  <span style="s3">Hello world today</span>
</p>
<p begin="00:00:05.000" end="00:00:08.000" style="s2">
  <span style="s4">Hello </span>
  <span style="s5">world </span>
  <span style="s6">today</span>
</p>

After stripping all <span> tags, both entries reduce to the same text and are treated as duplicates.

Related videos:

  • https://www.youtube.com/watch?v=nPF7lit7Z00 — en, hi-Latn, zh languages
  • https://www.youtube.com/watch?v=7w3jBGX7UcY — zh-Hans language

Case 3: Same text after stripping styles, differing only in whitespace

After removing style tags, the visible characters are identical, but the entries differ in leading/trailing spaces or runs of consecutive spaces.

Example (constructed from unit tests):

<p begin="00:00:01.000" end="00:00:02.000">  Hello world  </p>
<p begin="00:00:01.000" end="00:00:02.000">Hello      world</p>

SubtitleDeduplicator handles this by collapsing all runs of whitespace into a single space and trimming leading/trailing whitespace before comparison.

In addition to whitespace, the following invisible characters are also stripped before comparison, as they can cause two visually identical entries to appear different at the byte level:

  • Non-breaking space: U+00A0 (normalized to a regular space)

Note:

TTML may contain <br> tags to represent line breaks.

In the current implementation, <br> is not removed or normalized. Because:

  • it is part of the TTML structure, not regular whitespace
  • in observed cases, identical subtitle entries do not mix <br> with other forms of special characters (e.g., plain spaces), for example:
  <p>Hello<br/>world</p>
  <p>Hello world</p>

Therefore, <br> is preserved in text comparison.


Case 4: Visually identical entries that are not detected as duplicates

If two entries look identical to the human eye but are not flagged as duplicates by the deduplication logic, the most likely cause is invisible Unicode characters embedded in the text that are not covered by the current normalization rules.

The following invisible characters are currently normalized:

  • Zero-width spaces and related characters: U+200B to U+200D
  • Directionality control characters: U+200E, U+200F
  • Directionality formatting characters: U+202A to U+202E
  • Byte Order Mark (BOM): U+FEFF

Case 5: Duplicate entries that are not adjacent

The two duplicate entries are not consecutive — at least one other subtitle entry appears between them.

Example from https://www.youtube.com/watch?v=7w3jBGX7UcY (zh-Hans subtitles):

<p begin="00:00:01.642" end="00:00:03.244" style="s2">
  <span style="s3">原作​</span>​<span style="s3"> おジャ魔女どれみ​</span>
</p>
<p begin="00:00:01.642" end="00:00:03.244" style="s2">
  <span style="s4">原唱​</span>​<span style="s4"> MAHO堂​</span>...
</p>
<p begin="00:00:01.642" end="00:00:03.244" style="s2">
  <span style="s3">原作 ​</span>​<span style="s7">おジャ魔女どれみ​</span>
</p>

The first and third entries are duplicates (same text after normalization), separated by a non-duplicate entry. SubtitleDeduplicator handles this correctly because it uses a hash set to track all previously seen entries, not just the immediately preceding one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue is related to a bug youtube service, https://www.youtube.com/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants