Fix duplicated subtitle issue--core deduplication logic and screen-display part by TransZAllen · Pull Request #1448 · TeamNewPipe/NewPipeExtractor

TransZAllen · 2026-01-30T10:18:37Z

[ √ ] I carefully read the contribution guidelines and agree to them.
[ √ ] I have tested the API against NewPipe.
[ √ ] I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.

TransZAllen · 2026-01-30T10:20:37Z

Related issue

Scope of changes

This PR involves two repositories:

NewPipeExtractor (main changes)
Implements subtitle deduplication logic in
SubtitleDeduplicator.
NewPipe (supporting changes)
Initializes cache/subtitle_cache directory and ensures locally cached
subtitle files can still be manually downloaded as '*.srt'.

Reproduction case

Android device, duplicated subtitles visible during playback

YouTube video used for testing:
https://www.youtube.com/watch?v=b7vmW_5HSpE

Subtitle cache location

Cached subtitle files (*.ttml) are stored at:

/storage/emulated/0/Android/data/<package_name>/cache/subtitle_cache

The directory name corresponds to subCacheDir
defined in SubtitleDeduplicator.

Cache file naming

Cached subtitle filenames are intentionally descriptive,
so their meaning can be understood without reading the code
(e.g. source, format, origin, subtitle state). For example:
cache/subtitle_cache $ ls -l
total 48
-rw-rw---- 1 u0_a579 sdcard_rw 1214 2026-01-29 17:44 b7vmW_5HSpE--en--auto_generated--original.ttml
-rw-rw---- 1 u0_a579 sdcard_rw 42426 2026-01-29 17:44 b7vmW_5HSpE--en-GB--human_provided--deduplicated.ttml

Cache lifecycle & storage impact

Do cached subtitle files need to be deleted?

No.

SubtitleDeduplicator does not delete cached subtitle files,
regardless of whether duplication is detected.

Why keep cached subtitles?

If a remote subtitle download fails, a previously cached version
can be reused.
In practice, download failures are rare:
- User-uploaded and auto-generated YouTube subtitles were
  consistently downloadable in tests.
- Auto-translated subtitles showed a higher failure rate,
  but that feature is not merged into dev branch and is out of scope here.

Storage considerations

Subtitle files are small in size.
Even with many cached subtitles, storage usage grows slowly.
In the worst case, Android will notify users of low storage
and suggest clearing app cache, which includes NewPipe’s subtitle cache.

Unit tests

extractor/src/test/java/org/schabi/newpipe/extractor/utils/SubtitleDeduplicatorTest.java

Tests focus on the core deduplication logic:
detecting duplicated adjacent subtitle segments and verifying
the resulting output.

Why SubtitleDeduplicator operates on raw TTML text

SubtitleDeduplicator intentionally operates on raw TTML text before XML entity decoding.
Deduplication is limited to lightweight, string-level normalization to avoid double
subtitle parsing into the screen-display and SRT manually download layer.

This design is intended to be practical and simple. At this stage, the goal is only to detect obviously
duplicated subtitle segments from the same TTML source, not to fully interpret
or normalize subtitle semantics.

Difference from `SrtFromTtmlWriter`

These two components serve different purposes:

SrtFromTtmlWriter
- Performs full TTML XML parsing
- Decodes entities
- Resolves tags (,  , etc.)
- Generates SRT for manual download
SubtitleDeduplicator
- A lightweight pre-processing utility
- Does not parse XML
- Does not decode entities
- Performs minimal string-level normalization only

Note on subtitle caching

SubtitleDeduplicator always fetches remote subtitle content to ensure the latest version is used when detecting duplicated entries.

During playback, however, ExoPlayer may serve subtitle data from its internal cache (cache/exoplayer) if a cached version is available. As a result, there is a potential inconsistency where the subtitle content displayed to the user may not immediately reflect a recently updated remote subtitle.

This is intentional and won’t be changing for now:

YouTube subtitles don’t update often, so while users might encounter outdated cached subtitles, the chance is really low.
ExoPlayer’s caching is part of the player, and I’m not sure how much code changing it would require, but that’s outside the scope of this PR.
This PR is all about fixing subtitle duplication, not touching the caching setup between the extractor and player.

TransZAllen · 2026-01-30T10:32:32Z

The fix has been tested with a YouTube video link: https://www.youtube.com/watch?v=b7vmW_5HSpE

Before the fix, the subtitle is shown as follows:

After applying the fix, the subtitle is displayed as follows:

AudricV

I think we don't want NewPipe Extractor to download files directly, so your approach must be changed, especially as you do not delete files. Also, I would avoid downloading each subtitle to avoid reaching rate limits.

The extractor is not an Android library, therefore Android specific comments should be removed.

If YouTube provides incorrect subtitles, this should be not to the extractor to fix them in my opinion. It makes more sense to be fixed with a custom ExoPlayer component in the app side for me.

TransZAllen · 2026-01-31T05:52:44Z

@AudricV

Thanks for the feedback, it’s helpful for me to better understand the intended boundaries of NewPipeExtractor.

I’m preparing some follow-up comments to explain these commits, especially around subtitle downloading. I’m also taking some time to think about whether this design makes sense.

I’ll add more comments soon.

TransZAllen · 2026-02-01T08:30:01Z

About

“we don't want NewPipe Extractor to download files directly”

:

@AudricV

Just to make sure I understand correctly: currently, the extractor only provides
subtitle URLs, and the actual downloading is done later on the app side
(either by ExoPlayer or by the manual subtitle download feature), right?

My original idea was to fix duplicated subtitles as early as possible and
in a centralized place — at the source where subtitle URLs are produced.
That’s why I chose getSubtitles(final MediaFormat format) in
YoutubeStreamExtractor.java. If the source is deduplicated, all later code
would receive clean subtitles.

However, I now realize that my changes effectively moved the subtitle downloading
responsibility. Previously, subtitle URLs were passed through the extractor and
only downloaded on the app side. With this change, subtitles are downloaded
inside SubtitleDeduplicator instead.

At first, I thought this was acceptable since subtitles are eventually downloaded
anyway, and this could even reduce network requests by avoiding separate downloads
for playback (via ExoPlayer) and for manual SRT downloads. But after tracing the
code path more, I see that from getSubtitles() in NewPipeExtractor to
VideoPlaybackResolver.resolve() on the app side, subtitles are still handled
purely as URLs, without any download happening in the extractor.

So, performing file downloads inside NewPipeExtractor crosses its intended boundary, right?

sonarqubecloud · 2026-02-05T17:11:21Z

Quality Gate passed

Issues
20 New issues
0 Accepted issues

Measures
3 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

TransZAllen · 2026-03-22T11:33:26Z

Hi,

Just a small update.

Today I found a YouTube video where the subtitle still shows duplicated lines, so the current deduplication logic does not cover this case yet. I think I found the reason, and the logic probably needs a bit more adjustment.

===========

On the app side, I have already moved the deduplication logic (with network download and file I/O) from the extractor, and after some trial and error, I now have a better approach.

I will update the code soon and leave a message here.

Thanks.

… subtitle URL parameters. - Add `V`, `LANG`, `TLANG` constants to `YoutubeParsingHelper` - Implement `extractVideoId()`, `extractLanguageCode()`, `extractTranslationCode()` - Add `extractQueryParam()` utility in `Utils.java`

sonarqubecloud · 2026-04-01T14:58:34Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

TransZAllen · 2026-04-01T15:37:15Z

Hi,

Just a quick note.

The previous commits from the first review were overwritten when updating this PR.
To avoid losing that context, I have kept a backup here:
https://github.com/TransZAllen/NewPipeExtractor/tree/duplicated_subtitle_8_on_newest_dev-REVIEW_1
https://github.com/TransZAllen/NewPipe/tree/duplicated_subtitle_5_on_newest_dev-REVIEW_1

TransZAllen · 2026-04-01T15:42:52Z

Here is the updated description for this round of changes:

Background

As suggested in the previous review, the subtitle deduplication logic has been moved from the extractor side to the app side.
During this refactoring, several approaches were explored before arriving at the current solution.

Attempts and Design Decisions

1. Deduplication inside StreamInfoTag

StreamInfo (which contains the original subtitle data from the extractor) is defined inside StreamInfoTag:

private final StreamInfo streamInfo;

Initially, deduplication was implemented inside the constructor of StreamInfoTag (player/mediaitem/StreamInfoTag.java).

Advantages:

1）Centralized entry point for subtitle processing
2）Affects both: On-screen subtitles + Manual subtitle downloads (SRT)

Issues:

StreamInfo originates from the extractor, but is modified in the app layer
→ This effectively mutates upstream data (data pollution across layers)

Observed during testing:
a）StreamInfoTag constructor is called twice
b）SubtitleDeduplicator is executed 4 times --> Remote subtitle downloading is also triggered 4 times (Root cause not fully investigated)

👉 This approach was abandoned.

2. Introducing AppStreamInfo

A second attempt introduced a new domain class AppStreamInfo to hold processed subtitle data in the app layer.

Goals:
Keep processed (deduplicated) data isolated from extractor data
Provide a unified data source within the app

Current status:
Used in on-screen subtitle rendering
NOT used in manual SRT download
--Deduplication occurs independently in both the on-screen subtitle rendering and manual SRT download.

Reason:
The download pipeline cannot easily share AppStreamInfo with the UI layer.
Even if a separate AppStreamInfo instance is created for download, the following issue remains:
The download flow has multiple entry points:
DownloadInitializer.java
DownloadRunnable.java
DownloadMissionRecover.java
Covering all these paths would require broad and invasive changes.

👉 Therefore, AppStreamInfo was not used in the download part.

3. Final Approach for Download

The deduplication logic is now applied inside TtmlConverter.process()
This method internally calls SrtFromTtmlWriter.build(...) to convert downloaded TTML subtitles → SRT subtitles.

Current solution:
Deduplicate TTML content before calling SrtFromTtmlWriter.build(...).
In other words:
First → deduplicated TTML
Then → convert to SRT

Advantages:
Deduplication is applied at the end of remote subtitle download --> Avoids modifying multiple download entry points

TransZAllen · 2026-04-01T15:46:04Z

Additional Observation / Question

While working on this, I found an behavior that ChunkFileInputStream.read() returns 0 at EOF.
This was confirmed using debug logs in the new method TtmlConverter.readSharpStreamToString():

NewPipe$ git diff app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
diff --git a/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java b/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
index 935eacf59..0f229944a 100644
--- a/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
+++ b/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
@@ -57,7 +57,7 @@ class TtmlConverter extends Postprocessing {
     }
 
     private static String readSharpStreamToString(final SharpStream stream) throws IOException {
-
+        Log.d(TAG, "SharpStream implementation: " + stream.getClass().getName());
         final ByteArrayOutputStream out = new ByteArrayOutputStream();
         final byte[] buffer = new byte[8192];
 
@@ -77,8 +77,10 @@ class TtmlConverter extends Postprocessing {
         //   can safely be switched back to `read != -1`. Keeping `> 0` is
         //   also safe and will continue to work.
         while ((read = stream.read(buffer)) > 0) {
+            Log.d(TAG, "read bytes: " + read);
             out.write(buffer, 0, read);
         }
+        Log.d(TAG, "read loop finished with value: " + read);
 
         final String result = out.toString(StandardCharsets.UTF_8);

However, according to standard Java InputStream behavior:

read() should return -1 when reaching end-of-file (EOF)

This is also noted in the comment inside TtmlConverter.readSharpStreamToString().

Investigation

Search results in the NewPipe repository:

grep -rn -i "ChunkFileInputStream" . --include=*.{java,kt}

Result shows:

ChunkFileInputStream is mainly used in:

Postprocessing.java
TtmlConverter.java

The read() behavior is explicitly handled in TtmlConverter.java ('read > 0' instead of '!= -1')

Questions

Is returning 0 at EOF an intentional design in NewPipe?
Would it be safe to change it to the standard Java behavior (-1)?

TransZAllen · 2026-04-03T14:38:11Z

Observed Cases of Duplicate Subtitle Entries in YouTube TTML Files

The subtitle format involved is TTML.

A subtitle entry (paragraph) consists of three parts: a begin timestamp, an end timestamp, and text content. TTML supports visual styling of subtitle text (e.g., colors, font sizes) via  tags. Since NewPipe (using the ExoPlayer module) does not support styled subtitle rendering, two entries are considered duplicates if their timestamps are identical and their visible text content is the same after stripping style tags.

Text comparison is strict: it compares normalized plain text only and requires exact character equality (no semantic analysis).

All cases below assume the two entries have identical begin and end timestamps.

Case 1: Two subtitle entries are completely identical

Both the text content and all style attributes are exactly the same.

Example from https://www.youtube.com/watch?v=7w3jBGX7UcY (en-GB language):

<p begin="00:00:01.352" end="00:00:01.852" style="s2">
  <span style="s3"></span>
  <span style="s4"> Land </span>
  <span style="s5">mines </span>
  <span style="s3"><br/></span>
  <span style="s5"> I've left them everywhere </span>
  <span style="s3"></span>
</p>
<p begin="00:00:01.352" end="00:00:01.852" style="s2">
  <span style="s3"></span>
  <span style="s4"> Land </span>
  <span style="s5">mines </span>
  <span style="s3"><br/></span>
  <span style="s5"> I've left them everywhere </span>
  <span style="s3"></span>
</p>

Case 2: Same visible text, different style attributes

The text content is identical after stripping  tags, but the style attributes differ between entries.

Simple example — the same word styled differently across two entries:

<p begin="00:00:11.452" end="00:00:14.388" style="s2">
  <span style="s3">Magic</span>
</p>
<p begin="00:00:11.452" end="00:00:14.388" style="s2">
  <span style="s11">Magic</span>
</p>

Complex example — the same sentence, but one entry applies a single style to the whole sentence, while the other applies a different style to each individual word (or even each character):

<p begin="00:00:05.000" end="00:00:08.000" style="s2">
  <span style="s3">Hello world today</span>
</p>
<p begin="00:00:05.000" end="00:00:08.000" style="s2">
  <span style="s4">Hello </span>
  <span style="s5">world </span>
  <span style="s6">today</span>
</p>

After stripping all  tags, both entries reduce to the same text and are treated as duplicates.

Case 3: Same text after stripping styles, differing only in whitespace

After removing style tags, the visible characters are identical, but the entries differ in leading/trailing spaces or runs of consecutive spaces.

Example (constructed from unit tests):

<p begin="00:00:01.000" end="00:00:02.000">  Hello world  </p>
<p begin="00:00:01.000" end="00:00:02.000">Hello      world</p>

SubtitleDeduplicator handles this by collapsing all runs of whitespace into a single space and trimming leading/trailing whitespace before comparison.

In addition to whitespace, the following invisible characters are also stripped before comparison, as they can cause two visually identical entries to appear different at the byte level:

Non-breaking space: U+00A0 (normalized to a regular space)

Note:

TTML may contain   tags to represent line breaks.

In the current implementation,   is not removed or normalized. Because:

it is part of the TTML structure, not regular whitespace
in observed cases, identical subtitle entries do not mix   with other forms of special characters (e.g., plain spaces), for example:

  <p>Hello<br/>world</p>
  <p>Hello world</p>

Therefore,   is preserved in text comparison.

Case 4: Visually identical entries that are not detected as duplicates

If two entries look identical to the human eye but are not flagged as duplicates by the deduplication logic, the most likely cause is invisible Unicode characters embedded in the text that are not covered by the current normalization rules.

The following invisible characters are currently normalized:

Zero-width spaces and related characters: U+200B to U+200D
Directionality control characters: U+200E, U+200F
Directionality formatting characters: U+202A to U+202E
Byte Order Mark (BOM): U+FEFF

Case 5: Duplicate entries that are not adjacent

The two duplicate entries are not consecutive — at least one other subtitle entry appears between them.

Example from https://www.youtube.com/watch?v=7w3jBGX7UcY (zh-Hans subtitles):

<p begin="00:00:01.642" end="00:00:03.244" style="s2">
  <span style="s3">原作</span><span style="s3"> おジャ魔女どれみ</span>
</p>
<p begin="00:00:01.642" end="00:00:03.244" style="s2">
  <span style="s4">原唱</span><span style="s4"> MAHO堂</span>...
</p>
<p begin="00:00:01.642" end="00:00:03.244" style="s2">
  <span style="s3">原作 </span><span style="s7">おジャ魔女どれみ</span>
</p>

The first and third entries are duplicates (same text after normalization), separated by a non-duplicate entry. SubtitleDeduplicator handles this correctly because it uses a hash set to track all previously seen entries, not just the immediately preceding one.

This was referenced Jan 30, 2026

Fix duplicated subtitle issue-- initialize cache directory + local cached subtitles can be manually downloaded as *.srt TeamNewPipe/NewPipe#13143

Open

Subtitles appear twice sometimes TeamNewPipe/NewPipe#11204

Open

AudricV requested changes Jan 30, 2026

View reviewed changes

AudricV added bug Issue is related to a bug youtube service, https://www.youtube.com/ labels Jan 30, 2026

TransZAllen force-pushed the duplicated_subtitle_8_on_newest_dev branch from ec00c79 to e944f81 Compare April 1, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448

Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448
TransZAllen wants to merge 1 commit intoTeamNewPipe:devfrom
TransZAllen:duplicated_subtitle_8_on_newest_dev

TransZAllen commented Jan 30, 2026

Uh oh!

TransZAllen commented Jan 30, 2026 •

edited

Loading

Uh oh!

TransZAllen commented Jan 30, 2026 •

edited

Loading

Uh oh!

AudricV left a comment

Uh oh!

TransZAllen commented Jan 31, 2026

Uh oh!

TransZAllen commented Feb 1, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Uh oh!

TransZAllen commented Mar 22, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Apr 1, 2026

Uh oh!

TransZAllen commented Apr 1, 2026 •

edited

Loading

Uh oh!

TransZAllen commented Apr 1, 2026

Uh oh!

TransZAllen commented Apr 1, 2026

Uh oh!

TransZAllen commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TransZAllen commented Jan 30, 2026

Uh oh!

TransZAllen commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issue

Scope of changes

Reproduction case

Subtitle cache location

Cache file naming

Cache lifecycle & storage impact

Do cached subtitle files need to be deleted?

Why keep cached subtitles?

Storage considerations

Unit tests

Why SubtitleDeduplicator operates on raw TTML text

Difference from SrtFromTtmlWriter

Note on subtitle caching

Uh oh!

TransZAllen commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AudricV left a comment

Choose a reason for hiding this comment

Uh oh!

TransZAllen commented Jan 31, 2026

Uh oh!

TransZAllen commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Quality Gate passed

Uh oh!

TransZAllen commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 1, 2026

Quality Gate passed

Uh oh!

TransZAllen commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TransZAllen commented Apr 1, 2026

Background

Attempts and Design Decisions

1. Deduplication inside StreamInfoTag

2. Introducing AppStreamInfo

3. Final Approach for Download

Uh oh!

TransZAllen commented Apr 1, 2026

Additional Observation / Question

Uh oh!

TransZAllen commented Apr 3, 2026

Observed Cases of Duplicate Subtitle Entries in YouTube TTML Files

Case 1: Two subtitle entries are completely identical

Case 2: Same visible text, different style attributes

Case 3: Same text after stripping styles, differing only in whitespace

Case 4: Visually identical entries that are not detected as duplicates

Case 5: Duplicate entries that are not adjacent

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TransZAllen commented Jan 30, 2026 •

edited

Loading

Difference from `SrtFromTtmlWriter`

TransZAllen commented Jan 30, 2026 •

edited

Loading

TransZAllen commented Feb 1, 2026 •

edited

Loading

TransZAllen commented Mar 22, 2026 •

edited

Loading

TransZAllen commented Apr 1, 2026 •

edited

Loading