Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448
Fix duplicated subtitle issue--core deduplication logic and screen-display part#1448TransZAllen wants to merge 1 commit intoTeamNewPipe:devfrom
Conversation
TransZAllen
commented
Jan 30, 2026
- [ √ ] I carefully read the contribution guidelines and agree to them.
- [ √ ] I have tested the API against NewPipe.
- [ √ ] I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.
Related issueScope of changesThis PR involves two repositories:
Reproduction caseAndroid device, duplicated subtitles visible during playback YouTube video used for testing: Subtitle cache locationCached subtitle files ( The directory name corresponds to Cache file namingCached subtitle filenames are intentionally descriptive, Cache lifecycle & storage impactDo cached subtitle files need to be deleted?No.
Why keep cached subtitles?
Storage considerations
Unit testsTests focus on the core deduplication logic: Why SubtitleDeduplicator operates on raw TTML text
This design is intended to be practical and simple. At this stage, the goal is only to detect obviously Difference from
|
|
The fix has been tested with a YouTube video link: https://www.youtube.com/watch?v=b7vmW_5HSpE Before the fix, the subtitle is shown as follows: After applying the fix, the subtitle is displayed as follows: |
AudricV
left a comment
There was a problem hiding this comment.
I think we don't want NewPipe Extractor to download files directly, so your approach must be changed, especially as you do not delete files. Also, I would avoid downloading each subtitle to avoid reaching rate limits.
The extractor is not an Android library, therefore Android specific comments should be removed.
If YouTube provides incorrect subtitles, this should be not to the extractor to fix them in my opinion. It makes more sense to be fixed with a custom ExoPlayer component in the app side for me.
|
Thanks for the feedback, it’s helpful for me to better understand the intended boundaries of NewPipeExtractor. I’m preparing some follow-up comments to explain these commits, especially around subtitle downloading. I’m also taking some time to think about whether this design makes sense. I’ll add more comments soon. |
|
About
: Just to make sure I understand correctly: currently, the extractor only provides My original idea was to fix duplicated subtitles as early as possible and However, I now realize that my changes effectively moved the subtitle downloading At first, I thought this was acceptable since subtitles are eventually downloaded So, performing file downloads inside NewPipeExtractor crosses its intended boundary, right? |
|
|
Hi, Just a small update. Today I found a YouTube video where the subtitle still shows duplicated lines, so the current deduplication logic does not cover this case yet. I think I found the reason, and the logic probably needs a bit more adjustment. =========== On the app side, I have already moved the deduplication logic (with network download and file I/O) from the extractor, and after some trial and error, I now have a better approach. I will update the code soon and leave a message here. Thanks. |
… subtitle URL parameters. - Add `V`, `LANG`, `TLANG` constants to `YoutubeParsingHelper` - Implement `extractVideoId()`, `extractLanguageCode()`, `extractTranslationCode()` - Add `extractQueryParam()` utility in `Utils.java`
ec00c79 to
e944f81
Compare
|
|
Hi, Just a quick note. The previous commits from the first review were overwritten when updating this PR. |
|
Here is the updated description for this round of changes: BackgroundAs suggested in the previous review, the subtitle deduplication logic has been moved from the extractor side to the app side. Attempts and Design Decisions1. Deduplication inside StreamInfoTagStreamInfo (which contains the original subtitle data from the extractor) is defined inside StreamInfoTag: private final StreamInfo streamInfo;Initially, deduplication was implemented inside the constructor of StreamInfoTag (player/mediaitem/StreamInfoTag.java). Advantages: 1)Centralized entry point for subtitle processing Issues: StreamInfo originates from the extractor, but is modified in the app layer Observed during testing: 👉 This approach was abandoned. 2. Introducing AppStreamInfoA second attempt introduced a new domain class Goals: Current status: Reason: 👉 Therefore, AppStreamInfo was not used in the download part. 3. Final Approach for DownloadThe deduplication logic is now applied inside TtmlConverter.process() Current solution: Advantages: |
Additional Observation / QuestionWhile working on this, I found an behavior that ChunkFileInputStream.read() returns 0 at EOF. NewPipe$ git diff app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
diff --git a/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java b/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
index 935eacf59..0f229944a 100644
--- a/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
+++ b/app/src/main/java/us/shandian/giga/postprocessing/TtmlConverter.java
@@ -57,7 +57,7 @@ class TtmlConverter extends Postprocessing {
}
private static String readSharpStreamToString(final SharpStream stream) throws IOException {
-
+ Log.d(TAG, "SharpStream implementation: " + stream.getClass().getName());
final ByteArrayOutputStream out = new ByteArrayOutputStream();
final byte[] buffer = new byte[8192];
@@ -77,8 +77,10 @@ class TtmlConverter extends Postprocessing {
// can safely be switched back to `read != -1`. Keeping `> 0` is
// also safe and will continue to work.
while ((read = stream.read(buffer)) > 0) {
+ Log.d(TAG, "read bytes: " + read);
out.write(buffer, 0, read);
}
+ Log.d(TAG, "read loop finished with value: " + read);
final String result = out.toString(StandardCharsets.UTF_8);
However, according to standard Java InputStream behavior:
This is also noted in the comment inside Investigation Search results in the NewPipe repository: grep -rn -i "ChunkFileInputStream" . --include=*.{java,kt} Result shows: ChunkFileInputStream is mainly used in: Postprocessing.java The read() behavior is explicitly handled in TtmlConverter.java ('read > 0' instead of '!= -1') Questions Is returning 0 at EOF an intentional design in NewPipe? |
Observed Cases of Duplicate Subtitle Entries in YouTube TTML FilesThe subtitle format involved is TTML. A subtitle entry (paragraph) consists of three parts: a begin timestamp, an end timestamp, and text content. TTML supports visual styling of subtitle text (e.g., colors, font sizes) via Text comparison is strict: it compares normalized plain text only and requires exact character equality (no semantic analysis). All cases below assume the two entries have identical begin and end timestamps. Case 1: Two subtitle entries are completely identicalBoth the text content and all style attributes are exactly the same. Example from <p begin="00:00:01.352" end="00:00:01.852" style="s2">
<span style="s3"></span>
<span style="s4"> Land </span>
<span style="s5">mines </span>
<span style="s3"><br/></span>
<span style="s5"> I've left them everywhere </span>
<span style="s3"></span>
</p>
<p begin="00:00:01.352" end="00:00:01.852" style="s2">
<span style="s3"></span>
<span style="s4"> Land </span>
<span style="s5">mines </span>
<span style="s3"><br/></span>
<span style="s5"> I've left them everywhere </span>
<span style="s3"></span>
</p>Related issues and videos:
Case 2: Same visible text, different style attributesThe text content is identical after stripping Simple example — the same word styled differently across two entries: <p begin="00:00:11.452" end="00:00:14.388" style="s2">
<span style="s3">Magic</span>
</p>
<p begin="00:00:11.452" end="00:00:14.388" style="s2">
<span style="s11">Magic</span>
</p>Complex example — the same sentence, but one entry applies a single style to the whole sentence, while the other applies a different style to each individual word (or even each character): <p begin="00:00:05.000" end="00:00:08.000" style="s2">
<span style="s3">Hello world today</span>
</p>
<p begin="00:00:05.000" end="00:00:08.000" style="s2">
<span style="s4">Hello </span>
<span style="s5">world </span>
<span style="s6">today</span>
</p>After stripping all Related videos:
Case 3: Same text after stripping styles, differing only in whitespaceAfter removing style tags, the visible characters are identical, but the entries differ in leading/trailing spaces or runs of consecutive spaces. Example (constructed from unit tests): <p begin="00:00:01.000" end="00:00:02.000"> Hello world </p>
<p begin="00:00:01.000" end="00:00:02.000">Hello world</p>
In addition to whitespace, the following invisible characters are also stripped before comparison, as they can cause two visually identical entries to appear different at the byte level:
Note: TTML may contain In the current implementation,
<p>Hello<br/>world</p>
<p>Hello world</p>Therefore, Case 4: Visually identical entries that are not detected as duplicatesIf two entries look identical to the human eye but are not flagged as duplicates by the deduplication logic, the most likely cause is invisible Unicode characters embedded in the text that are not covered by the current normalization rules. The following invisible characters are currently normalized:
Case 5: Duplicate entries that are not adjacentThe two duplicate entries are not consecutive — at least one other subtitle entry appears between them. Example from <p begin="00:00:01.642" end="00:00:03.244" style="s2">
<span style="s3">原作</span><span style="s3"> おジャ魔女どれみ</span>
</p>
<p begin="00:00:01.642" end="00:00:03.244" style="s2">
<span style="s4">原唱</span><span style="s4"> MAHO堂</span>...
</p>
<p begin="00:00:01.642" end="00:00:03.244" style="s2">
<span style="s3">原作 </span><span style="s7">おジャ魔女どれみ</span>
</p>The first and third entries are duplicates (same text after normalization), separated by a non-duplicate entry. |




