Skip to content

Ensuring 404 PDF is not parsed into texts#1126

Merged
jamesbraza merged 4 commits intomainfrom
bad-pdf-denial
Oct 6, 2025
Merged

Ensuring 404 PDF is not parsed into texts#1126
jamesbraza merged 4 commits intomainfrom
bad-pdf-denial

Conversation

@jamesbraza
Copy link
Copy Markdown
Collaborator

@jamesbraza jamesbraza commented Oct 6, 2025

Let's ensure PDFs like:

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

Have two behaviors:

  • Don't crash us (the case for PyPDF)
  • Don't get considered valid by our system (the case for PyMuPDF)

@jamesbraza jamesbraza self-assigned this Oct 6, 2025
Copilot AI review requested due to automatic review settings October 6, 2025 18:10
@jamesbraza jamesbraza added the bug Something isn't working label Oct 6, 2025
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 6, 2025
@dosubot
Copy link
Copy Markdown

dosubot bot commented Oct 6, 2025

Related Documentation

Checked 1 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds protection against malformed PDF files (specifically 404 HTML error pages being served as PDFs) by increasing the minimum text length validation for documents. The change also includes a test to verify that such invalid PDFs are properly rejected.

Changes

  • Increased minimum text length requirement from 10 to 20 characters (after removing newlines) to better filter out malformed PDFs
  • Added comprehensive test coverage for invalid PDF handling with two different PDF parsers

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/paperqa/docs.py Modified validation logic to increase minimum text length requirement
tests/test_paperqa.py Added test case to verify rejection of 404 HTML pages masquerading as PDFs

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

cursor[bot]

This comment was marked as outdated.

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 6, 2025
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Oct 6, 2025
if metadata.parse_type != "image" and (
not texts
or len(texts[0].text) < 10 # noqa: PLR2004
or len(texts[0].text.replace("\n", "")) < 20 # noqa: PLR2004
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment describing what motivates 20, and why we care that there are at least that many non-newline characters?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just did so. I also moved it into the disable_doc_valid_check check, it probably makes more sense there

@jamesbraza
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


Comment @cursor review or bugbot run to trigger another review on this PR

@jamesbraza jamesbraza merged commit 294c9b0 into main Oct 6, 2025
6 checks passed
@jamesbraza jamesbraza deleted the bad-pdf-denial branch October 6, 2025 19:42
@dosubot
Copy link
Copy Markdown

dosubot bot commented Oct 6, 2025

Documentation Updates

Checked 1 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants