Ensuring 404 PDF is not parsed into texts#1126
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds protection against malformed PDF files (specifically 404 HTML error pages being served as PDFs) by increasing the minimum text length validation for documents. The change also includes a test to verify that such invalid PDFs are properly rejected.
Changes
- Increased minimum text length requirement from 10 to 20 characters (after removing newlines) to better filter out malformed PDFs
- Added comprehensive test coverage for invalid PDF handling with two different PDF parsers
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/paperqa/docs.py | Modified validation logic to increase minimum text length requirement |
| tests/test_paperqa.py | Added test case to verify rejection of 404 HTML pages masquerading as PDFs |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
1112243 to
bc6cfb3
Compare
bc6cfb3 to
467bdfe
Compare
src/paperqa/docs.py
Outdated
| if metadata.parse_type != "image" and ( | ||
| not texts | ||
| or len(texts[0].text) < 10 # noqa: PLR2004 | ||
| or len(texts[0].text.replace("\n", "")) < 20 # noqa: PLR2004 |
There was a problem hiding this comment.
Can you add a comment describing what motivates 20, and why we care that there are at least that many non-newline characters?
There was a problem hiding this comment.
Yeah just did so. I also moved it into the disable_doc_valid_check check, it probably makes more sense there
467bdfe to
09c8132
Compare
…nted its rationale
09c8132 to
ef02046
Compare
|
@cursor review |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Let's ensure PDFs like:
Have two behaviors: