Skip to content

Filtering out null byte from table ParsedMedia.text#1088

Merged
jamesbraza merged 1 commit intomainfrom
filtering-null-byte
Sep 15, 2025
Merged

Filtering out null byte from table ParsedMedia.text#1088
jamesbraza merged 1 commit intomainfrom
filtering-null-byte

Conversation

@jamesbraza
Copy link
Copy Markdown
Collaborator

@jamesbraza jamesbraza commented Sep 15, 2025

Seen in logs when using multimodal paper-qa + PyMuPDF reader + PostgreSQL DB:

Failed to create XYZ given m.text='|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|\n|---|---|---|---|---|---|---|---|\n||\x02\x03<br>|\x04\x05\x06\x07\x08<br> <br>|\x07\x08\x08<br>\n\x08<br>\x0e\x0f<br>\x17\x18\x18\x08<br>|\x02<br>\x0c\x10<br>\x11<br>\x19\r\x02\x1a\x00\x01\x02\x03<br>|\x11<br>\x12\x06\x05<br>\x0e\x13\x14\x15<br>\x04\x05\x06\x07<br>|\x05\x08<br>\x0c\x10<br>\x12\x06\x05<br>\x0e\x16\x13<br>|\x05\x08<br>\x0c\x10<br>\x12\x06\x05<br>\x0e\x16\x13<br>|' ...

Traceback (most recent call last):
  ...
  File "/srv/.venv/lib/python3.13/site-packages/asyncpg/connection.py", line 748, in fetchrow
    data = await self._execute(
  File "/srv/.venv/lib/python3.13/site-packages/asyncpg/connection.py", line 1864, in _execute
    result, _ = await self.__execute(
  File "/srv/.venv/lib/python3.13/site-packages/asyncpg/connection.py", line 1961, in __execute
    result, stmt = await self._do_execute(
  File "/srv/.venv/lib/python3.13/site-packages/asyncpg/connection.py", line 2024, in _do_execute
    result = await executor(stmt, None)
  File "asyncpg/protocol/protocol.pyx", line 206, in bind_execute
asyncpg.exceptions.CharacterNotInRepertoireError: invalid byte sequence for encoding "UTF8": 0x00

This PR fixes this issue within PaperQA by filtering out invalid Markdown text

@jamesbraza jamesbraza self-assigned this Sep 15, 2025
Copilot AI review requested due to automatic review settings September 15, 2025 21:12
@jamesbraza jamesbraza added the bug Something isn't working label Sep 15, 2025
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Sep 15, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a PostgreSQL database encoding error caused by null bytes in table markdown text extracted from PDFs. The solution filters out invalid control characters (specifically null bytes) from table markdown text before storing it in the database.

  • Adds regex pattern to detect invalid UTF-8 characters (null bytes) in table markdown
  • Modifies table parsing to set text to None when invalid characters are detected
  • Adds comprehensive test coverage to verify the filtering behavior

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
packages/paper-qa-pymupdf/src/paperqa_pymupdf/reader.py Implements null byte filtering logic with regex pattern and conditional text assignment
packages/paper-qa-pymupdf/tests/test_paperqa_pymupdf.py Adds test case with mocked table data containing null bytes to verify filtering works correctly

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@dosubot
Copy link
Copy Markdown

dosubot bot commented Sep 15, 2025

Related Documentation

1 document(s) may need updating based on files changed in this PR

How did I do? Any feedback?  Join Discord

@jamesbraza jamesbraza merged commit a064ab7 into main Sep 15, 2025
5 checks passed
@jamesbraza jamesbraza deleted the filtering-null-byte branch September 15, 2025 21:46
@dosubot
Copy link
Copy Markdown

dosubot bot commented Sep 15, 2025

Documentation Updates

1 document(s) were updated by changes in this PR

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants