Skip to content

Fixing flaky PaSa figure 1 read assertions#1312

Merged
jamesbraza merged 1 commit intomainfrom
fixing-flaky-test
Mar 4, 2026
Merged

Fixing flaky PaSa figure 1 read assertions#1312
jamesbraza merged 1 commit intomainfrom
fixing-flaky-test

Conversation

@jamesbraza
Copy link
Copy Markdown
Collaborator

Seen in this CI run:

FAILED packages/paper-qa-pypdf/tests/test_paperqa_pypdf.py::test_parse_pdf_to_pages - AssertionError: Expected raw_answer_no_citations='There is one User Query blue box in the diagram. It is connected to the Paper Queue and the Selector .' to have ('two', '2') present

This PR generalizes our assertions to allow for the answer saying "paper queue" then "selector" OR "selector" then "paper queue".

@jamesbraza jamesbraza self-assigned this Mar 4, 2026
Copilot AI review requested due to automatic review settings March 4, 2026 05:02
@jamesbraza jamesbraza added the bug Something isn't working label Mar 4, 2026
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses flaky PDF figure-question assertions by updating the tests to accept multiple valid phrasings/orderings in the model’s extracted answer for the PaSa Figure 1 question.

Changes:

  • Refactors per-question assertions from substring-count tuples into a more general “answer checks” list supporting regex checks.
  • Updates the “User Query blue boxes” question to use a regex intended to be order-independent for “paper queue” vs “selector”.
  • Applies the same assertion refactor across pypdf, pymupdf, docling, and nemotron test suites.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
packages/paper-qa-pypdf/tests/test_paperqa_pypdf.py Refactors answer assertions to allow regex-based checks for the figure Q&A test.
packages/paper-qa-pymupdf/tests/test_paperqa_pymupdf.py Mirrors the assertion refactor and regex-based check for the same flaky Q&A scenario.
packages/paper-qa-nemotron/tests/test_paperqa_nemotron.py Applies the same “answer checks” approach to the Nemotron-backed parser tests.
packages/paper-qa-docling/tests/test_paperqa_docling.py Applies the same generalized assertions to the Docling-backed parser tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

(
"How many User Query blue boxes are there, and what are they connected to?",
[(("two", "2"), 1), (("crawler", "selector"), 2)],
[r"two|2|(?=.*paper queue)(?=.*selector)"],
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex two|2|(?=.*paper queue)(?=.*selector) is too permissive: because of the | alternation it will pass if the answer contains just "two" or "2" even if it never mentions both "paper queue" and "selector". If the intent is to require the count and both connections (in either order), split into separate checks (one for count, one requiring both substrings) or use a single pattern that requires all conditions via lookaheads (e.g., require (two|2) plus both substrings).

Suggested change
[r"two|2|(?=.*paper queue)(?=.*selector)"],
[r"(?=.*\b(two|2)\b)(?=.*paper queue)(?=.*selector)"],

Copilot uses AI. Check for mistakes.
(
"How many User Query blue boxes are there, and what are they connected to?",
[(("two", "2"), 1), (("crawler", "selector"), 2)],
[r"two|2|(?=.*paper queue)(?=.*selector)"],
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex two|2|(?=.*paper queue)(?=.*selector) is too permissive: it will succeed if the answer contains only "two" or "2", without requiring that both "paper queue" and "selector" are present. To keep the test meaningful while allowing either ordering, use two separate checks (count + both substrings) or a lookahead-only pattern that requires all conditions (including the count).

Suggested change
[r"two|2|(?=.*paper queue)(?=.*selector)"],
[
(("two", "2"), 1),
(("paper queue", "selector"), 2),
],

Copilot uses AI. Check for mistakes.
(
"How many User Query blue boxes are there, and what are they connected to?",
[(("two", "2"), 1), (("crawler", "selector"), 2)],
[r"two|2|(?=.*paper queue)(?=.*selector)"],
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex two|2|(?=.*paper queue)(?=.*selector) is logically weaker than the old assertions: because of alternation it will pass on "two"/"2" alone, even if the answer doesn't mention both "paper queue" and "selector". Consider splitting into separate checks (count + presence of both substrings regardless of order) or changing the regex to require all three conditions via lookaheads (including (two|2)).

Suggested change
[r"two|2|(?=.*paper queue)(?=.*selector)"],
[r"(?=.*\b(?:two|2)\b)(?=.*paper queue)(?=.*selector)"],

Copilot uses AI. Check for mistakes.
(
"How many User Query blue boxes are there, and what are they connected to?",
[(("two", "2"), 1), (("crawler", "selector"), 2)],
[r"two|2|(?=.*paper queue)(?=.*selector)"],
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern two|2|(?=.*paper queue)(?=.*selector) uses | alternation, so the assertion will pass if the answer merely includes "two"/"2" without mentioning both "paper queue" and "selector". If the goal is to allow either ordering of those two connections while still checking them, split into separate checks (count + both substrings) or make the regex require all conditions (e.g., via lookaheads that include (two|2)).

Suggested change
[r"two|2|(?=.*paper queue)(?=.*selector)"],
[r"two|2", (("paper queue", "selector"), 2)],

Copilot uses AI. Check for mistakes.
@jamesbraza jamesbraza merged commit 17a2bb8 into main Mar 4, 2026
9 of 11 checks passed
@jamesbraza jamesbraza deleted the fixing-flaky-test branch March 4, 2026 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants