Supporting signed GCS links in `ParsedMedia` by jamesbraza · Pull Request #1307 · Future-House/paper-qa

jamesbraza · 2026-02-27T01:26:30Z

This PR moves ParsedMedia to support URLs such as signed GCS links by:

Adding a url field to ParsedMedia
Integrating it into Docs.aget_evidence workflow, with tests

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

dosubot · 2026-02-27T01:28:33Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

paper-qa

Multimodal Support in PaperQA

View Suggested Changes

@@ -6,15 +6,39 @@
 ## Multimodal Capabilities
 **Note:** If a table's markdown text contains invalid control characters (such as null bytes or orphaned UTF-16 surrogate code points in the range U+D800–U+DFFF), the markdown text will be omitted from the `ParsedMedia` object for that table. This prevents downstream encoding errors (e.g., with PostgreSQL databases or LLM prompt construction) and ensures only valid markdown is included. The image representation of the table remains available regardless of markdown validity. Similarly, if formula text is missing or invalid, the original LaTeX source is used as a fallback for text extraction.
 
+### ParsedMedia Structure
+`ParsedMedia` objects represent media extracted from documents. Each instance can hold media in one of two mutually exclusive ways:
+
+- **`data` field (bytes)**: Contains raw image bytes (e.g., PNG or JPEG data). This field defaults to empty bytes (`b""`) if not provided.
+- **`url` field (str | None)**: Contains an HTTP(S) URL to the media, such as a signed GCS link or other cloud storage URL. This field defaults to `None`.
+
+**Validation**: Exactly one of `data` or `url` must be set. You cannot provide both (ambiguous state) or neither (no media). If both are provided or both are empty, validation will raise a `ValueError`. This XOR constraint ensures clear semantics: media is either embedded as bytes or referenced by URL, never both.
+
+### ParsedMedia Methods
+`ParsedMedia` provides several methods for working with media, and their behavior varies depending on whether the instance holds bytes or a URL:
+
+- **`to_image_url()`**: Returns a URL suitable for LLM image content. If the `url` field is set, it returns that HTTP(S) URL directly. Otherwise, it converts the `data` bytes into an RFC 2397 base64 data URL (e.g., `data:image/png;base64,...`).
+
+- **`to_id()`**: Generates a UUID4 suitable for database IDs by hashing the image bytes and text. This method **raises a `ValueError`** for URL-only `ParsedMedia` instances (when `data` is empty), as ID generation requires image bytes.
+
+- **`save(path)`**: Saves the image bytes to the specified file path. This method **raises a `ValueError`** for URL-only `ParsedMedia` instances, as there is no need to save media that only contains a URL reference.
+
+- **Equality comparison (`__eq__`)**: Compares two `ParsedMedia` instances. The comparison only compares bytes-to-bytes or URL-to-URL; mixed types (one with `data`, one with `url`) are considered incompatible and will return `False`. If you need to compare mixed types, resolve the URL to bytes or generate a URL from bytes first.
+
+### Helper Function: `create_multimodal_message()`
+To support URL-based media in evidence workflows, PaperQA provides the `create_multimodal_message()` helper function. This function constructs OpenAI-format multimodal messages that support both HTTP(S) URLs (such as signed GCS links) and RFC 2397 data URLs.
+
+The function bypasses aviary's `Message.create_message()` base64 image validation, which rejects HTTP(S) URLs. Instead, it directly constructs the message content list, allowing signed cloud storage links alongside data URLs. This is used internally during evidence gathering to include media in LLM prompts, regardless of whether the media is embedded as bytes or referenced by URL.
+
 ## Integration of ParsedMedia into Docs
 The `Docs` object manages collections of documents and their associated media. Media are stored using the `ParsedMedia` abstraction, which supports a one-to-many relationship between media and document chunks: a single media object can be referenced by multiple text chunks, and the same image, formula, or logo may appear in several places (for example, a logo or formula on each page of a PDF or Office document). To avoid redundant storage and repeated inclusion of identical images, formulae, or tables, PaperQA deduplicates media using a robust hash based on the media's metadata (including a pixel-tolerant bounding box). This deduplication ensures that repeated images, formulae, tables, or logos—such as recurring logos or identical figures—are only stored and referenced once, even if they appear on multiple pages or in multiple contexts. As a result, downstream components access a unique set of relevant visual data for retrieval and evidence generation, regardless of how often or where it appears in the document [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1047]](https://github.com/Future-House/paper-qa/pull/1047), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225)).
 
 ## Multimodal Data in Evidence Gathering
-During evidence gathering, PaperQA includes both text and associated media when generating contextual summaries for each chunk. The summary LLM receives the chunk's text and any linked media (such as images, tables, or formulae), regardless of whether the source is a PDF, Office document, or other supported format. For tables, the system formats the table content as markdown and includes it in the prompt; for images and formulae, it attaches image URLs to the LLM message. To avoid redundant information and reduce prompt size, PaperQA deduplicates media before including them in the context: only unique images, tables, formulae, or logos (as determined by their metadata and a pixel-tolerant hash) are included, even if the same media appears multiple times in the document or chunk. This deduplication ensures that repeated figures, logos, formulae, or tables do not inflate the prompt or distract the LLM, while still providing all necessary visual context.
+During evidence gathering, PaperQA includes both text and associated media when generating contextual summaries for each chunk. The summary LLM receives the chunk's text and any linked media (such as images, tables, or formulae), regardless of whether the source is a PDF, Office document, or other supported format. For tables, the system formats the table content as markdown and includes it in the prompt; for images and formulae, it attaches image URLs to the LLM message using the `to_image_url()` method, which returns either HTTP(S) URLs (for URL-based media like signed GCS links) or RFC 2397 data URLs (for bytes-based media). To avoid redundant information and reduce prompt size, PaperQA deduplicates media before including them in the context: only unique images, tables, formulae, or logos (as determined by their metadata and a pixel-tolerant hash) are included, even if the same media appears multiple times in the document or chunk. This deduplication ensures that repeated figures, logos, formulae, or tables do not inflate the prompt or distract the LLM, while still providing all necessary visual context.
 
 The output summary remains text-only, but the LLM has access to the full, deduplicated multimodal context when generating its response.
 
-If the LLM call fails due to media inclusion (for example, if the model cannot process the attached images or if an image is corrupt or unsupported), PaperQA now handles these failures gracefully: the problematic context is skipped, and evidence gathering continues for the remaining contexts. This ensures that a handful of bad or unsupported media items do not cause the entire evidence gathering process to fail. If many contexts fail, the system may still fail the tool call, depending on future configuration. Additionally, if configured, the system can retry context creation using only text when media-related issues are detected. This logic applies to all supported document types, including Office files [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225)).
+If the LLM call fails due to media inclusion (for example, if the model cannot process the attached images or if an image is corrupt or unsupported), PaperQA handles these failures gracefully: the problematic context is skipped, and evidence gathering continues for the remaining contexts. This ensures that a handful of bad or unsupported media items do not cause the entire evidence gathering process to fail. If many contexts fail, the system may still fail the tool call, depending on future configuration. Additionally, if configured, the system can retry context creation using only text when media-related issues are detected. This logic applies to all supported document types, including Office files, and works with both bytes-based and URL-based media [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225), [[PR 1307]](https://github.com/Future-House/paper-qa/pull/1307)).
 
 ## Citation Peeks Are Text-Only
 Citation peeks in PaperQA are always generated using only the text content of documents, regardless of whether multimodal parsing is enabled. Images and other media are not included in citation peeks. This ensures that citation previews remain consistent and focused on textual evidence, even when documents contain embedded images or tables.
@@ -37,3 +61,5 @@
 To ensure robust deduplication, PaperQA includes dedicated tests and custom test files (such as `duplicate_media.pdf` and Office samples) that contain repeated images, tables, formulae, and logos across multiple pages or sheets. These tests verify that only unique media are included in context creation and LLM prompts, and that deduplication logic correctly identifies and collapses repeated figures, formulae, logos, or tables. The tests assert that the number of unique images, formulae, or tables included in evidence contexts is less than the total number of appearances in the document, confirming that deduplication is effective [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1047]](https://github.com/Future-House/paper-qa/pull/1047), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225)).
 
 For example, a test might check that a PDF or Office document containing tables and formulae results in both image and markdown (or LaTeX) representations being present in the parsed output, and that these are correctly linked to the relevant text chunks. If a table contains invalid characters in its markdown (such as null bytes or orphaned surrogate code points), the test verifies that the markdown is omitted and only the image is present in the parsed output. Similarly, if formula text is missing or invalid, the test verifies that the original LaTeX source is used as a fallback. If an image or formula is corrupt or unsupported, the system skips creating a context for that media and continues processing the rest of the evidence.
+
+Additional tests verify the behavior of URL-based media: tests check that `ParsedMedia` instances can be created with either raw image bytes or an HTTP(S) URL (but not both), that `to_image_url()` returns the URL directly for URL-based media, and that `to_id()` and `save()` raise errors for URL-only instances. Integration tests confirm that signed GCS links are correctly passed through the evidence gathering workflow and included in LLM prompts via `create_multimodal_message()` [[PR 1307]](https://github.com/Future-House/paper-qa/pull/1307).

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/paperqa/types.py

jabra

left a comments

src/paperqa/core.py

src/paperqa/types.py

jabra

LGTM

jamesbraza requested review from MicPie, mskarlin, sidnarayanan and whitead February 27, 2026 01:26

jamesbraza self-assigned this Feb 27, 2026

Copilot AI review requested due to automatic review settings February 27, 2026 01:26

jamesbraza added the enhancement New feature or request label Feb 27, 2026

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 27, 2026

Copilot AI reviewed Feb 27, 2026

View reviewed changes

jamesbraza requested a review from Copilot February 27, 2026 01:31

Copilot AI reviewed Feb 27, 2026

View reviewed changes

jamesbraza force-pushed the media-url branch from 72818ea to 4809b76 Compare February 27, 2026 05:32

jamesbraza requested a review from Copilot February 27, 2026 05:34

Copilot started reviewing on behalf of jamesbraza February 27, 2026 05:34 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

src/paperqa/types.py Outdated Show resolved Hide resolved

Addded ParsedMedia.url, with tests

b7f0628

jamesbraza force-pushed the media-url branch from 4809b76 to 0401baf Compare February 27, 2026 05:45

jabra reviewed Feb 27, 2026

View reviewed changes

jamesbraza added 3 commits February 27, 2026 11:19

Created a helper function for multimodal messages with GCS links

4b86cc6

Expanded tests to include ParsedMedia with a URL going to aget_evidence

28508fd

PR suggestions centering on readability

9cccb00

jamesbraza force-pushed the media-url branch from 0401baf to 9cccb00 Compare February 27, 2026 19:47

jamesbraza requested a review from jabra February 27, 2026 19:47

jabra approved these changes Feb 27, 2026

View reviewed changes

jamesbraza merged commit 730d4d2 into main Feb 27, 2026
11 of 14 checks passed

jamesbraza deleted the media-url branch February 27, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting signed GCS links in `ParsedMedia`#1307

Supporting signed GCS links in `ParsedMedia`#1307
jamesbraza merged 4 commits intomainfrom
media-url

jamesbraza commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

dosubot bot commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

jabra left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jabra left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jamesbraza commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

dosubot bot commented Feb 27, 2026

Multimodal Support in PaperQA

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

jabra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jabra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants