Skip to content

Supporting signed GCS links in ParsedMedia#1307

Merged
jamesbraza merged 4 commits intomainfrom
media-url
Feb 27, 2026
Merged

Supporting signed GCS links in ParsedMedia#1307
jamesbraza merged 4 commits intomainfrom
media-url

Conversation

@jamesbraza
Copy link
Copy Markdown
Collaborator

This PR moves ParsedMedia to support URLs such as signed GCS links by:

  • Adding a url field to ParsedMedia
  • Integrating it into Docs.aget_evidence workflow, with tests

@jamesbraza jamesbraza self-assigned this Feb 27, 2026
Copilot AI review requested due to automatic review settings February 27, 2026 01:26
@jamesbraza jamesbraza added the enhancement New feature or request label Feb 27, 2026
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@dosubot
Copy link
Copy Markdown

dosubot bot commented Feb 27, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

paper-qa

Multimodal Support in PaperQA
View Suggested Changes
@@ -6,15 +6,39 @@
 ## Multimodal Capabilities
 **Note:** If a table's markdown text contains invalid control characters (such as null bytes or orphaned UTF-16 surrogate code points in the range U+D800–U+DFFF), the markdown text will be omitted from the `ParsedMedia` object for that table. This prevents downstream encoding errors (e.g., with PostgreSQL databases or LLM prompt construction) and ensures only valid markdown is included. The image representation of the table remains available regardless of markdown validity. Similarly, if formula text is missing or invalid, the original LaTeX source is used as a fallback for text extraction.
 
+### ParsedMedia Structure
+`ParsedMedia` objects represent media extracted from documents. Each instance can hold media in one of two mutually exclusive ways:
+
+- **`data` field (bytes)**: Contains raw image bytes (e.g., PNG or JPEG data). This field defaults to empty bytes (`b""`) if not provided.
+- **`url` field (str | None)**: Contains an HTTP(S) URL to the media, such as a signed GCS link or other cloud storage URL. This field defaults to `None`.
+
+**Validation**: Exactly one of `data` or `url` must be set. You cannot provide both (ambiguous state) or neither (no media). If both are provided or both are empty, validation will raise a `ValueError`. This XOR constraint ensures clear semantics: media is either embedded as bytes or referenced by URL, never both.
+
+### ParsedMedia Methods
+`ParsedMedia` provides several methods for working with media, and their behavior varies depending on whether the instance holds bytes or a URL:
+
+- **`to_image_url()`**: Returns a URL suitable for LLM image content. If the `url` field is set, it returns that HTTP(S) URL directly. Otherwise, it converts the `data` bytes into an RFC 2397 base64 data URL (e.g., `data:image/png;base64,...`).
+
+- **`to_id()`**: Generates a UUID4 suitable for database IDs by hashing the image bytes and text. This method **raises a `ValueError`** for URL-only `ParsedMedia` instances (when `data` is empty), as ID generation requires image bytes.
+
+- **`save(path)`**: Saves the image bytes to the specified file path. This method **raises a `ValueError`** for URL-only `ParsedMedia` instances, as there is no need to save media that only contains a URL reference.
+
+- **Equality comparison (`__eq__`)**: Compares two `ParsedMedia` instances. The comparison only compares bytes-to-bytes or URL-to-URL; mixed types (one with `data`, one with `url`) are considered incompatible and will return `False`. If you need to compare mixed types, resolve the URL to bytes or generate a URL from bytes first.
+
+### Helper Function: `create_multimodal_message()`
+To support URL-based media in evidence workflows, PaperQA provides the `create_multimodal_message()` helper function. This function constructs OpenAI-format multimodal messages that support both HTTP(S) URLs (such as signed GCS links) and RFC 2397 data URLs.
+
+The function bypasses aviary's `Message.create_message()` base64 image validation, which rejects HTTP(S) URLs. Instead, it directly constructs the message content list, allowing signed cloud storage links alongside data URLs. This is used internally during evidence gathering to include media in LLM prompts, regardless of whether the media is embedded as bytes or referenced by URL.
+
 ## Integration of ParsedMedia into Docs
 The `Docs` object manages collections of documents and their associated media. Media are stored using the `ParsedMedia` abstraction, which supports a one-to-many relationship between media and document chunks: a single media object can be referenced by multiple text chunks, and the same image, formula, or logo may appear in several places (for example, a logo or formula on each page of a PDF or Office document). To avoid redundant storage and repeated inclusion of identical images, formulae, or tables, PaperQA deduplicates media using a robust hash based on the media's metadata (including a pixel-tolerant bounding box). This deduplication ensures that repeated images, formulae, tables, or logos—such as recurring logos or identical figures—are only stored and referenced once, even if they appear on multiple pages or in multiple contexts. As a result, downstream components access a unique set of relevant visual data for retrieval and evidence generation, regardless of how often or where it appears in the document [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1047]](https://github.com/Future-House/paper-qa/pull/1047), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225)).
 
 ## Multimodal Data in Evidence Gathering
-During evidence gathering, PaperQA includes both text and associated media when generating contextual summaries for each chunk. The summary LLM receives the chunk's text and any linked media (such as images, tables, or formulae), regardless of whether the source is a PDF, Office document, or other supported format. For tables, the system formats the table content as markdown and includes it in the prompt; for images and formulae, it attaches image URLs to the LLM message. To avoid redundant information and reduce prompt size, PaperQA deduplicates media before including them in the context: only unique images, tables, formulae, or logos (as determined by their metadata and a pixel-tolerant hash) are included, even if the same media appears multiple times in the document or chunk. This deduplication ensures that repeated figures, logos, formulae, or tables do not inflate the prompt or distract the LLM, while still providing all necessary visual context.
+During evidence gathering, PaperQA includes both text and associated media when generating contextual summaries for each chunk. The summary LLM receives the chunk's text and any linked media (such as images, tables, or formulae), regardless of whether the source is a PDF, Office document, or other supported format. For tables, the system formats the table content as markdown and includes it in the prompt; for images and formulae, it attaches image URLs to the LLM message using the `to_image_url()` method, which returns either HTTP(S) URLs (for URL-based media like signed GCS links) or RFC 2397 data URLs (for bytes-based media). To avoid redundant information and reduce prompt size, PaperQA deduplicates media before including them in the context: only unique images, tables, formulae, or logos (as determined by their metadata and a pixel-tolerant hash) are included, even if the same media appears multiple times in the document or chunk. This deduplication ensures that repeated figures, logos, formulae, or tables do not inflate the prompt or distract the LLM, while still providing all necessary visual context.
 
 The output summary remains text-only, but the LLM has access to the full, deduplicated multimodal context when generating its response.
 
-If the LLM call fails due to media inclusion (for example, if the model cannot process the attached images or if an image is corrupt or unsupported), PaperQA now handles these failures gracefully: the problematic context is skipped, and evidence gathering continues for the remaining contexts. This ensures that a handful of bad or unsupported media items do not cause the entire evidence gathering process to fail. If many contexts fail, the system may still fail the tool call, depending on future configuration. Additionally, if configured, the system can retry context creation using only text when media-related issues are detected. This logic applies to all supported document types, including Office files [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225)).
+If the LLM call fails due to media inclusion (for example, if the model cannot process the attached images or if an image is corrupt or unsupported), PaperQA handles these failures gracefully: the problematic context is skipped, and evidence gathering continues for the remaining contexts. This ensures that a handful of bad or unsupported media items do not cause the entire evidence gathering process to fail. If many contexts fail, the system may still fail the tool call, depending on future configuration. Additionally, if configured, the system can retry context creation using only text when media-related issues are detected. This logic applies to all supported document types, including Office files, and works with both bytes-based and URL-based media [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225), [[PR 1307]](https://github.com/Future-House/paper-qa/pull/1307)).
 
 ## Citation Peeks Are Text-Only
 Citation peeks in PaperQA are always generated using only the text content of documents, regardless of whether multimodal parsing is enabled. Images and other media are not included in citation peeks. This ensures that citation previews remain consistent and focused on textual evidence, even when documents contain embedded images or tables.
@@ -37,3 +61,5 @@
 To ensure robust deduplication, PaperQA includes dedicated tests and custom test files (such as `duplicate_media.pdf` and Office samples) that contain repeated images, tables, formulae, and logos across multiple pages or sheets. These tests verify that only unique media are included in context creation and LLM prompts, and that deduplication logic correctly identifies and collapses repeated figures, formulae, logos, or tables. The tests assert that the number of unique images, formulae, or tables included in evidence contexts is less than the total number of appearances in the document, confirming that deduplication is effective [[PR 1153]](https://github.com/Future-House/paper-qa/pull/1153), [[PR 1047]](https://github.com/Future-House/paper-qa/pull/1047), [[PR 1046]](https://github.com/Future-House/paper-qa/pull/1046), [[PR 1169]](https://github.com/Future-House/paper-qa/pull/1169), [[PR 1225]](https://github.com/Future-House/paper-qa/pull/1225)).
 
 For example, a test might check that a PDF or Office document containing tables and formulae results in both image and markdown (or LaTeX) representations being present in the parsed output, and that these are correctly linked to the relevant text chunks. If a table contains invalid characters in its markdown (such as null bytes or orphaned surrogate code points), the test verifies that the markdown is omitted and only the image is present in the parsed output. Similarly, if formula text is missing or invalid, the test verifies that the original LaTeX source is used as a fallback. If an image or formula is corrupt or unsupported, the system skips creating a context for that media and continues processing the rest of the evidence.
+
+Additional tests verify the behavior of URL-based media: tests check that `ParsedMedia` instances can be created with either raw image bytes or an HTTP(S) URL (but not both), that `to_image_url()` returns the URL directly for URL-based media, and that `to_id()` and `save()` raise errors for URL-only instances. Integration tests confirm that signed GCS links are correctly passed through the evidence gathering workflow and included in LLM prompts via `create_multimodal_message()` [[PR 1307]](https://github.com/Future-House/paper-qa/pull/1307).

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@jamesbraza jamesbraza requested a review from Copilot February 27, 2026 01:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@jabra jabra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comments

Copy link
Copy Markdown

@jabra jabra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jamesbraza jamesbraza merged commit 730d4d2 into main Feb 27, 2026
11 of 14 checks passed
@jamesbraza jamesbraza deleted the media-url branch February 27, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants