Skip to content
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
18c58db
feat: Add multimodal support for Office documents
takeruhukushima Nov 1, 2025
95e6e25
solve dependencies problem
takeruhukushima Nov 1, 2025
2f32c8a
fix bug
takeruhukushima Nov 1, 2025
f7cc49d
[pre-commit.ci lite] apply automatic fixes
pre-commit-ci-lite[bot] Nov 1, 2025
e42e636
Merge remote-tracking branch 'origin/feature/new-doc-types'
takeruhukushima Nov 1, 2025
c801103
fix:pre-commit fail src/paperqa/readers.py
takeruhukushima Nov 2, 2025
5e49ca1
add .docx,.pptx,.xlsx in settings.py
takeruhukushima Nov 2, 2025
65c5097
refactor(chunks):consolidating the chunk code for office and pdf
takeruhukushima Nov 2, 2025
5ead779
edit README.md:add .docx, .xlsx, .pptx, and code files (e.g., .py, .t…
takeruhukushima Nov 2, 2025
a7c5e3a
refactor: Unify chunking algorithm name for PDF and office documents
takeruhukushima Nov 2, 2025
5fe86c1
feat: Add unstructured version to office document parsing metadata
takeruhukushima Nov 2, 2025
d4619bd
feat: Implement lazy import for unstructured in office document parsing
takeruhukushima Nov 2, 2025
0be33fe
feat: Add unit test for office document parsing
takeruhukushima Nov 2, 2025
0775523
[pre-commit.ci lite] apply automatic fixes
pre-commit-ci-lite[bot] Nov 2, 2025
84c6bf2
feat: Update test_parse_office_doc for Gemini models and RAG query
takeruhukushima Nov 2, 2025
e218d91
add mailmap takerufukushima
takeruhukushima Nov 2, 2025
4933f16
fix pre-commit error
takeruhukushima Nov 2, 2025
77b95cc
Merge branch 'feature/new-doc-types' of https://github.com/takeruhuku…
takeruhukushima Nov 2, 2025
280c381
Fix: Address linting issues in test_paperqa.py
takeruhukushima Nov 2, 2025
d95ed9f
feat: Improve questions and assertions in test_parse_office_doc
takeruhukushima Nov 3, 2025
057c7f8
feat: Enhance office document parsing tests and assertions
takeruhukushima Nov 3, 2025
f4975fc
Minor tweaks to test_parse_office_doc
jamesbraza Nov 3, 2025
7bfad14
Updating assertions in other tests for this PR's changes
jamesbraza Nov 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .mailmap
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ Michael Skarlinski <[email protected]> mskarlin <12701035+mskarlin@use
Odhran O'Donoghue <[email protected]> odhran-o-d <[email protected]>
Odhran O'Donoghue <[email protected]> <[email protected]>
Samantha Cox <[email protected]> <[email protected]>
takeru fukushima <[email protected]><[email protected]>
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)
![PyPI Python Versions](https://img.shields.io/pypi/pyversions/paper-qa)

PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs or text files,
PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs, text files, Microsoft Office documents, and source code files,
with a focus on the scientific literature.
See our [recent 2024 paper](https://paper.wikicrow.ai)
to see examples of PaperQA2's superhuman performance in scientific tasks like
Expand Down Expand Up @@ -395,7 +395,7 @@ It just removes the automation associated with an agent picking the documents to
```python
from paperqa import Docs, Settings

# valid extensions include .pdf, .txt, .md, and .html
# valid extensions include .pdf, .txt, .md, .html, .docx, .xlsx, .pptx, and code files (e.g., .py, .ts, .yaml)
doc_paths = ("myfile.pdf", "myotherfile.pdf")

# Prepare the Docs object by adding a bunch of documents
Expand Down Expand Up @@ -438,7 +438,7 @@ from paperqa import Docs

async def main() -> None:
docs = Docs()
# valid extensions include .pdf, .txt, .md, and .html
# valid extensions include .pdf, .txt, .md, .html, .docx, .xlsx, .pptx, and code files (e.g., .py, .ts, .yaml)
for doc in ("myfile.pdf", "myotherfile.pdf"):
await docs.aadd(doc)

Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ dev = [
"ipython>=8", # Pin to keep recent
"litellm>=1.71", # Lower pin for aiohttp transport adoption
"mypy>=1.8", # Pin for mutable-override
"paper-qa[docling,image,ldp,memory,pypdf-media,pymupdf,typing,zotero,local,qdrant]",
"paper-qa[docling,image,ldp,memory,pypdf-media,pymupdf,typing,zotero,local,qdrant,office]",
"prek",
"pydantic~=2.11", # Pin for start of model_fields deprecation
"pylint-pydantic",
Expand Down Expand Up @@ -92,6 +92,9 @@ memory = [
"paper-qa[ldp]",
"usearch>=2.16.4", # Pin for Python 3.13 support
]
office = [
"unstructured[docx,xlsx,pptx]",
]
openreview = [
"openreview-py",
]
Expand Down
66 changes: 63 additions & 3 deletions src/paperqa/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import asyncio
import os
from collections.abc import Awaitable, Callable
from importlib.metadata import version
from math import ceil
from pathlib import Path
from typing import Literal, Protocol, cast, overload, runtime_checkable
Expand Down Expand Up @@ -171,6 +172,61 @@ def parse_text(
)


def parse_office_doc(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a unit test for this in test_paperqa.py? Feel free to use another LLM besides OpenAI (e.g. Anthropic, OpenRouter) for your testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[success] 90.64% tests/test_paperqa.py::test_parse_office_doc[dummy.docx]: 1.5548s
[success] 6.33% tests/test_paperqa.py::test_parse_office_doc[dummy.xlsx]: 0.1086s
[success] 3.03% tests/test_paperqa.py::test_parse_office_doc[dummy.pptx]: 0.0520s

Results (5.19s):
3 passed

I'm not confident, but the test passed. I'll commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And sorry for dummy .docx and .xlsx written in Japanese.

path: str | os.PathLike,
**kwargs,
) -> ParsedText:
"""Parse office documents (.docx, .xlsx, .pptx) using unstructured, extracting text and images."""
try:
import unstructured
from unstructured.documents.elements import Image, Table
from unstructured.partition.auto import partition
except ImportError as exc:
raise ImportError(
"Could not import `unstructured` dependencies. "
"Please install with `pip install paper-qa[office]`."
) from exc
UNSTRUCTURED_VERSION = version(unstructured.__name__)
elements = partition(str(path), **kwargs)

content_dict = {}
media_list: list[ParsedMedia] = []
current_text = ""
media_index = 0

for el in elements:
if isinstance(el, Image):
image_data = el.metadata.image_data
# Create a ParsedMedia object
parsed_media = ParsedMedia(
index=media_index,
data=image_data,
info={"suffix": el.metadata.image_mime_type},
)
media_list.append(parsed_media)
media_index += 1
elif isinstance(el, Table):
# For tables, we could get the HTML representation for better structure
if el.metadata.text_as_html:
current_text += el.metadata.text_as_html + "\n\n"
else:
current_text += str(el) + "\n\n"

# For office docs, we can treat the whole document as a single "page"
content_dict["1"] = (current_text, media_list)

return ParsedText(
content=content_dict,
metadata=ParsedMetadata(
parsing_libraries=[f"{unstructured.__name__} ({UNSTRUCTURED_VERSION})"],
paperqa_version=pqa_version,
total_parsed_text_length=len(current_text),
count_parsed_media=len(media_list),
name=f"office_doc|path={path}",
),
)


def chunk_text(
parsed_text: ParsedText,
doc: Doc,
Expand Down Expand Up @@ -276,7 +332,7 @@ def chunk_code_text(

IMAGE_EXTENSIONS = tuple({".png", ".jpg", ".jpeg"})
# When HTML reader supports images, add here
ENRICHMENT_EXTENSIONS = tuple({".pdf", *IMAGE_EXTENSIONS})
ENRICHMENT_EXTENSIONS = tuple({".pdf", ".docx", ".xlsx", ".pptx", *IMAGE_EXTENSIONS})


@overload
Expand Down Expand Up @@ -383,6 +439,9 @@ async def read_doc( # noqa: PLR0912
)
elif str_path.endswith(IMAGE_EXTENSIONS):
parsed_text = await parse_image(path, **parser_kwargs)
elif str_path.endswith((".docx", ".xlsx", ".pptx")):
# TODO: Make parse_office_doc async
parsed_text = await asyncio.to_thread(parse_office_doc, path, **parser_kwargs)
else:
parsed_text = await asyncio.to_thread(
parse_text, path, split_lines=True, **parser_kwargs
Expand Down Expand Up @@ -412,15 +471,15 @@ async def read_doc( # noqa: PLR0912
f"|reduction=cl100k_base{enrichment_summary}"
),
)
elif str_path.endswith(".pdf"):
elif str_path.endswith((".pdf", ".docx", ".xlsx", ".pptx")):
chunked_text = chunk_pdf(
parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
)
chunk_metadata = ChunkMetadata(
size=chunk_chars,
overlap=overlap,
name=(
f"paper-qa={pqa_version}|algorithm=overlap-pdf"
f"paper-qa={pqa_version}|algorithm=overlap-document"
f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}"
),
)
Expand All @@ -445,6 +504,7 @@ async def read_doc( # noqa: PLR0912
f"|size={chunk_chars}|overlap={overlap}{enrichment_summary}"
),
)

else:
chunked_text = chunk_code_text(
parsed_text, doc, chunk_chars=chunk_chars, overlap=overlap
Expand Down
2 changes: 1 addition & 1 deletion src/paperqa/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -598,7 +598,7 @@ class IndexSettings(BaseModel):
default=lambda f: (
f.suffix
# TODO: add images after embeddings are supported
in {".txt", ".pdf", ".html", ".md"}
in {".txt", ".pdf", ".html", ".md", ".xlsx", ".docx", ".pptx"}
),
exclude=True,
description=(
Expand Down
Binary file added tests/stub_data/dummy.docx
Binary file not shown.
Binary file added tests/stub_data/dummy.pptx
Binary file not shown.
Binary file added tests/stub_data/dummy.xlsx
Binary file not shown.
28 changes: 28 additions & 0 deletions tests/test_paperqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -3108,3 +3108,31 @@ async def test_reader_config_propagation(stub_data_dir: Path) -> None:
assert mock_read_doc.call_args.kwargs["chunk_chars"] == 2000
assert mock_read_doc.call_args.kwargs["overlap"] == 50
assert mock_read_doc.call_args.kwargs["dpi"] == 144


@pytest.mark.asyncio
@pytest.mark.parametrize("filename", ["dummy.docx", "dummy.pptx", "dummy.xlsx"])
async def test_parse_office_doc(stub_data_dir: Path, filename: str) -> None:
file_path = stub_data_dir / filename
if not file_path.exists():
pytest.skip(f"{filename} not found in stub_data")

docs = Docs()

settings = Settings(
llm="gemini/gemini-2.5-flash",
embedding="gemini/text-embedding-004",
summary_llm="gemini/gemini-2.5-flash",
agent={"agent_llm": "gemini/gemini-2.5-flash"},
parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parsing=ParsingSettings(use_doc_details=False, disable_doc_valid_check=True),
parsing=ParsingSettings(use_doc_details=False),

These docs should be valid (we don't need disable_doc_valid_check=True)

)
docname = await docs.aadd(
file_path,
"dummy citation",
docname=filename,
settings=settings,
)
assert docname is not None
assert docs.texts
session = await docs.aquery("What is the RAG system?", settings=settings)
assert session.answer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I actually ran these tests just now a bit, and I noticed the question "What is the RAG system?" only applies to the dummy.docx.

Can you either:

  • Adjust the question to match each document
  • Change the .pptx and xlsx to also have content for "What is the RAG system?"

Let's make the assertions:

session = await docs.aquery("What is the RAG system?", settings=settings)
assert session.used_contexts
assert len(session.answer) > 10, "Expected an answer"
assert CANNOT_ANSWER_PHRASE not in session.answer, (
    "Expected the system to be sure"
)

Loading
Loading