Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
18c58db
feat: Add multimodal support for Office documents
takeruhukushima Nov 1, 2025
95e6e25
solve dependencies problem
takeruhukushima Nov 1, 2025
2f32c8a
fix bug
takeruhukushima Nov 1, 2025
f7cc49d
[pre-commit.ci lite] apply automatic fixes
pre-commit-ci-lite[bot] Nov 1, 2025
e42e636
Merge remote-tracking branch 'origin/feature/new-doc-types'
takeruhukushima Nov 1, 2025
c801103
fix:pre-commit fail src/paperqa/readers.py
takeruhukushima Nov 2, 2025
5e49ca1
add .docx,.pptx,.xlsx in settings.py
takeruhukushima Nov 2, 2025
65c5097
refactor(chunks):consolidating the chunk code for office and pdf
takeruhukushima Nov 2, 2025
5ead779
edit README.md:add .docx, .xlsx, .pptx, and code files (e.g., .py, .t…
takeruhukushima Nov 2, 2025
a7c5e3a
refactor: Unify chunking algorithm name for PDF and office documents
takeruhukushima Nov 2, 2025
5fe86c1
feat: Add unstructured version to office document parsing metadata
takeruhukushima Nov 2, 2025
d4619bd
feat: Implement lazy import for unstructured in office document parsing
takeruhukushima Nov 2, 2025
0be33fe
feat: Add unit test for office document parsing
takeruhukushima Nov 2, 2025
0775523
[pre-commit.ci lite] apply automatic fixes
pre-commit-ci-lite[bot] Nov 2, 2025
84c6bf2
feat: Update test_parse_office_doc for Gemini models and RAG query
takeruhukushima Nov 2, 2025
e218d91
add mailmap takerufukushima
takeruhukushima Nov 2, 2025
4933f16
fix pre-commit error
takeruhukushima Nov 2, 2025
77b95cc
Merge branch 'feature/new-doc-types' of https://github.com/takeruhuku…
takeruhukushima Nov 2, 2025
280c381
Fix: Address linting issues in test_paperqa.py
takeruhukushima Nov 2, 2025
d95ed9f
feat: Improve questions and assertions in test_parse_office_doc
takeruhukushima Nov 3, 2025
057c7f8
feat: Enhance office document parsing tests and assertions
takeruhukushima Nov 3, 2025
f4975fc
Minor tweaks to test_parse_office_doc
jamesbraza Nov 3, 2025
7bfad14
Updating assertions in other tests for this PR's changes
jamesbraza Nov 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: Add unit test for office document parsing
    - Add a unit test for  to verify parsing of .docx, .pptx, and .xlsx files.
    - Add dummy office files to  for testing purposes.
    - Update test configuration to use OpenRouter and a non-OpenAI embedding model to avoid authentication
     issues during testing.
  • Loading branch information
takeruhukushima committed Nov 2, 2025
commit 0be33feb09eb4d73f533452ee7fcb6d8b464bbb0
Binary file added tests/stub_data/dummy.docx
Binary file not shown.
Binary file added tests/stub_data/dummy.pptx
Binary file not shown.
Binary file added tests/stub_data/dummy.xlsx
Binary file not shown.
27 changes: 27 additions & 0 deletions tests/test_paperqa.py
Original file line number Diff line number Diff line change
Expand Up @@ -3108,3 +3108,30 @@ async def test_reader_config_propagation(stub_data_dir: Path) -> None:
assert mock_read_doc.call_args.kwargs["chunk_chars"] == 2000
assert mock_read_doc.call_args.kwargs["overlap"] == 50
assert mock_read_doc.call_args.kwargs["dpi"] == 144


@pytest.mark.asyncio
@pytest.mark.parametrize("filename", ["dummy.docx", "dummy.pptx", "dummy.xlsx"])
async def test_parse_office_doc(stub_data_dir: Path, filename: str) -> None:
# This test requires the user to create dummy office files in tests/stub_data
# For example:
# touch tests/stub_data/dummy.docx
# touch tests/stub_data/dummy.pptx
# touch tests/stub_data/dummy.xlsx
file_path = stub_data_dir / filename
if not file_path.exists():
pytest.skip(f"{filename} not found in stub_data")

docs = Docs()
settings = Settings(
llm="openrouter/google/gemma-7b-it",
llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_config={"api_key": os.environ.get("OPEN_ROUTER_API_KEY")},

I think you shouldn't need this, litellm should just auto check OPENROUTER_API_KEY: https://docs.litellm.ai/docs/providers/openrouter

So maybe update your local env to have OPENROUTER_API_KEY

parsing=ParsingSettings(
use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_doc_details=False, disable_doc_valid_check=True, defer_embedding=True
use_doc_details=False, disable_doc_valid_check=True

I think you don't need to defer embeddings, we can embed right away

),
)
docname = await docs.aadd(
file_path, "dummy citation", docname=filename, dockey="dummy_doc", settings=settings
)
assert docname is not None
assert len(docs.texts) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at lint CI to fix this part

Loading