Skip to content

Conversation

@helojo
Copy link
Contributor

@helojo helojo commented May 7, 2025

Summary

Fix: the pict type picture was not processed in the docx

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working labels May 7, 2025
Copy link
Member

@crazywoola crazywoola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not modify the PR template, and link an existing issue in the desciption.

@helojo
Copy link
Contributor Author

helojo commented May 7, 2025

Please do not modify the PR template, and link an existing issue in the desciption.

The comments have been modified to English, and there are no entries in the issue that exist. This modification is relatively simple. It is to add support for pict type images when processing docx.

@crazywoola
Copy link
Member

Please fix the lint errors.

@helojo helojo requested a review from crazywoola May 7, 2025 04:11
@crazywoola crazywoola requested review from JohnJyong and laipz8200 May 7, 2025 13:29
@crazywoola crazywoola requested a review from Copilot July 3, 2025 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances parse_paragraph to correctly extract VML-based pict images and prevents duplicate processing when both drawing and pict tags are present.

  • Introduce a has_drawing flag to skip pict extraction if a drawing has already been processed
  • Add handling for <w:pict> shapes by looking up binData and VML imagedata relationships
  • Ensure pict images are appended only when they exist in image_map
Comments suppressed due to low confidence (2)

api/core/rag/extractor/word_extractor.py:261

  • No tests appear to cover VML-based pict extraction. Add a unit test with a .docx containing a <w:pict> shape to ensure this code path is exercised.
                    shape_elements = run.element.findall(

api/core/rag/extractor/word_extractor.py:266

  • Verify that the {http://schemas.openxmlformats.org/wordprocessingml/2006/main}binData namespace is correct for VML pict elements—in some docs binData may reside in a different namespace or part.
                        shape_image = shape.find(

@crazywoola crazywoola merged commit e7d80bf into langgenius:main Jul 17, 2025
6 checks passed
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 17, 2025
tutkun pushed a commit to tutkun/dify that referenced this pull request Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants