Fix: the pict type picture was not processed in the docx #19305

helojo · 2025-05-07T02:08:30Z

Summary

Fix: the pict type picture was not processed in the docx

api/core/rag/extractor/word_extractor.py

crazywoola

Please do not modify the PR template, and link an existing issue in the desciption.

helojo · 2025-05-07T02:29:13Z

Please do not modify the PR template, and link an existing issue in the desciption.

The comments have been modified to English, and there are no entries in the issue that exist. This modification is relatively simple. It is to add support for pict type images when processing docx.

crazywoola · 2025-05-07T03:10:20Z

Please fix the lint errors.

Copilot

Pull Request Overview

This PR enhances parse_paragraph to correctly extract VML-based pict images and prevents duplicate processing when both drawing and pict tags are present.

Introduce a has_drawing flag to skip pict extraction if a drawing has already been processed
Add handling for <w:pict> shapes by looking up binData and VML imagedata relationships
Ensure pict images are appended only when they exist in image_map

Comments suppressed due to low confidence (2)

api/core/rag/extractor/word_extractor.py:261

No tests appear to cover VML-based pict extraction. Add a unit test with a .docx containing a <w:pict> shape to ensure this code path is exercised.

                    shape_elements = run.element.findall(

api/core/rag/extractor/word_extractor.py:266

Verify that the {http://schemas.openxmlformats.org/wordprocessingml/2006/main}binData namespace is correct for VML pict elements—in some docs binData may reside in a different namespace or part.

                        shape_image = shape.find(

api/core/rag/extractor/word_extractor.py

…19305) Co-authored-by: zqgame <[email protected]>

Fix: the pict type picture was not processed in the docx

65d09f9

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working labels May 7, 2025

crazywoola reviewed May 7, 2025

View reviewed changes

api/core/rag/extractor/word_extractor.py Outdated Show resolved Hide resolved

crazywoola requested changes May 7, 2025

View reviewed changes

Fix: the pict type picture was not processed in the docx

17caa92

python style

7de6cae

helojo requested a review from crazywoola May 7, 2025 04:11

crazywoola requested review from JohnJyong and laipz8200 May 7, 2025 13:29

crazywoola requested a review from Copilot July 3, 2025 03:08

Copilot AI reviewed Jul 3, 2025

View reviewed changes

api/core/rag/extractor/word_extractor.py Show resolved Hide resolved

api/core/rag/extractor/word_extractor.py Show resolved Hide resolved

crazywoola approved these changes Jul 17, 2025

View reviewed changes

crazywoola merged commit e7d80bf into langgenius:main Jul 17, 2025
6 checks passed

dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 17, 2025

Nov1c444 mentioned this pull request Jul 23, 2025

chore(version): bump to 1.7.0 #22830

Merged

tutkun pushed a commit to tutkun/dify that referenced this pull request Aug 15, 2025

Fix: the pict type picture was not processed in the docx (langgenius#…

2ea65ea

…19305) Co-authored-by: zqgame <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: the pict type picture was not processed in the docx #19305

Fix: the pict type picture was not processed in the docx #19305

Uh oh!

helojo commented May 7, 2025

Uh oh!

Uh oh!

crazywoola left a comment

Uh oh!

helojo commented May 7, 2025

Uh oh!

crazywoola commented May 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: the pict type picture was not processed in the docx #19305

Fix: the pict type picture was not processed in the docx #19305

Uh oh!

Conversation

helojo commented May 7, 2025

Summary

Uh oh!

Uh oh!

crazywoola left a comment

Choose a reason for hiding this comment

Uh oh!

helojo commented May 7, 2025

Uh oh!

crazywoola commented May 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants