Skip to content

BUG: Fix test_watermarking_reportlab_rendering()#2203

Merged
MartinThoma merged 3 commits intomainfrom
fix-test_watermarking_reportlab_rendering
Sep 24, 2023
Merged

BUG: Fix test_watermarking_reportlab_rendering()#2203
MartinThoma merged 3 commits intomainfrom
fix-test_watermarking_reportlab_rendering

Conversation

@Lucas-C
Copy link
Member

@Lucas-C Lucas-C commented Sep 19, 2023

This fixes the issue spotted in #2191

The solution was to re-introduce calls to PageObject._push_pop_gs(),
in PageObject._merge_page & PageObject._merge_page_writer(),
but to optimize PageObject._push_pop_gs() by introducing a ContentsStream.isolate_graphics_state() method.

@Lucas-C
Copy link
Member Author

Lucas-C commented Sep 19, 2023

I wonder if we could use this opportunity to consider that use_original=True is always passed to PageObject._push_pop_gs(), and hence never re-recreate a ContentStream object, which could both simplify the code and improve the performances...

Edit: I did that in the 2nd commit of this PR, and it seems to work fine.

@Lucas-C Lucas-C force-pushed the fix-test_watermarking_reportlab_rendering branch from f8e7fad to c256b74 Compare September 19, 2023 19:52
@Lucas-C Lucas-C force-pushed the fix-test_watermarking_reportlab_rendering branch from c256b74 to a8ba231 Compare September 19, 2023 20:08
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 19, 2023

I wonder if we could use this opportunity to consider that use_original=True is always passed to PageObject._push_pop_gs(), and hence never re-recreate a ContentStream object, which could both simplify the code and improve the performances...

What about to do the merge just work on the decode byte streams? to conversion to operations takes some times and we just need something like:
Newcontent.decodedstream.set_data( b"q\n" + page1.content.get_data() +b"Q\nq\n" + page2.rotation( if required) + page2.content.get_data() +b"Q\n")

no ?

@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Patch coverage: 92.30% and project coverage change: +0.01% 🎉

Comparison is base (34c6875) 94.37% compared to head (bef49af) 94.38%.
Report is 4 commits behind head on main.

❗ Current head bef49af differs from pull request most recent head 22fb6c5. Consider uploading reports for the commit 22fb6c5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2203      +/-   ##
==========================================
+ Coverage   94.37%   94.38%   +0.01%     
==========================================
  Files          43       43              
  Lines        7588     7588              
  Branches     1496     1497       +1     
==========================================
+ Hits         7161     7162       +1     
+ Misses        263      262       -1     
  Partials      164      164              
Files Changed Coverage Δ
pypdf/generic/_data_structures.py 91.88% <83.33%> (-0.07%) ⬇️
pypdf/_page.py 94.35% <100.00%> (+0.18%) ⬆️
pypdf/_utils.py 98.58% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Lucas-C
Copy link
Member Author

Lucas-C commented Sep 20, 2023

What about to do the merge just work on the decode byte streams? to conversion to operations takes some times and we just need something like:

Sorry, I don't quite understand your suggestion 😅
Where are you suggesting to perform that change exactly?
What is Newcontent.decodedstream in the line of code you mentioned?
Overall, isn't what you suggest already part of this PR?

@Lucas-C Lucas-C requested a review from MartinThoma September 20, 2023 05:35
@pubpub-zz
Copy link
Collaborator

Sorry, I don't quite understand your suggestion 😅

It is quite a draft. I will make a test and prepare a PR if performance is good 😉

@Lucas-C
Copy link
Member Author

Lucas-C commented Sep 20, 2023

It is quite a draft. I will make a test and prepare a PR if performance is good 😉

Alright! 🙂

In the meantime, would you or @MartinThoma be OK to merge this PR?

@pubpub-zz
Copy link
Collaborator

for me, this PR is completely valid and what I have in mind will not modify this PR.

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Sep 21, 2023

In PyMuPDF, double-wrapping will be prevented: https://github.com/pymupdf/PyMuPDF/blob/1c2f1da3eb2541f1c8dbd50acb3a916939c99d3e/src/__init__.py#L9141-L9146 (the is_wrapped check uses additionally some caching mechanism to increase performance, see https://github.com/pymupdf/PyMuPDF/blob/1c2f1da3eb2541f1c8dbd50acb3a916939c99d3e/src/__init__.py#L8864-L8876 as well).

Would this be a possible enhancement here as well? Otherwise, running pypdf on the same PDF page multiple times might generate lots of "useless" wrappers.

@MartinThoma
Copy link
Member

Thank you for taking care of this @Lucas-C 🙏 I love that you're so actively contributing now 🎉

To me the PR looks good. @stefan6419846 left two interesting comments. I will take care of the docstring part, but the double-wrapping might be something interesting for another PR.

@MartinThoma MartinThoma changed the title Fix test_watermarking_reportlab_rendering() BUG: Fix test_watermarking_reportlab_rendering() Sep 24, 2023
@MartinThoma MartinThoma merged commit 91b6dcd into main Sep 24, 2023
@MartinThoma MartinThoma deleted the fix-test_watermarking_reportlab_rendering branch September 24, 2023 09:39
@Lucas-C
Copy link
Member Author

Lucas-C commented Sep 24, 2023

To me the PR looks good. @stefan6419846 left two interesting comments. I will take care of the docstring part, but the double-wrapping might be something interesting for another PR.

I agree! 🙂

If I understand correctly, your suggestion @stefan6419846 would be to add a .is_wrapped attribute to ContentStream, in order to avoid re-wrapping in q ... Q it if it's already done?
I wonder if that can really happen often in practice: have you witnessed it?
Could you maybe provide a minimal test case reproducing this?

@stefan6419846
Copy link
Collaborator

If I understand correctly, your suggestion @stefan6419846 would be to add a .is_wrapped attribute to ContentStream, in order to avoid re-wrapping in q ... Q it if it's already done?

Yes, something like this.

I wonder if that can really happen often in practice: have you witnessed it?
Could you maybe provide a minimal test case reproducing this?

It generally is good practice to already have an isolated graphics state for each page of the PDF file, so a "clean" PDF file would already have wrapped pages. pypdf will currently add another wrapping layer each time.

An easy example to reproduce this is by adding two different watermarks/overlays/backgrounds for example:

from pypdf import PdfReader, PdfWriter


watermark1 = PdfReader('watermark1.pdf').pages[0]
watermark2 = PdfReader('watermark2.pdf').pages[0]

writer = PdfWriter(clone_from='file.pdf')
for page in writer.pages:
    page.merge_page(watermark1)
    page.merge_page(watermark2)

Suppose we start with self._data being q\n 841.680[...]0 Do\nQ\n\n \n. Running the above example leads to self._data being q\nq\nq\n 841[...]ET\nQ\nQ\n\nQ\n, thus we have isolated the graphics state three times, although once should be enough.

@pubpub-zz
Copy link
Collaborator

@MartinThoma I've found a "regression" in the test the skipping decorator if ghostscript is not present is missing
fixed in reference PR

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Sep 24, 2023
MartinThoma pushed a commit that referenced this pull request Sep 24, 2023
MartinThoma added a commit that referenced this pull request Sep 24, 2023
## What's new

### Bug Fixes (BUG)
-  PDF size increases because of too high float writing precision (#2213) by @pubpub-zz
-  Fix test_watermarking_reportlab_rendering() (#2203) by @LucasCimon

### Documentation (DOC)
-  Fix typos and add a paragraph to ViewerPreferences docs (#2199) by @marcstober
-  How to install pypi from any branch (#2209) by @pubpub-zz
-  Update copyright footer in docs (#2207) by @marcstober

### Developer Experience (DEV)
-  Let dependabot update Github Actions by @MartinThoma

### Maintenance (MAINT)
-  Update .pre-commit-config.yaml by @MartinThoma

[Full Changelog](3.16.1...3.16.2)
@MartinThoma MartinThoma mentioned this pull request Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants