You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user/extract-text.md
+19-6Lines changed: 19 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,14 +27,27 @@ Refer to [extract\_text](../modules/PageObject.html#pypdf._page.PageObject.extra
27
27
You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.
28
28
29
29
The function provided in argument visitor_text of function extract_text has five arguments:
30
-
text, current transformation matrix, text matrix, font-dictionary and font-size.
31
-
In most cases the x and y coordinates of the current position
32
-
are in index 4 and 5 of the current transformation matrix.
30
+
* text: the current text (as long as possible, can be up to a full line)
31
+
* user_matrix: current matrix to move from user coordinate space (also known as CTM)
32
+
* tm_matrix: current matrix from text coordinate space
33
+
* font-dictionary: full font dictionary
34
+
* font-size: the size (in text coordinate space)
35
+
36
+
The matrix stores 6 parameters. The first 4 provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical)
37
+
It is recommended to use the user_matrix as it takes into all transformations.
38
+
39
+
Notes :
40
+
41
+
- as indicated in the PDF 1.7 reference, page 204 the user matrix applies to text space/image space/form space/pattern space.
42
+
- if you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows:
43
+
`txt2user = mult(tm, cm))`
44
+
The font-size is the raw text size, that is affected by the `user_matrix`
45
+
33
46
34
47
The font-dictionary may be None in case of unknown fonts.
35
48
If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold".
36
49
37
-
**Caveat**: In complicated documents the calculated positions might be wrong.
50
+
**Caveat**: In complicated documents the calculated positions may be difficult to (if you move from multiple forms to page user space for example).
38
51
39
52
The function provided in argument visitor_operand_before has four arguments:
40
53
operator, operand-arguments, current transformation matrix and text matrix.
@@ -53,7 +66,7 @@ parts = []
53
66
54
67
55
68
defvisitor_body(text, cm, tm, font_dict, font_size):
56
-
y =tm[5]
69
+
y =cm[5]
57
70
if y >50and y <720:
58
71
parts.append(text)
59
72
@@ -88,7 +101,7 @@ def visitor_svg_rect(op, args, cm, tm):
88
101
89
102
90
103
defvisitor_svg_text(text, cm, tm, fontDict, fontSize):
0 commit comments