Skip to content

Commit 0ab1a64

Browse files
dmsnelladamziel
andauthored
Expand class-level documentation for WP_HTML_Tag_Processor (#44478)
Also fixes `get_tag()` to return upper-case variant of tag names in accordance with the inline documentation and with other DOM interfaces. Co-authored-by: Adam Zielinski <adam@adamziel.com>
1 parent 66e39ae commit 0ab1a64

File tree

2 files changed

+175
-26
lines changed

2 files changed

+175
-26
lines changed

lib/experimental/html/class-wp-html-tag-processor.php

Lines changed: 164 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@
1717
* E.g. match having class `1<"2` needs to recognize `class="1&lt;&quot;2"`.
1818
* @TODO: Decode character references in `get_attribute()`
1919
* @TODO: Properly escape attribute value in `set_attribute()`
20+
* @TODO: Add slow mode to escape character entities in CSS class names?
21+
* (This requires a custom decoder since `html_entity_decode()`
22+
* doesn't handle attribute character reference decoding rules.
2023
*
2124
* @package WordPress
2225
* @subpackage HTML
@@ -28,6 +31,152 @@
2831
* of patches to that input. Tokenizes HTML but does not fully
2932
* parse the input document.
3033
*
34+
* ## Usage
35+
*
36+
* Use of this class requires three steps:
37+
*
38+
* 1. Create a new class instance with your input HTML document.
39+
* 2. Find the tag(s) you are looking for.
40+
* 3. Request changes to the attributes in those tag(s).
41+
*
42+
* Example:
43+
* ```php
44+
* $tags = new WP_HTML_Tag_Processor( $html );
45+
* if ( $tags->next_tag( [ 'tag_name' => 'option' ] ) ) {
46+
* $tags->set_attribute( 'selected', true );
47+
* }
48+
* ```
49+
*
50+
* ### Finding tags
51+
*
52+
* The `next_tag()` function moves the internal cursor through
53+
* your input HTML document until it finds a tag meeting any of
54+
* the supplied restrictions in the optional query argument. If
55+
* no argument is provided then it will find the next HTML tag,
56+
* regardless of what kind it is.
57+
*
58+
* If you want to _find whatever the next tag is_
59+
* ```php
60+
* $tags->next_tag();
61+
* ```
62+
*
63+
* | Goal | Query |
64+
* |-----------------------------------------------------------|----------------------------------------------------------------------------|
65+
* | Find any tag. | `$tags->next_tag();` |
66+
* | Find next image tag. | `$tags->next_tag( [ 'tag_name' => 'img' ] );` |
67+
* | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( [ 'class_name' => 'fullwidth' ] );` |
68+
* | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( [ 'tag_name' => 'img', 'class_name' => 'fullwidth' ] );` |
69+
*
70+
* If a tag was found meeting your criteria then `next_tag()`
71+
* will return `true` and you can proceed to modify it. If it
72+
* returns `false`, however, it failed to find the tag and
73+
* moved the cursor to the end of the file.
74+
*
75+
* Once the cursor reaches the end of the file the processor
76+
* is done and if you want to reach an earlier tag you will
77+
* need to recreate the processor and start over. The internal
78+
* cursor can only proceed forward, never backing up.
79+
*
80+
* #### Custom queries
81+
*
82+
* Sometimes it's necessary to further inspect an HTML tag than
83+
* the query syntax here permits. In these cases one may further
84+
* inspect the search results using the read-only functions
85+
* provided by the processor or external state or variables.
86+
*
87+
* Example:
88+
* ```php
89+
* // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style.
90+
* $remaining_count = 5;
91+
* while ( $remaining_count > 0 && $tags->next_tag() ) {
92+
* if (
93+
* ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) &&
94+
* 'jazzy' === $tags->get_attribute( 'data-style' )
95+
* ) {
96+
* $tags->add_class( 'theme-style-everest-jazz' );
97+
* $remaining_count--;
98+
* }
99+
* }
100+
* ```
101+
*
102+
* `get_attribute()` will return `null` if the attribute wasn't present
103+
* on the tag when it was called. It may return `""` (the empty string)
104+
* in cases where the attribute was present but its value was empty.
105+
* For boolean attributes, those whose name is present but no value is
106+
* given, it will return `true` (the only way to set `false` for an
107+
* attribute is to remove it).
108+
*
109+
* ### Modifying HTML attributes for a found tag
110+
*
111+
* Once you've found the start of an opening tag you can modify
112+
* any number of the attributes on that tag. You can set a new
113+
* value for an attribute, remove the entire attribute, or do
114+
* nothing and move on to the next opening tag.
115+
*
116+
* Example:
117+
* ```php
118+
* if ( $tags->next_tag( [ 'class' => 'wp-group-block' ] ) ) {
119+
* $tags->set_attribute( 'title', 'This groups the contained content.' );
120+
* $tags->remove_attribute( 'data-test-id' );
121+
* }
122+
* ```
123+
*
124+
* If `set_attribute()` is called for an existing attribute it will
125+
* overwrite the existing value. Similarly, calling `remove_attribute()`
126+
* for a non-existing attribute has no effect on the document. Both
127+
* of these methods are safe to call without knowing if a given attribute
128+
* exists beforehand.
129+
*
130+
* ### Modifying CSS classes for a found tag
131+
*
132+
* The tag processor treats the `class` attribute as a special case.
133+
* Because it's a common operation to add or remove CSS classes you
134+
* can do so using this interface.
135+
*
136+
* As with attribute values, adding or removing CSS classes is a safe
137+
* operation that doesn't require checking if the attribute or class
138+
* exists before making changes. If removing the only class then the
139+
* entire `class` attribute will be removed.
140+
*
141+
* Example:
142+
* ```php
143+
* // from `<span>Yippee!</span>`
144+
* // to `<span class="is-active">Yippee!</span>`
145+
* $tags->add_class( 'is-active' );
146+
*
147+
* // from `<span class="excited">Yippee!</span>`
148+
* // to `<span class="excited is-active">Yippee!</span>`
149+
* $tags->add_class( 'is-active' );
150+
*
151+
* // from `<span class="is-active heavy-accent">Yippee!</span>`
152+
* // to `<span class="is-active heavy-accent">Yippee!</span>`
153+
* $tags->add_class( 'is-active' );
154+
*
155+
* // from `<input type="text" class="is-active rugby not-disabled" length="24">`
156+
* // to `<input type="text" class="is-active not-disabled" length="24">
157+
* $tags->remove_class( 'rugby' );
158+
*
159+
* // from `<input type="text" class="rugby" length="24">`
160+
* // to `<input type="text" length="24">
161+
* $tags->remove_class( 'rugby' );
162+
*
163+
* // from `<input type="text" length="24">`
164+
* // to `<input type="text" length="24">
165+
* $tags->remove_class( 'rugby' );
166+
* ```
167+
*
168+
* ## Design limitations
169+
*
170+
* @TODO: Expand this section
171+
*
172+
* - no nesting: cannot match open and close tag
173+
* - only move forward, never backward
174+
* - class names not decoded if they contain character references
175+
* - only secures against HTML escaping issues; requires
176+
* manually sanitizing or escaping values based on the needs of
177+
* each individual attribute, since different attributes have
178+
* different needs.
179+
*
31180
* @since 6.2.0
32181
*/
33182
class WP_HTML_Tag_Processor {
@@ -136,16 +285,16 @@ class WP_HTML_Tag_Processor {
136285
* // and stops after recognizing the `id` attribute
137286
* // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8">
138287
* // ^ parsing will continue from this point
139-
* $this->attributes = array(
288+
* $this->attributes = [
140289
* 'id' => new WP_HTML_Attribute_Match( 'id', null, 6, 17 )
141-
* );
290+
* ];
142291
*
143292
* // when picking up parsing again, or when asking to find the
144293
* // `class` attribute we will continue and add to this array
145-
* $this->attributes = array(
146-
* 'id' => new WP_HTML_Attribute_Match( 'id', null, 6, 17 ),
294+
* $this->attributes = [
295+
* 'id' => new WP_HTML_Attribute_Match( 'id', null, 6, 17 ),
147296
* 'class' => new WP_HTML_Attribute_Match( 'class', 'outline', 18, 32 )
148-
* );
297+
* ];
149298
*
150299
* // Note that only the `class` attribute value is stored in the index.
151300
* // That's because it is the only value used by this class at the moment.
@@ -170,11 +319,11 @@ class WP_HTML_Tag_Processor {
170319
* Example:
171320
* <code>
172321
* // Add the `WP-block-group` class, remove the `WP-group` class.
173-
* $class_changes = array(
322+
* $class_changes = [
174323
* // Indexed by a comparable class name
175324
* 'wp-block-group' => new WP_Class_Name_Operation( 'WP-block-group', WP_Class_Name_Operation::ADD ),
176325
* 'wp-group' => new WP_Class_Name_Operation( 'WP-group', WP_Class_Name_Operation::REMOVE )
177-
* );
326+
* ];
178327
* </code>
179328
*
180329
* @since 6.2.0
@@ -206,9 +355,9 @@ class WP_HTML_Tag_Processor {
206355
*
207356
* // Correspondingly, something like this
208357
* // will appear in the replacements array.
209-
* $replacements = array(
358+
* $replacements = [
210359
* WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' )
211-
* );
360+
* ];
212361
* </code>
213362
*
214363
* @since 6.2.0
@@ -270,9 +419,9 @@ public function next_tag( $query = null ) {
270419
if ( 's' === $t || 'S' === $t || 't' === $t || 'T' === $t ) {
271420
$tag_name = $this->get_tag();
272421

273-
if ( 'script' === $tag_name ) {
422+
if ( 'SCRIPT' === $tag_name ) {
274423
$this->skip_script_data();
275-
} elseif ( 'textarea' === $tag_name || 'title' === $tag_name ) {
424+
} elseif ( 'TEXTAREA' === $tag_name || 'TITLE' === $tag_name ) {
276425
$this->skip_rcdata( $tag_name );
277426
}
278427
}
@@ -318,7 +467,7 @@ private function skip_rcdata( $tag_name ) {
318467
$tag_char = $tag_name[ $i ];
319468
$html_char = $html[ $at + $i ];
320469

321-
if ( $html_char !== $tag_char && strtolower( $html_char ) !== $tag_char ) {
470+
if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
322471
$at += $i;
323472
continue 2;
324473
}
@@ -937,7 +1086,7 @@ public function get_tag() {
9371086

9381087
$tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );
9391088

940-
return strtolower( $tag_name );
1089+
return strtoupper( $tag_name );
9411090
}
9421091

9431092
/**
@@ -1189,7 +1338,7 @@ private function matches() {
11891338

11901339
/*
11911340
* Otherwise we have to check for each character if they
1192-
* are the same, and only `strtolower()` if we have to.
1341+
* are the same, and only `strtoupper()` if we have to.
11931342
* Presuming that most people will supply lowercase tag
11941343
* names and most HTML will contain lowercase tag names,
11951344
* most of the time this runs we shouldn't expect to
@@ -1199,7 +1348,7 @@ private function matches() {
11991348
$html_char = $this->html[ $this->tag_name_starts_at + $i ];
12001349
$tag_char = $this->sought_tag_name[ $i ];
12011350

1202-
if ( $html_char !== $tag_char && strtolower( $html_char ) !== $tag_char ) {
1351+
if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
12031352
return false;
12041353
}
12051354
}

phpunit/html/wp-html-tag-processor-standalone-test.php

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ public function test_get_tag_returns_null_when_not_in_open_tag() {
6868
public function test_get_tag_returns_open_tag_name() {
6969
$p = new WP_HTML_Tag_Processor( '<div>Test</div>' );
7070
$this->assertTrue( $p->next_tag( 'div' ), 'Querying an existing tag did not return true' );
71-
$this->assertSame( 'div', $p->get_tag(), 'Accessing an existing tag name did not return "div"' );
71+
$this->assertSame( 'DIV', $p->get_tag(), 'Accessing an existing tag name did not return "div"' );
7272
}
7373

7474
/**
@@ -841,7 +841,7 @@ public function test_setting_a_boolean_attribute_to_a_string_value_adds_explicit
841841
public function test_unclosed_script_tag_should_not_cause_an_infinite_loop() {
842842
$p = new WP_HTML_Tag_Processor( '<script>' );
843843
$p->next_tag();
844-
$this->assertSame( 'script', $p->get_tag() );
844+
$this->assertSame( 'SCRIPT', $p->get_tag() );
845845
$p->next_tag();
846846
}
847847

@@ -855,9 +855,9 @@ public function test_unclosed_script_tag_should_not_cause_an_infinite_loop() {
855855
public function test_next_tag_ignores_the_contents_of_a_script_tag( $script_then_div ) {
856856
$p = new WP_HTML_Tag_Processor( $script_then_div );
857857
$p->next_tag();
858-
$this->assertSame( 'script', $p->get_tag(), 'The first found tag was not "script"' );
858+
$this->assertSame( 'SCRIPT', $p->get_tag(), 'The first found tag was not "script"' );
859859
$p->next_tag();
860-
$this->assertSame( 'div', $p->get_tag(), 'The second found tag was not "∂iv"' );
860+
$this->assertSame( 'DIV', $p->get_tag(), 'The second found tag was not "div"' );
861861
}
862862

863863
/**
@@ -934,7 +934,7 @@ public function test_next_tag_ignores_the_contents_of_a_rcdata_tag( $rcdata_then
934934
$p->next_tag();
935935
$this->assertSame( $rcdata_tag, $p->get_tag(), "The first found tag was not '$rcdata_tag'" );
936936
$p->next_tag();
937-
$this->assertSame( 'div', $p->get_tag(), "The second found tag was not 'div'" );
937+
$this->assertSame( 'DIV', $p->get_tag(), "The second found tag was not 'div'" );
938938
}
939939

940940
/**
@@ -951,32 +951,32 @@ public function data_rcdata_state() {
951951
$examples = array();
952952
$examples['Simple textarea'] = array(
953953
'<textarea><span class="d-none d-md-inline">Back to notifications</span></textarea><div></div>',
954-
'textarea',
954+
'TEXTAREA',
955955
);
956956

957957
$examples['Simple title'] = array(
958958
'<title><span class="d-none d-md-inline">Back to notifications</title</span></title><div></div>',
959-
'title',
959+
'TITLE',
960960
);
961961

962962
$examples['Comment opener inside a textarea tag should be ignored'] = array(
963963
'<textarea class="d-md-none"><!--</textarea><div></div>-->',
964-
'textarea',
964+
'TEXTAREA',
965965
);
966966

967967
$examples['Textarea closer with another textarea tag in closer attributes'] = array(
968968
'<textarea><span class="d-none d-md-inline">Back to notifications</title</span></textarea <textarea><div></div>',
969-
'textarea',
969+
'TEXTAREA',
970970
);
971971

972972
$examples['Textarea closer with attributes'] = array(
973973
'<textarea class="d-md-none"><span class="d-none d-md-inline">Back to notifications</span></textarea id="test"><div></div>',
974-
'textarea',
974+
'TEXTAREA',
975975
);
976976

977977
$examples['Textarea opener with title closer inside'] = array(
978978
'<textarea class="d-md-none"></title></textarea><div></div>',
979-
'textarea',
979+
'TEXTAREA',
980980
);
981981
return $examples;
982982
}

0 commit comments

Comments
 (0)