HTML API: Provide mechanism to scan all tokens in an HTML document, not only the tags. #5683

dmsnell · 2023-11-17T22:46:47Z

Trac ticket: Core-60170
Companion port into Gutenberg: WordPress/gutenberg#58107 (contains additional porting code)

This PR provides full tokenization scanning of an HTML document. This is being added into the Tag Processor and will be a necessary component for a number of related changes to the HTML API:

Reading and modifying inner and outer content.
Serializing HTML.
Methods to transform text content while preserving or stripping away markup.

Enables syntax-aware processing such as wp_truncate_html() [gist]
Replaces/incorporates chunked/extended processing in #5050
Replaces/incorporates stopping at comments in dmsnell#7
Provides critical functionality for inner/outer getter/setter in [dmsnell#10, #4965]

Depends on #5721 ✅
Depends on #5725 ✅

Todo

Design Changes

In this change we're introducing two features stemming from two internal changes:

next_token() provides the ability to scan every token in the HTML stream.
it is possible to parse HTML in chunks without having the entire document in memory.

The internal changes powering this are:

internal state adopts a new parsing mode which allows resuming from the middle of a previous match.
the new concept of modifiable text and a token proper tracks the bounds of the currently-matched token as well as a safe region that can be changed without impacting the document syntax, if one exists.

For example, when encountering an HTML comment the parser will track the following token information:

This <!-- is a comment -->.
     │   │            │  └ end of token
     │   │            └─── end of text
     │   └──────────────── start of text
     └──────────────────── start of token

Not every token will have a text region, but it's important to track the entire token and any text region because similar tokens may have different syntax. For example, an invalid comment is still a comment.

This <? is also a comment --!>.
     │ │                 │   └ end of token
     │ │                 └──── end of text
     │ └────────────────────── start of text
     └──────────────────────── start of token

This holds for tokens whose entire content is text, such as with the #text node.

<div>This is text.</div>
     │           ├ end of token
     │           └ end of text
     ├──────────── start of text
     └──────────── start of token

Special HTML tags have modifiable text and that isn't part of .textConent or .innerText. For example, the TITLE element contains no HTML inside of it and everything is plaintext and its contents don't appear in the page. The same is true for TEXTAREA and SCRIPT and STYLE and a few more elements.

<title>This is text <em>Not HTML</em>.</title>
│      │                             │       └ end of token
│      └ start of text               └ end of text
└──────────── start of token

Scanning tokens

In order to keep the next_tag() interface and use clear, it is left unchanged. For operations needing access to the token stream, there is no built-in query mechanism and querying ought to be performed inside a next_token() loop. get_token_type() indicates what kind of token is currently matched, get_token_name() returns something that more closely matches what a DOM API would return, and get_modifiable_text() returns the modifiable text if available.

function wp_strip_all_tags( $html, $remove_breaks ) {
	$processor = new WP_HTML_Processor( $html );

	$text_content = '';
	while ( $processor->next_token() ) {
		if ( '#text' === $processor->get_token_name() ) {
			$text_content .= $processor->get_node_text();
		}
	}

	return $remove_breaks
		? trim( preg_replace( '/[\r\n\t ]+/', ' ', $text_content ) )
		: $text_content;
}

Most tags have no modifiable content.
The inner contents of special tags whose contents cannot contain HTML is their modifiable content. The inner contents of these tags are not rendered in the page.
- IFRAME
- NOEMBED, NOFRAMES
- SCRIPT
- STYLE
- TEXTAREA [character references are decoded]
- TITLE [character references are decoded]
- XMP

TODO

Add next_token() method to scan each token.
Stop at RCDATA sections and SCRIPT, STYLE, TITLE, TEXTAREA, etc…
Stop at text nodes.
Indicate a continuation state to support resumable parsing. This is necessary for stopping at SCRIPT tags and other tags with special closing rules. These are currently handled by skipping to the end of the element when finding the starting tag, but this has introduced a few challenges and bugs (for example, the Tag Processor fails to stop at a <title> tag if the document ends before the </title> closer is found).
Add rewind() method to reverse to the start of the document.

…, end) This patch follows-up with earlier design questions around how to represent spans of strings inside the class. It's relevant now as preparation for WordPress#5683. The mixture of (offset, length) and (start, end) coordinates becomes confusing at times and all final string operations are performed with the (offset, length) pair, since these feed into `strlen()`. In preparation for exposing all tokens within an HTML document this change: - Unifies the representation throughout the class. - It creates `token_starts_at` to track the start of the current token. - It replaces `tag_ends_at` with `token_length` for re-use with other token types. There should be no functional or behavioral changes in this patch. For the internal helper classes this patch introduces breaking changes, but those classes are marked private and should not be used outside of the HTML API itself.

github-actions · 2023-12-21T21:19:19Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

src/wp-includes/html-api/class-wp-html-tag-processor.php

sirreal

I really like the shape this is taking. I've left several thoughts comments and questions from this first pass. I'd like to take a look at the processing of comments because I think we can fix that @todo in this PR.

I also want to see what feedback the html5lib tests can give us so I'll take some time to see what it looks like to run them against this PR with additional handling of more node types (one of your todo items in the description).

I haven't gone through everything yet, only the main implementation changes.

src/wp-includes/html-api/class-wp-html-tag-processor.php

dmsnell · 2024-01-15T17:57:15Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+ *  - `#text` nodes, whose entire token _is_ the modifiable text.
+ *  - Comment nodes and nodes that became comments because of some syntax error. The
+ *    text for these nodes is the portion of the comment inside of the syntax. E.g. for
+ *    `<!-- comment -->` the text is `" comment "` (note that the spaces are part of it).
+ *  - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
+ *    `<![CDATA[some content]]>` the text is `"some content"`.
+ *  - "Funky comments," which are a special case of invalid closing tags whose name is
+ *    invalid. The text for these nodes is the text that a browser would transform into
+ *    an HTML when parsing. E.g. for `</%post_author>` the text is `%post_author`.
+ *
+ * And there are non-elements which are atomic in nature but have no modifiable text.
+ *  - `DOCTYPE` nodes like `<DOCTYPE html>` which have no closing tag.
+ *  - The empty end tag `</>` which is ignored in the browser and DOM but exposed
+ *    to the HTML API.


@sirreal I'm not sure yet on what do to here. can we tag this for follow-up after merge?

I'm a bit concerned about using specific property names here because this is supposed to be the explanatory section of the documentation and I don't want to couple the description to our own terms; I want it to read comfortable for someone coming in with an HTML background - that is, leave things a bit loose here to guide an understanding without pinning it to one specific technicality.

nonetheless I've taken another pass at the comment to update it based on how this has developed.

src/wp-includes/html-api/class-wp-html-tag-processor.php

+	 * | *Text node*     | Found a #text node; this is plaintext and modifiable.                |
+	 * | *CDATA node*    | Found a CDATA section; this is modifiable.                           |
+	 * | *Comment*       | Found a comment or bogus comment; this is modifiable.                |


src/wp-includes/html-api/class-wp-html-tag-processor.php

sirreal · 2024-01-11T18:32:21Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+			case self::STATE_DOCTYPE:
+				return '#doctype';


Why do we return #doctype here? The html value we'd get from get_token_name is confusing but aligns with what the browser does.

this is specifically to expose what kind of token it is. I don't like conflating it with the HTML tag name for an Element, even though one is lower-case and the other upper-case.

in my own explorations I found it helpful to have both functions: one to say the node name like the browser would, and one to say the node type (also like how the browser would). I've also been trying to balance the use of longer constants against cearly-searchable text values since this is a more consumer-oriented function.

switch ( $processor->get_token_type() ) { case WP_HTML_Processor::NODE_TYPE_DOCUMENT_TYPE: case '#doctype': … }

at this point I'm assuming people will use the string value even if the constant exists. also I started with get_node_type() and get_node_name() but then renamed to _token_ because I wanted to support a slightly different set of kinds; I'm doubting this since discovering the challenge of partial documents with invalid comment syntaxes, but haven't completely abandoned the idea yet, particularly because of the support for presumptuous tags and funky comments, which aren't in the DOM API.

Thanks, I think this makes sense 👍

src/wp-includes/html-api/class-wp-html-tag-processor.php

sirreal

Thanks! I think this is ready to merge.

sirreal · 2024-01-16T12:38:11Z

tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php

+		$processor = WP_HTML_Processor::create_fragment( '<![CDATA[this is a comment]]>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#cdata-section',
+			$processor->get_token_name(),
+			"Should have found CDATA section name but found {$processor->get_token_name()} instead."
+		);


I don't want to get hung up on CDATA and PI handling, this doesn't block merging this PR.

I think this behavior is what you described in Slack here:

…it finds those comments (to the first >), and then if it ends in ]]> and starts with <![CDATA[ we can safely say, "this is a CDATA node" … though the actual rules for those are more complicated and we can only support a subset now

We can discuss this more in a follow-up, but I'm reluctant to diverge from the specification. This is not a cdata-section with the text content this is a comment (unless we were in svg or math foreign content), this is a comment with the text content [CDATA[this is a comment]].

yeah I'm torn a bit but also I find that things are slightly different since we're not building a DOM here. when considering the intentionality behind some HTML string, I think it's evident that if someone writes <![CDATA[something]]> then they clearly meant to product what they consider a CDATA section, and WordPress itself still creates these for legacy reasons (even though it may be the case outside of WordPress's XML outputs that those aren't needed anymroe).

so this does conflate with a comment whose text is [CDATA[this is a comment]], but if we only indicate that we have also lost the ability to differentiate these two strings, which in my opinion have divergent histories and intents:



<![CDATA[this is a comment]]>

what I see as the potential failure here is that we hold fixed a comment structure someone can't get rid of because we're only allowing adjustment inside the [ and ], but ultimately in the browser they both disappear as comments.

the case I was far more concerned with is the one we fixed, which is when we think that the inner text is 5 > 3 or [CDATA[5 > 3]] when in fact it's truly 5 or [CDATA[5 since these represent a divergence in token boundaries from the browser (which we still have somewhat at play inside foreign elements).

I'm having similar vibes about representing  because right now we're not exposing those as changeable comments. again, someone might miss these because of the representation, but they won't cause the parser to get off track and they won't change the rendered view of the page.

let's keep talking because I'd like to push this as far as possible. I really want it to work that we expose these as separate entities. a possible compromise is to maintain a separate indicator specifying type_of_comment which could be BOGUS_COMMENT, CDATA, VALID_COMMENT, etc…, but that also introduces more API surface so I want to have a good feeling that it's necessary before putting it there.

tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php

+		);
+
+		$processor->next_token();
+


tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php

sirreal · 2024-01-16T13:47:26Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+			case self::STATE_DOCTYPE:
+				return '#doctype';


Thanks, I think this makes sense 👍

src/wp-includes/html-api/class-wp-html-tag-processor.php

sirreal · 2024-01-22T12:07:34Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+	 *
+	 *     <!-->
+	 *     <!--->
+	 *     <!---->


This one is not abruptly closed, we have start , with empty text content.

Here are two examples from the html5lib-tests. There's no comment error with , but with 

thank you. the comment was wrong but the code appears to have been good.

I actually had a last-minute mini-panic thinking that we aren't detecting <!---!> as an abruptly-closed comment, but the Tag Processor is already right! it's not, and the comment continues. thankfully the code in this branch and in trunk handles it properly

sirreal

I've reviewed all of the recent changes and they look good to me. I left a few suggestions and I want to make sure we add LISTING to the special handling that removes starting newlines for PRE and TEXTAREA content.

src/wp-includes/html-api/class-wp-html-tag-processor.php

Since its introduction in WordPress 6.2 the HTML Tag Processor has provided a way to scan through all of the HTML tags in a document and then read and modify their attributes. In order to reliably do this, it also needed to be aware of other kinds of HTML syntax, but it didn't expose those syntax tokens to consumers of the API. In this patch the Tag Processor introduces a new scanning method and a few helper methods to read information about or from each token. Most significantly, this introduces the ability to read `#text` nodes in the document. What's new in the Tag Processor? ================================ - `next_token()` visits every distinct syntax token in a document. - `get_token_type()` indicates what kind of token it is. - `get_token_name()` returns something akin to `DOMNode.nodeName`. - `get_modifiable_text()` returns the text associated with a token. - `get_comment_type()` indicates why a token represents an HTML comment. Example usage. ============== ```php function strip_all_tags( $html ) { $text_content = ''; $processor = new WP_HTML_Tag_Processor( $html ); while ( $processor->next_token() ) { if ( '#text' !== $processor->get_token_type() ) { continue; } $text_content .= $processor->get_modifiable_text(); } return $text_content; } ``` What changes in the Tag Processor? ================================== Previously, the Tag Processor would scan the opening and closing tag of every HTML element separately. Now, however, there are special tags which it only visits once, as if those elements were void tags without a closer. These are special tags because their content contains no other HTML or markup, only non-HTML content. - SCRIPT elements contain raw text which is isolated from the rest of the HTML document and fed separately into a JavaScript engine. There are complicated rules to avoid escaping the script context in the HTML. The contents are left verbatim, and character references are not decoded. - TEXTARA and TITLE elements contain plain text which is decoded before display, e.g. transforming `&` into `&`. Any markup which resembles tags is treated as verbatim text and not a tag. - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the textarea and title elements, but no character references are decoded. For example, `&` inside a STYLE element is passed to the CSS engine as the literal string `&` and _not_ as `&`. Because it's important not treat this inner content separately from the elements containing it, the Tag Processor combines them when scanning into a single match and makes their content available as modifiable text (see below). This means that the Tag Processor will no longer visit a closing tag for any of these elements unless that tag is unexpected. <title>There is only a single token in this line</title> <title>There are two tokens in this line></title></title> </title><title>There are still two tokens in this line></title> What are tokens? ================ The term "token" here is a parsing term, which means a primitive unit in HTML. There are only a few kinds of tokens in HTML: - a tag has a name, attributes, and a closing or self-closing flag. - a text node, or `#text` node contains plain text which is displayed in a browser and which is decoded before display. - a DOCTYPE declaration indicates how to parse the document. - a comment is hidden from the display on a page but present in the HTML. There are a few more kinds of tokens that the HTML Tag Processor will recognize, some of which don't exist as concepts in HTML. These mostly comprise XML syntax elements that aren't part of HTML (such as CDATA and processing instructions) and invalid HTML syntax that transforms into comments. What is a funky comment? ======================== This patch treats a specific kind of invalid comment in a special way. A closing tag with an invalid name is considered a "funky comment." In the browser these become HTML comments just like any other, but their syntax is convenient for representing a variety of bits of information in a well-defined way and which cannot be nested or recursive, given the parsing rules handling this invalid syntax. - `</1>` - `</%avatar_url>` - `</{"wp_bit": {"type": "post-author"}}>` - `</[post-author]>` - `</__( 'Save Post' );>` All of these examples become HTML comments in the browser. The content inside the funky content is easily parsable, whereby the only rule is that it starts at the `<` and continues until the nearest `>`. There can be no funky comment inside another, because that would imply having a `>` inside of one, which would actually terminate the first one. What is modifiable text? ======================== Modifiable text is similar to the `innerText` property of a DOM node. It represents the span of text for a given token which may be modified without changing the structure of the HTML document or the token. There is currently no mechanism to change the modifiable text, but this is planned to arrive in a later patch. Tags ==== Most tags have no modifiable text because they have child nodes where text nodes are found. Only the special tags mentioned above have modifiable text. <div class="post">Another day in HTML</div> └─ tag ──────────┘└─ text node ─────┘└────┴─ tag <title>Is <img> > <image>?</title> │ └ modifiable text ───┘ │ "Is <img> > <image>?" └─ tag ─────────────────────────────┘ Text nodes ========== Text nodes are entirely modifiable text. This HTML document has no tags. └─ modifiable text ───────────┘ Comments ======== The modifiable text inside a comment is the portion of the comment that doesn't form its syntax. This applies for a number of invalid comments.  │ └─ modifiable text ──────┘ │ └─ comment token ───────────────┘  │ └─ modifiable text ────────┘ │ └─ comment token ───────────────┘ <[CDATA[this is an invalid comment]]> │ └─ modifiable text ───────┘ │ └─ comment token ───────────────────┘ Other token types also have modifiable text. Consult the code or tests for further information.

dmsnell · 2024-01-24T23:39:12Z

Merged in 57348
616e673

dmsnell force-pushed the html-api/scan-all-tokens branch from 3dede00 to a25b57a Compare November 28, 2023 21:45

dmsnell mentioned this pull request Nov 30, 2023

HTML API: Track spans of text with (offset, length) instead of (start, end) #5721

Closed

dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 156a31e to c22cd4b Compare December 21, 2023 21:06

dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from f19a5cb to cc96ed2 Compare January 1, 2024 03:42

dmsnell marked this pull request as ready for review January 1, 2024 03:43

dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 51f432c to 103a556 Compare January 11, 2024 03:27

dmsnell mentioned this pull request Jan 11, 2024

HTML API: Introduce HTML templating dmsnell/wordpress-develop#12

Closed

4 tasks

dlh01 reviewed Jan 11, 2024

View reviewed changes

src/wp-includes/html-api/class-wp-html-tag-processor.php Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

sirreal reviewed Jan 11, 2024

View reviewed changes

dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 9f29920 to 1098c19 Compare January 15, 2024 17:22

sirreal approved these changes Jan 16, 2024

View reviewed changes

dmsnell force-pushed the html-api/scan-all-tokens branch from 0ca080a to f502153 Compare January 16, 2024 15:36

sirreal reviewed Jan 19, 2024

View reviewed changes

src/wp-includes/html-api/class-wp-html-tag-processor.php Outdated Show resolved Hide resolved

sirreal mentioned this pull request Jan 19, 2024

HTML API: Add PLAINTEXT tag handling #5905

Draft

dmsnell force-pushed the html-api/scan-all-tokens branch from 4959837 to 633804a Compare January 20, 2024 00:20

sirreal reviewed Jan 22, 2024

View reviewed changes

dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 5065bee to 7d9786d Compare January 23, 2024 05:21

dmsnell mentioned this pull request Jan 23, 2024

HTML API: Backport updates from Core WordPress/gutenberg#58107

Merged

sirreal approved these changes Jan 23, 2024

View reviewed changes

dmsnell force-pushed the html-api/scan-all-tokens branch 3 times, most recently from faf9cef to 9d01322 Compare January 24, 2024 21:38

dmsnell force-pushed the html-api/scan-all-tokens branch from 9d01322 to 30991d7 Compare January 24, 2024 21:47

dmsnell closed this Jan 24, 2024

dmsnell mentioned this pull request Jan 30, 2024

HTML API: Fix void tag nesting with next_token #5975

Closed

dmsnell deleted the html-api/scan-all-tokens branch February 1, 2024 00:15

dmsnell mentioned this pull request Feb 2, 2024

HTML API: Reset parser state after seeking to bookmark. #6021

Closed

HTML API: Provide mechanism to scan all tokens in an HTML document, not only the tags. #5683

HTML API: Provide mechanism to scan all tokens in an HTML document, not only the tags. #5683

Uh oh!

Conversation

dmsnell commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo

Design Changes

Scanning tokens

TODO

Uh oh!

github-actions bot commented Dec 21, 2023

Test using WordPress Playground

Some things to be aware of

Uh oh!

Uh oh!

This comment was marked as resolved.

sirreal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sirreal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sirreal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmsnell commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmsnell commented Nov 17, 2023 •

edited

Loading