Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Nov 17, 2023

Trac ticket: Core-60170
Companion port into Gutenberg: WordPress/gutenberg#58107 (contains additional porting code)

This PR provides full tokenization scanning of an HTML document. This is being added into the Tag Processor and will be a necessary component for a number of related changes to the HTML API:

  • Reading and modifying inner and outer content.
  • Serializing HTML.
  • Methods to transform text content while preserving or stripping away markup.

Enables syntax-aware processing such as wp_truncate_html() [gist]
Replaces/incorporates chunked/extended processing in #5050
Replaces/incorporates stopping at comments in dmsnell#7
Provides critical functionality for inner/outer getter/setter in [dmsnell#10, #4965]

Depends on #5721
Depends on #5725

Todo

  • Change CDATA sections and PI Nodes into comments with a new "comment type" flag.
  • Ensure HTML Processor can seek around without messing up.
  • Review all $this->bytes_already_parsed assignments and make sure they are proper. I think half of them are one off.
  • Add function docblocks.
  • Review what this enables in the html5lib test suite.
  • Use this internally in the HTML Processor to ensure that breadcrumbs are generated.
  • Explore using combinable bit flags for the token types instead of string constants. This would allow for things like MATCHED_TAG | TEXT_NODE and INCOMPLETE | COMPLETE, which could simplify some logic that's spread in if statements.
    • For this PR it's not worth moving to boolean logic like this. The existing code is clearer for review.
  • Add test suite.
  • Distinguish too-short HTML comments that may cause trouble when modifying. E.g. <!--->.
  • Discuss what to do about PI Nodes and CDATA sections
    • It's possible to identify these after identifying the bogus comment span, but we can't look for the ending syntax of these sections because HTML stipulates that they end at the first >, not the closing ]]> or the closing ?>. So we can find all HTML comments, and then determine if they would have been a CDATA or PI Node if HTML supported those.
    • We can also ignore them all, but we lose knowledge about the HTML stream that we could recover (e.g. distinguish <?for-each?> from <--for-each-->.

Design Changes

In this change we're introducing two features stemming from two internal changes:

  • next_token() provides the ability to scan every token in the HTML stream.
  • it is possible to parse HTML in chunks without having the entire document in memory.

The internal changes powering this are:

  • internal state adopts a new parsing mode which allows resuming from the middle of a previous match.
  • the new concept of modifiable text and a token proper tracks the bounds of the currently-matched token as well as a safe region that can be changed without impacting the document syntax, if one exists.

For example, when encountering an HTML comment the parser will track the following token information:

This <!-- is a comment -->.
     │   │            │  └ end of token
     │   │            └─── end of text
     │   └──────────────── start of text
     └──────────────────── start of token

Not every token will have a text region, but it's important to track the entire token and any text region because similar tokens may have different syntax. For example, an invalid comment is still a comment.

This <? is also a comment --!>.
     │ │                 │   └ end of token
     │ │                 └──── end of text
     │ └────────────────────── start of text
     └──────────────────────── start of token

This holds for tokens whose entire content is text, such as with the #text node.

<div>This is text.</div>
     │           ├ end of token
     │           └ end of text
     ├──────────── start of text
     └──────────── start of token

Special HTML tags have modifiable text and that isn't part of .textConent or .innerText. For example, the TITLE element contains no HTML inside of it and everything is plaintext and its contents don't appear in the page. The same is true for TEXTAREA and SCRIPT and STYLE and a few more elements.

<title>This is text <em>Not HTML</em>.</title>
│      │                             │       └ end of token
│      └ start of text               └ end of text
└──────────── start of token

Scanning tokens

In order to keep the next_tag() interface and use clear, it is left unchanged. For operations needing access to the token stream, there is no built-in query mechanism and querying ought to be performed inside a next_token() loop. get_token_type() indicates what kind of token is currently matched, get_token_name() returns something that more closely matches what a DOM API would return, and get_modifiable_text() returns the modifiable text if available.

function wp_strip_all_tags( $html, $remove_breaks ) {
	$processor = new WP_HTML_Processor( $html );

	$text_content = '';
	while ( $processor->next_token() ) {
		if ( '#text' === $processor->get_token_name() ) {
			$text_content .= $processor->get_node_text();
		}
	}

	return $remove_breaks
		? trim( preg_replace( '/[\r\n\t ]+/', ' ', $text_content ) )
		: $text_content;
}
  • Most tags have no modifiable content.
  • The inner contents of special tags whose contents cannot contain HTML is their modifiable content. The inner contents of these tags are not rendered in the page.
    • IFRAME
    • NOEMBED, NOFRAMES
    • SCRIPT
    • STYLE
    • TEXTAREA [character references are decoded]
    • TITLE [character references are decoded]
    • XMP

TODO

  • Add next_token() method to scan each token.
  • Stop at RCDATA sections and SCRIPT, STYLE, TITLE, TEXTAREA, etc…
  • Stop at text nodes.
  • Indicate a continuation state to support resumable parsing. This is necessary for stopping at SCRIPT tags and other tags with special closing rules. These are currently handled by skipping to the end of the element when finding the starting tag, but this has introduced a few challenges and bugs (for example, the Tag Processor fails to stop at a <title> tag if the document ends before the </title> closer is found).
  • Add rewind() method to reverse to the start of the document.

@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch from 3dede00 to a25b57a Compare November 28, 2023 21:45
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Nov 30, 2023
…, end)

This patch follows-up with earlier design questions around how to represent
spans of strings inside the class. It's relevant now as preparation for WordPress#5683.

The mixture of (offset, length) and (start, end) coordinates becomes confusing
at times and all final string operations are performed with the (offset, length)
pair, since these feed into `strlen()`.

In preparation for exposing all tokens within an HTML document this change:
 - Unifies the representation throughout the class.
 - It creates `token_starts_at` to track the start of the current token.
 - It replaces `tag_ends_at` with `token_length` for re-use with other token types.

There should be no functional or behavioral changes in this patch.

For the internal helper classes this patch introduces breaking changes, but those
classes are marked private and should not be used outside of the HTML API itself.
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Dec 10, 2023
…, end)

This patch follows-up with earlier design questions around how to represent
spans of strings inside the class. It's relevant now as preparation for WordPress#5683.

The mixture of (offset, length) and (start, end) coordinates becomes confusing
at times and all final string operations are performed with the (offset, length)
pair, since these feed into `strlen()`.

In preparation for exposing all tokens within an HTML document this change:
 - Unifies the representation throughout the class.
 - It creates `token_starts_at` to track the start of the current token.
 - It replaces `tag_ends_at` with `token_length` for re-use with other token types.

There should be no functional or behavioral changes in this patch.

For the internal helper classes this patch introduces breaking changes, but those
classes are marked private and should not be used outside of the HTML API itself.
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Dec 10, 2023
…, end)

This patch follows-up with earlier design questions around how to represent
spans of strings inside the class. It's relevant now as preparation for WordPress#5683.

The mixture of (offset, length) and (start, end) coordinates becomes confusing
at times and all final string operations are performed with the (offset, length)
pair, since these feed into `strlen()`.

In preparation for exposing all tokens within an HTML document this change:
 - Unifies the representation throughout the class.
 - It creates `token_starts_at` to track the start of the current token.
 - It replaces `tag_ends_at` with `token_length` for re-use with other token types.

There should be no functional or behavioral changes in this patch.

For the internal helper classes this patch introduces breaking changes, but those
classes are marked private and should not be used outside of the HTML API itself.
@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 156a31e to c22cd4b Compare December 21, 2023 21:06
@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from f19a5cb to cc96ed2 Compare January 1, 2024 03:42
@dmsnell dmsnell marked this pull request as ready for review January 1, 2024 03:43
@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 51f432c to 103a556 Compare January 11, 2024 03:27
@sirreal

This comment was marked as resolved.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the shape this is taking. I've left several thoughts comments and questions from this first pass. I'd like to take a look at the processing of comments because I think we can fix that @todo in this PR.

I also want to see what feedback the html5lib tests can give us so I'll take some time to see what it looks like to run them against this PR with additional handling of more node types (one of your todo items in the description).

I haven't gone through everything yet, only the main implementation changes.

Comment on lines 312 to 326
* - `#text` nodes, whose entire token _is_ the modifiable text.
* - Comment nodes and nodes that became comments because of some syntax error. The
* text for these nodes is the portion of the comment inside of the syntax. E.g. for
* `<!-- comment -->` the text is `" comment "` (note that the spaces are part of it).
* - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
* `<![CDATA[some content]]>` the text is `"some content"`.
* - "Funky comments," which are a special case of invalid closing tags whose name is
* invalid. The text for these nodes is the text that a browser would transform into
* an HTML when parsing. E.g. for `</%post_author>` the text is `%post_author`.
*
* And there are non-elements which are atomic in nature but have no modifiable text.
* - `DOCTYPE` nodes like `<DOCTYPE html>` which have no closing tag.
* - The empty end tag `</>` which is ignored in the browser and DOM but exposed
* to the HTML API.

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal I'm not sure yet on what do to here. can we tag this for follow-up after merge?

I'm a bit concerned about using specific property names here because this is supposed to be the explanatory section of the documentation and I don't want to couple the description to our own terms; I want it to read comfortable for someone coming in with an HTML background - that is, leave things a bit loose here to guide an understanding without pinning it to one specific technicality.

nonetheless I've taken another pass at the comment to update it based on how this has developed.

Comment on lines +483 to +497
* | *Text node* | Found a #text node; this is plaintext and modifiable. |
* | *CDATA node* | Found a CDATA section; this is modifiable. |
* | *Comment* | Found a comment or bogus comment; this is modifiable. |

This comment was marked as resolved.

Comment on lines +2541 to +2651
case self::STATE_DOCTYPE:
return '#doctype';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return #doctype here? The html value we'd get from get_token_name is confusing but aligns with what the browser does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is specifically to expose what kind of token it is. I don't like conflating it with the HTML tag name for an Element, even though one is lower-case and the other upper-case.

in my own explorations I found it helpful to have both functions: one to say the node name like the browser would, and one to say the node type (also like how the browser would). I've also been trying to balance the use of longer constants against cearly-searchable text values since this is a more consumer-oriented function.

switch ( $processor->get_token_type() ) {
	case WP_HTML_Processor::NODE_TYPE_DOCUMENT_TYPE:
	case '#doctype':
		…
}

at this point I'm assuming people will use the string value even if the constant exists. also I started with get_node_type() and get_node_name() but then renamed to _token_ because I wanted to support a slightly different set of kinds; I'm doubting this since discovering the challenge of partial documents with invalid comment syntaxes, but haven't completely abandoned the idea yet, particularly because of the support for presumptuous tags and funky comments, which aren't in the DOM API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this makes sense 👍

@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 9f29920 to 1098c19 Compare January 15, 2024 17:22
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this is ready to merge.

Comment on lines 318 to 331
$processor = WP_HTML_Processor::create_fragment( '<![CDATA[this is a comment]]>' );
$processor->next_token();

$this->assertSame(
'#cdata-section',
$processor->get_token_name(),
"Should have found CDATA section name but found {$processor->get_token_name()} instead."
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to get hung up on CDATA and PI handling, this doesn't block merging this PR.

I think this behavior is what you described in Slack here:

…it finds those comments (to the first >), and then if it ends in ]]> and starts with <![CDATA[ we can safely say, "this is a CDATA node" … though the actual rules for those are more complicated and we can only support a subset now

We can discuss this more in a follow-up, but I'm reluctant to diverge from the specification. This is not a cdata-section with the text content this is a comment (unless we were in svg or math foreign content), this is a comment with the text content [CDATA[this is a comment]].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm torn a bit but also I find that things are slightly different since we're not building a DOM here. when considering the intentionality behind some HTML string, I think it's evident that if someone writes <![CDATA[something]]> then they clearly meant to product what they consider a CDATA section, and WordPress itself still creates these for legacy reasons (even though it may be the case outside of WordPress's XML outputs that those aren't needed anymroe).

so this does conflate with a comment whose text is [CDATA[this is a comment]], but if we only indicate that we have also lost the ability to differentiate these two strings, which in my opinion have divergent histories and intents:

  • <!--[CDATA[this is a comment]]-->
  • <![CDATA[this is a comment]]>

what I see as the potential failure here is that we hold fixed a comment structure someone can't get rid of because we're only allowing adjustment inside the [ and ], but ultimately in the browser they both disappear as comments.

the case I was far more concerned with is the one we fixed, which is when we think that the inner text is 5 > 3 or [CDATA[5 > 3]] when in fact it's truly 5 or [CDATA[5 since these represent a divergence in token boundaries from the browser (which we still have somewhat at play inside foreign elements).

I'm having similar vibes about representing <!--> and <!---> because right now we're not exposing those as changeable comments. again, someone might miss these because of the representation, but they won't cause the parser to get off track and they won't change the rendered view of the page.

let's keep talking because I'd like to push this as far as possible. I really want it to work that we expose these as separate entities. a possible compromise is to maintain a separate indicator specifying type_of_comment which could be BOGUS_COMMENT, CDATA, VALID_COMMENT, etc…, but that also introduces more API surface so I want to have a good feeling that it's necessary before putting it there.

);

$processor->next_token();

This comment was marked as resolved.

Comment on lines +2541 to +2651
case self::STATE_DOCTYPE:
return '#doctype';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this makes sense 👍

@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch from 0ca080a to f502153 Compare January 16, 2024 15:36
*
* <!-->
* <!--->
* <!---->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not abruptly closed, we have start <!-- and end -->, with empty text content.

Here are two examples from the html5lib-tests. There's no comment error with <!---->, but with <!---> there's an "abrupt-closing-of-empty-comment" error.

Suggested change
* <!---->

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. the comment was wrong but the code appears to have been good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had a last-minute mini-panic thinking that we aren't detecting <!---!> as an abruptly-closed comment, but the Tag Processor is already right! it's not, and the comment continues. thankfully the code in this branch and in trunk handles it properly

@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch 2 times, most recently from 5065bee to 7d9786d Compare January 23, 2024 05:21
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed all of the recent changes and they look good to me. I left a few suggestions and I want to make sure we add LISTING to the special handling that removes starting newlines for PRE and TEXTAREA content.

@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch 3 times, most recently from faf9cef to 9d01322 Compare January 24, 2024 21:38
Since its introduction in WordPress 6.2 the HTML Tag Processor has
provided a way to scan through all of the HTML tags in a document and
then read and modify their attributes. In order to reliably do this, it
also needed to be aware of other kinds of HTML syntax, but it didn't
expose those syntax tokens to consumers of the API.

In this patch the Tag Processor introduces a new scanning method and a
few helper methods to read information about or from each token. Most
significantly, this introduces the ability to read `#text` nodes in the
document.

What's new in the Tag Processor?
================================

 - `next_token()` visits every distinct syntax token in a document.
 - `get_token_type()` indicates what kind of token it is.
 - `get_token_name()` returns something akin to `DOMNode.nodeName`.
 - `get_modifiable_text()` returns the text associated with a token.
 - `get_comment_type()` indicates why a token represents an HTML comment.

Example usage.
==============

```php
function strip_all_tags( $html ) {
        $text_content = '';
        $processor    = new WP_HTML_Tag_Processor( $html );

        while ( $processor->next_token() ) {
                if ( '#text' !== $processor->get_token_type() ) {
                        continue;
                }

                $text_content .= $processor->get_modifiable_text();
        }

        return $text_content;
}
```

What changes in the Tag Processor?
==================================

Previously, the Tag Processor would scan the opening and closing tag of
every HTML element separately. Now, however, there are special tags
which it only visits once, as if those elements were void tags without
a closer.

These are special tags because their content contains no other HTML or
markup, only non-HTML content.

 - SCRIPT elements contain raw text which is isolated from the rest of
   the HTML document and fed separately into a JavaScript engine. There
   are complicated rules to avoid escaping the script context in the HTML.
   The contents are left verbatim, and character references are not decoded.

 - TEXTARA and TITLE elements contain plain text which is decoded
   before display, e.g. transforming `&amp;` into `&`. Any markup which
   resembles tags is treated as verbatim text and not a tag.

 - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the
   textarea and title elements, but no character references are decoded.
   For example, `&amp;` inside a STYLE element is passed to the CSS engine
   as the literal string `&amp;` and _not_ as `&`.

Because it's important not treat this inner content separately from the
elements containing it, the Tag Processor combines them when scanning
into a single match and makes their content available as modifiable
text (see below).

This means that the Tag Processor will no longer visit a closing tag for
any of these elements unless that tag is unexpected.

    <title>There is only a single token in this line</title>
    <title>There are two tokens in this line></title></title>
    </title><title>There are still two tokens in this line></title>

What are tokens?
================

The term "token" here is a parsing term, which means a primitive unit in
HTML. There are only a few kinds of tokens in HTML:

 - a tag has a name, attributes, and a closing or self-closing flag.
 - a text node, or `#text` node contains plain text which is displayed
   in a browser and which is decoded before display.
 - a DOCTYPE declaration indicates how to parse the document.
 - a comment is hidden from the display on a page but present in the HTML.

There are a few more kinds of tokens that the HTML Tag Processor will
recognize, some of which don't exist as concepts in HTML. These mostly
comprise XML syntax elements that aren't part of HTML (such as CDATA and
processing instructions) and invalid HTML syntax that transforms into
comments.

What is a funky comment?
========================

This patch treats a specific kind of invalid comment in a special way.
A closing tag with an invalid name is considered a "funky comment." In
the browser these become HTML comments just like any other, but their
syntax is convenient for representing a variety of bits of information
in a well-defined way and which cannot be nested or recursive, given
the parsing rules handling this invalid syntax.

 - `</1>`
 - `</%avatar_url>`
 - `</{"wp_bit": {"type": "post-author"}}>`
 - `</[post-author]>`
 - `</__( 'Save Post' );>`

All of these examples become HTML comments in the browser. The content
inside the funky content is easily parsable, whereby the only rule is
that it starts at the `<` and continues until the nearest `>`. There
can be no funky comment inside another, because that would imply having
a `>` inside of one, which would actually terminate the first one.

What is modifiable text?
========================

Modifiable text is similar to the `innerText` property of a DOM node.
It represents the span of text for a given token which may be modified
without changing the structure of the HTML document or the token.

There is currently no mechanism to change the modifiable text, but this
is planned to arrive in a later patch.

Tags
====

Most tags have no modifiable text because they have child nodes where
text nodes are found. Only the special tags mentioned above have
modifiable text.

    <div class="post">Another day in HTML</div>
    └─ tag ──────────┘└─ text node ─────┘└────┴─ tag

    <title>Is <img> &gt; <image>?</title>
    │      └ modifiable text ───┘       │ "Is <img> > <image>?"
    └─ tag ─────────────────────────────┘

Text nodes
==========

Text nodes are entirely modifiable text.

    This HTML document has no tags.
    └─ modifiable text ───────────┘

Comments
========

The modifiable text inside a comment is the portion of the comment that
doesn't form its syntax. This applies for a number of invalid comments.

    <!-- this is inside a comment -->
    │   └─ modifiable text ──────┘  │
    └─ comment token ───────────────┘

    <!-->
    This invalid comment has no modifiable text.

    <? this is an invalid comment -->
    │ └─ modifiable text ────────┘  │
    └─ comment token ───────────────┘

    <[CDATA[this is an invalid comment]]>
    │       └─ modifiable text ───────┘ │
    └─ comment token ───────────────────┘

Other token types also have modifiable text. Consult the code or tests
for further information.
@dmsnell dmsnell force-pushed the html-api/scan-all-tokens branch from 9d01322 to 30991d7 Compare January 24, 2024 21:47
@dmsnell
Copy link
Member Author

dmsnell commented Jan 24, 2024

Merged in 57348
616e673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants