-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Provide mechanism to scan all tokens in an HTML document, not only the tags. #5683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3dede00 to
a25b57a
Compare
…, end) This patch follows-up with earlier design questions around how to represent spans of strings inside the class. It's relevant now as preparation for WordPress#5683. The mixture of (offset, length) and (start, end) coordinates becomes confusing at times and all final string operations are performed with the (offset, length) pair, since these feed into `strlen()`. In preparation for exposing all tokens within an HTML document this change: - Unifies the representation throughout the class. - It creates `token_starts_at` to track the start of the current token. - It replaces `tag_ends_at` with `token_length` for re-use with other token types. There should be no functional or behavioral changes in this patch. For the internal helper classes this patch introduces breaking changes, but those classes are marked private and should not be used outside of the HTML API itself.
…, end) This patch follows-up with earlier design questions around how to represent spans of strings inside the class. It's relevant now as preparation for WordPress#5683. The mixture of (offset, length) and (start, end) coordinates becomes confusing at times and all final string operations are performed with the (offset, length) pair, since these feed into `strlen()`. In preparation for exposing all tokens within an HTML document this change: - Unifies the representation throughout the class. - It creates `token_starts_at` to track the start of the current token. - It replaces `tag_ends_at` with `token_length` for re-use with other token types. There should be no functional or behavioral changes in this patch. For the internal helper classes this patch introduces breaking changes, but those classes are marked private and should not be used outside of the HTML API itself.
…, end) This patch follows-up with earlier design questions around how to represent spans of strings inside the class. It's relevant now as preparation for WordPress#5683. The mixture of (offset, length) and (start, end) coordinates becomes confusing at times and all final string operations are performed with the (offset, length) pair, since these feed into `strlen()`. In preparation for exposing all tokens within an HTML document this change: - Unifies the representation throughout the class. - It creates `token_starts_at` to track the start of the current token. - It replaces `tag_ends_at` with `token_length` for re-use with other token types. There should be no functional or behavioral changes in this patch. For the internal helper classes this patch introduces breaking changes, but those classes are marked private and should not be used outside of the HTML API itself.
156a31e to
c22cd4b
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
f19a5cb to
cc96ed2
Compare
51f432c to
103a556
Compare
This comment was marked as resolved.
This comment was marked as resolved.
sirreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the shape this is taking. I've left several thoughts comments and questions from this first pass. I'd like to take a look at the processing of comments because I think we can fix that @todo in this PR.
I also want to see what feedback the html5lib tests can give us so I'll take some time to see what it looks like to run them against this PR with additional handling of more node types (one of your todo items in the description).
I haven't gone through everything yet, only the main implementation changes.
| * - `#text` nodes, whose entire token _is_ the modifiable text. | ||
| * - Comment nodes and nodes that became comments because of some syntax error. The | ||
| * text for these nodes is the portion of the comment inside of the syntax. E.g. for | ||
| * `<!-- comment -->` the text is `" comment "` (note that the spaces are part of it). | ||
| * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for | ||
| * `<![CDATA[some content]]>` the text is `"some content"`. | ||
| * - "Funky comments," which are a special case of invalid closing tags whose name is | ||
| * invalid. The text for these nodes is the text that a browser would transform into | ||
| * an HTML when parsing. E.g. for `</%post_author>` the text is `%post_author`. | ||
| * | ||
| * And there are non-elements which are atomic in nature but have no modifiable text. | ||
| * - `DOCTYPE` nodes like `<DOCTYPE html>` which have no closing tag. | ||
| * - The empty end tag `</>` which is ignored in the browser and DOM but exposed | ||
| * to the HTML API. |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sirreal I'm not sure yet on what do to here. can we tag this for follow-up after merge?
I'm a bit concerned about using specific property names here because this is supposed to be the explanatory section of the documentation and I don't want to couple the description to our own terms; I want it to read comfortable for someone coming in with an HTML background - that is, leave things a bit loose here to guide an understanding without pinning it to one specific technicality.
nonetheless I've taken another pass at the comment to update it based on how this has developed.
| * | *Text node* | Found a #text node; this is plaintext and modifiable. | | ||
| * | *CDATA node* | Found a CDATA section; this is modifiable. | | ||
| * | *Comment* | Found a comment or bogus comment; this is modifiable. | |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| case self::STATE_DOCTYPE: | ||
| return '#doctype'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we return #doctype here? The html value we'd get from get_token_name is confusing but aligns with what the browser does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is specifically to expose what kind of token it is. I don't like conflating it with the HTML tag name for an Element, even though one is lower-case and the other upper-case.
in my own explorations I found it helpful to have both functions: one to say the node name like the browser would, and one to say the node type (also like how the browser would). I've also been trying to balance the use of longer constants against cearly-searchable text values since this is a more consumer-oriented function.
switch ( $processor->get_token_type() ) {
case WP_HTML_Processor::NODE_TYPE_DOCUMENT_TYPE:
case '#doctype':
…
}at this point I'm assuming people will use the string value even if the constant exists. also I started with get_node_type() and get_node_name() but then renamed to _token_ because I wanted to support a slightly different set of kinds; I'm doubting this since discovering the challenge of partial documents with invalid comment syntaxes, but haven't completely abandoned the idea yet, particularly because of the support for presumptuous tags and funky comments, which aren't in the DOM API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this makes sense 👍
9f29920 to
1098c19
Compare
sirreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think this is ready to merge.
| $processor = WP_HTML_Processor::create_fragment( '<![CDATA[this is a comment]]>' ); | ||
| $processor->next_token(); | ||
|
|
||
| $this->assertSame( | ||
| '#cdata-section', | ||
| $processor->get_token_name(), | ||
| "Should have found CDATA section name but found {$processor->get_token_name()} instead." | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to get hung up on CDATA and PI handling, this doesn't block merging this PR.
I think this behavior is what you described in Slack here:
…it finds those comments (to the first
>), and then if it ends in]]>and starts with<![CDATA[we can safely say, "this is a CDATA node" … though the actual rules for those are more complicated and we can only support a subset now
We can discuss this more in a follow-up, but I'm reluctant to diverge from the specification. This is not a cdata-section with the text content this is a comment (unless we were in svg or math foreign content), this is a comment with the text content [CDATA[this is a comment]].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm torn a bit but also I find that things are slightly different since we're not building a DOM here. when considering the intentionality behind some HTML string, I think it's evident that if someone writes <![CDATA[something]]> then they clearly meant to product what they consider a CDATA section, and WordPress itself still creates these for legacy reasons (even though it may be the case outside of WordPress's XML outputs that those aren't needed anymroe).
so this does conflate with a comment whose text is [CDATA[this is a comment]], but if we only indicate that we have also lost the ability to differentiate these two strings, which in my opinion have divergent histories and intents:
<!--[CDATA[this is a comment]]--><![CDATA[this is a comment]]>
what I see as the potential failure here is that we hold fixed a comment structure someone can't get rid of because we're only allowing adjustment inside the [ and ], but ultimately in the browser they both disappear as comments.
the case I was far more concerned with is the one we fixed, which is when we think that the inner text is 5 > 3 or [CDATA[5 > 3]] when in fact it's truly 5 or [CDATA[5 since these represent a divergence in token boundaries from the browser (which we still have somewhat at play inside foreign elements).
I'm having similar vibes about representing <!--> and <!---> because right now we're not exposing those as changeable comments. again, someone might miss these because of the representation, but they won't cause the parser to get off track and they won't change the rendered view of the page.
let's keep talking because I'd like to push this as far as possible. I really want it to work that we expose these as separate entities. a possible compromise is to maintain a separate indicator specifying type_of_comment which could be BOGUS_COMMENT, CDATA, VALID_COMMENT, etc…, but that also introduces more API surface so I want to have a good feeling that it's necessary before putting it there.
| ); | ||
|
|
||
| $processor->next_token(); | ||
|
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php
Outdated
Show resolved
Hide resolved
| case self::STATE_DOCTYPE: | ||
| return '#doctype'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this makes sense 👍
0ca080a to
f502153
Compare
4959837 to
633804a
Compare
| * | ||
| * <!--> | ||
| * <!---> | ||
| * <!----> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not abruptly closed, we have start <!-- and end -->, with empty text content.
Here are two examples from the html5lib-tests. There's no comment error with <!---->, but with <!---> there's an "abrupt-closing-of-empty-comment" error.
| * <!----> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you. the comment was wrong but the code appears to have been good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had a last-minute mini-panic thinking that we aren't detecting <!---!> as an abruptly-closed comment, but the Tag Processor is already right! it's not, and the comment continues. thankfully the code in this branch and in trunk handles it properly
5065bee to
7d9786d
Compare
sirreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed all of the recent changes and they look good to me. I left a few suggestions and I want to make sure we add LISTING to the special handling that removes starting newlines for PRE and TEXTAREA content.
faf9cef to
9d01322
Compare
Since its introduction in WordPress 6.2 the HTML Tag Processor has
provided a way to scan through all of the HTML tags in a document and
then read and modify their attributes. In order to reliably do this, it
also needed to be aware of other kinds of HTML syntax, but it didn't
expose those syntax tokens to consumers of the API.
In this patch the Tag Processor introduces a new scanning method and a
few helper methods to read information about or from each token. Most
significantly, this introduces the ability to read `#text` nodes in the
document.
What's new in the Tag Processor?
================================
- `next_token()` visits every distinct syntax token in a document.
- `get_token_type()` indicates what kind of token it is.
- `get_token_name()` returns something akin to `DOMNode.nodeName`.
- `get_modifiable_text()` returns the text associated with a token.
- `get_comment_type()` indicates why a token represents an HTML comment.
Example usage.
==============
```php
function strip_all_tags( $html ) {
$text_content = '';
$processor = new WP_HTML_Tag_Processor( $html );
while ( $processor->next_token() ) {
if ( '#text' !== $processor->get_token_type() ) {
continue;
}
$text_content .= $processor->get_modifiable_text();
}
return $text_content;
}
```
What changes in the Tag Processor?
==================================
Previously, the Tag Processor would scan the opening and closing tag of
every HTML element separately. Now, however, there are special tags
which it only visits once, as if those elements were void tags without
a closer.
These are special tags because their content contains no other HTML or
markup, only non-HTML content.
- SCRIPT elements contain raw text which is isolated from the rest of
the HTML document and fed separately into a JavaScript engine. There
are complicated rules to avoid escaping the script context in the HTML.
The contents are left verbatim, and character references are not decoded.
- TEXTARA and TITLE elements contain plain text which is decoded
before display, e.g. transforming `&` into `&`. Any markup which
resembles tags is treated as verbatim text and not a tag.
- IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the
textarea and title elements, but no character references are decoded.
For example, `&` inside a STYLE element is passed to the CSS engine
as the literal string `&` and _not_ as `&`.
Because it's important not treat this inner content separately from the
elements containing it, the Tag Processor combines them when scanning
into a single match and makes their content available as modifiable
text (see below).
This means that the Tag Processor will no longer visit a closing tag for
any of these elements unless that tag is unexpected.
<title>There is only a single token in this line</title>
<title>There are two tokens in this line></title></title>
</title><title>There are still two tokens in this line></title>
What are tokens?
================
The term "token" here is a parsing term, which means a primitive unit in
HTML. There are only a few kinds of tokens in HTML:
- a tag has a name, attributes, and a closing or self-closing flag.
- a text node, or `#text` node contains plain text which is displayed
in a browser and which is decoded before display.
- a DOCTYPE declaration indicates how to parse the document.
- a comment is hidden from the display on a page but present in the HTML.
There are a few more kinds of tokens that the HTML Tag Processor will
recognize, some of which don't exist as concepts in HTML. These mostly
comprise XML syntax elements that aren't part of HTML (such as CDATA and
processing instructions) and invalid HTML syntax that transforms into
comments.
What is a funky comment?
========================
This patch treats a specific kind of invalid comment in a special way.
A closing tag with an invalid name is considered a "funky comment." In
the browser these become HTML comments just like any other, but their
syntax is convenient for representing a variety of bits of information
in a well-defined way and which cannot be nested or recursive, given
the parsing rules handling this invalid syntax.
- `</1>`
- `</%avatar_url>`
- `</{"wp_bit": {"type": "post-author"}}>`
- `</[post-author]>`
- `</__( 'Save Post' );>`
All of these examples become HTML comments in the browser. The content
inside the funky content is easily parsable, whereby the only rule is
that it starts at the `<` and continues until the nearest `>`. There
can be no funky comment inside another, because that would imply having
a `>` inside of one, which would actually terminate the first one.
What is modifiable text?
========================
Modifiable text is similar to the `innerText` property of a DOM node.
It represents the span of text for a given token which may be modified
without changing the structure of the HTML document or the token.
There is currently no mechanism to change the modifiable text, but this
is planned to arrive in a later patch.
Tags
====
Most tags have no modifiable text because they have child nodes where
text nodes are found. Only the special tags mentioned above have
modifiable text.
<div class="post">Another day in HTML</div>
└─ tag ──────────┘└─ text node ─────┘└────┴─ tag
<title>Is <img> > <image>?</title>
│ └ modifiable text ───┘ │ "Is <img> > <image>?"
└─ tag ─────────────────────────────┘
Text nodes
==========
Text nodes are entirely modifiable text.
This HTML document has no tags.
└─ modifiable text ───────────┘
Comments
========
The modifiable text inside a comment is the portion of the comment that
doesn't form its syntax. This applies for a number of invalid comments.
<!-- this is inside a comment -->
│ └─ modifiable text ──────┘ │
└─ comment token ───────────────┘
<!-->
This invalid comment has no modifiable text.
<? this is an invalid comment -->
│ └─ modifiable text ────────┘ │
└─ comment token ───────────────┘
<[CDATA[this is an invalid comment]]>
│ └─ modifiable text ───────┘ │
└─ comment token ───────────────────┘
Other token types also have modifiable text. Consult the code or tests
for further information.
9d01322 to
30991d7
Compare
Trac ticket: Core-60170
Companion port into Gutenberg: WordPress/gutenberg#58107 (contains additional porting code)
This PR provides full tokenization scanning of an HTML document. This is being added into the Tag Processor and will be a necessary component for a number of related changes to the HTML API:
Enables syntax-aware processing such as
wp_truncate_html()[gist]Replaces/incorporates chunked/extended processing in #5050
Replaces/incorporates stopping at comments in dmsnell#7
Provides critical functionality for inner/outer getter/setter in [dmsnell#10, #4965]
Depends on #5721 ✅
Depends on #5725 ✅
Todo
$this->bytes_already_parsedassignments and make sure they are proper. I think half of them are one off.MATCHED_TAG | TEXT_NODEandINCOMPLETE | COMPLETE, which could simplify some logic that's spread inifstatements.<!--->.>, not the closing]]>or the closing?>. So we can find all HTML comments, and then determine if they would have been a CDATA or PI Node if HTML supported those.<?for-each?>from<--for-each-->.Design Changes
In this change we're introducing two features stemming from two internal changes:
next_token()provides the ability to scan every token in the HTML stream.The internal changes powering this are:
For example, when encountering an HTML comment the parser will track the following token information:
Not every token will have a text region, but it's important to track the entire token and any text region because similar tokens may have different syntax. For example, an invalid comment is still a comment.
This holds for tokens whose entire content is text, such as with the
#textnode.Special HTML tags have modifiable text and that isn't part of
.textConentor.innerText. For example, theTITLEelement contains no HTML inside of it and everything is plaintext and its contents don't appear in the page. The same is true forTEXTAREAandSCRIPTandSTYLEand a few more elements.Scanning tokens
In order to keep the
next_tag()interface and use clear, it is left unchanged. For operations needing access to the token stream, there is no built-in query mechanism and querying ought to be performed inside anext_token()loop.get_token_type()indicates what kind of token is currently matched,get_token_name()returns something that more closely matches what a DOM API would return, andget_modifiable_text()returns the modifiable text if available.TODO
next_token()method to scan each token.SCRIPT,STYLE,TITLE,TEXTAREA, etc…SCRIPTtags and other tags with special closing rules. These are currently handled by skipping to the end of the element when finding the starting tag, but this has introduced a few challenges and bugs (for example, the Tag Processor fails to stop at a<title>tag if the document ends before the</title>closer is found).rewind()method to reverse to the start of the document.