-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Hi,
I'd like to suggest that the exact number of errors that are now required by conformant parsers, no longer be required.
Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#new-errors", "#document-fragment", "#document", "#script-off", "#script-on", or empty, the only thing that matters is that there be the right number of parse errors.
Source: https://github.com/html5lib/html5lib-tests/blob/master/tree-construction/README.md
I suggest this is changed to be prefixed with "Optional: " or similar, to relax this requirement.
Example 1: <table>ABC</table>
When the parser sees A, B, C while in “in table”, it must foster parent those characters (move them outside the table in a specific way). But the error reporting granularity is inherently arbitrary:
- Implementation A (perfectly reasonable): emits one parse error for the whole run of “unexpected character(s) in table”.
- Implementation B (also reasonable, and what html5lib-style tests often assume): emits one per character → 3 parse errors (A, B, C).
- Implementation C (also reasonable): emits one per token, where the tokenizer may have produced a single Character token "ABC" → 1 parse error.
All three implementations build the same DOM per the algorithm, and all three can give users a useful signal. The number (1 vs 3) is just a policy decision about batching/deduping, not a semantic requirement.
Example 2: <div><span></div>
Tree-building must recover from the misnesting. There are multiple “plausible” error-reporting strategies:
- Emit an error when closes while is still open (end-tag-too-early style).
- Emit an error later at EOF for the missing (an “expected closing tag but got EOF” style).
- Emit both (two errors), or collapse them into one (“misnested tags”).
The recovery behavior can be identical, while the count varies based on whether you report at the first detectable symptom, at EOF, or both.
Example 3: ¬it; (entity prefix situation)
Tokenization can decode this as ¬it; (because ¬ is a legacy named ref prefix), but error reporting could be:
- one “missing semicolon after character reference” error,
- or none (parser chooses not to report),
- or a different error taxonomy (some implementations treat it as “named entity used without semicolon”).
Since none of these cases are covered by the specification, I don't think it makes sense to require errors to match (even in number) in the test suite.