Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
bf4c1e1
Parser: Propose new hand-coded PHP parser
dmsnell Jul 20, 2018
bb7ff54
Fix issue with containing the nested innerHTML
dmsnell Jul 20, 2018
a905f58
Also handle newlines as whitespace
dmsnell Jul 20, 2018
1bebf95
Use classes for some static typing
dmsnell Jul 20, 2018
e0256b3
add type hints
dmsnell Jul 20, 2018
070b4f2
remove needless comment
dmsnell Jul 20, 2018
987b6e6
space where space is due
dmsnell Jul 20, 2018
92c110d
meaningless rename
dmsnell Jul 20, 2018
21132d3
remove needless function call
dmsnell Jul 20, 2018
474eab3
harmonize with spec parser
dmsnell Jul 20, 2018
4501e9a
don't forget freeform HTML before blocks
dmsnell Jul 20, 2018
6ed9e50
account for oddity in spec-parser
dmsnell Jul 20, 2018
029feb0
add some polish, fix a thing
dmsnell Jul 21, 2018
5230045
comment it
dmsnell Jul 21, 2018
760ad75
add JS version too
dmsnell Jul 21, 2018
ce42f86
Change `.` to `[^]` because `/s` isn't well supported in JS
dmsnell Jul 23, 2018
3ed3424
Move code into `/packages` directory, prepare for review
dmsnell Aug 24, 2018
a448817
take out names from RegExp pattern to not fail tests
dmsnell Aug 24, 2018
ed917f3
Fix bug in parser: store HTML soup in stack frames while parsing
dmsnell Aug 25, 2018
b440a86
fix whitespace
dmsnell Aug 25, 2018
76c8d50
fix oddity in spec
dmsnell Aug 25, 2018
e9bd804
match styles
dmsnell Aug 26, 2018
1e91266
use class name filter on server-side parser class
dmsnell Aug 26, 2018
e80a6d9
fix whitespace
dmsnell Aug 26, 2018
45d7c7b
Document extensibility
dmsnell Aug 27, 2018
1b7592a
fix typo in example code
dmsnell Aug 27, 2018
c60b95d
Push failing parsing test
mcsf Aug 29, 2018
10a2097
fix lazy/greedy bug in parser regexp
dmsnell Aug 29, 2018
6a232a4
Docs: Fix typos, links, tweak style.
mcsf Aug 29, 2018
cb13b54
update from PR feedback
dmsnell Aug 30, 2018
ce1864f
trim docs
dmsnell Aug 30, 2018
f5b97a6
Load default block parser, replacing PEG-generated one
mcsf Aug 31, 2018
9c72d5e
Expand `?:` shorthand for PHP 5.2 compat
mcsf Sep 1, 2018
a57e448
add fixtures test for default parser
dmsnell Sep 3, 2018
20e6131
spaces to tabs
dmsnell Sep 3, 2018
0bd5e71
could we need no assoc?
dmsnell Sep 3, 2018
08015d7
fill out return array
dmsnell Sep 3, 2018
1004cbe
put that assoc back in there
dmsnell Sep 3, 2018
3dc74fd
isometrize
dmsnell Sep 3, 2018
22f10de
rename and add 0
dmsnell Sep 3, 2018
a41a995
Conditionally include the parser class
jorgefilipecosta Sep 4, 2018
fe98a4a
Add docblocks
dmsnell Sep 5, 2018
9463906
Standardize the package configuration
gziolo Sep 6, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
comment it
  • Loading branch information
dmsnell committed Sep 5, 2018
commit 523004578639bd00ba3b7bde50234b824d1acf6b
69 changes: 69 additions & 0 deletions lib/parser-rd-trampoline.php
Original file line number Diff line number Diff line change
@@ -1,5 +1,74 @@
<?php

/**
* Implements the formal specification for parsing Gutenberg documents
* serialized into HTML (nominally in `post_content` of a WordPress post)
*
* @see https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-spec-parser
*
* ## What is different about this one from the spec-parser?
*
* This is a recursive-descent parser that scans linearly once through the input document.
* Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow.
* In order to minimize data copying and passing it's built into a class with class properties.
* Between every token (a block comment delimiter) we can instrument the parser and intervene.
*
* The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many
* questions inherently that we must answer explicitly in this parser. The goal for this
* implementation is to match the characteristics of the PEG so that it can be directly
* swapped out so that the only changes are better runtime performance and memory usage.
*
* ## How does it work?
*
* It's pretty self-explanatory...haha
*
* Every Gutenberg document is nominally an HTML document which in addition to normal HTML may
* also contain specially designed HTML comments - the block comment delimiters - which separate
* and isolate the blocks which are serialized in the document.
*
* This parser attempts to create a kind of state-machine around the transitions triggered from
* those delimiters - the "tokens" of the grammar. Every time we find one we should only be doing
* one of a small set of actions:
*
* - enter a new block
* - exit out of a block
*
* Those actions have different effects depending on the context; for instance, when we exit a
* block we either need to add it to the output block list _or_ we need to append it as the
* next `innerBlock` on the parent block below it in the block stack (the place where we track
* open blocks). The details are documented below.
*
* The biggest challenge in this parser is making the right accounting of indices required to
* to construct the `innerHTML` values for each block at every level of nesting depth. We take
* a simple approach:
*
* - start each newly-opened block with an empty `innerHTML`
* - whenever we push a first block into the `innerBlocks` list then add the content from
* where the content of the parent block started to where this inner block starts
* - whenever we push another block into the `innerBlocks` list then add the content from
* where the previous inner block ended to where this inner block starts
* - when we close out an open block we add the content from where the last inner block
* ended to where the closing block delimiter starts
* - if there are no inner blocks then we take the entire content between the opening and
* closing block comment delimiters as the `innerHTML`
*
* ## I meant, how does it perform?
*
* This parser operates much faster than the generated parser from the specification.
* Because w know more about the parsing than the PEG does we can take advantage of several
* tricks to improve our speed and memory usage:
*
* - we only have one or two distinct tokens depending on how you look at it and they are
* all readily matched via a regular expression. instead of parsing on a character-per-
* character basis we can allow the PCRE RegExp engine skip over large swaths of the
* document for us in order to find those tokens.
* - since `preg_match()` takes an `offset` parameter we can crawl through the input
* without passing copies of the input text on every step. we can track our position
* in the string and only pass a number instead
* - not copying all those strings means that we'll also skip many memory allocations
*
*/

function rdt_parse( $document ) {
static $parser;

Expand Down