-
Notifications
You must be signed in to change notification settings - Fork 892
Refactor HTML Parser #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Refactor HTML Parser #803
Changes from 35 commits
Commits
Show all changes
67 commits
Select commit
Hold shift + click to select a range
585f207
Refactor HTML Parser
waylan 77baade
fix silly error
waylan d4c8951
Add some new tests
waylan 356f5c3
More tests.
waylan ff0f8f2
Round out tests of valid markup.
waylan 6efe8d5
Some cleanup and bugfixes
waylan e5f9ca4
Some minor tweaks
waylan 95e8498
comments partially fixed.
waylan ea98546
Support 0-3 spaces of indent for raw HTML blocks
waylan 23e41d3
Remove need to wrap raw in blank lines
waylan 46b3a1b
More tests passing
waylan 8a17794
All handle_* methods are now defined and tested
waylan 845637a
Some test cleanup
waylan eee4e49
Monkeypatch HTMLParser piclose
waylan b8f70b7
unknown_decl is not a handle method
waylan 7a8a6b5
Switch back to a preprocessor
waylan 22151c7
Start audit of legacy tests
waylan a0c37e1
More legacy test audits.
waylan 0e4a545
More test audits
waylan 49c187d
Fix amperstand handling
waylan 3bc2960
preserve actual closing tags
waylan 4953272
More bugs fixed
waylan 29cc7ba
Account for code spans at start of line.
waylan d09d602
Code spans at start of line 2nd attempt.
waylan 1e16fd0
Drop py2 and cleanup after rebase.
waylan 9fe2473
First attempt at md in raw.
waylan e4a8796
Support markdown=1
waylan 1d17525
Eliminate extra blank lines.
waylan 6b4b351
Add more tests
waylan c0194f3
Track index of containing tag in stack.
waylan 23375a5
Minor tweaks.
waylan 9ffead5
break md_in_html out into subclass of HTML parser.
waylan e3ff368
Only put raw tags in stack.
waylan c96efad
Refactor and simplify logic.
waylan 37ff86a
Disable 'incomplete' entity handling of HTMLParser.
waylan f02b427
Fixed whitespace issues.
waylan efa36c8
Import copy of html.parser so our monkeypatches don't break user's code.
waylan a8145f8
Handle raw blocks in tail of previous block.
waylan 70d2624
Account for extra whitespace on blank lines.
waylan 335816e
Handle inline raw html in tail.
waylan 5776e97
Update md_in_html with recent htmlparser changes.
waylan 4888464
Add test_md_in_html.py
waylan aae6676
More tests
waylan 183537f
Handle markdown=1 attrs.
waylan 7783d48
Fix some bugs.
waylan cae2ef0
track mdstate down and back up nested elements.
waylan 56111c4
fix nested multiline paragraphs.
waylan dda2755
Move link reference handling to block parser.
waylan 370d601
Move abbr reference handling to block parser.
waylan 81ac09d
Move footnote reference handling to block parser.
waylan 6b068e3
Cleanup
waylan 7a85397
Remove reference to comments and PIs in TreeBuilder as unused.
waylan 42299a8
Remove other reference to comments and PIs in TreeBuilder.
waylan fbae484
Rewrite extension docs.
waylan 097f52c
Fix normalization docs to match behavior.
waylan df14000
Update spelling dict with unclosed
waylan f61eb28
Address some coverage.
waylan 2d8ce54
Ensure extension doesn't break default behavior.
waylan 4856e86
update abbr tests
waylan 07c9267
add basic link ref tests.
waylan 82b97e5
flake8 cleanup
waylan 1a0a893
footnote tests. 100% patch coverage
waylan 46ac436
Add test for case in #1012.
waylan 9cfbf20
Add release notes.
waylan 1eb9fd3
Avoid duplicate tests.
waylan 6f3b417
Fix a broken link
waylan 15b431a
Final cleanup.
waylan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| """ | ||
| Python Markdown | ||
|
|
||
| A Python implementation of John Gruber's Markdown. | ||
|
|
||
| Documentation: https://python-markdown.github.io/ | ||
| GitHub: https://github.com/Python-Markdown/markdown/ | ||
| PyPI: https://pypi.org/project/Markdown/ | ||
|
|
||
| Started by Manfred Stienstra (http://www.dwerg.net/). | ||
| Maintained for a few years by Yuri Takhteyev (http://www.freewisdom.org). | ||
| Currently maintained by Waylan Limberg (https://github.com/waylan), | ||
| Dmitry Shachnev (https://github.com/mitya57) and Isaac Muse (https://github.com/facelessuser). | ||
|
|
||
| Copyright 2007-2020 The Python Markdown Project (v. 1.7 and later) | ||
| Copyright 2004, 2005, 2006 Yuri Takhteyev (v. 0.2-1.6b) | ||
| Copyright 2004 Manfred Stienstra (the original version) | ||
|
|
||
| License: BSD (see LICENSE.md for details). | ||
| """ | ||
|
|
||
| from html import parser | ||
| import re | ||
|
|
||
| # Monkeypatch HTMLParser to only accept `?>` to close Processing Instructions. | ||
| parser.piclose = re.compile(r'\?>') | ||
| # Monkeypatch HTMLParser to only recognize entity references with a closing semicolon. | ||
| parser.entityref = re.compile(r'&([a-zA-Z][-.a-zA-Z0-9]*);') | ||
| # Monkeypatch HTMLParser to no longer support partial entities. We are always feeding a complete block, | ||
| # so the 'incomplete' functionality is unnecessary. As the entityref regex is run right before incomplete, | ||
| # and the two regex are the same, then incomplete will simply never match and we avoid the logic within. | ||
| parser.incomplete = parser.entityref | ||
|
|
||
|
|
||
| class HTMLExtractor(parser.HTMLParser): | ||
| """ | ||
| Extract raw HTML from text. | ||
|
|
||
| The raw HTML is stored in the `htmlStash` of the Markdown instance passed | ||
| to `md` and the remaining text is stored in `cleandoc` as a list of strings. | ||
| """ | ||
|
|
||
| def __init__(self, md, *args, **kwargs): | ||
| if 'convert_charrefs' not in kwargs: | ||
| kwargs['convert_charrefs'] = False | ||
| # This calls self.reset | ||
| super().__init__(*args, **kwargs) | ||
| self.md = md | ||
|
|
||
| def reset(self): | ||
| """Reset this instance. Loses all unprocessed data.""" | ||
| self.inraw = False | ||
| self.stack = [] # When inraw==True, stack contains a list of tags | ||
| self._cache = [] | ||
| self.cleandoc = [] | ||
| super().reset() | ||
|
|
||
| def close(self): | ||
| """Handle any buffered data.""" | ||
| super().close() | ||
| # Handle any unclosed tags. | ||
| if len(self._cache): | ||
| self.cleandoc.append(self.md.htmlStash.store(''.join(self._cache))) | ||
| self._cache = [] | ||
|
|
||
| @property | ||
| def line_offset(self): | ||
| """Returns char index in self.rawdata for the start of the current line. """ | ||
| if self.lineno > 1: | ||
| return re.match(r'([^\n]*\n){{{}}}'.format(self.lineno-1), self.rawdata).end() | ||
| return 0 | ||
|
|
||
| def at_line_start(self): | ||
| """ | ||
| Returns True if current position is at start of line. | ||
|
|
||
| Allows for up to three blank spaces at start of line. | ||
| """ | ||
| if self.offset == 0: | ||
| return True | ||
| if self.offset > 3: | ||
| return False | ||
| # Confirm up to first 3 chars are whitespace | ||
| return self.rawdata[self.line_offset:self.line_offset + self.offset].strip() == '' | ||
|
|
||
| def get_endtag_text(self, tag): | ||
| """ | ||
| Returns the text of the end tag. | ||
|
|
||
| If it fails to extract the actual text from the raw data, it builds a closing tag with `tag`. | ||
| """ | ||
| # Attempt to extract actual tag from raw source text | ||
| start = self.line_offset + self.offset | ||
| m = parser.endendtag.search(self.rawdata, start) | ||
| if m: | ||
| return self.rawdata[start:m.end()] | ||
| else: | ||
| # Failed to extract from raw data. Assume well formed and lowercase. | ||
| return '</{}>'.format(tag) | ||
|
|
||
| def handle_starttag(self, tag, attrs): | ||
| attrs = dict(attrs) | ||
|
|
||
| if self.at_line_start() and self.md.is_block_level(tag) and not self.inraw: | ||
| # Started a new raw block. Prepare stack. | ||
| self.inraw = True | ||
| self.cleandoc.append('\n') | ||
|
|
||
| text = self.get_starttag_text() | ||
| if self.inraw: | ||
| self.stack.append(tag) | ||
| self._cache.append(text) | ||
| else: | ||
| self.cleandoc.append(text) | ||
|
|
||
| def handle_endtag(self, tag): | ||
| text = self.get_endtag_text(tag) | ||
|
|
||
| if self.inraw: | ||
| self._cache.append(text) | ||
| if tag in self.stack: | ||
| # Remove tag from stack | ||
| while self.stack: | ||
| if self.stack.pop() == tag: | ||
| break | ||
| if len(self.stack) == 0: | ||
| # End of raw block. Reset stack. | ||
| self.inraw = False | ||
| self.cleandoc.append(self.md.htmlStash.store(''.join(self._cache))) | ||
| # Insert blank line between this and next line. | ||
| self.cleandoc.append('\n\n') | ||
| self._cache = [] | ||
| else: | ||
| self.cleandoc.append(text) | ||
|
|
||
| def handle_data(self, data): | ||
| if self.inraw: | ||
| self._cache.append(data) | ||
| else: | ||
| self.cleandoc.append(data) | ||
|
|
||
| def handle_empty_tag(self, data, is_block): | ||
| """ Handle empty tags (`<data>`). """ | ||
| if self.inraw: | ||
| # Append this to the existing raw block | ||
| self._cache.append(data) | ||
| elif self.at_line_start() and is_block: | ||
| # Handle this as a standalone raw block | ||
| self.cleandoc.append(self.md.htmlStash.store(data)) | ||
| # Insert blank line between this and next line. | ||
| self.cleandoc.append('\n\n') | ||
| else: | ||
| self.cleandoc.append(data) | ||
|
|
||
| def handle_startendtag(self, tag, attrs): | ||
| self.handle_empty_tag(self.get_starttag_text(), is_block=self.md.is_block_level(tag)) | ||
|
|
||
| def handle_charref(self, name): | ||
| self.handle_empty_tag('&#{};'.format(name), is_block=False) | ||
|
|
||
| def handle_entityref(self, name): | ||
| self.handle_empty_tag('&{};'.format(name), is_block=False) | ||
|
|
||
| def handle_comment(self, data): | ||
| self.handle_empty_tag('<!--{}-->'.format(data), is_block=True) | ||
|
|
||
| def handle_decl(self, data): | ||
| self.handle_empty_tag('<!{}>'.format(data), is_block=True) | ||
|
|
||
| def handle_pi(self, data): | ||
| self.handle_empty_tag('<?{}?>'.format(data), is_block=True) | ||
|
|
||
| def unknown_decl(self, data): | ||
| end = ']]>' if data.startswith('CDATA[') else ']>' | ||
| self.handle_empty_tag('<![{}{}'.format(data, end), is_block=True) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.