Skip to content

Conversation

@htdat
Copy link
Member

@htdat htdat commented Aug 15, 2019

Fixes #9546

Changes proposed in this Pull Request:

  • Use wp_strip_all_tags() to remove all HTML tags.

Testing instructions:

  • Upload a new video to Media Libary
  • Add the description of the video, which includes HTML tags like <strong>Description</strong> with <a href="https://github.com/Automattic/jetpack/issues/9546">HTML tags</a>.
  • Check https://domain.com/video-sitemap-1.xml
  • Make sure the description of this video is Description with HTML tags.

Proposed changelog entry for your changes:

  • Remove HTML tags in the description column of video sitemaps

@htdat htdat requested a review from a team August 15, 2019 12:44
@matticbot
Copy link
Contributor

Caution: This PR has changes that must be merged to WordPress.com
Hello htdat! These changes need to be synced to WordPress.com - If you 're an a11n, please commandeer, review, and approve D31578-code before merging this PR. Thank you!

@jetpackbot
Copy link
Collaborator

Warnings
⚠️

The PR is missing at least one [Status] label. Suggestions: [Status] In Progress, [Status] Needs Review

This is an automated check which relies on PULL_REQUEST_TEMPLATE. We encourage you to follow that template as it helps Jetpack maintainers do their job. If you think 'Testing instructions' or 'Proposed changelog entry' are not needed for your PR - please explain why you think so. Thanks for cooperation 🤖

Generated by 🚫 dangerJS against e2c0004

@htdat htdat added the [Status] Needs Review This PR is ready for review. label Aug 16, 2019
@mdawaffe
Copy link
Member

The spec says that both title and description can contain HTML. The spec also says that any HTML must be be HTML-escaped or wrapped in CDATA blocks.

If you view the source of the video-sitemap-1.xml, you'll see that the description element correctly escapes all HTML as the spec requires.

So the sitemap is being generated correctly.

In one sense, that's all that really matters :) Since the sitemap is meant to be read by a machine and not a human, it doesn't really matter what it looks like to a human.

But, Jetpack's sitemaps include an XSL Stylesheet to make the sitemap look pretty if a human does come along to view it. That's where the reported bug comes from: that humans see encoded HTML, which doesn't look pretty.

I see three potential fixes:

  1. Do nothing :) Does it matter that humans see the HTML? The sitemap is still correct.
  2. As in this PR, remove all HTML, so that it looks the same to humans and machines, but we'd lose the richness that the additional HTML markup provides.
  3. Instead of this PR, do something like the following to change the XSL Stylesheet to display the HTML. This would mean machines can see the HTML (as they do now), and humans would see the the HTML interpreted in the DOM rather than the raw HTML.
diff --git a/modules/sitemaps/sitemap-stylist.php b/modules/sitemaps/sitemap-stylist.php
index ce883d7f1..08d38b0a9 100644
--- a/modules/sitemaps/sitemap-stylist.php
+++ b/modules/sitemaps/sitemap-stylist.php
@@ -549,3 +549,3 @@ $css
 					<td>
-						<xsl:value-of select='video:video/video:description'/>
+						<xsl:value-of select='video:video/video:description' disable-output-escaping="yes"/>
 					</td>

Thoughts?

(Note that, for the title, although HTML is allowed, WordPress strips all tags during the_title_rss, so we don't see the behavior reported here in the title element.)

@mdawaffe mdawaffe added [Status] Needs Author Reply We need more details from you. This label will be auto-added until the PR meets all requirements. and removed [Status] Needs Review This PR is ready for review. labels Aug 17, 2019
@htdat htdat added [Status] In Progress and removed [Status] Needs Author Reply We need more details from you. This label will be auto-added until the PR meets all requirements. labels Aug 19, 2019
@htdat
Copy link
Member Author

htdat commented Aug 19, 2019

@mdawaffe Thanks a lot for the thorough review.

The spec says that both title and description can contain HTML. The spec also says that any HTML must be be HTML-escaped or wrapped in CDATA blocks.

Definitely, this makes sense. I did not double check the official guide about this.

Instead of this PR, do something like the following to change the XSL Stylesheet to display the HTML. This would mean machines can see the HTML (as they do now), and humans would see the the HTML interpreted in the DOM rather than the raw HTML.

This makes the most sense for me. I will test and add another commit for this.

@htdat
Copy link
Member Author

htdat commented Aug 19, 2019

Close this PR in favor of this #13261

@htdat htdat closed this Aug 19, 2019
@htdat htdat deleted the fix/9546 branch August 19, 2019 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Video Sitemap: description should filter out HTML tags

6 participants