Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

Open
gopherbot opened this issue May 3, 2014 · 3 comments
Open

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

gopherbot opened this issue May 3, 2014 · 3 comments
Milestone

Comments

@gopherbot
Copy link

by martin@probst.io:

I'm not sure if this is a bug or working as intended according to the HTML5 parsing
algorithm, but it seems at least problematic from a user's perspective.

When parsing an HTML document that contains <script> tags, writing out the tokens
received will double escape any contained entities, thus <script> tags don't
round-trip through the tokenizer. See the attached patch which adds two tests for
<script>"</script> (which leads to &#24; as the contents) and
<script>&#34;</script>, which leads to &amp;#34;.

That means re-parsing the output of tokenization adds more and more double escaping.

There is a test for <style> just below the one I added that makes this look
intentional. But this is a real problem: using go.net/html to parse and re-serialize
documents breaks the documents.

Attachments:

  1. script_tags_test.diff (494 bytes)
@bradfitz
Copy link
Contributor

bradfitz commented May 5, 2014

Comment 1:

Labels changed: added repo-net.

Owner changed to @nigeltao.

Status changed to Accepted.

@andybalholm
Copy link
Contributor

Comment 2:

I'm pretty sure that the problem isn't in the tokenization but in the printing.

@mikioh mikioh changed the title code.google.com/p/go.net/html: Tokenizer cannot round-trip <script> tag contents x/net/html: Tokenizer cannot round-trip <script> tag contents Dec 23, 2014
@mikioh mikioh added repo-net and removed repo-net labels Dec 23, 2014
@mikioh mikioh changed the title x/net/html: Tokenizer cannot round-trip <script> tag contents html: Tokenizer cannot round-trip <script> tag contents Jan 4, 2015
@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc rsc changed the title html: Tokenizer cannot round-trip <script> tag contents x/net/html: Tokenizer cannot round-trip <script> tag contents Apr 14, 2015
@rsc rsc modified the milestones: Unreleased, Unplanned Apr 14, 2015
@rsc rsc removed the repo-net label Apr 14, 2015
evanj added a commit to evanj/kubewebproxy that referenced this issue Jan 20, 2020
evanj added a commit to evanj/kubewebproxy that referenced this issue Jan 20, 2020
evanj added a commit to evanj/kubewebproxy that referenced this issue Jan 20, 2020
@evanj
Copy link
Contributor

evanj commented Jan 20, 2020

The workaround I'm using is to use token.Data instead of token.String() for text tokens:

var content string
if tokenType == html.TextToken {
  content = t.Data
} else {
  content = t.String()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants