x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

gopherbot · 2014-05-03T12:20:48Z

I'm not sure if this is a bug or working as intended according to the HTML5 parsing
algorithm, but it seems at least problematic from a user's perspective.

When parsing an HTML document that contains <script> tags, writing out the tokens
received will double escape any contained entities, thus <script> tags don't
round-trip through the tokenizer. See the attached patch which adds two tests for
<script>"</script> (which leads to &#24; as the contents) and
<script>&#34;</script>, which leads to &amp;#34;.

That means re-parsing the output of tokenization adds more and more double escaping.

There is a test for <style> just below the one I added that makes this look
intentional. But this is a real problem: using go.net/html to parse and re-serialize
documents breaks the documents.

Attachments:

script_tags_test.diff (494 bytes)

bradfitz · 2014-05-05T13:54:08Z

Comment 1:

Labels changed: added repo-net.

Owner changed to @nigeltao.

Status changed to Accepted.

andybalholm · 2014-07-08T02:02:11Z

Comment 2:

I'm pretty sure that the problem isn't in the tokenization but in the printing.

Workaround for golang/go#7929

evanj · 2020-01-20T14:43:45Z

The workaround I'm using is to use token.Data instead of token.String() for text tokens:

var content string
if tokenType == html.TextToken {
  content = t.Data
} else {
  content = t.String()
}

gopherbot added accepted labels Jul 8, 2014

mikioh changed the title ~~code.google.com/p/go.net/html: Tokenizer cannot round-trip <script> tag contents~~ x/net/html: Tokenizer cannot round-trip <script> tag contents Dec 23, 2014

mikioh added repo-net and removed repo-net labels Dec 23, 2014

mikioh changed the title ~~x/net/html: Tokenizer cannot round-trip <script> tag contents~~ html: Tokenizer cannot round-trip <script> tag contents Jan 4, 2015

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

rsc changed the title ~~html: Tokenizer cannot round-trip <script> tag contents~~ x/net/html: Tokenizer cannot round-trip <script> tag contents Apr 14, 2015

rsc modified the milestones: Unreleased, Unplanned Apr 14, 2015

rsc removed the repo-net label Apr 14, 2015

evanj added a commit to evanj/kubewebproxy that referenced this issue Jan 20, 2020

HTML rewriting: Don't escape script tag contents

e5be406

Workaround for golang/go#7929

evanj mentioned this issue Jan 20, 2020

HTML rewriting: Don't escape script tag contents evanj/kubewebproxy#6

Merged

evanj added a commit to evanj/kubewebproxy that referenced this issue Jan 20, 2020

HTML rewriting: Don't escape script tag contents

c78fa12

Workaround for golang/go#7929

evanj added a commit to evanj/kubewebproxy that referenced this issue Jan 20, 2020

HTML rewriting: Don't escape script tag contents

ad80da9

Workaround for golang/go#7929

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

gopherbot commented May 3, 2014

bradfitz commented May 5, 2014

andybalholm commented Jul 8, 2014

evanj commented Jan 20, 2020 •

edited

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

x/net/html: Tokenizer cannot round-trip <script> tag contents #7929

Comments

gopherbot commented May 3, 2014

bradfitz commented May 5, 2014

andybalholm commented Jul 8, 2014

evanj commented Jan 20, 2020 • edited

evanj commented Jan 20, 2020 •

edited