html: UnescapeString unescapes HTML character references without a final semicolon #21563

stjj89 · 2017-08-22T22:46:43Z

What version of Go are you using (`go version`)?

go version go1.9rc2_cl165246139 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

linux/amd64

What did you do?

html.UnescapeString treats HTML character references that are missing a final ; as valid character references and escapes them. For example, &#58 is unescaped to :.

https://play.golang.org/p/oyPAjmj0s_

The HTML5 specification states that all valid character references must be terminated by a ; character.

https://www.w3.org/TR/html5/syntax.html#character-references

Therefore, character references such as &#58 that are missing this semicolon should not be unescaped.

Note: the authors of this function probably intended to accept unterminated character references (see this test case). This was probably to handle an edge case mentioned in the HTML4 spec (https://www.w3.org/TR/html4/charset.html#entities):

In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

The text was updated successfully, but these errors were encountered:

mikesamuel · 2017-08-24T12:38:22Z

This bug is invalid.

You can see this by running

<!doctype html><html><head><title>Test</title></head>
<body>&#65 &#x41  &Aacute</body></html>

through https://validator.w3.org/nu/#textarea which yields

Error: Character reference was not terminated by a semicolon.
At line 1, column 64
head><body>&#65 &#x41  &Aacute
Error: Character reference was not terminated by a semicolon.
At line 1, column 70
body>&#65 &#x41  &Aacute</body
Error: Named character reference was not terminated by a semicolon. (Or & should have been escaped as &amp;.)
At line 1, column 79
 &#x41  &Aacute</body></html>

and loading the same in your browser which displays "A A Á"

To understand why, we have to look at the spec more closely.

The canonical HTML5 is whatwg not w3.org.

named-character-references lists lists a number of character references that are recognized without semicolons for compatibility:

The first two rows of that table

name codepoint glyph

Aacute; U+000C1 Á

Aacute U+000C1 Á

indicate that the Á can appear with or without a semicolon.

named-character-reference-state and numeric-character-reference-state are the relevant portions of the parser.

Note in decimal-character-reference-state which is reached after parsing &#1 starting from the RCDATA state:

12.2.5.79 Decimal character reference state

U+003B SEMICOLON
Switch to the numeric character reference end state.

Anything else
This is a missing-semicolon-after-character-reference parse error. Reconsume in the numeric character reference end state.

So whether there is a semicolon or not, we transition to the numeric character reference end state which appends the character. The only difference is that an error is issued.

This means that

<html><body>&#1;</body></html>

and

<html><body>&#1</body></html>

parse to the same document, but the second does not validate.

There is similar language in the hexadecimal character reference state

And the relevant text for named character references is

Otherwise:

If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error.

Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer.

Flush code points consumed as a character reference. Switch to the return state.

Note that (1) does not transition directly to the return state but proceeds to (2).

stjj89 · 2017-08-24T19:29:46Z

Ah, that is extremely subtle. Thanks for the detailed explanation. I'll close this issue then.

ianlancetaylor added this to the Go1.10 milestone Aug 23, 2017

stjj89 closed this as completed Aug 24, 2017

golang locked and limited conversation to collaborators Aug 24, 2018

gopherbot added the FrozenDueToAge label Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html: UnescapeString unescapes HTML character references without a final semicolon #21563

html: UnescapeString unescapes HTML character references without a final semicolon #21563

stjj89 commented Aug 22, 2017

mikesamuel commented Aug 24, 2017

stjj89 commented Aug 24, 2017

html: UnescapeString unescapes HTML character references without a final semicolon #21563

html: UnescapeString unescapes HTML character references without a final semicolon #21563

Comments

stjj89 commented Aug 22, 2017

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

mikesamuel commented Aug 24, 2017

stjj89 commented Aug 24, 2017

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?