Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html: UnescapeString unescapes HTML character references without a final semicolon #21563

Closed
stjj89 opened this issue Aug 22, 2017 · 2 comments
Closed
Milestone

Comments

@stjj89
Copy link
Contributor

stjj89 commented Aug 22, 2017

What version of Go are you using (go version)?

go version go1.9rc2_cl165246139 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

html.UnescapeString treats HTML character references that are missing a final ; as valid character references and escapes them. For example, &#58 is unescaped to :.

https://play.golang.org/p/oyPAjmj0s_

The HTML5 specification states that all valid character references must be terminated by a ; character.

https://www.w3.org/TR/html5/syntax.html#character-references

Therefore, character references such as &#58 that are missing this semicolon should not be unescaped.

Note: the authors of this function probably intended to accept unterminated character references (see this test case). This was probably to handle an edge case mentioned in the HTML4 spec (https://www.w3.org/TR/html4/charset.html#entities):

In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Aug 23, 2017
@mikesamuel
Copy link
Contributor

This bug is invalid.

You can see this by running

<!doctype html><html><head><title>Test</title></head>
<body>&#65 &#x41  &Aacute</body></html>

through https://validator.w3.org/nu/#textarea which yields

Error: Character reference was not terminated by a semicolon.
At line 1, column 64
head><body>&#65 &#x41  &Aacute
Error: Character reference was not terminated by a semicolon.
At line 1, column 70
body>&#65 &#x41  &Aacute</body
Error: Named character reference was not terminated by a semicolon. (Or & should have been escaped as &amp;.)
At line 1, column 79
 &#x41  &Aacute</body></html>

and loading the same in your browser which displays "A A Á"


To understand why, we have to look at the spec more closely.

The canonical HTML5 is whatwg not w3.org.

named-character-references lists lists a number of character references that are recognized without semicolons for compatibility:

The first two rows of that table

name codepoint glyph
Aacute; U+000C1 Á
Aacute U+000C1 Á

indicate that the &Aacute; can appear with or without a semicolon.

named-character-reference-state and numeric-character-reference-state are the relevant portions of the parser.

Note in decimal-character-reference-state which is reached after parsing &#1 starting from the RCDATA state:

12.2.5.79 Decimal character reference state

U+003B SEMICOLON
Switch to the numeric character reference end state.

Anything else
This is a missing-semicolon-after-character-reference parse error. Reconsume in the numeric character reference end state.

So whether there is a semicolon or not, we transition to the numeric character reference end state which appends the character. The only difference is that an error is issued.

This means that

<html><body>&#1;</body></html>

and

<html><body>&#1</body></html>

parse to the same document, but the second does not validate.


There is similar language in the hexadecimal character reference state


And the relevant text for named character references is

Otherwise:

  1. If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error.
  1. Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer.
  1. Flush code points consumed as a character reference. Switch to the return state.

Note that (1) does not transition directly to the return state but proceeds to (2).

@stjj89
Copy link
Contributor Author

stjj89 commented Aug 24, 2017

Ah, that is extremely subtle. Thanks for the detailed explanation. I'll close this issue then.

@stjj89 stjj89 closed this as completed Aug 24, 2017
@golang golang locked and limited conversation to collaborators Aug 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants