You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env)?
linux/amd64
What did you do?
html.UnescapeString treats HTML character references that are missing a final ; as valid character references and escapes them. For example, : is unescaped to :.
In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
The text was updated successfully, but these errors were encountered:
Error: Character reference was not terminated by a semicolon.
At line 1, column 64
head><body>A A Á
Error: Character reference was not terminated by a semicolon.
At line 1, column 70
body>A A Á</body
Error: Named character reference was not terminated by a semicolon. (Or & should have been escaped as &.)
At line 1, column 79
A Á</body></html>
and loading the same in your browser which displays "A A Á"
To understand why, we have to look at the spec more closely.
U+003B SEMICOLON
Switch to the numeric character reference end state.
Anything else
This is a missing-semicolon-after-character-reference parse error. Reconsume in the numeric character reference end state.
So whether there is a semicolon or not, we transition to the numeric character reference end state which appends the character. The only difference is that an error is issued.
This means that
<html><body></body></html>
and
<html><body></body></html>
parse to the same document, but the second does not validate.
And the relevant text for named character references is
Otherwise:
If the last character matched is not a U+003B SEMICOLON character (;), then this is a missing-semicolon-after-character-reference parse error.
Set the temporary buffer to the empty string. Append one or two characters corresponding to the character reference name (as given by the second column of the named character references table) to the temporary buffer.
Flush code points consumed as a character reference. Switch to the return state.
Note that (1) does not transition directly to the return state but proceeds to (2).
What version of Go are you using (
go version
)?go version go1.9rc2_cl165246139 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
html.UnescapeString treats HTML character references that are missing a final
;
as valid character references and escapes them. For example,:
is unescaped to:
.https://play.golang.org/p/oyPAjmj0s_
The HTML5 specification states that all valid character references must be terminated by a
;
character.https://www.w3.org/TR/html5/syntax.html#character-references
Therefore, character references such as
:
that are missing this semicolon should not be unescaped.Note: the authors of this function probably intended to accept unterminated character references (see this test case). This was probably to handle an edge case mentioned in the HTML4 spec (https://www.w3.org/TR/html4/charset.html#entities):
The text was updated successfully, but these errors were encountered: