New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/html: not all HTML entities are escaped #38008
Comments
I did some reading through the html standards spec, the expectation is the between a specific ascii range only certain characters need to be escaped, which I believe are the set defined here. Characters outside of this ascii range should always be encoded. This tool and demo highlights this well: https://github.com/mathiasbynens/he https://mothereff.in/html-entities |
At least, I don't think this behavior disappoints current expectations.
As you can see from above, the characters targeted for escape are limited. And, as you can see in many other languages, this behavior is sufficient for security. Is there any reason why the Btw, your patch can be improved like the following: diff --git a/html/escape.go b/html/escape.go
index d856139..3be3540 100644
--- a/html/escape.go
+++ b/html/escape.go
@@ -193,7 +193,7 @@ func lower(b []byte) []byte {
return b
}
-const escapedChars = "&'<>\"\r"
+const escapedChars = "&'<>\"\r“”"
func escape(w writer, s string) error {
i := strings.IndexAny(s, escapedChars)
@@ -218,7 +218,19 @@ func escape(w writer, s string) error {
case '\r':
esc = " "
default:
- panic("unrecognized escape character")
+ if len(s) >= 3 {
+ switch s[i : i+3] {
+ case "“":
+ esc = "“"
+ i += 2
+ case "”":
+ esc = "”"
+ i += 2
+ }
+ }
+ if esc == "" {
+ panic("unrecognized escape character")
+ }
}
s = s[i+1:]
if _, err := w.WriteString(esc); err != nil { |
There is no reason for it being escaped beyond it's an unexpected behavior I hit when manipulating some HTML resulting in HTML being rendered incorrectly. |
How is the HTML being rendered incorrectly? The HTML is correct in both cases: with or without the U+201C being escaped to an |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Unsure
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I've been working on a static site using Hugo and I have written a tool that reads the resulting HTML file, makes some alterations and writes a new HTML file.
I noticed that some HTML entities were being read, parsed into the appropriate character but the output would be the character that should be a HTML entity.
You can reproduce it with the following test:
This is the test from https://github.com/golang/net/blob/master/html/render_test.go#L12 with a change of the third TextNode.
From:
To:
What did you expect to see vs what did you see?
The output is
0<1 “hello”
when I was expecting0<1 “hello”
.Looking at the code to parse HTML entities it seems like it's only able to encode single byte runes.
You can get the correct output with something akin to:
But that loses all of the optimizations in https://github.com/golang/net/blob/master/html/escape.go#L198
The text was updated successfully, but these errors were encountered: