Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: not all HTML entities are escaped #38008

Closed
gauntface opened this issue Mar 22, 2020 · 5 comments
Closed

x/net/html: not all HTML entities are escaped #38008

gauntface opened this issue Mar 22, 2020 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@gauntface
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.13.4 linux/amd64

Does this issue reproduce with the latest release?

Unsure

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/_/.cache/go-build"
GOENV="/home/_/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/_/_/_/go/"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/_/_/_/net/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build646558793=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I've been working on a static site using Hugo and I have written a tool that reads the resulting HTML file, makes some alterations and writes a new HTML file.

I noticed that some HTML entities were being read, parsed into the appropriate character but the output would be the character that should be a HTML entity.

You can reproduce it with the following test:

func TestRenderer(t *testing.T) {
	nodes := [...]*Node{
		0: {
			Type: ElementNode,
			Data: "html",
		},
		1: {
			Type: ElementNode,
			Data: "head",
		},
		2: {
			Type: ElementNode,
			Data: "body",
		},
		3: {
			Type: TextNode,
			Data: "0<1 “hello”",
		},
		4: {
			Type: ElementNode,
			Data: "p",
			Attr: []Attribute{
				{
					Key: "id",
					Val: "A",
				},
				{
					Key: "foo",
					Val: `abc"def`,
				},
			},
		},
		5: {
			Type: TextNode,
			Data: "2",
		},
		6: {
			Type: ElementNode,
			Data: "b",
			Attr: []Attribute{
				{
					Key: "empty",
					Val: "",
				},
			},
		},
		7: {
			Type: TextNode,
			Data: "3",
		},
		8: {
			Type: ElementNode,
			Data: "i",
			Attr: []Attribute{
				{
					Key: "backslash",
					Val: `\`,
				},
			},
		},
		9: {
			Type: TextNode,
			Data: "&4",
		},
		10: {
			Type: TextNode,
			Data: "5",
		},
		11: {
			Type: ElementNode,
			Data: "blockquote",
		},
		12: {
			Type: ElementNode,
			Data: "br",
		},
		13: {
			Type: TextNode,
			Data: "6",
		},
		14: {
			Type: CommentNode,
			Data: "comm",
		},
		15: {
			Type: RawNode,
			Data: "7<pre>8</pre>9",
		},
	}

	// Build a tree out of those nodes, based on a textual representation.
	// Only the ".\t"s are significant. The trailing HTML-like text is
	// just commentary. The "0:" prefixes are for easy cross-reference with
	// the nodes array.
	treeAsText := [...]string{
		0: `<html>`,
		1: `.	<head>`,
		2: `.	<body>`,
		3: `.	.	"0&lt;1 &ldquo;hello&rdquo;"`,
		4: `.	.	<p id="A" foo="abc&#34;def">`,
		5: `.	.	.	"2"`,
		6: `.	.	.	<b empty="">`,
		7: `.	.	.	.	"3"`,
		8: `.	.	.	<i backslash="\">`,
		9: `.	.	.	.	"&amp;4"`,
		10: `.	.	"5"`,
		11: `.	.	<blockquote>`,
		12: `.	.	<br>`,
		13: `.	.	"6"`,
		14: `.	.	"<!--comm-->"`,
		15: `.	.	"7<pre>8</pre>9"`,
	}
	if len(nodes) != len(treeAsText) {
		t.Fatal("len(nodes) != len(treeAsText)")
	}
	var stack [8]*Node
	for i, line := range treeAsText {
		level := 0
		for line[0] == '.' {
			// Strip a leading ".\t".
			line = line[2:]
			level++
		}
		n := nodes[i]
		if level == 0 {
			if stack[0] != nil {
				t.Fatal("multiple root nodes")
			}
			stack[0] = n
		} else {
			stack[level-1].AppendChild(n)
			stack[level] = n
			for i := level + 1; i < len(stack); i++ {
				stack[i] = nil
			}
		}
		// At each stage of tree construction, we check all nodes for consistency.
		for j, m := range nodes {
			if err := checkNodeConsistency(m); err != nil {
				t.Fatalf("i=%d, j=%d: %v", i, j, err)
			}
		}
	}

	want := `<html><head></head><body>0&lt;1 &ldquo;hello&rdquo;<p id="A" foo="abc&#34;def">` +
		`2<b empty="">3</b><i backslash="\">&amp;4</i></p>` +
		`5<blockquote></blockquote><br/>6<!--comm-->7<pre>8</pre>9</body></html>`
	b := new(bytes.Buffer)
	if err := Render(b, nodes[0]); err != nil {
		t.Fatal(err)
	}
	if got := b.String(); got != want {
		t.Errorf("got vs want:\n%s\n%s\n", got, want)
	}
}

This is the test from https://github.com/golang/net/blob/master/html/render_test.go#L12 with a change of the third TextNode.

From:

		3: {
			Type: TextNode,
			Data: "0<1",
		},

To:

		3: {
			Type: TextNode,
			Data: "0<1 “hello”",
		},

What did you expect to see vs what did you see?

The output is 0&lt;1 “hello” when I was expecting 0&lt;1 &ldquo;hello&rdquo;.

Looking at the code to parse HTML entities it seems like it's only able to encode single byte runes.

You can get the correct output with something akin to:

func escape(w writer, st string) error {
	for _, s := range st {
		var esc string = string(s)
		switch s {
		case '&':
			esc = "&amp;"
		case '\'':
			// "&#39;" is shorter than "&apos;" and apos was not in HTML until HTML5.
			esc = "&#39;"
		case '<':
			esc = "&lt;"
		case '>':
			esc = "&gt;"
		case '"':
			// "&#34;" is shorter than "&quot;".
			esc = "&#34;"
		case '\r':
			esc = "&#13;"
		case '“':
			esc = "&ldquo;"
		case '”':
			esc = "&rdquo;"
		}
		if _, err := w.WriteString(esc); err != nil {
			return err
		}
	}
	return nil
}

But that loses all of the optimizations in https://github.com/golang/net/blob/master/html/escape.go#L198

@gopherbot gopherbot added this to the Unreleased milestone Mar 22, 2020
@andybons andybons changed the title x/net:html not all HTML entities are escaped x/net/html: not all HTML entities are escaped Mar 23, 2020
@andybons
Copy link
Member

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 23, 2020
@gauntface
Copy link
Author

I did some reading through the html standards spec, the expectation is the between a specific ascii range only certain characters need to be escaped, which I believe are the set defined here. Characters outside of this ascii range should always be encoded.

This tool and demo highlights this well: https://github.com/mathiasbynens/he https://mothereff.in/html-entities

@namusyaka
Copy link
Member

At least, I don't think this behavior disappoints current expectations.
Take a look at the comments on EscapeString and UnescapeString:

EscapeString escapes special characters like "<" to become "&lt;". It escapes only five such characters: <, >, &, ' and ". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.
UnescapeString unescapes entities like "&lt;" to become "<". It unescapes a larger range of entities than EscapeString escapes. For example, "&aacute;" unescapes to "á", as does "&#225;" and "&xE1;". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

As you can see from above, the characters targeted for escape are limited. And, as you can see in many other languages, this behavior is sufficient for security.

Is there any reason why the &ldquo; and &rdquo; you are showing should be specially escaped?


Btw, your patch can be improved like the following:

diff --git a/html/escape.go b/html/escape.go
index d856139..3be3540 100644
--- a/html/escape.go
+++ b/html/escape.go
@@ -193,7 +193,7 @@ func lower(b []byte) []byte {
        return b
 }

-const escapedChars = "&'<>\"\r"
+const escapedChars = "&'<>\"\r“”"

 func escape(w writer, s string) error {
        i := strings.IndexAny(s, escapedChars)
@@ -218,7 +218,19 @@ func escape(w writer, s string) error {
                case '\r':
                        esc = "&#13;"
                default:
-                       panic("unrecognized escape character")
+                       if len(s) >= 3 {
+                               switch s[i : i+3] {
+                               case "“":
+                                       esc = "&ldquo;"
+                                       i += 2
+                               case "”":
+                                       esc = "&rdquo;"
+                                       i += 2
+                               }
+                       }
+                       if esc == "" {
+                               panic("unrecognized escape character")
+                       }
                }
                s = s[i+1:]
                if _, err := w.WriteString(esc); err != nil {

@gauntface
Copy link
Author

There is no reason for it being escaped beyond it's an unexpected behavior I hit when manipulating some HTML resulting in HTML being rendered incorrectly.

@drj11
Copy link

drj11 commented Feb 16, 2022

How is the HTML being rendered incorrectly? The HTML is correct in both cases: with or without the U+201C being escaped to an &ldquo; entity or not. If a browser is rendered that incorrectly, that sounds like a bug downstream of the HTML. For example, browser not using UTF-8 encoding, or the file-on-disk has been written not with UTF-8.

@golang golang locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants