x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973

empijei · 2018-08-13T21:37:08Z

What version of Go are you using (`go version`)?

go1.9.4 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/clap/go"
GORACE=""
GOROOT="/usr/lib/go-1.9"
GOTOOLDIR="/usr/lib/go-1.9/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build659916345=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

package main
import (
	"fmt"
	"strings"
	"golang.org/x/net/html"
)
func main() {
	n, _ := html.Parse(strings.NewReader(`<html><head></head><body><tr><td><pre>some text</pre></td></tr></body></html>`))
	fmt.Println(n.FirstChild.Data) // html
	fmt.Println(n.FirstChild.FirstChild.Data) // head
	fmt.Println(n.FirstChild.FirstChild.NextSibling.Data) // body
	// unexpected <tr> is skipped
	// unexpected <td> is skipped
	fmt.Println(n.FirstChild.FirstChild.NextSibling.FirstChild.Data) // pre
}

What did you expect to see?

A line in the documentation warning the users that html.Parse skips and ignores nodes.

Would also be nice to know when and how this happens (at least have an approximative description for this behavior.)

What did you see instead?

Code silently fixing the html

The text was updated successfully, but these errors were encountered:

magical · 2018-08-13T22:32:24Z

FWIW, this is also how browsers interpret that snippet of HTML. <tr> and <td> are not meaningful outside of <table> so they are ignored.

I believe Go is doing the right thing here.

FMNSSun · 2018-08-14T08:29:42Z

Parsing invalid HTML is a mess. Either you try to fix the errors in HTML (which browsers obviously do and do a great job at) or you return a parser error. I'd always advocate for "reject invalid inputs completely" rather than trying to fix it but in the case of HTML there are just so so many invalid HTML documents around that if your parser outright rejects invalid HTML then you restrict its usage.

FWIW: The parser generally ignores tokens if they aren't supposed to show up in that context. In the body context the following are skipped: caption, col, colgroup, head, tbody, td, tfoot, th, thead, tr.

<tr> as per spec is only allowed inside <table>, <thead>, <tbody>, <tfoot>.

In my opinion documenting all these would probably be overkill and it should suffice to document broadly that the parser may choose to ignore elements in invalid HTML documents.

jimmyfrasche · 2018-08-14T15:13:11Z

Parsing html is a mess in the sense that it's complicated, but html5 codifies all of the ways in which it is messy and has very specific rules for handling malformed documents to make sure that everyone interprets them the same way.

If these are not supported, it is not an html5 parser. It would be a bug if it did not ignore those tags.

The behavior shown in this issue is documented here https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody (but you have to scroll for a while because there are a lot of rules). I got to this link through the docs for the package.

So this is working as intended and as documented—albeit somewhat implicitly and indirectly.

I agree with @FMNSSun that it would be better for the docs to broadly mention that Parse ignores tags as required by the spec.

I'm going to re-purpose this bug for making the documentation more explicit.

empijei · 2018-08-16T20:18:30Z

I agree with you, Go is doing the right thing here, and pointing it out in the doc is probably the best way to approach this.

Do you think referring to the document you linked would make sense?

jimmyfrasche · 2018-08-16T20:27:40Z

It is albeit indirectly. I got to that link from the links in the docs. The html5 spec has many pages and we could find reasons to point to all of them.

I'm unsure of the best way to document this.

gopherbot · 2018-08-31T03:33:11Z

Change https://golang.org/cl/132535 mentions this issue: html: document html.Parse behaviour for invalid/unexpected nodes

gopherbot · 2018-08-31T03:43:49Z

Change https://golang.org/cl/132536 mentions this issue: html: document html.Parse behaviour for invalid/unexpected nodes

gopherbot · 2018-09-06T07:12:09Z

Change https://golang.org/cl/133695 mentions this issue: html: add more comments to Parse and ParseFragment

They implement the HTML5 parsing algorithm, which is very complicated. Fixes golang/go#26973 Change-Id: I83a5753ab00fe84f73797fcecd309306d9f24819 Reviewed-on: https://go-review.googlesource.com/133695 Reviewed-by: Kunpei Sakai <namusyaka@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

gopherbot added this to the Unreleased milestone Aug 13, 2018

jimmyfrasche added Documentation help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Aug 14, 2018

jimmyfrasche changed the title ~~x/net/html: html.Parse silently ignores invalid/unexpected nodes~~ x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes Aug 14, 2018

gopherbot closed this as completed in golang/net@161cd47 Sep 6, 2018

golang locked and limited conversation to collaborators Sep 6, 2019

gopherbot added the FrozenDueToAge label Sep 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973

x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973

empijei commented Aug 13, 2018

magical commented Aug 13, 2018

FMNSSun commented Aug 14, 2018 •

edited

jimmyfrasche commented Aug 14, 2018

empijei commented Aug 16, 2018

jimmyfrasche commented Aug 16, 2018

gopherbot commented Aug 31, 2018

gopherbot commented Aug 31, 2018

gopherbot commented Sep 6, 2018

x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973

x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973

Comments

empijei commented Aug 13, 2018

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

magical commented Aug 13, 2018

FMNSSun commented Aug 14, 2018 • edited

jimmyfrasche commented Aug 14, 2018

empijei commented Aug 16, 2018

jimmyfrasche commented Aug 16, 2018

gopherbot commented Aug 31, 2018

gopherbot commented Aug 31, 2018

gopherbot commented Sep 6, 2018

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

FMNSSun commented Aug 14, 2018 •

edited