Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973

Closed
empijei opened this issue Aug 13, 2018 · 8 comments
Labels
Documentation FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@empijei
Copy link
Contributor

empijei commented Aug 13, 2018

What version of Go are you using (go version)?

go1.9.4 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/clap/go"
GORACE=""
GOROOT="/usr/lib/go-1.9"
GOTOOLDIR="/usr/lib/go-1.9/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build659916345=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

package main
import (
	"fmt"
	"strings"
	"golang.org/x/net/html"
)
func main() {
	n, _ := html.Parse(strings.NewReader(`<html><head></head><body><tr><td><pre>some text</pre></td></tr></body></html>`))
	fmt.Println(n.FirstChild.Data) // html
	fmt.Println(n.FirstChild.FirstChild.Data) // head
	fmt.Println(n.FirstChild.FirstChild.NextSibling.Data) // body
	// unexpected <tr> is skipped
	// unexpected <td> is skipped
	fmt.Println(n.FirstChild.FirstChild.NextSibling.FirstChild.Data) // pre
}

What did you expect to see?

A line in the documentation warning the users that html.Parse skips and ignores nodes.

Would also be nice to know when and how this happens (at least have an approximative description for this behavior.)

What did you see instead?

Code silently fixing the html

@gopherbot gopherbot added this to the Unreleased milestone Aug 13, 2018
@magical
Copy link
Contributor

magical commented Aug 13, 2018

FWIW, this is also how browsers interpret that snippet of HTML. <tr> and <td> are not meaningful outside of <table> so they are ignored.

Screenshot of firefox dev tools showing a DOM with pre but not tr or td

I believe Go is doing the right thing here.

@FMNSSun
Copy link

FMNSSun commented Aug 14, 2018

Parsing invalid HTML is a mess. Either you try to fix the errors in HTML (which browsers obviously do and do a great job at) or you return a parser error. I'd always advocate for "reject invalid inputs completely" rather than trying to fix it but in the case of HTML there are just so so many invalid HTML documents around that if your parser outright rejects invalid HTML then you restrict its usage.

FWIW: The parser generally ignores tokens if they aren't supposed to show up in that context. In the body context the following are skipped: caption, col, colgroup, head, tbody, td, tfoot, th, thead, tr.

<tr> as per spec is only allowed inside <table>, <thead>, <tbody>, <tfoot>.

In my opinion documenting all these would probably be overkill and it should suffice to document broadly that the parser may choose to ignore elements in invalid HTML documents.

@jimmyfrasche
Copy link
Member

Parsing html is a mess in the sense that it's complicated, but html5 codifies all of the ways in which it is messy and has very specific rules for handling malformed documents to make sure that everyone interprets them the same way.

If these are not supported, it is not an html5 parser. It would be a bug if it did not ignore those tags.

The behavior shown in this issue is documented here https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody (but you have to scroll for a while because there are a lot of rules). I got to this link through the docs for the package.

So this is working as intended and as documented—albeit somewhat implicitly and indirectly.

I agree with @FMNSSun that it would be better for the docs to broadly mention that Parse ignores tags as required by the spec.

I'm going to re-purpose this bug for making the documentation more explicit.

@jimmyfrasche jimmyfrasche added Documentation help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Aug 14, 2018
@jimmyfrasche jimmyfrasche changed the title x/net/html: html.Parse silently ignores invalid/unexpected nodes x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes Aug 14, 2018
@empijei
Copy link
Contributor Author

empijei commented Aug 16, 2018

I agree with you, Go is doing the right thing here, and pointing it out in the doc is probably the best way to approach this.

Do you think referring to the document you linked would make sense?

@jimmyfrasche
Copy link
Member

It is albeit indirectly. I got to that link from the links in the docs. The html5 spec has many pages and we could find reasons to point to all of them.

I'm unsure of the best way to document this.

@gopherbot
Copy link

Change https://golang.org/cl/132535 mentions this issue: html: document html.Parse behaviour for invalid/unexpected nodes

@gopherbot
Copy link

Change https://golang.org/cl/132536 mentions this issue: html: document html.Parse behaviour for invalid/unexpected nodes

@gopherbot
Copy link

Change https://golang.org/cl/133695 mentions this issue: html: add more comments to Parse and ParseFragment

@golang golang locked and limited conversation to collaborators Sep 6, 2019
tmm1 pushed a commit to fancybits/go-net that referenced this issue Sep 12, 2022
They implement the HTML5 parsing algorithm, which is very complicated.

Fixes golang/go#26973

Change-Id: I83a5753ab00fe84f73797fcecd309306d9f24819
Reviewed-on: https://go-review.googlesource.com/133695
Reviewed-by: Kunpei Sakai <namusyaka@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Documentation FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants