New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/html: html.Parse should document that it silently ignores invalid/unexpected nodes #26973
Comments
Parsing invalid HTML is a mess. Either you try to fix the errors in HTML (which browsers obviously do and do a great job at) or you return a parser error. I'd always advocate for "reject invalid inputs completely" rather than trying to fix it but in the case of HTML there are just so so many invalid HTML documents around that if your parser outright rejects invalid HTML then you restrict its usage. FWIW: The parser generally ignores tokens if they aren't supposed to show up in that context. In the
In my opinion documenting all these would probably be overkill and it should suffice to document broadly that the parser may choose to ignore elements in invalid HTML documents. |
Parsing html is a mess in the sense that it's complicated, but html5 codifies all of the ways in which it is messy and has very specific rules for handling malformed documents to make sure that everyone interprets them the same way. If these are not supported, it is not an html5 parser. It would be a bug if it did not ignore those tags. The behavior shown in this issue is documented here https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody (but you have to scroll for a while because there are a lot of rules). I got to this link through the docs for the package. So this is working as intended and as documented—albeit somewhat implicitly and indirectly. I agree with @FMNSSun that it would be better for the docs to broadly mention that Parse ignores tags as required by the spec. I'm going to re-purpose this bug for making the documentation more explicit. |
I agree with you, Go is doing the right thing here, and pointing it out in the doc is probably the best way to approach this. Do you think referring to the document you linked would make sense? |
It is albeit indirectly. I got to that link from the links in the docs. The html5 spec has many pages and we could find reasons to point to all of them. I'm unsure of the best way to document this. |
Change https://golang.org/cl/132535 mentions this issue: |
Change https://golang.org/cl/132536 mentions this issue: |
Change https://golang.org/cl/133695 mentions this issue: |
They implement the HTML5 parsing algorithm, which is very complicated. Fixes golang/go#26973 Change-Id: I83a5753ab00fe84f73797fcecd309306d9f24819 Reviewed-on: https://go-review.googlesource.com/133695 Reviewed-by: Kunpei Sakai <namusyaka@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
What version of Go are you using (
go version
)?go1.9.4 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?What did you do?
What did you expect to see?
A line in the documentation warning the users that html.Parse skips and ignores nodes.
Would also be nice to know when and how this happens (at least have an approximative description for this behavior.)
What did you see instead?
Code silently fixing the html
The text was updated successfully, but these errors were encountered: