Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: void element <link> has child nodes #10535

Open
dvyukov opened this issue Apr 22, 2015 · 10 comments
Open

x/net/html: void element <link> has child nodes #10535

dvyukov opened this issue Apr 22, 2015 · 10 comments
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Apr 22, 2015

The following program crashes:

package main

import (
    "golang.org/x/net/html"
    "io/ioutil"
    "strings"
)

func main() {
    nodes, err := html.ParseFragment(strings.NewReader("<svg ><link >0"), nil)
    if err != nil {
        return
    }
    for _, n := range nodes {
        if err := html.Render(ioutil.Discard, n); err != nil {
            panic(err)
        }
    }
}
panic: html: void element <link> has child nodes

Render must be able handle Parse output. Or otherwise Parse must not accept the input as valid.

On commit 6f62f426de90c0ed6a55207b51476115fcb17237.

@dvyukov dvyukov added this to the Go1.5 milestone Apr 22, 2015
@dvyukov
Copy link
Member Author

dvyukov commented Apr 22, 2015

Similar crash for "<svg ><input >0"

@dvyukov
Copy link
Member Author

dvyukov commented Apr 22, 2015

Similar crash for `"

@dvyukov
Copy link
Member Author

dvyukov commented Apr 23, 2015

the same for area element

@rsc rsc modified the milestones: Unreleased, Go1.5 Apr 26, 2015
@andybalholm
Copy link
Contributor

A "correct" parse according to the HTML5 spec can produce a parse tree that is not well-formed. But what should we do about it in Render? My initial thought would be that if a void element has child nodes, we should just render it as if it weren't a void element. Of course that is garbage, but it is in keeping with the principle of garbage in, garbage out.

@dvyukov
Copy link
Member Author

dvyukov commented Jul 27, 2015

If an input is parsed, then I would expect it to be serialized into its original form. The next HTML5-complaint parser in the chain should be able to parse it successfully again, right?

@andybalholm
Copy link
Contributor

That sounds good, but it is very far from the case with HTML5. HTML5 takes Postel's principle to the extreme: any random string of data can be parsed as an HTML5 document. (It probably won't validate, but it will parse.) There are places in the spec where parsers are allowed to return errors, but they also specify what the result should be if they don't. In all those cases, we chose to have the html package continue parsing instead of returning an error.

Having the rendered output exactly the same as the input is actually quite rare, because of things like capitalization and attribute quoting. <INPUT TYPE=TEXT> renders as <input type="text" />. But where things get really interesting is in situations like this where the parse tree is not well-formed. See http://godoc.org/golang.org/x/net/html#Render for more details.

@dvyukov
Copy link
Member Author

dvyukov commented Jul 27, 2015

Well, at least it should be parsed by net/html parser again and maybe by some subset of other parsers.
I see two use cases:

  1. analyze the HTML in memory
    or 2. analyze, modify and serialize HTML
    And the second case does not work now for semi-invalid documents.
    Assuming that the original document is parsed successfully by the target HTML parser (e.g. a permissive browser parser), if we serialize an invalid document the same way it was in the original document, it still will be parsed by the target parser, so we did not make things worse. We just allow to modify and serialize semi-invalid documents.

@andybalholm
Copy link
Contributor

I agree that returning an error from Render is probably not what we want to do in this case. And we should do what we reasonably can to make invalid trees render sensibly. (Actually I suspect that this tree is valid; does SVG allow children for those elements?)

But we don't want to fill the Render function with special-case heuristics that attempt to reconstruct the markup underlying invalid parse trees.

For your use case #2, it would be very desirable for Render to never return an error unless the destination Writer returns one. That would make its operation parallel to Parse, which accepts all possible inputs. @nigeltao, what do you think? Should we eliminate errors from Render, and just make it produce the least-wrong output practical?

@nigeltao
Copy link
Contributor

I think this is WAI. Render's doc comment already says, "Rendering is done on a 'best effort' basis".

The HTML5 parsing algorithm (https://html.spec.whatwg.org/multipage/syntax.html#tree-construction) is an enormous nest of special cases. It prints at 130 pages. I don't think that it guarantees to be idempotent, or even self-consistent (e.g. "void elements" are separately listed in a separate document at http://www.w3.org/TR/html5/syntax.html#void-elements), and I can't see an obvious proof that it is either.

Given that, the Rendering algorithm is naive, and doesn't promise to render everything you can parse. In any case, the error in the original post is indeed accurate: the void element has child nodes, and it shouldn't.

You could argue that Render could return a special RenderingError, separate from I/O errors, that callers are free to ignore. But in general, the space of "semi-invalid documents" is ill-defined, and as I said earlier, it's not clear that the parser algorithm guarantees to produce a "valid document", so I don't think Render should never return errors (other than I/O errors).

You could argue, then that the parser shouldn't have produced a tree that was like that, but the parser is what it is, specified by a 130-page spec that's already too complicated. It would be nice if HTML5 was based on sensible, consistent foundations, but it isn't.

@andybalholm
Copy link
Contributor

As I think about this more, I think the real issue is that we are applying the HTML void element list to SVG elements. I'm pretty sure that the input element that has children is an svg:input (though input doesn't mean anything that I know of in that namespace), not a regular HTML input. So maybe we should check the namespace before we check the void element list, since the concept of void elements doesn't apply to foreign content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants