Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: nested <a> parsing issue #18865

Closed
Marqin opened this issue Jan 31, 2017 · 10 comments
Closed

x/net/html: nested <a> parsing issue #18865

Marqin opened this issue Jan 31, 2017 · 10 comments

Comments

@Marqin
Copy link

Marqin commented Jan 31, 2017

What version of Go are you using (go version)?

go version go1.7.4 darwin/amd64

What operating system and processor architecture are you using (go env)?

OS X 10.11.6 (amd64)

What did you do?

I was using library for selecting html nodes (based on jQuery selection string) and it always returned me duplicated <a> nodes if inside them were <div> with another <a> (this is how webpage I'm scanning is built - I cannot change it). Creator of that library told me (PuerkitoBio/goquery#150) that this bug is because x/net/html acts weirdly on my input and I should report it here.

My code - https://play.golang.org/p/MJ33wgyDjG

What did you expect to see?

<html><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...
  </a><div><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...


</body></html>

What did you see instead?

<html><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...
  </a><div><a class="myclass" href="/foo/bar"></a><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...


</body></html>

Conclusion

As you can see, your Parser does duplicate my <a class="myclass" href="/foo/bar"> node while parsing.

@bradfitz bradfitz added this to the Unreleased milestone Jan 31, 2017
@nigeltao
Copy link
Contributor

nigeltao commented Feb 2, 2017

I can't remember if that original HTML is malformed because there are nested <a>s or because there's a <div> inside the outer <a>, but I believe that duplicating that <a> node is what the HTML5 parsing algorithm specifies to do.

It's not totally obvious, but start in The "in body" insertion mode at https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inbody and note two cases in the spec:

  • A start tag whose tag name is one of: ..., "div", ...
  • A start tag whose tag name is "a"

In particular, the second one gives this example:

In the non-conforming stream <a href="a">a<table><a href="b">b</table>x, the
first a element would be closed upon seeing the second one, and the "x"
character would be inside a link to "b", not to "a". This is despite the fact
that the outer a element is not in table scope (meaning that a regular </a> end
tag at the start of the table wouldn't close the outer a element). The result
is that the two a elements are indirectly nested inside each other —
non-conforming markup will often result in non-conforming DOMs when parsed.

which sounds similar.

In any case, the parse result (the DOM) is also what the Chrome browser yields. See the attached screenshot with Chrome's "inspect element mode" on:

x

@nigeltao nigeltao closed this as completed Feb 2, 2017
@Marqin
Copy link
Author

Marqin commented Feb 2, 2017

@nigeltao No. Maybe you have some old Chrome, because the newest one (56.0.2924.87 (64-bit) on OS X) doesn't insert that <a>:
zrzut ekranu 2017-02-02 o 10 37 07

@Marqin
Copy link
Author

Marqin commented Feb 2, 2017

@nigeltao I've just noticed also that the original site is not HTML5, but XHTML 1.0, so I've changed my test code to:

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

const data = `
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...s
  <div><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...
</a>
</body></html>
`

func main() {
	root, e := html.Parse(strings.NewReader(data))
	if e != nil {
		fmt.Println(e)
		return
	}
	html.Render(os.Stdout, root)
}

and it's still bugged (duplicated <a>).

Here is what Chrome (56.0.2924.87 (64-bit) on OS X) shows:
zrzut ekranu 2017-02-02 o 10 45 42

This is clearly bug on your side, as you can see - even Chrome renders it properly without duplicating.

So please reopen this issue @nigeltao

@nigeltao
Copy link
Contributor

nigeltao commented Feb 3, 2017

Huh, that's weird. My Chrome version is also "56.0.2924.87 (Official Build) (64-bit)". Mine is Linux, not OS X, but I'd be very surprised if that's a significant difference.

How are you loading that HTML file into Chrome? Are you loading a file:///foo/bar URL, or are you serving it over HTTP (with some HTTP headers)??

@nigeltao nigeltao reopened this Feb 3, 2017
@nigeltao
Copy link
Contributor

nigeltao commented Feb 3, 2017

Firefox (version 51.0.1, released January 26, 2017) also inserts an <a> tag:

x

@nigeltao
Copy link
Contributor

nigeltao commented Feb 3, 2017

html5lib (version 0.999) also inserts an <a> tag:

x

@nigeltao
Copy link
Contributor

nigeltao commented Feb 3, 2017

Textual version:

$ cat x.py
#!/usr/bin/python

import html5lib

def dump(node, indent):
    s = indent + node.tag
    for t in node.items():
        s += " " + t[0] + "=" + t[1]
    print s
    indent += "    "
    for child in node:
        dump(child, indent)

src = """<html><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...
  <div><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...
</a>
</body></html>"""

print html5lib.__version__
dump(html5lib.parse(src), "")
$ 
$ 
$ ./x.py 
0.999
{http://www.w3.org/1999/xhtml}html
    {http://www.w3.org/1999/xhtml}head
    {http://www.w3.org/1999/xhtml}body
        {http://www.w3.org/1999/xhtml}a href=/foo/bar class=myclass
        {http://www.w3.org/1999/xhtml}div
            {http://www.w3.org/1999/xhtml}a href=/foo/bar class=myclass
            {http://www.w3.org/1999/xhtml}a href=/baz/baz

@nigeltao
Copy link
Contributor

nigeltao commented Feb 3, 2017

A colleague of mine loaded up that HTML page in Chrome on Mac (version 56.0.2924.87). Once again, it also inserts an <a> tag:

bug-18865-chrome-56

@nigeltao
Copy link
Contributor

nigeltao commented Feb 3, 2017

It's a mystery why your Chrome's DOM tree is showing something different. Can you attach (maybe as a .zip file?) the exact HTML file you're loading in your browser?

@Marqin Marqin closed this as completed Feb 3, 2017
@Marqin
Copy link
Author

Marqin commented Feb 3, 2017

Ok, I've found out that there was additional "" in my testing HTML, now it renders it like that in Chrome.

Thanks for your support, I'll then just have to put my results into some set and not list to get rid of duplicates.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants