x/net/html: nested <a> parsing issue #18865

Marqin · 2017-01-31T14:50:24Z

What version of Go are you using (`go version`)?

go version go1.7.4 darwin/amd64

What operating system and processor architecture are you using (`go env`)?

OS X 10.11.6 (amd64)

What did you do?

I was using library for selecting html nodes (based on jQuery selection string) and it always returned me duplicated <a> nodes if inside them were <div> with another <a> (this is how webpage I'm scanning is built - I cannot change it). Creator of that library told me (PuerkitoBio/goquery#150) that this bug is because x/net/html acts weirdly on my input and I should report it here.

My code - https://play.golang.org/p/MJ33wgyDjG

What did you expect to see?

<html><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...
  </a><div><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...


</body></html>

What did you see instead?

<html><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...
  </a><div><a class="myclass" href="/foo/bar"></a><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...


</body></html>

Conclusion

As you can see, your Parser does duplicate my <a class="myclass" href="/foo/bar"> node while parsing.

The text was updated successfully, but these errors were encountered:

nigeltao · 2017-02-02T04:35:05Z

I can't remember if that original HTML is malformed because there are nested <a>s or because there's a <div> inside the outer <a>, but I believe that duplicating that <a> node is what the HTML5 parsing algorithm specifies to do.

It's not totally obvious, but start in The "in body" insertion mode at https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inbody and note two cases in the spec:

A start tag whose tag name is one of: ..., "div", ...
A start tag whose tag name is "a"

In particular, the second one gives this example:

In the non-conforming stream <a href="a">a<table><a href="b">b</table>x, the
first a element would be closed upon seeing the second one, and the "x"
character would be inside a link to "b", not to "a". This is despite the fact
that the outer a element is not in table scope (meaning that a regular </a> end
tag at the start of the table wouldn't close the outer a element). The result
is that the two a elements are indirectly nested inside each other —
non-conforming markup will often result in non-conforming DOMs when parsed.

which sounds similar.

In any case, the parse result (the DOM) is also what the Chrome browser yields. See the attached screenshot with Chrome's "inspect element mode" on:

Marqin · 2017-02-02T09:42:55Z

@nigeltao No. Maybe you have some old Chrome, because the newest one (56.0.2924.87 (64-bit) on OS X) doesn't insert that <a>:

Marqin · 2017-02-02T09:46:28Z

@nigeltao I've just noticed also that the original site is not HTML5, but XHTML 1.0, so I've changed my test code to:

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

const data = `
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...s
  <div><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...
</a>
</body></html>
`

func main() {
	root, e := html.Parse(strings.NewReader(data))
	if e != nil {
		fmt.Println(e)
		return
	}
	html.Render(os.Stdout, root)
}

and it's still bugged (duplicated <a>).

Here is what Chrome (56.0.2924.87 (64-bit) on OS X) shows:

This is clearly bug on your side, as you can see - even Chrome renders it properly without duplicating.

So please reopen this issue @nigeltao

nigeltao · 2017-02-03T00:56:58Z

Huh, that's weird. My Chrome version is also "56.0.2924.87 (Official Build) (64-bit)". Mine is Linux, not OS X, but I'd be very surprised if that's a significant difference.

How are you loading that HTML file into Chrome? Are you loading a file:///foo/bar URL, or are you serving it over HTTP (with some HTTP headers)??

nigeltao · 2017-02-03T01:11:14Z

Firefox (version 51.0.1, released January 26, 2017) also inserts an <a> tag:

nigeltao · 2017-02-03T01:36:53Z

html5lib (version 0.999) also inserts an <a> tag:

nigeltao · 2017-02-03T01:38:41Z

Textual version:

$ cat x.py
#!/usr/bin/python

import html5lib

def dump(node, indent):
    s = indent + node.tag
    for t in node.items():
        s += " " + t[0] + "=" + t[1]
    print s
    indent += "    "
    for child in node:
        dump(child, indent)

src = """<html><head></head><body>
<a class="myclass" href="/foo/bar">
  ...here is some stuff...
  <div><a href="/baz/baz">some stuff</a></div>
  ...here is also some stuff...
</a>
</body></html>"""

print html5lib.__version__
dump(html5lib.parse(src), "")
$ 
$ 
$ ./x.py 
0.999
{http://www.w3.org/1999/xhtml}html
    {http://www.w3.org/1999/xhtml}head
    {http://www.w3.org/1999/xhtml}body
        {http://www.w3.org/1999/xhtml}a href=/foo/bar class=myclass
        {http://www.w3.org/1999/xhtml}div
            {http://www.w3.org/1999/xhtml}a href=/foo/bar class=myclass
            {http://www.w3.org/1999/xhtml}a href=/baz/baz

nigeltao · 2017-02-03T02:26:21Z

A colleague of mine loaded up that HTML page in Chrome on Mac (version 56.0.2924.87). Once again, it also inserts an <a> tag:

nigeltao · 2017-02-03T02:27:04Z

It's a mystery why your Chrome's DOM tree is showing something different. Can you attach (maybe as a .zip file?) the exact HTML file you're loading in your browser?

Marqin · 2017-02-03T10:07:42Z

Ok, I've found out that there was additional "" in my testing HTML, now it renders it like that in Chrome.

Thanks for your support, I'll then just have to put my results into some set and not list to get rid of duplicates.

bradfitz added this to the Unreleased milestone Jan 31, 2017

bradfitz assigned nigeltao Jan 31, 2017

nigeltao closed this as completed Feb 2, 2017

nigeltao reopened this Feb 3, 2017

Marqin closed this as completed Feb 3, 2017

Marqin mentioned this issue Feb 3, 2017

selector duplicating entry on nested <a>, when second <a> is in <div> PuerkitoBio/goquery#150

Closed

golang locked and limited conversation to collaborators Feb 3, 2018

gopherbot added the FrozenDueToAge label Feb 3, 2018

rsc unassigned nigeltao Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: nested <a> parsing issue #18865

x/net/html: nested <a> parsing issue #18865

Marqin commented Jan 31, 2017 •

edited

nigeltao commented Feb 2, 2017

Marqin commented Feb 2, 2017

Marqin commented Feb 2, 2017 •

edited

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

Marqin commented Feb 3, 2017

x/net/html: nested <a> parsing issue #18865

x/net/html: nested <a> parsing issue #18865

Comments

Marqin commented Jan 31, 2017 • edited

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Conclusion

nigeltao commented Feb 2, 2017

Marqin commented Feb 2, 2017

Marqin commented Feb 2, 2017 • edited

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

nigeltao commented Feb 3, 2017

Marqin commented Feb 3, 2017

Marqin commented Jan 31, 2017 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

Marqin commented Feb 2, 2017 •

edited