Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: provide convenience methods FirstElementChild/NextElementSibling #15996

Open
RayfenWindspear opened this issue Jun 7, 2016 · 9 comments
Milestone

Comments

@RayfenWindspear
Copy link

RayfenWindspear commented Jun 7, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.1 linux/amd64

  1. What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/user/go"
GORACE=""
GOROOT="/usr/lib/go-1.6"
GOTOOLDIR="/usr/lib/go-1.6/pkg/tool/linux_amd64"
GO15VENDOREXPERIMENT="1"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

  1. What did you do?
    If possible, provide a recipe for reproducing the error.
    A complete runnable program is good.
    A link on play.golang.org is best.

Here is a link to golang playground, but note that this library doesn't actually run on the playground. I am providing a playground link because I cannot format the code correctly here given the string formatting I am using.

https://play.golang.org/p/DrERE6s0tT

  1. What did you expect to see?

The last printed div is supposed to have children as per the html string.

  1. What did you see instead?

Instead it is parsed as a Text NodeType, and thus has no children, or even data.

@RayfenWindspear
Copy link
Author

One other thing to note is that calling html.Render will reproduce the missing elements, despite them not being represented as Nodes.

@kennygrant
Copy link
Contributor

kennygrant commented Jun 11, 2016

Nodes also contain TextNodes for their content (including whitespace), which is what you're seeing here. You can test this with a simpler version of the html, with some whitespace before the last section tag, which becomes the last child node:

<section><div><div></div></div> </section>

Adding a test for section.LastChild.Type to your example shows it is of type TextNode (either with this simpler html or with your example). Removing the whitespace from your example means it works as you expect.

https://play.golang.org/p/uGfE2LTsjg

So I don't think this is a bug?

@RayfenWindspear
Copy link
Author

RayfenWindspear commented Jun 11, 2016

Isn't whitespace supposed to be ignored as part of the HTML specification? I'm not sure if this is the right reference, but this portion of the HTML 5 spec is about space separated tokens, which I'm assuming should include tokenization of tags.

https://html.spec.whatwg.org/multipage/infrastructure.html#space-separated-tokens

@RayfenWindspear
Copy link
Author

RayfenWindspear commented Jun 11, 2016

This may be a better resource. It specifically mentions inter-element-whitespace. It states that these Text nodes consisting of only whitespace must be ignored. The paragraph above defines inter-element-whitespace.

https://www.w3.org/TR/html/dom.html#inter-element-whitespace

With the html package behaving this way, it doesn't match the way the html is parsed into a DOM by a browser. If the browser had recognized these nodes as Text Nodes, there would be raw html bleeding through all over the web due to a few misplaced whitespaces. And there would be no way to access these children through either raw JS or jQuery.

@kennygrant
Copy link
Contributor

kennygrant commented Jun 11, 2016

Can you reproduce your expected behaviour in a browser using the DOM? If I try this in Chrome or Safari on your snippet or the simpler document (NB whitespace):

<html><head></head><body><section><div><div></div></div> </section></body></html>

using a similar query:

document.firstChild.lastChild.firstChild.lastChild

I get a textNode with whitespace content, which mirrors what this parser is doing. In your second link it does say they should parse whitespace as text nodes, but then says these should be ignored in the green box (which seems to contradict what browsers do). The DOM does have separate firstElementChild methods to ignore comments and whitespace, but firstChild etc do return white space.

I will leave it to the maintainers to comment further but I hope this helps a little.

@RayfenWindspear
Copy link
Author

RayfenWindspear commented Jun 12, 2016

Interesting. I think I see what is going on here. It's not that the browser doesn't support this, it supplies it in a different way. Here is the functionality I have been expecting document.firstElementChild, which skips the empty Text Nodes as in the html specification. However, the net/html package only supplies the functionality of the form document.firstChild, which does not.

Given this realization, yes you are right to say that the package isn't malfunctioning. So in effect, this is actually a feature request rather than a bug report.

EDIT: Hah, I read your example and tried it before I finished reading your post and ended up discovering the same new methods.

@RayfenWindspear
Copy link
Author

RayfenWindspear commented Jun 12, 2016

Chose not to edit previous post.

So the functionality I am looking for is easily provided by a for loop skipping over Text Nodes. Doing this manually provides me with the data that I am looking for as in this example.

https://play.golang.org/p/ZNKgZATPUI

Should I close the issue and go through the process to be able to add the functionality myself (I know pull requests on github are not granted), or is someone willing to add the feature for me?

Basically what is working for me amounts to this

`func firstElementChild(n *html.Node) *html.Node {
for c := n.FirstChild; c != nil; c = c.NextSibling {
if c.Type == html.ElementNode {
return c
}
}
return nil
}

func nextElementSibling(n *html.Node) *html.Node {
for s := n.NextSibling; s != nil; s = s.NextSibling {
if s.Type == html.ElementNode {
return s
}
}
return nil
}`

@quentinmit quentinmit changed the title golang.org/x/net/html Not parsing all Nodes of html string. x/net/html: Not parsing all Nodes of html string. Jun 16, 2016
@quentinmit quentinmit added this to the Unreleased milestone Jun 16, 2016
@quentinmit
Copy link
Contributor

@RayfenWindspear You can use this issue for tracking that addition. I don't know how likely it is to be accepted, though, since as you demonstrated it is easy to implement yourself.

/cc @adg @nigeltao Are we likely to add these methods?

@quentinmit quentinmit changed the title x/net/html: Not parsing all Nodes of html string. x/net/html: provide convenience methods FirstElementChild/NextElementSibling Jun 16, 2016
@nigeltao
Copy link
Contributor

Yes, whitespace is ignored, but at rendering time, not at parse time. kennygrant gave a javascript example, but for a Python example, this program:

#!/usr/bin/python

import html5lib

h = '''<section id="job-snapshot-section">
    <h2>Job Snapshot</h2>
    <div class="section-body">

        <div class="snap-line">
            <strong><span id="CBBody_Snapshot__ctl10_SnapshotKey">Base Pay</span></strong>
            <span><span id="CBBody_Snapshot__ctl10_SnapshotValue">$150,000.00 - $160,000.00 /Year</span></span>
        </div>                                        
    </div>
</section>'''

doc = html5lib.parse(h)
for t in doc.itertext():
    print "[" + t + "]"

prints

[
    ]
[Job Snapshot]
[
    ]
[

        ]
[
            ]
[Base Pay]
[
            ]
[$150,000.00 - $160,000.00 /Year]
[
        ]
[                                        
    ]
[
]

because those (whitespace) text nodes are part of the DOM tree.

As for a FirstElementChild method, I suppose it's possible, although it's straightforward to provide that as a function (in your package). It doesn't need to be a method (in package html).

Or, if you want to loop only over element children, it's trivial:

for c := n.FirstChild; c != nil; c = c.NextSibling {
  if c.Type != html.ElementNode {
    continue
  }
  etc
}

so I'm not convinced yet for the need for FirstElementChild and NextElementSibling methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants