Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/net/html: add Node.FindAll #62219

Open
tebeka opened this issue Aug 22, 2023 · 10 comments
Open

proposal: x/net/html: add Node.FindAll #62219

tebeka opened this issue Aug 22, 2023 · 10 comments
Labels
Milestone

Comments

@tebeka
Copy link
Contributor

tebeka commented Aug 22, 2023

Adding node.FindAll(pred func(*Node) []*Node) function will make using the code less lower level.

Example usage:

isTr := func(node *html.Node) bool {
	return node.Type == html.ElementNode && node.Data == "tr"
}
for _, tr := doc.FindAll(isTr) {
	fmt.Println(tr)
}

A possible implementation is:

// FindAll returns all child nodes (including the current node) matching pred.
func (n *Node) FindAll(pred func(*html.Node) bool) []*html.Node {
	var out []*html.Node
	stack := []*html.Node{node}

	for len(stack) != 0 {
		n := stack[len(stack)-1]
		stack = stack[:len(stack)-1]
		if pred(n) {
			out = append(out, n)
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			stack = append(stack, c)
		}
	}

	return out
}
@gopherbot gopherbot added this to the Proposal milestone Aug 22, 2023
@tebeka
Copy link
Contributor Author

tebeka commented Aug 22, 2023

See also #62113

@seankhliao
Copy link
Member

how common is this?

@tebeka
Copy link
Contributor Author

tebeka commented Aug 23, 2023

@seankhliao According to github code search there are about 2,600 places where the for loop from the example in the documentation (for c := n.FirstChild; c != nil; c = c.NextSibling {) is used.

For me personally, it's the first time I'm parsing HTML with Go. It reduced code size once I wrote it.

@seankhliao
Copy link
Member

that search result might support #62113, but doesn't say much about needing a find variant

@tebeka
Copy link
Contributor Author

tebeka commented Aug 23, 2023

If I add if c.Type == html.ElementNode to the code search query, it goes does to about 170 files. Not sure if this is the right search.

@earthboundkid
Copy link
Contributor

earthboundkid commented Aug 23, 2023

I think if #62113 is added, then this becomes unnecessary because it would be trivial to write slices.Collect(iter.Filter(doc.All(), pred)). Of course, it is still a good idea for a helper library that smooths out x/net/html and makes it more convenient, but for x/net/html itself, it's just providing the core HTML parsing abilities and other capabilities can be added on top in third party libraries like https://pkg.go.dev/github.com/go-shiori/dom, https://pkg.go.dev/github.com/yhat/scrape, and my own xhtml package.

@tebeka
Copy link
Contributor Author

tebeka commented Aug 24, 2023

OTOH, doc.All is a special case of doc.FindAll with a predicate that always returns true.

@earthboundkid
Copy link
Contributor

No, as proposed, doc.All() is an iterator, not a slice, so it’s quite different. It doesn’t allocate a slot for every child.

@tebeka
Copy link
Contributor Author

tebeka commented Aug 24, 2023

Once we have iterators, we can implement doc.FindAll as iterator as well.
Since we don't know when (if?) iterators will come, I think we can start with a sliced based implementation.

@earthboundkid
Copy link
Contributor

Iterators are on track to release in February of 2024. The Go team aren't going to change the signature of a method in x/net/html after using it ~six months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants