Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/net/html: add ability to parse from a tokenizer #63177

Open
MagicalTux opened this issue Sep 23, 2023 · 1 comment
Open

proposal: x/net/html: add ability to parse from a tokenizer #63177

MagicalTux opened this issue Sep 23, 2023 · 1 comment
Labels
Milestone

Comments

@MagicalTux
Copy link

Package html implements an HTML5-compliant tokenizer and parser.

Both elements are separate despite the parser internally using the tokenizer. There are times when the ability to use a separate tokenizer or filter some tags from the tokenizer while parsing might be helpful, however the package does not allow this.

It can however be implemented easily by adding a parser method that would accept a generic tokenizer interface without affecting existing code.

In my specific use case I am looking to filter out some tags that x/net/html does not handle well. I'm sure other people will have other use cases where the parser would be a lot more useful with the ability to specify a tokenizer.

The parser only uses the following methods from the tokenizer:

// TokenizerInterface is the minimum implementation required for the parser to be able to work
type TokenizerInterface interface {
        AllowCDATA(allowCDATA bool)
        Err() error
        NextIsNotRawText()
        Next() TokenType
        Token() Token
}

Which makes an implementation as simple as this:

diff --git a/html/parse.go b/html/parse.go
index 46a89ed..46a308b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -17,7 +17,7 @@ import (
 // https://html.spec.whatwg.org/multipage/syntax.html#tree-construction
 type parser struct {
        // tokenizer provides the tokens for the parser.
-       tokenizer *Tokenizer
+       tokenizer TokenizerInterface
        // tok is the most recently read token.
        tok Token
        // Self-closing tags like <hr/> are treated as start tags, except that
@@ -2368,8 +2368,14 @@ func ParseOptionEnableScripting(enable bool) ParseOption {
 
 // ParseWithOptions is like Parse, with options.
 func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error) {
+       return ParseTokenizer(NewTokenizer(r), opts...)
+}
+
+// ParseTokenizer is like Parse or ParseWithOptions, but using an already instanciated
+// tokenizer instead of a reader
+func ParseTokenizer(tok TokenizerInterface, opts ...ParseOption) (*Node, error) {
        p := &parser{
-               tokenizer: NewTokenizer(r),
+               tokenizer: tok,
                doc: &Node{
                        Type: DocumentNode,
                },
@gopherbot gopherbot added this to the Proposal milestone Sep 23, 2023
@ianlancetaylor
Copy link
Contributor

CC @neild @bradfitz

@seankhliao seankhliao changed the title proposal: x/net/html: Add ability to parse from a tokenizer proposal: x/net/html: add ability to parse from a tokenizer Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

3 participants