proposal: x/net/html: add ability to parse from a tokenizer #63177

MagicalTux · 2023-09-23T12:13:10Z

Package html implements an HTML5-compliant tokenizer and parser.

Both elements are separate despite the parser internally using the tokenizer. There are times when the ability to use a separate tokenizer or filter some tags from the tokenizer while parsing might be helpful, however the package does not allow this.

It can however be implemented easily by adding a parser method that would accept a generic tokenizer interface without affecting existing code.

In my specific use case I am looking to filter out some tags that x/net/html does not handle well. I'm sure other people will have other use cases where the parser would be a lot more useful with the ability to specify a tokenizer.

The parser only uses the following methods from the tokenizer:

// TokenizerInterface is the minimum implementation required for the parser to be able to work
type TokenizerInterface interface {
        AllowCDATA(allowCDATA bool)
        Err() error
        NextIsNotRawText()
        Next() TokenType
        Token() Token
}

Which makes an implementation as simple as this:

diff --git a/html/parse.go b/html/parse.go
index 46a89ed..46a308b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -17,7 +17,7 @@ import (
 // https://html.spec.whatwg.org/multipage/syntax.html#tree-construction
 type parser struct {
        // tokenizer provides the tokens for the parser.
-       tokenizer *Tokenizer
+       tokenizer TokenizerInterface
        // tok is the most recently read token.
        tok Token
        // Self-closing tags like <hr/> are treated as start tags, except that
@@ -2368,8 +2368,14 @@ func ParseOptionEnableScripting(enable bool) ParseOption {
 
 // ParseWithOptions is like Parse, with options.
 func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error) {
+       return ParseTokenizer(NewTokenizer(r), opts...)
+}
+
+// ParseTokenizer is like Parse or ParseWithOptions, but using an already instanciated
+// tokenizer instead of a reader
+func ParseTokenizer(tok TokenizerInterface, opts ...ParseOption) (*Node, error) {
        p := &parser{
-               tokenizer: NewTokenizer(r),
+               tokenizer: tok,
                doc: &Node{
                        Type: DocumentNode,
                },

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2023-09-26T04:17:29Z

CC @neild @bradfitz

MagicalTux added the Proposal label Sep 23, 2023

gopherbot added this to the Proposal milestone Sep 23, 2023

seankhliao changed the title ~~proposal: x/net/html: Add ability to parse from a tokenizer~~ proposal: x/net/html: add ability to parse from a tokenizer Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: x/net/html: add ability to parse from a tokenizer #63177

proposal: x/net/html: add ability to parse from a tokenizer #63177

MagicalTux commented Sep 23, 2023

ianlancetaylor commented Sep 26, 2023

proposal: x/net/html: add ability to parse from a tokenizer #63177

proposal: x/net/html: add ability to parse from a tokenizer #63177

Comments

MagicalTux commented Sep 23, 2023

ianlancetaylor commented Sep 26, 2023