You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Package html implements an HTML5-compliant tokenizer and parser.
Both elements are separate despite the parser internally using the tokenizer. There are times when the ability to use a separate tokenizer or filter some tags from the tokenizer while parsing might be helpful, however the package does not allow this.
It can however be implemented easily by adding a parser method that would accept a generic tokenizer interface without affecting existing code.
In my specific use case I am looking to filter out some tags that x/net/html does not handle well. I'm sure other people will have other use cases where the parser would be a lot more useful with the ability to specify a tokenizer.
The parser only uses the following methods from the tokenizer:
// TokenizerInterface is the minimum implementation required for the parser to be able to worktypeTokenizerInterfaceinterface {
AllowCDATA(allowCDATAbool)
Err() errorNextIsNotRawText()
Next() TokenTypeToken() Token
}
Which makes an implementation as simple as this:
diff --git a/html/parse.go b/html/parse.go
index 46a89ed..46a308b 100644
--- a/html/parse.go+++ b/html/parse.go@@ -17,7 +17,7 @@ import (
// https://html.spec.whatwg.org/multipage/syntax.html#tree-construction
type parser struct {
// tokenizer provides the tokens for the parser.
- tokenizer *Tokenizer+ tokenizer TokenizerInterface
// tok is the most recently read token.
tok Token
// Self-closing tags like <hr/> are treated as start tags, except that
@@ -2368,8 +2368,14 @@ func ParseOptionEnableScripting(enable bool) ParseOption {
// ParseWithOptions is like Parse, with options.
func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error) {
+ return ParseTokenizer(NewTokenizer(r), opts...)+}++// ParseTokenizer is like Parse or ParseWithOptions, but using an already instanciated+// tokenizer instead of a reader+func ParseTokenizer(tok TokenizerInterface, opts ...ParseOption) (*Node, error) {
p := &parser{
- tokenizer: NewTokenizer(r),+ tokenizer: tok,
doc: &Node{
Type: DocumentNode,
},
The text was updated successfully, but these errors were encountered:
seankhliao
changed the title
proposal: x/net/html: Add ability to parse from a tokenizer
proposal: x/net/html: add ability to parse from a tokenizer
Dec 8, 2023
Package html implements an HTML5-compliant tokenizer and parser.
Both elements are separate despite the parser internally using the tokenizer. There are times when the ability to use a separate tokenizer or filter some tags from the tokenizer while parsing might be helpful, however the package does not allow this.
It can however be implemented easily by adding a parser method that would accept a generic tokenizer interface without affecting existing code.
In my specific use case I am looking to filter out some tags that
x/net/html
does not handle well. I'm sure other people will have other use cases where the parser would be a lot more useful with the ability to specify a tokenizer.The parser only uses the following methods from the tokenizer:
Which makes an implementation as simple as this:
The text was updated successfully, but these errors were encountered: