Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: tokenizer error #34281

Closed
pidario opened this issue Sep 13, 2019 · 5 comments
Closed

x/net/html: tokenizer error #34281

pidario opened this issue Sep 13, 2019 · 5 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@pidario
Copy link

pidario commented Sep 13, 2019

What version of Go are you using (go version)?

go version go1.13 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN="/home/dario/.local/bin"
GOCACHE="/home/dario/.cache/go-build"
GOENV="/home/dario/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/dario/.local/lib/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/dario/gows/testhtmltoken/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build579414839=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I was extracting titles and meta tags from webpages and I found out that, if a title tag contains a < at the end, the tokenizer cannot tell when the text tag ends and the closing tag starts.

https://play.golang.org/p/KO2-PEfpccQ

What did you expect to see?

tag: 'title'
text: 'title ko<'
tag: 'title'
text: 'title ok'

What did you see instead?

tag: 'title'
text: 'title ko<</title>'
tag: 'title'
text: 'title ok'
@gopherbot gopherbot added this to the Unreleased milestone Sep 13, 2019
@dmitshur dmitshur changed the title x/net/html tokenizer error x/net/html: tokenizer error Sep 14, 2019
@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 16, 2019
@toothrot
Copy link
Contributor

/cc @namusyaka @nigeltao

@nigeltao
Copy link
Contributor

Yeah, it's probably a bug in the HTML tokenizer.

It's been a while since I've looked at https://www.w3.org/TR/html52/syntax.html#tokenization. Somebody would need to figure out how it maps back to the token.go code and therefore what the spec-compliant fix is.

I don't have a lot of spare time right now. Sorry.

@pidario
Copy link
Author

pidario commented Sep 19, 2019

Thanks for the feedback.

Neither do I but I'll see if I can find some time to work on it.

@gopherbot
Copy link

Change https://golang.org/cl/196620 mentions this issue: html: fix tokenizer error

@namusyaka
Copy link
Member

namusyaka commented Sep 24, 2019

@nigeltao Since we haven't conformed the spec in the tokenizer implementation, I've reviewed the CL and suggested the quick fix considering the whatwg spec. This is the best effort totally.

@namusyaka namusyaka added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 24, 2019
@golang golang locked and limited conversation to collaborators Oct 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants