Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html/charset: DetermineEncoding returns encoding.Nop and "utf-8" as name #46343

Open
Rikkuru opened this issue May 24, 2021 · 2 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@Rikkuru
Copy link

Rikkuru commented May 24, 2021

What version of Go are you using (go version)?

$ go version
go version go1.13 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/d.fedorova/.cache/go-build"
GOENV="/home/d.fedorova/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/d.fedorova/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/golang"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/d.fedorova/mailapi/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build166701564=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I called DetermineEncoding for a utf8 invalid string that has invalid bytes after first 1024 bytes.
https://play.golang.org/p/M0LeleRdj16

What did you expect to see?

I expect encoding returned by DetermineEncoding to decode my content into valid utf8 string even if certain == false (with replacement characters if needed).
I expected https://pkg.go.dev/golang.org/x/text/encoding/unicode#UTF8 to be returned with utf-8 name.
It seems wrong to me to return encoding.Nop if we only checked prefix of all content.
https://github.com/golang/net/blob/master/html/charset/charset.go#L98

if hasHighBit && utf8.Valid(content) {
    return encoding.Nop, "utf-8", false
}

What did you see instead?

encoding.Nop is returned with name utf-8 (https://pkg.go.dev/golang.org/x/text@v0.3.5/encoding#Nop)
And NewDecoder did not return valid utf8 string

@gopherbot gopherbot added this to the Unreleased milestone May 24, 2021
@mknyszek
Copy link
Contributor

Since x/net/html doesn't have a direct owner, bubbling this up to the x/net owners.

CC @ianlancetaylor @bradfitz via https://dev.golang.org/owners for x/net.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 24, 2021
@mknyszek mknyszek changed the title x/net/html/charset: DetermineEncoding returnes encoding.Nop with utf-8 name x/net/html/charset: DetermineEncoding returns encoding.Nop and "utf-8" as name May 24, 2021
@Rikkuru
Copy link
Author

Rikkuru commented May 24, 2021

UPD: I noticed encoding.Nop is also returned in prescan for unf-16* encoding names. It will lead to the same problem if len (content) > 1024
https://github.com/golang/net/blob/master/html/charset/charset.go#L206

if strings.HasPrefix(name, "utf-16") {
    name = "utf-8"
    e = encoding.Nop
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants