Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html/charset: BOM cannot be processed correctly #30736

Open
wzshiming opened this issue Mar 11, 2019 · 2 comments
Open

x/net/html/charset: BOM cannot be processed correctly #30736

wzshiming opened this issue Mar 11, 2019 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@wzshiming
Copy link

What version of Go are you using (go version)?

$ go version go1.12 darwin/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/zsm/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/zsm/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/6v/7stmg2756wlfk9c_qnv1hnbm0000gn/T/go-build194117381=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

package main

import (
	"bytes"
	"fmt"
	"io/ioutil"

	"golang.org/x/net/html/charset"
	"golang.org/x/text/encoding"
	"golang.org/x/text/transform"
)

func main() {
	raw := []byte("\xEF\xBB\xBFhello")
	fmt.Println(raw)
	if e, _, _ := charset.DetermineEncoding(raw, ""); e != encoding.Nop {
		tmp := transform.NewReader(bytes.NewBuffer(raw), e.NewDecoder())
		dist, _ := ioutil.ReadAll(tmp)
		fmt.Println(dist)
	}
}

What did you expect to see?

BOM has been removed
or
returns encoding.Nop

What did you see instead?

[239 187 191 104 101 108 108 111]
[239 187 191 104 101 108 108 111]
@gopherbot gopherbot added this to the Unreleased milestone Mar 11, 2019
@bcmills
Copy link
Contributor

bcmills commented Apr 12, 2019

CC @mpvl

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 12, 2019
@namusyaka
Copy link
Member

@wzshiming Since the given bytes start with "0xEF, 0xBB, 0xBF", determining encoding to utf-8 looks correct as per: https://encoding.spec.whatwg.org/#bom-sniff

The current implementation does not return the encoding considering BOM. In this case you'll get the UTF8 encoding.

On the other hand, in order to remove the BOM programally, you'll need to do something like:

package main

import (
	"bytes"
	"fmt"
	"io/ioutil"

	"golang.org/x/net/html/charset"
	"golang.org/x/text/encoding"
	"golang.org/x/text/encoding/unicode"
	"golang.org/x/text/transform"
)

func main() {
	raw := []byte("\xEF\xBB\xBFhello")
	fmt.Println(raw)
	if e, _, _ := charset.DetermineEncoding(raw, ""); e != encoding.Nop {
		tmp := transform.NewReader(bytes.NewBuffer(raw), unicode.UTF8BOM.NewDecoder())
		dist, _ := ioutil.ReadAll(tmp)
		fmt.Println(dist)
	}
}

@mpvl Any thoughts on this? The implementation does not currently seem not to be linked from the maps like var encodings in x/text/encoding/htmlindex. Is there any chance to make this package return encodings like unicode.UTF8BOM (and/or add identifiers like utf8-bom)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants