Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: unable to handle utf-16-encoded file without manual manipulation of source bytes #38335

Open
mccolljr opened this issue Apr 9, 2020 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mccolljr
Copy link

mccolljr commented Apr 9, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/mccolljr/Library/Caches/go-build"
GOENV="/Users/mccolljr/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/mccolljr/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/mccolljr/go/src/github.com/golang/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/mccolljr/go/src/github.com/golang/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/mccolljr/go/src/github.com/orthly/3oxz/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4g/0y_btbcj46v3x478swzt64140000gn/T/go-build229162746=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I have a file like the following, encoded in utf-16 (BOM is little endian) on disk:

<?xml version="1.0" encoding="utf-16"?>
<SomeValidXML></SomeValidXML>

When I read the file from disk, the bytes (including the line containing <?xml version="1.0" encoding="utf-16"?>), are encoded in utf-16.

I wanted to parse the full file, with no modification, using the encoding/xml package.

What did you expect to see?

I expected to be able to either A: transform the file's bytes to utf8 and pass that reader to xml.NewDecoder to successfully parse the utf8 data as xml, or B: pass the utf16-encoded bytes to xml.NewDecider and provide a CharsetReader to successfully parse the utf16 data as XML.

What did you see instead?

There were a couple of error cases.

  1. When I pass the resultOfOsOpen directly to xml.NewDecoder, with or without setting CharsetReader to charset.NewReaderLabel: XML syntax error on line 1: invalid UTF-8
  2. When I pass the utf-8 reader returned by charset.NewReader(resultOfOsOpen, "text/xml") to xml.NewDecoder: xml: encoding "utf-16" declared but Decoder.CharsetReader is nil
  3. When I pass the utf-8 reader returned by charset.NewReader(resultOfOsOpen, "text/xml") to xml.NewDecoder, AND set CharsetReader to charset.NewReaderLabel: The (now utf-8-encoded) data is interpreted as utf-16 and I the decoder reads the file as gibberish.

It seems to me that the encoding/xml package expects the line containing <?xml version="1.0" encoding="utf-16"?> to be in some encoding that resembles valid utf-8-encoded text in order to read the encoding line and properly parse the rest of the file, OR for the line to be removed if manual transformation of the input is done beforehand (like with utf-16, which cannot be read as valid utf-8 text).

Am I missing something? Is there a way to do this without modifying the input bytes?

@andybons
Copy link
Member

@rsc

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 10, 2020
@andybons andybons added this to the Unplanned milestone Apr 10, 2020
@stevenh
Copy link
Contributor

stevenh commented Jun 21, 2022

I know this is old but does the following help @mccolljr ?

dec := xml.NewDecoder(reader)
dec.CharsetReader = func(charset string, r io.Reader) (io.Reader, error) {
        enc, err := ianaindex.IANA.Encoding(charset)
        if err != nil {
                return nil, fmt.Errorf("charset %s: %w", charset, err)
        }
        if enc == nil {
                // Assume it's compatible with (a subset of) UTF-8 encoding
                // Bug: https://github.com/golang/go/issues/19421
                return r, nil
        }
        return enc.NewDecoder().Reader(r), nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants