Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: Support for alternate encodings #8937

Closed
gopherbot opened this issue Oct 15, 2014 · 4 comments
Closed

encoding/xml: Support for alternate encodings #8937

gopherbot opened this issue Oct 15, 2014 · 4 comments

Comments

@gopherbot
Copy link

by pico303:

In Go 1.3.3, the XML parser for Go is locked into UTF-8 encodings.  In
encoding/xml/xml.go (around line 576), there's the line:

    enc := procInstEncoding(string(data))
    if enc != "" && enc != "utf-8" && enc != "UTF-8" {

For documents with:

    <?xml version="1.0" encoding="ISO-8859-1"?>

you get this error message:

    Invalid body content: xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil

You can override the reader to support alternative encodings, but this means pre-parse
the XML []byte yourself for the proper encoding, setup the reader, then parse the XML. 

Could the package be adapted somehow so you could provide alternate readers ahead of
time, based on the encoding value?  Something like this (pseudocode):

    func init() {
        xml.AddCharsetReader("iso-8859-1", ISO8859Reader)
    }

    func Parse(doc []byte) (SomeStruct, error) {
        var myobj SomeStruct
        if err := xml.Unmarshal(doc, &myobj); err != nil {
            return nil, err
        }
        return myobj, nil
    }
@ianlancetaylor
Copy link
Contributor

Comment 1:

Labels changed: added repo-main, release-none.

@bradfitz
Copy link
Contributor

Comment 2:

This hook already exists.
Use xml.Decoder, not xml.Unmarshal, and set Decoder.CharsetReader, as the error message
says.

Labels changed: added performance.

Status changed to WorkingAsIntended.

@gopherbot
Copy link
Author

Comment 3 by pico303:

Except that to do that, you have to know the encoding ahead of time. Our servers get
messages in either UTF-8 or ISO-8859-1. So we basically have to parse the incoming
stream for the encoding parameter, load the correct reader, and unmarshal.  Feels clunky.

@bradfitz
Copy link
Contributor

Comment 4:

Look at the docs:
        // CharsetReader, if non-nil, defines a function to generate
        // charset-conversion readers, converting from the provided
        // non-UTF-8 charset into UTF-8. If CharsetReader is nil or
        // returns an error, parsing stops with an error. One of the
        // the CharsetReader's result values must be non-nil.
        CharsetReader func(charset string, input io.Reader) (io.Reader, error)
Your hook gets passed in the charset. You don't need to parse it yourself.

@golang golang locked and limited conversation to collaborators Jun 25, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants