Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: too restrictive in char encoding #3794

Closed
gopherbot opened this issue Jul 3, 2012 · 4 comments
Closed

encoding/xml: too restrictive in char encoding #3794

gopherbot opened this issue Jul 3, 2012 · 4 comments

Comments

@gopherbot
Copy link

by borman@google.com:

go version go1.0.2

Around line 928 of xml.go there is a block of code starting with:

        // Inspect each rune for being a disallowed character.

that was introduced by CL 2967041 in response to issue #1259.  This makes it impossible
to use the XML decoder for input that has non utf-8 data chardata. This strict input
check does not conform to the networking tenet of "be liberal in what you accept
and conservative in what you send"   In particular,
<string>non-utf8</string> should be possible.  One solution is to make it
possible to disable this check akin to the Strict flag that is already present. 
Alternatively, a utf8.ValidReader object could be created that returns a read-error when
an invalid code point is received:

    d := xml.NewDecoder(utf8.ValidReader(r))

Valid Reader could return an error when read encounters an invalid utf-8 sequence,
lifting the responsibility from xml and enabling it to be used in other locations as
well.
@bradfitz
Copy link
Contributor

Comment 1:

XML requires a known encoding.
Can't you use xml.Decoder.CharsetReader?
    // CharsetReader, if non-nil, defines a function to generate
    // charset-conversion readers, converting from the provided
    // non-UTF-8 charset into UTF-8. If CharsetReader is nil or
    // returns an error, parsing stops with an error. One of the
    // the CharsetReader's result values must be non-nil.
    CharsetReader func(charset string, input io.Reader) (io.Reader, error)

@gopherbot
Copy link
Author

Comment 3 by borman@google.com:

What XML requires and what the world does are not always the same thing.
The problem with xml.Decoder.CharsetReader is that it causes you to modify the data. 
You no longer get the actual value that was in the tag, you get something else.  Ideally
this would never be produced.  The world is not ideal.

@rsc
Copy link
Contributor

rsc commented Jul 29, 2012

Comment 4:

I'm pretty sure this is Status: Unfortunate, but I will leave it open for a little
longer to make sure.
A reasonable workaround would be to write a Reader that converts Latin-1 input to UTF-8,
and then when you pull out individual strings from the UTF-8, convert them from UTF-8
back to Latin-1. That will preserve the original input without having to add clumsy
workarounds to the XML code just because you've found something that generates invalid
XML.

Labels changed: added priority-later, removed priority-triage.

Status changed to Thinking.

@rsc
Copy link
Contributor

rsc commented Sep 12, 2012

Comment 5:

The XML parser assumes Unicode at a very fundamental level. If there were never any
unquoting to do, then maybe we could treat the input as uninterpreted 8-bit ASCI++
(let's call it Latin-1). However, when we see ÿ (a code point 0xFF) we need to know
how to turn that into a byte sequence in the encoding of the surrounding document. If we
accept Latin-1 in the surrounding document but assume UTF-8 here (as we really must),
you'll end up with hybrid output that is part Latin-1 and part-UTF-8.
If you want an accurate conversion, you really do need to do the Latin-1 to UTF-8 and
back in the calling code.

Status changed to Unfortunate.

@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants