New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/xml: too restrictive in char encoding #3794
Labels
Comments
XML requires a known encoding. Can't you use xml.Decoder.CharsetReader? // CharsetReader, if non-nil, defines a function to generate // charset-conversion readers, converting from the provided // non-UTF-8 charset into UTF-8. If CharsetReader is nil or // returns an error, parsing stops with an error. One of the // the CharsetReader's result values must be non-nil. CharsetReader func(charset string, input io.Reader) (io.Reader, error) |
Comment 3 by borman@google.com: What XML requires and what the world does are not always the same thing. The problem with xml.Decoder.CharsetReader is that it causes you to modify the data. You no longer get the actual value that was in the tag, you get something else. Ideally this would never be produced. The world is not ideal. |
I'm pretty sure this is Status: Unfortunate, but I will leave it open for a little longer to make sure. A reasonable workaround would be to write a Reader that converts Latin-1 input to UTF-8, and then when you pull out individual strings from the UTF-8, convert them from UTF-8 back to Latin-1. That will preserve the original input without having to add clumsy workarounds to the XML code just because you've found something that generates invalid XML. Labels changed: added priority-later, removed priority-triage. Status changed to Thinking. |
The XML parser assumes Unicode at a very fundamental level. If there were never any unquoting to do, then maybe we could treat the input as uninterpreted 8-bit ASCI++ (let's call it Latin-1). However, when we see ÿ (a code point 0xFF) we need to know how to turn that into a byte sequence in the encoding of the surrounding document. If we accept Latin-1 in the surrounding document but assume UTF-8 here (as we really must), you'll end up with hybrid output that is part Latin-1 and part-UTF-8. If you want an accurate conversion, you really do need to do the Latin-1 to UTF-8 and back in the calling code. Status changed to Unfortunate. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
by borman@google.com:
The text was updated successfully, but these errors were encountered: