New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/text/encoding/unicode: Add UTF-32 support #15876
Comments
The use UTF-32 has some issues, including security-related. It is generally advised not to support UTF-32 as an interchange format. I'm not aware of UTF-32 being used for attacks (like BOCU-1), but it would seem to me it is a tricky one. Is there any particular use case you need this for? |
I am writing an application that accept and process user uploaded text files. I know use of the UTF-32 encoding is rare but would like my processing to be robust against the odd UTF-32 encoded file which could come by. I was not aware that UTF-32 had security-related issues. Can you point me in a direction where I can find more information? |
Security issues are minor, but it is a combination of easy to get things wrong and then nobody caring to fix bugs as pretty much nobody is using UTF-32 and nobody cares. One has to be particularly careful with encodings that are not compatible with ASCII in general. Either way. If you have a real need for it, I'm not adverse to adding or supporting it, albeit a caveat: |
I can see pros and cons of adding UTF-32 support to a new utf32 package versus the existent unicode package. Not changing the existing BOM detection algorithm makes sense to me, since it seems to be intended HTML5 decoding as per the W3C recommendation. Adding UTF-32 support to it would not be possible in a backwards compatible manner. It would, however, seem logical to me that unicode.All included all supported Unicode encodings. On the other hand I am not sure what the use case for unicode.All is and thus don't know the implications of adding UTF-32 to it for consistency. If you know, would you mind elaborating a little on it, just so I can follow? The points you make about ASCII compatibility and that it is easy to get things wrong apply to UTF-16 as well. Putting UTF-32 in its own package for me thus solely appears to be to ensure that lack of maintenance won't affect the more widely used Unicode encodings. Alas, whether it should be in one package or the other depends on the benefits of having the UTF-32 encodings in unicode.All. Independent of where UTF-32 is added, what disclaimer do you have in mind? It is not yet entirely clear to me. Thanks! :-) |
regarding unicode.All, I agree it would be logical to include it, but it is not desirable. The All vars that are defined are for easy of use to users who want to select a default set of encodings to support without too much effort, more or less. It is often recommended, such as by WhatWG and W3C, to not support UTF-32. Including UTF-32 in unicode.All would make it too easy to pull it in. I would pick pragmatic over logical here. The consequence of not including it in unicode.All is that people will often "forget" to pull in UTF-32, which I consider a good thing. It forces people to be explicit to support UTF-32. So it is not a matter of maintenance, but setting up the right defaults. You're right about UTF-16 being problematic in an analogous way. But it is such a common format that it is more likely for bugs to be ironed out soon. As for the disclaimer I would mention that the use of UTF-32 is extremely rare, a very inefficient encoding, and that its support is often discouraged and then quote/mention the recommendations of WhatWG/W3C, for example. Another issue with putting in UTF-32 is that (as for now) the code will be unconditionally pulled in. I believe this is going to be fixed, but I don't really want to count on this. Finally, I consider UTF-32 to be more or less on the same level as BOCU-1 an SCSU. It would be too much to put all of these in the unicode package, give the wrong impression of their importance, and will make it harder to point potential caveats with their use. |
Thanks for elaborating on the issues related to supporting UTF-32 and why it is desirable to isolate it in its own package. The rationale is now clear to me and I agree it is the right solution. I will refactor my current work to match your recommendations. |
thanks! feel free to send code reviews my way. |
I have now refactored my work, written tests and added the desired disclaimer. Let me know what you think — unless you'd rather have me invite to review straight through Gerrit? |
@PhilipBorgesen, I'm wondering how did you push to golang/text
without going through Gerrit?
|
@minux that link compares @PhilipBorgesen's fork with golang/text. |
@PhilipBorgesen: I prefer to review through Gerrit. As a first point of review, though, I would put the utf32-specific tests in the utf32 package. I know this was often done differently in encoding, but turned out to be inconvenient, IMO. |
I have moved the utf32-specific tests (i.e. TestUTF32) to the new utf32 package and submitted my changes for review. See https://go-review.googlesource.com/23806. |
CL https://golang.org/cl/23806 mentions this issue. |
I would like the x/text/encoding/unicode package to be complete with support for the UTF-32 encoding.
I am filing this issue in accordance with the contributing document, and have already begun writing an implementation of the encoding here: https://github.com/PhilipBorgesen/golang-text/blob/master/encoding/unicode/unicode.go.
My current approach to the API is:
Currently I haven't updated the BOMOverride function as its comments say that it shouldn't support UTF-32 by default.
What do you think?
The text was updated successfully, but these errors were encountered: