Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/encoding/unicode: Add UTF-32 support #15876

Closed
PhilipBorgesen opened this issue May 29, 2016 · 13 comments
Closed

x/text/encoding/unicode: Add UTF-32 support #15876

PhilipBorgesen opened this issue May 29, 2016 · 13 comments

Comments

@PhilipBorgesen
Copy link
Contributor

PhilipBorgesen commented May 29, 2016

I would like the x/text/encoding/unicode package to be complete with support for the UTF-32 encoding.

I am filing this issue in accordance with the contributing document, and have already begun writing an implementation of the encoding here: https://github.com/PhilipBorgesen/golang-text/blob/master/encoding/unicode/unicode.go.

My current approach to the API is:

  • Port the UTF-16 API and make a UTF32 function, reusing the BOMPolicy and Endianness types. No RFC seems to exist for UTF-32, so make the UTF-32 implementation behave similar to the UTF-16 implementation for the sake of consistency.
  • Add the UTF-32, UTF-32BE and UTF-32LE encodings to the 'All' slice.

Currently I haven't updated the BOMOverride function as its comments say that it shouldn't support UTF-32 by default.

What do you think?

@ianlancetaylor ianlancetaylor added this to the Unreleased milestone May 30, 2016
@mpvl
Copy link
Contributor

mpvl commented May 31, 2016

The use UTF-32 has some issues, including security-related. It is generally advised not to support UTF-32 as an interchange format. I'm not aware of UTF-32 being used for attacks (like BOCU-1), but it would seem to me it is a tricky one.

Is there any particular use case you need this for?

@PhilipBorgesen
Copy link
Contributor Author

I am writing an application that accept and process user uploaded text files. I know use of the UTF-32 encoding is rare but would like my processing to be robust against the odd UTF-32 encoded file which could come by.
I could easily implement the encoding in a private module but thought I wasn't the only one missing this functionality, for which reason I am offering to add it.

I was not aware that UTF-32 had security-related issues. Can you point me in a direction where I can find more information?

@mpvl
Copy link
Contributor

mpvl commented Jun 1, 2016

Security issues are minor, but it is a combination of easy to get things wrong and then nobody caring to fix bugs as pretty much nobody is using UTF-32 and nobody cares.

One has to be particularly careful with encodings that are not compatible with ASCII in general.

Either way. If you have a real need for it, I'm not adverse to adding or supporting it, albeit a caveat:
The default should still be to not support UTF-32. So the BOM detection, using unicode.All, etc. etc. should not include UTF-32. It is therefore better to put the functionality in package unicode/utf32 with a big disclaimer. If one really must one could add a BOM detection algorithm variant that supports UTF-32 in this package.

@PhilipBorgesen
Copy link
Contributor Author

I can see pros and cons of adding UTF-32 support to a new utf32 package versus the existent unicode package.

Not changing the existing BOM detection algorithm makes sense to me, since it seems to be intended HTML5 decoding as per the W3C recommendation. Adding UTF-32 support to it would not be possible in a backwards compatible manner.

It would, however, seem logical to me that unicode.All included all supported Unicode encodings. On the other hand I am not sure what the use case for unicode.All is and thus don't know the implications of adding UTF-32 to it for consistency. If you know, would you mind elaborating a little on it, just so I can follow?

The points you make about ASCII compatibility and that it is easy to get things wrong apply to UTF-16 as well. Putting UTF-32 in its own package for me thus solely appears to be to ensure that lack of maintenance won't affect the more widely used Unicode encodings.

Alas, whether it should be in one package or the other depends on the benefits of having the UTF-32 encodings in unicode.All.

Independent of where UTF-32 is added, what disclaimer do you have in mind? It is not yet entirely clear to me.

Thanks! :-)

@mpvl
Copy link
Contributor

mpvl commented Jun 2, 2016

regarding unicode.All, I agree it would be logical to include it, but it is not desirable.

The All vars that are defined are for easy of use to users who want to select a default set of encodings to support without too much effort, more or less. It is often recommended, such as by WhatWG and W3C, to not support UTF-32. Including UTF-32 in unicode.All would make it too easy to pull it in. I would pick pragmatic over logical here. The consequence of not including it in unicode.All is that people will often "forget" to pull in UTF-32, which I consider a good thing. It forces people to be explicit to support UTF-32. So it is not a matter of maintenance, but setting up the right defaults.

You're right about UTF-16 being problematic in an analogous way. But it is such a common format that it is more likely for bugs to be ironed out soon.

As for the disclaimer I would mention that the use of UTF-32 is extremely rare, a very inefficient encoding, and that its support is often discouraged and then quote/mention the recommendations of WhatWG/W3C, for example.

Another issue with putting in UTF-32 is that (as for now) the code will be unconditionally pulled in. I believe this is going to be fixed, but I don't really want to count on this.

Finally, I consider UTF-32 to be more or less on the same level as BOCU-1 an SCSU. It would be too much to put all of these in the unicode package, give the wrong impression of their importance, and will make it harder to point potential caveats with their use.

@PhilipBorgesen
Copy link
Contributor Author

Thanks for elaborating on the issues related to supporting UTF-32 and why it is desirable to isolate it in its own package. The rationale is now clear to me and I agree it is the right solution.

I will refactor my current work to match your recommendations.

@mpvl
Copy link
Contributor

mpvl commented Jun 2, 2016

thanks! feel free to send code reviews my way.

@PhilipBorgesen
Copy link
Contributor Author

PhilipBorgesen commented Jun 3, 2016

I have now refactored my work, written tests and added the desired disclaimer.
My proposed changes are available for inspection here: golang/text@master...PhilipBorgesen:master

Let me know what you think — unless you'd rather have me invite to review straight through Gerrit?

@minux
Copy link
Member

minux commented Jun 4, 2016 via email

@adg
Copy link
Contributor

adg commented Jun 6, 2016

@minux that link compares @PhilipBorgesen's fork with golang/text.

@mpvl
Copy link
Contributor

mpvl commented Jun 6, 2016

@PhilipBorgesen: I prefer to review through Gerrit.

As a first point of review, though, I would put the utf32-specific tests in the utf32 package. I know this was often done differently in encoding, but turned out to be inconvenient, IMO.

@PhilipBorgesen
Copy link
Contributor Author

I have moved the utf32-specific tests (i.e. TestUTF32) to the new utf32 package and submitted my changes for review. See https://go-review.googlesource.com/23806.

@gopherbot
Copy link

CL https://golang.org/cl/23806 mentions this issue.

@golang golang locked and limited conversation to collaborators Jun 8, 2017
@rsc rsc unassigned mpvl Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants