x/text/encoding/unicode: Add UTF-32 support #15876

PhilipBorgesen · 2016-05-29T12:48:18Z

I would like the x/text/encoding/unicode package to be complete with support for the UTF-32 encoding.

I am filing this issue in accordance with the contributing document, and have already begun writing an implementation of the encoding here: https://github.com/PhilipBorgesen/golang-text/blob/master/encoding/unicode/unicode.go.

My current approach to the API is:

Port the UTF-16 API and make a UTF32 function, reusing the BOMPolicy and Endianness types. No RFC seems to exist for UTF-32, so make the UTF-32 implementation behave similar to the UTF-16 implementation for the sake of consistency.
Add the UTF-32, UTF-32BE and UTF-32LE encodings to the 'All' slice.

Currently I haven't updated the BOMOverride function as its comments say that it shouldn't support UTF-32 by default.

What do you think?

mpvl · 2016-05-31T09:24:13Z

The use UTF-32 has some issues, including security-related. It is generally advised not to support UTF-32 as an interchange format. I'm not aware of UTF-32 being used for attacks (like BOCU-1), but it would seem to me it is a tricky one.

Is there any particular use case you need this for?

PhilipBorgesen · 2016-05-31T17:25:21Z

I am writing an application that accept and process user uploaded text files. I know use of the UTF-32 encoding is rare but would like my processing to be robust against the odd UTF-32 encoded file which could come by.
I could easily implement the encoding in a private module but thought I wasn't the only one missing this functionality, for which reason I am offering to add it.

I was not aware that UTF-32 had security-related issues. Can you point me in a direction where I can find more information?

mpvl · 2016-06-01T14:12:54Z

Security issues are minor, but it is a combination of easy to get things wrong and then nobody caring to fix bugs as pretty much nobody is using UTF-32 and nobody cares.

One has to be particularly careful with encodings that are not compatible with ASCII in general.

Either way. If you have a real need for it, I'm not adverse to adding or supporting it, albeit a caveat:
The default should still be to not support UTF-32. So the BOM detection, using unicode.All, etc. etc. should not include UTF-32. It is therefore better to put the functionality in package unicode/utf32 with a big disclaimer. If one really must one could add a BOM detection algorithm variant that supports UTF-32 in this package.

PhilipBorgesen · 2016-06-02T06:16:56Z

I can see pros and cons of adding UTF-32 support to a new utf32 package versus the existent unicode package.

Not changing the existing BOM detection algorithm makes sense to me, since it seems to be intended HTML5 decoding as per the W3C recommendation. Adding UTF-32 support to it would not be possible in a backwards compatible manner.

It would, however, seem logical to me that unicode.All included all supported Unicode encodings. On the other hand I am not sure what the use case for unicode.All is and thus don't know the implications of adding UTF-32 to it for consistency. If you know, would you mind elaborating a little on it, just so I can follow?

The points you make about ASCII compatibility and that it is easy to get things wrong apply to UTF-16 as well. Putting UTF-32 in its own package for me thus solely appears to be to ensure that lack of maintenance won't affect the more widely used Unicode encodings.

Alas, whether it should be in one package or the other depends on the benefits of having the UTF-32 encodings in unicode.All.

Independent of where UTF-32 is added, what disclaimer do you have in mind? It is not yet entirely clear to me.

Thanks! :-)

mpvl · 2016-06-02T09:33:39Z

regarding unicode.All, I agree it would be logical to include it, but it is not desirable.

The All vars that are defined are for easy of use to users who want to select a default set of encodings to support without too much effort, more or less. It is often recommended, such as by WhatWG and W3C, to not support UTF-32. Including UTF-32 in unicode.All would make it too easy to pull it in. I would pick pragmatic over logical here. The consequence of not including it in unicode.All is that people will often "forget" to pull in UTF-32, which I consider a good thing. It forces people to be explicit to support UTF-32. So it is not a matter of maintenance, but setting up the right defaults.

You're right about UTF-16 being problematic in an analogous way. But it is such a common format that it is more likely for bugs to be ironed out soon.

As for the disclaimer I would mention that the use of UTF-32 is extremely rare, a very inefficient encoding, and that its support is often discouraged and then quote/mention the recommendations of WhatWG/W3C, for example.

Another issue with putting in UTF-32 is that (as for now) the code will be unconditionally pulled in. I believe this is going to be fixed, but I don't really want to count on this.

Finally, I consider UTF-32 to be more or less on the same level as BOCU-1 an SCSU. It would be too much to put all of these in the unicode package, give the wrong impression of their importance, and will make it harder to point potential caveats with their use.

PhilipBorgesen · 2016-06-02T13:53:20Z

Thanks for elaborating on the issues related to supporting UTF-32 and why it is desirable to isolate it in its own package. The rationale is now clear to me and I agree it is the right solution.

I will refactor my current work to match your recommendations.

mpvl · 2016-06-02T19:42:54Z

thanks! feel free to send code reviews my way.

PhilipBorgesen · 2016-06-03T22:13:09Z

I have now refactored my work, written tests and added the desired disclaimer.
My proposed changes are available for inspection here: golang/text@master...PhilipBorgesen:master

Let me know what you think — unless you'd rather have me invite to review straight through Gerrit?

minux · 2016-06-04T20:08:28Z

@PhilipBorgesen, I'm wondering how did you push to golang/text without going through Gerrit?

adg · 2016-06-06T00:27:30Z

@minux that link compares @PhilipBorgesen's fork with golang/text.

mpvl · 2016-06-06T14:17:23Z

@PhilipBorgesen: I prefer to review through Gerrit.

As a first point of review, though, I would put the utf32-specific tests in the utf32 package. I know this was often done differently in encoding, but turned out to be inconvenient, IMO.

PhilipBorgesen · 2016-06-06T21:15:56Z

I have moved the utf32-specific tests (i.e. TestUTF32) to the new utf32 package and submitted my changes for review. See https://go-review.googlesource.com/23806.

gopherbot · 2016-06-06T22:00:39Z

CL https://golang.org/cl/23806 mentions this issue.

ianlancetaylor added this to the Unreleased milestone May 30, 2016

ianlancetaylor assigned mpvl May 30, 2016

gopherbot closed this as completed in golang/text@cd00d65 Jun 8, 2016

golang locked and limited conversation to collaborators Jun 8, 2017

gopherbot added the FrozenDueToAge label Jun 8, 2017

rsc unassigned mpvl Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text/encoding/unicode: Add UTF-32 support #15876

x/text/encoding/unicode: Add UTF-32 support #15876

PhilipBorgesen commented May 29, 2016 •

edited

mpvl commented May 31, 2016

PhilipBorgesen commented May 31, 2016

mpvl commented Jun 1, 2016

PhilipBorgesen commented Jun 2, 2016

mpvl commented Jun 2, 2016

PhilipBorgesen commented Jun 2, 2016

mpvl commented Jun 2, 2016

PhilipBorgesen commented Jun 3, 2016 •

edited

minux commented Jun 4, 2016 via email

adg commented Jun 6, 2016

mpvl commented Jun 6, 2016

PhilipBorgesen commented Jun 6, 2016

gopherbot commented Jun 6, 2016

x/text/encoding/unicode: Add UTF-32 support #15876

x/text/encoding/unicode: Add UTF-32 support #15876

Comments

PhilipBorgesen commented May 29, 2016 • edited

mpvl commented May 31, 2016

PhilipBorgesen commented May 31, 2016

mpvl commented Jun 1, 2016

PhilipBorgesen commented Jun 2, 2016

mpvl commented Jun 2, 2016

PhilipBorgesen commented Jun 2, 2016

mpvl commented Jun 2, 2016

PhilipBorgesen commented Jun 3, 2016 • edited

minux commented Jun 4, 2016 via email

adg commented Jun 6, 2016

mpvl commented Jun 6, 2016

PhilipBorgesen commented Jun 6, 2016

gopherbot commented Jun 6, 2016

PhilipBorgesen commented May 29, 2016 •

edited

PhilipBorgesen commented Jun 3, 2016 •

edited