x/net/idna: we are almost out of room for new mapped runes #49371
Labels
NeedsDecision
Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Not a bug… yet.
The x/net/idna package (aka x/text/internal/export/idna) stores information about each character in a 16-bit info structure. Of the 16 bits, 13 bits can be used for an index into the
mappings
table, which defines replacements of deviation and mapped characters.The 13 bits necessarily mean that the index can only be in the range [0, 8191]. However, we see that in the latest version of Unicode supported by Go (13.0.0), the
mappings
table already has 8188 bytes of data. Each new mapped rune adds at least two bytes to the table (one for a header, one for the actual mapping), so at most only two more runes can be added in future versions of Unicode for the encoding scheme to continue working!What did you expect to see?
The encoding scheme should be future-proof for the foreseeable future.
What did you see instead?
The encoding scheme will stop working imminently, perhaps as soon as Unicode 14.0.0 (#48621)!
I propose two ways to solve this problem, each of which would suffice by itself.
We could adjust the encoding to give more room to the mapping table index. In particular, we can merge the
xorBit
(bit 2) into the indices, since there are not as many XOR table indices (xorData
has 4862 bytes for Unicode 13). Here's one possible scheme:This scheme is implemented in https://golang.org/cl/361496.
We could align each mapping table index to be a multiple of two, and stop explicitly storing the least significant bit. This allows the current layout of the
info
type to be kept. After this change, the size ofmappings
is 9012, which means that the allowed stored indices are in the range [0, 4505] – well within the range of 13 bits.In terms of binary size impacts: this change causes the total size of tables to grow from 43370 bytes to 44314 bytes (944 bytes) – a modest increase on par with two Unicode version upgrades (Unicode 11.0.0 → 13.0.0 caused an increase of 904 bytes).
This scheme is implemented in https://golang.org/cl/361497.
Finally, I have benchmarked both approaches on Go 1.17.1:
Notice, though that BenchmarkProfile is not an entirely fair benchmark as it hammers one ASCII-only string specifically. The real-world performance impact for both approaches is likely to be more pronounced especially when characters in the mapping table are exercised – but probably not overly so.
The text was updated successfully, but these errors were encountered: