x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?Playground.
What did you do?
https://go.dev/play/p/v_4hT9WSD7_y
What did you expect to see?
What did you see instead?
According to GB18030-20221 §7.2 "双字节部分的码位分配" and §7.3 "四字节部分的码位分配", there are 4 Private User Area ranges (the first three same as the GBK encoding):
[\xAA-\xAF][\xA1-\xFE]
(564 code points)[\xF8-\xFE][\xA1-\xFE]
(658 code points)[\xA1-\xA7][\x40-\x7E\x80-\xA0]
(672 code points)[\xFD-\xFE][\x30-\x39][\x81-\xFE][\x30-\x39]
(25200 code points)There are explicit mappings of the first 3 ranges2 to the Unicode PUA range, specified in the Appendix A pp.83–90 which is normative.
Instead, the current implementation of
x/test
:I'd also like to note that Python 3.11 produces the correct mapping with its GB18030 codec:
Footnotes
The simplified Chinese version of the standard is freely available on https://archive.org/details/GB18030-2022 ↩
GB18030 did not specify a mapping for the quad-byte PUA range FD308130–FE39FE39. According to https://icu-project.org/docs/papers/unicode-gb18030-faq.html, “Normally, they need to be treated as unassigned codes.”. ↩
The text was updated successfully, but these errors were encountered: