x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

kennytm · 2023-08-16T11:00:20Z

What version of Go are you using (`go version`)?

$ go version
go version go1.21.0 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

Playground.

What did you do?

https://go.dev/play/p/v_4hT9WSD7_y

package main

import (
	"fmt"

	"golang.org/x/text/encoding/simplifiedchinese"
)

func main() {
	// test decoding of GB18030 PUA characters to UTF-8
	s1, err := simplifiedchinese.GB18030.NewDecoder().Bytes([]byte("\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0"))
	fmt.Printf("s1/ %x / %v\n", s1, err)
	// test decoding of GBK PUA characters to UTF-8
	s2, err := simplifiedchinese.GBK.NewDecoder().Bytes([]byte("\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0"))
	fmt.Printf("s2/ %x / %v\n", s2, err)
	// test encoding of UTF-8 PUA characters to GB18030
	s3, err := simplifiedchinese.GB18030.NewEncoder().Bytes([]byte("\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765"))
	fmt.Printf("s3/ %x / %v\n", s3, err)
	// test encoding of UTF-8 PUA characters to GBK
	s4, err := simplifiedchinese.GBK.NewEncoder().Bytes([]byte("\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765"))
	fmt.Printf("s4/ %x / %v\n", s4, err)
}

What did you expect to see?

s1/ ee8080232323ee88b3232323ee88b4232323ee9385232323ee9386232323ee9da5 / <nil>
s2/ ee8080232323ee88b3232323ee88b4232323ee9385232323ee9386232323ee9da5 / <nil>
s3/ aaa1232323affe232323f8a1232323fefe232323a140232323a7a0 / <nil>
s4/ aaa1232323affe232323f8a1232323fefe232323a140232323a7a0 / <nil>

What did you see instead?

s1/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s2/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s3/ 833898372323238338d1302323238338d13123232383399438232323833994392323238339d830 / <nil>
s4/  / encoding: rune not supported by encoding.

According to GB18030-2022¹ §7.2 "双字节部分的码位分配" and §7.3 "四字节部分的码位分配", there are 4 Private User Area ranges (the first three same as the GBK encoding):

[\xAA-\xAF][\xA1-\xFE] (564 code points)
[\xF8-\xFE][\xA1-\xFE] (658 code points)
[\xA1-\xA7][\x40-\x7E\x80-\xA0] (672 code points)
[\xFD-\xFE][\x30-\x39][\x81-\xFE][\x30-\x39] (25200 code points)

There are explicit mappings of the first 3 ranges² to the Unicode PUA range, specified in the Appendix A pp.83–90 which is normative.

AAA1 maps to U+E000, allocating sequentially until AFFE mapping to U+E233
F8A1 maps to U+E234, allocating sequentially until FEFE mapping to U+E4C5
A140 maps to U+E4C6, allocating sequentially until A7A0 mapping to U+E765

Instead, the current implementation of x/test:

wrongly decodes all these 3 ranges of double-byte PUA characters to U+FFFD (the "s1" and "s2" tests above)
wrongly encodes U+E000 to U+E765 to the quad-byte range for U+F014 to U+F779 (83389837–8339d830) which does not round-trip (the "s3" and "s4" tests above).

I'd also like to note that Python 3.11 produces the correct mapping with its GB18030 codec:

>>> b"\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0".decode('gb18030')
'\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765'
>>> _.encode('gb18030')
b'\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1@###\xa7\xa0'

The simplified Chinese version of the standard is freely available on https://archive.org/details/GB18030-2022 ↩
GB18030 did not specify a mapping for the quad-byte PUA range FD308130–FE39FE39. According to https://icu-project.org/docs/papers/unicode-gb18030-faq.html, “Normally, they need to be treated as unassigned codes.”. ↩

The text was updated successfully, but these errors were encountered:

kennytm · 2023-08-16T11:15:00Z

cc #61165, #41990

dmitshur · 2023-08-16T17:36:10Z

CC @mpvl.

gopherbot added this to the Unreleased milestone Aug 16, 2023

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

kennytm commented Aug 16, 2023 •

edited

kennytm commented Aug 16, 2023

dmitshur commented Aug 16, 2023

x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

Comments

kennytm commented Aug 16, 2023 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Footnotes

kennytm commented Aug 16, 2023

dmitshur commented Aug 16, 2023

kennytm commented Aug 16, 2023 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?