Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text: incorrectly decodes all GBK/GB18030 double-byte PUA characters to U+FFFD #62063

Open
kennytm opened this issue Aug 16, 2023 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@kennytm
Copy link

kennytm commented Aug 16, 2023

What version of Go are you using (go version)?

$ go version
go version go1.21.0 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Playground.

What did you do?

https://go.dev/play/p/v_4hT9WSD7_y

package main

import (
	"fmt"

	"golang.org/x/text/encoding/simplifiedchinese"
)

func main() {
	// test decoding of GB18030 PUA characters to UTF-8
	s1, err := simplifiedchinese.GB18030.NewDecoder().Bytes([]byte("\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0"))
	fmt.Printf("s1/ %x / %v\n", s1, err)
	// test decoding of GBK PUA characters to UTF-8
	s2, err := simplifiedchinese.GBK.NewDecoder().Bytes([]byte("\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0"))
	fmt.Printf("s2/ %x / %v\n", s2, err)
	// test encoding of UTF-8 PUA characters to GB18030
	s3, err := simplifiedchinese.GB18030.NewEncoder().Bytes([]byte("\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765"))
	fmt.Printf("s3/ %x / %v\n", s3, err)
	// test encoding of UTF-8 PUA characters to GBK
	s4, err := simplifiedchinese.GBK.NewEncoder().Bytes([]byte("\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765"))
	fmt.Printf("s4/ %x / %v\n", s4, err)
}

What did you expect to see?

s1/ ee8080232323ee88b3232323ee88b4232323ee9385232323ee9386232323ee9da5 / <nil>
s2/ ee8080232323ee88b3232323ee88b4232323ee9385232323ee9386232323ee9da5 / <nil>
s3/ aaa1232323affe232323f8a1232323fefe232323a140232323a7a0 / <nil>
s4/ aaa1232323affe232323f8a1232323fefe232323a140232323a7a0 / <nil>

What did you see instead?

s1/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s2/ efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd232323efbfbd / <nil>
s3/ 833898372323238338d1302323238338d13123232383399438232323833994392323238339d830 / <nil>
s4/  / encoding: rune not supported by encoding.

According to GB18030-20221 §7.2 "双字节部分的码位分配" and §7.3 "四字节部分的码位分配", there are 4 Private User Area ranges (the first three same as the GBK encoding):

  1. [\xAA-\xAF][\xA1-\xFE] (564 code points)
  2. [\xF8-\xFE][\xA1-\xFE] (658 code points)
  3. [\xA1-\xA7][\x40-\x7E\x80-\xA0] (672 code points)
  4. [\xFD-\xFE][\x30-\x39][\x81-\xFE][\x30-\x39] (25200 code points)

There are explicit mappings of the first 3 ranges2 to the Unicode PUA range, specified in the Appendix A pp.83–90 which is normative.

  1. AAA1 maps to U+E000, allocating sequentially until AFFE mapping to U+E233
  2. F8A1 maps to U+E234, allocating sequentially until FEFE mapping to U+E4C5
  3. A140 maps to U+E4C6, allocating sequentially until A7A0 mapping to U+E765

Instead, the current implementation of x/test:

  • wrongly decodes all these 3 ranges of double-byte PUA characters to U+FFFD (the "s1" and "s2" tests above)
  • wrongly encodes U+E000 to U+E765 to the quad-byte range for U+F014 to U+F779 (83389837–8339d830) which does not round-trip (the "s3" and "s4" tests above).

I'd also like to note that Python 3.11 produces the correct mapping with its GB18030 codec:

>>> b"\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1\x40###\xa7\xa0".decode('gb18030')
'\ue000###\ue233###\ue234###\ue4c5###\ue4c6###\ue765'
>>> _.encode('gb18030')
b'\xaa\xa1###\xaf\xfe###\xf8\xa1###\xfe\xfe###\xa1@###\xa7\xa0'

Footnotes

  1. The simplified Chinese version of the standard is freely available on https://archive.org/details/GB18030-2022

  2. GB18030 did not specify a mapping for the quad-byte PUA range FD308130–FE39FE39. According to https://icu-project.org/docs/papers/unicode-gb18030-faq.html, “Normally, they need to be treated as unassigned codes.”.

@gopherbot gopherbot added this to the Unreleased milestone Aug 16, 2023
@kennytm
Copy link
Author

kennytm commented Aug 16, 2023

cc #61165, #41990

@dmitshur
Copy link
Contributor

CC @mpvl.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants