x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

danielbeaudreau · 2019-01-03T19:13:52Z

What version of Go are you using (`go version`)?

$ go version
go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

amd64, linux

What did you do?

Call charset.Lookup("us-ascii") to get an encoding
https://godoc.org/golang.org/x/net/html/charset#Lookup

e, _ := charset.Lookup("us-ascii")

Call encoding.NewDecoder().String(string([]byte{0x80})) and get the result

res, _ := e.NewDecoder().String(string([]byte{0x80}))

Print the result

What did you expect to see?

I expect to see a � character, since the value is outside of the range of US-ASCII (most significant bit is 1)

What did you see instead?

I will see the character € from the Windows 1252 encoding instead. This is caused because go is re-using Windows 1252 for US-ASCII. Similar issues arise for out of bounds characters in other charactersets, for example tis-620 maps to windows874. Now if I want to correctly parse the text I need to read through the decoded runes and test if any of them are out of bounds. If I want to use windows874 for just tis-620 characters, I would have to do a similar manual exclusion of out of bounds characters. I do not know of a way to create my own characterset so that these problems can be avoided.

The text was updated successfully, but these errors were encountered:

Wessie · 2019-01-03T19:27:24Z

this seems to follow https://www.w3.org/TR/encoding/#names-and-labels where us-ascii is an alias for windows 1252

danielbeaudreau · 2019-01-03T19:37:34Z

Interesting, but US-ASCII and Windows1252 are different charactersets (windows 1252 is a superset).
See https://en.wikipedia.org/wiki/ASCII#Character_set
and https://en.wikipedia.org/wiki/Windows-1252#Character_set

Other languages like Java allow users to differentiate between these character sets. For decoding legacy text, it is not ideal to use a superset characterset like windows1252, because invalid characters which are not expressible in the subset character set can be inserted into the result if the user's text is invalid. This requires developers to implement workarounds to ensure invalid text is not contained in the result.

katiehockman · 2019-01-04T18:21:19Z

/cc @rsc

katiehockman added the NeedsInvestigation label Jan 4, 2019

katiehockman added this to the Unplanned milestone Jan 4, 2019

katiehockman changed the title ~~Go's reuse of character sets causes incorrect decoding of invalid input~~ src/encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 4, 2019

katiehockman changed the title ~~src/encoding: Go's reuse of character sets causes incorrect decoding of invalid input~~ encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 4, 2019

agnivade changed the title ~~encoding: Go's reuse of character sets causes incorrect decoding of invalid input~~ x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

danielbeaudreau commented Jan 3, 2019 •

edited

Loading

Wessie commented Jan 3, 2019

danielbeaudreau commented Jan 3, 2019 •

edited

Loading

katiehockman commented Jan 4, 2019

x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

Comments

danielbeaudreau commented Jan 3, 2019 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Wessie commented Jan 3, 2019

danielbeaudreau commented Jan 3, 2019 • edited Loading

katiehockman commented Jan 4, 2019

danielbeaudreau commented Jan 3, 2019 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

danielbeaudreau commented Jan 3, 2019 •

edited

Loading