x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?amd64, linux
What did you do?
https://godoc.org/golang.org/x/net/html/charset#Lookup
e, _ := charset.Lookup("us-ascii")
res, _ := e.NewDecoder().String(string([]byte{0x80}))
What did you expect to see?
I expect to see a � character, since the value is outside of the range of US-ASCII (most significant bit is 1)
What did you see instead?
I will see the character € from the Windows 1252 encoding instead. This is caused because go is re-using Windows 1252 for US-ASCII. Similar issues arise for out of bounds characters in other charactersets, for example tis-620 maps to windows874. Now if I want to correctly parse the text I need to read through the decoded runes and test if any of them are out of bounds. If I want to use windows874 for just tis-620 characters, I would have to do a similar manual exclusion of out of bounds characters. I do not know of a way to create my own characterset so that these problems can be avoided.
The text was updated successfully, but these errors were encountered: