Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input #29535

Open
danielbeaudreau opened this issue Jan 3, 2019 · 3 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@danielbeaudreau
Copy link

danielbeaudreau commented Jan 3, 2019

What version of Go are you using (go version)?

$ go version
go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

amd64, linux

What did you do?

  1. Call charset.Lookup("us-ascii") to get an encoding
    https://godoc.org/golang.org/x/net/html/charset#Lookup

e, _ := charset.Lookup("us-ascii")

  1. Call encoding.NewDecoder().String(string([]byte{0x80})) and get the result

res, _ := e.NewDecoder().String(string([]byte{0x80}))

  1. Print the result

What did you expect to see?

I expect to see a � character, since the value is outside of the range of US-ASCII (most significant bit is 1)

What did you see instead?

I will see the character € from the Windows 1252 encoding instead. This is caused because go is re-using Windows 1252 for US-ASCII. Similar issues arise for out of bounds characters in other charactersets, for example tis-620 maps to windows874. Now if I want to correctly parse the text I need to read through the decoded runes and test if any of them are out of bounds. If I want to use windows874 for just tis-620 characters, I would have to do a similar manual exclusion of out of bounds characters. I do not know of a way to create my own characterset so that these problems can be avoided.

@Wessie
Copy link

Wessie commented Jan 3, 2019

this seems to follow https://www.w3.org/TR/encoding/#names-and-labels where us-ascii is an alias for windows 1252

@danielbeaudreau
Copy link
Author

danielbeaudreau commented Jan 3, 2019

Interesting, but US-ASCII and Windows1252 are different charactersets (windows 1252 is a superset).
See https://en.wikipedia.org/wiki/ASCII#Character_set
and https://en.wikipedia.org/wiki/Windows-1252#Character_set

Other languages like Java allow users to differentiate between these character sets. For decoding legacy text, it is not ideal to use a superset characterset like windows1252, because invalid characters which are not expressible in the subset character set can be inserted into the result if the user's text is invalid. This requires developers to implement workarounds to ensure invalid text is not contained in the result.

@katiehockman katiehockman added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 4, 2019
@katiehockman katiehockman added this to the Unplanned milestone Jan 4, 2019
@katiehockman
Copy link
Contributor

/cc @rsc

@katiehockman katiehockman changed the title Go's reuse of character sets causes incorrect decoding of invalid input src/encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 4, 2019
@katiehockman katiehockman changed the title src/encoding: Go's reuse of character sets causes incorrect decoding of invalid input encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 4, 2019
@agnivade agnivade changed the title encoding: Go's reuse of character sets causes incorrect decoding of invalid input x/text/encoding: Go's reuse of character sets causes incorrect decoding of invalid input Jan 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants