-
Notifications
You must be signed in to change notification settings - Fork 18k
strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is how strings and UTF-8 work in Go. Every byte of illegal UTF-8 is converted to U+FFF8. it's actually not all that clearly documented. Although we went to great pains to do this consistently between all the library routines and the range statement, and therefore by inclusion strconv.Unquote, the best documentation for this behavior is here: https://pkg.go.dev/unicode/utf8#DecodeRune. It also appears in https://go.dev/ref/spec#For_statements in the section about range clauses. It should probably be made more visible. But as far as this behavior is concerned, it is working as intended. I'll leave the issue open to see if anyone has ideas about the best way to make this easier to discover. |
@robpike Thanks for the reply! I came across this issue when trying to json unmarshal a part with non-printable bytes inside. Unquote is needed to unescape those unicode sequences like |
@robpike I noticed the case above might pointed to default:
rr, size := utf8.DecodeRune(s[r:])
if rr == utf8.RuneError {
b[w] = s[r]
r += 1
w += 1
} else {
r += size
w += utf8.EncodeRune(b[w:], rr)
} |
The the error condition would need to be
and that is certainly possible but I am not in a position to say whether it's the right thing to do. It would likely break some tests, but maybe not too many. Non-compliance with the RFC is arguably a reason to make such a change but this behavior would be unique within the Go environment and therefore would require discussion. |
When it comes to encoding/json, I'm afraid the current behavior is well documented:
Changing the default behavior would surely break programs out there in subtle ways. The general understanding with the Go standard library's compatibility guarantee is that we may be able to change undocumented behavior, as long as very few Go programs out there break because of the change. But changing documented behavior, especially if it could break programs, is almost always a breaking change we cannot do. Note that the json encoder does allow disabling the replacement of invalid UTF-8 (see below), so one could imagine a proposal to also opt out of this feature in the decoder.
|
@mvdan Thanks for pointing out the docs. And I do understand chaning the behaviour is a concerning move. Is it worth |
That is what I'm saying could be proposed. I would start a new issue, though, because it's entirely separate from strconv. |
@fffonion. There are at least three possible behaviors to do for
The current behavior of I'm not sure we should be adding options to |
@dsnet Indeed, reading through the UTF8 spec, it says if a codepoint is larger than 0x7F (non-printable), it should be encoded in at least two bytes. So a single byte |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
It seems neither does UnquoteChar (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) nor does its caller Unquote (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) handles the error from
utf8.DecodeRuneInString(s)
, so any single byte non-UTF8 sequence are being escaped toutf8.RuneError
and stored in output.In Python for example, such cases are handled properly:
What did you expect to see?
[147 1]
What did you see instead?
[239 191 189 1]
The text was updated successfully, but these errors were encountered: