Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

Open
fffonion opened this issue Feb 9, 2022 · 9 comments
Open

strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

fffonion opened this issue Feb 9, 2022 · 9 comments
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@fffonion
Copy link

fffonion commented Feb 9, 2022

What version of Go are you using (go version)?

$ go version
go version go1.17.2 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
```
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/fffonion/Library/Caches/go-build"
GOENV="/Users/fffonion/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/fffonion/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/fffonion/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.17.2/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.2/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/fffonion/bbb/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug
```

What did you do?

package main

import (
	"fmt"
	"strconv"
)

func main() {
	// `"\147\\u0001"`
	in := []byte{34, 147, 92, 117, 48, 48, 48, 49, 34}

	str0, _ := strconv.Unquote(string(in))
	fmt.Println([]byte(str0))

}

It seems neither does UnquoteChar (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) nor does its caller Unquote (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) handles the error from utf8.DecodeRuneInString(s), so any single byte non-UTF8 sequence are being escaped to utf8.RuneError and stored in output.

In Python for example, such cases are handled properly:

a=bytes([147,92,117,48,48,48,49])
a.decode('raw_unicode_escape')
# '\x93\x01'

What did you expect to see?

[147 1]

What did you see instead?

[239 191 189 1]

@robpike
Copy link
Contributor

robpike commented Feb 9, 2022

This is how strings and UTF-8 work in Go. Every byte of illegal UTF-8 is converted to U+FFF8.

it's actually not all that clearly documented. Although we went to great pains to do this consistently between all the library routines and the range statement, and therefore by inclusion strconv.Unquote, the best documentation for this behavior is here: https://pkg.go.dev/unicode/utf8#DecodeRune. It also appears in https://go.dev/ref/spec#For_statements in the section about range clauses.

It should probably be made more visible.

But as far as this behavior is concerned, it is working as intended. I'll leave the issue open to see if anyone has ideas about the best way to make this easier to discover.

@fffonion
Copy link
Author

fffonion commented Feb 9, 2022

@robpike Thanks for the reply! I came across this issue when trying to json unmarshal a part with non-printable bytes inside. Unquote is needed to unescape those unicode sequences like "\u0022", but those single byte character (like 0x99) are converted to utf8.RuneError. I have tested if those single byte are escaped to like "\x99" before passing to json.Unmarshal then strconv.Unquote is happy. So we are going to write our modified version of strconv.Unquote that just ignores such case.

@fffonion
Copy link
Author

fffonion commented Feb 9, 2022

@robpike I noticed the case above might pointed to json package instead (https://github.com/golang/go/blob/go1.17.6/src/encoding/json/decode.go#L1306) where a similar approach is done as strconv.Unquote. However, per JSON RFC, putting unescaped
characters like 0x93 unescaped, as shown in example, is allowed.
Does it make sense to make the json package specificly allow this use case? Technically it could just be:

		default:
			rr, size := utf8.DecodeRune(s[r:])
			if rr == utf8.RuneError {
				b[w] = s[r]
				r += 1
				w += 1
			} else {
				r += size
				w += utf8.EncodeRune(b[w:], rr)
			}

@robpike
Copy link
Contributor

robpike commented Feb 9, 2022

The the error condition would need to be

size == 1 && rr == utf.RuneError

and that is certainly possible but I am not in a position to say whether it's the right thing to do. It would likely break some tests, but maybe not too many. Non-compliance with the RFC is arguably a reason to make such a change but this behavior would be unique within the Go environment and therefore would require discussion.

@fffonion fffonion changed the title strconv: non-UTF8 bytes are being unescaped in strconv.Unquote strconv/json: non-UTF8 bytes are being unescaped in strconv.Unquote/json.unquote Feb 9, 2022
@mvdan
Copy link
Member

mvdan commented Feb 9, 2022

When it comes to encoding/json, I'm afraid the current behavior is well documented:

When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

Changing the default behavior would surely break programs out there in subtle ways. The general understanding with the Go standard library's compatibility guarantee is that we may be able to change undocumented behavior, as long as very few Go programs out there break because of the change. But changing documented behavior, especially if it could break programs, is almost always a breaking change we cannot do.

Note that the json encoder does allow disabling the replacement of invalid UTF-8 (see below), so one could imagine a proposal to also opt out of this feature in the decoder.

String values encode as JSON strings coerced to valid UTF-8, replacing invalid bytes with the Unicode replacement rune. So that the JSON will be safe to embed inside HTML <script> tags, the string is encoded using HTMLEscape, which replaces "<", ">", "&", U+2028, and U+2029 are escaped to "\u003c","\u003e", "\u0026", "\u2028", and "\u2029". This replacement can be disabled when using an Encoder, by calling SetEscapeHTML(false).

@fffonion
Copy link
Author

fffonion commented Feb 9, 2022

@mvdan Thanks for pointing out the docs. And I do understand chaning the behaviour is a concerning move. Is it worth
to however have a similar switch that disables this behaviour, as SetEscapeHTML does? As I do test with json libraries like
json-iterator/go and bytedance/sonic and they support such input.

@mvdan
Copy link
Member

mvdan commented Feb 9, 2022

Is it worth to however have a similar switch that disables this behaviour, as SetEscapeHTML does?

That is what I'm saying could be proposed. I would start a new issue, though, because it's entirely separate from strconv.

@seankhliao seankhliao changed the title strconv/json: non-UTF8 bytes are being unescaped in strconv.Unquote/json.unquote strconv: non-UTF8 bytes are being unescaped in strconv.Unquote Feb 9, 2022
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 9, 2022
@dsnet
Copy link
Member

dsnet commented Feb 9, 2022

@fffonion. There are at least three possible behaviors to do for json:

  1. Silently convert invalid UTF-8 to the Unicode replacement character,
  2. Silently preserve invalid UTF-8 as is, and
  3. Reject invalid UTF-8.

The current behavior of json is behavior 1. Both behaviors 1 and 2 are non-compliant with RFC 8259, which states in section 8.1. that the text "MUST be encoded using UTF-8". If we add an option, it would be to opt-into behavior 3, but that doesn't seem to be what you want (based on the original issue).

I'm not sure we should be adding options to json that allows it to get farther from standard behavior. Rather, we should be doing the opposite (i.e., moving closer to compliance).

@fffonion
Copy link
Author

fffonion commented Feb 9, 2022

@dsnet Indeed, reading through the UTF8 spec, it says if a codepoint is larger than 0x7F (non-printable), it should be encoded in at least two bytes. So a single byte\x99 is not a valid "UTF-8 string". I guess for our case maybe the JSON encoder (lua-cjson) might be problematic; I do tested other implementations like PHP or Python and Go, they seem to correctly encode those non-printable ASCII bytes.

@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants