strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

fffonion · 2022-02-09T05:09:20Z

What version of Go are you using (`go version`)?

$ go version
go version go1.17.2 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
```
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/fffonion/Library/Caches/go-build"
GOENV="/Users/fffonion/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/fffonion/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/fffonion/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.17.2/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.2/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/fffonion/bbb/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug
```

What did you do?

package main

import (
	"fmt"
	"strconv"
)

func main() {
	// `"\147\\u0001"`
	in := []byte{34, 147, 92, 117, 48, 48, 48, 49, 34}

	str0, _ := strconv.Unquote(string(in))
	fmt.Println([]byte(str0))

}

It seems neither does UnquoteChar (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) nor does its caller Unquote (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) handles the error from utf8.DecodeRuneInString(s), so any single byte non-UTF8 sequence are being escaped to utf8.RuneError and stored in output.

In Python for example, such cases are handled properly:

a=bytes([147,92,117,48,48,48,49])
a.decode('raw_unicode_escape')
# '\x93\x01'

What did you expect to see?

[147 1]

What did you see instead?

[239 191 189 1]

The text was updated successfully, but these errors were encountered:

robpike · 2022-02-09T05:35:53Z

This is how strings and UTF-8 work in Go. Every byte of illegal UTF-8 is converted to U+FFF8.

it's actually not all that clearly documented. Although we went to great pains to do this consistently between all the library routines and the range statement, and therefore by inclusion strconv.Unquote, the best documentation for this behavior is here: https://pkg.go.dev/unicode/utf8#DecodeRune. It also appears in https://go.dev/ref/spec#For_statements in the section about range clauses.

It should probably be made more visible.

But as far as this behavior is concerned, it is working as intended. I'll leave the issue open to see if anyone has ideas about the best way to make this easier to discover.

fffonion · 2022-02-09T06:25:53Z

@robpike Thanks for the reply! I came across this issue when trying to json unmarshal a part with non-printable bytes inside. Unquote is needed to unescape those unicode sequences like "\u0022", but those single byte character (like 0x99) are converted to utf8.RuneError. I have tested if those single byte are escaped to like "\x99" before passing to json.Unmarshal then strconv.Unquote is happy. So we are going to write our modified version of strconv.Unquote that just ignores such case.

fffonion · 2022-02-09T08:06:28Z

@robpike I noticed the case above might pointed to json package instead (https://github.com/golang/go/blob/go1.17.6/src/encoding/json/decode.go#L1306) where a similar approach is done as strconv.Unquote. However, per JSON RFC, putting unescaped
characters like 0x93 unescaped, as shown in example, is allowed.
Does it make sense to make the json package specificly allow this use case? Technically it could just be:

		default:
			rr, size := utf8.DecodeRune(s[r:])
			if rr == utf8.RuneError {
				b[w] = s[r]
				r += 1
				w += 1
			} else {
				r += size
				w += utf8.EncodeRune(b[w:], rr)
			}

robpike · 2022-02-09T08:51:33Z

The the error condition would need to be

size == 1 && rr == utf.RuneError

and that is certainly possible but I am not in a position to say whether it's the right thing to do. It would likely break some tests, but maybe not too many. Non-compliance with the RFC is arguably a reason to make such a change but this behavior would be unique within the Go environment and therefore would require discussion.

mvdan · 2022-02-09T09:15:35Z

When it comes to encoding/json, I'm afraid the current behavior is well documented:

When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

Changing the default behavior would surely break programs out there in subtle ways. The general understanding with the Go standard library's compatibility guarantee is that we may be able to change undocumented behavior, as long as very few Go programs out there break because of the change. But changing documented behavior, especially if it could break programs, is almost always a breaking change we cannot do.

Note that the json encoder does allow disabling the replacement of invalid UTF-8 (see below), so one could imagine a proposal to also opt out of this feature in the decoder.

String values encode as JSON strings coerced to valid UTF-8, replacing invalid bytes with the Unicode replacement rune. So that the JSON will be safe to embed inside HTML <script> tags, the string is encoded using HTMLEscape, which replaces "<", ">", "&", U+2028, and U+2029 are escaped to "\u003c","\u003e", "\u0026", "\u2028", and "\u2029". This replacement can be disabled when using an Encoder, by calling SetEscapeHTML(false).

fffonion · 2022-02-09T09:31:21Z

@mvdan Thanks for pointing out the docs. And I do understand chaning the behaviour is a concerning move. Is it worth
to however have a similar switch that disables this behaviour, as SetEscapeHTML does? As I do test with json libraries like
json-iterator/go and bytedance/sonic and they support such input.

mvdan · 2022-02-09T09:36:27Z

Is it worth to however have a similar switch that disables this behaviour, as SetEscapeHTML does?

That is what I'm saying could be proposed. I would start a new issue, though, because it's entirely separate from strconv.

dsnet · 2022-02-09T17:07:42Z

@fffonion. There are at least three possible behaviors to do for json:

Silently convert invalid UTF-8 to the Unicode replacement character,
Silently preserve invalid UTF-8 as is, and
Reject invalid UTF-8.

The current behavior of json is behavior 1. Both behaviors 1 and 2 are non-compliant with RFC 8259, which states in section 8.1. that the text "MUST be encoded using UTF-8". If we add an option, it would be to opt-into behavior 3, but that doesn't seem to be what you want (based on the original issue).

I'm not sure we should be adding options to json that allows it to get farther from standard behavior. Rather, we should be doing the opposite (i.e., moving closer to compliance).

fffonion · 2022-02-09T17:23:09Z

@dsnet Indeed, reading through the UTF8 spec, it says if a codepoint is larger than 0x7F (non-printable), it should be encoded in at least two bytes. So a single byte\x99 is not a valid "UTF-8 string". I guess for our case maybe the JSON encoder (lua-cjson) might be problematic; I do tested other implementations like PHP or Python and Go, they seem to correctly encode those non-printable ASCII bytes.

fffonion changed the title ~~strconv: non-UTF8 bytes are being unescaped in strconv.Unquote~~ strconv/json: non-UTF8 bytes are being unescaped in strconv.Unquote/json.unquote Feb 9, 2022

seankhliao changed the title ~~strconv/json: non-UTF8 bytes are being unescaped in strconv.Unquote/json.unquote~~ strconv: non-UTF8 bytes are being unescaped in strconv.Unquote Feb 9, 2022

seankhliao added the NeedsInvestigation label Feb 9, 2022

seankhliao added the Documentation label Aug 20, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

c4-bot-4 mentioned this issue Dec 18, 2023

Logs not processing because of panics code-423n4/2023-11-zetachain-findings#573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

fffonion commented Feb 9, 2022

robpike commented Feb 9, 2022

fffonion commented Feb 9, 2022 •

edited

Loading

fffonion commented Feb 9, 2022

robpike commented Feb 9, 2022 •

edited

Loading

mvdan commented Feb 9, 2022

fffonion commented Feb 9, 2022 •

edited

Loading

mvdan commented Feb 9, 2022

dsnet commented Feb 9, 2022

fffonion commented Feb 9, 2022

strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

strconv: non-UTF8 bytes are being unescaped in strconv.Unquote #51094

Comments

fffonion commented Feb 9, 2022

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

robpike commented Feb 9, 2022

fffonion commented Feb 9, 2022 • edited Loading

fffonion commented Feb 9, 2022

robpike commented Feb 9, 2022 • edited Loading

mvdan commented Feb 9, 2022

fffonion commented Feb 9, 2022 • edited Loading

mvdan commented Feb 9, 2022

dsnet commented Feb 9, 2022

fffonion commented Feb 9, 2022

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

fffonion commented Feb 9, 2022 •

edited

Loading

robpike commented Feb 9, 2022 •

edited

Loading

fffonion commented Feb 9, 2022 •

edited

Loading