Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: document Valid allows invalid utf8 #58517

Open
LukeShu opened this issue Feb 14, 2023 · 4 comments
Open

encoding/json: document Valid allows invalid utf8 #58517

LukeShu opened this issue Feb 14, 2023 · 4 comments
Labels
Documentation NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@LukeShu
Copy link

LukeShu commented Feb 14, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/lukeshu/.cache/go-build"
GOENV="/home/lukeshu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/lukeshu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/lukeshu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/run/user/1000/tmpdir/go-build2160073961=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I passed a binary garbage wrapped in " quotes to json.Valid.

package main

import (
	"encoding/json"
	"fmt"
)

func main() {
	jsonStr := "\"0\x85\xcd\xc0\xf3\xcb\xc1\xb3\xf2\xf5\xa4\xc1\xd40\xba\xe9\""
	fmt.Println(json.Valid([]byte(jsonStr)))
}

https://go.dev/play/p/rrtmrEL3Ipd

What did you expect to see?

As JSON is specified to be "a sequence of Unicode code points" (ECMA-404) or "encoded in UTF-8, UTF-16, or UTF-32" (RFC 7159) I would expect that a JSON document containing bytes that cannot be interpreted as Unicode codepoints to not be considered valid.

What did you see instead?

It considered the document to be valid, even though it contains bytes that cannot be interpreted as Unicode codepoints.

@artyom
Copy link
Member

artyom commented Feb 14, 2023

@LukeShu note that json.Unmarshal documentation mentions this:

When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

As json.Unmarshal and json.Valid share underlying code, your example looks like an expected behavior.

Perhaps there's a room for improving the documentation?

@seankhliao seankhliao changed the title encoding/json: Considers binary garbage to be valid encoding/json: document Valid allows invalid utf8 Feb 14, 2023
@LukeShu
Copy link
Author

LukeShu commented Feb 14, 2023

I understand that Unmarshal and Valid share underlying code, but Unmarshal being permissive of invalid input doesn't mean that Valid should identify it as valid (at least without its own note in the documentation).

And I discovered this because of inconsistency between the various functions, when I expected that even if they have quirks they'd at least be consistent because of sharing the underlying implementation. I'd discovered that json.Unmarshal(any)json.MarshalIndent() would convert the binary garbage to U+FFFD, while json.Indent() and json.Compact() would pass it through unchanged. I figured that differing behavior for invalid JSON was reasonable, but then was surprised to find that json.Valid() identified it as valid. (And that differing behavior, while surprising, I couldn't really say was a bug; while json.Valid() identifying it as valid is a clear bug to me.)

@ianlancetaylor
Copy link
Contributor

CC @dsnet @mvdan

@prattmic prattmic added this to the Backlog milestone Feb 14, 2023
@prattmic prattmic added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Feb 14, 2023
@dsnet
Copy link
Member

dsnet commented Feb 21, 2023

It's unfortunate that the json package allows invalid UTF-8 since RFC 8259, section 8.1 clearly states that JSON must be UTF-8.

Given that we already verbally promise handling of invalid UTF-8 in Unmarshal, we should also document how it operates in Valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

6 participants