Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fmt: %q quotes out-of-range codepoints to strings containing invalid runes #51526

Closed
bcmills opened this issue Mar 7, 2022 · 6 comments
Closed
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.

Comments

@bcmills
Copy link
Contributor

bcmills commented Mar 7, 2022

What version of Go are you using (go version)?

$ go version
devel go1.19-d1820f748f Mon Mar 7 14:07:18 2022 +0000

Does this issue reproduce with the latest release?

Yes.

What did you do?

Called fmt.Sprintf("%q", x) with various int32 values that are not valid as UTF-8 runes (https://go.dev/play/p/wbvbvuKIWCC?v=gotip).

	for _, r := range []rune{
		-1,
		surrogateMin,
		unicode.ReplacementChar,
		unicode.MaxRune + 1,
	} {
		s := fmt.Sprintf("%q", r)
		fmt.Printf("%s (%x)\n", s, s)
	}

What did you expect to see?

The documentation for the fmt package says that %q formats the value as “a single-quoted character literal safely escaped with Go syntax.”

Per https://go.dev/ref/spec#Rune_literals, “The escapes \u and \U represent Unicode code points so within them some values are illegal, in particular those above 0x10FFFF and surrogate halves.

So I expected fmt.Sprintf to emit either an invalid \u or \U literal (since these values are too large for \x or \0 escapes), or an explicit format-error string of some sort (along the lines of https://pkg.go.dev/fmt#hdr-Format_errors).

At the very least, I expected to see some explicit description of the invalid-range behavior in the fmt package docs!

What did you see instead?

Invalid codepoints uniformly (and lossily) quoted to the Unicode replacement character:

devel go1.19-d1820f748f Mon Mar 7 14:07:18 2022 +0000
'�' (27efbfbd27)
'�' (27efbfbd27)
'�' (27efbfbd27)
'�' (27efbfbd27)

(CC @robpike)

@gopherbot
Copy link

Change https://go.dev/cl/390424 mentions this issue: internal/fuzz: fix encoding for out-of-range ints and runes

@rsc
Copy link
Contributor

rsc commented Mar 7, 2022

I think this is happening in strconv.QuoteRune not fmt. It's also what the Unicode replacement character is for. That rune is the explicit error indicator.

@bcmills
Copy link
Contributor Author

bcmills commented Mar 7, 2022

I think this is happening in strconv.QuoteRune not fmt.

Seems plausible, but fmt could make the behavior clearer (because it already has a format for recovered errors, whereas strconv does not).

It's also what the Unicode replacement character is for. That rune is the explicit error indicator.

Maybe? Again, at the very least I would expect to see an explicit description in the documentation: the word “replacement” doesn't even appear in the fmt docs today, and “replaced” only appears in the context of reflect.Value and the * flag.

@robpike
Copy link
Contributor

robpike commented Mar 7, 2022

I was about to write what @rsc said. I don't think fmt should behave differently from strconv in this regard. But perhaps documentation could be improved.

This is not what I would call a "recovered error", or even a formatting error (such as use of an invalid verb or missing argument). This is just a question of how to represent invalid UTF-8, and there is lots of consistent precedent for that.

@gopherbot
Copy link

Change https://go.dev/cl/390436 mentions this issue: fmt: document use of Unicode replacement character in %q

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 8, 2022
@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 8, 2022
@rsc rsc self-assigned this Mar 8, 2022
@rsc rsc removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 8, 2022
gopherbot pushed a commit that referenced this issue Mar 8, 2022
Also switch float64 NaN encoding to use hexadecimal, and accept
hexadecimal encoding for all other integer types too. (That gives us
the flexibility to change the encodings in either direction in the
future without breaking earlier Go versions.)

Out-of-range runes encoded using "%q" were previously replaced with
the Unicode replacement charecter, losing their values.

Out-of-range ints and uints on 32-bit platforms were previously
rejected. Now they are wrapped instead: an “interesting” case with a
large int or uint found on a 64-bit platform likely remains
interesting on a 32-bit platform, even if the specific values differ.

To verify the above changes, I have made TestMarshalUnmarshal accept
(and check for) arbitrary differences between input and output, and
added tests cases that include values in valid but non-canonical
encodings.

I have also added round-trip fuzz tests in the opposite direction for
most of the types affected by this change, verifying that a marshaled
value unmarshals to the same bitwise value.

Updates #51258
Updates #51526
Fixes #51528

Change-Id: I7727a9d0582d81be0d954529545678a4374e88ed
Reviewed-on: https://go-review.googlesource.com/c/go/+/390424
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Roland Shoemaker <roland@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/390816 mentions this issue: [release-branch.go1.18] internal/fuzz: fix encoding for out-of-range ints and runes

gopherbot pushed a commit that referenced this issue Mar 9, 2022
…ints and runes

Also switch float64 NaN encoding to use hexadecimal, and accept
hexadecimal encoding for all other integer types too. (That gives us
the flexibility to change the encodings in either direction in the
future without breaking earlier Go versions.)

Out-of-range runes encoded using "%q" were previously replaced with
the Unicode replacement charecter, losing their values.

Out-of-range ints and uints on 32-bit platforms were previously
rejected. Now they are wrapped instead: an “interesting” case with a
large int or uint found on a 64-bit platform likely remains
interesting on a 32-bit platform, even if the specific values differ.

To verify the above changes, I have made TestMarshalUnmarshal accept
(and check for) arbitrary differences between input and output, and
added tests cases that include values in valid but non-canonical
encodings.

I have also added round-trip fuzz tests in the opposite direction for
most of the types affected by this change, verifying that a marshaled
value unmarshals to the same bitwise value.

Updates #51258
Updates #51526
Fixes #51528

Change-Id: I7727a9d0582d81be0d954529545678a4374e88ed
Reviewed-on: https://go-review.googlesource.com/c/go/+/390424
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Roland Shoemaker <roland@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit 7419bb3)
Reviewed-on: https://go-review.googlesource.com/c/go/+/390816
Trust: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
@rsc rsc removed their assignment Jun 22, 2022
@golang golang locked and limited conversation to collaborators Jun 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants