regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

ComaVN · 2021-10-03T12:09:42Z

What version of Go are you using (`go version`)?

$ go version
go version go1.16.8 linux/amd64

Does this issue reproduce with the latest release?

It reproduces with the Go Playground, which I assume is the latest version.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/roel/.cache/go-build"
GOENV="/home/roel/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/roel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/roel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/snap/go/8408"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/snap/go/8408/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.8"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/roel/dev/json-api-golang/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3447792839=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I tried to validate a user-provided string, which might contain non-utf8 data, using a regex.

See https://play.golang.org/p/j-jsteknY0M for a concise example.

What did you expect to see?

The regex package docs says:

All characters are UTF-8-encoded code points.

So, I expected matching on non-utf8 strings to either:

always give an error,
always return false

What did you see instead?

For some regexes, it returns false, for some it returns true. It never returns an error.

Particularly, regexes referencing or containing the Unicode REPLACEMENT CHARACTER (\ufffd, �) inside a bracket expression return true (but only if there are other characters in the same bracket). See the playgound example.

I understand that the immediate solution for me is to just check for invalid utf8 first, before regexing. However, the actual behaviour was so unexpected to me, even if it's technically undefined when reading the docs, that it might be a good idea to at least document this.

The text was updated successfully, but these errors were encountered:

robpike · 2021-10-03T14:07:16Z

It may not be clear and it may not be right, but this behavior is how strings work in Go. Invalid UTF-8 gets turned, one byte at a time, into U+FFFD. Here is text from the spec about ranges over strings, which is as good a description as any of how Go handles invalid UTF-8. It's part of the language itself to do it this way:

If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

From the point of view of the matching algorithm, the regexp compiler has already overwritten all the invalid UTF-8 when it built the engine using Go's rules to interpret the string.

There is no way to fix this compatibly other than to provide a flag or other mechanism to avoid this interpretation. Given that the code is all runes inside, though, even that may be infeasible.

Working as intended, and unfortunate.

You are right that your best bet is likely to validate the string ahead of time, or else elide the invalid UTF-8 altogether.

ComaVN · 2021-10-03T14:35:34Z

Invalid UTF-8 gets turned, one byte at a time, into U+FFFD

This does not explain why a simple regex explicitly looking for U+FFFD does NOT match on invalid utf8. Maybe there's some optimization going on for regexes such as \ufffd or \x{fffd} that turns them into simple string.Contains or similar?

Anyway, thanks for the quick response. I just wanted to save someone some time hairpulling like I did today :)

mknyszek · 2021-10-04T17:04:44Z

Based on @ComaVN and @robpike's conversation, I'm going to close this issue.

robpike · 2021-10-05T00:03:35Z

I think there is a real bug here for some cases. Reopening.

rsc · 2021-10-05T14:39:39Z

The literals (lines 13-18) are buggy in https://play.golang.org/p/j-jsteknY0M and should be fixed.
The literal search needs to not kick in for U+FFFD.

rsc · 2021-10-06T18:16:47Z

This is a duplicate of #38006, which I've merged into this issue because this issue had more commentary. That issue was marked as just needing a documentation update.

I started to look into fixing this, but it's fairly complex to get all the cases in all the matching engines.

For the record, the coherent behavior options are:

Invalid UTF-8 does not match any character classes, nor a U+FFFD literal (nor \x{fffd}).
Each byte of invalid UTF-8 is treated identically to a U+FFFD in the input, as a utf8.DecodeRune loop might.

RE2 uses Rule 1. Because it works byte at a time it can also provide \C to match any single byte of input, which matches invalid UTF-8 as well. This provides the nice property that a match for a regexp without \C is guaranteed to be valid UTF-8.

Unfortunately, today Go has an incoherent mix of these two, although mostly Rule 2. This is a deviation from RE2, and it gives up the nice property, but we probably can't correct that at this point. In particular .* already matches entire inputs today, valid UTF-8 or not, and I doubt we can break that. The right solution for Go is probably to adopt Rule 2 officially, fixing the few places that deviate from Rule 2.

gopherbot · 2021-10-07T14:01:56Z

Change https://golang.org/cl/354569 mentions this issue: regexp: document and implement that invalid UTF-8 bytes are the same as U+FFFD

mknyszek changed the title ~~Unexpected behaviour of regex containing unicode replacement character on non-utf8 strings~~ regexp: unexpected behaviour of regex containing unicode replacement character on non-utf8 strings Oct 4, 2021

mknyszek added the NeedsInvestigation label Oct 4, 2021

mknyszek added this to the Backlog milestone Oct 4, 2021

mknyszek closed this as completed Oct 4, 2021

robpike reopened this Oct 5, 2021

robpike assigned robpike and rsc and unassigned robpike Oct 5, 2021

rsc mentioned this issue Oct 6, 2021

regexp: behavior on invalid UTF-8 input is undocumented and inconsistent #38006

Closed

rsc changed the title ~~regexp: unexpected behaviour of regex containing unicode replacement character on non-utf8 strings~~ regexp: document that invalid UTF-8 is treated as U+FFFD (and fix) Oct 6, 2021

rsc changed the title ~~regexp: document that invalid UTF-8 is treated as U+FFFD (and fix)~~ regexp: document and implement invalid UTF-8 treated as U+FFFD Oct 6, 2021

gopherbot added the Documentation label Oct 6, 2021

gopherbot closed this as completed in 702e337 Oct 11, 2021

rsc removed their assignment Jun 23, 2022

golang locked and limited conversation to collaborators Jun 23, 2023

gopherbot added the FrozenDueToAge label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

ComaVN commented Oct 3, 2021 •

edited

Loading

robpike commented Oct 3, 2021

ComaVN commented Oct 3, 2021

mknyszek commented Oct 4, 2021

robpike commented Oct 5, 2021

rsc commented Oct 5, 2021

rsc commented Oct 6, 2021

gopherbot commented Oct 7, 2021

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

Comments

ComaVN commented Oct 3, 2021 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

robpike commented Oct 3, 2021

ComaVN commented Oct 3, 2021

mknyszek commented Oct 4, 2021

robpike commented Oct 5, 2021

rsc commented Oct 5, 2021

rsc commented Oct 6, 2021

gopherbot commented Oct 7, 2021

ComaVN commented Oct 3, 2021 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?