Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

Closed
ComaVN opened this issue Oct 3, 2021 · 7 comments
Closed

regexp: document and implement invalid UTF-8 treated as U+FFFD #48749

ComaVN opened this issue Oct 3, 2021 · 7 comments
Labels
Documentation FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@ComaVN
Copy link

ComaVN commented Oct 3, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16.8 linux/amd64

Does this issue reproduce with the latest release?

It reproduces with the Go Playground, which I assume is the latest version.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/roel/.cache/go-build"
GOENV="/home/roel/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/roel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/roel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/snap/go/8408"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/snap/go/8408/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.8"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/roel/dev/json-api-golang/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3447792839=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I tried to validate a user-provided string, which might contain non-utf8 data, using a regex.

See https://play.golang.org/p/j-jsteknY0M for a concise example.

What did you expect to see?

The regex package docs says:

All characters are UTF-8-encoded code points.

So, I expected matching on non-utf8 strings to either:

  • always give an error,
  • always return false

What did you see instead?

For some regexes, it returns false, for some it returns true. It never returns an error.

Particularly, regexes referencing or containing the Unicode REPLACEMENT CHARACTER (\ufffd, �) inside a bracket expression return true (but only if there are other characters in the same bracket). See the playgound example.

I understand that the immediate solution for me is to just check for invalid utf8 first, before regexing. However, the actual behaviour was so unexpected to me, even if it's technically undefined when reading the docs, that it might be a good idea to at least document this.

@robpike
Copy link
Contributor

robpike commented Oct 3, 2021

It may not be clear and it may not be right, but this behavior is how strings work in Go. Invalid UTF-8 gets turned, one byte at a time, into U+FFFD. Here is text from the spec about ranges over strings, which is as good a description as any of how Go handles invalid UTF-8. It's part of the language itself to do it this way:

If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

From the point of view of the matching algorithm, the regexp compiler has already overwritten all the invalid UTF-8 when it built the engine using Go's rules to interpret the string.

There is no way to fix this compatibly other than to provide a flag or other mechanism to avoid this interpretation. Given that the code is all runes inside, though, even that may be infeasible.

Working as intended, and unfortunate.

You are right that your best bet is likely to validate the string ahead of time, or else elide the invalid UTF-8 altogether.

@ComaVN
Copy link
Author

ComaVN commented Oct 3, 2021

Invalid UTF-8 gets turned, one byte at a time, into U+FFFD

This does not explain why a simple regex explicitly looking for U+FFFD does NOT match on invalid utf8. Maybe there's some optimization going on for regexes such as \ufffd or \x{fffd} that turns them into simple string.Contains or similar?

Anyway, thanks for the quick response. I just wanted to save someone some time hairpulling like I did today :)

@mknyszek mknyszek changed the title Unexpected behaviour of regex containing unicode replacement character on non-utf8 strings regexp: unexpected behaviour of regex containing unicode replacement character on non-utf8 strings Oct 4, 2021
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 4, 2021
@mknyszek mknyszek added this to the Backlog milestone Oct 4, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Oct 4, 2021

Based on @ComaVN and @robpike's conversation, I'm going to close this issue.

@mknyszek mknyszek closed this as completed Oct 4, 2021
@robpike
Copy link
Contributor

robpike commented Oct 5, 2021

I think there is a real bug here for some cases. Reopening.

@robpike robpike reopened this Oct 5, 2021
@robpike robpike assigned robpike and rsc and unassigned robpike Oct 5, 2021
@rsc
Copy link
Contributor

rsc commented Oct 5, 2021

The literals (lines 13-18) are buggy in https://play.golang.org/p/j-jsteknY0M and should be fixed.
The literal search needs to not kick in for U+FFFD.

@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

This is a duplicate of #38006, which I've merged into this issue because this issue had more commentary. That issue was marked as just needing a documentation update.

I started to look into fixing this, but it's fairly complex to get all the cases in all the matching engines.

For the record, the coherent behavior options are:

  1. Invalid UTF-8 does not match any character classes, nor a U+FFFD literal (nor \x{fffd}).
  2. Each byte of invalid UTF-8 is treated identically to a U+FFFD in the input, as a utf8.DecodeRune loop might.

RE2 uses Rule 1. Because it works byte at a time it can also provide \C to match any single byte of input, which matches invalid UTF-8 as well. This provides the nice property that a match for a regexp without \C is guaranteed to be valid UTF-8.

Unfortunately, today Go has an incoherent mix of these two, although mostly Rule 2. This is a deviation from RE2, and it gives up the nice property, but we probably can't correct that at this point. In particular .* already matches entire inputs today, valid UTF-8 or not, and I doubt we can break that. The right solution for Go is probably to adopt Rule 2 officially, fixing the few places that deviate from Rule 2.

@rsc rsc changed the title regexp: unexpected behaviour of regex containing unicode replacement character on non-utf8 strings regexp: document that invalid UTF-8 is treated as U+FFFD (and fix) Oct 6, 2021
@rsc rsc changed the title regexp: document that invalid UTF-8 is treated as U+FFFD (and fix) regexp: document and implement invalid UTF-8 treated as U+FFFD Oct 6, 2021
@gopherbot
Copy link

Change https://golang.org/cl/354569 mentions this issue: regexp: document and implement that invalid UTF-8 bytes are the same as U+FFFD

@rsc rsc removed their assignment Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Documentation FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants