Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: Expand rules are confusing #40329

Open
a-h opened this issue Jul 21, 2020 · 6 comments
Open

regexp: Expand rules are confusing #40329

a-h opened this issue Jul 21, 2020 · 6 comments
Labels
Documentation help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@a-h
Copy link
Contributor

a-h commented Jul 21, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.4 darwin/amd64

Also tested on Go docker containers: 1.2, 1.10

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/adrian/Library/Caches/go-build"
GOENV="/Users/adrian/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/adrian/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/nix/store/6f0x9mycyrcc3gsfm36bci0b4ls1q8zc-go-1.14.4/share/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/nix/store/6f0x9mycyrcc3gsfm36bci0b4ls1q8zc-go-1.14.4/share/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/zz/yrqp3xx513sf6drry24cw8r40000gp/T/go-build152779906=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://play.golang.org/p/WYmtk1Gi-T_4
https://play.golang.org/p/XAdw3-cgx0z

package main

import (
	"fmt"
	"regexp"
)

func main() {
	re := regexp.MustCompile(`a(.)c`)
	fmt.Println(re.ReplaceAllString("abc", "c$1a"))
	// expected "cba", got "c"
}

What did you expect to see?

Expected that the first capture submatch (.) resulted in the character b, so the result should have been cba, according to the docs at https://golang.org/pkg/regexp/#Regexp.ReplaceAllString

ReplaceAllString returns a copy of src, replacing matches of the Regexp with the replacement string repl. Inside repl, $ signs are interpreted as in Expand, so for instance $1 represents the text of the first submatch.

What did you see instead?

I just got the character c, I expected to get cba, because the replacement had been interpreted differently.

When I dug into it to raise this report, I found this in the documentation on the Expand function:

In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.

I didn't realise I was using the name form, I thought I was using the number form. This made it clearer that the regexp group was being interpreted as ${1a}, not $1, followed by a.

That means, I would need to use "c${1}a" as my replacement, which is quite unusual because I haven't seen another programming language that works this way, e.g. Node.js will return cba.

"abc".replace(/a(.)c/, "c$1a");

sed also works as I expected (although it uses \1 as the reference instead of $1).

echo "abc" | sed -E 's/a(.)c/c\1a/g'

It was surprising behaviour to me, and I spent a while debugging the program I was working on before I realised there was something else going on.

Ideally, the Go implementation would change to match other programming languages and text editors, i.e. after a dollar, if the first character is [0-9], then it's a numeric reference up until the next non-numeric character, e.g.

$1 = $1
$1a = $1, followed by a
${1a} = ${1a}
$10 = $10
$1_ = $1, followed by _
a$1c = a, $1, c
etc.

However, if that's not possible, then a documentation change might help. $1 is brittle, it only works if your reference to the capture group is the whole replacement ($1), or is followed by a space character ($1 a), at the end of the replacement (a$1) etc. So the documentation should use the long form by default.

@toothrot toothrot added Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 21, 2020
@toothrot toothrot added this to the Backlog milestone Jul 21, 2020
@toothrot
Copy link
Contributor

/cc @rsc @matloob

I labeled this as Documentation, as I am not sure how we could change this without breaking real-world programs. I have also been caught by this before.

@ianlancetaylor ianlancetaylor added help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Jul 21, 2020
@ianlancetaylor
Copy link
Contributor

Thanks. This isn't something we can change today, so it has to be a documentation fix.

@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 21, 2020
@gopherbot
Copy link

Change https://golang.org/cl/244625 mentions this issue: doc: improve re.ReplaceAllString documentation related to Expand part

@a-h
Copy link
Contributor Author

a-h commented Jul 24, 2020

I think the simplest change is to nudge people to use the more explicit syntax.

A lot of the time, people don't look at the examples in the docs, they just look at the text that pops up in their development environment, so how about this?

ReplaceAllString returns a copy of src, replacing matches of the Regexp with the replacement string repl. Inside repl, $ signs are interpreted as in Expand, for instance, ${1} represents the text of the first submatch, and ${xyz} represents the text of the submatch named xyz.

The problem with skipping the braces when referring to submatches is hinted at in the example code, but it's a bit too subtle, maybe adding a comment there would bring it to the attention. I don't think any more than that is needed.

// $1W is not a named submatch, so the replacement is with an empty slice.
fmt.Printf("%s\n", re.ReplaceAll([]byte("-ab-axxb-"), []byte("$1W")))

@eduardbme
Copy link
Contributor

Rebased PR to the current master, can we review one more time and land it ?

@gopherbot
Copy link

Change https://go.dev/cl/446836 mentions this issue: regexp: improve ReplaceAllString documentation related to Expand part

eduardbme added a commit to eduardbme/go that referenced this issue Jul 26, 2023
eduardbme added a commit to eduardbme/go that referenced this issue Jul 28, 2023
eduardbme added a commit to eduardbme/go that referenced this issue Aug 1, 2023
gopherbot pushed a commit that referenced this issue Aug 7, 2023
…Expand part

For #40329

Change-Id: Ie0cb337545ce39cd169129227c45f7d2eaebc898
GitHub-Last-Rev: c017d4c
GitHub-Pull-Request: #56507
Reviewed-on: https://go-review.googlesource.com/c/go/+/446836
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants