Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: LiteralPrefix behaving incorrectly #30425

Open
shishkander opened this issue Feb 27, 2019 · 14 comments
Open

regexp: LiteralPrefix behaving incorrectly #30425

shishkander opened this issue Feb 27, 2019 · 14 comments
Labels
Documentation NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@shishkander
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Yeah, go1.12 has just been released.

What did you do?

package main
import (
	"fmt"
	"regexp"
)
func main() {
	l, _ := regexp.MustCompile("^prefix\\d+.$").LiteralPrefix()
	fmt.Printf("literalPrefix: %q", l)
}

See also on Play or even more

What did you expect to see?

literalPrefix: "prefix"

What did you see instead?

literalPrefix: ""
@agnivade
Copy link
Contributor

/cc @rsc @matloob

@cuonglm
Copy link
Member

cuonglm commented Feb 27, 2019

Maybe just adding document, because prefix is applied for unanchored matches only

prefix string // required prefix in unanchored matches

@ianlancetaylor
Copy link
Contributor

Yes, this is a documentation issue. The code is behaving as intended.

@ianlancetaylor ianlancetaylor added Documentation NeedsFix The path to resolution is known, but the work has not been done. labels Feb 27, 2019
@ianlancetaylor ianlancetaylor added this to the Go1.13 milestone Feb 27, 2019
@antong
Copy link
Contributor

antong commented Feb 27, 2019

Do note that if the expression is a bit more anchored, then LiteralPrefix() applies again:

func main() {
	fmt.Println(regexp.MustCompile("^foo").LiteralPrefix())
	fmt.Println(regexp.MustCompile("^foo$").LiteralPrefix())
	fmt.Println(regexp.MustCompile("^foo.bar$").LiteralPrefix())
	// For go version 1.6 to 1.12
	// Output:
	//  false
	// foo true
	// foo false
}

The behaviour has changed through versions. The relevant versions I found:

go version go1.2.2 linux/amd64
 false
 false
 false

go version go1.3.3 linux/amd64
 false
foo false
foo false

go version go1.6.4 linux/amd64
 false
foo true
foo false

@cuonglm
Copy link
Member

cuonglm commented Feb 27, 2019

@antong go 1.6 to 1.12 seems to be correct.

Only ^foo$ can be turned to literal match, ^foo only matches at start, . in ^foo.bar$ can be anything.

@antong
Copy link
Contributor

antong commented Feb 28, 2019

@Gnouc ,Well, ^foo$ also only matches at start and is also "anchored". I'm sure there is some reasoning behind this, but I don't really understand it based on the doc and the info in this thread.

I found only one place in the standard library where LiteralPrefix() is used, in suffixarray.FindAllIndex(). To me it looks like suffixarray assumes that ^foo$ has no literal prefix (as it was in go1.2).

https://play.golang.org/p/j27u5DdE1oS :

sufixarray.FindAllIndex(re) in "banana":
"n."   found at: [[2 4] [4 6]]
"^n."  found at: []
"^n.$" found at: [[4 6]], should be []?
"a"    found at: [[1 2] [3 4] [5 6]]
"^a"   found at: []
"^a$"  found at: [[1 2] [3 4] [5 6]], should be []?

@cuonglm
Copy link
Member

cuonglm commented Feb 28, 2019

@antong: I mean ^foo$ is the same as foo., ^foo isn't.

@antong
Copy link
Contributor

antong commented Mar 1, 2019

I must be missing something obvious, but why would "^foo$" deserve to return a prefix but not "^foo" ? Any match to either of these needs to have the prefix "foo" (discounting multiline flag etc.).

regexp/syntax also has a variant Prog.Prefix() that (in go1.12 and all previous go versions) behaves like LiteralPrefix() did in go1.2 (https://play.golang.org/p/pUzVZIhot8s):

"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix=""      complete=false

Now I must admit I've never used LiteralPrefix() myself, but to me it seems the way it worked in go1.2 and the way it has always worked in regexp/syntax would be more useful to a user. Take for instance the only user in the standard lib, suffixarray. It uses the prefix to locate potential starting points for a regexp match in the FindAllIndex() function. That makes it useful that regexps anchored to the beginning of the text ("^foo") do not return a prefix. But if a regexp that is anchored in both ends ("^foo$") again does return a prefix, that makes it not useful for that purpose. I'd even say that suffixarray has a bug in this regard (https://play.golang.org/p/j27u5DdE1oS).

So I would like to understand why "^foo$" returns a literal prefix even though it is anchored. I believe I do understand how it happens: There is an optimization that takes a separate path for regexps that can be compiled for "one-pass" execution, and the prefix is calculated differently for these. But to me it sounds like an internal implementation detail and not an external feature of a regular expression that should have relevance on the output.

For reference, here are LiteralPrefix() results for different Go versions (https://play.golang.org/p/KmN93nSPy6A):

go1.2
"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix=""      complete=false

go1.3 - 1.5
"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix="foo"   complete=false

go1.6 - 1.12
"foo"   : prefix="foo"   complete=true
"foo$"  : prefix="foo"   complete=false
"^foo"  : prefix=""      complete=false
"^foo$" : prefix="foo"   complete=true

@griesemer
Copy link
Contributor

cc: @rsc for insights on the behavior change of LiteralPrefix.

@Yexo
Copy link

Yexo commented Apr 11, 2019

Some more testcases here: https://play.golang.org/p/e0IJhTBidQe
It's really surprising that all of "a.*b$", "a.b$" and "^a.b$" return "a" as prefix, but "^a.*b$" does not.

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@eliben
Copy link
Member

eliben commented Aug 6, 2019

I spent some time looking at this, and I believe there are two separate issues we should be discussing here:

  1. The behavior of LiteralPrefix in general. Does it make sense that it returns an empty string for ^foo? I’m not sure, since “foo” is definitely a prefix of every match, so this behavior is surprising
  2. The change in behavior that occurred in 1.3 w.r.t. ^foo$

This issue is about (2), so I’ll focus on this here. Once we resolve the issue we should discuss (1) as well, and I can open a separate issue about that (it may be too late to change behavior, but at least the docs could be changed).

The change in 1.3 seems to be a regression that wasn’t caught for a long time. If we don’t see a prefix for ^foo, we shouldn’t see one for ^foo$ either. The fact that we do is a bug, IMHO, and it also manifests in #30511

In 1.3, the one-pass algorithm was added for regexps, and a new path appeared for the prefix computation (which LiteralPrefix ultimately returns). In today’s code, among these options:

foo
^foo
foo$
^foo$

The first three go through Prog.Prefix - they don’t trigger one-pass. Prog.Prefix rejects regexps that don’t begin with a InstRune, so it rejects everything starting with ^ (this could be the root cause of (1)).

The last one triggers one-pass (because compileOnePass expects a syntax.InstEmptyWidth at the start, as well as in every instruction leading to InstMatch, which only ^foo$ fulfills). To get its prefix, onePassPrefix is called. onePassPrefix is OK with ^foo$ - it doesn’t expect the program to start only with InstRune.

There is a deviation in the implementations of Prog.Prefix and onePassPrefix, and this should be fixed. We should change onePassPrefix to behave in the same way as Prog.Prefix w.r.t. patterns that begin with ^. I’ll be happy to provide a fix, if this sounds reasonable (as well as many more tests for prefixes which are currently very scarcely tested).

@eliben
Copy link
Member

eliben commented Aug 6, 2019

I just noticed that in #21463 (comment), @rsc says:

"It looks like the literal prefix matcher doesn't kick in for anchored patterns, and we should fix that. (See the difference in std between ^xxy and xx in your benchmark.)"

So it sounds like an answer for (1).

@rsc
Copy link
Contributor

rsc commented Sep 26, 2019

This is tricky, because part of the idea behind the complete bool in LiteralPrefix was that if you have

prefix, complete := re.LiteralPrefix()

then if complete is true then you should be able to substitute strings.Index(prefix) for a regular expression search entirely. But if anchors are involved then that's not safe. In the behaviors that @antong helpfully worked out, Go 1.2 is doing the right thing. Go 1.3 through Go 1.5 are suspicious but at least not wrong. Go 1.6 and later are wrong or at least very misleading when they return complete=true for any anchored match.

I think the right thing to do here, in terms of the safest result, is to go back to the Go 1.2 behavior and return empty string, false if re begins with any empty-width assertions (\b, ^, and so on). There is no safe way to use LiteralPrefix if it returns an actual prefix there. Any optimization involving search for that prefix out-of-band could make the later search do the wrong with with the assertions. This is different from what I said in 2017. Maybe I didn't think through it completely then, or maybe I haven't thought through it completely today. :-)

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@gopherbot
Copy link

Change https://golang.org/cl/377294 mentions this issue: regexp: allow prefix string anchored at beginning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests