Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: case-insensitive MatchString performance #13288

Closed
egonelbre opened this issue Nov 17, 2015 · 15 comments
Closed

regexp: case-insensitive MatchString performance #13288

egonelbre opened this issue Nov 17, 2015 · 15 comments

Comments

@egonelbre
Copy link
Contributor

I was investigating why https://github.com/dimroc/etl-language-comparison/blob/517507b033dbb938dd1e83401914cbd4dd9a79bc/golang/search.go#L160 ends up significantly slower with regular expression.

strings.Contains(strings.ToLower(message), "knicks") // 7.8s
// versus
rx := regexp.MustCompile("(?i)knicks") // won't be done multiple times
rx.MatchString(message) // 21.7s

Profiling the etl example lead me to:

152.84s of 158.97s total (96.14%)
Dropped 167 nodes (cum <= 0.79s)
      flat  flat%   sum%        cum   cum%
    61.12s 38.45% 38.45%     61.25s 38.53%  unicode.SimpleFold
    22.62s 14.23% 52.68%    121.64s 76.52%  regexp.(*machine).tryBacktrack
    18.76s 11.80% 64.48%     18.76s 11.80%  regexp.(*bitState).push
    10.41s  6.55% 71.03%     71.66s 45.08%  regexp/syntax.(*Inst).MatchRunePos
     8.99s  5.66% 76.68%      9.41s  5.92%  regexp.(*inputString).step
     8.94s  5.62% 82.30%    135.55s 85.27%  regexp.(*machine).backtrack
     5.76s  3.62% 85.93%      5.76s  3.62%  runtime.osyield
     3.48s  2.19% 88.12%     75.14s 47.27%  regexp/syntax.(*Inst).MatchRune

Which lead me to: https://github.com/golang/go/blob/master/src/regexp/syntax/prog.go#L213

I assume that there is a good reason for using unicode.SimpleFold, i.e. some special unicode symbols. When I changed it:

for r1 := unicode.SimpleFold(r0); r1 != r0; r1 = unicode.SimpleFold(r1) {
// into
if r1 := unicode.ToLower(r0); r1 == r || r1 == unicode.ToLower(r) {

It didn't break any tests... also I created easy0i = "(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ$" in https://github.com/golang/go/blob/master/src/regexp/exec_test.go#L674 to measure the performance change:

// old
BenchmarkMatchEasy0i_32-8                 500000              2860 ns/op          11.19 MB/s
BenchmarkMatchEasy0i_1K-8                  20000             84204 ns/op          12.16 MB/s
BenchmarkMatchEasy0i_32K-8                   500           3346191 ns/op           9.79 MB/s
BenchmarkMatchEasy0i_1M-8                     10         107206130 ns/op           9.78 MB/s
BenchmarkMatchEasy0i_32M-8                     1        3424195900 ns/op           9.80 MB/s
// new
BenchmarkMatchEasy0i_32-8                1000000              1831 ns/op          17.48 MB/s
BenchmarkMatchEasy0i_1K-8                  30000             56369 ns/op          18.17 MB/s
BenchmarkMatchEasy0i_32K-8                  1000           2276130 ns/op          14.40 MB/s
BenchmarkMatchEasy0i_1M-8                     20          73004175 ns/op          14.36 MB/s
BenchmarkMatchEasy0i_32M-8                     1        2334133500 ns/op          14.38 MB/s

So, is there a reason why unicode.SimpleFold is used instead of something else, such as unicode.ToLower? Either way, there might be some significant performance gains here for case-insensitive matches.

@egonelbre
Copy link
Contributor Author

unicode.SimpleFold does a binary search over the caseOrbit table, every letter.

Introducing a LUT for runes < 0x80, helped the Easy0i case.

var foldLUT [0x80]uint16

func SimpleFold(r rune) rune {
    if r < len(foldLUT) {
        return rune(foldLUT[r])
    }
    // ...

Results:

BenchmarkMatchEasy0i_32-8        1000000              1796 ns/op          17.82 MB/s
BenchmarkMatchEasy0i_1K-8          30000             50636 ns/op          20.22 MB/s
BenchmarkMatchEasy0i_32K-8          1000           2326133 ns/op          14.09 MB/s
BenchmarkMatchEasy0i_1M-8             20          74604270 ns/op          14.06 MB/s
BenchmarkMatchEasy0i_32M-8             1        2373135700 ns/op          14.14 MB/s

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Nov 17, 2015
@bradfitz
Copy link
Contributor

Nice. Use https://godoc.org/golang.org/x/tools/cmd/benchcmp to show the before & after numbers.

Send in a change? https://golang.org/doc/contribute.html

It's probably too late for Go 1.6, though, unless you send it soon and it addresses an existing bug or regression.

@egonelbre
Copy link
Contributor Author

Sure, I can make CL for the unicode.SimpleFold optimization.

I'm not sure whether regexp part can or should be adjusted, to not use SimpleFold?

@bradfitz
Copy link
Contributor

I don't think you can change SimpleFold to ToLower in regexp. I suspect that would break case-insensitive regexps over characters like "kKK" (little k, big k, Kelvin symbol)? But no, ToLower of kelvin symbol is lowercase 'k'.... http://play.golang.org/p/pqhR4ksOk_ But maybe there are other characters which SimpleFold to each other but don't map to the same ToLower character? You could brute force all the characters (or characters in the defined ranges) and see?

@egonelbre
Copy link
Contributor Author

@gopherbot
Copy link

CL https://golang.org/cl/16943 mentions this issue.

@egonelbre
Copy link
Contributor Author

Did the brute-force, it seems that there are multiple such cases: https://play.golang.org/p/D1_JuZYP0v. So yes, unicode.ToLower wouldn't be suitable. Then again, it would be possible to use regexp/syntax.minFoldRune (after making it single lookup).

@bradfitz bradfitz modified the milestones: Go1.7, Unplanned Feb 1, 2016
mk0x9 pushed a commit to mk0x9/go that referenced this issue Apr 26, 2016
This change significantly speeds up case-insensitive regexp matching.

benchmark                      old ns/op      new ns/op      delta
BenchmarkMatchEasy0i_32-8      2690           1473           -45.24%
BenchmarkMatchEasy0i_1K-8      80404          42269          -47.43%
BenchmarkMatchEasy0i_32K-8     3272187        2076118        -36.55%
BenchmarkMatchEasy0i_1M-8      104805990      66503805       -36.55%
BenchmarkMatchEasy0i_32M-8     3360192200     2126121600     -36.73%

benchmark                      old MB/s     new MB/s     speedup
BenchmarkMatchEasy0i_32-8      11.90        21.72        1.83x
BenchmarkMatchEasy0i_1K-8      12.74        24.23        1.90x
BenchmarkMatchEasy0i_32K-8     10.01        15.78        1.58x
BenchmarkMatchEasy0i_1M-8      10.00        15.77        1.58x
BenchmarkMatchEasy0i_32M-8     9.99         15.78        1.58x

Issue golang#13288

Change-Id: I94af7bb29e75d60b4f6ee760124867ab271b9642
Reviewed-on: https://go-review.googlesource.com/16943
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@bradfitz bradfitz modified the milestones: Unplanned, Go1.7 Apr 26, 2016
@bradfitz
Copy link
Contributor

Probably enough for Go 1.7. Leaving this open, though, if you were planning on doing more later.

@aprice
Copy link

aprice commented Jul 14, 2016

Go 1.7 is looming and I don't see this mentioned anywhere in the release notes and no update on this ticket - is this being rolled into the 1.7 release? Regex is an unfortunate pain point for Go performance compared to other languages.

@cespare
Copy link
Contributor

cespare commented Jul 14, 2016

@aprice @egonelbre's improvements (https://go-review.googlesource.com/#/c/16943/) were merged and will be part of Go 1.7. (Localized performance improvements are typically not called out in the release notes.)

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jul 14, 2016

A change was submitted to speed up unicode.SimpleFold performance for ASCII, which in turns speeds up case-insensitive regexp performance for ASCII. That change is not significant enough to mention in the release notes. This issue remains open for improved case-folding performance for non-ASCII, if anybody wants to work on it.

@kilyo
Copy link

kilyo commented Jul 14, 2016

I have moved my question to the forum, as requested.
https://forum.golangbridge.org/t/unicode-simplefold-implementataion-in-go-1-7/2999

@bradfitz
Copy link
Contributor

Let's move general code/style questions to a mailing list and not clutter this bug. I'm happy to answer on a mailing list.

@egonelbre
Copy link
Contributor Author

When making the easy fix, the better solution would have been to avoid calling SimpleFold and use minFoldRune with a LUT to make less lookups. Also, maybe, convert the case-insensitive part of regexp to minFold when compiling the regexp. However, I don't think I would be able to create the time to make such change.

@rsc
Copy link
Contributor

rsc commented Oct 2, 2018

You can't simply use strings.ToLower on "knicks" because that will fail to match the "k" against the Unicode Kelvin symbol U+212A (K). The extra overhead here compared to strings.ToLower is exactly to add that symbol.

@rsc rsc closed this as completed Oct 2, 2018
@golang golang locked and limited conversation to collaborators Oct 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants