regexp: individual regexp scales poorly under highly concurrent use #8232

jrockway · 2014-06-18T17:42:07Z

version: go1.3rc2

When matching against a regexp that's shared between goroutines, the program spends most
of its time waiting for the locks here:

68 @ 0x413579 0x4135fb 0x424eae 0x425080 0x496096 0x449dd7 0x4474fe 0x44df9e 0x400c6c
0x413810
#   0x425080    sync.runtime_Semacquire+0x30            /go/gc/src/pkg/runtime/sema.goc:199
#   0x496096    sync.(*Mutex).Lock+0xd6             /go/gc/src/pkg/sync/mutex.go:66
#   0x449dd7    regexp.(*Regexp).get+0x37           /go/gc/src/pkg/regexp/regexp.go:192
#   0x4474fe    regexp.(*Regexp).doExecute+0x5e         /go/gc/src/pkg/regexp/exec.go:423
#   0x44df9e    regexp.(*Regexp).FindStringSubmatch+0x9e    /go/gc/src/pkg/regexp/regexp.go:883
#   0x400c6c    main.match+0x6c                 experimental/users/jrockway/regex.go:24

56 @ 0x413579 0x4135fb 0x424eae 0x425080 0x496096 0x449f2a 0x4477a4 0x44df9e 0x400c6c
0x413810
#   0x425080    sync.runtime_Semacquire+0x30            /go/gc/src/pkg/runtime/sema.goc:199
#   0x496096    sync.(*Mutex).Lock+0xd6             /go/gc/src/pkg/sync/mutex.go:66
#   0x449f2a    regexp.(*Regexp).put+0x3a           /go/gc/src/pkg/regexp/regexp.go:210
#   0x4477a4    regexp.(*Regexp).doExecute+0x304        /go/gc/src/pkg/regexp/exec.go:450
#   0x44df9e    regexp.(*Regexp).FindStringSubmatch+0x9e    /go/gc/src/pkg/regexp/regexp.go:883
#   0x400c6c    main.match+0x6c                 experimental/users/jrockway/regex.go:24

With GOMAXPROCS=32 on a 32 core machine, we end up only using about 700% CPU according
to /usr/bin/time.

If we give each goroutine its own copy of the regexp (created in the goroutine with
regexp.MustCompile), then we use 2500% CPU.  This is much more reasonable.

While my actual problem occurs in a process that is reading data from the network, the
attached test case reproduces the issue rather well.

(Note: time CPU timing is done with the profile printer code commented out.  The test
case is compiled with optimization on.)

Attachments:

regex.go (878 bytes)

bradfitz · 2014-06-18T17:47:13Z

Comment 1:

Russ, I know we previously decided not to use sync.Pool here, but it would give us
per-CPU retrieval without this lock. Worth reconsidering?
Or, would keeping idle machines in a buffered channel be faster, at least once Dmitry
does whatever channel work he's planning in Go 1.4?

Labels changed: added release-go1.4, repo-main.

Status changed to Accepted.

gopherbot · 2014-06-18T18:08:33Z

Comment 2:

CL https://golang.org/cl/110020045 mentions this issue.

rsc · 2014-06-18T19:46:40Z

Comment 3:

If the lock there is dominating performance, then it means the regexp is super trivial.
In that case you will get an even bigger speedup by just writing some code instead of
doing a regexp match.
For example, strings.Index(line, "foo bar baz").
I don't want to optimize for no-op regexps.

jrockway · 2014-06-18T19:51:40Z

Comment 4:

That seems reasonable.  I'm happy to file a different bug for a feature request for
regexp that compiles "trivial" regexps down to a strings.Index call.
In the non-test-case, I'm actually doing something a little more complex and lock
performance still dominates.

bradfitz · 2014-06-18T20:20:39Z

Comment 5:

So the regexp package is only for non-trivial regexps, even if it means a regexp is the
most comfortable way to describe the solution?
I changed the regexp to be less trivial:
func BenchmarkMachineCache(b *testing.B) {
        re := MustCompile("foo (ba+r)? baz")
        line := "this is a long line that contains foo bar baz"
        for i := 0; i < b.N; i++ {
                re.MatchString(line)
        }
}
func BenchmarkMachineCacheParallel(b *testing.B) {
        re := MustCompile("foo (ba+r)? baz")
        line := "this is a long line that contains foo bar baz"
        b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                        re.MatchString(line)
                }
        })
}
The sync.Pool doesn't seem to hurt the single-threaded case, even on this non-trivial
regexp:
$ benchcmp before after
benchmark                           old ns/op     new ns/op     delta
BenchmarkMachineCache               1608          1560          -2.99%
BenchmarkMachineCache-2             1595          1546          -3.07%
BenchmarkMachineCache-4             1571          1543          -1.78%
BenchmarkMachineCache-8             1594          1549          -2.82%
BenchmarkMachineCacheParallel       1593          1554          -2.45%
BenchmarkMachineCacheParallel-2     929           928           -0.11%
BenchmarkMachineCacheParallel-4     649           447           -31.12%
BenchmarkMachineCacheParallel-8     586           381           -34.98%
I'm also not sure I buy the extra memory argument from Dmitry, since it's marginal
compared to the weight of a compiled program, which are generally few and global.
If you don't like sync.Pool here, what do you propose? Writing the code by hand and
avoiding the regexp package seems like a cop-out.

dvyukov · 2014-06-19T01:35:48Z

Comment 6:

I've tested on a machine with 32 hardware threads:
benchmark                            old ns/op     new ns/op     delta       
BenchmarkMachineCacheParallel        1117          1296          +16.03%     
BenchmarkMachineCacheParallel-2      1013          602           -40.57%     
BenchmarkMachineCacheParallel-4      574           362           -36.93%     
BenchmarkMachineCacheParallel-8      1292          193           -85.06%     
BenchmarkMachineCacheParallel-16     2751          118           -95.71%     
BenchmarkMachineCacheParallel-32     2843          69.4          -97.56%

rsc · 2014-06-19T03:15:49Z

Comment 7:

A slice holding a single cached *machine costs something like 5 words (Lock + slice +
1-element array) A pool holding a single cached *machine on a 4-CPU system costs
something like 16*4 = 64 words. A factor of ten memory usage for a simple free list is
too much. This is the kind of decision that makes microbenchmarks fast and overall
programs slow (because so much more memory is used).
The sync.Pool docs are very clear: it is for globals, singletons. It is not for things
that already have clearly defined lifetimes and can use the same lifetime for their
caches.
Put another way, sync.Pool is for answering the question "how do I decide when to drop
this cached thing that otherwise would never die". The suggestion here is to use it to
answer the question "how can I make this thing fast". That's a mistake, even if the
implementation can satisfy both today. If it starts getting used for these two different
things, then we will not be able to change it, because optimizing for one case will hurt
the other.
The regexp free list is a Lock+slice. If that's so slow that it dominates real regexp
execution, then something is very wrong. What is it? Why are locks that slow? Are locks
broken? Is the scheduler broken? Is a locked free list just fundamentally incompatible
with multiprocessors? 
If locks are broken, let's fix the locks.
If the scheduler is broken, let's fix the scheduler.
If a locked free list is fundamentally incompatible with multiprocessors - and you'll
need serious evidence to back this up - then maybe package sync needs to provide a
faster free list.
None of these answers involve using a sync.Pool in a context it is not intended for.

bradfitz · 2014-06-19T06:15:20Z

Comment 8:

Dmitry, I found I had to -benchtime=4s before the numbers stopped being noisy. (Notably
for -cpu=1)  That was on real hardware, with 8 cores.
Russ, I understand that sync.Pool is heavier than a lock + slice.
I asked what the alternative was. I'm confused by your:
> then maybe package sync needs to provide a faster free list.
That sounds awfully like a sync.Pool. You want a lighter freelist than sync.Pool in sync
too? That would be confusing, having two very similar-but-different mechanisms.
Presumably the lighter version wouldn't have a per-CPU cache, and thus could avoid the
[128]byte false-sharing-avoidance pad. But I doubt it'd do too well without that. I
guess you mean just using atomics to do a lock-free queue of unsafe.Pointer or
interface{}. Seems like it wouldn't be better enough to warrant the confusion.
I'll watch this bug with interest.

dvyukov · 2014-06-19T18:48:39Z

Comment 9:

I've implemented lock-free stack based freelist for machines:
https://golang.org/cl/107180045
Here are results for freelist:
benchmark                                   old ns/op     new ns/op     delta       
BenchmarkMachineCacheParallel               1007          1115          +10.72%     
BenchmarkMachineCacheParallel-2             793           678           -14.50%     
BenchmarkMachineCacheParallel-4             521           454           -12.86%     
BenchmarkMachineCacheParallel-8             594           402           -32.32%     
BenchmarkMachineCacheParallel-16            693           467           -32.61%     
BenchmarkMachineCacheParallel-32            2600          1521          -41.50%     
BenchmarkMachineCacheComplexParallel        4303          4503          +4.65%      
BenchmarkMachineCacheComplexParallel-2      3133          2414          -22.95%     
BenchmarkMachineCacheComplexParallel-4      1556          1359          -12.66%     
BenchmarkMachineCacheComplexParallel-8      1077          783           -27.30%     
BenchmarkMachineCacheComplexParallel-16     898           715           -20.38%     
BenchmarkMachineCacheComplexParallel-32     1975          1381          -30.08% 
Here are results for sync.Pool:
benchmark                                   old ns/op     new ns/op     delta       
BenchmarkMachineCacheParallel               1007          1007          +0.00%      
BenchmarkMachineCacheParallel-2             793           576           -27.36%     
BenchmarkMachineCacheParallel-4             521           347           -33.40%     
BenchmarkMachineCacheParallel-8             594           180           -69.70%     
BenchmarkMachineCacheParallel-16            693           135           -80.52%     
BenchmarkMachineCacheParallel-32            2600          72.4          -97.22%     
BenchmarkMachineCacheComplexParallel        4303          4374          +1.65%      
BenchmarkMachineCacheComplexParallel-2      3133          2255          -28.02%     
BenchmarkMachineCacheComplexParallel-4      1556          1212          -22.11%     
BenchmarkMachineCacheComplexParallel-8      1077          656           -39.09%     
BenchmarkMachineCacheComplexParallel-16     898           573           -36.19%     
BenchmarkMachineCacheComplexParallel-32     1975          291           -85.27%   
BenchmarkMachineCacheParallel is what Brad used:
        re := MustCompile("foo (ba+r)? baz")
        line := "this is a long line that contains foo bar baz"
BenchmarkMachineCacheComplexParallel is somewhat more complex expression:
        re := MustCompile("f?o+o (bzz)(b+a+r)? b+a+z")
        line := "this is a long line that contains foo bar baz"

dvyukov · 2014-06-19T23:48:28Z

Comment 10:

Tested the lock-free freelist on 2 heavier regexps (this is current code vs freelist):
BenchmarkMachineCacheComplexParallel        7429          7237          -2.58%      
BenchmarkMachineCacheComplexParallel-2      4259          4040          -5.14%      
BenchmarkMachineCacheComplexParallel-4      2318          2227          -3.93%      
BenchmarkMachineCacheComplexParallel-8      1313          1201          -8.53%      
BenchmarkMachineCacheComplexParallel-16     1241          1004          -19.10%     
BenchmarkMachineCacheComplexParallel-32     2000          1277          -36.15%  
BenchmarkMachineCacheComplexParallel        14909         14327         -3.90%      
BenchmarkMachineCacheComplexParallel-2      7872          7860          -0.15%      
BenchmarkMachineCacheComplexParallel-4      4190          4321          +3.13%      
BenchmarkMachineCacheComplexParallel-8      2333          2279          -2.31%      
BenchmarkMachineCacheComplexParallel-16     1950          1917          -1.69%      
BenchmarkMachineCacheComplexParallel-32     2125          1069          -49.69%

dvyukov · 2014-06-20T04:06:26Z

Comment 11:

sync.Pool is the wrong primitive here
the lock-free freelist improve situation a bit, but there is still 20x slowdown on
32-thread machine
the conclusion so far is: use a local per-goroutine regexp in such cases, it minimally
complicates code, but provides linear scaling

Status changed to Unfortunate.

rsc · 2014-06-25T01:47:38Z

Comment 12:

What Dmitriy wrote is correct: that's what you should do.
It may be that regexp needs an API change to support this case better.

Status changed to Accepted.

bradfitz · 2014-06-25T02:10:29Z

Comment 13:

database/sql has the same problem. In both cases we tried to provide a global value you
could share across goroutines and it sucks here and it sucks in database/sql too. :/

rsc · 2014-06-26T15:13:44Z

Comment 14:

The problem here is that the work being done is too tiny compared to the cost of the
lock. The API needs to change to make it possible to do more work with a single lock
acquisition.
What is the tiny amount of work that database/sql does?

bradfitz · 2014-06-26T15:36:40Z

Comment 15:

database/sql manages a connection pool to databases.
How would you change the regexp API to do more work per lock acquisition?

rsc · 2014-06-26T16:04:20Z

Comment 16:

What is the tiny amount of work that database/sql does under lock? Are the operations on
the databases really so fast that the locking to get at the connections is significant?
There are two things we could do for the regexp API.
One is to introduce a new type with a better name than RegexpForSingleGoroutine that
only one goroutine can use at a time. It would have exactly one cached machine and no
lock when used. You could get at a RegexpForSingleGoroutine by calling a new method on
Regexp. This would end up being similar to the "just call regexp.Compile in each
goroutine" solution but it avoids the Compile, which although usually cheap may not
always be. RegexpForSingleGoroutine would need all the methods that Regexp has, which is
unfortunate.
The other thing we could do is what I just described but reuse Regexp as the type and
keep the lock. The speedup comes from not contending for the lock, not from not having
the lock. So if you had func (*Regexp) Copy() *Regexp that each goroutine could call if
it wanted its own local contention-free copy, that would probably be good enough. And it
avoids adding a new type with many new methods.

rsc · 2014-09-16T19:49:07Z

Comment 17:

Labels changed: added release-go1.5, removed release-go1.4.

rsc · 2015-06-29T04:31:33Z

Too late for Go 1.5.

maxbaldin · 2015-07-23T18:47:00Z

@rsc why possibility to use regexp in high-concurrent application was removed from 1.5 release? :(

cespare · 2015-07-23T19:01:44Z

@maxbaldin If this is a bottleneck in your code, you can work around it today by using a different regexp.Regexp in each goroutine. Based on what @rsc mentioned above, it may be that in future Go versions that becomes slightly easier via a Copy method or something.

cespare · 2015-10-20T21:42:49Z

I gave the Copy approach a shot in https://golang.org/cl/16110. I also compared benchmarks against @rsc's RegexpForSingleGoroutine; that gave a 7% improvement over the no-contention Copy version using the regex @bradfitz and @dvyukov were using above. (See CL for details.)

gopherbot · 2015-10-20T22:02:17Z

CL https://golang.org/cl/16110 mentions this issue.

This helps users who wish to use separate Regexps in each goroutine to avoid lock contention. Previously they had to parse the expression multiple times to achieve this. I used variants of the included benchmark to evaluate this change. I used the arguments -benchtime 20s -cpu 1,2,4,8,16 on a machine with 16 hardware cores. Comparing a single shared Regexp vs. copied Regexps, we can see that lock contention causes huge slowdowns at higher levels of parallelism. The copied version shows the expected linear speedup. name old time/op new time/op delta MatchParallel 366ns ± 0% 370ns ± 0% +1.09% (p=0.000 n=10+8) MatchParallel-2 324ns ±28% 184ns ± 1% -43.37% (p=0.000 n=10+10) MatchParallel-4 352ns ± 5% 93ns ± 1% -73.70% (p=0.000 n=9+10) MatchParallel-8 480ns ± 3% 46ns ± 0% -90.33% (p=0.000 n=9+8) MatchParallel-16 510ns ± 8% 24ns ± 6% -95.36% (p=0.000 n=10+8) I also compared a modified version of Regexp that has no mutex and a single machine (the "RegexpForSingleGoroutine" rsc mentioned in #8232 (comment)). In this next test, I compared using N copied Regexps vs. N separate RegexpForSingleGoroutines. This shows that, even for this relatively simple regex, avoiding the lock entirely would only buy about 10-12% further improvement. name old time/op new time/op delta MatchParallel 370ns ± 0% 322ns ± 0% -12.97% (p=0.000 n=8+8) MatchParallel-2 184ns ± 1% 162ns ± 1% -11.60% (p=0.000 n=10+10) MatchParallel-4 92.7ns ± 1% 81.1ns ± 2% -12.43% (p=0.000 n=10+10) MatchParallel-8 46.4ns ± 0% 41.8ns ±10% -9.78% (p=0.000 n=8+10) MatchParallel-16 23.7ns ± 6% 20.6ns ± 1% -13.14% (p=0.000 n=8+8) Updates #8232. Change-Id: I15201a080c363d1b44104eafed46d8df5e311902 Reviewed-on: https://go-review.googlesource.com/16110 Reviewed-by: Russ Cox <rsc@golang.org>

rsc · 2015-11-25T17:27:32Z

Fixed (at least worked around) in CL 16110.

This helps users who wish to use separate Regexps in each goroutine to avoid lock contention. Previously they had to parse the expression multiple times to achieve this. I used variants of the included benchmark to evaluate this change. I used the arguments -benchtime 20s -cpu 1,2,4,8,16 on a machine with 16 hardware cores. Comparing a single shared Regexp vs. copied Regexps, we can see that lock contention causes huge slowdowns at higher levels of parallelism. The copied version shows the expected linear speedup. name old time/op new time/op delta MatchParallel 366ns ± 0% 370ns ± 0% +1.09% (p=0.000 n=10+8) MatchParallel-2 324ns ±28% 184ns ± 1% -43.37% (p=0.000 n=10+10) MatchParallel-4 352ns ± 5% 93ns ± 1% -73.70% (p=0.000 n=9+10) MatchParallel-8 480ns ± 3% 46ns ± 0% -90.33% (p=0.000 n=9+8) MatchParallel-16 510ns ± 8% 24ns ± 6% -95.36% (p=0.000 n=10+8) I also compared a modified version of Regexp that has no mutex and a single machine (the "RegexpForSingleGoroutine" rsc mentioned in golang/go#8232 (comment)). In this next test, I compared using N copied Regexps vs. N separate RegexpForSingleGoroutines. This shows that, even for this relatively simple regex, avoiding the lock entirely would only buy about 10-12% further improvement. name old time/op new time/op delta MatchParallel 370ns ± 0% 322ns ± 0% -12.97% (p=0.000 n=8+8) MatchParallel-2 184ns ± 1% 162ns ± 1% -11.60% (p=0.000 n=10+10) MatchParallel-4 92.7ns ± 1% 81.1ns ± 2% -12.43% (p=0.000 n=10+10) MatchParallel-8 46.4ns ± 0% 41.8ns ±10% -9.78% (p=0.000 n=8+10) MatchParallel-16 23.7ns ± 6% 20.6ns ± 1% -13.14% (p=0.000 n=8+8) Updates #8232. Change-Id: I15201a080c363d1b44104eafed46d8df5e311902 Reviewed-on: https://go-review.googlesource.com/16110 Reviewed-by: Russ Cox <rsc@golang.org>

jrockway added accepted labels Sep 16, 2014

bradfitz modified the milestone: Go1.5 Dec 16, 2014

bradfitz removed the release-go1.5 label Dec 16, 2014

rsc removed accepted labels Apr 14, 2015

rsc modified the milestones: Go1.6Early, Go1.5 Jun 29, 2015

rsc closed this as completed Nov 25, 2015

golang locked and limited conversation to collaborators Nov 27, 2016

gopherbot added the FrozenDueToAge label Nov 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp: individual regexp scales poorly under highly concurrent use #8232

regexp: individual regexp scales poorly under highly concurrent use #8232

jrockway commented Jun 18, 2014

bradfitz commented Jun 18, 2014

gopherbot commented Jun 18, 2014

rsc commented Jun 18, 2014

jrockway commented Jun 18, 2014

bradfitz commented Jun 18, 2014

dvyukov commented Jun 19, 2014

rsc commented Jun 19, 2014

bradfitz commented Jun 19, 2014

dvyukov commented Jun 19, 2014

dvyukov commented Jun 19, 2014

dvyukov commented Jun 20, 2014

rsc commented Jun 25, 2014

bradfitz commented Jun 25, 2014

rsc commented Jun 26, 2014

bradfitz commented Jun 26, 2014

rsc commented Jun 26, 2014

rsc commented Sep 16, 2014

rsc commented Jun 29, 2015

maxbaldin commented Jul 23, 2015

cespare commented Jul 23, 2015

cespare commented Oct 20, 2015

gopherbot commented Oct 20, 2015

rsc commented Nov 25, 2015

regexp: individual regexp scales poorly under highly concurrent use #8232

regexp: individual regexp scales poorly under highly concurrent use #8232

Comments

jrockway commented Jun 18, 2014

bradfitz commented Jun 18, 2014

gopherbot commented Jun 18, 2014

rsc commented Jun 18, 2014

jrockway commented Jun 18, 2014

bradfitz commented Jun 18, 2014

dvyukov commented Jun 19, 2014

rsc commented Jun 19, 2014

bradfitz commented Jun 19, 2014

dvyukov commented Jun 19, 2014

dvyukov commented Jun 19, 2014

dvyukov commented Jun 20, 2014

rsc commented Jun 25, 2014

bradfitz commented Jun 25, 2014

rsc commented Jun 26, 2014

bradfitz commented Jun 26, 2014

rsc commented Jun 26, 2014

rsc commented Sep 16, 2014

rsc commented Jun 29, 2015

maxbaldin commented Jul 23, 2015

cespare commented Jul 23, 2015

cespare commented Oct 20, 2015

gopherbot commented Oct 20, 2015

rsc commented Nov 25, 2015