-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp: individual regexp scales poorly under highly concurrent use #8232
Comments
Russ, I know we previously decided not to use sync.Pool here, but it would give us per-CPU retrieval without this lock. Worth reconsidering? Or, would keeping idle machines in a buffered channel be faster, at least once Dmitry does whatever channel work he's planning in Go 1.4? Labels changed: added release-go1.4, repo-main. Status changed to Accepted. |
CL https://golang.org/cl/110020045 mentions this issue. |
So the regexp package is only for non-trivial regexps, even if it means a regexp is the most comfortable way to describe the solution? I changed the regexp to be less trivial: func BenchmarkMachineCache(b *testing.B) { re := MustCompile("foo (ba+r)? baz") line := "this is a long line that contains foo bar baz" for i := 0; i < b.N; i++ { re.MatchString(line) } } func BenchmarkMachineCacheParallel(b *testing.B) { re := MustCompile("foo (ba+r)? baz") line := "this is a long line that contains foo bar baz" b.RunParallel(func(pb *testing.PB) { for pb.Next() { re.MatchString(line) } }) } The sync.Pool doesn't seem to hurt the single-threaded case, even on this non-trivial regexp: $ benchcmp before after benchmark old ns/op new ns/op delta BenchmarkMachineCache 1608 1560 -2.99% BenchmarkMachineCache-2 1595 1546 -3.07% BenchmarkMachineCache-4 1571 1543 -1.78% BenchmarkMachineCache-8 1594 1549 -2.82% BenchmarkMachineCacheParallel 1593 1554 -2.45% BenchmarkMachineCacheParallel-2 929 928 -0.11% BenchmarkMachineCacheParallel-4 649 447 -31.12% BenchmarkMachineCacheParallel-8 586 381 -34.98% I'm also not sure I buy the extra memory argument from Dmitry, since it's marginal compared to the weight of a compiled program, which are generally few and global. If you don't like sync.Pool here, what do you propose? Writing the code by hand and avoiding the regexp package seems like a cop-out. |
I've tested on a machine with 32 hardware threads: benchmark old ns/op new ns/op delta BenchmarkMachineCacheParallel 1117 1296 +16.03% BenchmarkMachineCacheParallel-2 1013 602 -40.57% BenchmarkMachineCacheParallel-4 574 362 -36.93% BenchmarkMachineCacheParallel-8 1292 193 -85.06% BenchmarkMachineCacheParallel-16 2751 118 -95.71% BenchmarkMachineCacheParallel-32 2843 69.4 -97.56% |
A slice holding a single cached *machine costs something like 5 words (Lock + slice + 1-element array) A pool holding a single cached *machine on a 4-CPU system costs something like 16*4 = 64 words. A factor of ten memory usage for a simple free list is too much. This is the kind of decision that makes microbenchmarks fast and overall programs slow (because so much more memory is used). The sync.Pool docs are very clear: it is for globals, singletons. It is not for things that already have clearly defined lifetimes and can use the same lifetime for their caches. Put another way, sync.Pool is for answering the question "how do I decide when to drop this cached thing that otherwise would never die". The suggestion here is to use it to answer the question "how can I make this thing fast". That's a mistake, even if the implementation can satisfy both today. If it starts getting used for these two different things, then we will not be able to change it, because optimizing for one case will hurt the other. The regexp free list is a Lock+slice. If that's so slow that it dominates real regexp execution, then something is very wrong. What is it? Why are locks that slow? Are locks broken? Is the scheduler broken? Is a locked free list just fundamentally incompatible with multiprocessors? If locks are broken, let's fix the locks. If the scheduler is broken, let's fix the scheduler. If a locked free list is fundamentally incompatible with multiprocessors - and you'll need serious evidence to back this up - then maybe package sync needs to provide a faster free list. None of these answers involve using a sync.Pool in a context it is not intended for. |
Dmitry, I found I had to -benchtime=4s before the numbers stopped being noisy. (Notably for -cpu=1) That was on real hardware, with 8 cores. Russ, I understand that sync.Pool is heavier than a lock + slice. I asked what the alternative was. I'm confused by your: > then maybe package sync needs to provide a faster free list. That sounds awfully like a sync.Pool. You want a lighter freelist than sync.Pool in sync too? That would be confusing, having two very similar-but-different mechanisms. Presumably the lighter version wouldn't have a per-CPU cache, and thus could avoid the [128]byte false-sharing-avoidance pad. But I doubt it'd do too well without that. I guess you mean just using atomics to do a lock-free queue of unsafe.Pointer or interface{}. Seems like it wouldn't be better enough to warrant the confusion. I'll watch this bug with interest. |
I've implemented lock-free stack based freelist for machines: https://golang.org/cl/107180045 Here are results for freelist: benchmark old ns/op new ns/op delta BenchmarkMachineCacheParallel 1007 1115 +10.72% BenchmarkMachineCacheParallel-2 793 678 -14.50% BenchmarkMachineCacheParallel-4 521 454 -12.86% BenchmarkMachineCacheParallel-8 594 402 -32.32% BenchmarkMachineCacheParallel-16 693 467 -32.61% BenchmarkMachineCacheParallel-32 2600 1521 -41.50% BenchmarkMachineCacheComplexParallel 4303 4503 +4.65% BenchmarkMachineCacheComplexParallel-2 3133 2414 -22.95% BenchmarkMachineCacheComplexParallel-4 1556 1359 -12.66% BenchmarkMachineCacheComplexParallel-8 1077 783 -27.30% BenchmarkMachineCacheComplexParallel-16 898 715 -20.38% BenchmarkMachineCacheComplexParallel-32 1975 1381 -30.08% Here are results for sync.Pool: benchmark old ns/op new ns/op delta BenchmarkMachineCacheParallel 1007 1007 +0.00% BenchmarkMachineCacheParallel-2 793 576 -27.36% BenchmarkMachineCacheParallel-4 521 347 -33.40% BenchmarkMachineCacheParallel-8 594 180 -69.70% BenchmarkMachineCacheParallel-16 693 135 -80.52% BenchmarkMachineCacheParallel-32 2600 72.4 -97.22% BenchmarkMachineCacheComplexParallel 4303 4374 +1.65% BenchmarkMachineCacheComplexParallel-2 3133 2255 -28.02% BenchmarkMachineCacheComplexParallel-4 1556 1212 -22.11% BenchmarkMachineCacheComplexParallel-8 1077 656 -39.09% BenchmarkMachineCacheComplexParallel-16 898 573 -36.19% BenchmarkMachineCacheComplexParallel-32 1975 291 -85.27% BenchmarkMachineCacheParallel is what Brad used: re := MustCompile("foo (ba+r)? baz") line := "this is a long line that contains foo bar baz" BenchmarkMachineCacheComplexParallel is somewhat more complex expression: re := MustCompile("f?o+o (bzz)(b+a+r)? b+a+z") line := "this is a long line that contains foo bar baz" |
Tested the lock-free freelist on 2 heavier regexps (this is current code vs freelist): BenchmarkMachineCacheComplexParallel 7429 7237 -2.58% BenchmarkMachineCacheComplexParallel-2 4259 4040 -5.14% BenchmarkMachineCacheComplexParallel-4 2318 2227 -3.93% BenchmarkMachineCacheComplexParallel-8 1313 1201 -8.53% BenchmarkMachineCacheComplexParallel-16 1241 1004 -19.10% BenchmarkMachineCacheComplexParallel-32 2000 1277 -36.15% BenchmarkMachineCacheComplexParallel 14909 14327 -3.90% BenchmarkMachineCacheComplexParallel-2 7872 7860 -0.15% BenchmarkMachineCacheComplexParallel-4 4190 4321 +3.13% BenchmarkMachineCacheComplexParallel-8 2333 2279 -2.31% BenchmarkMachineCacheComplexParallel-16 1950 1917 -1.69% BenchmarkMachineCacheComplexParallel-32 2125 1069 -49.69% |
sync.Pool is the wrong primitive here the lock-free freelist improve situation a bit, but there is still 20x slowdown on 32-thread machine the conclusion so far is: use a local per-goroutine regexp in such cases, it minimally complicates code, but provides linear scaling Status changed to Unfortunate. |
What is the tiny amount of work that database/sql does under lock? Are the operations on the databases really so fast that the locking to get at the connections is significant? There are two things we could do for the regexp API. One is to introduce a new type with a better name than RegexpForSingleGoroutine that only one goroutine can use at a time. It would have exactly one cached machine and no lock when used. You could get at a RegexpForSingleGoroutine by calling a new method on Regexp. This would end up being similar to the "just call regexp.Compile in each goroutine" solution but it avoids the Compile, which although usually cheap may not always be. RegexpForSingleGoroutine would need all the methods that Regexp has, which is unfortunate. The other thing we could do is what I just described but reuse Regexp as the type and keep the lock. The speedup comes from not contending for the lock, not from not having the lock. So if you had func (*Regexp) Copy() *Regexp that each goroutine could call if it wanted its own local contention-free copy, that would probably be good enough. And it avoids adding a new type with many new methods. |
Too late for Go 1.5. |
@rsc why possibility to use regexp in high-concurrent application was removed from 1.5 release? :( |
@maxbaldin If this is a bottleneck in your code, you can work around it today by using a different |
I gave the Copy approach a shot in https://golang.org/cl/16110. I also compared benchmarks against @rsc's RegexpForSingleGoroutine; that gave a 7% improvement over the no-contention Copy version using the regex @bradfitz and @dvyukov were using above. (See CL for details.) |
CL https://golang.org/cl/16110 mentions this issue. |
This helps users who wish to use separate Regexps in each goroutine to avoid lock contention. Previously they had to parse the expression multiple times to achieve this. I used variants of the included benchmark to evaluate this change. I used the arguments -benchtime 20s -cpu 1,2,4,8,16 on a machine with 16 hardware cores. Comparing a single shared Regexp vs. copied Regexps, we can see that lock contention causes huge slowdowns at higher levels of parallelism. The copied version shows the expected linear speedup. name old time/op new time/op delta MatchParallel 366ns ± 0% 370ns ± 0% +1.09% (p=0.000 n=10+8) MatchParallel-2 324ns ±28% 184ns ± 1% -43.37% (p=0.000 n=10+10) MatchParallel-4 352ns ± 5% 93ns ± 1% -73.70% (p=0.000 n=9+10) MatchParallel-8 480ns ± 3% 46ns ± 0% -90.33% (p=0.000 n=9+8) MatchParallel-16 510ns ± 8% 24ns ± 6% -95.36% (p=0.000 n=10+8) I also compared a modified version of Regexp that has no mutex and a single machine (the "RegexpForSingleGoroutine" rsc mentioned in #8232 (comment)). In this next test, I compared using N copied Regexps vs. N separate RegexpForSingleGoroutines. This shows that, even for this relatively simple regex, avoiding the lock entirely would only buy about 10-12% further improvement. name old time/op new time/op delta MatchParallel 370ns ± 0% 322ns ± 0% -12.97% (p=0.000 n=8+8) MatchParallel-2 184ns ± 1% 162ns ± 1% -11.60% (p=0.000 n=10+10) MatchParallel-4 92.7ns ± 1% 81.1ns ± 2% -12.43% (p=0.000 n=10+10) MatchParallel-8 46.4ns ± 0% 41.8ns ±10% -9.78% (p=0.000 n=8+10) MatchParallel-16 23.7ns ± 6% 20.6ns ± 1% -13.14% (p=0.000 n=8+8) Updates #8232. Change-Id: I15201a080c363d1b44104eafed46d8df5e311902 Reviewed-on: https://go-review.googlesource.com/16110 Reviewed-by: Russ Cox <rsc@golang.org>
Fixed (at least worked around) in CL 16110. |
This helps users who wish to use separate Regexps in each goroutine to avoid lock contention. Previously they had to parse the expression multiple times to achieve this. I used variants of the included benchmark to evaluate this change. I used the arguments -benchtime 20s -cpu 1,2,4,8,16 on a machine with 16 hardware cores. Comparing a single shared Regexp vs. copied Regexps, we can see that lock contention causes huge slowdowns at higher levels of parallelism. The copied version shows the expected linear speedup. name old time/op new time/op delta MatchParallel 366ns ± 0% 370ns ± 0% +1.09% (p=0.000 n=10+8) MatchParallel-2 324ns ±28% 184ns ± 1% -43.37% (p=0.000 n=10+10) MatchParallel-4 352ns ± 5% 93ns ± 1% -73.70% (p=0.000 n=9+10) MatchParallel-8 480ns ± 3% 46ns ± 0% -90.33% (p=0.000 n=9+8) MatchParallel-16 510ns ± 8% 24ns ± 6% -95.36% (p=0.000 n=10+8) I also compared a modified version of Regexp that has no mutex and a single machine (the "RegexpForSingleGoroutine" rsc mentioned in golang/go#8232 (comment)). In this next test, I compared using N copied Regexps vs. N separate RegexpForSingleGoroutines. This shows that, even for this relatively simple regex, avoiding the lock entirely would only buy about 10-12% further improvement. name old time/op new time/op delta MatchParallel 370ns ± 0% 322ns ± 0% -12.97% (p=0.000 n=8+8) MatchParallel-2 184ns ± 1% 162ns ± 1% -11.60% (p=0.000 n=10+10) MatchParallel-4 92.7ns ± 1% 81.1ns ± 2% -12.43% (p=0.000 n=10+10) MatchParallel-8 46.4ns ± 0% 41.8ns ±10% -9.78% (p=0.000 n=8+10) MatchParallel-16 23.7ns ± 6% 20.6ns ± 1% -13.14% (p=0.000 n=8+8) Updates #8232. Change-Id: I15201a080c363d1b44104eafed46d8df5e311902 Reviewed-on: https://go-review.googlesource.com/16110 Reviewed-by: Russ Cox <rsc@golang.org>
Attachments:
The text was updated successfully, but these errors were encountered: