sync: bad placement of multiple contested locks can cause drastic slowdown #17953

ianlancetaylor · 2016-11-17T01:08:17Z

This program runs in a couple of seconds in the normal case. When invoked with the argument -offset=251, it takes over four minutes. The problem is the hash table in runtime/sema.go. When two contested mutexes happen to wind up using the same hash table entry, the linear search in semrelease is a significant performance drag. While this is not a likely occurrence, 1 in 251 is not so terribly unlikely either, and the performance effect is extreme. This actually occurred in an internal Google program.

CC @dvyukov

package main

import (
    "flag"
    "sync"
)

var offset = flag.Int("offset", 1, "Offset for the second lock")

func main() {
    flag.Parse()
    locks := make([]sync.RWMutex, *offset+1)
    ch := make(chan struct{})

    var wg sync.WaitGroup

    for i := 0; i < 10000; i++ {
	    wg.Add(1)
	    go func() {
		    <-ch
		    for i := 0; i < 10; i++ {
			    locks[0].Lock()
			    locks[*offset].Lock()
			    locks[*offset].Unlock()
			    locks[0].Unlock()
		    }
		    wg.Done()
	    }()
    }

    for i := 0; i < 100; i++ {
	    go func() {
		    for {
			    locks[*offset].Lock()
			    locks[*offset].Unlock()
		    }
	    }()
    }

    close(ch)
    wg.Wait()
}

The text was updated successfully, but these errors were encountered:

dvyukov · 2016-11-17T04:33:59Z

This is not specific to RWMutex, right? Mutex should be affected the same way. If RWMutex is used only for write locks it behaves exactly as Mutex.

It's solvable if we increase size of the Mutex by at least 1 word to store *semaRoot directly. Maybe we could even pack all state into 2 words without indirection.
Solving it without Mutex size increase will require playing dirty games like packing pointers into integers (which should work provided that pointee is non-movable and is preserved alive by other pointers, which is true for e.g. G).

ianlancetaylor · 2016-11-17T04:48:47Z

As far as I can see it would be OK to grow Mutex by a word. It seems unlikely that any program's memory usage is dominated by the number of Mutex values it creates.

rsc · 2017-02-10T19:39:19Z

I'd really prefer not to store internal runtime state (pointers) in a user-corruptible data structure. We've seen malloc implementations have to move their metadata into separate pages to make sure that user corruption doesn't cause what look like internal errors in malloc, and I think that lesson applies here.

I have a different, simpler idea. Instead of making the 251-entry hash table have a list of all goroutines blocked on addresses hashing to that bucket, make it have a list of only goroutines blocked on unique addresses hashing to that bucket. Instead of inserting an address into the list a second time, a secondary list can hang off of each of these unique representatives. Because the wakeups are FIFO, the length of those secondary lists don't matter - the ops on them are O(1) - and moving the large numbers of goroutines blocked on the same lock off the main list eliminates that as a cause for slowdown if there's another lock hitting the same hash bucket.

I've implemented this, and it eliminates the problem in @ianlancetaylor's test program. Ian, if you can send me more details about the Google problem off-list then I'll see if it would fix that too.

gopherbot · 2017-02-10T20:10:42Z

CL https://golang.org/cl/36792 mentions this issue.

gopherbot · 2017-02-15T21:51:16Z

CL https://golang.org/cl/37103 mentions this issue.

CL 36792 fixed #17953, a linear scan caused by n goroutines piling into two different locks that hashed to the same bucket in the semaphore table. In that CL, n goroutines contending for 2 unfortunately chosen locks went from O(n²) to O(n). This CL fixes a different linear scan, when n goroutines are contending for n/2 different locks that all hash to the same bucket in the semaphore table. In this CL, n goroutines contending for n/2 unfortunately chosen locks goes from O(n²) to O(n log n). This case is much less likely, but any linear scan eventually hurts, so we might as well fix it while the problem is fresh in our minds. The new test in this CL checks for both linear scans. The effect of this CL on the sync benchmarks is negligible (but it fixes the new test). name old time/op new time/op delta Cond1-48 576ns ±10% 575ns ±13% ~ (p=0.679 n=71+71) Cond2-48 1.59µs ± 8% 1.61µs ± 9% ~ (p=0.107 n=73+69) Cond4-48 4.56µs ± 7% 4.55µs ± 7% ~ (p=0.670 n=74+72) Cond8-48 9.87µs ± 9% 9.90µs ± 7% ~ (p=0.507 n=69+73) Cond16-48 20.4µs ± 7% 20.4µs ±10% ~ (p=0.588 n=69+71) Cond32-48 45.4µs ±10% 45.4µs ±14% ~ (p=0.944 n=73+73) UncontendedSemaphore-48 19.7ns ±12% 19.7ns ± 8% ~ (p=0.589 n=65+63) ContendedSemaphore-48 55.4ns ±26% 54.9ns ±32% ~ (p=0.441 n=75+75) MutexUncontended-48 0.63ns ± 0% 0.63ns ± 0% ~ (all equal) Mutex-48 210ns ± 6% 213ns ±10% +1.30% (p=0.035 n=70+74) MutexSlack-48 210ns ± 7% 211ns ± 9% ~ (p=0.184 n=71+72) MutexWork-48 299ns ± 5% 300ns ± 5% ~ (p=0.678 n=73+75) MutexWorkSlack-48 302ns ± 6% 300ns ± 5% ~ (p=0.149 n=74+72) MutexNoSpin-48 135ns ± 6% 135ns ±10% ~ (p=0.788 n=67+75) MutexSpin-48 693ns ± 5% 689ns ± 6% ~ (p=0.092 n=65+74) Once-48 0.22ns ±25% 0.22ns ±24% ~ (p=0.882 n=74+73) Pool-48 5.88ns ±36% 5.79ns ±24% ~ (p=0.655 n=69+69) PoolOverflow-48 4.79µs ±18% 4.87µs ±20% ~ (p=0.233 n=75+75) SemaUncontended-48 0.80ns ± 1% 0.82ns ± 8% +2.46% (p=0.000 n=60+74) SemaSyntNonblock-48 103ns ± 4% 102ns ± 5% -1.11% (p=0.003 n=75+75) SemaSyntBlock-48 104ns ± 4% 104ns ± 5% ~ (p=0.231 n=71+75) SemaWorkNonblock-48 128ns ± 4% 129ns ± 6% +1.51% (p=0.000 n=63+75) SemaWorkBlock-48 129ns ± 8% 130ns ± 7% ~ (p=0.072 n=75+74) RWMutexUncontended-48 2.35ns ± 1% 2.35ns ± 0% ~ (p=0.144 n=70+55) RWMutexWrite100-48 139ns ±18% 141ns ±21% ~ (p=0.071 n=75+73) RWMutexWrite10-48 145ns ± 9% 145ns ± 8% ~ (p=0.553 n=75+75) RWMutexWorkWrite100-48 297ns ±13% 297ns ±15% ~ (p=0.519 n=75+74) RWMutexWorkWrite10-48 588ns ± 7% 585ns ± 5% ~ (p=0.173 n=73+70) WaitGroupUncontended-48 0.87ns ± 0% 0.87ns ± 0% ~ (all equal) WaitGroupAddDone-48 63.2ns ± 4% 62.7ns ± 4% -0.82% (p=0.027 n=72+75) WaitGroupAddDoneWork-48 109ns ± 5% 109ns ± 4% ~ (p=0.233 n=75+75) WaitGroupWait-48 0.17ns ± 0% 0.16ns ±16% -8.55% (p=0.000 n=56+75) WaitGroupWaitWork-48 1.78ns ± 1% 2.08ns ± 5% +16.92% (p=0.000 n=74+70) WaitGroupActuallyWait-48 52.0ns ± 3% 50.6ns ± 5% -2.70% (p=0.000 n=71+69) https://perf.golang.org/search?q=upload:20170215.1 Change-Id: Ia29a8bd006c089e401ec4297c3038cca656bcd0a Reviewed-on: https://go-review.googlesource.com/37103 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

ianlancetaylor added this to the Go1.9 milestone Nov 17, 2016

dvyukov mentioned this issue Feb 2, 2017

sync: detect copying of Mutex #6188

Open

gopherbot closed this as completed in 45c6f59 Feb 12, 2017

bcmills mentioned this issue May 15, 2017

sync: enhancement: have the new sync.Map use sharding #20360

Closed

golang locked and limited conversation to collaborators Feb 15, 2018

gopherbot added the FrozenDueToAge label Feb 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: bad placement of multiple contested locks can cause drastic slowdown #17953

sync: bad placement of multiple contested locks can cause drastic slowdown #17953

ianlancetaylor commented Nov 17, 2016

dvyukov commented Nov 17, 2016

ianlancetaylor commented Nov 17, 2016

rsc commented Feb 10, 2017

gopherbot commented Feb 10, 2017

gopherbot commented Feb 15, 2017

sync: bad placement of multiple contested locks can cause drastic slowdown #17953

sync: bad placement of multiple contested locks can cause drastic slowdown #17953

Comments

ianlancetaylor commented Nov 17, 2016

dvyukov commented Nov 17, 2016

ianlancetaylor commented Nov 17, 2016

rsc commented Feb 10, 2017

gopherbot commented Feb 10, 2017

gopherbot commented Feb 15, 2017