sync: shrink types in sync package #37142

balasanjay · 2020-02-08T23:50:21Z

Proposal: shrink types in sync package

Current API

The types in the sync package make use of semaphore operations provided by the runtime package. Specifically, they use the following two APIs

semacquire(s *uint32): waits until *s > 0 and atomically decrements it.

semrelease(s *uint32): atomically increments *s and notifies any goroutine blocked in a semacquire (if any).

Each semaphore logically manages a queue of goroutines. The sync types store their atomically-modified state separate from their semaphores; for this reason, semaphores have to support out-of-order operations (e.g. when unlocking a sync.Mutex, you could have a semrelease show up to the semaphore 'before' a corresponding semacquire). The separation into atomic state and semaphores also results in the sync types having a larger memory footprint.

Proposed new API

Let's consider a slightly different API for queueing and dequeueing goroutines.

semqueue(s *uint32, mask uint32, cmp uint32, bit uint8) bool: checks (*s & mask) == cmp; if false, will immediately return false (indicating 'barging'). Otherwise, puts the calling goroutine to sleep; if this goroutine is the first to sleep for the given s and bit, will also perform *s = *s | (1 << bit).

semdequeue(s *uint32, bit uint8, dequeueAll bool) int: wakes up 1 goroutine sleeping on a corresponding semqueue call with a matching s and bit (or all, if dequeueAll is true). If there are no more goroutines sleeping, then will unset the corresponding bit in *s: *s = *s & ^(1 << bit). Returns the number of woken goroutines.

Notes about this API:
The names listed above are placeholders (other suggestions welcome).

All of the operations for the given function are atomic (by making use of a lock keyed on s).

Also, the real signatures will likely need to be more complicated to support mutex profiling, and FIFO vs LIFO queueing.

Benefits

Given this kind of API, I believe we could shrink the types in the sync package. This is because this new API only ever modifies a single bit in an atomic word, so all the other bits are available to store atomic state. We can manage multiple logical queues using the same 4-byte atomic word; this will allow it to handle types like sync.RWMutex that have a queue for readers and a queue for writers. Here are the possible savings for reimplementing various types in the sync package:

Type	Current Size (bytes)	New Size (bytes)
`sync.Once`	12	4
`sync.Mutex`	8	4
`sync.RWMutex`	24	4

These are the ones I'm fairly confident about. I also think we can shrink sync.Waitgroup, and sync.Cond (the former from 12 bytes to 4 bytes, and I think we can shave off 28 bytes from the latter), but I'm less sure of these two as I'm unfamiliar with their implementations.

Also, while this isn't the goal, I think this might also improve the performance of (*sync.RWMutex).Unlock. It currently repeatedly calls semrelease in a loop to wake readers, acquiring and releasing locks for each iteration. The API above offers batch dequeue functionality, which will allow us to avoid a number of atomic operations.

Backwards Compatibility

This is only modifying internal details of the Go runtime, so from that perspective it should be backwards compatible. There are two other considerations here:

If any users are carefully sizing their types to avoid false sharing or to align with cache-line boundaries, changing the size of types in sync could be problematic for them.

There are users who are using go:linkname to directly invoke semacquire and semrelease. See gVisor for an example of this. So, even if we convert all uses in the standard library, we will likely want to have these functions stick around for a little while.

Edits

Reordered types in order of implementation simplicity
Increased size of sync.RWMutex from 4 to 8 bytes (see comments for explanation)
Decreased size of sync.RWMutex back down to 4 bytes (see comments for explanation)

The text was updated successfully, but these errors were encountered:

balasanjay · 2020-02-08T23:52:30Z

Tagging owners of sync package:
@rsc @ianlancetaylor @dvyukov @aclements

networkimprov · 2020-02-09T04:26:41Z

It would help if any new implementation would also accommodate a future RWMutex.TryRLock(), which is not trivial to implement (unlike TryLock).

Discussion at https://groups.google.com/d/topic/golang-nuts/h_Dm-XtrlOA/discussion

balasanjay · 2020-02-10T00:50:44Z

@networkimprov That seems entirely orthogonal to this proposal (it could be implemented with the existing internal semaphore APIs or with the proposed new ones). This proposal also doesn't involve any public API changes. It would therefore seem best to file a separate proposal for that.

networkimprov · 2020-02-10T01:41:51Z

I wasn't suggesting you add a new API in this proposal, but flagging a missing feature of sync.RWMutex which any reimplementation should allow for.

balasanjay · 2020-02-10T01:46:59Z

Got it, thanks. I believe the implementation I have in mind could accommodate a TryRLock method.

rsc · 2020-02-12T18:51:11Z

@networkimprov, I don't believe we've decided that TryRLock is a good idea to add at all, so it seems not necessary to condition this proposal on support for it.

This is not a proposal in the sense that it does not create new API.
It definitely needs a discussion and decision from the owners of the package
(me, @dvyukov, @aclements is a good set).

Removing from the proposal process and changing to NeedsDecision.

dvyukov · 2020-02-13T06:42:55Z

Are there any significant downsides? It seems that this mostly improve things and don't change public interfaces, if so this like a no brainer to me wrt to decision. Somebody just needs to do it :)
I understand it may slightly increase code complexity, or maybe not (comparable complexity to current code). It looks reasonable to me, moreover the complexity is contained within semaphore/mutex.
Re false sharing, I don't think we should significantly take this into account. Nobody never promised exact sizes, and there are ways to resolve this and hopefully the careful users have build asserts for their assumptions re implementation details.

balasanjay · 2020-02-13T08:08:23Z

I'd be happy to implement this, assuming that owners are ok with the plan.

(Since I'm not a fan of chainsaw juggling, I'd want to first write the new implementations in Promela or Pluscal first to make sure they're sound and the APIs proposed above are usable)

I don't know of any significant downsides. I don't have concrete implementations yet, so I don't have an answer for you regarding code complexity. Based on the implementation in my head, I think it'd be comparable.

One question: what is the maximum number of concurrent RLocks we need to support in RWMutex?

I realized that the original implementation I had in mind for RWMutex used up 7 bits for control state, leaving only 25 bits for a count of readers; this would cap the number of concurrent RLocks to 33 million, which feels like it might be too low. So for now, I've increased the RWMutex type up to 8 bytes.

I think its possible to compress the control state down into 5 bits (two queue bits, 1 "writerStarving" bit, and 2 bits that record whether the mutex is unlocked/read-locked/write-locked), which increases the max count to 134 million concurrent RLocks. If that's enough, then we should be able to use 4 bytes again.

dvyukov · 2020-02-13T08:49:38Z

I don't think we have exact number.
I hope nobody blocks more than 32M goroutines on a single mutex :) However, this number seems to be close to max number of goroutine users may use today. And assuming all worker goroutines rlock some mutex and there is 1 episodic writer, this scenario does not look completely impossible.
Perhaps it's possible to provide some kind of safe but slow fallback? E.g. make all excessive readers spin and sleep until number of readers drops.

bcmills · 2020-02-13T14:28:48Z

what is the maximum number of concurrent RLocks we need to support in RWMutex?

Perhaps it's possible to provide some kind of safe but slow fallback?

sync.RWMutex already scales very poorly anyway (#17973), so it is unlikely that many folks are using a large number of concurrent readers, and a safe-but-slow fallback seems fine for now.

(I notice that @balasanjay is the author of CL 215361, so this probably not new information but perhaps useful to other readers of this thread.)

balasanjay · 2020-02-13T21:15:05Z

I am now pretty sure that 5 bits of control state are enough. I think 134 million concurrent RLocks should probably be enough for everybody.

If anyone disagrees with this, then I think one possible solution is to have readers queue even when the mutex is in read-locked mode. And RUnlocks that see readers queued with no writers queued would wake a single reader. That overflow behavior seems difficult to model without a state-space explosion, though, so I'd rather avoid it. If we do need to avoid it, then representing an RWMutex in two words is probably the simplest solution.

@dvyukov One other downside: the current RLock implementation's fast-path is an AddUint32. I think the new implementation would need to increment the reader count with a CAS-loop (i.e. only increment the reader count if the mutex stays in read-locked mode).

CAFxX · 2020-02-13T23:10:26Z

Unless this solves actual scalability problems (the benefits section does not list much in this regard, where the only ones listed are apparently using less memory and making the Unlock slow path faster) I would recommend to ensure we don't regress performance of the fast paths.

E.g. making the Unlock slow path a bit faster at the expense of the fast path does not seem like a worthwhile direction.

balasanjay · 2020-02-14T01:05:37Z

Fair enough; I'll do some work to quantify the difference between an AddUint32-based fast-path and a CAS-loop-based fast path (by just making the current implementation use a CAS-loop).

The fast path is only actually fast when there is no contention, and in that case, I'd expect a Load+CAS to be roughly as fast as an atomic add. FWIW, I did check that both Folly's SharedMutex and Abseil's Mutex types use a Load+CAS for their fast paths.

To be clear, the benefit I'm after is simply to reduce the space used by the sync types; when trying to write cache-line optimized data structures, you don't really have a lot of room to spare and usually are forced to resort to terrible bit-packing hacks.
Using 24 bytes for an RWMutex type makes this much harder than necessary.

balasanjay · 2020-02-14T04:48:36Z

Hmm, I think it might be impossible to make an inlineable fast-path using a Load+CAS. This function is marked as not-inlineable:

func (rw *RWMutex) rlock() {
	rc := atomic.LoadInt32(&rw.readerCount)
	if atomic.CompareAndSwapInt32(&rw.readerCount, rc, rc+16) {
		return
	}

	rw.rlockSlow()
}

This has a cost of 82, and I can't seem to get it below 80. Also, any actual implementation would need slightly more in here (e.g. a test of "rc & constantMask == constantValue" before the CAS). That's rather unfortunate. I suppose I could just add a hack to the compiler to increase the inlining limit for some functions in the sync package, but that feels somewhat gross.

OneOfOne · 2020-02-14T05:12:36Z

@balasanjay #17566 and #22310 are slightly related to that last comment.

josharian · 2020-02-14T19:49:10Z

@balasanjay I think you should hack the compiler temporarily to inline this, in order to make progress. If it turns out the Load+CAS isn't performant enough, no harm done. And if it is a viable path forward, we can focus our attention on getting it past the execrable inliner.

E.g. I just noticed that we're charging a budget of 2 for intrinsic calls, one for the intrinsic and one for the CALLFUNC. But given that they're assumed to be comparable in cost to an arithmetic operation, we should just assign them a cost of 1. I suspect this is an oversight/accident. (Insofar as any of this can be called principled.) Fixing this gets your function inlined, barely. :)

diff --git a/src/cmd/compile/internal/gc/inl.go b/src/cmd/compile/internal/gc/inl.go
index 48c7de327d..4379be18ba 100644
--- a/src/cmd/compile/internal/gc/inl.go
+++ b/src/cmd/compile/internal/gc/inl.go
@@ -321,7 +321,7 @@ func (v *hairyVisitor) visit(n *Node) bool {
                }
 
                if isIntrinsicCall(n) {
-                       v.budget--
+                       // Treat like any other node.
                        break
                }

I'll plan to mail this as a CL during Go 1.15 and see whether it flies.

In the meantime, please forge ahead using whatever means necessary. :)

balasanjay · 2020-02-15T04:15:12Z

Alrighty, I benchmarked AddInt32 vs a Load+CAS (both a branch-before variant, and a branch-after variant), and it appears I was wrong to be optimistic:

name              time/op
Add               5.91ns ± 4%
AddCASBranchy     10.4ns ± 8%
AddCASBranchLess  11.0ns ± 3%

Code

func BenchmarkAdd(b *testing.B) {
        foo := int32(0)
        for i := 0; i < b.N; i++ {
                nv := atomic.AddInt32(&foo, 2)
                if nv&1 == 0 {
                        continue
                }
                panic("slow path")
        }
        _ = foo
}

func BenchmarkAddCASBranchy(b *testing.B) {
        foo := int32(0)
        for i := 0; i < b.N; i++ {
                ov := atomic.LoadInt32(&foo)
                if ov&1 == 0 && atomic.CompareAndSwapInt32(&foo, ov, ov+2) {
                        continue
                }
                panic("slow path")
        }

        _ = foo
}

func BenchmarkAddCASBranchLess(b *testing.B) {
        foo := int32(0)
        for i := 0; i < b.N; i++ {
                ov := atomic.LoadInt32(&foo) & (^1)
                if atomic.CompareAndSwapInt32(&foo, ov, ov+2) {
                        continue
                }
                panic("slow path")
        }

        _ = foo
}

That's unfortunate. I think I know how to implement RLock with an AddUint32-based fast-path and a fuzzy count of readers, but its not quite as straightforward as what I had in mind.

I'll try to spend some time this weekend writing a model of my idea to validate it.

balasanjay · 2020-02-16T08:19:19Z

I wrote a model of sync.Mutex using the proposed new API, and simulated all interleavings for 4 concurrent goroutines. I successfully verified safety (specifically that no two goroutines were ever in the critical section at the same time).

I then tried to verify starvation-freedom for the model, but even with 2 goroutines, the model checker quickly found a trivial counter-example. I think this counter-example is an issue even with the current Mutex implementation. Essentially, the sequence of events is:

W1 locks mutex
W2 tries to lock mutex, finds it locked and queues
W1 unlocks mutex, waking W2 (but W2 isn't running yet)
W1 locks mutex (barging ahead of W2)
W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex
W1 unlocks mutex (using fast-path unlock CAS)
W2's starvation mode CAS fails (because the mutex was unlocked)
W1 locks mutex (using fast-path lock CAS)
W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex

We're in a loop, so the strong starvation freedom property fails to verify for goroutine W2 (the property I was trying to verify was "if a goroutine starts trying to lock the mutex, then that goroutine will eventually acquire the mutex").

That said, this execution trace seems fairly unrealistic. I don't think we've seen starvation of this sort with the current implementation in practice (which AFAICT, has an exactly analogous starvation event possible).

I tried a weaker starvation freedom property ("if a goroutine starts trying to lock the mutex, then some goroutine will eventually acquire the mutex") and verified that for 3 goroutines. That gives me confidence that there isn't some sequence of events which leaves the mutex in a corrupt state that can never be locked again. I might leave 4 goroutines running overnight (liveness properties are much slower to check).

I'll work on adding read-locks to the model next.

Here's the model so far:
rwmutex.tla
rendered to pdf

CAFxX · 2020-02-16T23:19:08Z

W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex
W1 unlocks mutex (using fast-path unlock CAS)

Isn't this impossible in the current implementation? If starving we will fall back to the slow path, since new will not be 0.

go/src/sync/mutex.go

Lines 186 to 191 in 6917529

    
           new := atomic.AddInt32(&m.state, -mutexLocked) 
        
           if new != 0 { 
        
           	// Outlined slow path to allow inlining the fast path. 
        
           	// To hide unlockSlow during tracing we skip one extra frame when tracing GoUnblock. 
        
           	m.unlockSlow(new) 
        
           }

balasanjay · 2020-02-17T02:22:12Z

If the mutex were successfully put into starvation mode, then yep, it would prevent the fast path unlock.

However, what I'm saying is that W2 decides to enable starvation mode but has not actually done so yet.
Then, W1 unlocks the mutex (using the fast-path).
Then W2 actually tries to CAS in the starvation mode bit; this CAS to enable starvation mode will fail (because W1 unlocked the mutex).

In this trace, the mutex never successfully gets put into starvation mode.

balasanjay · 2020-02-17T09:19:03Z

Alrighty, I have a model of an RWMutex implementation.

I think it ticks all the boxes (has the same fast paths, handles writer starvation, alternates between writers and batches of readers (also, as a nice-to-have, it has 3 states stored in 2 bits, leaving 1 leftover state for use by a future solution to #17973), and fits in a single uint32 (assuming we're ok with max 133 million readers)). I believe we could implement Mutex in terms of this RWMutex; I think if no one ever calls any of the reader APIs, it'll basically identically to today's Mutex.

One issue I ran into is that it appears that the semdequeue API I proposed above is a bit too simple. In particular, when a writer is handing off a lock to a waiting batch of readers, it wants to atomically switch the state of mutex from "write-locked" to "read-locked", and set the reader-count to the number of dequeued goroutines, and clear the flag indicating that there are readers queued. The original API I proposed above will only do the last one of these.

I think its possible to work around this by letting the reader count go "negative" and try and work with async events again.

But I think its simpler to just split the dequeue API into two to accommodate this use-case cleanly. One is the semdequeueSingle (pretty much the same as what I said before). Dequeues a single goroutine and if its the last one in the queue unsets the waiting bit (we don't need anything fancier for this).

The other is semdequeueAll, which will wake all goroutines, and crucially before unsetting the queue bit, it invokes a passed-in function pointer to compute a new value for the atomic word given the old value and the number of dequeuing goroutines. It takes the output of that function, clears the queue-bit from that, and then CAS-es that final value into place (all of this in a loop, until the CAS succeeds). This additional hook will let us run the necessary logic to atomically wake the goroutines and handoff the mutex to them (it'd be atomic because we're holding the runtime-internal mutex while we do all this, meaning no new readers can queue up while we're doing this). I admit its a tiny bit sketchy to call arbitrary callbacks while holding a runtime lock; if folks prefer, we can just hard-code the behavior we want directly in the runtime (we'll likely need to pass a flag to enable the "RWMutex" dequeue behavior, because I suspect when we get around to migrating WaitGroup or Cond, they'll need slightly different semantics because they probably don't need to store a count in the same way).

I've verified safety properties for the model with 3 readers and 1 writer, with 1 reader and 3 writers, and with 2 readers and 2 writers (which found a number of errors in the model, I should note). Anything bigger looked like it would take too long on my laptop.

I want to run a simulation with 3 readers and 2 writers (after 3 hours of simulation, it had only gotten ~60 steps deep, and I think it'll finally end up being something like 100 steps). I'll try that on my workstation sometime this week, or failing that I guess I'll need to figure out how to make the distributed simulation version work.

I should probably try to optimize the model a bit. I wrote it like pseudocode, so each function is composed of many atomic steps. I bet if I wrote it more like math, I could reduce the number of atomic steps in each function, which might dramatically reduce the state space that needs to be explored.

Updated model files:
rwmutex.tla
rwmutex.pdf

dvyukov · 2020-02-17T09:25:45Z

I would say it's OK to have semdequeue to know about rwmutex details and just do the necessary update. It's nice to have it generic, but we also don't have to, it's completely internal api. We could make it accept an "enum mode" parameter to support rwmutex/cond/waitgroup/etc.

balasanjay · 2020-02-19T04:26:51Z

Cool, 3 readers and 2 writers verified successfully. Also the number of states dramatically increased:

1972646627 states generated, 435337469 distinct states found, 0 states left on queue.

Any other objections to me working on this when the tree's open?

balasanjay · 2020-02-24T07:30:34Z

@rsc @aclements Gentle ping.

(If this sounds good, I'm hoping to work on it next weekend and have CLs out for Monday)

cristaloleg · 2020-12-31T11:28:43Z

@balasanjay kindly ping. What is the status? (just a curiosity, no pressure)

balasanjay · 2021-01-06T06:52:31Z

Thanks for the ping, I completely lost track of this work.

As I recall, I wrote about 20% of the code (some of the runtime-internal logic), before COVID became serious and I dropped this to focus on other more pressing things.

I need to ramp back up on this and dig out that repo. I'll see if I can do that in the next month, so I can catch the beginning of the next release cycle.

gopherbot added this to the Proposal milestone Feb 8, 2020

gopherbot added the Proposal label Feb 8, 2020

rsc added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed Proposal labels Feb 12, 2020

gopherbot added Proposal and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Feb 12, 2020

rsc modified the milestones: Proposal, Unplanned Feb 12, 2020

rsc changed the title ~~Proposal: shrink types in sync package~~ sync: shrink types in sync package Feb 12, 2020

rsc added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed Proposal labels Feb 12, 2020

balasanjay mentioned this issue Feb 13, 2020

sync: RWMutex scales poorly with CPU count #17973

Open

This was referenced Mar 3, 2021

account.Address: use RWLocker cosmos/cosmos-sdk#8774

Merged

Lock mutex on getting element from cache cosmos/cosmos-sdk#8767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: shrink types in sync package #37142

sync: shrink types in sync package #37142

balasanjay commented Feb 8, 2020 •

edited

balasanjay commented Feb 8, 2020

networkimprov commented Feb 9, 2020 •

edited

balasanjay commented Feb 10, 2020

networkimprov commented Feb 10, 2020

balasanjay commented Feb 10, 2020

rsc commented Feb 12, 2020

dvyukov commented Feb 13, 2020

balasanjay commented Feb 13, 2020

dvyukov commented Feb 13, 2020

bcmills commented Feb 13, 2020

balasanjay commented Feb 13, 2020

CAFxX commented Feb 13, 2020 •

edited

balasanjay commented Feb 14, 2020

balasanjay commented Feb 14, 2020

OneOfOne commented Feb 14, 2020

josharian commented Feb 14, 2020

balasanjay commented Feb 15, 2020

balasanjay commented Feb 16, 2020

CAFxX commented Feb 16, 2020 •

edited

balasanjay commented Feb 17, 2020

balasanjay commented Feb 17, 2020

dvyukov commented Feb 17, 2020

balasanjay commented Feb 19, 2020

balasanjay commented Feb 24, 2020

cristaloleg commented Dec 31, 2020

balasanjay commented Jan 6, 2021

sync: shrink types in sync package #37142

sync: shrink types in sync package #37142

Comments

balasanjay commented Feb 8, 2020 • edited

Proposal: shrink types in sync package

Current API

Proposed new API

Benefits

Backwards Compatibility

Edits

balasanjay commented Feb 8, 2020

networkimprov commented Feb 9, 2020 • edited

balasanjay commented Feb 10, 2020

networkimprov commented Feb 10, 2020

balasanjay commented Feb 10, 2020

rsc commented Feb 12, 2020

dvyukov commented Feb 13, 2020

balasanjay commented Feb 13, 2020

dvyukov commented Feb 13, 2020

bcmills commented Feb 13, 2020

balasanjay commented Feb 13, 2020

CAFxX commented Feb 13, 2020 • edited

balasanjay commented Feb 14, 2020

balasanjay commented Feb 14, 2020

OneOfOne commented Feb 14, 2020

josharian commented Feb 14, 2020

balasanjay commented Feb 15, 2020

balasanjay commented Feb 16, 2020

CAFxX commented Feb 16, 2020 • edited

balasanjay commented Feb 17, 2020

balasanjay commented Feb 17, 2020

dvyukov commented Feb 17, 2020

balasanjay commented Feb 19, 2020

balasanjay commented Feb 24, 2020

cristaloleg commented Dec 31, 2020

balasanjay commented Jan 6, 2021

balasanjay commented Feb 8, 2020 •

edited

networkimprov commented Feb 9, 2020 •

edited

CAFxX commented Feb 13, 2020 •

edited

CAFxX commented Feb 16, 2020 •

edited