sync: add Mutex.TryLock #45435

TyeMcQueen · 2021-04-07T23:41:20Z

What version of Go are you using (`go version`)?

$ go version
go version go1.16.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/tyem/Library/Caches/go-build"
GOENV="/Users/tyem/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/tyem/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/tyem/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4w/l3sttpnj00xctgf981g1t2cr0000gq/T/go-build590339154=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Try to fix several bugs with locking contention in net/http/h2_bundle.go.

What did you expect to see?

A non-blocking way to test if a mutex is locked.

What did you see instead?

A years-old proposal that was eventually closed due to nobody laying out a convincing enough case to provide such a feature, #6123. I was unable to find a way to re-open an old, locked issue so I am raising it fresh now that I believe I can make a convincing case for the need for this feature.

I think fixing several years-old serious lock-up problems with h2 provides a good justification for this functionality. I believe I have a quite small and simple fix for the following open issues:

x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time
#23559
x/net/http2: pool deadlock
#32388
net/http: a "bad" https connection which uses Header.Set could potentially block all https requests more than 15 mins in production environments
#33006
x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely
#33425

To lay out the justification for readers not familiar with any of these bugs...

You have a pool of network connections.
You have a sync.Mutex for preventing races when accessing this pool struct.
Each network connection also has a sync.Mutex for preventing races when accessing the connection struct.
Several operations on the pool need to query the state of one or more connections.
In particular, consider a function for making a new connection that needs to see if any existing connections can be reused.
This function must:
Lock the pool mutex.
Look up existing connections to the same endpoint.
Ask each connection if it is in a state where it can be used for a new connection.

That last question requires looking inside of the connection struct which would require locking the connection's mutex.
Despite a great deal of effort over years, the complexity of a network connection is such that it is still possible for the connection mutex to be locked when an operation that is usually fast is done and then that operation takes a long time, leaving the connection mutex locked for a non-trivial duration. It is desirable to prevent such situations but there have been so many cases of this and several of them could not be eliminated after years. So the risk of such is always going to be present.

While having such a problem impact a single connection object is often not much of a problem, having it make it impossible for any new network connections to be made is a very serious problem and we should prevent that. The problem is so serious that h2_bundle has been disabled in important situations just to avoid this.

If I am holding a lock on the pool's mutex and need to ask the question, "Is this particular connection available to take this request?", then, I believe, the connection's mutex being locked means that I should be able to immediately answer with "No, this connection is not available to take a new request, because it is busy doing something that requires its mutex to be locked. We hope this situation will progress very quickly to a state where the mutex is no longer locked and at that point we could ask the question again, but, no, right now, the connection is not ready and available."

But sync.Mutex does not currently allow me to code that, and I think it is completely reasonable to allow such. This can be easily done via the simple addition of:

// LockIfNoWait either locks m immediately if it is not in use, returning
// true, or, when a lock on m is already held, immediately returns false.
func (m *Mutex) LockIfNoWait() bool {
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
        if race.Enabled {
            race.Acquire(unsafe.Pointer(m))
        }
        return true
    }
    return false
}

Such an addition then allows fixes to h2_bundle.go that are also very simple.

Now, even if we do a great deal of complicated work to eliminate all of the places where a connection mutex is held during a call that has a rare chance of taking a bit of time and even convince ourselves that we won't ever let such bugs ever creep in again, I would still prefer that "Is this connection available?" be answered immediately with "no" if the question is asked while the connection's mutex is held. Having the "open a new connection" function have to potentially wait for any number of connection mutexes to be unlocked just does not sound like a good strategy for making a pool that provides high performance when there are tons of connections being managed.

One of the places this has happened is holding the connection mutex when the connection is Close()d. It is completely reasonable for the connection struct to remain locked during such an operation as nothing useful can be done with a connection that is in the process of closing... except, of course, if you need to ask the connection a question and you are in a context where waiting for the answer would tie up all progress for anybody wanting to deal with network connections.

And the cases that cause this to happen are often sudden, ungraceful loss of network connectivity to a service. In such a case, it doesn't matter much if the mutex for connections to such a server are locked or not because you won't be able to make useful connections to the service anyway (better to try a new connection where that gets done in a way that doesn't hold the new connection's mutex locked). But having one service failing slowly block all other network connections is terrible.

Another approach would be to have two layers to a connection object and 2 mutexes. And simple state information would be handled by one mutex and complex information would be handled by another. You'd have to be very careful to always lock the slow mutex before locking the fast one. And be very careful about which things are protected by which mutex. What are the trade-offs of such an approach? Cons: A great increase in complexity, lots of ways to easily make mistakes about what is fast vs slow, pool operations may have to wait for any number of fast connection mutexes to be unlocked, and fixing the above long-lived serious bugs will take other people spending a long time working on them (vs. me already having a simple fix). Pros: You don't have to add a basic feature of mutexes that is provided by most other implementations and has been requested by several people and that I've seen lots of reasonable uses for. (I would have piled on to the request for this myself a few times before this but the issue was already there and locked so I just did a more disruptive design change to work around this basic lack.)

So I'm hoping that case is convincing so I can make more progress on fixing those bugs.

Thanks!

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2021-04-08T00:21:34Z

See also #27544.

Personally I think TryLock is a better name than LockIfNoWait.

TyeMcQueen · 2021-04-08T03:02:20Z

TryLock appears to be a pretty common name for this. There were objections to that name in #6123 so I endeavored to pick a name more in keeping with Go traditions. But I'm not picky about the name.

I suspect we don't want net/http depending on a 3rd-party module, a solution offered in #27544. Though I suppose an alternate mutex implementation (using a channel?) could be bundled into h2_bundle.go (and not exported). That would be better than not fixing h2 anyway.

bcmills · 2021-04-13T13:41:44Z

You have a pool of network connections.
You have a sync.Mutex for preventing races when accessing this pool struct.

As in #27544, why not replace the sync.Mutex with a 1-buffered channel?

You could even store the pool itself in the channel's buffer, so that the pool isn't even in scope when it isn't locked.
To lock the pool, receive it from the channel.
To unlock the pool, send it back to the channel.
To TryLock, receive in a select with a default case.

(I examined this pattern in much more detail starting around slide 52 of my talk on Rethinking Classical Concurrency Patterns. See in particular the queue implementation around slide 61.)

neild · 2021-04-13T16:26:59Z

As in #27544, why not replace the sync.Mutex with a 1-buffered channel?

By the same argument, why not replace all sync.Mutexes with channels?

bcmills · 2021-04-13T16:40:00Z

By the same argument, why not replace all sync.Mutexes with channels?

Honestly? I think that's a great idea. 😁

(But we can't remove sync.Mutex because of Go 1 compatibility, and because it would be a gratuitous breaking change. If we we starting Go over from scratch, I would certainly argue for its omission.)

TyeMcQueen · 2021-04-15T18:57:26Z

To avoid a lot of thrash in h2 code, I'd just bundle a replacement implementation of a Mutex using a 1-buffer channel and replace the sync.Mutex on each member with such and use TryLock()...

Why not include such an implementation in sync as a differently named Mutex? Having a Mutex implemented using a channel sounded like something desirable. Or just replace the guts of Mutex with a channel-based implementation?

bcmills · 2021-04-15T19:10:55Z

@TyeMcQueen, I would love to see a generic channel-based Mutex implementation. You could even imagine one using typed callbacks, based on the sync.Once signature:

package sync

type TypedMutex[T any] …

func (mu *TypedMutex[T]) Do(func(T) T)
func (mu *TypedMutex[T]) Try(context.Context, func(T) (T, error)) error

Adds a way to lock a Mutex optimistically, without blocking. ``` var mu Mutex if mu.TryLock() { // lock attempt succeeded mu.Unlock() } else { // attempt failed - do something else } ``` Refs: golang#45435 Refs: golang#6123

rsc · 2021-05-05T18:24:45Z

Locks are for protecting invariants. If the lock is held by someone else, there is nothing you can say about the invariant.

TryLock encourages imprecise thinking about locks; it encourages making assumptions about the invariants that may or may not be true. That ends up being its own source of races.

There are definitely locking issues in http2. Adding TryLock would let us paper over them to some extent, but even that would not be a real fix. It would be more like the better your 4-wheel-drive the farther out you get stuck.

I don't believe http2 makes a compelling case for TryLock.

rsc · 2021-05-05T18:26:38Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

TyeMcQueen · 2021-05-06T04:22:12Z

When the question is "Is it ready to be used for another connection?" when it is locked, the precise answer is actually "No, it is not ready".

And I believe that is the better approach even if we eliminated all cases of locks for individual connections being held around blocking operations. In theory, we could achieve that. In practice, it hasn't happened for years and I don't see it happening for years more, and once it is achieved, I expect that success to be short-lived (due to the complexity of an h2 connection).

Building a system where a minor mistake leads to serious lock-up is not good design. In some ways it may be purer; it is not better.

And at least many of the cases of "locking problems" for h2 connections are not really problems when dealing with a single connection (see late in my original, long proposal for more on this) and so I'm not convinced that eliminating them is worthwhile, or at least worth the effort required.

In my experience, preventing the use of techniques because you believe that coders with less skill, care, experience, or judgement may misuse them often leads to even worse approaches being chosen to work around this lack. So I prefer to document the better approaches and the reasons why a feature should not be used in some way, rather than just prohibit it. It leads to more learning, not just less frustration.

Shifting gears...

Having spent some time thinking about the several ways we could get a TryLock() available to h2, I think several of the approaches are reasonable. I think it would be cool to create a channel-based mutex implementation that prevents you from even accessing the object before you obtain the lock and allows you to use a context to even wait "for a while" when trying to obtain a lock. And, if that proves valuable enough, it would be cool to include that in the standard go library (perhaps so it can be used by other things in the standard go library). But I also think such would need time to prove that it deserves such inclusion. And I probably wouldn't replace sync.Mutex with that implementation. The simpler (and probably more efficient version) is probably worth having for when you don't need more.

But I now really do believe that even with alternatives out there in other libraries, this very specific, very simple addition to sync.Mutex is really something that should be added. (In the beginning, I was less convinced of that, more just wanting to fix serious problems in h2.)

It is quite easy to start using sync.Mutex because it is a perfect fit for your situation and then come to a place where "I want to do X, but if I can't do it right now, I don't want to do anything, and waiting would be a terrible choice". And it is a sign of respect for the users of our language to give them the option to use TryLock() to get their work done while they file a "technical debt" ticket to rework how locking is done because the design has shifted to the point that a simple Mutex is not the best fit any longer. Will this lead to that technical debt never getting fixed? Surely it sometimes will. But the better way to encourage such design improvements is documentation, not prohibition.

I've personally run into this situation twice already that I can remember. Forcing the more significant redesign just to get past this hurdle was not helpful to me. And it didn't even always lead to a better design, just one that didn't need TryLock().

There really are situations where Lock, Unlock, and TryLock is just exactly what you need. Those situations can also be satisfied with channels and semaphores and other more complex and powerful things. If you need to ask "Are you still busy?", that's a perfectly reasonable question to be able to ask. And if the thing you need to ask that of is just one thing and you also need a mutex around it and that mutex already perfectly captures "Yes, I'm busy", let people do that.

Just TryLock() is not enough to warrant a whole separate abstraction. This is reinforced by how easy it is to implement for sync.Mutex.

So I like having Lock, Unlock, and TryLock as one set of features over having Lock+Unlock in one object and Lock+Unlock+TryLock in another. And I like having such a simple thing, not just an all-singing, all-dancing thing that can also be used as a mutex.

rsc · 2021-05-12T17:15:20Z

It is easy to build a TryLock-able Mutex out of a Mutex and an atomic word. It is probably worth doing that in http2 to see if it solves your problem.

Adding TryLock just for http2 does not seem worthwhile. We would need a more compelling case.

DmitriyMV · 2021-05-13T10:33:40Z

@rsc I've seen variations of this across several commercial projects I've been involved in. And while I agree that TryLock is not an example of "good code" maybe it can be added just for the sake of preventing people from using such hacks. Although

It is easy to build a TryLock-able Mutex out of a Mutex and an atomic word

Maybe it would be a good idea to implement reference implementation inside x/sync?

rsc · 2021-05-13T15:45:59Z

Thinking more about this, there is one important benefit to building TryLock into Mutex compared to a wrapper:
failed TryLock calls wouldn't create spurious happens-before edges to confuse the race detector.

I'm still not convinced we should do this, but I put together a CL to see what it would look like at https://golang.org/cl/319769.

/cc @dvyukov for thoughts

gopherbot · 2021-05-13T15:45:59Z

Change https://golang.org/cl/319769 mentions this issue: sync: add Mutex.TryLock, RWMutex.TryLock, RWMutex.TryRLock

prattmic · 2021-05-13T15:59:29Z

FWIW, the gVisor project uses a modified sync package with TryLock and came up with an implementation almost 100% identical to @rsc's: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sync/rwmutex_unsafe.go;l=109?ss=gvisor%2Fgvisor

neild · 2021-05-13T16:06:53Z

I think that TryLock is almost always an indication of sloppy thinking about invariants, but I'm not convinced that it's always a mistake. And I'd rather see a single TryLock implementation that can be reasoned about rather than many custom ones.

A channel-based implementation as @bcmills suggests is possible, but performs poorly in comparison. There's a reason we have sync.Mutex rather than just using channel for locking.

dvyukov · 2021-05-14T05:59:26Z

I don't have much to add. Yes, TryLock may be needed in some cases; yes, it's needed very rarely in Go; yes, part of the current uses is papering over sub-optimal design; yes, an implementation in std lib can cooperate with race detector, with starvation prevention logic in Mutex, with contention profiler; yes, adding TryLock to std lib will increase number of unjustified uses.

I've searched our internal code base for "TryLock() bool" and found:

gvisor
https://pkg.go.dev/v.io/x/lib/nsync#Mu.TryLock
https://github.com/trivago/tgo/blob/master/tsync/mutex.go#L53
and one internal implementation build on a channel
So it's not much, frankly I expected to find more. But I did not look how/where they are used.

networkimprov · 2021-05-15T20:57:32Z

I need TryRLock() for the mnm client app. I can explain the use case if asked.

It's not trivial to implement with a wrapper.

rsc · 2021-05-19T17:21:31Z

OK, I retract my objections. Everyone agrees it's unfortunate but sometimes necessary.

Does anyone else object to adding it?

rsc · 2021-05-26T18:14:19Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

rsc · 2021-06-02T18:00:27Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

kaey · 2021-06-03T04:07:18Z

Can it be exposed in x/sync instead?

cristaloleg · 2021-06-03T05:33:46Z

@kaey what is the point for that?

tandr · 2021-06-05T00:56:42Z

Re: https://golang.org/cl/319769

Will this make to 1.17, or it is too late, sorry?

ianlancetaylor · 2021-06-05T04:34:03Z

We are in the feature freeze for 1.17. This will be for 1.18.

alecrajeev · 2021-09-14T03:34:37Z

This is a good idea. I definitely can see my team using this for our internal tooling.

DmitriyMV · 2021-09-14T10:57:29Z

Kindly ping @rsc and @ianlancetaylor about https://golang.org/cl/319769

tbal999 · 2021-09-26T08:38:47Z

I like this idea.

seebs · 2021-10-13T21:20:45Z

Another potential advantage of TryLock is: If you TryLock something, and get false, you can report the circumstances and reasoning that led you to believe that a lock should have succeeded, even if there's nothing else you can do.

neild · 2021-10-13T22:19:41Z

FYI, I believe the HTTP/2 client bugs listed in the original issue (#23559, #32388, #33006, #33425) have all been fixed now, without using TryLock. This isn't an argument against adding TryLock, just a data point that it wasn't needed here.

as · 2021-10-14T20:13:17Z

I predict that the conditions under which you find this feature useful are not the conditions under which other users will use it. The primitive being non-trivial to implement correctly should not be justification for adding it to the standard library and encouraging others to use it.

The only utility of this feature is when you have a long-lived operation and a small set of workers infrequently racing to acquire the capability to do that operation. Not only is this trivial to implement with a buffered channel it is also extremely uncommon in a system that uses concurrency effectively.

gopherbot · 2021-11-13T22:38:16Z

Change https://golang.org/cl/363672 mentions this issue: sync: in TryLock try to acquire mutex even if state is not 0

For #45435 Change-Id: I728accd9a53c1826243f52aa04dc2a0a1dfdaadf Reviewed-on: https://go-review.googlesource.com/c/go/+/363672 Trust: Ian Lance Taylor <iant@golang.org> Run-TryBot: Ian Lance Taylor <iant@golang.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

gopherbot added this to the Proposal milestone Apr 7, 2021

gopherbot added the Proposal label Apr 7, 2021

ianlancetaylor added this to Incoming in Proposals (old) Apr 8, 2021

ianlancetaylor changed the title ~~proposal: func (m *sync.Mutex) LockIfNoWait() bool~~ proposal: sync: add Mutex.LockIfNoWait method Apr 8, 2021

rsc changed the title ~~proposal: sync: add Mutex.LockIfNoWait method~~ proposal: sync: add Mutex.TryLock May 5, 2021

rsc moved this from Incoming to Active in Proposals (old) May 5, 2021

rsc mentioned this issue May 5, 2021

proposal: review meeting minutes #33502

Open

rsc moved this from Active to Likely Accept in Proposals (old) May 26, 2021

rsc added the Proposal-FinalCommentPeriod label May 26, 2021

rsc moved this from Likely Accept to Accepted in Proposals (old) Jun 2, 2021

rsc changed the title ~~proposal: sync: add Mutex.TryLock~~ sync: add Mutex.TryLock Jun 2, 2021

rsc added the Proposal-Accepted label Jun 2, 2021

rsc modified the milestones: Proposal, Backlog Jun 2, 2021

bradfitz modified the milestones: Backlog, Go1.18 Sep 3, 2021

gopherbot closed this as completed in 645d078 Oct 29, 2021

dtelyukh mentioned this issue Nov 1, 2021

net/http: bundled x/net/http2.responseWriter hangs forever #42534

Open

wonjerry mentioned this issue Nov 23, 2021

Add trylock method to memory locker yorkie-team/yorkie#234

Closed

ianlancetaylor mentioned this issue Feb 24, 2022

api: audit for Go 1.18 #51333

Closed

s-urbaniak mentioned this issue May 12, 2022

Bug 2086519: Introduce Pod Security Admission Label Synchronization controller openshift/cluster-policy-controller#75

Merged

mrnugget mentioned this issue Jun 15, 2022

DB Backend: report explicit error when transactions are used concurrently sourcegraph/sourcegraph#37172

Merged

golang locked and limited conversation to collaborators Nov 13, 2022

gopherbot added the FrozenDueToAge label Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: add Mutex.TryLock #45435

sync: add Mutex.TryLock #45435

TyeMcQueen commented Apr 7, 2021

ianlancetaylor commented Apr 8, 2021

TyeMcQueen commented Apr 8, 2021

bcmills commented Apr 13, 2021 •

edited

neild commented Apr 13, 2021

bcmills commented Apr 13, 2021

TyeMcQueen commented Apr 15, 2021

bcmills commented Apr 15, 2021

rsc commented May 5, 2021

rsc commented May 5, 2021

TyeMcQueen commented May 6, 2021

rsc commented May 12, 2021

DmitriyMV commented May 13, 2021

rsc commented May 13, 2021

gopherbot commented May 13, 2021

prattmic commented May 13, 2021

neild commented May 13, 2021

dvyukov commented May 14, 2021

networkimprov commented May 15, 2021

rsc commented May 19, 2021

rsc commented May 26, 2021

rsc commented Jun 2, 2021

kaey commented Jun 3, 2021

cristaloleg commented Jun 3, 2021

tandr commented Jun 5, 2021

ianlancetaylor commented Jun 5, 2021

alecrajeev commented Sep 14, 2021

DmitriyMV commented Sep 14, 2021

tbal999 commented Sep 26, 2021

seebs commented Oct 13, 2021

neild commented Oct 13, 2021

as commented Oct 14, 2021

gopherbot commented Nov 13, 2021

sync: add Mutex.TryLock #45435

sync: add Mutex.TryLock #45435

Comments

TyeMcQueen commented Apr 7, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ianlancetaylor commented Apr 8, 2021

TyeMcQueen commented Apr 8, 2021

bcmills commented Apr 13, 2021 • edited

neild commented Apr 13, 2021

bcmills commented Apr 13, 2021

TyeMcQueen commented Apr 15, 2021

bcmills commented Apr 15, 2021

rsc commented May 5, 2021

rsc commented May 5, 2021

TyeMcQueen commented May 6, 2021

rsc commented May 12, 2021

DmitriyMV commented May 13, 2021

rsc commented May 13, 2021

gopherbot commented May 13, 2021

prattmic commented May 13, 2021

neild commented May 13, 2021

dvyukov commented May 14, 2021

networkimprov commented May 15, 2021

rsc commented May 19, 2021

rsc commented May 26, 2021

rsc commented Jun 2, 2021

kaey commented Jun 3, 2021

cristaloleg commented Jun 3, 2021

tandr commented Jun 5, 2021

ianlancetaylor commented Jun 5, 2021

alecrajeev commented Sep 14, 2021

DmitriyMV commented Sep 14, 2021

tbal999 commented Sep 26, 2021

seebs commented Oct 13, 2021

neild commented Oct 13, 2021

as commented Oct 14, 2021

gopherbot commented Nov 13, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

bcmills commented Apr 13, 2021 •

edited