Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: lock order violation between gscan and profInsert after CL 544195 #64706

Closed
prattmic opened this issue Dec 13, 2023 · 12 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@prattmic
Copy link
Member

#!watchflakes
post <- builder ~ `.*-staticlockranking` && log ~ `lock ordering problem` && log ~ `profInsert`

https://build.golang.org/log/9ef94ae07bd07605e17fcd241ef0d38d5064b41f
https://build.golang.org/log/3f80d959a6cc683334f8521b8a3b002dda1bd6b3
https://build.golang.org/log/a5bfdd06fa69808fe6378b412610e65b06ddd8ff

Taking profInsert after gscan is indeed an explicit violation based on our defined lock order: https://cs.opensource.google/go/go/+/master:src/runtime/mklockrank.go;l=125-138;drc=23c0d30244d5e38d928e68661bf39458eb85d2b2.

That said, I am currently confused about the entire STACKGROW category (long day, perhaps). As far as I can tell, the gscan "lock" is only ever held from the system stack, so I don't understand how any of these other locks could "grow the stack", or what exactly we are protecting against.

cc @rhysh @mknyszek

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 13, 2023
@prattmic prattmic added this to the Go1.22 milestone Dec 13, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 13, 2023
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- builder ~ `.*-staticlockranking` && log ~ `lock ordering problem` && log ~ `profInsert`
2023-12-01 19:20 linux-amd64-staticlockranking go@5a2161ce runtime (log)
79554  ======
0 : gscan 35 0x0
1 : profInsert 31 0xc51f70
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x8be74f?, 0x0?})
	/workdir/go/src/runtime/panic.go:1023 +0x5c fp=0x7f435fffe6c8 sp=0x7f435fffe698 pc=0x44383c
runtime.checkRanks(0xc000006ea0, 0x0?, 0xbc1c80?)
	/workdir/go/src/runtime/lockrank_on.go:162 +0x236 fp=0x7f435fffe728 sp=0x7f435fffe6c8 pc=0x4129f6
...
runtime_test.TestGCTestMoveStackOnNextCall(0xc000d40000)
	/workdir/go/src/runtime/gc_test.go:211 +0x1c fp=0xc0007a0f70 sp=0xc0007a0f20 pc=0x74997c
testing.tRunner(0xc000d40000, 0x8e3380)
	/workdir/go/src/testing/testing.go:1689 +0xfb fp=0xc0007a0fc0 sp=0xc0007a0f70 pc=0x521e1b
testing.(*T).Run.gowrap1()
	/workdir/go/src/testing/testing.go:1742 +0x25 fp=0xc0007a0fe0 sp=0xc0007a0fc0 pc=0x522e45
runtime.goexit({})
	/workdir/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0007a0fe8 sp=0xc0007a0fe0 pc=0x486f61
created by testing.(*T).Run in goroutine 1
	/workdir/go/src/testing/testing.go:1742 +0x390
2023-12-06 17:59 linux-amd64-staticlockranking go@3f2bf706 runtime (log)
81122  ======
0 : gscan 35 0x0
1 : profInsert 31 0xc56110
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x8c0a0e?, 0x0?})
	/workdir/go/src/runtime/panic.go:1023 +0x5c fp=0x7f5fbfffe6c8 sp=0x7f5fbfffe698 pc=0x443f3c
runtime.checkRanks(0xc00022e680, 0x0?, 0xbc5c80?)
	/workdir/go/src/runtime/lockrank_on.go:162 +0x236 fp=0x7f5fbfffe728 sp=0x7f5fbfffe6c8 pc=0x413236
...
runtime_test.TestNetpollWaiters(0xc000c9a1a0)
	/workdir/go/src/runtime/crash_test.go:893 +0x1f fp=0xc000a8c770 sp=0xc000a8c6f0 pc=0x73f23f
testing.tRunner(0xc000c9a1a0, 0x8e5d80)
	/workdir/go/src/testing/testing.go:1689 +0xfb fp=0xc000a8c7c0 sp=0xc000a8c770 pc=0x52289b
testing.(*T).Run.gowrap1()
	/workdir/go/src/testing/testing.go:1742 +0x25 fp=0xc000a8c7e0 sp=0xc000a8c7c0 pc=0x5238c5
runtime.goexit({})
	/workdir/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000a8c7e8 sp=0xc000a8c7e0 pc=0x4879e1
created by testing.(*T).Run in goroutine 1
	/workdir/go/src/testing/testing.go:1742 +0x390
2023-12-13 00:22 linux-amd64-staticlockranking go@400e24a8 runtime (log)
81226  ======
0 : gscan 35 0x0
1 : profInsert 31 0xc561d0
fatal error: lock ordering problem

runtime stack:
runtime.throw({0x8c0cb9?, 0x0?})
	/workdir/go/src/runtime/panic.go:1023 +0x5c fp=0x7f875effc6c8 sp=0x7f875effc698 pc=0x443f1c
runtime.checkRanks(0xc0001a2fc0, 0x0?, 0xbc5c80?)
	/workdir/go/src/runtime/lockrank_on.go:162 +0x236 fp=0x7f875effc728 sp=0x7f875effc6c8 pc=0x413236
...
runtime_test.TestPanicTraceback(0xc0000dbba0)
	/workdir/go/src/runtime/crash_test.go:496 +0x2a fp=0xc0007d7f70 sp=0xc0007d7e38 pc=0x73c62a
testing.tRunner(0xc0000dbba0, 0x8e6288)
	/workdir/go/src/testing/testing.go:1689 +0xfb fp=0xc0007d7fc0 sp=0xc0007d7f70 pc=0x52295b
testing.(*T).Run.gowrap1()
	/workdir/go/src/testing/testing.go:1742 +0x25 fp=0xc0007d7fe0 sp=0xc0007d7fc0 pc=0x523985
runtime.goexit({})
	/workdir/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0007d7fe8 sp=0xc0007d7fe0 pc=0x487aa1
created by testing.(*T).Run in goroutine 1
	/workdir/go/src/testing/testing.go:1742 +0x390

watchflakes

@bcmills
Copy link
Contributor

bcmills commented Dec 14, 2023

Should this be a release blocker? Lock inversion in the runtime seems like a pretty severe bug. 😅

(Or is this an existing issue that was only recently exposed by testing?)

@prattmic
Copy link
Member Author

It should be a release blocker until we understand better. It is a new issue, not just uncovered.

@rhysh
Copy link
Contributor

rhysh commented Dec 14, 2023

I'd thought this was an issue with instrumentation / reporting, but it's not: Although the (new in Go 1.22) code for adding runtime-internal lock contention events to the mprof hash table is guarded by a check to run only when the last lock is released (

prof.store()
, checking gp.m.locks==1 because its caller has yet to decrement the count down to zero,
gp.m.mLockProfile.recordUnlock(l)
), the gscan lockrank appears to not be an actual "lock/unlock" mutex. The lock rank is used directly (
acquireLockRank(lockRankGscan)
) without (it seems) resulting in a change to gp.m.locks.

I'm not up to speed on gscan / casgstatus / castogscanstatus.

Does that look like an accurate description, @prattmic ? If so, then it seems the lock rank builder is trying to warn us of a real problem.

@mknyszek
Copy link
Contributor

Come to think of it, I think I may have been the one that added the Gscan lock rank. It was because we'd encountered a few Gscan-related deadlocks and they were hard to debug, so we wanted to avoid that in the future. Gscan acts like a kind of spinlock on ownership of the G, but indeed, it does not increment gp.m.locks. Perhaps it should.

@prattmic
Copy link
Member Author

I haven't yet been able to come up with a way that this could cause an actual deadlock, but I agree that having gscan increment gp.m.locks seems like it would fix this problem if there is one.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 18, 2023

Oof. I think we narrowly miss a real deadlock issue here and it turns out to just be a limitation of our lock modeling.

TL;DR: This lock modeling issue is only present when GODEBUG=runtimecontentionstacks=1 is set. TestRuntimeLockMetricsAndProfile is the only place where it's enabled, and it's not enabled by default, so https://go.dev/cl/550775 resolves this issue on the builders for now.

Here's an almost problematic ordering:

(1) G1 is running on T1 and calls into blockevent, say for a mutex. It eventually calls stkbucket, acquires profInsertLock, and realizes G1 is out of stack space. G1 goes to acquire its own Gscan bit.
(2) Meanwhile the GC is enabled and T2 acquires G1's scan bit for scanning its stack. It then acquires and releases a lock, calling unlock, leading to trying to acquire profInsertLock.

(2) can't actually happen but the reasons are very subtle. There are only three cases where the GC tries to acquire a goroutines Gscan bit.

One is after asynchronously preempting, but asynchronous preemption will back out early if any locks are held by the target goroutine, so that can't happen.

Another case is where the Gscan bit is acquired and it's on a running goroutine in anticipation of a preemption, here: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=199?q=preempt.go&ss=go%2Fgo. Luckily nothing happens that could cause a lock to be acquired between the acquire and release of the Gscan bit.

Finally, the Gscan bit can be acquired for a blocked goroutine, but if that was the case for G1 above, then (1) couldn't have been happening, since the goroutine needs to be running to acquire it's own scan bit.

One other possibly problematic ordering is that of two GC threads compete for the same Gscan bit. One acquires and releases a lock while that bit is held, causing profInsertLock to be acquired. The other somehow starts trying to scan a goroutine while profInsertLock is held. The latter can't happen because nothing happens while profInsertLock is held that could cause a goroutine to get scanned (that is, there are no memory allocations, therefore no GC assists).

Modeling this properly in the lock ranking is going to be a pain. And I'm not a fan of incrementing gp.m.locks when acquiring the Gscan bit right now just because it could conceivably cause some other "holding locks" failure. It's risky to do at this point in the release.

OK, that's all the bad-ish news. Our saving grace for the release is that this can only happen with GODEBUG=runtimecontentionstacks=1, so it's not a true release blocker. Note that all three failures happen during TestRuntimeLockMetricsAndProfile which enables the GODEBUG explicitly. This test was disabled by https://go.dev/cl/550775, so we're fine for now.

Moving out of release-blocker and into Go 1.23.

@mknyszek mknyszek modified the milestones: Go1.22, Go1.23 Dec 18, 2023
@mknyszek
Copy link
Contributor

Like #64253, I think this can be worked on as a test-only fix for Go 1.22 still. The lock ranking mode is entirely for our own testing, and this issue only pops up with TestRuntimeLockMetricsAndProfile currently.

But there's no reason this has to block the Go 1.22 release at all, hence why I moved it to Go 1.23.

@bcmills
Copy link
Contributor

bcmills commented Jan 31, 2024

@rhysh
Copy link
Contributor

rhysh commented Mar 6, 2024

#66004 looks similar. @mknyszek , is this an acceptable point in the release cycle to have Gscan users modify gp.m.locks?

@prattmic
Copy link
Member Author

prattmic commented Mar 7, 2024

Yes, I think so

rhysh added a commit to rhysh/go that referenced this issue Mar 12, 2024
Profiling of runtime-internal locks checks gp.m.locks to see if it's
safe to add a new record to the profile, but direct use of
acquireLockRank can change the list of the M's active lock ranks without
updating gp.m.locks to match. The runtime's internal rwmutex
implementation makes a point of calling acquirem/releasem when
manipulating the lock rank list, but the other user of acquireLockRank
(the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks.

Codify the rwmutex approach by having acquireLockRank (now called
acquireLockRankAndM) include a call to aquirem. Do the same for release.

For golang#64706
For golang#66004
@gopherbot
Copy link

Change https://go.dev/cl/571056 mentions this issue: runtime: always acquire M when acquiring locks by rank

rhysh added a commit to rhysh/go that referenced this issue Mar 18, 2024
Profiling of runtime-internal locks checks gp.m.locks to see if it's
safe to add a new record to the profile, but direct use of
acquireLockRank can change the list of the M's active lock ranks without
updating gp.m.locks to match. The runtime's internal rwmutex
implementation makes a point of calling acquirem/releasem when
manipulating the lock rank list, but the other user of acquireLockRank
(the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks.

Codify the rwmutex approach by adding a variant of acquireLockRank,
acquireLockRankAndM, include a call to aquirem. Do the same for release.

Leave runtime/time.go's use of the old variants intact for the moment.

For golang#64706
For golang#66004

Change-Id: Id18e4d8de1036de743d2937fad002c6feebe2faf
rhysh added a commit to rhysh/go that referenced this issue Mar 18, 2024
Profiling of runtime-internal locks checks gp.m.locks to see if it's
safe to add a new record to the profile, but direct use of
acquireLockRank can change the list of the M's active lock ranks without
updating gp.m.locks to match. The runtime's internal rwmutex
implementation makes a point of calling acquirem/releasem when
manipulating the lock rank list, but the other user of acquireLockRank
(the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks.

Codify the rwmutex approach by adding a variant of acquireLockRank,
acquireLockRankAndM, include a call to aquirem. Do the same for release.

Leave runtime/time.go's use of the old variants intact for the moment.

For golang#64706
For golang#66004

Change-Id: Id18e4d8de1036de743d2937fad002c6feebe2faf
rhysh added a commit to rhysh/go that referenced this issue Apr 16, 2024
Profiling of runtime-internal locks checks gp.m.locks to see if it's
safe to add a new record to the profile, but direct use of
acquireLockRank can change the list of the M's active lock ranks without
updating gp.m.locks to match. The runtime's internal rwmutex
implementation makes a point of calling acquirem/releasem when
manipulating the lock rank list, but the other user of acquireLockRank
(the GC's Gscan bit) relied on the GC's invariants to avoid deadlocks.

Codify the rwmutex approach by renaming acquireLockRank to
acquireLockRankAndM and having it include a call to aquirem. Do the same
for release.

For golang#64706
For golang#66004

Change-Id: Id18e4d8de1036de743d2937fad002c6feebe2faf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants