Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net,runtime: apparent deadlock in (*net.conn).Close and runtime.netpollblock on arm64 platforms #48925

Closed
bcmills opened this issue Oct 12, 2021 · 21 comments
Labels
arch-arm64 FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 12, 2021

greplogs --dashboard -md -l -e '(?m).*\[semacquire, \d+ minutes\]:\n(.+\n\t.+\n)*net\.\(\*conn\)\.Close'

2021-10-08T16:26:20-99c1b24/windows-arm64-10
2021-10-05T15:58:05-f1f626d/linux-arm64-aws
2021-09-30T18:10:18-e81a3d9-205640e/linux-arm64-packet
2021-09-29T18:36:18-e48cf0d/android-arm64-corellium
2021-09-28T18:58:50-7d9f5e0-9c43872/windows-arm64-10
2021-09-28T17:19:08-1c6e50a/android-arm64-corellium
2021-09-27T22:22:04-52b23a5/android-arm64-corellium
2021-09-27T19:42:34-4e4d966-315cec2/linux-arm64-packet
2021-09-24T18:21:14-f6b5ffb/android-arm64-corellium
2021-09-22T20:53:48-41bb744/linux-arm64-packet
2021-09-18T06:04:41-c084706-c894b44/windows-arm64-10
2021-09-17T18:33:15-70493b3/ios-arm64-corellium
2021-09-16T23:13:58-e09dcc2/android-arm64-corellium
2021-09-09T21:28:56-c163c31-c981874/linux-arm64-aws
2021-09-07T23:27:08-3fff213/windows-arm64-10
2021-09-03T01:38:54-9f69a44/ios-arm64-corellium
2021-09-02T19:36:22-a8aa6cf/windows-arm64-10
2021-09-02T19:16:19-9633195/ios-arm64-corellium
2021-09-02T19:16:19-9633195/linux-arm64-packet
2021-09-02T16:59:57-2a463a2/android-arm64-corellium
2021-09-02T14:02:42-b8420ba/ios-arm64-corellium
2021-09-01T22:28:21-0bfd6fc/android-arm64-corellium
2021-08-28T04:46:13-0108177/windows-arm64-10
2021-08-24T15:55:38-60bc85c-d70c69d/windows-arm64-10
2021-08-23T12:26:30-c7e354d/linux-arm64-packet
2021-08-21T00:55:22-e17439e/windows-arm64-10
2021-08-17T15:00:04-3001b0a/windows-arm64-10
2021-08-05T20:22:31-fd45e26/android-arm64-corellium
2021-07-31T23:59:40-c6fcb2d-b8ca6e5/windows-arm64-10
2021-07-26T22:15:24-bfbb288/ios-arm64-corellium
2021-07-20T19:54:36-d568e6e/ios-arm64-corellium
2021-07-16T02:43:48-aa4e0f5/linux-arm64-aws
2021-06-30T01:29:49-c45e800/windows-arm64-10
2021-05-04T14:38:36-5e4f9b0/linux-arm64-aws
2021-05-03T16:23:09-7b768d4/linux-arm64-aws
2021-04-29T17:08:05-e03cca6/linux-arm64-aws
2021-04-29T11:37:05-c4c68fb/linux-arm64-aws
2021-04-22T17:47:59-537cde0/linux-arm64-aws
2021-04-16T16:36:57-2f0e5bf/linux-arm64-packet
2021-04-14T03:15:00-2d4ba26/linux-arm64-packet
2021-04-13T21:13:09-e512bc2/ios-arm64-corellium
2021-04-09T18:49:05-c3faff7/linux-arm64-aws
2021-04-09T18:49:05-c3faff7/linux-arm64-packet
2021-04-09T14:33:44-a690a5d/android-arm64-corellium
2021-04-08T18:47:17-96a6745/linux-arm64-packet
2021-03-29T21:50:16-4e1bf8e/linux-arm64-packet
2021-03-15T23:47:56-661f3f1/linux-arm64-aws
2021-03-12T21:52:25-5ea612d-f39c4de/linux-arm64-aws
2021-03-05T22:14:48-a829114/linux-arm64-aws
2021-01-25T22:54:28-e6b6d10/android-arm64-corellium
2020-12-16T16:40:57-5abda26/android-arm64-corellium
2020-12-10T23:11:25-1fe891a/ios-arm64-corellium
2020-10-15T18:07:26-3c9488e/linux-arm64-aws
2020-09-10T14:44:25-8098dbb/linux-arm64-aws

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-arm64 labels Oct 12, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Oct 12, 2021

Given that this appears to affect all arm64 platforms, it smells like a runtime or compiler bug to me.

Notably, all of these stack traces seem to feature goroutines blocked in runtime.netpollblock and at least one of runtime.semacquire1 or internal/poll.runtime_Semacquire.

@jeremyfaller, could you suggest someone to route it to?

@bcmills bcmills changed the title net: apparent deadlock in (*conn).Close on arm64 platforms net,runtime: apparent deadlock in (*net.conn).Close and runtime.netpollblock on arm64 platforms Oct 12, 2021
@jeremyfaller
Copy link
Contributor

Hmm. I feel like @cherrymui, @prattmic, or @dr2chase might be able to offer insights.

@dr2chase
Copy link
Contributor

If I recall, arm64 needs a few more memory fences than amd64.
(And it will be really hard to figure out which ones aren't there, if this is the case.)
Do we know the particular flavors of arm involved?

@bcmills
Copy link
Contributor Author

bcmills commented Oct 13, 2021

The darwin-arm64-11_0-toothrot, freebsd-arm64-dmgk, openbsd-arm64-jsing, and netbsd-arm64-bsiegert builders are notably absent from the failures observed on the builders.

That leaves:

@prattmic
Copy link
Member

Some notes:

  • internal/poll/FD.Close acquires FD.csema to wait for other users of the FD (system calls, primarily) to finish.
  • Each use of the FD reference counts their use. The last use decref releases FD.csema, which unblocks the Close.
  • Close also (before acquiring csema) "evicts" the FD, which should wake any blocked waiters so they can decref.

In https://build.golang.org/log/93f6f9bde154a60a2dcad7c9a3e7b66f4bf2f41a, we have 2 goroutines using the same stuck FD:

goroutine 9859 [semacquire, 9 minutes]:
internal/poll.runtime_Semacquire(0xffff54148070)
	/workdir/go/src/runtime/sema.go:61 +0x2c
internal/poll.(*FD).Close(0x400048b080)
	/workdir/go/src/internal/poll/fd_unix.go:116 +0x68
net.(*netFD).Close(0x400048b080)
	/workdir/go/src/net/fd_posix.go:38 +0x44
net.(*conn).Close(0x40000100a0)
	/workdir/go/src/net/net.go:207 +0x34
golang.org/x/net/http2.(*ClientConn).closeIfIdle(0x40004de600)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:888 +0x224
golang.org/x/net/http2.(*clientConnPool).closeIdleConnections(0x4001406000)
	/workdir/gopath/src/golang.org/x/net/http2/client_conn_pool.go:275 +0x148
golang.org/x/net/http2.(*Transport).CloseIdleConnections(0x10)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:544 +0x7c
golang.org/x/net/http2.(*clientTester).cleanup(0x40004b2140)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:760 +0x2c
sync.(*Once).doSlow(0x4001406000, 0x400070c200)
	/workdir/go/src/sync/once.go:68 +0x10c
sync.(*Once).Do(...)
	/workdir/go/src/sync/once.go:59
golang.org/x/net/http2.(*clientTester).run(0x40004b2140)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:786 +0x1bc
golang.org/x/net/http2.testTransportReqBodyAfterResponse(0x4000602758, 0x193)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:931 +0x15c
golang.org/x/net/http2.TestTransportReqBodyAfterResponse_403(0x40010c8000)
	/workdir/gopath/src/golang.org/x/net/http2/transport_test.go:824 +0x24
testing.tRunner(0x4000308000, 0x3f86b8)
	/workdir/go/src/testing/testing.go:1389 +0x110
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1436 +0x328
goroutine 9861 [IO wait, 9 minutes]:
internal/poll.runtime_pollWait(0xffff54148070, 0x72)
	/workdir/go/src/runtime/netpoll.go:229 +0x9c
internal/poll.(*pollDesc).wait(0x400048b080, 0x4000343000, 0x0)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x2c
internal/poll.(*pollDesc).waitRead(...)
	/workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400048b080, {0x4000343000, 0x1000, 0x1000})
	/workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400048b080, {0x4000343000, 0x4000065c28, 0x0})
	/workdir/go/src/net/fd_posix.go:56 +0x2c
net.(*conn).Read(0x40000100a0, {0x4000343000, 0x400000801, 0x1})
	/workdir/go/src/net/net.go:183 +0x38
bufio.(*Reader).Read(0x40003ae0c0, {0x40004ca120, 0x9, 0x40007ff640})
	/workdir/go/src/bufio/bufio.go:232 +0x1ec
io.ReadAtLeast({0x45b640, 0x40003ae0c0}, {0x40004ca120, 0x9, 0x9}, 0x9)
	/workdir/go/src/io/io.go:331 +0xac
io.ReadFull(...)
	/workdir/go/src/io/io.go:350
golang.org/x/net/http2.readFrameHeader({0x40004ca120, 0x9, 0x40007ff650}, {0x45b640, 0x40003ae0c0})
	/workdir/gopath/src/golang.org/x/net/http2/frame.go:237 +0x5c
golang.org/x/net/http2.(*Framer).ReadFrame(0x40004ca0e0)
	/workdir/gopath/src/golang.org/x/net/http2/frame.go:498 +0x84
golang.org/x/net/http2.(*clientConnReadLoop).run(0x4000065f90)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:1973 +0x134
golang.org/x/net/http2.(*ClientConn).readLoop(0x40004de600)
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:1864 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
	/workdir/gopath/src/golang.org/x/net/http2/transport.go:739 +0x8fc

The fact that the second goroutine is still blocked in Read seems to indicate that the eviction failed to unblock that goroutine, or that the second goroutine was able to acquire the fd mu after close.

@prattmic
Copy link
Member

I reproduced this on linux-arm64-aws quite quickly, so I'll dig in some more:

$ GOROOT=$(pwd) gomote run $(gomote-instance) ./go/bin/go test -count=1000 crypto/tls
panic: test timed out after 10m0s

goroutine 117595 [running]:
testing.(*M).startAlarm.func1()
        /workdir/go/src/testing/testing.go:1961 +0x8c
created by time.goFunc
        /workdir/go/src/time/sleep.go:176 +0x3c

goroutine 1 [chan receive, 6 minutes]:
testing.(*T).Run(0x40003e4340, {0x2cce47, 0x75bdd8b719}, 0x2eef18)
        /workdir/go/src/testing/testing.go:1425 +0x33c
testing.runTests.func1(0x0)
        /workdir/go/src/testing/testing.go:1771 +0x74
testing.tRunner(0x40003e4340, 0x4000133b68)
        /workdir/go/src/testing/testing.go:1377 +0x110
testing.runTests(0x4000106b40, {0x4e2da0, 0x95, 0x95}, {0x46c, 0x4000014280, 0x4e7040})
        /workdir/go/src/testing/testing.go:1769 +0x3e0
testing.(*M).Run(0x4000106b40)
        /workdir/go/src/testing/testing.go:1657 +0x510
crypto/tls.runMain(0x400005c180)
        /workdir/go/src/crypto/tls/handshake_test.go:381 +0x5b4
crypto/tls.TestMain(0x4000045ec8)
        /workdir/go/src/crypto/tls/handshake_test.go:333 +0x80
main.main()
        _testmain.go:353 +0x1ec

goroutine 6 [IO wait, 6 minutes]:
internal/poll.runtime_pollWait(0xffff65e8e818, 0x72)
        /workdir/go/src/runtime/netpoll.go:229 +0x9c
internal/poll.(*pollDesc).wait(0x400013a080, 0x87ac4, 0x0)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x2c
internal/poll.(*pollDesc).waitRead(...)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x400013a080)
        /workdir/go/src/internal/poll/fd_unix.go:508 +0x1d8
net.(*netFD).accept(0x400013a080)
        /workdir/go/src/net/fd_unix.go:173 +0x28
net.(*TCPListener).accept(0x400000e198)
        /workdir/go/src/net/tcpsock_posix.go:140 +0x2c
net.(*TCPListener).Accept(0x400000e198)
        /workdir/go/src/net/tcpsock.go:262 +0x30
crypto/tls.localServer({0x340ac8, 0x400000e198})
        /workdir/go/src/crypto/tls/handshake_test.go:247 +0x5c
created by crypto/tls.runMain
        /workdir/go/src/crypto/tls/handshake_test.go:353 +0x1e0

goroutine 117621 [IO wait, 6 minutes]:
internal/poll.runtime_pollWait(0xffff65e8db68, 0x72)
        /workdir/go/src/runtime/netpoll.go:229 +0x9c
internal/poll.(*pollDesc).wait(0x400013a500, 0x40004e3440, 0x0)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:84 +0x2c
internal/poll.(*pollDesc).waitRead(...)
        /workdir/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x400013a500, {0x40004e3440, 0x205, 0x205})
        /workdir/go/src/internal/poll/fd_unix.go:167 +0x1ec
net.(*netFD).Read(0x400013a500, {0x40004e3440, 0x0, 0x1c58933d99f3})
        /workdir/go/src/net/fd_posix.go:56 +0x2c
net.(*conn).Read(0x4000558058, {0x40004e3440, 0x40003f2698, 0x160ca8})
        /workdir/go/src/net/net.go:183 +0x38
crypto/tls.(*atLeastReader).Read(0x40004e05a0, {0x40004e3440, 0x0, 0x1ac18})
        /workdir/go/src/crypto/tls/conn.go:777 +0x44
bytes.(*Buffer).ReadFrom(0x4000077af8, {0x33f328, 0x40004e05a0})
        /workdir/go/src/bytes/buffer.go:204 +0x94
crypto/tls.(*Conn).readFromUntil(0x4000077880, {0x33f588, 0x4000558058}, 0x160ca8)
        /workdir/go/src/crypto/tls/conn.go:799 +0xd8
crypto/tls.(*Conn).readRecordOrCCS(0x4000077880, 0x0)
        /workdir/go/src/crypto/tls/conn.go:606 +0xdc
crypto/tls.(*Conn).readRecord(...)
        /workdir/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).readHandshake(0x4000077880)
        /workdir/go/src/crypto/tls/conn.go:1009 +0x4c
crypto/tls.(*Conn).clientHandshake(0x4000077880, {0x340dd0, 0x40005bc480})
        /workdir/go/src/crypto/tls/handshake_client.go:179 +0x1c0
crypto/tls.(*Conn).handshakeContext(0x4000077880, {0x340e08, 0x400001a040})
        /workdir/go/src/crypto/tls/conn.go:1445 +0x3c8
crypto/tls.(*Conn).HandshakeContext(...)
        /workdir/go/src/crypto/tls/conn.go:1395
crypto/tls.(*Conn).Handshake(...)
        /workdir/go/src/crypto/tls/conn.go:1379
crypto/tls.TestHostnameInSNI.func1({0x2cacbe, 0xb})
        /workdir/go/src/crypto/tls/handshake_client_test.go:1439 +0x154
created by crypto/tls.TestHostnameInSNI
        /workdir/go/src/crypto/tls/handshake_client_test.go:1438 +0x184

goroutine 117585 [semacquire, 6 minutes]:
internal/poll.runtime_Semacquire(0xffff65e8db68)
        /workdir/go/src/runtime/sema.go:61 +0x2c
internal/poll.(*FD).Close(0x400013a500)
        /workdir/go/src/internal/poll/fd_unix.go:116 +0x68
net.(*netFD).Close(0x400013a500)
        /workdir/go/src/net/fd_posix.go:38 +0x44
net.(*conn).Close(0x4000558058)
        /workdir/go/src/net/net.go:207 +0x34
crypto/tls.TestHostnameInSNI(0x40003e4820)
        /workdir/go/src/crypto/tls/handshake_client_test.go:1453 +0x278
testing.tRunner(0x40003e4820, 0x2eef18)
        /workdir/go/src/testing/testing.go:1377 +0x110
created by testing.(*T).Run
        /workdir/go/src/testing/testing.go:1424 +0x328
FAIL    crypto/tls      600.010s
FAIL
Error running run: exit status 1

@prattmic
Copy link
Member

prattmic commented Oct 13, 2021

More logging:

[0.719056680 P 0] fd 0x4000362200 Read: enter
[0.719057213 P 0] fd 0x4000362200 Read: lock done
[0.719057533 P 0] fd 0x4000362200 Read: prepare done
[0.719057828 P 0] fd 0x4000362200 Read: syscall.Read enter
[0.719058977 P 0] fd 0x4000362200 Read: syscall.Read error
[0.719059264 P 0] fd 0x4000362200 Read: waitRead enter
[0.719059420 P 1] fd 0x4000362200 Close: enter
[0.719059519 P 0] pd 0xffff65c5a678 pollWait: enter
[0.719059781 P 0] pd 0xffff65c5a678 pollWait: netpollcheckerr done
[0.719059798 P 1] fd 0x4000362200 Close: increfAndClose success
[0.719060077 P 0] pd 0xffff65c5a678 pollWait: netpollblock enter
[0.719060150 P 1] pd 0xffff65c5a678 pollUnblock: enter
[0.719060511 P 1] pd 0xffff65c5a678 pollUnblock: netpollunblock done
[0.719060881 P 1] fd 0x4000362200 Close: evict done
[0.719061168 P 1] fd 0x4000362200 Close: decref done
[0.719061455 P 1] fd 0x4000362200 Close: block enter

My read of this is that runtime.poll_runtime_pollUnblock is failing to properly wake the reader, which is in poll_runtime_pollWait, but may or may not have yet fully blocked in netpollblock.

The synchronization between netpollblock and netpollunblock looks a bit suspicious to me, particularly on a weak system like ARM, but I will need to give it more thought tomorrow.

@prattmic
Copy link
Member

One theory:

Inside netpollblock and netpollunblock, gpp is read using a non-atomic read. In the normal case, writes use atomics, creating a barrier.

However, when consuming a ready event, netpollblock resets gpp using a non-atomic store, thus skipping a barrier.

In netpollunblock, we can return early if gpp == pdReady, not actually performing an unblock.

Thus I posit that an earlier netpollblock sees pdReady and resets to 0. Later, on another core we read gpp and see the stale pdReady and incorrectly return early without triggering a new unblock.

I haven't validated this theory, but my log does show that this pollDesc has earlier calls to netpollblock, so it is possible. If this is the case, I believe the fix is simply to use a atomic store to reset gpp to 0.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Oct 14, 2021 via email

@harshavardhana
Copy link
Contributor

One theory:

Inside netpollblock and netpollunblock, gpp is read using a non-atomic read. In the normal case, writes use atomics, creating a barrier.

However, when consuming a ready event, netpollblock resets gpp using a non-atomic store, thus skipping a barrier.

In netpollunblock, we can return early if gpp == pdReady, not actually performing an unblock.

Thus I posit that an earlier netpollblock sees pdReady and resets to 0. Later, on another core we read gpp and see the stale pdReady and incorrectly return early without triggering a new unblock.

I haven't validated this theory, but my log does show that this pollDesc has earlier calls to netpollblock, so it is possible. If this is the case, I believe the fix is simply to use a atomic store to reset gpp to 0.

Does this only affect arm64?

@ianlancetaylor
Copy link
Contributor

Does this only affect arm64?

It's a fair question why this would affect arm64 and not arm32.

@prattmic
Copy link
Member

Using all atomic loads and stores does make this issue disappear, so I think that's it. Though I also can't explain why we haven't seen this on other architectures, especially arm32.

@bcmills
Copy link
Contributor Author

bcmills commented Oct 14, 2021

Are aligned 32-bit stores always atomic (FSVO “atomic”) on arm32? Maybe the failure mode happens from a torn 64-bit uintptr store.

@LaPingvino
Copy link

fwiw, I think bash all.bash on my arm64 Chromebook fails on the same tests... I could probably test a possible fix...

@gopherbot
Copy link

Change https://golang.org/cl/355952 mentions this issue: runtime: consistently access pollDesc r/w Gs with atomics

@LaPingvino
Copy link

LaPingvino commented Oct 15, 2021

trying just this code change on top of a tip checkout doesn't fix tests failing on my Chromebook arm64. ofc might be unrelated issue(s)... I didn't give it enough time yet to fully isolate if this is fully the bug that makes the tests fail, but seeing the kind of failing it seems likely to me (mostly timeouts of tests, which seem like possible deadlocks to me, and the place where it failed now suspiciously looks like code touched by that change: FAIL runtime 541.176s, the stacktrace touching poll)

Also I have a weak suspicion failing tests are different at times? I will try to keep a closer eye on it from now on xD if necessary I can probably paste some stack traces etc...

@prattmic
Copy link
Member

@LaPingvino give the new version a try, though it sounds like you may be having a different issue. Logs would definitely help.

@gopherbot please backport to 1.17 and 1.16. This is a pre-existing issue with no workaround, and causing flakes on trybots.

@gopherbot
Copy link

Backport issue(s) opened: #49009 (for 1.16), #49010 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@LaPingvino
Copy link

@prattmic it does seem to have fixed the tests failing on my debian arm64 in crostini (chromebook). runtime took a long time to test (431.701s) but it didn't fail this time. Another test fails atm but I don't think that one is related? now it's a cmd/go test that fails.

@gopherbot
Copy link

Change https://golang.org/cl/356369 mentions this issue: [release-branch.go1.17] runtime: consistently access pollDesc r/w Gs with atomics

@gopherbot
Copy link

Change https://golang.org/cl/356370 mentions this issue: [release-branch.go1.16] runtime: consistently access pollDesc r/w Gs with atomics

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 28, 2021
@dmitshur dmitshur added this to the Go1.18 milestone Oct 28, 2021
gopherbot pushed a commit that referenced this issue Oct 28, 2021
…with atomics

Both netpollblock and netpollunblock read gpp using a non-atomic load.
When consuming a ready event, netpollblock clears gpp using a non-atomic
store, thus skipping a barrier.

Thus on systems with weak memory ordering, a sequence like so this is
possible:

             T1                                T2

1. netpollblock: read gpp -> pdReady
2. netpollblock: store gpp -> 0

                                 3. netpollunblock: read gpp -> pdReady
                                 4. netpollunblock: return

i.e., without a happens-before edge between (2) and (3), netpollunblock
may read the stale value of gpp.

Switch these access to use atomic loads and stores in order to create
these edges.

For ease of future maintainance, I've simply changed rg and wg to always
be accessed atomically, though I don't believe pollOpen or pollClose
require atomics today.

For #48925
Fixes #49009

Change-Id: I903ea667eea320277610b4f969129935731520c3
Reviewed-on: https://go-review.googlesource.com/c/go/+/355952
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: David Chase <drchase@google.com>
(cherry picked from commit 1b072b3)
Reviewed-on: https://go-review.googlesource.com/c/go/+/356370
gopherbot pushed a commit that referenced this issue Oct 28, 2021
…with atomics

Both netpollblock and netpollunblock read gpp using a non-atomic load.
When consuming a ready event, netpollblock clears gpp using a non-atomic
store, thus skipping a barrier.

Thus on systems with weak memory ordering, a sequence like so this is
possible:

             T1                                T2

1. netpollblock: read gpp -> pdReady
2. netpollblock: store gpp -> 0

                                 3. netpollunblock: read gpp -> pdReady
                                 4. netpollunblock: return

i.e., without a happens-before edge between (2) and (3), netpollunblock
may read the stale value of gpp.

Switch these access to use atomic loads and stores in order to create
these edges.

For ease of future maintainance, I've simply changed rg and wg to always
be accessed atomically, though I don't believe pollOpen or pollClose
require atomics today.

For #48925
Fixes #49010

Change-Id: I903ea667eea320277610b4f969129935731520c3
Reviewed-on: https://go-review.googlesource.com/c/go/+/355952
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: David Chase <drchase@google.com>
(cherry picked from commit 1b072b3)
Reviewed-on: https://go-review.googlesource.com/c/go/+/356369
@golang golang locked and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants