Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/poll: deadlock in Read on arm64 when an FD is closed [1.17 backport] #50611

Closed
gopherbot opened this issue Jan 14, 2022 · 8 comments
Closed
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge
Milestone

Comments

@gopherbot
Copy link

@bcmills requested issue #45211 to be considered for backport to the next 1.17 minor release.

@gopherbot, please backport to 1.17 and 1.16: this is a race condition that can result in deadlocks on ARM64 in any process that relies on Close unblocking a read. It is tricky to diagnose (and hard to reproduce reliably), and there is no apparent workaround for users on ARM64 machines.

@gopherbot gopherbot added the CherryPickCandidate Used during the release process for point releases label Jan 14, 2022
@gopherbot gopherbot added this to the Go1.17.7 milestone Jan 14, 2022
@rsc
Copy link
Contributor

rsc commented Jan 19, 2022

I'd feel better about a backport if we had more data showing that the crashes really have stopped.

@bcmills
Copy link
Contributor

bcmills commented Jan 19, 2022

Unfortunately the crashes only reproduced on a very small number of builders, and even then not in a way we could reproduce easily outside of the builder environment, which can be arbitrarily different from even a gomote due to system load, test sharding, and the process environment (see #32430). To make matters worse, because of the reduced rate of CLs for the code freeze we aren't getting very many organic test runs on the builders we do have.

Perhaps we should revisit a backport after the tree reopens for 1.19 development, so that we can have a larger body of test runs to compare?

@toothrot toothrot added the CherryPickApproved Used during the release process for point releases label Feb 9, 2022
@toothrot
Copy link
Contributor

toothrot commented Feb 9, 2022

Approved. This is a serious issue with no workaround.

@gopherbot gopherbot removed the CherryPickCandidate Used during the release process for point releases label Feb 9, 2022
@cherrymui cherrymui modified the milestones: Go1.17.7, Go1.17.8 Feb 9, 2022
@dmitshur
Copy link
Contributor

https://go.dev/wiki/MinorRelease states:

When the child issue is labeled CherryPickApproved, the original author of the change fixing that issue should immediately create and mail a cherry-pick change against the release branch, which will be merged as soon as it is ready, closing the child issue.

Both this issue and #50610 are now labeled CherryPickApproved.

@rsc You're the original author of CL 378234 that fixed #45211, so please feel free to mail cherry-pick CLs when you're ready to proceed here. Thanks.

@cagedmantis cagedmantis modified the milestones: Go1.17.8, Go1.17.9 Mar 3, 2022
@heschi
Copy link
Contributor

heschi commented Mar 14, 2022

@bcmills @rsc @ianlancetaylor

This backport is at risk of missing the next minor release.

@gopherbot
Copy link
Author

Change https://go.dev/cl/392714 mentions this issue: [release-branch.go1.17] runtime: fix net poll races

@rhysh
Copy link
Contributor

rhysh commented Mar 14, 2022

Unfortunately the crashes only reproduced on a very small number of builders, and even then not in a way we could reproduce easily outside of the builder environment

I'm not sure whether disabling arm64HasATOMICS = cpu.ARM64.HasATOMICS in src/runtime/proc.go is "allowed" (breaks invariants), but doing that has let me reproduce a related issue (#50469) on my darwin/arm64 M1 Macbook Air about once per minute. It reproduces up until the fix for #45211 , at 17b2fb1.

So if falling back to older atomic instructions is a valid way to run, then it looks to me like that commit really does stop the crashes (at least by a factor of 50, from ~3% failure rate to 0 out of 2000).

Update: I ran the reproducer 45k times, with no netpoll-related deadlocks. (The single failure I saw was reported as a normal application-layer TCP timeout.)

@gopherbot
Copy link
Author

Closed by merging 4e69fdd to release-branch.go1.17.

gopherbot pushed a commit that referenced this issue Mar 28, 2022
The netpoll code was written long ago, when the
only multiprocessors that Go ran on were x86.
It assumed that an atomic store would trigger a
full memory barrier and then used that barrier
to order otherwise racy access to a handful of fields,
including pollDesc.closing.

On ARM64, this code has finally failed, because
the atomic store is on a value completely unrelated
to any of the racily-accessed fields, and the ARMv8
hardware, unlike x86, is clever enough not to do a
full memory barrier for a simple atomic store.
We are seeing a constant background rate of trybot
failures where the net/http tests deadlock - a netpollblock
has clearly happened after the pollDesc has begun to close.

The code that does the racy reads is netpollcheckerr,
which needs to be able to run without acquiring a lock.
This CL fixes the race, without introducing unnecessary
inefficiency or deadlock, by arranging for every updater
of the relevant fields to publish a summary as a single
atomic uint32, and then having netpollcheckerr use a
single atomic load to fetch the relevant bits and then
proceed as before.

For #45211
Fixes #50611

Change-Id: Ib6788c8da4d00b7bda84d55ca3fdffb5a64c1a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/378234
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
(cherry picked from commit 17b2fb1)
Reviewed-on: https://go-review.googlesource.com/c/go/+/392714
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
@golang golang locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

9 participants