runtime: newosproc doesn't handle clone returning EAGAIN #49438
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
FrozenDueToAge
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I haven't tried, but inspection of the code says it will.
What operating system and processor architecture are you using (
go env
)?go env
Output(With apologies for pruning)
What did you do?
I don't have a reproduction case for this one - it's very sensitive to something I haven't pinned down yet - but I have uncovered the nature of the bug via inspection.
Running go programs on sufficiently loaded systems sometimes crashes with "runtime: failed to create new OS thread (have 2 already; errno=11)". The really interesting thing here is errno=11, which is EAGAIN. If you read the Linux manpage for fork/clone it will refer to system limits; I have verified that is not the case in my scenario. At this point I said to myself (more than once): fork/clone aren't restartable syscalls, surely they can't actually return EAGAIN. Then I started doubting myself and went looking.
It turns out that Linux can and does return EAGAIN in some circumstances which are entirely undocumented in the manpages. The key code path ends up here:
https://elixir.bootlin.com/linux/v5.15.1/source/kernel/fork.c#L1523
And starts out over here:
https://elixir.bootlin.com/linux/v5.15.1/source/fs/exec.c#L1581
Which eventually led me back to this thread:
https://lore.kernel.org/lkml/20090329005343.GA12139@redhat.com/
It appears Linux has been willing to return EAGAIN to fork/clone for over a decade now, which means this code needs to handle that case somehow:
go/src/runtime/os_linux.go
Lines 167 to 173 in a97c527
It is super unfortunate that the rlimit scenario also returns EAGAIN, but I don't see any solution other than retrying a few times before panic - but maybe there's something I haven't fully understood here, I'll admit I haven't pieced together exactly what's happening. The only thing I'm fully confident of is: there is some way in which go processes can crash with an EAGAIN returned from clone() which isn't caused by rlimits.
The text was updated successfully, but these errors were encountered: