Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: newosproc doesn't handle clone returning EAGAIN #49438

Closed
asuffield opened this issue Nov 8, 2021 · 3 comments
Closed

runtime: newosproc doesn't handle clone returning EAGAIN #49438

asuffield opened this issue Nov 8, 2021 · 3 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@asuffield
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.16.7 linux/amd64

Does this issue reproduce with the latest release?

I haven't tried, but inspection of the code says it will.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"

(With apologies for pruning)

What did you do?

I don't have a reproduction case for this one - it's very sensitive to something I haven't pinned down yet - but I have uncovered the nature of the bug via inspection.

Running go programs on sufficiently loaded systems sometimes crashes with "runtime: failed to create new OS thread (have 2 already; errno=11)". The really interesting thing here is errno=11, which is EAGAIN. If you read the Linux manpage for fork/clone it will refer to system limits; I have verified that is not the case in my scenario. At this point I said to myself (more than once): fork/clone aren't restartable syscalls, surely they can't actually return EAGAIN. Then I started doubting myself and went looking.

It turns out that Linux can and does return EAGAIN in some circumstances which are entirely undocumented in the manpages. The key code path ends up here:

https://elixir.bootlin.com/linux/v5.15.1/source/kernel/fork.c#L1523

And starts out over here:

https://elixir.bootlin.com/linux/v5.15.1/source/fs/exec.c#L1581

Which eventually led me back to this thread:

https://lore.kernel.org/lkml/20090329005343.GA12139@redhat.com/

It appears Linux has been willing to return EAGAIN to fork/clone for over a decade now, which means this code needs to handle that case somehow:

go/src/runtime/os_linux.go

Lines 167 to 173 in a97c527

if ret < 0 {
print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")
if ret == -_EAGAIN {
println("runtime: may need to increase max user processes (ulimit -u)")
}
throw("newosproc")
}

It is super unfortunate that the rlimit scenario also returns EAGAIN, but I don't see any solution other than retrying a few times before panic - but maybe there's something I haven't fully understood here, I'll admit I haven't pieced together exactly what's happening. The only thing I'm fully confident of is: there is some way in which go processes can crash with an EAGAIN returned from clone() which isn't caused by rlimits.

@ianlancetaylor ianlancetaylor changed the title newosproc doesn't handle clone() returning EAGAIN runtime: newosproc doesn't handle clone returning EAGAIN Nov 8, 2021
@ianlancetaylor
Copy link
Contributor

Note: for the cgo case we use a loop with an increasing delay to handle pthread_create returning EAGAIN; see #18146 and https://go.googlesource.com/go/+/refs/heads/master/src/runtime/cgo/gcc_libinit.c#91. We could certainly do the same thing for the non-cgo case, which is what you are describing. It would be nice to have a test case.

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 8, 2021
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Nov 8, 2021
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@myitcv
Copy link
Member

myitcv commented Nov 1, 2022

@asuffield hello! Long time no see :)

Noting that I run into this occasionally (~once a month), in exactly the scenario of load that @asuffield reports (generally at machine startup).

@gopherbot
Copy link

Change https://go.dev/cl/447175 mentions this issue: runtime: retry thread creation on EAGAIN

@golang golang locked and limited conversation to collaborators Nov 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants