Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: failed to create new OS thread #19163

Closed
cherrymui opened this issue Feb 17, 2017 · 11 comments
Closed

runtime: failed to create new OS thread #19163

cherrymui opened this issue Feb 17, 2017 · 11 comments
Labels
FrozenDueToAge OS-Windows Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@cherrymui
Copy link
Member

There are test failures on the builder dashboard at various places on various machines with

runtime: failed to create new OS thread (have 9 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x818701e, 0x9)
	/tmp/workdir/go/src/runtime/panic.go:596 +0x7c
runtime.newosproc(0x18538c80, 0x186fe000)
	/tmp/workdir/go/src/runtime/os_linux.go:163 +0x15f
runtime.newm(0x0, 0x18518000)
	/tmp/workdir/go/src/runtime/proc.go:1614 +0xf9
runtime.startm(0x18518000, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:1684 +0x141
runtime.handoffp(0x18518000)
	/tmp/workdir/go/src/runtime/proc.go:1711 +0x49
runtime.retake(0xe65e8051, 0x1a5e8c, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:3860 +0x10e
runtime.sysmon()
	/tmp/workdir/go/src/runtime/proc.go:3787 +0x272
runtime.mstart1()
	/tmp/workdir/go/src/runtime/proc.go:1166 +0xec
runtime.mstart()
	/tmp/workdir/go/src/runtime/proc.go:1136 +0x4d

Seems the earliest is https://build.golang.org/log/1a62fd950384d62c1782922f55fbaf194691d126 on my commit 98061fa. But I don't know how that CL could be related. Should I revert that CL?

@ianlancetaylor
Copy link
Contributor

This kind of error usually indicates that the system is overloaded, conceivably by some other test running in parallel. However, in this case it can't be running parallel with the test that I know can cause these kinds of problems, which is misc/cgo/test/issue18146.go. I don't know what is going on here.

@ianlancetaylor
Copy link
Contributor

An example from the tools repo: https://build.golang.org/log/81e6ad3b5a351ecfc96c3f663d6b649794b493b4

@bradfitz
Copy link
Contributor

I see no evidence that this is the fault of the builders.

The Kubernetes configuration hasn't changed (same node count and size, same pod limits). No pod leaks I can see. No new master or node versions. No logged errors.

Unless one particularly bad CL was running on a trybot and consuming threads like crazy and Kubernetes' isolation between containers isn't good. But I'm not sure we keep enough logs (or enough association between build failure logs and GKE logs) to prove that.

/cc @rsc (who mentioned this to me on chat)

@bradfitz
Copy link
Contributor

I see that Kubernetes doesn't seem to support setting rlimit (kubernetes/kubernetes#3595) limits. So maybe we did just have one bad build somewhere impacting other builds.

Looks like I can modify the builders to set their own limits early in the build to prevent bad builds from impacting other pods.

@bradfitz
Copy link
Contributor

I kicked off a Go 1.8 trybot run and it also failed on the GKE builders, so I think our GKE nodes are just screwed up somehow.

I don't see any leaked pods, though, and I haven't changed anything about the builders that should affect the GKE builders recently.

I tried to ssh into the GKE nodes via the GCP web UI and the ssh failed to connect.

I tried to kubectl proxy to see their web UI (using the GCP web UI instructions) and I got auth errors. I tried again updating my gcloud components, but same results.

So, I have zero visibility into what is happening on the 5 nodes of the GKE cluster, other than listing their pods and such and seeing that they look fine.

Maybe some system pod or other daemon went crazy and leaked a bunch threads?

In any case, I have to reboot them anyway, so I'm just updating from GKE 1.4.6 to GKE 1.5.2 (using the GCP web UI option), since bug reports against the Kubernetes/GKE teams will probably be more well-received if I'm using the latest version.

We'll see if this does anything.

@bradfitz
Copy link
Contributor

GKE master is updated to 1.5.2.

The 5 nodes are half done updating from 1.4.6 to 1.5.2. (2 done, 1 rebooting, 2 still 1.4.6)

@bradfitz
Copy link
Contributor

The master and all five of the n1-standard-32 nodes are now on 1.5.2.

Wait and see now, I guess.

@bradfitz
Copy link
Contributor

Another of this bug, but on Windows:

https://storage.googleapis.com/go-build-log/c984be4c/windows-amd64-gce_664bd878.log

Note that Windows machines are new VMs (with no prior state) per build, so they should not be overloaded or stale or have stray processes running.

@bradfitz bradfitz added OS-Windows Testing An issue that has been verified to require only test changes, not just a test failure. labels Mar 10, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Mar 10, 2017
@bradfitz
Copy link
Contributor

bradfitz commented Apr 1, 2017

@alexbrainman
Copy link
Member

More:
https://storage.googleapis.com/go-build-log/a9fae47f/windows-386-gce_e4a6dc6c.log

I wonder if this is actually #18253 instead?
In fact #18253 should be fixed now (by CL 34616) ...

Alex

@aclements
Copy link
Member

There haven't been any linux/* failures since Brad kicked the builders and the Windows failures are almost certailny #18253 based on the errno, so I believe this is fixed.

@golang golang locked and limited conversation to collaborators Jun 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge OS-Windows Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants