runtime: failed to create new OS thread #19163

cherrymui · 2017-02-17T20:06:29Z

There are test failures on the builder dashboard at various places on various machines with

runtime: failed to create new OS thread (have 9 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x818701e, 0x9)
	/tmp/workdir/go/src/runtime/panic.go:596 +0x7c
runtime.newosproc(0x18538c80, 0x186fe000)
	/tmp/workdir/go/src/runtime/os_linux.go:163 +0x15f
runtime.newm(0x0, 0x18518000)
	/tmp/workdir/go/src/runtime/proc.go:1614 +0xf9
runtime.startm(0x18518000, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:1684 +0x141
runtime.handoffp(0x18518000)
	/tmp/workdir/go/src/runtime/proc.go:1711 +0x49
runtime.retake(0xe65e8051, 0x1a5e8c, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:3860 +0x10e
runtime.sysmon()
	/tmp/workdir/go/src/runtime/proc.go:3787 +0x272
runtime.mstart1()
	/tmp/workdir/go/src/runtime/proc.go:1166 +0xec
runtime.mstart()
	/tmp/workdir/go/src/runtime/proc.go:1136 +0x4d

Seems the earliest is https://build.golang.org/log/1a62fd950384d62c1782922f55fbaf194691d126 on my commit 98061fa. But I don't know how that CL could be related. Should I revert that CL?

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2017-02-17T20:27:52Z

This kind of error usually indicates that the system is overloaded, conceivably by some other test running in parallel. However, in this case it can't be running parallel with the test that I know can cause these kinds of problems, which is misc/cgo/test/issue18146.go. I don't know what is going on here.

ianlancetaylor · 2017-02-17T20:48:36Z

An example from the tools repo: https://build.golang.org/log/81e6ad3b5a351ecfc96c3f663d6b649794b493b4

bradfitz · 2017-02-17T22:17:45Z

I see no evidence that this is the fault of the builders.

The Kubernetes configuration hasn't changed (same node count and size, same pod limits). No pod leaks I can see. No new master or node versions. No logged errors.

Unless one particularly bad CL was running on a trybot and consuming threads like crazy and Kubernetes' isolation between containers isn't good. But I'm not sure we keep enough logs (or enough association between build failure logs and GKE logs) to prove that.

/cc @rsc (who mentioned this to me on chat)

bradfitz · 2017-02-17T22:50:17Z

I see that Kubernetes doesn't seem to support setting rlimit (kubernetes/kubernetes#3595) limits. So maybe we did just have one bad build somewhere impacting other builds.

Looks like I can modify the builders to set their own limits early in the build to prevent bad builds from impacting other pods.

bradfitz · 2017-02-19T04:24:09Z

I kicked off a Go 1.8 trybot run and it also failed on the GKE builders, so I think our GKE nodes are just screwed up somehow.

I don't see any leaked pods, though, and I haven't changed anything about the builders that should affect the GKE builders recently.

I tried to ssh into the GKE nodes via the GCP web UI and the ssh failed to connect.

I tried to kubectl proxy to see their web UI (using the GCP web UI instructions) and I got auth errors. I tried again updating my gcloud components, but same results.

So, I have zero visibility into what is happening on the 5 nodes of the GKE cluster, other than listing their pods and such and seeing that they look fine.

Maybe some system pod or other daemon went crazy and leaked a bunch threads?

In any case, I have to reboot them anyway, so I'm just updating from GKE 1.4.6 to GKE 1.5.2 (using the GCP web UI option), since bug reports against the Kubernetes/GKE teams will probably be more well-received if I'm using the latest version.

We'll see if this does anything.

bradfitz · 2017-02-19T04:42:48Z

GKE master is updated to 1.5.2.

The 5 nodes are half done updating from 1.4.6 to 1.5.2. (2 done, 1 rebooting, 2 still 1.4.6)

bradfitz · 2017-02-19T05:49:10Z

The master and all five of the n1-standard-32 nodes are now on 1.5.2.

Wait and see now, I guess.

bradfitz · 2017-03-10T21:35:53Z

Another of this bug, but on Windows:

https://storage.googleapis.com/go-build-log/c984be4c/windows-amd64-gce_664bd878.log

Note that Windows machines are new VMs (with no prior state) per build, so they should not be overloaded or stale or have stray processes running.

bradfitz · 2017-04-01T10:41:26Z

More:
https://storage.googleapis.com/go-build-log/a9fae47f/windows-386-gce_e4a6dc6c.log

alexbrainman · 2017-04-02T08:22:08Z

More:
https://storage.googleapis.com/go-build-log/a9fae47f/windows-386-gce_e4a6dc6c.log

I wonder if this is actually #18253 instead?
In fact #18253 should be fixed now (by CL 34616) ...

Alex

aclements · 2017-06-07T19:40:07Z

There haven't been any linux/* failures since Brad kicked the builders and the Windows failures are almost certailny #18253 based on the errno, so I believe this is fixed.

bradfitz added OS-Windows Testing labels Mar 10, 2017

bradfitz added this to the Go1.9 milestone Mar 10, 2017

aclements closed this as completed Jun 7, 2017

jmoser11 mentioned this issue Jun 22, 2017

runtime: failed to create new OS thread git-lfs/git-lfs#2351

Closed

golang locked and limited conversation to collaborators Jun 7, 2018

gopherbot added the FrozenDueToAge label Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: failed to create new OS thread #19163

runtime: failed to create new OS thread #19163

cherrymui commented Feb 17, 2017

ianlancetaylor commented Feb 17, 2017

ianlancetaylor commented Feb 17, 2017

bradfitz commented Feb 17, 2017

bradfitz commented Feb 17, 2017

bradfitz commented Feb 19, 2017

bradfitz commented Feb 19, 2017

bradfitz commented Feb 19, 2017

bradfitz commented Mar 10, 2017

bradfitz commented Apr 1, 2017

alexbrainman commented Apr 2, 2017

aclements commented Jun 7, 2017

runtime: failed to create new OS thread #19163

runtime: failed to create new OS thread #19163

Comments

cherrymui commented Feb 17, 2017

ianlancetaylor commented Feb 17, 2017

ianlancetaylor commented Feb 17, 2017

bradfitz commented Feb 17, 2017

bradfitz commented Feb 17, 2017

bradfitz commented Feb 19, 2017

bradfitz commented Feb 19, 2017

bradfitz commented Feb 19, 2017

bradfitz commented Mar 10, 2017

bradfitz commented Apr 1, 2017

alexbrainman commented Apr 2, 2017

aclements commented Jun 7, 2017