-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: failed to create new OS thread #19163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This kind of error usually indicates that the system is overloaded, conceivably by some other test running in parallel. However, in this case it can't be running parallel with the test that I know can cause these kinds of problems, which is misc/cgo/test/issue18146.go. I don't know what is going on here. |
An example from the tools repo: https://build.golang.org/log/81e6ad3b5a351ecfc96c3f663d6b649794b493b4 |
I see no evidence that this is the fault of the builders. The Kubernetes configuration hasn't changed (same node count and size, same pod limits). No pod leaks I can see. No new master or node versions. No logged errors. Unless one particularly bad CL was running on a trybot and consuming threads like crazy and Kubernetes' isolation between containers isn't good. But I'm not sure we keep enough logs (or enough association between build failure logs and GKE logs) to prove that. /cc @rsc (who mentioned this to me on chat) |
I see that Kubernetes doesn't seem to support setting rlimit (kubernetes/kubernetes#3595) limits. So maybe we did just have one bad build somewhere impacting other builds. Looks like I can modify the builders to set their own limits early in the build to prevent bad builds from impacting other pods. |
I kicked off a Go 1.8 trybot run and it also failed on the GKE builders, so I think our GKE nodes are just screwed up somehow. I don't see any leaked pods, though, and I haven't changed anything about the builders that should affect the GKE builders recently. I tried to ssh into the GKE nodes via the GCP web UI and the ssh failed to connect. I tried to kubectl proxy to see their web UI (using the GCP web UI instructions) and I got auth errors. I tried again updating my gcloud components, but same results. So, I have zero visibility into what is happening on the 5 nodes of the GKE cluster, other than listing their pods and such and seeing that they look fine. Maybe some system pod or other daemon went crazy and leaked a bunch threads? In any case, I have to reboot them anyway, so I'm just updating from GKE 1.4.6 to GKE 1.5.2 (using the GCP web UI option), since bug reports against the Kubernetes/GKE teams will probably be more well-received if I'm using the latest version. We'll see if this does anything. |
GKE master is updated to 1.5.2. The 5 nodes are half done updating from 1.4.6 to 1.5.2. (2 done, 1 rebooting, 2 still 1.4.6) |
The master and all five of the n1-standard-32 nodes are now on 1.5.2. Wait and see now, I guess. |
Another of this bug, but on Windows: https://storage.googleapis.com/go-build-log/c984be4c/windows-amd64-gce_664bd878.log Note that Windows machines are new VMs (with no prior state) per build, so they should not be overloaded or stale or have stray processes running. |
I wonder if this is actually #18253 instead? Alex |
There haven't been any linux/* failures since Brad kicked the builders and the Windows failures are almost certailny #18253 based on the errno, so I believe this is fixed. |
There are test failures on the builder dashboard at various places on various machines with
Seems the earliest is https://build.golang.org/log/1a62fd950384d62c1782922f55fbaf194691d126 on my commit 98061fa. But I don't know how that CL could be related. Should I revert that CL?
The text was updated successfully, but these errors were encountered: