runtime: random panics when running tests on RHEL 6.6 #13968

gshimansky · 2016-01-15T10:19:57Z

First I found this bug on an RHEL (Red Hat Enterprise Linux Server) 6.6 system in a virtual qemu machine which has 24 virtual processors.
log-virtual.txt

Linux kernel version is 2.6.32-504.el6.x86_64. Everything used to work on revision 732e2cd which I had previously checked out, so I bisected the problem and found a problematic commits: d513ee7 and f034ee8. I even created a patch which reverted these two commits and fixed this problem for me on the virtual machine.
fix-virtual.patch.txt

But later I found a RHEL 6.6 system which runs on a physical hardware with 48 processors and without virtual layer, and it tried running tests on it. It happens that tests always crash on this machine. Even revision 732e2cd which used to work for me on virtual machine produces the same result.
log-physical.txt

A similar system running Ubuntu 15.04 (kernel 3.19.0-15-generic) doesn't have any such problems.

Go bootstrap compiler used in both cases is recently released version 1.5.3.

The text was updated successfully, but these errors were encountered:

gshimansky · 2016-01-15T10:33:21Z

Rerun tests with GOTRACEBACK=2
log-virtual.txt
log-physical.txt

davecheney · 2016-01-15T10:35:36Z

11 is ECHILD, which clone(2) doesn't say it returns. I don't think the changes you highlighted are directly responsible, they just changed the pattern of access that pushed this machine over some limit.

Is AppArmor or SELinux in play ? Are there any odd entries in /etc/security (not 100% of the name)? What is the output from ulimit -a on unaffected and affected machines.

I don't think qemu is related, unless the machine inside the qemu host is starved for memory.

davecheney · 2016-01-15T10:36:55Z

Can you remove NFS from the equation ?

gshimansky · 2016-01-15T10:46:41Z

Yes I don't think that running tests in parallel causes problems in newosproc and pthread_create. I just described how it changed the behavior on a virtual system.

I rerun tests on virtual machine in /tmp so that there is no NFS access.
log-virtual.txt

On physical system there is no NFS already.

gshimansky · 2016-01-15T12:45:12Z

Ulimit -a is the same on both systems:

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515268
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

SELinux seems to be present because there are files in /etc/security and command selinuxenabled retruns 0. But I am quite sure that all SElinux settings are in their default values since distribution installation.

gshimansky · 2016-01-15T13:04:54Z

I disabled selinux in /etc/selinux/config (SELINUX=disabled and selinuxenabled returns 1 now), but it didn't change anything.
log-virtual-noselinux.txt

gshimansky · 2016-01-15T14:23:28Z

I found a system with RHEL 7.1 (kernel version 3.10.0-229.el7.x86_64) and tried running tests on it too. No problems encountered. So am starting to think that there may be some bugs specific to RHEL 6.6 kernel.

ianlancetaylor · 2016-01-15T15:05:45Z

It turns out that error 11 is EAGAIN, not ECHILD.

I have long suspected that there is a potential bug in the GNU/Linux support, but I have never been able to write a test case for it. The Linux kernel source code shows that if one thread calls clone while a different thread is calling exec, the call to clone can return EAGAIN (look for uses of in_exec in the kernel source code). That suggests that newosproc in runtime/os1_linux.go should check for that case, and loop calling clone again. But since I've never been able to write a test case showing the problem, I've never made the change.

This kind of problem, if it is indeed the problem, could certainly be kernel specific.

You could try applying this patch to runtime/os_linux.go to see if it fixes the problem.

diff --git a/src/runtime/os1_linux.go b/src/runtime/os1_linux.go
index b38cfc1..2d967a2 100644
--- a/src/runtime/os1_linux.go
+++ b/src/runtime/os1_linux.go
@@ -141,7 +141,13 @@ func newosproc(mp *m, stk unsafe.Pointer) {
        // with signals disabled.  It will enable them in minit.
        var oset sigset
        rtsigprocmask(_SIG_SETMASK, &sigset_all, &oset, int32(unsafe.Sizeof(oset)))
-       ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+       var ret int32
+       for i := 0; i < 10; i++ {
+               ret = clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+               if ret != _EAGAIN {
+                       break
+               }
+       }
        rtsigprocmask(_SIG_SETMASK, &oset, nil, int32(unsafe.Sizeof(oset)))

        if ret < 0 {

gshimansky · 2016-01-15T17:19:15Z

Thank you for a patch, but it didn't help so far. I tried to modify it a bit because comparison should be done with -_EAGAIN, but panics still remain. My patch looks like this now

diff --git a/src/runtime/os1_linux.go b/src/runtime/os1_linux.go
index b38cfc1..d6b9408 100644
--- a/src/runtime/os1_linux.go
+++ b/src/runtime/os1_linux.go
@@ -141,11 +141,21 @@ func newosproc(mp *m, stk unsafe.Pointer) {
        // with signals disabled.  It will enable them in minit.
        var oset sigset
        rtsigprocmask(_SIG_SETMASK, &sigset_all, &oset, int32(unsafe.Sizeof(oset)))
-       ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+    var ret int32
+    var i int32
+    for i = 0; i < 10; i++ {
+        ret = clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+        if ret < 0 {
+            print("Got error number ", -ret, "\n")
+        }
+        if ret != -_EAGAIN {
+            break
+        }
+    }
        rtsigprocmask(_SIG_SETMASK, &oset, nil, int32(unsafe.Sizeof(oset)))

        if ret < 0 {
-               print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")
+               print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, "), tries ", i, "\n")
                throw("newosproc")
        }
 }

and it produces errors like this:

Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
runtime: failed to create new OS thread (have 2 already; errno=11), tries 10
fatal error: newosproc

There are also some CGo tests which fail on pthread_create: runtime/cgo: pthread_create failed: Resource temporarily unavailable.

ianlancetaylor · 2016-01-15T17:31:57Z

Thanks for trying it. In that case you should check how many processes are running in total on the machine, and how many are permitted. This is ulimit -u.

I suppose you could also try adding a call to usleep(1) in the loop.

gshimansky · 2016-01-18T08:12:45Z

It is really a surprise to me, but increasing ulimit from 1024 to 2048 helped. All tests passed both on virtual and physical systems. Ubuntu doesn't have any limit set, and RHEL 7.1 has 4096, that is why tests passed on those systems.

I think this bug can be closed.

ianlancetaylor changed the title ~~Random panics when running tests on RHEL 6.6~~ runtime: random panics when running tests on RHEL 6.6 Jan 15, 2016

ianlancetaylor added this to the Go1.6Maybe milestone Jan 15, 2016

gshimansky closed this as completed Jan 18, 2016

kortschak mentioned this issue Apr 28, 2016

runtime: provide more informative error for new thread creation failure #15476

Closed

kyeapp mentioned this issue May 18, 2016

package testing fail for go1.6+ #15726

Closed

golang locked and limited conversation to collaborators Jan 17, 2017

gopherbot added the FrozenDueToAge label Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: random panics when running tests on RHEL 6.6 #13968

runtime: random panics when running tests on RHEL 6.6 #13968

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

davecheney commented Jan 15, 2016

davecheney commented Jan 15, 2016

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

ianlancetaylor commented Jan 15, 2016

gshimansky commented Jan 15, 2016

ianlancetaylor commented Jan 15, 2016

gshimansky commented Jan 18, 2016

runtime: random panics when running tests on RHEL 6.6 #13968

runtime: random panics when running tests on RHEL 6.6 #13968

Comments

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

davecheney commented Jan 15, 2016

davecheney commented Jan 15, 2016

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

gshimansky commented Jan 15, 2016

ianlancetaylor commented Jan 15, 2016

gshimansky commented Jan 15, 2016

ianlancetaylor commented Jan 15, 2016

gshimansky commented Jan 18, 2016