Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: random panics when running tests on RHEL 6.6 #13968

Closed
gshimansky opened this issue Jan 15, 2016 · 11 comments
Closed

runtime: random panics when running tests on RHEL 6.6 #13968

gshimansky opened this issue Jan 15, 2016 · 11 comments

Comments

@gshimansky
Copy link

First I found this bug on an RHEL (Red Hat Enterprise Linux Server) 6.6 system in a virtual qemu machine which has 24 virtual processors.
log-virtual.txt

Linux kernel version is 2.6.32-504.el6.x86_64. Everything used to work on revision 732e2cd which I had previously checked out, so I bisected the problem and found a problematic commits: d513ee7 and f034ee8. I even created a patch which reverted these two commits and fixed this problem for me on the virtual machine.
fix-virtual.patch.txt

But later I found a RHEL 6.6 system which runs on a physical hardware with 48 processors and without virtual layer, and it tried running tests on it. It happens that tests always crash on this machine. Even revision 732e2cd which used to work for me on virtual machine produces the same result.
log-physical.txt

A similar system running Ubuntu 15.04 (kernel 3.19.0-15-generic) doesn't have any such problems.

Go bootstrap compiler used in both cases is recently released version 1.5.3.

@gshimansky
Copy link
Author

Rerun tests with GOTRACEBACK=2
log-virtual.txt
log-physical.txt

@davecheney
Copy link
Contributor

11 is ECHILD, which clone(2) doesn't say it returns. I don't think the changes you highlighted are directly responsible, they just changed the pattern of access that pushed this machine over some limit.

Is AppArmor or SELinux in play ? Are there any odd entries in /etc/security (not 100% of the name)? What is the output from ulimit -a on unaffected and affected machines.

I don't think qemu is related, unless the machine inside the qemu host is starved for memory.

@davecheney
Copy link
Contributor

Can you remove NFS from the equation ?

@gshimansky
Copy link
Author

Yes I don't think that running tests in parallel causes problems in newosproc and pthread_create. I just described how it changed the behavior on a virtual system.

I rerun tests on virtual machine in /tmp so that there is no NFS access.
log-virtual.txt

On physical system there is no NFS already.

@gshimansky
Copy link
Author

Ulimit -a is the same on both systems:

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515268
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

SELinux seems to be present because there are files in /etc/security and command selinuxenabled retruns 0. But I am quite sure that all SElinux settings are in their default values since distribution installation.

@gshimansky
Copy link
Author

I disabled selinux in /etc/selinux/config (SELINUX=disabled and selinuxenabled returns 1 now), but it didn't change anything.
log-virtual-noselinux.txt

@gshimansky
Copy link
Author

I found a system with RHEL 7.1 (kernel version 3.10.0-229.el7.x86_64) and tried running tests on it too. No problems encountered. So am starting to think that there may be some bugs specific to RHEL 6.6 kernel.

@ianlancetaylor ianlancetaylor changed the title Random panics when running tests on RHEL 6.6 runtime: random panics when running tests on RHEL 6.6 Jan 15, 2016
@ianlancetaylor
Copy link
Contributor

It turns out that error 11 is EAGAIN, not ECHILD.

I have long suspected that there is a potential bug in the GNU/Linux support, but I have never been able to write a test case for it. The Linux kernel source code shows that if one thread calls clone while a different thread is calling exec, the call to clone can return EAGAIN (look for uses of in_exec in the kernel source code). That suggests that newosproc in runtime/os1_linux.go should check for that case, and loop calling clone again. But since I've never been able to write a test case showing the problem, I've never made the change.

This kind of problem, if it is indeed the problem, could certainly be kernel specific.

You could try applying this patch to runtime/os_linux.go to see if it fixes the problem.

diff --git a/src/runtime/os1_linux.go b/src/runtime/os1_linux.go
index b38cfc1..2d967a2 100644
--- a/src/runtime/os1_linux.go
+++ b/src/runtime/os1_linux.go
@@ -141,7 +141,13 @@ func newosproc(mp *m, stk unsafe.Pointer) {
        // with signals disabled.  It will enable them in minit.
        var oset sigset
        rtsigprocmask(_SIG_SETMASK, &sigset_all, &oset, int32(unsafe.Sizeof(oset)))
-       ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+       var ret int32
+       for i := 0; i < 10; i++ {
+               ret = clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+               if ret != _EAGAIN {
+                       break
+               }
+       }
        rtsigprocmask(_SIG_SETMASK, &oset, nil, int32(unsafe.Sizeof(oset)))

        if ret < 0 {

@ianlancetaylor ianlancetaylor added this to the Go1.6Maybe milestone Jan 15, 2016
@gshimansky
Copy link
Author

Thank you for a patch, but it didn't help so far. I tried to modify it a bit because comparison should be done with -_EAGAIN, but panics still remain. My patch looks like this now

diff --git a/src/runtime/os1_linux.go b/src/runtime/os1_linux.go
index b38cfc1..d6b9408 100644
--- a/src/runtime/os1_linux.go
+++ b/src/runtime/os1_linux.go
@@ -141,11 +141,21 @@ func newosproc(mp *m, stk unsafe.Pointer) {
        // with signals disabled.  It will enable them in minit.
        var oset sigset
        rtsigprocmask(_SIG_SETMASK, &sigset_all, &oset, int32(unsafe.Sizeof(oset)))
-       ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+    var ret int32
+    var i int32
+    for i = 0; i < 10; i++ {
+        ret = clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
+        if ret < 0 {
+            print("Got error number ", -ret, "\n")
+        }
+        if ret != -_EAGAIN {
+            break
+        }
+    }
        rtsigprocmask(_SIG_SETMASK, &oset, nil, int32(unsafe.Sizeof(oset)))

        if ret < 0 {
-               print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")
+               print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, "), tries ", i, "\n")
                throw("newosproc")
        }
 }

and it produces errors like this:

Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
Got error number 11
runtime: failed to create new OS thread (have 2 already; errno=11), tries 10
fatal error: newosproc

There are also some CGo tests which fail on pthread_create: runtime/cgo: pthread_create failed: Resource temporarily unavailable.

@ianlancetaylor
Copy link
Contributor

Thanks for trying it. In that case you should check how many processes are running in total on the machine, and how many are permitted. This is ulimit -u.

I suppose you could also try adding a call to usleep(1) in the loop.

@gshimansky
Copy link
Author

It is really a surprise to me, but increasing ulimit from 1024 to 2048 helped. All tests passed both on virtual and physical systems. Ubuntu doesn't have any limit set, and RHEL 7.1 has 4096, that is why tests passed on those systems.

I think this bug can be closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants