Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/race: some kind of fork/exec deadlock? #4840

Closed
alberts opened this issue Feb 18, 2013 · 17 comments
Closed

runtime/race: some kind of fork/exec deadlock? #4840

alberts opened this issue Feb 18, 2013 · 17 comments

Comments

@alberts
Copy link
Contributor

alberts commented Feb 18, 2013

I've been stalking a strange bug on "stress test" builder that runs under
Jenkins.

I have the Jenkins builder configured to kill a job if it runs for more than 60 minutes.
The current set of tests usually finish in about 20 to 30 minutes.

I had 4 builds between 4 February and 14 February that got killed after 60 minutes. At
first I thought that the machine was under high load or that the Jenkins timer was
misbehaving, but I added some timing logs to the build script and everything seemed fine.

I disabled the Jenkins build time limit. I caught one this morning that had been stuck
for a few hours.

There were two processes running that were part of the suspect build:

20397 ? Sl 0:48 go test std -short -timeout=120s
22515 ? S 0:00 go test std -short -timeout=120s

I could tell from their /proc/<PID>/environ file that they were both part of the
Jenkins job.

/proc/20397 had:

cwd -> /build/go.tip/go/src
exe -> /build/go.tip/go/bin/go

/build/go.tip/go is $GOROOT.

/proc/22515 had:

cwd -> /dev/shm/go-build017000760/sync/_test
exe -> /build/go.tip/go/bin/go

/dev/shm is TMPDIR.

This looks like it could be a process stuck between fork and exec?

The last thing the job had printed out before I killed it was:

ok log/syslog 2.187s

In previous cases I had:

+ go test -v -cpu 1,2,4 std
...
ok mime/multipart 0.642s

Build timed out (after 60 minutes). Marking the build as aborted.

+ go test -v -cpu 1,2,4 std
...
ok crypto/tls 1.182s
Build timed out (after 60 minutes). Marking the build as aborted.

and a very nice one:

+ go version
go version devel +d8e8634eec4e Thu Feb 14 15:04:22 2013 +1100 linux/amd64
++ seq 1 5 + for i in '`seq 1 5`'
+ export GOMAXPROCS=3
+ GOMAXPROCS=3
+ ./run.bash --no-rebuild
# Testing packages.
ok cmd/api 0.055s
? cmd/cgo [no test files]
ok cmd/fix 2.585s
ok cmd/go 0.058s
? cmd/godoc [no test files]
ok cmd/gofmt 0.103s
? cmd/vet [no test files]
? cmd/yacc [no test files]
ok archive/tar 0.041s
Build timed out (after 60 minutes). Marking the build as aborted.

so this doesn't seem to be related to any of the recent issues with log/syslog.

I tried to strace pid 22515 on the way out. I sent a SIGQUIT first, this didn't kill it.
Then I sent a SIGKILL.

22515 futex(0x7fafb4000020, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
22515 --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=13066, si_uid=0} ---
22515 futex(0xb68930, FUTEX_WAKE, 1) = 0
22515 rt_sigreturn() = 202
22515 futex(0x7fafb4000020, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
22515 --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=13066, si_uid=0} ---
22515 rt_sigreturn() = 202
22515 futex(0x7fafb4000020, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
22515 +++ killed by SIGKILL +++

I also straced pid 20397, but that trace is less enlightening.

Finally, I noticed that 22515 had quite a lot of file descriptors hanging around.
Doesn't look quite right at first glance:

# ls -la /proc/22515/fd
total 0
dr-x------ 2 jenkins jenkins  0 Feb 18 06:04 .
dr-xr-xr-x 8 jenkins jenkins  0 Feb 15 21:47 ..
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 0 -> /dev/null
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 1 -> pipe:[566509041]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 10 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 11 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 13 -> pipe:[566509040]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 14 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 15 -> pipe:[566530248]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 16 -> pipe:[566498220]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 17 -> pipe:[566506106]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 18 -> pipe:[566498220]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 19 -> /dev/null
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 2 -> pipe:[566433078]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 20 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 22 -> pipe:[566527240]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 23 -> pipe:[566513584]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 24 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 25 -> pipe:[566508526]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 26 -> pipe:[566508526]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 27 -> pipe:[566536197]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 28 -> /dev/null
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 29 -> pipe:[566530248]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 3 -> pipe:[566527242]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 30 -> pipe:[566506059]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 31 -> pipe:[566521230]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 32 -> /dev/null
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 33 -> pipe:[566506106]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 34 -> pipe:[566527240]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 35 -> pipe:[566521230]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 36 -> pipe:[566521227]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 37 -> pipe:[566509032]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 38 -> pipe:[566514464]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 39 -> pipe:[566521227]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 4 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 40 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 41 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 42 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 43 -> pipe:[566531204]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 44 -> pipe:[566508527]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 45 -> pipe:[566516473]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 46 -> pipe:[566529271]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 47 -> pipe:[566529271]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 48 -> pipe:[566529259]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 49 -> pipe:[566509043]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 5 -> pipe:[566521231]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 50 -> pipe:[566509043]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 51 -> pipe:[566511380]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 52 -> pipe:[566508527]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 53 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 54 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 55 -> pipe:[566520373]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 57 -> pipe:[566520373]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 59 -> pipe:[566525096]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 6 -> pipe:[566521231]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 60 -> pipe:[566525096]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 61 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 63 -> pipe:[566509038]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 65 -> /dev/null
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 66 -> pipe:[566509041]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 67 -> pipe:[566509041]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 68 -> pipe:[566521228]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 69 -> pipe:[566521213]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 7 -> pipe:[566536197]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 70 -> pipe:[566521228]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 73 -> pipe:[566514464]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 75 -> pipe:[566529270]
l-wx------ 1 jenkins jenkins 64 Feb 18 06:07 76 -> pipe:[566529270]
lr-x------ 1 jenkins jenkins 64 Feb 18 06:07 9 -> /dev/null

I've seen this issue with:

go version devel +b67523215571 Mon Feb 04 15:30:41 2013 +1100 linux/amd64 
go version devel +a44171137af8 Sat Feb 09 17:36:31 2013 -0500 linux/amd64 
go version devel +d8e8634eec4e Thu Feb 14 15:04:22 2013 +1100 linux/amd64
go version devel +537555f45dcc Fri Feb 15 13:37:43 2013 -0800 linux/amd64 

Kernel is 3.4.7-1.fc16.x86_64.
@rsc
Copy link
Contributor

rsc commented Feb 19, 2013

Comment 1:

If you can make it happen again, please try doing a SIGABRT and capture the
goroutine dump.

@alberts
Copy link
Contributor Author

alberts commented Feb 21, 2013

Comment 2:

Here's a complete trace.
I ABRTed the child process first.
The parent seemed stuck doing a bunch of sched_yield system calls, so I ABRTed that next.
It seems to have gotten stuck in a slightly different place as before, but it looks like
the same issue.
You'll see that my build steps are:
./make.bash
go install -v -race std
so I think I actually end up with a race-enabled go command, which might be a bit
different from what normal people work with daily.
That explains the tsan stuff in the stack traces.
So this might actually be a race detector issue.

Attachments:

  1. consoleText.txt (142786 bytes)

@dvyukov
Copy link
Member

dvyukov commented Feb 21, 2013

Comment 3:

There are 2 go processes, and both have [running] goroutines.
Can you check whether it's consuming CPU, or completely blocked?
Also gdb 'thread apply all bt' may help. Perhaps it's deadlocked inside of tsan runtime
(and holding Go fork lock).

@alberts
Copy link
Contributor Author

alberts commented Feb 21, 2013

Comment 4:

I'm pretty sure they were completed blocked. They had been blocked for hours and only
consumed a few seconds of cpu time according to ps.
Will try to get a gdb backtrace next time this happens.

@alberts
Copy link
Contributor Author

alberts commented Feb 21, 2013

Comment 5:

(gdb) thread apply all bt
Thread 1 (process 2399):
#0  0x00007f7e1cd296ac in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f7e1ccafa0f in _L_lock_10695 () from /lib64/libc.so.6
#2  0x00007f7e1ccad587 in malloc () from /lib64/libc.so.6
#3  0x000000000047eaf4 in __sanitizer::InternalAlloc(unsigned long) ()
#4  0x00007f7de40b0a68 in ?? ()
#5  0x0000000000481e42 in __tsan::StackTrace::ObtainCurrent(__tsan::ThreadState*,
unsigned long) ()
#6  0x00007f7de40b0a08 in ?? ()
#7  0x00007f7de40b3240 in ?? ()
#8  0x0000000000000005 in ?? ()
#9  0x0000000000481ee1 in __tsan::TraceSwitch(__tsan::ThreadState*) ()
#10 0x00007f7dd57f9e00 in ?? ()
#11 0x00007f7de40b0a08 in ?? ()
#12 0x00007f7de40b0a08 in ?? ()
#13 0x0094000000a00000 in ?? ()
#14 0x0094000000a00001 in ?? ()
#15 0x00000000004887a7 in __tsan_read ()
#16 0x00000000005a8b2c in syscall.forkAndExecInChild (fp=6171101, argv0=0xc2008a9800 "",
argv= []*uint8, envv= []*uint8, chroot=0x7f7dd57fa700 "", dir=0x0, attr=0x0, sys=0x0,
pipe=0, pid=140179631511840, err=0)
    at /build/go.tip/go/src/pkg/syscall/exec_linux.go:52
#17 0x00000000005e29dd in threadentry ()
#18 0x000000c2008a9800 in ?? ()
#19 0x000000c200a2c300 in ?? ()
#20 0x0000000000470690 in ?? () at /build/go.tip/go/src/pkg/runtime/proc.c:772
#21 0x0000000000000000 in ?? ()

@alberts
Copy link
Contributor Author

alberts commented Feb 21, 2013

Comment 6:

this gdb backtrace was with
go version devel +43cc2ef77ca6 Thu Feb 21 21:59:46 2013 +0400 linux/amd64
if that helps.
I'll leave this process in place if you want me to look at anything else.

@dvyukov
Copy link
Member

dvyukov commented Feb 21, 2013

Comment 7:

Thanks! It helps!
It's indeed fork/exec deadlock caused by tsan.
The problem is that between fork and exec, one can do only very limited set of things.
And tsan does prohibited things, i.e. calls malloc()... and apparently the child was
forked with taken malloc mutex.

Labels changed: added racedetector.

Owner changed to @dvyukov.

@dvyukov
Copy link
Member

dvyukov commented Feb 21, 2013

Comment 8:

> I'll leave this process in place if you want me to look at anything else.
I think I will be able to reproduce it, you may shut it down.

@rsc
Copy link
Contributor

rsc commented Feb 21, 2013

Comment 9:

Labels changed: added priority-later, removed priority-triage.

Status changed to Accepted.

@dvyukov
Copy link
Member

dvyukov commented Feb 22, 2013

Comment 10:

Yes, I was able to reproduce it by running "GOMAXPROCS=82 go test -run=noneasd std"
repeatedly, but it required about 50 runs.
The simplest solution would be to not instrument that piece of code between fork and
exec. But I am not sure how to do it, currently we don't have such general facility.
Perhaps hack go/gc to ignore a specific file in syscall package.

@rsc
Copy link
Contributor

rsc commented Mar 12, 2013

Comment 11:

[The time for maybe has passed.]

Labels changed: removed go1.1maybe.

@dvyukov
Copy link
Member

dvyukov commented Jun 28, 2013

Comment 12:

It become worser with preemption. Now forkAndExecInChild() calls runtime.Gosched() after
fork()...
goroutine 5 [runnable]:
runtime/race.Write(0x1f63308, 0xc2100e43d0, 0x5ad9be)
    runtime/race/_obj/race.cgo1.go:55
runtime.racewrite(0xc2100e43d0, 0x0, 0x0, 0x0, 0x0, ...)
    src/pkg/runtime/race.c:75 +0x67
syscall.forkAndExecInChild(0xc2100cc940, 0xc2100cca80, 0x7, 0x7, 0xc210135300, ...)
    src/pkg/syscall/exec_linux.go:58 +0x22e
syscall.forkExec(0xc2100cc9c0, 0x3e, 0xc2100f12a0, 0x6, 0x6, ...)
    src/pkg/syscall/exec_unix.go:191 +0x596
syscall.StartProcess(0xc2100cc9c0, 0x3e, 0xc2100f12a0, 0x6, 0x6, ...)
    src/pkg/syscall/exec_unix.go:238 +0x84
os.startProcess(0xc2100cc9c0, 0x3e, 0xc2100f12a0, 0x6, 0x6, ...)
    src/pkg/os/exec_posix.go:45 +0x547
os.StartProcess(0xc2100cc9c0, 0x3e, 0xc2100f12a0, 0x6, 0x6, ...)
    src/pkg/os/doc.go:24 +0x6c

@dvyukov
Copy link
Member

dvyukov commented Jun 28, 2013

Comment 13:

I want to change 6g so that it does not instrument files containing
syscall.forkAndExecInChild() functions.
Any objections?
In C/C++ land we mark functions that call fork with
__attribute__((no-thread-safety-analysis)) or use separate instrumentation blacklists.
But I think it's too early for such general solution in Go.

@ianlancetaylor
Copy link
Contributor

Comment 14:

To clarify, you want to change the behaviour on specific files within a package?  That
is, when compiling the syscall package, you want to compile some files with -race and
some without?  Further, do you really care about the file level?  Would it suffice to
compile some functions with -race and some without?
Note that we already have a precedent for comments that are specific to functions, such
as //go:noescape.  It seems that it would not be out of the question to have a
//go:norace comment.

@dvyukov
Copy link
Member

dvyukov commented Jun 28, 2013

Comment 15:

> To clarify, you want to change the behaviour on specific files within a package?
Yes.
I don't have strong opinion on files vs functions yet. Function level is probably
better. Because the next time we may need to not instrument only a single function in a
large file.
I will look how //go:noescape is implemented. If most of the machinery for such
attributes is already there, and I just need to add one more attribute, then it looks
like a good choice.

@dvyukov
Copy link
Member

dvyukov commented Jun 29, 2013

Comment 16:

Mailed https://golang.org/cl/10783043 with go:norace

@dvyukov
Copy link
Member

dvyukov commented Jul 16, 2013

Comment 17:

This issue was closed by revision 63e0ddc.

Status changed to Fixed.

@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants