Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: ForkLock held during Accept #4737

Closed
gopherbot opened this issue Jan 31, 2013 · 20 comments
Closed

syscall: ForkLock held during Accept #4737

gopherbot opened this issue Jan 31, 2013 · 20 comments
Milestone

Comments

@gopherbot
Copy link

by patrick.allen.higgins:

Sorry that this is so wordy and without a simple test case. I haven't been able to
narrow it down to a simpler one yet, and may not have the time to do so. The tl;dr
version is that GDB clearly shows syscall.ForkLock as being locked when I don't believe
it should be.

I have a program which consists of a master which spawns slaves.

The master opens three net.Listener sockets and then spawns the first slave, passing it
the listener FDs.

On receipt of SIGHUP, the master spawns a new slave, passing it the listener FDs.

When a slave finishes initialization and starts serving HTTP requests on the 3
listeners, it sends a SIGUSR1 to the master.

On receipt of SIGUSR1, the master sends a SIGTERM to all but the most recently spawned
slave.

On receipt of SIGTERM, a slave closes all three of its listeners and then exits.

The master has a goroutine for each slave waiting for them to exit.

By running a "kill -HUP $master_pid" several times in rapid succession from
the shell (by hand, not a script), eventually I notice that the slaves do not exit.

Using lsof shows that one of the listeners is being closed, even though I close it.

I tried to insert code to run lsof at key points in the slave lifecycle and find that it
won't run at a certain point because it is waiting to acquire the syscall.ForkLock:

(gdb) gor 1 bt
#0  0x000000000041bf3c in runtime.gosched () at
/home/phiggins/go-tip/src/pkg/runtime/proc.c:967
#1  0x000000000041bfa6 in runtime.park (unlockf=void, lock=void, reason=void) at
/home/phiggins/go-tip/src/pkg/runtime/proc.c:979
#2  0x0000000000423ca6 in semacquireimpl (addr=void, profile=void) at
/home/phiggins/go-tip/src/pkg/runtime/sema.goc:1479
#3  0x0000000000423eae in sync.runtime_Semacquire (addr=void) at
/home/phiggins/go-tip/src/pkg/runtime/sema.goc:1533
#4  0x00000000004c2101 in sync.(*RWMutex).Lock (rw=0x916a98) at
/home/phiggins/go-tip/src/pkg/sync/rwmutex.go:80
#5  0x00000000004c9f51 in syscall.forkExec (argv0="/usr/sbin/lsof", argv=
[]string = {...}, attr=0xc2008370a0, pid=128849834000, err=0)
    at /home/phiggins/go-tip/src/pkg/syscall/exec_unix.go:183
#6  0x00000000004ca547 in syscall.StartProcess (argv0="/usr/sbin/lsof", argv=
[]string = {...}, attr=0xc2008370a0, pid=2, handle=4, err=0)
    at /home/phiggins/go-tip/src/pkg/syscall/exec_unix.go:244
#7  0x00000000004bcd27 in os.startProcess (name="/usr/sbin/lsof", argv=
[]string = {...}, attr=0xc200837050, p=0xc200091d80, err=0)
    at /home/phiggins/go-tip/src/pkg/os/exec_posix.go:45
#8  0x00000000004bbf9c in os.StartProcess (name="/usr/sbin/lsof", argv=
[]string = {...}, attr=0xc200837050, noname=void)
    at /home/phiggins/go-tip/src/pkg/os/doc.go:24
#9  0x00000000004c3f72 in os/exec.(*Cmd).Start (c=void, noname=void) at
/home/phiggins/go-tip/src/pkg/os/exec/exec.go:262
#10 0x00000000004c39a5 in os/exec.(*Cmd).Run (c=0xc200838000, noname=void) at
/home/phiggins/go-tip/src/pkg/os/exec/exec.go:229
#11 0x00000000004039de in main.lsof (msg="pre-close") at
/home/phiggins/src/RecommendationEngine/go/src/recserv/lsof.go:15
#12 0x00000000004073f3 in main.slave (config=0xc2001060e0) at
/home/phiggins/src/RecommendationEngine/go/src/recserv/slave.go:79
#13 0x0000000000406ab2 in main.main () at
/home/phiggins/src/RecommendationEngine/go/src/recserv/recserv.go:66

You can see that ForkLock is actually locked:

(gdb) p 'syscall.ForkLock'
$3 = {w = {state = 1, sema = 0}, writerSem = 0, readerSem = 0, readerCount =
-1073741823, readerWait = 1}

The full gdb session is attached.

Attachments:

  1. recserv-gdb-session.txt (13883 bytes)
@gopherbot
Copy link
Author

Comment 1 by patrick.allen.higgins:

I was able to write a test program that doesn't depend on anything but standard packages
which triggers the bug. Unfortunately it's largish.
Running "test.sh" trips the bug for me.
Removing the first config.LoadTLS() from slave fixes the issue for me.

Attachments:

  1. fork-lock.tar.gz (1316061 bytes)

@gopherbot
Copy link
Author

Comment 2 by patrick.allen.higgins:

I was able to write a test program that doesn't depend on anything but standard packages
which triggers the bug. Unfortunately it's largish.
Running "test.sh" trips the bug for me.
Removing the first config.LoadTLS() from slave fixes the issue for me.

Attachments:

  1. fork-lock.tar.gz (5989 bytes)

@gopherbot
Copy link
Author

Comment 3 by patrick.allen.higgins:

I have verified the bug exists on these combinations:
Compiler: 6g and 8g
OS: Fedora 18, 64-bit
Go Versions:
  * 1.0.3 on linux-386 (have not tried tip yet)
  * 1.0.3 and devel +f584af3b870d on linux-amd64.

@gopherbot
Copy link
Author

Comment 4 by patrick.allen.higgins:

I have simplified the test case further and added comments to slave.go about the various
changes that can be made to it which prevent the bug from being exposed.
If the sleep time after each "kill -HUP" in test.sh is increased to just 0.5, then I
don't trip the bug.
If the syscall.ForkLock manipulations are removed from slave.go entirely, then it
doesn't exit when the bug conditions are tripped because the socket is never actually
closed, even though Close() returns without error.

Attachments:

  1. fork-lock.tar.gz (5398 bytes)
  2. fork-lock-gdb.txt (6359 bytes)

@rsc
Copy link
Contributor

rsc commented Feb 3, 2013

Comment 5:

Fascinating, thanks. I haven't had a chance to try this on Linux yet but I
certainly plan to.

@rsc
Copy link
Contributor

rsc commented Feb 4, 2013

Comment 6:

Labels changed: added priority-later, removed priority-triage.

Status changed to Accepted.

@rsc
Copy link
Contributor

rsc commented Feb 5, 2013

Comment 7:

I have gotten your test.sh script to tell me it failed, but I don't believe this is
actually a bug. The master listens on a TCP port and passes that fd to many children.
That is, they all have references to the exact same TCP listener.
The slave exit closes its copy of the fd and then waits for the blocked accept to
finish. However, there is no reason to expect it to finish. The listener is still
active, and will be until all the copies of the fd are closed. The incoming signal may
force the accept to restart, but in all likelihood it will have successfully restarted
before the main slave thread gets around to closing the fd.
It is easy to believe that the different behaviors you are seeing depend primarily on
the order in which threads are scheduled to run. The accept will exit if it gets
restarted and sees the closed fd, but it need not be restarted; whether it does depends
on this kind of scheduling decision. Removing config.LoadTLS presumably changes the
timing or the cpu nice priorities or something like that.

@rsc
Copy link
Contributor

rsc commented Feb 5, 2013

Comment 8:

I bet you are using a Linux kernel before version 2.6.28, without Accept4. Then this
code:
func accept(fd int) (int, syscall.Sockaddr, error) {
    nfd, sa, err := syscall.Accept4(fd, syscall.SOCK_NONBLOCK|syscall.SOCK_CLOEXEC)
    // The accept4 system call was introduced in Linux 2.6.28.  If
    // we get an ENOSYS error, fall back to using accept.
    if err == nil || err != syscall.ENOSYS {
        return nfd, sa, err
    }
    // See ../syscall/exec_unix.go for description of ForkLock.
    // It is okay to hold the lock across syscall.Accept
    // because we have put fd.sysfd into non-blocking mode.
    syscall.ForkLock.RLock()
    nfd, sa, err = syscall.Accept(fd)
    if err == nil {
        syscall.CloseOnExec(nfd)
    }
    syscall.ForkLock.RUnlock()
    if err != nil {
        return -1, nil, err
    }
    if err = syscall.SetNonblock(nfd, true); err != nil {
        syscall.Close(nfd)
        return -1, nil, err
    }
    return nfd, sa, nil
}
will fall back to the ordinary Accept system call. But since you have invoked the File
method on the listener, it is now blocking, not non-blocking. So the comment's rationale
no longer holds. This accept blocks while holding the RLock, making the ForkLock
impossible to lock properly. I guess we have to accept the race and remove the RLock.
Newer kernels won't have it because they'll use Accept4.

@ianlancetaylor
Copy link
Contributor

Comment 9:

Presumably the same argument will apply to net/sys_cloexec.go, though, where it will
happen for all calls.  It seems that we are in trouble if anybody calls File on a socket
and then calls AcceptTCP or AcceptUnix on it.
Or maybe the code in netFD.dup is wrong.  The comment implies that it is possible to set
blocking mode only for the new fd, but it is not.  Setting blocking mode for the new fd
also sets it for the old one.  This makes me wonder why netFD.dup calls dup at all.

@rsc
Copy link
Contributor

rsc commented Feb 5, 2013

Comment 10:

I changed the comment in dup. We have to return a new fd because that's the
documented behavior. I wanted something where the closes would be
independent. It is too bad that the non-blocking mode is not per-fd but we
knew that at some point in the past, because the File method docs have
weasel words at the end:
    File sets the underlying os.File to blocking mode and returns a copy. It
    is the caller's responsibility to close f when finished. Closing c does
    not affect f, and closing f does not affect c.
    The returned os.File's file descriptor is different from the
    connection's. Attempting to change properties of the original using this
    duplicate may or may not have the desired effect.
Russ

@gopherbot
Copy link
Author

Comment 11 by patrick.allen.higgins:

I am using kernel 3.7.2:
$ uname -a
Linux xeno.staples.com 3.7.2-204.fc18.x86_64 #1 SMP Wed Jan 16 16:22:52 UTC 2013 x86_64
x86_64 x86_64 GNU/Linux

@gopherbot
Copy link
Author

Comment 12 by patrick.allen.higgins:

I am using kernel 3.7.2:
$ uname -a
Linux xeno.redacted 3.7.2-204.fc18.x86_64 #1 SMP Wed Jan 16 16:22:52 UTC 2013 x86_64
x86_64 x86_64 GNU/Linux

@rsc
Copy link
Contributor

rsc commented Feb 5, 2013

Comment 13:

Does it have accept4?
Russ

@gopherbot
Copy link
Author

Comment 14 by patrick.allen.higgins:

Yes:
(gdb) b 'net.accept'
Breakpoint 1 at 0x4aa260: file /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go, line
44.
(gdb) c
Continuing.
[Switching to Thread 0x7f4da0c7d700 (LWP 4788)]
Breakpoint 1, net.accept (fd=3, noname=void)
    at /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go:44
44  func accept(fd int) (int, syscall.Sockaddr, error) {
(gdb) where
#0  net.accept (fd=3, noname=void)
    at /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go:44
#1  0x000000000049acf3 in net.(*netFD).accept (fd=0xc2000d0d00, toAddr=
    {void (syscall.Sockaddr, net.Addr *)} 0x7f4db084bdc0, netfd=0x0, err=0)
    at /home/phiggins/go-tip/src/pkg/net/fd_unix.go:618
#2  0x00000000004acd22 in net.(*TCPListener).AcceptTCP (l=0xc2000c95c0, c=
    0x43, err=0) at /home/phiggins/go-tip/src/pkg/net/tcpsock_posix.go:232
#3  0x00000000004ace39 in net.(*TCPListener).Accept (l=0xc2000c95c0, c=0, err=
    0) at /home/phiggins/go-tip/src/pkg/net/tcpsock_posix.go:242
#4  0x00000000004eebb6 in net/http.(*Server).Serve (srv=void, l=..., noname=
    void) at /home/phiggins/go-tip/src/pkg/net/http/server.go:1260
#5  0x0000000000407a5f in main.serveRest (wg=0xc2001b0d40, l=..., mux=
    0xc200351c60, config=0xc2000d00d0)
    at /home/phiggins/src/RecommendationEngine/go/src/recserv/slave.go:155
#6  0x000000000041acf0 in schedunlock ()
    at /home/phiggins/go-tip/src/pkg/runtime/proc.c:280
#7  0x000000c2001b0d40 in ?? ()
#8  0x000000c200102040 in ?? ()
#9  0x000000c2000c95c0 in ?? ()
#10 0x000000c200351c60 in ?? ()
#11 0x000000c2000d00d0 in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb) s
45      nfd, sa, err := syscall.Accept4(fd, syscall.SOCK_NONBLOCK|syscall.SOCK_CLOEXEC)
(gdb) s
syscall.Accept4 (fd=3, flags=526336, nfd=139971650698528, sa=0, err=4818963)
    at /home/phiggins/go-tip/src/pkg/syscall/syscall_linux.go:430
430 func Accept4(fd int, flags int) (nfd int, sa Sockaddr, err error) {
(gdb) su
Undefined command: "su".  Try "help".
(gdb) help s
Step program until it reaches a different source line.
Argument N means do this N times (or till program stops for another reason).
(gdb) fin
Run till exit from #0  syscall.Accept4 (fd=3, flags=526336, nfd=
    139971650698528, sa=0, err=4818963)
    at /home/phiggins/go-tip/src/pkg/syscall/syscall_linux.go:430
0x00000000004aa28e in net.accept (fd=3, noname=void)
    at /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go:45
45      nfd, sa, err := syscall.Accept4(fd, syscall.SOCK_NONBLOCK|syscall.SOCK_CLOEXEC)
(gdb) s
48      if err == nil || err != syscall.ENOSYS {
(gdb) p err
$1 = 0
(gdb) s
49          return nfd, sa, err
(gdb) s
net.(*netFD).accept (fd=0xc2000d0d00, toAddr=
    {void (syscall.Sockaddr, net.Addr *)} 0x7f4db084bdc0, netfd=0x0, err=0)
    at /home/phiggins/go-tip/src/pkg/net/fd_unix.go:619
619         if err != nil {

@gopherbot
Copy link
Author

Comment 15 by patrick.allen.higgins:

Yes:
(gdb) b 'net.accept'
Breakpoint 1 at 0x4aa260: file /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go, line
44.
(gdb) c
Continuing.
[Switching to Thread 0x7f4da0c7d700 (LWP 4788)]
Breakpoint 1, net.accept (fd=3, noname=void)
    at /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go:44
44  func accept(fd int) (int, syscall.Sockaddr, error) {
(gdb) s
45      nfd, sa, err := syscall.Accept4(fd, syscall.SOCK_NONBLOCK|syscall.SOCK_CLOEXEC)
(gdb) s
syscall.Accept4 (fd=3, flags=526336, nfd=139971650698528, sa=0, err=4818963)
    at /home/phiggins/go-tip/src/pkg/syscall/syscall_linux.go:430
430 func Accept4(fd int, flags int) (nfd int, sa Sockaddr, err error) {
(gdb) fin
Run till exit from #0  syscall.Accept4 (fd=3, flags=526336, nfd=
    139971650698528, sa=0, err=4818963)
    at /home/phiggins/go-tip/src/pkg/syscall/syscall_linux.go:430
0x00000000004aa28e in net.accept (fd=3, noname=void)
    at /home/phiggins/go-tip/src/pkg/net/sock_cloexec.go:45
45      nfd, sa, err := syscall.Accept4(fd, syscall.SOCK_NONBLOCK|syscall.SOCK_CLOEXEC)
(gdb) s
48      if err == nil || err != syscall.ENOSYS {
(gdb) p err
$1 = 0
(gdb) s
49          return nfd, sa, err
(gdb) s
net.(*netFD).accept (fd=0xc2000d0d00, toAddr=
    {void (syscall.Sockaddr, net.Addr *)} 0x7f4db084bdc0, netfd=0x0, err=0)
    at /home/phiggins/go-tip/src/pkg/net/fd_unix.go:619
619         if err != nil {

@gopherbot
Copy link
Author

Comment 16 by patrick.allen.higgins:

What's always jumped out as odd to me is that the ForkLock readerCount is -1073741823,
which is -(1<<30)+1.

@rsc
Copy link
Contributor

rsc commented Feb 5, 2013

Comment 17:

Sorry, the kernel version is a dead end. My diagnosis is correct: you're running code
older than when accept4 was even introduced as an option (at least you were in the
original report). The gdb session in the first comment very clearly shows that Accept
(not Accept4) is blocked; that's definitely the problem.
The readerCount is fine. The -1<<30 means that there is a writer waiting.

@gopherbot
Copy link
Author

Comment 18 by patrick.allen.higgins:

That explains why the ForkLock issue suddenly disappeared for me a few days ago. I
thought I was chasing a ghost! I updated my go-tip so I could work on fixes for the
other issues I opened and got Accept4 as a side effect.
I also found I can fix the accept-socket-not-closing problem by having the master call
File on the listeners once at start-up time rather than when spawning each slave.

@rsc
Copy link
Contributor

rsc commented Feb 5, 2013

Comment 19:

That explains the rest of your race. newFileFD turns non-blocking mode back
on, and closing the fd does get you out of the epoll loop, just not out of
a blocked accept. So for the test.sh hang to happen the old slave accept
has to happen between the master setting the fd blocking and the new slave
setting the fd non-blocking. By moving the File call to the beginning, the
fd gets set non-blocking again by the first slave and thereafter is
non-blocking, so the race is gone.
Russ

@rsc
Copy link
Contributor

rsc commented Feb 8, 2013

Comment 20:

This issue was closed by revision 18441e8.

Status changed to Fixed.

@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants