-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: ForkLock held during Accept #4737
Labels
Milestone
Comments
Comment 1 by patrick.allen.higgins: I was able to write a test program that doesn't depend on anything but standard packages which triggers the bug. Unfortunately it's largish. Running "test.sh" trips the bug for me. Removing the first config.LoadTLS() from slave fixes the issue for me. Attachments:
|
Comment 2 by patrick.allen.higgins: I was able to write a test program that doesn't depend on anything but standard packages which triggers the bug. Unfortunately it's largish. Running "test.sh" trips the bug for me. Removing the first config.LoadTLS() from slave fixes the issue for me. Attachments:
|
Comment 4 by patrick.allen.higgins: I have simplified the test case further and added comments to slave.go about the various changes that can be made to it which prevent the bug from being exposed. If the sleep time after each "kill -HUP" in test.sh is increased to just 0.5, then I don't trip the bug. If the syscall.ForkLock manipulations are removed from slave.go entirely, then it doesn't exit when the bug conditions are tripped because the socket is never actually closed, even though Close() returns without error. Attachments:
|
I have gotten your test.sh script to tell me it failed, but I don't believe this is actually a bug. The master listens on a TCP port and passes that fd to many children. That is, they all have references to the exact same TCP listener. The slave exit closes its copy of the fd and then waits for the blocked accept to finish. However, there is no reason to expect it to finish. The listener is still active, and will be until all the copies of the fd are closed. The incoming signal may force the accept to restart, but in all likelihood it will have successfully restarted before the main slave thread gets around to closing the fd. It is easy to believe that the different behaviors you are seeing depend primarily on the order in which threads are scheduled to run. The accept will exit if it gets restarted and sees the closed fd, but it need not be restarted; whether it does depends on this kind of scheduling decision. Removing config.LoadTLS presumably changes the timing or the cpu nice priorities or something like that. |
I bet you are using a Linux kernel before version 2.6.28, without Accept4. Then this code: func accept(fd int) (int, syscall.Sockaddr, error) { nfd, sa, err := syscall.Accept4(fd, syscall.SOCK_NONBLOCK|syscall.SOCK_CLOEXEC) // The accept4 system call was introduced in Linux 2.6.28. If // we get an ENOSYS error, fall back to using accept. if err == nil || err != syscall.ENOSYS { return nfd, sa, err } // See ../syscall/exec_unix.go for description of ForkLock. // It is okay to hold the lock across syscall.Accept // because we have put fd.sysfd into non-blocking mode. syscall.ForkLock.RLock() nfd, sa, err = syscall.Accept(fd) if err == nil { syscall.CloseOnExec(nfd) } syscall.ForkLock.RUnlock() if err != nil { return -1, nil, err } if err = syscall.SetNonblock(nfd, true); err != nil { syscall.Close(nfd) return -1, nil, err } return nfd, sa, nil } will fall back to the ordinary Accept system call. But since you have invoked the File method on the listener, it is now blocking, not non-blocking. So the comment's rationale no longer holds. This accept blocks while holding the RLock, making the ForkLock impossible to lock properly. I guess we have to accept the race and remove the RLock. Newer kernels won't have it because they'll use Accept4. |
Presumably the same argument will apply to net/sys_cloexec.go, though, where it will happen for all calls. It seems that we are in trouble if anybody calls File on a socket and then calls AcceptTCP or AcceptUnix on it. Or maybe the code in netFD.dup is wrong. The comment implies that it is possible to set blocking mode only for the new fd, but it is not. Setting blocking mode for the new fd also sets it for the old one. This makes me wonder why netFD.dup calls dup at all. |
I changed the comment in dup. We have to return a new fd because that's the documented behavior. I wanted something where the closes would be independent. It is too bad that the non-blocking mode is not per-fd but we knew that at some point in the past, because the File method docs have weasel words at the end: File sets the underlying os.File to blocking mode and returns a copy. It is the caller's responsibility to close f when finished. Closing c does not affect f, and closing f does not affect c. The returned os.File's file descriptor is different from the connection's. Attempting to change properties of the original using this duplicate may or may not have the desired effect. Russ |
Sorry, the kernel version is a dead end. My diagnosis is correct: you're running code older than when accept4 was even introduced as an option (at least you were in the original report). The gdb session in the first comment very clearly shows that Accept (not Accept4) is blocked; that's definitely the problem. The readerCount is fine. The -1<<30 means that there is a writer waiting. |
That explains the rest of your race. newFileFD turns non-blocking mode back on, and closing the fd does get you out of the epoll loop, just not out of a blocked accept. So for the test.sh hang to happen the old slave accept has to happen between the master setting the fd blocking and the new slave setting the fd non-blocking. By moving the File call to the beginning, the fd gets set non-blocking again by the first slave and thereafter is non-blocking, so the race is gone. Russ |
This issue was closed by revision 18441e8. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
by patrick.allen.higgins:
Attachments:
The text was updated successfully, but these errors were encountered: