-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: program re-enables SIGHUP #4491
Comments
It is possible that the signal initialization in Go programs re-enables SIGHUP, so that running it under nohup does not protect it from your logging out. That would be the first thing I would try. If that's the case then a simple Go program that just does for { time.Sleep(1*time.Minute) } would demonstrate the same mysterious disappearances. |
Comment 2 by alexandre.normand@opower.com: I'm running that test right now (I just added a log statement too, just so I could tell when it stopped). I'll post later with the results. For reference, here's the actual code: package main import ( "log" "time" ) func main() { for { time.Sleep(1 * time.Minute) log.Print("Waiting for 1 minute again") } } |
Comment 3 by alexandre.normand@opower.com: The first test mentioned above is still running fine with nohup and I added one more that just adds a go routine to the mix: package main import ( "log" "time" ) func main() { callback := make(chan int) go sleepAndPrint(callback) <-callback } func sleepAndPrint(communication chan int) { for i := 0; i < 10000; i++ { log.Printf("waiting for 1 minute with count = %d\n", i) time.Sleep(1 * time.Minute) } communication <- 1 } Both are still running after logging out and back again. I'm going to leave them running both to see if anything can come out of this. |
Comment 4 by alexandre.normand@opower.com: The two versions of the test mentioned above are still running fine with nohup. I wonder if this could be that something in the packages my original app is using is doing a Exit() call without logging anything first. For reference, here are my imports: import ( "bytes" "encoding/json" "fmt" "io/ioutil" "log" "os" "net/http" "strings" "time" ) |
Comment 7 by alexandre.normand@opower.com: There's no indication in dmesg that the program got killed by the kernel. I did see that the box I'm running in on right now has only 375 MB of free physical ram but I was initially running it on another box that has more than 1.7 GB of free physical RAM and I had made the switch to try and see if there was something wrong with the first one. Also, wouldn't the memory usage be the same regardless of whether I'm logged in/tailing the output log than if I'm logged out? The current status is that this is still occurring as per my initial description. If I kill my remote ssh session to the box running it, it eventually disappears but if I'm logged in and tailing the log, it stays there up to the normal program termination. |
Comment 9 by alexandre.normand@opower.com: If that was a case of 10 writes getting IPIPE causing the exit, wouldn't the simple test cases from the earlier comments have shown the same problem? I'd like to provide some example that reproduces the issue. Since I can't use the program I'm currently working on, I might put together an example that just does http get/posts using a similar pattern. It might take a while before I can have something to reproduce it just because of the nature of the problem (if I'm logged in and waiting for it to vanish, it's just going to remain alive). |
Comment 12 by alexandre.normand@opower.com: I stripped much of what was specific to my organisation from my program and replaced it with calls to the flickr REST api (it's the first thing that came to my mind). I'm currently running this one the same way I was running my original program and I'm doing the strace to get that info if that turns out to effectively reproduce the issue. This test is still not using all features that I was using in the original but that's probably a good first attempt. For reference (and so I don't lose it), I started the test with: nohup ./pollrunner > run-`date +%F-%H-%M`.log 2>&1 & and the strace with the following: nohup strace -v -f -s1000 -tt -otrace.log -p 6834 & |
Comment 13 by alexandre.normand@opower.com: I have the strace output for the current run of the original program which did exit about 15 minutes after I logged out of the box on which it was running (my first attempt at reproducing it using a simple version that can be tested by anyone hasn't been successful). It's the first time I use strace myself so I'm not sure what to expect/look at. That said, one of the last lines printed is: 1551 11:15:11.836543 --- SIGHUP (Hangup) @ 0 (0) --- Does that mean it got a hang-up signal and exited even if it was executed with nohup? Attachments:
|
Comment 16 by alexandre.normand@opower.com: The reason that the log doesn't include startup is that I attached it after it had already been running for several hours. From the moment I attached strace to the process until that process exited with the SIGHUP, there is no other occurence of SIGHUP. However, I'm running it again for the complete execution using strace [command]. So far, it's still running but there is one occurrence of SIGHUP so far: 31934 13:20:09.433071 rt_sigaction(SIGHUP, {0x41c505, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x41c563}, NULL, 8) = 0 |
Comment 18 by alexandre.normand@opower.com: Hey Russ, Here are the occurrences of sigprocmask from my still-in-progress test (done using grep sigprocmask -C 5 <log>). Thanks Attachments:
|
Thanks, that was enough for me to puzzle through what's going on. The signal setup during clone says: // Disable signals during clone, so that the new thread starts // with signals disabled. It will enable them in minit. runtime·rtsigprocmask(SIG_SETMASK, &sigset_all, &oset, sizeof oset); ret = runtime·clone(flags, stk, m, g, fn); runtime·rtsigprocmask(SIG_SETMASK, &oset, nil, sizeof oset); but the minit says // Initialize signal handling. m->gsignal = runtime·malg(32*1024); // OS X wants >=8K, Linux >=2K runtime·signalstack((byte*)m->gsignal->stackguard - StackGuard, 32*1024); runtime·rtsigprocmask(SIG_SETMASK, &sigset_none, nil, sizeof sigset_none); The sigset_none should be whatever signals the parent had; clone should record them in m-> somewhere and minit should use them. Any use of SIG_SETMASK with sigset_none is wrong - that variable should probably be deleted. Labels changed: added go1.1, size-m, suggested, removed go1.1maybe. Status changed to Accepted. |
Comment 20 by alexandre.normand@opower.com: That's great! I wasn't that hopeful going into it (I didn't have as much info as I would have liked to provide) but I'm glad I submitted the issue. Thanks! |
Comment 21 by alexandre.normand@opower.com: I've been taking a quick look at runtime/thread_linux.c but this is totally outside of my domain of expertise. I wish I woud have been able to go ahead and make changes and test them quickly but there would be a significant overhead for me. If anyone with the capacity has a few minutes to spend on this, I could apply the patch and test it. I'll be spending the next day or so still testing this program of mine that exposed this bug. |
alexandre, I am not linux wiz myself, but what rsc said https://golang.org/issue/4491?c=19 is: 1) "sigset_none should be whatever signals the parent had; clone should record them in m-> somewhere" "whatever signals the parent had" gets stored into oset variable, which is local in runtime·newosproc (it gets thrown away once function exits). He suggests it should be moved into mp and kept there for the child to use later on. So, you should add field "Sigset oset" to struct M (in runtime.h). Then you should delete local oset variable in runtime·newosproc and use mp->oset instead. 2) "The sigset_none should be whatever signals the parent had ..." So, now that you are in the child (in runtime·minit) replace sigset_none with m->oset, which will have preserved signal mask. I am probably talking rubbish here, but rsc is away for couple of days. And why shouldn't you have some fun yourself :-) What is the worst that could happened? Alex |
Comment 23 by alexandre.normand@opower.com: Hey Alex, Thanks for the hand holding. My C is more rusty than I thought it was but, with your help, I think I have it done. I'll post an update later later. |
Comment 24 by alexandre.normand@opower.com: My last comment made it sound like I had a compiled version pending test but it turns out I had forgotten to set the GOOS and GOARCH when compiling so I wasn't compiling the linux version (I'm running OSX). Since then, I've been trying to get the changes to compile but whatever I try to add to the M struct produces some output and then fails (not exactly a compilation error): ... pkg/text/template/parse pkg/text/template pkg/go/doc pkg/go/build cmd/go pkg/runtime (linux/amd64) sizeofSigset = 8; aggr Sigset { 'a' 0 mask; }; ... I tried to figure this out and it looks like adding something to the struct needs also some changing elsewhere. There a hint in runtime.h but I couldn't actually figure this out on my own: // The offsets of these fields are known to (hard-coded in) libmach. Turns out it's a lot bigger bag of fun than I expected it to be. :) |
Comment 26 by alexandre.normand@opower.com: Thanks..but why is it that this happens only when I add a new field? |
Comment 28 by alexandre.normand@opower.com: Thanks for that. Uploaded here: https://golang.org/cl/7087043. I must confess, I'm ashamed of how much I forgot my C...but I'm hoping this will serve as a good refresher. |
This is quick an dirty, but should implement what rsc suggested. # hg id 0f40abacc6ee+ tip # hg diff diff -r 0f40abacc6ee src/pkg/runtime/runtime.h --- a/src/pkg/runtime/runtime.h Fri Jan 11 12:29:14 2013 +0800 +++ b/src/pkg/runtime/runtime.h Fri Jan 11 17:19:10 2013 +1100 @@ -295,6 +295,7 @@ void* thread; // thread handle #endif SEH* seh; + uint32 linux_sig_mask[2]; uintptr end[]; }; diff -r 0f40abacc6ee src/pkg/runtime/thread_linux.c --- a/src/pkg/runtime/thread_linux.c Fri Jan 11 12:29:14 2013 +0800 +++ b/src/pkg/runtime/thread_linux.c Fri Jan 11 17:19:10 2013 +1100 @@ -149,6 +149,8 @@ // Disable signals during clone, so that the new thread starts // with signals disabled. It will enable them in minit. runtime·rtsigprocmask(SIG_SETMASK, &sigset_all, &oset, sizeof oset); + mp->linux_sig_mask[0] = oset.mask[0]; // client will restore our signals + mp->linux_sig_mask[1] = oset.mask[1]; ret = runtime·clone(flags, stk, mp, gp, fn); runtime·rtsigprocmask(SIG_SETMASK, &oset, nil, sizeof oset); @@ -174,10 +176,15 @@ void runtime·minit(void) { + Sigset oset; + // Initialize signal handling. m->gsignal = runtime·malg(32*1024); // OS X wants >=8K, Linux >=2K runtime·signalstack((byte*)m->gsignal->stackguard - StackGuard, 32*1024); - runtime·rtsigprocmask(SIG_SETMASK, &sigset_none, nil, sizeof sigset_none); + // restore parents signals + oset.mask[0] = m->linux_sig_mask[0]; + oset.mask[1] = m->linux_sig_mask[1]; + runtime·rtsigprocmask(SIG_SETMASK, &oset, nil, sizeof oset); } void |
Comment 30 by alexandre.normand@opower.com: Thanks guys. As I was about to run a new test this morning, I just went with Alex's patch. Using this, I patched and built go locally and then built my executable using that patched go dist. I'm getting the same results now, unfortunately. It's my first time building with a go dist from source but I think I've done that correctly (I've set GOROOT to be the directory with the source and I've added $GOROOT/bin to the path). Go version does seem to confirm I'm good: go version devel +ec1e87068221 Fri Jan 11 08:30:25 2013 -0800 darwin/amd64 I'm attaching the last part of the strace log from that last execution in case someone wants to take a look. Attachments:
|
Comment 32 by alexandre.normand@opower.com: That's okay. Thanks for your help! I might actually be able to do some more testing to validate a potential fix when he gets back but that might not be something I can do as quickly that I was able to do now. |
Looks like the mask wasn't the important part. https://golang.org/cl/7308102 should fix it for real. Status changed to Started. |
This issue was closed by revision f3407f4. Status changed to Fixed. |
This fix doesn't completely work either. If a program asks for SIGHUP explicitly it should be enabled. I will fix this. Labels changed: removed suggested. Owner changed to @rsc. Status changed to Started. |
This issue was closed by revision cb4428e. Status changed to Fixed. |
by alexandre.normand@opower.com:
The text was updated successfully, but these errors were encountered: