Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: program re-enables SIGHUP #4491

Closed
gopherbot opened this issue Dec 5, 2012 · 36 comments
Closed

runtime: program re-enables SIGHUP #4491

gopherbot opened this issue Dec 5, 2012 · 36 comments
Milestone

Comments

@gopherbot
Copy link
Contributor

by alexandre.normand@opower.com:

I'm hoping someone can shed some light on this one but I, unfortunately, can't provide
that much details nor an example that could be tried outside of our organisation. 

Here's the story:
I have a small go program that uses encoding/json and net/http to submit jobs to one of
our apps. After submission, the program polls that same app (using http/json) until it
sees the jobs going to completion (or failure). This has been implemented using channels
and go routines (one go routine sees that a job gets submitted and waits for its
completion). It gets fancier with the notion of checkpoints that have been implemented
by simply waiting for some jobs to complete before starting other jobs (again, via new
go routines). At any given time, there are never more than 10 go routines running in
parallel (which sleep for 1 minute between each http call). 

The program runs fine when executed on my MacBook Pro but since it can run for 10+ hours
before completion, I've been trying to run that on one of the many Linux servers using
{{{nohup}}}. It looks like when starting the process with nohup and logging out of the
system, the program never runs to completion. The stdout/stderr is redirected to a file
and it looks like the process vanishes after a few hours at most (sometimes after less
than 1 hour). I can't really tell what the exit code is either since not having an open
session when that happens, I'm not seeing it. 

To add to the mystery, when keeping a session open on that host running the program and
tailing the log file, it seems to _then_ be able to run to completion. I can't rule out
something killing that process but it I'm very confident that it's not the case since
I'm running other scripts on those machines which have been running for days without
interruption.  

What steps will reproduce the problem?
1. Use a go program that's a mix of http/json/log statements/sleeps/go routines.
2. Start it using {{{nohup program > output.log 2>&1 &}}}
3. Wait for it...and it's gone before it completes. 

What is the expected output?
The program should either go to completion or print a panic.

What do you see instead?
Nothing, the process just ends abruptly with no trace. This looks like I'm running
through some os.Exit() call somewhere in the go packages I'm using but it's not clear
what that would be and why. 

Which compiler are you using (5g, 6g, 8g, gccgo)?
6g

Which operating system are you using?
Linux  2.6.32-220.7.1.el6.x86_64

Which version are you using?  (run 'go version')
go1.0.3

Please provide any additional information below.
I can provide my code (one main go source and an additional package) but I would have to
remove urls and obfuscate some of it which would render it pretty much usable except for
visual inspection.
@rsc
Copy link
Contributor

rsc commented Dec 6, 2012

Comment 1:

It is possible that the signal initialization in Go programs
re-enables SIGHUP, so that running it under nohup does not protect it
from your logging out. That would be the first thing I would try. If
that's the case then a simple Go program that just does
for {
    time.Sleep(1*time.Minute)
}
would demonstrate the same mysterious disappearances.

@gopherbot
Copy link
Contributor Author

Comment 2 by alexandre.normand@opower.com:

I'm running that test right now (I just added a log statement too, just so I could tell
when it stopped). I'll post later with the results. 
For reference, here's the actual code:
package main
import (
        "log"
        "time"
)
func main() {
        for {
                time.Sleep(1 * time.Minute)
                log.Print("Waiting for 1 minute again")
        }
}

@gopherbot
Copy link
Contributor Author

Comment 3 by alexandre.normand@opower.com:

The first test mentioned above is still running fine with nohup and I added one more
that just adds a go routine to the mix:
package main
import (
        "log"
        "time"
)
func main() {
        callback := make(chan int)
        go sleepAndPrint(callback)
        <-callback
}
func sleepAndPrint(communication chan int) {
        for i := 0; i < 10000; i++ {
                log.Printf("waiting for 1 minute with count = %d\n", i)
                time.Sleep(1 * time.Minute)
        }
        communication <- 1
}
Both are still running after logging out and back again. I'm going to leave them running
both to see if anything can come out of this.

@gopherbot
Copy link
Contributor Author

Comment 4 by alexandre.normand@opower.com:

The two versions of the test mentioned above are still running fine with nohup. I wonder
if this could be that something in the packages my original app is using is doing a
Exit() call without logging anything first.
For reference, here are my imports:
import (
        "bytes"
        "encoding/json"
        "fmt"
        "io/ioutil"
        "log"
        "os"
        "net/http"
        "strings"
        "time"
)

@rsc
Copy link
Contributor

rsc commented Dec 7, 2012

Comment 5:

I just checked via grep: nothing in the standard library calls os.Exit
without printing a message. The only calls outside tests are in flag
and log.

@rsc
Copy link
Contributor

rsc commented Dec 9, 2012

Comment 6:

Is it possible that the program is using too much memory and being killed by the Linux
kernel? Look in the output of dmesg.

Labels changed: added priority-later, go1.1maybe, removed priority-triage, go1.1.

Status changed to WaitingForReply.

@gopherbot
Copy link
Contributor Author

Comment 7 by alexandre.normand@opower.com:

There's no indication in dmesg that the program got killed by the kernel. I did see that
the box I'm running in on right now has only 375 MB of free physical ram but I was
initially running it on another box that has more than 1.7 GB of free physical RAM and I
had made the switch to try and see if there was something wrong with the first one. 
Also, wouldn't the memory usage be the same regardless of whether I'm logged in/tailing
the output log than if I'm logged out?
The current status is that this is still occurring as per my initial description. If I
kill my remote ssh session to the box running it, it eventually disappears but if I'm
logged in and tailing the log, it stays there up to the normal program termination.

@alberts
Copy link
Contributor

alberts commented Dec 10, 2012

Comment 8:

10 writes to stdout/stderr getting EPIPE and then the program exits?

@gopherbot
Copy link
Contributor Author

Comment 9 by alexandre.normand@opower.com:

If that was a case of 10 writes getting IPIPE causing the exit, wouldn't the simple test
cases from the earlier comments have shown the same problem? 
I'd like to provide some example that reproduces the issue. Since I can't use the
program I'm currently working on, I might put together an example that just does http
get/posts using a similar pattern. It might take a while before I can have something to
reproduce it just because of the nature of the problem (if I'm logged in and waiting for
it to vanish, it's just going to remain alive).

@alberts
Copy link
Contributor

alberts commented Dec 10, 2012

Comment 10:

ssh and it dying when you log out smelled a bit like epipecheck.
You could perhaps see if you can strace the program until it exits?
strace -v -s1000 -tt -ostrace.log -p <pid>

@alberts
Copy link
Contributor

alberts commented Dec 10, 2012

Comment 11:

also -f

@gopherbot
Copy link
Contributor Author

Comment 12 by alexandre.normand@opower.com:

I stripped much of what was specific to my organisation from my program and replaced it
with calls to the flickr REST api (it's the first thing that came to my mind). I'm
currently running this one the same way I was running my original program and I'm doing
the strace to get that info if that turns out to effectively reproduce the issue. This
test is still not using all features that I was using in the original but that's
probably a good first attempt.
For reference (and so I don't lose it), I started the test with:
nohup ./pollrunner > run-`date +%F-%H-%M`.log 2>&1 &
and the strace with the following:
nohup strace -v -f -s1000 -tt -otrace.log -p 6834 &

@gopherbot
Copy link
Contributor Author

Comment 13 by alexandre.normand@opower.com:

I have the strace output for the current run of the original program which did exit
about 15 minutes after I logged out of the box on which it was running (my first attempt
at reproducing it using a simple version that can be tested by anyone hasn't been
successful).
It's the first time I use strace myself so I'm not sure what to expect/look at. That
said, one of the last lines printed is: 
1551  11:15:11.836543 --- SIGHUP (Hangup) @ 0 (0) ---
Does that mean it got a hang-up signal and exited even if it was executed with nohup?

Attachments:

  1. trace_obfuscated.log (9466838 bytes)

@rsc
Copy link
Contributor

rsc commented Dec 10, 2012

Comment 14:

Yes, that's what it means. That's progress at least.

@rsc
Copy link
Contributor

rsc commented Dec 10, 2012

Comment 15:

It looks like the log does not include process startup. Can you grep
for SIGHUP in the full log and see if any sigaction calls turn up?

@gopherbot
Copy link
Contributor Author

Comment 16 by alexandre.normand@opower.com:

The reason that the log doesn't include startup is that I attached it after it had
already been running for several hours. From the moment I attached strace to the process
until that process exited with the SIGHUP, there is no other occurence of SIGHUP.
However, I'm running it again for the complete execution using strace [command]. So far,
it's still running but there is one occurrence of SIGHUP so far:
31934 13:20:09.433071 rt_sigaction(SIGHUP, {0x41c505, ~[],
SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x41c563}, NULL, 8) = 0

@rsc
Copy link
Contributor

rsc commented Dec 10, 2012

Comment 17:

Thanks. I'd also be interested in any sigprocmask.
Russ

@gopherbot
Copy link
Contributor Author

Comment 18 by alexandre.normand@opower.com:

Hey Russ,
Here are the occurrences of sigprocmask from my still-in-progress test (done using grep
sigprocmask -C 5 <log>). 
Thanks

Attachments:

  1. sigprocmask_extract.log (23877 bytes)

@rsc
Copy link
Contributor

rsc commented Dec 10, 2012

Comment 19:

Thanks, that was enough for me to puzzle through what's going on. The signal setup
during clone says:
        // Disable signals during clone, so that the new thread starts
        // with signals disabled.  It will enable them in minit.
        runtime·rtsigprocmask(SIG_SETMASK, &sigset_all, &oset, sizeof oset);
        ret = runtime·clone(flags, stk, m, g, fn);
        runtime·rtsigprocmask(SIG_SETMASK, &oset, nil, sizeof oset);
but the minit says
        // Initialize signal handling.
        m->gsignal = runtime·malg(32*1024);     // OS X wants >=8K, Linux >=2K
        runtime·signalstack((byte*)m->gsignal->stackguard - StackGuard, 32*1024);
        runtime·rtsigprocmask(SIG_SETMASK, &sigset_none, nil, sizeof sigset_none);
The sigset_none should be whatever signals the parent had; clone should record them in
m-> somewhere and minit should use them.  Any use of SIG_SETMASK with sigset_none is
wrong - that variable should probably be deleted.

Labels changed: added go1.1, size-m, suggested, removed go1.1maybe.

Status changed to Accepted.

@gopherbot
Copy link
Contributor Author

Comment 20 by alexandre.normand@opower.com:

That's great! I wasn't that hopeful going into it (I didn't have as much info as I would
have liked to provide) but I'm glad I submitted the issue. 
Thanks!

@gopherbot
Copy link
Contributor Author

Comment 21 by alexandre.normand@opower.com:

I've been taking a quick look at runtime/thread_linux.c but this is totally outside of
my domain of expertise. I wish I woud have been able to go ahead and make changes and
test them quickly but there would be a significant overhead for me. 
If anyone with the capacity has a few minutes to spend on this, I could apply the patch
and test it. I'll be spending the next day or so still testing this program of mine that
exposed this bug.

@alexbrainman
Copy link
Member

Comment 22:

alexandre,
I am not linux wiz myself, but what rsc said
https://golang.org/issue/4491?c=19 is:
1) "sigset_none should be whatever signals the parent had; clone should record them in
m-> somewhere"
"whatever signals the parent had" gets stored into oset variable, which is local in
runtime·newosproc (it gets thrown away once function exits). He suggests it should be
moved into mp and kept there for the child to use later on. So, you should add field
"Sigset oset" to struct M (in runtime.h). Then you should delete local oset variable in
runtime·newosproc and use mp->oset instead.
2) "The sigset_none should be whatever signals the parent had ..."
So, now that you are in the child (in runtime·minit) replace sigset_none with m->oset,
which will have preserved signal mask.
I am probably talking rubbish here, but rsc is away for couple of days. And why
shouldn't you have some fun yourself :-) What is the worst that could happened?
Alex

@gopherbot
Copy link
Contributor Author

Comment 23 by alexandre.normand@opower.com:

Hey Alex,
Thanks for the hand holding. My C is more rusty than I thought it was but, with your
help, I think I have it done. I'll post an update later later.

@gopherbot
Copy link
Contributor Author

Comment 24 by alexandre.normand@opower.com:

My last comment made it sound like I had a compiled version pending test but it turns
out I had forgotten to set the GOOS and GOARCH when compiling so I wasn't compiling the
linux version (I'm running OSX). Since then, I've been trying to get the changes to
compile but whatever I try to add to the M struct produces some output and then fails
(not exactly a compilation error):
...
pkg/text/template/parse
pkg/text/template
pkg/go/doc
pkg/go/build
cmd/go
pkg/runtime (linux/amd64)
sizeofSigset = 8;
aggr Sigset
{
    'a' 0 mask;
};
...
 I tried to figure this out and it looks like adding something to the struct needs also some changing elsewhere. There a hint in runtime.h but I couldn't actually figure this out on my own:
// The offsets of these fields are known to (hard-coded in) libmach.
Turns out it's a lot bigger bag of fun than I expected it to be. :)

@minux
Copy link
Member

minux commented Jan 11, 2013

Comment 25:

Re #24, sorry for the inconvenience, as compilation error is overwhelmed
in some unrelated (acid program) dumps.
you have to manually filter the compilation error out from the output.
PS: please try to add new fields to the end (or near the end) of struct M.

@gopherbot
Copy link
Contributor Author

Comment 26 by alexandre.normand@opower.com:

Thanks..but why is it that this happens only when I add a new field?

@minux
Copy link
Member

minux commented Jan 11, 2013

Comment 27:

I don't know. If you can upload your code to rietveld (hg change but don't hg mail) as
a CL, I'm happy to take a look.

@gopherbot
Copy link
Contributor Author

Comment 28 by alexandre.normand@opower.com:

Thanks for that. Uploaded here: https://golang.org/cl/7087043. I must confess,
I'm ashamed of how much I forgot my C...but I'm hoping this will serve as a good
refresher.

@alexbrainman
Copy link
Member

Comment 29:

This is quick an dirty, but should implement what rsc suggested.
# hg id
0f40abacc6ee+ tip  
# hg diff
diff -r 0f40abacc6ee src/pkg/runtime/runtime.h
--- a/src/pkg/runtime/runtime.h Fri Jan 11 12:29:14 2013 +0800
+++ b/src/pkg/runtime/runtime.h Fri Jan 11 17:19:10 2013 +1100
@@ -295,6 +295,7 @@
    void*   thread;     // thread handle
 #endif
    SEH*    seh;
+   uint32  linux_sig_mask[2];
    uintptr end[];
 };
 
diff -r 0f40abacc6ee src/pkg/runtime/thread_linux.c
--- a/src/pkg/runtime/thread_linux.c    Fri Jan 11 12:29:14 2013 +0800
+++ b/src/pkg/runtime/thread_linux.c    Fri Jan 11 17:19:10 2013 +1100
@@ -149,6 +149,8 @@
    // Disable signals during clone, so that the new thread starts
    // with signals disabled.  It will enable them in minit.
    runtime·rtsigprocmask(SIG_SETMASK, &sigset_all, &oset, sizeof oset);
+   mp->linux_sig_mask[0] = oset.mask[0];   // client will restore our signals
+   mp->linux_sig_mask[1] = oset.mask[1];
    ret = runtime·clone(flags, stk, mp, gp, fn);
    runtime·rtsigprocmask(SIG_SETMASK, &oset, nil, sizeof oset);
 
@@ -174,10 +176,15 @@
 void
 runtime·minit(void)
 {
+   Sigset oset;
+
    // Initialize signal handling.
    m->gsignal = runtime·malg(32*1024);    // OS X wants >=8K, Linux >=2K
    runtime·signalstack((byte*)m->gsignal->stackguard - StackGuard, 32*1024);
-   runtime·rtsigprocmask(SIG_SETMASK, &sigset_none, nil, sizeof sigset_none);
+   // restore parents signals
+   oset.mask[0] = m->linux_sig_mask[0];
+   oset.mask[1] = m->linux_sig_mask[1];
+   runtime·rtsigprocmask(SIG_SETMASK, &oset, nil, sizeof oset);
 }
 
 void

@gopherbot
Copy link
Contributor Author

Comment 30 by alexandre.normand@opower.com:

Thanks guys. As I was about to run a new test this morning, I just went with Alex's
patch. Using this, I patched and built go locally and then built my executable using
that patched go dist. I'm getting the same results now, unfortunately. It's my first
time building with a go dist from source but I think I've done that correctly (I've set
GOROOT to be the directory with the source and I've added $GOROOT/bin to the path). Go
version does seem to confirm I'm good: 
go version devel +ec1e87068221 Fri Jan 11 08:30:25 2013 -0800 darwin/amd64
I'm attaching the last part of the strace log from that last execution in case someone
wants to take a look.

Attachments:

  1. strace_extract.log (1328643 bytes)

@alexbrainman
Copy link
Member

Comment 31:

Bad luck. Like I said, I do not know much about Linux. You would have to wait for rsc to
come back. Perhaps someone else will help. Sorry.
Alex

@gopherbot
Copy link
Contributor Author

Comment 32 by alexandre.normand@opower.com:

That's okay. Thanks for your help! I might actually be able to do some more testing to
validate a potential fix when he gets back but that might not be something I can do as
quickly that I was able to do now.

@rsc
Copy link
Contributor

rsc commented Feb 15, 2013

Comment 33:

Looks like the mask wasn't the important part. 
https://golang.org/cl/7308102 should fix it for real.

Status changed to Started.

@rsc
Copy link
Contributor

rsc commented Feb 15, 2013

Comment 34:

This issue was closed by revision f3407f4.

Status changed to Fixed.

@rsc
Copy link
Contributor

rsc commented Mar 6, 2013

Comment 35:

This fix doesn't completely work either. If a program asks for SIGHUP explicitly it
should be enabled. I will fix this.

Labels changed: removed suggested.

Owner changed to @rsc.

Status changed to Started.

@rsc
Copy link
Contributor

rsc commented Mar 15, 2013

Comment 36:

This issue was closed by revision cb4428e.

Status changed to Fixed.

@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
@rsc rsc removed their assignment Jun 22, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants