Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: cmd.CombinedOutput seems stuck on an io.Copy under s390x #33704

Closed
q2683252 opened this issue Aug 18, 2019 · 7 comments
Closed

os/exec: cmd.CombinedOutput seems stuck on an io.Copy under s390x #33704

q2683252 opened this issue Aug 18, 2019 · 7 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@q2683252
Copy link

q2683252 commented Aug 18, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.7 linux/s390x

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output

$ go env

P008-PG-S102A:~ # go env
GOARCH="s390x"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="s390x"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_s390x"
GCCGO="gccgo"
CC="s390x-linux-gnu-gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -march=z196 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build024277193=/tmp/go-build -gno-record-gcc-switches"

What did you do?

What did you expect to see?

I use go to protect a service which restarts the process when it exites.
here is the code.

func GuardAthenaService(name string) {

    for true {

        cur_dir, err := oss_utils.GetCurrentDirectory()
        if err != nil {
            oss_logger.Errorf("Fail to get current runing dir, err_desc=%s", err.Error())
            continue
        }

        bin := filepath.Join(cur_dir, name)
        conf := filepath.Join(cur_dir, fmt.Sprintf("../config/%s.yml", name))
        cmd := exec.Command(bin, "-c", conf)
        oss_logger.Infof("Athena start %v", cmd)
        cmdOutput, err := cmd.CombinedOutput()
        if err != nil {
            oss_logger.Errorf("%s service out put %s", name, cmdOutput)
            oss_logger.Errorf("%s service failed:%s,service restart", name, err.Error())
        }
        time.Sleep(5 * time.Second)
    }
}

func main() {

    ...........
    go GuardAthenaService("athena-keeper")
    go GuardAthenaService("athena-sentinel")

    .............
}

What did you see instead?

I used ps to find that process had already exited, but it did not restart.I looks like it got struck.
It did not happened under x86 so I think it might have something to do under s390x platform.
I used go tool routine, traces and found this.

runtime.gopark
runtime.goparkunlock
runtime.chanrecv
runtime.chanrecv1
os/exec.(*Cmd).Wait
os/exec.(*Cmd).Run
os/exec.(*Cmd).CombinedOutput
main.GuardAthenaService

It looks like that it's waiting for chan in io.Copy to finish .
unfortunately when I wanted to use delve to go deeper but found that it does not support s390x and gdb with go support can't watch variables the goroutine which is not running and it has it's own problem under s390x.

@odeke-em
Copy link
Member

Hello @q2683252 thank you for filing this issue and welcome to the Go project!

So perhaps since there isn't Delve support on s390x, it would be great to know the Process ID (PID) of each process and then while it is seemingly stuck waiting to copy, getting a stacktrace of what is happening e.g.
Perhaps you can modify your cmd.CombinedOutput to:
a) invoke cmd.Start() with your own buffer
b) Print out the Process ID (PID) of the process
c) Invoke cmd.Wait()
d) While any of the processes is stuck, send it a SIGQUIT e.g. kill -s QUIT <PID>
and examine that trace to see what your program is doing -- if your program while produce that core dump

The output of each of those can then show you what's going on so

--- orig.go	2019-08-18 23:30:51.000000000 -0600
+++ main.go	2019-08-18 23:30:14.000000000 -0600
@@ -13,8 +13,21 @@
 		bin := filepath.Join(cur_dir, name)
 		conf := filepath.Join(cur_dir, fmt.Sprintf("../config/%s.yml", name))
 		cmd := exec.Command(bin, "-c", conf)
+		buf := new(bytes.Buffer)
+		cmd.Stdout = buf
+		cmd.Stderr = buf
 		oss_logger.Infof("Athena start %v", cmd)
-		cmdOutput, err := cmd.CombinedOutput()
+		if err := cmd.Start(); err != nil {
+			oss_logger.Errorf("%s service start error: %v", name, err)
+			return
+		}
+
+		// This is the PID of the subcommand that we shall
+		// send that "SIQUIT" if it stalls for a long time.
+		oss_logger.Info("%s service PID: %d\n", cmd.Process.Pid)
+
+		err = cmd.Wait()
+		cmdOutput := buf.String()
 		if err != nil {
 			oss_logger.Errorf("%s service out put %s", name, cmdOutput)
 			oss_logger.Errorf("%s service failed:%s,service restart", name, err.Error())

so once you have the PID printed out

kill -s QUIT XXXX

In the meantime I shall ping an s390x expert @mundaym about this bug.

@odeke-em odeke-em changed the title execute.CombinedOutput got struck under s390x os/exec: cmd.CombinedOutput seems stuck on an io.Copy under s390x Aug 19, 2019
@odeke-em odeke-em added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 19, 2019
@q2683252
Copy link
Author

Hello @q2683252 thank you for filing this issue and welcome to the Go project!

So perhaps since there isn't Delve support on s390x, it would be great to know the Process ID (PID) of each process and then while it is seemingly stuck waiting to copy, getting a stacktrace of what is happening e.g.
Perhaps you can modify your cmd.CombinedOutput to:
a) invoke cmd.Start() with your own buffer
b) Print out the Process ID (PID) of the process
c) Invoke cmd.Wait()
d) While any of the processes is stuck, send it a SIGQUIT e.g. kill -s QUIT <PID>
and examine that trace to see what your program is doing -- if your program while produce that core dump

The output of each of those can then show you what's going on so

--- orig.go	2019-08-18 23:30:51.000000000 -0600
+++ main.go	2019-08-18 23:30:14.000000000 -0600
@@ -13,8 +13,21 @@
 		bin := filepath.Join(cur_dir, name)
 		conf := filepath.Join(cur_dir, fmt.Sprintf("../config/%s.yml", name))
 		cmd := exec.Command(bin, "-c", conf)
+		buf := new(bytes.Buffer)
+		cmd.Stdout = buf
+		cmd.Stderr = buf
 		oss_logger.Infof("Athena start %v", cmd)
-		cmdOutput, err := cmd.CombinedOutput()
+		if err := cmd.Start(); err != nil {
+			oss_logger.Errorf("%s service start error: %v", name, err)
+			return
+		}
+
+		// This is the PID of the subcommand that we shall
+		// send that "SIQUIT" if it stalls for a long time.
+		oss_logger.Info("%s service PID: %d\n", cmd.Process.Pid)
+
+		err = cmd.Wait()
+		cmdOutput := buf.String()
 		if err != nil {
 			oss_logger.Errorf("%s service out put %s", name, cmdOutput)
 			oss_logger.Errorf("%s service failed:%s,service restart", name, err.Error())

so once you have the PID printed out

kill -s QUIT XXXX

In the meantime I shall ping an s390x expert @mundaym about this bug.
Thank you for your kindly reply.
It's not bound to happen everytime it got exited.
But what should I do if I have the pid?

@odeke-em
Copy link
Member

Thank you for your kindly reply.
It's not bound to happen everytime it got exited.
But what should I do if I have the pid?

Once you have the PID, if the process is stuck, then send it a signal to cause it to produce a core dump and that will show you what's stopping progress on the forked process. I showed that in my initial comment part d) #33704 (comment) but also here again

kill -s QUIT XXXX

where XXXX is the printed out PID of the process that's stuck.

@q2683252
Copy link
Author

q2683252 commented Aug 20, 2019

Thank you for your kindly reply.
It's not bound to happen everytime it got exited.
But what should I do if I have the pid?

Once you have the PID, if the process is stuck, then send it a signal to cause it to produce a core dump and that will show you what's stopping progress on the forked process. I showed that in my initial comment part d) #33704 (comment) but also here again

kill -s QUIT XXXX

where XXXX is the printed out PID of the process that's stuck.

Maybe I didn't express the problem clearly.
I have two programs one called A,the other called B,I used exec.CombinedOutput to execute B in program A but when B got exited(which we can't see it in the os using ps command),A didn't act as expeced. as a result,A got struck in here.

runtime.gopark
runtime.goparkunlock
runtime.chanrecv
runtime.chanrecv1
os/exec.(*Cmd).Wait
os/exec.(*Cmd).Run
os/exec.(*Cmd).CombinedOutput
main.GuardAthenaService

It's A got struck not B,so I don't see the point in kill B which does not exist anymore?

@mundaym
Copy link
Member

mundaym commented Aug 23, 2019

Hi @q2683252 I'm currently on vacation so I can't currently experiment myself unfortunately. Do you how your process (e.g. athena-keeper) is crashing? Does the Go program work if you try and launch something simple like /bin/ls instead?

Can you try Go 1.13rc1? In Go 1.13 we use vfork on s390x (which is what amd64 does) so that might make a difference.

@q2683252
Copy link
Author

q2683252 commented Aug 25, 2019

Athena-keeper exited by itself because of someother issues ,it's a serivice which means it will run for a very long time, it wasn't killed by the os or users.I haven't tried using
a simpler program such as ls but when I manually maked athena-keeper exit immediately once it started ,it did not reappear with many times of tr .I think it might relate to the callee process uptime.

I will try go1.13 ,and maybe it takes a long time to reappear.

@ianlancetaylor
Copy link
Contributor

This is #23019. Closing as dup.

@golang golang locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants