Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: running cmd.exec within goroutine sometimes leaves process with 100% CPU #53863

Closed
ostrowr opened this issue Jul 14, 2022 · 18 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Milestone

Comments

@ostrowr
Copy link

ostrowr commented Jul 14, 2022

What version of Go are you using (go version)?

Reproducible on at least 1.18.1 and 1.18.4 on Apple Silicon (M1)

go version go1.18.1 darwin/arm64
go version go1.18.4 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/robbie/Library/Caches/go-build"
GOENV="/Users/robbie/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/robbie/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/robbie/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/Cellar/go/1.18.4/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/Cellar/go/1.18.4/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.18.4"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/my/x46mws81707gx7sxgsjcwc2w0000gn/T/go-build881887791=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

See the repro repo here for a few scripts that might help you reproduce the issue.

  1. Build the below program – go build -o repro main.go
  2. Run repro a bunch of times at once – e.g. for i in $(seq 500); do "./repro" &; done
  3. run ps aux | grep repro
  4. Notice that some small percentage of the time, there are leftover processes either called repro or (repro) that continue to run and hog 100% CPU.

Screen Shot 2022-07-13 at 10 35 38 PM

This is the smallest reproduction I can find but haven't been able to diagnose what this process is actually doing. kill -ABRT is ignored; the only way I can get rid of these hanging processes is with a kill -9.

No idea if this has to do with running a goroutine in an init function and whether this is replicable without running exec.Command or whether this is an exec.Command bug independent of init semantics.

package main

import (
	"math/rand"
	"os/exec"
	"time"
)

func main() {
	go func() {
		exec.Command("echo").Run()
	}()
	time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
}

Old reproduction
package main

import (
	"math/rand"
	"os/exec"
	"time"
)

func init() {
	go func() {
		exec.Command("echo").Run() // doesn't have to be `echo`; reproduces with other commands
	}()
}

func main() {
	time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
}

What did you expect to see?

The process to exit and not take up much CPU

What did you see instead?

A small percentage of the time, the process remains, either with its own name or its name in parentheses in the process table, hanging and taking up 100% CPU.

@bcmills bcmills changed the title Running goroutine within init sometimes leaves process with 100% CPU runtime: running goroutine within init sometimes leaves process with 100% CPU Jul 14, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 14, 2022
@mknyszek
Copy link
Contributor

No idea if this has to do with running a goroutine in an init function and whether this is replicable without running exec.Command or whether this is an exec.Command bug independent of init semantics.

Just to be totally clear, does your reproducer still work if you get rid of any of these things? (Not init, not in a new goroutine at init, not exec.Command?)

Can you get a core dump of the hanging processes? If they're actually hung, it should be possible to attach gdb (or not on macOS? not sure) or trigger a core dump somehow (I'm not sure how to do this on macOS). I think it's interesting that kill -ABRT doesn't work; that suggests its locked up in a way that even the signal handler isn't triggering.

This reminds me of some subtleties around signal masking around exec syscalls on macOS. I think there's actually a bug on the macOS side that we try to mitigate? #41702 That might be related.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 14, 2022
@mknyszek mknyszek added this to the Backlog milestone Jul 14, 2022
@ostrowr
Copy link
Author

ostrowr commented Jul 14, 2022

Just to be totally clear, does your reproducer still work if you get rid of any of these things? (Not init, not in a new goroutine at init, not exec.Command?)

This is the smallest repro I could find (i.e. I haven't managed to reproduce by deleting any of these things, but it's super nondeterministic so I am not 100% confident that this is actually the minimal repro.)

Haven't been able to get a core dump on lldb/gdb yet; @allanbreyes has been investigating as well.

@allanbreyes
Copy link

Attaching to one of the hanging processes with lldb can't progress any further than this:

(lldb) process attach --pid 37728
Process 37728 stopped
* thread #1, stop reason = EXC_BAD_ACCESS (code=1, address=0x18cb12178)
    frame #0: 0x000000018cb12178
error: memory read failed for 0x18cb12000
Target 0: (No executable module.) stopped.
Architecture set to: arm64e-apple-macosx-.
(lldb) image lookup -va 0x000000018cb12178
error: the target has no associated executable images
(lldb) image list
error: the target has no associated executable images

I'm not sure what module the pc register is pointing to because I can't manage to get the memory map. We've also tried triggering a coredump by sending a SIGSEGV, but it doesn't cause the process to terminate... perhaps this is another data point on the signal handler not triggering?

We'll keep researching, but any advice on how to debug further would be welcomed! Thank you.

@prattmic
Copy link
Member

All of this sounds a lot like an issue in the kernel where zombie processes are getting stuck attempting to clean up / exit.

However, it is not at all clear what is special about this Go program that would trigger this. My best guess would be some kind of race between the parent process exiting and the fork+exec of the child.

Does this reproduce if you move the goroutine creation into main?

@ostrowr
Copy link
Author

ostrowr commented Jul 14, 2022

I couldn't get it to repro last night but just tried another 10 times and it does!

package main

import (
	"math/rand"
	"os/exec"
	"time"
)

func main() {
	go func() {
		exec.Command("echo").Run()
	}()
	time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
}

is the new minimal reproduction, so this doesn't have to do with init after all. I'll edit the description above.

@ostrowr ostrowr changed the title runtime: running goroutine within init sometimes leaves process with 100% CPU runtime: running cmd.exec within goroutine sometimes leaves process with 100% CPU Jul 14, 2022
@ostrowr
Copy link
Author

ostrowr commented Jul 14, 2022

Interestingly, I haven't been able to reproduce this when running the program many times in a row serially. This doesn't mean it doesn't reproduce that way, but I haven't seen an example yet. This might indicate that there is some sort of race condition in the kernel that doesn't clean up these zombie processes properly when it tries to do a lot at once.

I.e. running this usually can sometimes reproduce:

for i in $(seq 500); do "./repro" &; done

But

while true; do "./repro"; done

never seems to (so far).

@prattmic
Copy link
Member

Ah, great. There is little difference between the last init function and the start of main, so it is good to know that it isn't related.

A few more things to try:

  1. Eliminate the race between exec and exit to see if the problem goes away. I'm thinking:
package main

import (
	"math/rand"
	"os/exec"
	"time"
)

func main() {
	c := make(chan struct{}, 1)
	go func() {
		exec.Command("echo").Run()
		c <- struct{}{}
	}()
	time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
	<-c
}
  1. Try recreating this as a C program with a background thread doing fork+exec and the main thread doing random sleep + exit.

@ostrowr
Copy link
Author

ostrowr commented Jul 14, 2022

We'll try both of those things when we get a chance. In the meantime, I've narrowed this further:

package main

import (
	"math/rand"
	"syscall"
	"time"
)

func main() {
	go func() {
		syscall.ForkExec("echo", []string{}, &syscall.ProcAttr{})
	}()
	time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
}

replicates. I'm feeling like it's unlikely that this is a go bug, but as you worked around a macos bug in #41702 it might be worth trying to work around this one as well once we get to the root of it.

@ostrowr
Copy link
Author

ostrowr commented Jul 14, 2022

Ah, great. There is little difference between the last init function and the start of main, so it is good to know that it isn't related.

A few more things to try:

  1. Eliminate the race between exec and exit to see if the problem goes away.

This does indeed solve the problem, as expected.

  1. Try recreating this as a C program with a background thread doing fork+exec and the main thread doing random sleep + exit.

So far, I haven't been able to replicate this in C, but will keep poking when I have time.

@ostrowr
Copy link
Author

ostrowr commented Jul 14, 2022

Confirmed reproduction in C:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>


void *forking_thread(void* arg)
{
    int pid = fork();
    if (pid == 0) execl("/bin/echo", NULL);
    return NULL;
}

int main()
{
    pthread_t thread_id;
    pthread_create(&thread_id, NULL, forking_thread, NULL);
    usleep(20 * 1000);
    exit(0);
}

So, this is clearly not a Go bug. Feel free to close this, but if you have ideas w.r.t. a workaround as above, I'm happy to try to help.

Any idea where people have had success reporting macos kernel bugs in the past? Doesn't seem like the other thread got much traction.

@prattmic
Copy link
Member

We do have a lock taken around exec and thread creation to protect against a Linux kernel bug (#19546). As a workaround for this, we could perhaps take this lock prior to exit.

You could give this a try by calling execLock.rlock() prior to exit here (note that this is only exit when returning from main. os.Exit is a separate path).

@cherrymui
Copy link
Member

What version of macOS are you running on?.

I haven't been able to reproduce it locally. For the C reproducer, how many iterations does it usually take to have a hanging process?

Thanks.

@ostrowr
Copy link
Author

ostrowr commented Jul 19, 2022

What version of macOS are you running on?.

12.4, with an M1 Max chip.

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro18,2
  Chip:	Apple M1 Max
  Total Number of Cores:	10 (8 performance and 2 efficiency)
  Memory:	32 GB
  System Firmware Version:	7459.121.3
  OS Loader Version:	7459.121.3

I haven't been able to reproduce it locally. For the C reproducer, how many iterations does it usually take to have a hanging process?

Anecdotally it depends on CPU load, but I can consistently get a go repro in under a minute by running the scripts from here

while true; do ./test_race_go; done

C repros are less predictable, but that's probably because I didn't include the random jitter; I'll see if adding the randomness helps.

while true; do ./test_race_c; done

takes at most a couple of minutes to leave a process at 100%.

@mknyszek mknyszek modified the milestones: Backlog, Go1.20 Jul 20, 2022
@cherrymui
Copy link
Member

Thanks @ostrowr ! I can also reproduce on an Intel Mac with macOS 12.4, both the Go and the C version. I'll forward it to Apple.

@ostrowr
Copy link
Author

ostrowr commented Jul 21, 2022

Great, thanks! For posterity, I submitted this to Apple on the 14th; feedback ID FB10691471 if you want to reference it.

Haven't taken a look at a workaround in go since it'll likely take me a while to get a dev environment set up but I'll try to make some time this weekend

@cherrymui
Copy link
Member

I couldn't reproduce now with macOS 12.6.4 and 13.1. Maybe it is fixed now?

@cherrymui cherrymui modified the milestones: Go1.20, Unplanned Jan 13, 2023
@ostrowr
Copy link
Author

ostrowr commented Jan 21, 2023

I haven't been able to reproduce on 13.1 either, but only tried for about 10 runs of ./test_race_go.

Seems like possibly fixed!

Never did get a response on my Apple feedback.

@cherrymui
Copy link
Member

Thanks. Let's call it fixed. If it is not we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Projects
None yet
Development

No branches or pull requests

7 participants