Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: cmd.Output/StartProcess hangs on macOS 12.3 #52086

Closed
dprotaso opened this issue Mar 31, 2022 · 12 comments
Closed

os/exec: cmd.Output/StartProcess hangs on macOS 12.3 #52086

dprotaso opened this issue Mar 31, 2022 · 12 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@dprotaso
Copy link

dprotaso commented Mar 31, 2022

What version of Go are you using (go version)?

$ go version
go1.17.8 darwin/amd64
and 
go version go1.18 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

OS: macOS Monterey Version 12.3 (21E230)
Arch: amd64 (Intel i9)
Bash: GNU bash, version 5.1.16(1)-release (x86_64-apple-darwin20.6.0)
KO_DOCKER_REPO set to a gcr repo - but I've reproduced this using kind.local as well (see kind setup)
Terminal: iTerm2 Build 3.4.15

go env Output
$ go env
go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dprotasowski/Library/Caches/go-build"
GOENV="/Users/dprotasowski/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/dprotasowski/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dprotasowski/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.18/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.18/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.18"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/dprotasowski/work/serving4/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xs/xfp1z4cn643c46w69lv8bv8w0000gp/T/go-build2573651288=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

$ go install github.com/google/ko@v0.11.2
$ git clone https://github.com/knative/serving
$ cd serving
# default shell is zsh run bash 5.1
$ bash
# Run the following a bunch of times until it hangs ~20 tries at times
$ ko resolve -R -f config/core/ > /dev/null

What did you expect to see?

The ko program uses the os/exec.Cmd package to kick off go builds and perform registry credential lookups.
The program runs to completion

What did you see instead?

The program hangs calling cmd.Output(). It looks like it a fork/exec fails as ps -l it shows the forked process but it looks like exec wasn't invoked as the process name is the same.

  UID   PID  PPID        F CPU PRI NI       SZ    RSS WCHAN     S             ADDR TTY           TIME CMD
  502 19733 19731     4006   0  31  0 34584196   3808 -      S                   0 ttys001    0:00.06 -zsh
  502 21014 19733     4006   0  31  0 34551188   9672 -      S+                  0 ttys001    0:09.78 watch -n1 ps | grep ko
  502 20937 20935     4006   0  31  0 34752132   4132 -      S                   0 ttys002    0:00.15 -zsh
  502 63498 63496     4006   0  31  0 34599668   4316 -      S                   0 ttys003    0:00.07 -zsh
  502 63608 63498     4006   0  31  0 36285588  44612 -      S+                  0 ttys003    0:01.95 nvim .
  502 21101 21099     4006   0  31  0 34731652   3884 -      S                   0 ttys005    0:00.07 -zsh
  502 23766 21101     4006   0  31  0 34293012   2452 -      S                   0 ttys005    0:00.05 bash
  502 53232 23766     4006   0  31  0 35033184 110996 -      S+                  0 ttys005    0:00.76 ko resolve -R -f config/core/
  502 53344 53232        6   0  31  0 34901180    372 -      R+                  0 ttys005    1:35.16 ko resolve -R -f config/core/

After some time the process state was

  UID   PID  PPID        F CPU PRI NI       SZ    RSS WCHAN     S             ADDR TTY           TIME CMD
  502 55784 55782     4006   0  31  0 34739844   3876 -      S                   0 ttys000    0:00.06 -zsh
  502 21014 19733     4006   0  31  0 34551188   9672 -      S+                  0 ttys001    0:10.63 watch -n1 ps | grep ko
  502 20937 20935     4006   0  31  0 34752132   4132 -      S                   0 ttys002    0:00.16 -zsh
  502 63498 63496     4006   0  31  0 34599668   4316 -      S                   0 ttys003    0:00.07 -zsh
  502 63608 63498     4006   0  31  0 36285588  44612 -      S+                  0 ttys003    0:01.95 nvim .
  502 21101 21099     4006   0  31  0 34731652   3884 -      S                   0 ttys005    0:00.07 -zsh
  502 23766 21101     4006   0  31  0 34293012   2452 -      S                   0 ttys005    0:00.05 bash
  502 53232 23766     4006   0  31  0 35033188 111004 -      S+                  0 ttys005    0:00.79 ko resolve -R -f config/core/
  502 53344 56663      806   0  31  0 34901184    744 -      SX+                 0 ttys005    7:45.97 ko resolve -R -f config/core/

Parent Process Stack Dumps

Go routine stack traces for the parent and child process are here: https://gist.github.com/dprotaso/1e1c4c2abea19dbfbdbb61b221c0fdcc

Notes on reproducibility

I wasn't able to reproduce this in zsh only bash

Using Apple's Terminal.app I couldn't reproduce the issue (~10 runs). Comparing the difference between the terminal apps I noticed iTerm had some extra env vars.

I figured out an env var having an affect. __CF_USER_TEXT_ENCODING. Setting this did cause ko in Apple's Terminal.app to start hanging

export __CF_USER_TEXT_ENCODING=0x1F6:0x0:0x52

After unset'ing that env var in iTerm I didn't encounter a hang (20 tries)

Unsure how that envvar has an affect on fork/exec

@dprotaso dprotaso changed the title os/exec: StartProcess hangs os/exec: cmd.Output/StartProcess hangs Mar 31, 2022
@bcmills bcmills added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 31, 2022
@bcmills bcmills added this to the Backlog milestone Mar 31, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Apr 8, 2022

It's possible #52226 is related. It's also a hang on darwin/amd64. At least one user has reported that the hang is in exec, as you point out.

@dprotaso
Copy link
Author

dprotaso commented Apr 8, 2022

Yeah I'm just surprised how I'm able to reproduce this bug

@bcmills bcmills changed the title os/exec: cmd.Output/StartProcess hangs os/exec: cmd.Output/StartProcess hangs on macOS May 2, 2022
@bcmills bcmills changed the title os/exec: cmd.Output/StartProcess hangs on macOS os/exec: cmd.Output/StartProcess hangs on macOS 12.3 May 2, 2022
@lpusok
Copy link

lpusok commented Jul 22, 2022

@dprotaso Try building with CGO_ENABLED=0. Debugging a different application where this helped with a similar issue. Dissabling CGO will use a different codepath for DNS resolution, using the Go implementation that behaves differently. The issue does not happen on Catalina, it was introduced to some degree on Big Sur and got worse on Monterey. Not really sure why HTTP queries affect starting a subprocess, which is stuck in _semwait_signal().

@ianferguson
Copy link

ianferguson commented Jul 25, 2022

I'm encountering a similar issue as reported here, and was able to reproduce on Golang v1.18.4.

I don't have a small enough reproduction case to share here yet, but I did grab similar (but not identical) golang stack traces from the parent process and OS X process samples from the child process.

Similar to @dprotaso, the hang only reproduces when the __CF_USER_TEXT_ENCODING env var is set (to some variation $hex:$hex:$hex) and does not reproduce if __CF_USER_TEXT_ENCODING is not set or is set to something like 0x0:0x0 (with only 2 colon delimited hex values rather than 3)

I was not able to test CGO_ENABLED=0 as @lpusok suggested, as the tool being compiled in my case requires CGO for core functionality it has.

Golang parent process golang stack trace

goroutine 36 [running]:
runtime/pprof.writeGoroutineStacks({0x101f44288, 0x14000010018})
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/runtime/pprof/pprof.go:694 +0x70
runtime/pprof.writeGoroutine({0x101f44288?, 0x14000010018?}, 0x14000700000?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/runtime/pprof/pprof.go:683 +0x30
runtime/pprof.(*Profile).WriteTo(0x1017cc76c?, {0x101f44288?, 0x14000010018?}, 0x0?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/runtime/pprof/pprof.go:332 +0x158
<additional stack frames from internal packages removed>

goroutine 1 [syscall]:
syscall.syscall(0x1400081f648?, 0x100a6ef88?, 0x1400081f720?, 0x1400048d000?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/runtime/sys_darwin.go:22 +0x38
syscall.readlen(0x102b31170?, 0x140006d4ae0?, 0x4?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/syscall/syscall_darwin.go:234 +0x40
syscall.forkExec({0x140000420c6?, 0x14000484300?}, {0x14000462600?, 0x3?, 0x3?}, 0x14000484300?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/syscall/exec_unix.go:221 +0x380
syscall.StartProcess(...)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/syscall/exec_unix.go:255
os.startProcess({0x140000420c6, 0x16}, {0x14000462600, 0x3, 0x3}, 0x1400081fa50)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/os/exec_posix.go:54 +0x250
os.StartProcess({0x140000420c6, 0x16}, {0x14000462600, 0x3, 0x3}, 0x30?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/os/exec.go:109 +0x54
os/exec.(*Cmd).Start(0x14000482160)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/os/exec/exec.go:422 +0x41c
os/exec.(*Cmd).Run(0x14000462570?)
        /opt/homebrew/Cellar/go/1.18.1/libexec/src/os/exec/exec.go:338 +0x24
github.com/hashicorp/vault/command/token.(*ExternalTokenHelper).Get(0x14000319ec0?)
<additional stack frames from internal packages removed>

goroutine 7 [select]:
go.opencensus.io/stats/view.(*worker).start(0x140001a6580)
        /Users/ian.ferguson/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:276 +0x88
created by go.opencensus.io/stats/view.init.0
        /Users/ian.ferguson/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:34 +0xa8

goroutine 35 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0?)
        /Users/ian.ferguson/go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:1169 +0x60
created by k8s.io/klog/v2.init.0
        /Users/ian.ferguson/go/pkg/mod/k8s.io/klog/v2@v2.9.0/klog.go:420 +0x154

OS X process samples

actual print binary name replaced with $cli in this

Sampling process 47212 for 3 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
Analysis of sampling  (pid 47212) every 1 millisecond
Process:         $cli [47212]
Path:            /Users/USER/*/$cli
Load Address:    0x104840000
Identifier:      $cli
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  $cli [47179]

Date/Time:       2022-07-25 10:22:18.822 -0400
Launch Time:     2022-07-25 10:09:22.169 -0400
OS Version:      macOS 12.4 (21F79)
Report Version:  7
Analysis Tool:   /usr/bin/sample

Physical footprint:         753K
Physical footprint (peak):  753K
----

Call graph:
    2330 Thread_5053139   DispatchQueue_1: com.apple.main-thread  (serial)
      2330 runtime.asmcgocall.abi0  (in $cli) + 124  [0x1048a99ac]
        2330 runtime.syscall.abi0  (in $cli) + 44  [0x1048aaddc]
          2330 fork  (in libsystem_c.dylib) + 112  [0x1b24735b8]
            2330 libSystem_atfork_child  (in libSystem.B.dylib) + 60  [0x1bd0eaa80]
              2330 _notify_fork_child  (in libsystem_notify.dylib) + 380  [0x1b5123a80]
                2330 _os_alloc_once  (in libsystem_platform.dylib) + 56  [0x1b25a0a6c]
                  2330 _os_once_gate_wait  (in libsystem_platform.dylib) + 364  [0x1b25a2908]
                    2330 _os_once_gate_corruption_abort  (in libsystem_platform.dylib) + 36  [0x1b25a617c]

Total number in stack (recursive counted multiple, when >=5):

Sort by top of stack, same collapsed (when >= 5):
        _os_once_gate_corruption_abort  (in libsystem_platform.dylib)        2330

@ianferguson
Copy link

ianferguson commented Jul 25, 2022

Reproducing the hang

I was able to make a ~32 line snippet that reproduces this issue about once every 1200 invocations using golang v1.18.4 on my Apple Arm based M1 device, running OS X Monterrey 12.4:

I was seemingly able to reproduce the issue more frequently before removing several channels and time based selects, but
the linked code is simpler and still triggers the issue fast enough when compiled and then executed inside a for loop on a shell using something like the following fish shell snippet:

go build .; 
for i in (seq 1 100000); 
  echo "Iteration $i"; 
  ./repro-52086; 
end;

Process State After Reproduction

When the for loop stops printing, if you look at the process statuses with something like ps aux -o ppid | grep repro | grep -v grep, you should see a pair of processes running. Both will have the same process name, and one of them will be the child of the other.

Child Process

The child process will be using 95%+ CPU and if you find the process in OS X Activity Monitor, and take a sample, the trace will look something like:

Call graph:
    2280 Thread_8683228   DispatchQueue_1: com.apple.main-thread  (serial)
      2280 runtime.asmcgocall.abi0  (in repro) + 124  [0x102ffbd3c]
        2280 runtime.syscall  (in repro) + 44  [0x102ffd36c]
          2280 fork  (in libsystem_c.dylib) + 112  [0x1b24735b8]
            2280 libSystem_atfork_child  (in libSystem.B.dylib) + 60  [0x1bd0eaa80]
              2280 _notify_fork_child  (in libsystem_notify.dylib) + 380  [0x1b5123a80]
                2280 _os_alloc_once  (in libsystem_platform.dylib) + 56  [0x1b25a0a6c]
                  2280 _os_once_gate_wait  (in libsystem_platform.dylib) + 364  [0x1b25a2908]
                    2280 _os_once_gate_corruption_abort  (in libsystem_platform.dylib) + 36  [0x1b25a617c]

Total number in stack (recursive counted multiple, when >=5):

Parent Process

The parent process will be using 0% cpu and have a sampled trace in the OS X activity monitor of something like

Call graph:
    2530 Thread_9739879   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2530 runtime.asmcgocall.abi0  (in repro) + 124  [0x102e5fd3c]
    +   2530 runtime.syscall  (in repro) + 44  [0x102e6136c]
    +     2530 read  (in libsystem_kernel.dylib) + 8  [0x1b254f634]
    2530 Thread_9739880
    + 2530 runtime.asmcgocall.abi0  (in repro) + 200  [0x102e5fd88]
    +   2530 runtime.pthread_cond_timedwait_relative_np_trampoline  (in repro) + 28  [0x102e6124c]
    +     2530 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1280  [0x1b258c868]
    +       2530 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1b2552270]
    2530 Thread_9739881
    + 2530 runtime.asmcgocall.abi0  (in repro) + 200  [0x102e5fd88]
    +   2530 runtime.pthread_cond_wait_trampoline  (in repro) + 24  [0x102e61218]
    +     2530 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1236  [0x1b258c83c]
    +       2530 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1b2552270]
    2530 Thread_9739882
    + 2530 runtime.kevent_trampoline  (in repro) + 40  [0x102e60f28]
    +   2530 kevent  (in libsystem_kernel.dylib) + 8  [0x1b2554c20]
    2530 Thread_9739883
    + 2530 runtime.asmcgocall.abi0  (in repro) + 200  [0x102e5fd88]
    +   2530 runtime.pthread_cond_wait_trampoline  (in repro) + 24  [0x102e61218]
    +     2530 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1236  [0x1b258c83c]
    +       2530 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1b2552270]
    2530 Thread_9739884
    + 2530 runtime.asmcgocall.abi0  (in repro) + 200  [0x102e5fd88]
    +   2530 runtime.pthread_cond_wait_trampoline  (in repro) + 24  [0x102e61218]
    +     2530 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1236  [0x1b258c83c]
    +       2530 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1b2552270]
    2530 Thread_9739885
      2530 runtime.asmcgocall.abi0  (in repro) + 201  [0x102e5fd89]
        2530 runtime.pthread_cond_wait_trampoline  (in repro) + 24  [0x102e61218]
          2530 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1236  [0x1b258c83c]
            2530 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1b2552270]

Total number in stack (recursive counted multiple, when >=5):
        5       __psynch_cvwait  (in libsystem_kernel.dylib) + 0  [0x1b2552268]

Sort by top of stack, same collapsed (when >= 5):
        __psynch_cvwait  (in libsystem_kernel.dylib)        12650
        kevent  (in libsystem_kernel.dylib)        2530
        read  (in libsystem_kernel.dylib)        2530

Important setup caveats

The environment variable __CF_USER_TEXT_ENCODING must be set to a value in the pattern $hex2:$hex2:$hex3, where $hex1 is your OX uid encoded as a hex value and $hex2 and $hex3 match the pair of values in the ~/.CFUserTextEncoding file. for example: 0x1F7:0x0:0x0.

If you are using iTerm 2 or Alacritty as your terminal __CF_USER_TEXT_ENCODING is likely already set, if you are using Apple Terminal, it likely is not set by default and you will need to set it yourself in order to reproduce the process hang

I've confirmed that this reproduction case works on both Apple Terminal (once __CF_USER_TEXT_ENCODING was set) and Alacritty

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2022

I think this failure looks quite similar.

2022-09-03T06:26:16-9fc09d2/darwin-arm64-12

@dprotaso
Copy link
Author

Still encountering this with macOS 12.6.1 and go1.19.3 darwin/amd64

@rsc
Copy link
Contributor

rsc commented Nov 17, 2022

Very weird. If you are comfortable building Go from source, please try commit 1f4394a, which I just checked in. If not, wait for Go 1.20rc1.

@dprotaso
Copy link
Author

I think the change works - I'll run it overnight to confirm. Before it failed maybe 1 in 20 runs. I just ran it ~100 times and no issues.

@rsc
Copy link
Contributor

rsc commented Nov 18, 2022

Sounds like this is a duplicate of #56784 then. Please close when you are satisfied it's gone.

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 18, 2022
@dprotaso
Copy link
Author

Ran it 1000 times and didn't see it reproduce. Thanks for the fix!

@dprotaso
Copy link
Author

Does it make sense to cherry pick this fix back?

@golang golang locked and limited conversation to collaborators Nov 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

7 participants