syscall: Go can leak a forked process if main thread exits before spawn finished #33565

jacobvosmaer · 2019-08-09T13:06:55Z

Please answer these questions before submitting your issue. Thanks!

What did you do?

If I spawn a process in a goroutine, and exit the main thread before the spawn is finished, I am sometimes left with a fork of my original process that does not go away. The fork does not respond to SIGTERM, and sometimes consumes 100% CPU.

https://play.golang.org/p/1HPscvoAiwj

What did you expect to see?

If the main thread exits Go should not leave behind a forked process.

What did you see instead?

A forked process, sometimes using 100% CPU

Does this issue reproduce with the latest release (go1.12.7)?

Yes.

System details

go version go1.12.6 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/jacobvosmaer/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/jacobvosmaer/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
GOROOT/bin/go version: go version go1.12.6 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.12.6
uname -v: Darwin Kernel Version 18.7.0: Thu Jun 20 18:42:21 PDT 2019; root:xnu-4903.270.47~4/RELEASE_X86_64
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G87
lldb --version: lldb-1100.0.25.3
Apple Swift version 5.1 (swiftlang-1100.0.38.29 clang-1100.0.20.14)

The text was updated successfully, but these errors were encountered:

jacobvosmaer · 2019-08-09T13:33:21Z

My guess is that when doing a fork+exec, after the fork, the child wants to "communicate" with the parent; perhaps by reading or writing to a pipe. If the parent is gone the child should exit and do nothing. But it appears that with the "right" timing, the child can get stuck in a loop instead.

jacobvosmaer · 2019-08-19T17:24:30Z

There are some lldb stack traces that might help here: https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

agnivade · 2019-08-20T05:51:47Z

Can't reproduce on linux.

If the parent is gone the child should exit and do nothing.

I don't believe that is necessarily correct. If a parent exits, the child becomes a zombie and attaches to PID 1. I understand this issue is about cleaning up the child before the spawn has happened completely.

@randall77 for macOS.

jacobvosmaer · 2019-08-28T16:15:54Z

I also cannot reproduce on linux.

I understand this issue is about cleaning up the child before the spawn has happened completely.

Yes. In my toy example, I'm trying to spawn /bin/sh which will exit harmlessly on its own accord, even if its parent is gone. The problem is that when this bug strikes, we never get to the exec which creates the /bin/sh process.

I've learned some more about the real life bug that prompted this issue. It seems that there, even if the main goroutine does not exit, we get a stuck forked process. I was able to attach to some of these with lldb, which is what I linked to above. https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

I'm not sure if this is the exact same state as my reproducing example, but it might be related. To save some clicking I'm including stack traces below.

mystery fork A, 0% CPU

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7eaee86a libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff7ebad56e libsystem_pthread.dylib`_pthread_cond_wait + 722
    frame #2: 0x00007fff7d1e6467 libobjc.A.dylib`monitor_tt<false>::wait() + 19
    frame #3: 0x00007fff7d1eb462 libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 132
    frame #4: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79
    frame #5: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79
    frame #6: 0x00007fff7d1ec62f libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 187
    frame #7: 0x00007fff7d1db690 libobjc.A.dylib`lookUpImpOrForward + 228
    frame #8: 0x00007fff7d1db114 libobjc.A.dylib`_objc_msgSend_uncached + 68
    frame #9: 0x00007fff7ebfafa1 libxpc.dylib`xpc_atfork_child + 125
    frame #10: 0x00007fff7b99eb18 libSystem.B.dylib`libSystem_atfork_child + 54
    frame #11: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40
    frame #12: 0x00000000010603ff gitaly`runtime.syscall + 31
    frame #13: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112
    frame #14: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624
    frame #15: 0x000000000104d40e gitaly`syscall.rawSyscall + 78
    frame #16: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226
    frame #17: 0x00000000010a31a0 gitaly`syscall.forkExec + 864
    frame #18: 0x00000000010c4a12 gitaly`os.startProcess + 610
    frame #19: 0x00000000010c449c gitaly`os.StartProcess + 124
    frame #20: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180
    frame #21: 0x00000000016805a3 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435

mystery fork B, 100% CPU

* thread #1: tid = 0x43f1b, 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
(lldb) bt
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007fff7eba114c libsystem_platform.dylib`_os_once_gate_wait + 196
    frame #2: 0x00007fff7eb9ea6f libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007fff7eb99d2a libsystem_notify.dylib`_notify_fork_child + 211
    frame #4: 0x00007fff7b99eb13 libSystem.B.dylib`libSystem_atfork_child + 49
    frame #5: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40
    frame #6: 0x00000000010603ff gitaly`runtime.syscall + 31
    frame #7: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112
    frame #8: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624
    frame #9: 0x000000000104d40e gitaly`syscall.rawSyscall + 78
    frame #10: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226
    frame #11: 0x00000000010a31a0 gitaly`syscall.forkExec + 864
    frame #12: 0x00000000010c4a12 gitaly`os.startProcess + 610
    frame #13: 0x00000000010c449c gitaly`os.StartProcess + 124
    frame #14: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180
    frame #15: 0x0000000001680563 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435

When I lldb the crashing example I don't get nice symbols. Is it possible I messed up a go build command?

agnivade · 2019-08-28T16:24:17Z

I would also give 1.13rc1 a shot, just to be sure that it is not something already fixed. Keith can probably comment more on what exactly is happening.

ianlancetaylor · 2019-08-28T17:19:30Z

The stack traces make it appear that the problem is in the Darwin libc. The forked child is trying to get some response from the parent, but the parent has exited. If that is true, it should be possible to write a similar test case in C. Anybody want to try that?

jacobvosmaer · 2019-08-28T18:07:07Z

it should be possible to write a similar test case in C. Anybody want to try that?

It's the sort of challenge I enjoy so I'll take a look. But a person familiar with Go internals would be much faster at it than me. I won't be offended if such a person swoops in and beats me to it.

jacobvosmaer · 2019-08-28T18:57:29Z

OK that was not hard. Just translated my Go example to C.

https://gist.github.com/jacobvosmaer/69fae756e88d2f5f4ef5091ceeba1d88

As before, leave it running for a while in a restart loop, and you get a weird forked process left behind.

jacobvosmaer · 2019-08-28T19:18:48Z

The C reproducer highlights how odd this is even more: all it tries to do is fork followed by exit, yet the leaking forked processes are not exiting.

ianlancetaylor · 2019-08-28T22:09:38Z

Thanks!

Does anybody want to try to report this to Apple?

To fix this in Go I think it would suffice to have exit in runtime/sys_darwin.go acquire a write lock on syscall.ForkLock. Unfortunately that will be a bit of a hassle since the runtime package doesn't depend on the syscall package, and since syscall.ForkLock is exported it can't move.

jacobvosmaer · 2019-08-29T07:39:44Z

It really does feel like a macOS bug. With my C reproducer, if I leave it running it leaks way more processes than the Go one. Most of the leaked processes are at 0% CPU instead of 100% CPU.

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff6d67ef06 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff6d73bd52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff6d7394cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x00007fff6d5ed2b4 libsystem_c.dylib`__cxa_finalize_ranges + 36
    frame #4: 0x00007fff6d5ed6b3 libsystem_c.dylib`exit + 55
    frame #5: 0x0000000106b09e21 fork-leak`do_fork + 33
    frame #6: 0x00007fff6d73b2eb libsystem_pthread.dylib`_pthread_body + 126
    frame #7: 0x00007fff6d73e249 libsystem_pthread.dylib`_pthread_start + 66
    frame #8: 0x00007fff6d73a40d libsystem_pthread.dylib`thread_start + 13

What is particularly worrying here is that after I remove the leaked processes with kill -9, some part of the system (kernel? userspace kernel helper?) keeps spinning the CPU. It shows in the activity monitor as constant "system" load. I can only get rid of that by rebooting.

I don't feel confident reporting this to Apple for a number of reasons:

I don't have enough experience with C and Unix programming to stand behind my C reproducer
I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

jqmichael · 2020-05-25T17:16:16Z

I don't feel confident reporting this to Apple for a number of reasons:

I don't have enough experience with C and Unix programming to stand behind my C reproducer

I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

That's certainly true. But we are still making progress even if Apple proved we were doing wrong, isn't it?

Kila2 · 2021-03-19T10:33:36Z

I tried various methods. Can we reslove?

Process 38893 stopped
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
libsystem_platform.dylib`_os_once_gate_corruption_abort:
->  0x7fff731566b9 <+23>: ud2    

libsystem_platform.dylib`_os_lock_recursive_abort:
    0x7fff731566bb <+0>:  movl   %edi, %eax
    0x7fff731566bd <+2>:  leaq   0x7d1(%rip), %rcx         ; "BUG IN CLIENT OF LIBPLATFORM: Trying to recursively lock an os_lock"
    0x7fff731566c4 <+9>:  movq   %rcx, 0x2651c3ad(%rip)    ; gCRAnnotations + 8
Target 0: (bit) stopped.

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007fff73152feb libsystem_platform.dylib`_os_once_gate_wait + 196
    frame #2: 0x00007fff73150f89 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007fff73038c75 libsystem_coreservices.dylib`_libcoreservices_fork_child + 78
    frame #4: 0x00007fff6ff40aa9 libSystem.B.dylib`libSystem_atfork_child + 44
    frame #5: 0x00007fff72fba8ad libsystem_c.dylib`fork + 40
    frame #6: 0x000000000106f73f bit`runtime.syscall + 31
    frame #7: 0x000000000106d5b0 bit`runtime.asmcgocall + 112
    frame #8: 0x000000000103c8c0 bit`runtime.startTheWorldWithSema + 576
    frame #9: 0x000000000106a86c bit`syscall.rawSyscall + 76
    frame #10: 0x00000000010afca6 bit`syscall.forkAndExecInChild + 230
    frame #11: 0x00000000010b0eb8 bit`syscall.forkExec + 1080
    frame #12: 0x00000000010da23b bit`os.startProcess + 667
    frame #13: 0x00000000010d9c3c bit`os.StartProcess + 124
    frame #14: 0x0000000001342505 bit`os/exec.(*Cmd).Start + 1317
    frame #15: 0x0000000001341f6b bit`os/exec.(*Cmd).Run + 43
    frame #16: 0x0000000001343171 bit`os/exec.(*Cmd).CombinedOutput + 145

cr: https://code.amazon.com/reviews/CR-51130379

…0 to version 3.0.1295.0 Darrell Kienzle (1): Add syscall.Sync on Darwin to workaround golang/go#33565 Faho Shubladze (2): Added validation checks to InstanceID and Region fields of CustomIdentity Updated release notes Frank Miao (1): Added error handling to a session utility function Ivan Aguilar (1): Separate logs files on windows Samiullah Mohammed (1): Add cross-account domain join Thor Bjorgvinsson (4): Added configurable custom identity and configurable identity consumption order Pass appconfig to functions instead of reading config Adding synchronization to RunCommand service stop Moving darwin proxy handling to proxyconfig package Vishnu Karthik Ravindran (3): correct default hibernation seelog config cleanup agent updater old versions Fix gofmt issue Yuting Fan (1): Remove delay in non-interactive session type

cherrymui · 2021-09-07T17:16:25Z

For the C reproducer in #33565 (comment) , I can reproduce on macOS. I can also reproduce it on Linux, with even a much higher rate of leaking children. It seems the child stuck at exit call:

#0  __lll_lock_wait_private (futex=futex@entry=0x7f0507afaae8 <__exit_funcs_lock>) at ./lowlevellock.c:35
#1  0x00007f05079785f8 in __run_exit_handlers (status=0, listp=0x7f0507af8718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:56
#2  0x00007f050797867a in __GI_exit (status=<optimized out>) at exit.c:139
#3  0x0000561c884091d5 in do_fork ()
#4  0x00007f0507b07ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0507a37def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If we exec instead of exit in child, it doesn't seem to stuck. So it may be a different issue. The original bug seems that it stuck at the fork call instead of exit.

Fix high cpu usage caused by golang/go#33565

cherrymui · 2022-11-07T17:02:28Z

This may be similar to #53863 ? (which is a macOS system bug and may be fixed in a future macOS version.)

rsc · 2022-11-16T17:27:20Z

I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
    frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
    frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
    frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
    frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
    frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
    frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
    frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
    frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1

Is there anything we can do?

Kila2 · 2022-11-17T02:17:10Z

I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
    frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
    frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
    frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
    frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
    frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
    frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
    frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
    frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1

Is there anything we can do?

try refactor code async logic to sync or before syscall wait async code finish. I guess maybe when after syscall forked. they will inherit parent process and parent process is in async state. the child process memory is dirty

rsc · 2022-11-17T19:39:07Z

I have a fix for the Go stacks that were reported in this issue. I am not as sure about the C reproducer - it may be that it tickles a different fork+exec bug in Apple libc.

gopherbot · 2022-11-17T20:18:27Z

Change https://go.dev/cl/451735 mentions this issue: runtime: work around Apple libc bugs to make exec stop hanging

gopherbot · 2022-12-22T15:10:50Z

Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround

gopherbot · 2022-12-22T15:11:13Z

Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin

Revert CL 451735 (1f4394a), which fixed #33565 and #56784 but also introduced #57263. I have a different fix to apply instead. Since the first fix was never backported, it will be easiest to backport the new fix if the new fix is done in a separate CL from the revert. Change-Id: I6c8ea3a46e542ee4702675bbc058e29ccd2723e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/459175 Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Russ Cox <rsc@golang.org>

Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Russ Cox <rsc@golang.org>

gopherbot · 2022-12-22T18:14:08Z

Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot · 2022-12-22T18:19:22Z

Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin

Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Fixes #56837. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-on: https://go-review.googlesource.com/c/go/+/459178

Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Fixes #56836. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-on: https://go-review.googlesource.com/c/go/+/459179

gopherbot · 2023-01-04T14:36:47Z

Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3

CL 451735 worked around bugs in Apple's atfork handlers by calling notify_is_valid_token and xpc_atfork_child at startup, so that init code that wouldn't be safe in the child process would be warmed up in the parent process instead, but xpc_atfork_child broke use of the xpc library in Go programs, and xpc is internally used by various macOS frameworks (#57263). CL 459175 reverted that change, and then CL 459176 tried a new approach: use __fork, which doesn't call any of the atfork handlers at all. That worked, but an Apple engineer reviewing the change in private email suggests that since __fork is not public API, it should be avoided. The same engineer (with access to the source code for the xpc library) suggests that the breakage in #57263 is caused by xpc_atfork_child marking the library as unusable, expecting an imminent call to exec, and that calling xpc_date_create_from_current instead would do the necessary initialization without marking xpc as unusable. CL 460475 reverted that change, to prepare for this one. This CL goes back to the original “call functions to warm things up” approach, replacing xpc_atfork_child with xpc_date_create_from_current. The CL also updates cmd/link to use OS and SDK version 10.13.0 for x86 macOS binaries, up from 10.9.0, also suggested by the Apple engineer. Combined with the two warmup calls, this makes the fork hangs go away. The minimum macOS version has been 10.13 High Sierra since Go 1.17, so there should be no problem with writing that in the binaries too. Fixes #33565. Fixes #56784. Fixes #57263. Fixes #57577. Change-Id: I20769d9daa1fe9ea930f8009481335f8a14dc21b Reviewed-on: https://go-review.googlesource.com/c/go/+/460476 Auto-Submit: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

jacobvosmaer changed the title ~~os/exec: Go can leak a fork process if main thread exits before spawn finished~~ os/exec: Go can leak a forked process if main thread exits before spawn finished Aug 9, 2019

ianlancetaylor changed the title ~~os/exec: Go can leak a forked process if main thread exits before spawn finished~~ syscall: Go can leak a forked process if main thread exits before spawn finished Aug 9, 2019

ianlancetaylor added the NeedsInvestigation label Aug 9, 2019

ianlancetaylor added this to the Go1.14 milestone Aug 9, 2019

ianlancetaylor added NeedsFix help wanted labels Aug 28, 2019

gopherbot removed the NeedsInvestigation label Aug 28, 2019

ianlancetaylor modified the milestones: Go1.14, Unplanned Sep 11, 2019

Thor-Bjorgvinsson pushed a commit to aws/amazon-ssm-agent that referenced this issue Jun 15, 2021

Add syscall.Sync on Darwin to workaround golang/go#33565

962d4c9

cr: https://code.amazon.com/reviews/CR-51130379

cshuaimin added a commit to leancloud/lean-cli that referenced this issue Mar 11, 2022

Fix high cpu usage caused by golang/go#33565

6c04a72

rsc mentioned this issue Nov 17, 2022

os/exec: tests hang on macOS due to Apple libc fork bugs #56784

Closed

gopherbot closed this as completed in 1f4394a Nov 17, 2022

dmitshur modified the milestones: Unplanned, Go1.20 Nov 17, 2022

golang locked and limited conversation to collaborators Jan 4, 2024

gopherbot added the FrozenDueToAge label Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syscall: Go can leak a forked process if main thread exits before spawn finished #33565

syscall: Go can leak a forked process if main thread exits before spawn finished #33565

jacobvosmaer commented Aug 9, 2019

jacobvosmaer commented Aug 9, 2019

jacobvosmaer commented Aug 19, 2019

agnivade commented Aug 20, 2019

jacobvosmaer commented Aug 28, 2019

agnivade commented Aug 28, 2019

ianlancetaylor commented Aug 28, 2019

jacobvosmaer commented Aug 28, 2019

jacobvosmaer commented Aug 28, 2019

jacobvosmaer commented Aug 28, 2019

ianlancetaylor commented Aug 28, 2019

jacobvosmaer commented Aug 29, 2019

jqmichael commented May 25, 2020

Kila2 commented Mar 19, 2021

cherrymui commented Sep 7, 2021

cherrymui commented Nov 7, 2022

rsc commented Nov 16, 2022

Kila2 commented Nov 17, 2022

rsc commented Nov 17, 2022

gopherbot commented Nov 17, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Jan 4, 2023

syscall: Go can leak a forked process if main thread exits before spawn finished #33565

syscall: Go can leak a forked process if main thread exits before spawn finished #33565

Comments

jacobvosmaer commented Aug 9, 2019

What did you do?

What did you expect to see?

What did you see instead?

Does this issue reproduce with the latest release (go1.12.7)?

System details

jacobvosmaer commented Aug 9, 2019

jacobvosmaer commented Aug 19, 2019

agnivade commented Aug 20, 2019

jacobvosmaer commented Aug 28, 2019

agnivade commented Aug 28, 2019

ianlancetaylor commented Aug 28, 2019

jacobvosmaer commented Aug 28, 2019

jacobvosmaer commented Aug 28, 2019

jacobvosmaer commented Aug 28, 2019

ianlancetaylor commented Aug 28, 2019

jacobvosmaer commented Aug 29, 2019

jqmichael commented May 25, 2020

Kila2 commented Mar 19, 2021

cherrymui commented Sep 7, 2021

cherrymui commented Nov 7, 2022

rsc commented Nov 16, 2022

Kila2 commented Nov 17, 2022

rsc commented Nov 17, 2022

gopherbot commented Nov 17, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Dec 22, 2022

gopherbot commented Jan 4, 2023