Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: Go can leak a forked process if main thread exits before spawn finished #33565

Closed
jacobvosmaer opened this issue Aug 9, 2019 · 24 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@jacobvosmaer
Copy link

Please answer these questions before submitting your issue. Thanks!

What did you do?

If I spawn a process in a goroutine, and exit the main thread before the spawn is finished, I am sometimes left with a fork of my original process that does not go away. The fork does not respond to SIGTERM, and sometimes consumes 100% CPU.

https://play.golang.org/p/1HPscvoAiwj

What did you expect to see?

If the main thread exits Go should not leave behind a forked process.

What did you see instead?

A forked process, sometimes using 100% CPU

Does this issue reproduce with the latest release (go1.12.7)?

Yes.

System details

go version go1.12.6 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/jacobvosmaer/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/jacobvosmaer/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
GOROOT/bin/go version: go version go1.12.6 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.12.6
uname -v: Darwin Kernel Version 18.7.0: Thu Jun 20 18:42:21 PDT 2019; root:xnu-4903.270.47~4/RELEASE_X86_64
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G87
lldb --version: lldb-1100.0.25.3
Apple Swift version 5.1 (swiftlang-1100.0.38.29 clang-1100.0.20.14)
@jacobvosmaer jacobvosmaer changed the title os/exec: Go can leak a fork process if main thread exits before spawn finished os/exec: Go can leak a forked process if main thread exits before spawn finished Aug 9, 2019
@jacobvosmaer
Copy link
Author

My guess is that when doing a fork+exec, after the fork, the child wants to "communicate" with the parent; perhaps by reading or writing to a pipe. If the parent is gone the child should exit and do nothing. But it appears that with the "right" timing, the child can get stuck in a loop instead.

@ianlancetaylor ianlancetaylor changed the title os/exec: Go can leak a forked process if main thread exits before spawn finished syscall: Go can leak a forked process if main thread exits before spawn finished Aug 9, 2019
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 9, 2019
@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Aug 9, 2019
@jacobvosmaer
Copy link
Author

There are some lldb stack traces that might help here: https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

@agnivade
Copy link
Contributor

Can't reproduce on linux.

If the parent is gone the child should exit and do nothing.

I don't believe that is necessarily correct. If a parent exits, the child becomes a zombie and attaches to PID 1. I understand this issue is about cleaning up the child before the spawn has happened completely.

@randall77 for macOS.

@jacobvosmaer
Copy link
Author

I also cannot reproduce on linux.

I understand this issue is about cleaning up the child before the spawn has happened completely.

Yes. In my toy example, I'm trying to spawn /bin/sh which will exit harmlessly on its own accord, even if its parent is gone. The problem is that when this bug strikes, we never get to the exec which creates the /bin/sh process.

I've learned some more about the real life bug that prompted this issue. It seems that there, even if the main goroutine does not exit, we get a stuck forked process. I was able to attach to some of these with lldb, which is what I linked to above. https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

I'm not sure if this is the exact same state as my reproducing example, but it might be related. To save some clicking I'm including stack traces below.

mystery fork A, 0% CPU
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7eaee86a libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff7ebad56e libsystem_pthread.dylib`_pthread_cond_wait + 722
    frame #2: 0x00007fff7d1e6467 libobjc.A.dylib`monitor_tt<false>::wait() + 19
    frame #3: 0x00007fff7d1eb462 libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 132
    frame #4: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79
    frame #5: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79
    frame #6: 0x00007fff7d1ec62f libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 187
    frame #7: 0x00007fff7d1db690 libobjc.A.dylib`lookUpImpOrForward + 228
    frame #8: 0x00007fff7d1db114 libobjc.A.dylib`_objc_msgSend_uncached + 68
    frame #9: 0x00007fff7ebfafa1 libxpc.dylib`xpc_atfork_child + 125
    frame #10: 0x00007fff7b99eb18 libSystem.B.dylib`libSystem_atfork_child + 54
    frame #11: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40
    frame #12: 0x00000000010603ff gitaly`runtime.syscall + 31
    frame #13: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112
    frame #14: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624
    frame #15: 0x000000000104d40e gitaly`syscall.rawSyscall + 78
    frame #16: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226
    frame #17: 0x00000000010a31a0 gitaly`syscall.forkExec + 864
    frame #18: 0x00000000010c4a12 gitaly`os.startProcess + 610
    frame #19: 0x00000000010c449c gitaly`os.StartProcess + 124
    frame #20: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180
    frame #21: 0x00000000016805a3 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435
mystery fork B, 100% CPU
* thread #1: tid = 0x43f1b, 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
(lldb) bt
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007fff7eba114c libsystem_platform.dylib`_os_once_gate_wait + 196
    frame #2: 0x00007fff7eb9ea6f libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007fff7eb99d2a libsystem_notify.dylib`_notify_fork_child + 211
    frame #4: 0x00007fff7b99eb13 libSystem.B.dylib`libSystem_atfork_child + 49
    frame #5: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40
    frame #6: 0x00000000010603ff gitaly`runtime.syscall + 31
    frame #7: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112
    frame #8: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624
    frame #9: 0x000000000104d40e gitaly`syscall.rawSyscall + 78
    frame #10: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226
    frame #11: 0x00000000010a31a0 gitaly`syscall.forkExec + 864
    frame #12: 0x00000000010c4a12 gitaly`os.startProcess + 610
    frame #13: 0x00000000010c449c gitaly`os.StartProcess + 124
    frame #14: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180
    frame #15: 0x0000000001680563 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435

When I lldb the crashing example I don't get nice symbols. Is it possible I messed up a go build command?

@agnivade
Copy link
Contributor

I would also give 1.13rc1 a shot, just to be sure that it is not something already fixed. Keith can probably comment more on what exactly is happening.

@ianlancetaylor
Copy link
Contributor

The stack traces make it appear that the problem is in the Darwin libc. The forked child is trying to get some response from the parent, but the parent has exited. If that is true, it should be possible to write a similar test case in C. Anybody want to try that?

@jacobvosmaer
Copy link
Author

it should be possible to write a similar test case in C. Anybody want to try that?

It's the sort of challenge I enjoy so I'll take a look. But a person familiar with Go internals would be much faster at it than me. I won't be offended if such a person swoops in and beats me to it.

@jacobvosmaer
Copy link
Author

OK that was not hard. Just translated my Go example to C.

https://gist.github.com/jacobvosmaer/69fae756e88d2f5f4ef5091ceeba1d88

As before, leave it running for a while in a restart loop, and you get a weird forked process left behind.

@jacobvosmaer
Copy link
Author

The C reproducer highlights how odd this is even more: all it tries to do is fork followed by exit, yet the leaking forked processes are not exiting.

@ianlancetaylor
Copy link
Contributor

Thanks!

Does anybody want to try to report this to Apple?

To fix this in Go I think it would suffice to have exit in runtime/sys_darwin.go acquire a write lock on syscall.ForkLock. Unfortunately that will be a bit of a hassle since the runtime package doesn't depend on the syscall package, and since syscall.ForkLock is exported it can't move.

@ianlancetaylor ianlancetaylor added NeedsFix The path to resolution is known, but the work has not been done. help wanted labels Aug 28, 2019
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 28, 2019
@jacobvosmaer
Copy link
Author

It really does feel like a macOS bug. With my C reproducer, if I leave it running it leaks way more processes than the Go one. Most of the leaked processes are at 0% CPU instead of 100% CPU.

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff6d67ef06 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff6d73bd52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff6d7394cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x00007fff6d5ed2b4 libsystem_c.dylib`__cxa_finalize_ranges + 36
    frame #4: 0x00007fff6d5ed6b3 libsystem_c.dylib`exit + 55
    frame #5: 0x0000000106b09e21 fork-leak`do_fork + 33
    frame #6: 0x00007fff6d73b2eb libsystem_pthread.dylib`_pthread_body + 126
    frame #7: 0x00007fff6d73e249 libsystem_pthread.dylib`_pthread_start + 66
    frame #8: 0x00007fff6d73a40d libsystem_pthread.dylib`thread_start + 13

What is particularly worrying here is that after I remove the leaked processes with kill -9, some part of the system (kernel? userspace kernel helper?) keeps spinning the CPU. It shows in the activity monitor as constant "system" load. I can only get rid of that by rebooting.

Screenshot 2019-08-29 at 09 31 49

I don't feel confident reporting this to Apple for a number of reasons:

  1. I don't have enough experience with C and Unix programming to stand behind my C reproducer
  2. I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.14, Unplanned Sep 11, 2019
@jqmichael
Copy link

I don't feel confident reporting this to Apple for a number of reasons:

  1. I don't have enough experience with C and Unix programming to stand behind my C reproducer
  2. I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

That's certainly true. But we are still making progress even if Apple proved we were doing wrong, isn't it?

@Kila2
Copy link

Kila2 commented Mar 19, 2021

I tried various methods. Can we reslove?

Process 38893 stopped
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
libsystem_platform.dylib`_os_once_gate_corruption_abort:
->  0x7fff731566b9 <+23>: ud2    

libsystem_platform.dylib`_os_lock_recursive_abort:
    0x7fff731566bb <+0>:  movl   %edi, %eax
    0x7fff731566bd <+2>:  leaq   0x7d1(%rip), %rcx         ; "BUG IN CLIENT OF LIBPLATFORM: Trying to recursively lock an os_lock"
    0x7fff731566c4 <+9>:  movq   %rcx, 0x2651c3ad(%rip)    ; gCRAnnotations + 8
Target 0: (bit) stopped.
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fff731566b9 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007fff73152feb libsystem_platform.dylib`_os_once_gate_wait + 196
    frame #2: 0x00007fff73150f89 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007fff73038c75 libsystem_coreservices.dylib`_libcoreservices_fork_child + 78
    frame #4: 0x00007fff6ff40aa9 libSystem.B.dylib`libSystem_atfork_child + 44
    frame #5: 0x00007fff72fba8ad libsystem_c.dylib`fork + 40
    frame #6: 0x000000000106f73f bit`runtime.syscall + 31
    frame #7: 0x000000000106d5b0 bit`runtime.asmcgocall + 112
    frame #8: 0x000000000103c8c0 bit`runtime.startTheWorldWithSema + 576
    frame #9: 0x000000000106a86c bit`syscall.rawSyscall + 76
    frame #10: 0x00000000010afca6 bit`syscall.forkAndExecInChild + 230
    frame #11: 0x00000000010b0eb8 bit`syscall.forkExec + 1080
    frame #12: 0x00000000010da23b bit`os.startProcess + 667
    frame #13: 0x00000000010d9c3c bit`os.StartProcess + 124
    frame #14: 0x0000000001342505 bit`os/exec.(*Cmd).Start + 1317
    frame #15: 0x0000000001341f6b bit`os/exec.(*Cmd).Run + 43
    frame #16: 0x0000000001343171 bit`os/exec.(*Cmd).CombinedOutput + 145

clrpackages pushed a commit to clearlinux-pkgs/amazon-ssm-agent that referenced this issue Jun 17, 2021
…0 to version 3.0.1295.0

Darrell Kienzle (1):
      Add syscall.Sync on Darwin to workaround golang/go#33565

Faho Shubladze (2):
      Added validation checks to InstanceID and Region fields of CustomIdentity
      Updated release notes

Frank Miao (1):
      Added error handling to a session utility function

Ivan Aguilar (1):
      Separate logs files on windows

Samiullah Mohammed (1):
      Add cross-account domain join

Thor Bjorgvinsson (4):
      Added configurable custom identity and configurable identity consumption order
      Pass appconfig to functions instead of reading config
      Adding synchronization to RunCommand service stop
      Moving darwin proxy handling to proxyconfig package

Vishnu Karthik Ravindran (3):
      correct default hibernation seelog config
      cleanup agent updater old versions
      Fix gofmt issue

Yuting Fan (1):
      Remove delay in non-interactive session type
@cherrymui
Copy link
Member

For the C reproducer in #33565 (comment) , I can reproduce on macOS. I can also reproduce it on Linux, with even a much higher rate of leaking children. It seems the child stuck at exit call:

#0  __lll_lock_wait_private (futex=futex@entry=0x7f0507afaae8 <__exit_funcs_lock>) at ./lowlevellock.c:35
#1  0x00007f05079785f8 in __run_exit_handlers (status=0, listp=0x7f0507af8718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:56
#2  0x00007f050797867a in __GI_exit (status=<optimized out>) at exit.c:139
#3  0x0000561c884091d5 in do_fork ()
#4  0x00007f0507b07ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0507a37def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If we exec instead of exit in child, it doesn't seem to stuck. So it may be a different issue. The original bug seems that it stuck at the fork call instead of exit.

cshuaimin added a commit to leancloud/lean-cli that referenced this issue Mar 11, 2022
cshuaimin added a commit to leancloud/lean-cli that referenced this issue Mar 14, 2022
@cherrymui
Copy link
Member

This may be similar to #53863 ? (which is a macOS system bug and may be fixed in a future macOS version.)

@rsc
Copy link
Contributor

rsc commented Nov 16, 2022

I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
    frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
    frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
    frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
    frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
    frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
    frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
    frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
    frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1

Is there anything we can do?

@Kila2
Copy link

Kila2 commented Nov 17, 2022

I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:

* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
    frame #10: 0x0000000001077f5f exec.test`syscall.forkExec + 863
    frame #11: 0x000000000109b992 exec.test`os.startProcess + 818
    frame #12: 0x000000000109b59a exec.test`os.StartProcess + 90
    frame #13: 0x00000000010ffb0e exec.test`os/exec.(*Cmd).Start + 1518
    frame #14: 0x00000000012d5045 exec.test`os/exec_test.TestEchoFileRace + 165
    frame #15: 0x00000000010ecaab exec.test`testing.tRunner + 267
    frame #16: 0x00000000010edaca exec.test`testing.(*T).Run.func1 + 42
    frame #17: 0x000000000106b5c1 exec.test`runtime.goexit.abi0 + 1

Is there anything we can do?

try refactor code async logic to sync or before syscall wait async code finish. I guess maybe when after syscall forked. they will inherit parent process and parent process is in async state. the child process memory is dirty

@rsc
Copy link
Contributor

rsc commented Nov 17, 2022

I have a fix for the Go stacks that were reported in this issue. I am not as sure about the C reproducer - it may be that it tickles a different fork+exec bug in Apple libc.

@gopherbot
Copy link

Change https://go.dev/cl/451735 mentions this issue: runtime: work around Apple libc bugs to make exec stop hanging

@dmitshur dmitshur modified the milestones: Unplanned, Go1.20 Nov 17, 2022
@gopherbot
Copy link

Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround

@gopherbot
Copy link

Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot pushed a commit that referenced this issue Dec 22, 2022
Revert CL 451735 (1f4394a), which fixed #33565 and #56784
but also introduced #57263.

I have a different fix to apply instead. Since the first fix was
never backported, it will be easiest to backport the new fix
if the new fix is done in a separate CL from the revert.

Change-Id: I6c8ea3a46e542ee4702675bbc058e29ccd2723e0
Reviewed-on: https://go-review.googlesource.com/c/go/+/459175
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin

@gopherbot
Copy link

Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin

gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.

Fixes #56837.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/go/+/459178
gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.
Fixes #56836.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/go/+/459179
@gopherbot
Copy link

Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3

gopherbot pushed a commit that referenced this issue Jan 10, 2023
CL 451735 worked around bugs in Apple's atfork handlers by calling
notify_is_valid_token and xpc_atfork_child at startup, so that init
code that wouldn't be safe in the child process would be warmed up in
the parent process instead, but xpc_atfork_child broke use of the xpc
library in Go programs, and xpc is internally used by various macOS
frameworks (#57263).

CL 459175 reverted that change, and then CL 459176 tried a new
approach: use __fork, which doesn't call any of the atfork handlers at all.
That worked, but an Apple engineer reviewing the change in private
email suggests that since __fork is not public API, it should be avoided.
The same engineer (with access to the source code for the xpc library)
suggests that the breakage in #57263 is caused by xpc_atfork_child
marking the library as unusable, expecting an imminent call to exec,
and that calling xpc_date_create_from_current instead would do the
necessary initialization without marking xpc as unusable.

CL 460475 reverted that change, to prepare for this one.

This CL goes back to the original “call functions to warm things up”
approach, replacing xpc_atfork_child with xpc_date_create_from_current.

The CL also updates cmd/link to use OS and SDK version 10.13.0 for
x86 macOS binaries, up from 10.9.0, also suggested by the Apple engineer.
Combined with the two warmup calls, this makes the fork hangs go away.
The minimum macOS version has been 10.13 High Sierra since Go 1.17,
so there should be no problem with writing that in the binaries too.

Fixes #33565.
Fixes #56784.
Fixes #57263.
Fixes #57577.

Change-Id: I20769d9daa1fe9ea930f8009481335f8a14dc21b
Reviewed-on: https://go-review.googlesource.com/c/go/+/460476
Auto-Submit: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@golang golang locked and limited conversation to collaborators Jan 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants