runtime: morestack on g0 when C calls Go with deep stack #59294

cherrymui · 2023-03-28T16:47:44Z

What version of Go are you using (`go version`)?

tip (d49b11b)

Does this issue reproduce with the latest release?

No

What operating system and processor architecture are you using (`go env`)?

linux/amd64

What did you do?

Build a Go c-archive, call the Go function from C.

x.go

package main

import "C"

func main() {}

//export GoF
func GoF() {}

c.c

#include "x.h"

void callGoFWithDeepStack(int n) {
	if (n > 0)
		callGoFWithDeepStack(n - 1);
	GoF();
}

int main() {
	GoF();                        // call GoF without using much stack
	callGoFWithDeepStack(100000); // call GoF with a deep stack
}

$ go build -buildmode=c-archive x.go 
$ cc -O0 c.c x.a # don't optimize out my recursion
$ ./a.out

What did you expect to see?

Run without error.

What did you see instead?

fatal: morestack on g0

This is a regression from CL https://golang.org/cl/392854 . The first time when C calls into Go, we create an M and a g0. We compute a stack bound using the current SP. We don't know how big the C stack is, so we simply assume 32K https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=1937-1946
Previously, when the Go function returns to C, we drop the M. And the next time C calls into Go, we put a new stack bound on the g0 based on the current SP.
After the CL, we don't drop the M, so the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows.

We probably need to either try to get a more accurate stack bounds the first time C calling into Go, like we do in x_cgo_init https://cs.opensource.google/go/go/+/master:src/runtime/cgo/gcc_linux_amd64.c;l=46 , or recompute the g0 stack bounds based on the current SP each time when C calls into Go.

The text was updated successfully, but these errors were encountered:

gopherbot · 2023-03-28T19:33:01Z

Change https://go.dev/cl/479915 mentions this issue: runtime: get a better g0 stack bound in needm

Currently, when C calls into Go the first time, we grab an M using needm, which sets m.g0's stack bounds using the SP. We don't know how big the stack is, so we simply assume 32K. Previously, when the Go function returns to C, we drop the M, and the next time C calls into Go, we put a new stack bound on the g0 based on the current SP. After CL 392854, we don't drop the M, and the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack space before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows. This CL makes needm get a more accurate stack bound from pthread. (In some platforms this may still be a guess as we don't know exactly where we are in the C stack), but it is probably better than simply assuming 32K. For #59294. Change-Id: Ie52a8f931e0648d8753e4c1dbe45468b8748b527 Reviewed-on: https://go-review.googlesource.com/c/go/+/479915 Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>

gopherbot · 2023-03-31T18:42:27Z

Change https://go.dev/cl/481056 mentions this issue: Revert "runtime/cgo: store M for C-created thread in pthread key"

gopherbot · 2023-03-31T18:42:28Z

Change https://go.dev/cl/481057 mentions this issue: runtime/cgo: fix memory leak in x_cgo_getstackbound

gopherbot · 2023-03-31T19:17:25Z

Change https://go.dev/cl/481060 mentions this issue: Revert "runtime/cgo: store M for C-created thread in pthread key"

This reverts CL 392854. Reason for revert: caused #59294, which was derived from google internal tests. The attempted fix of #59294 caused more breakage. Change-Id: I5a061561ac2740856b7ecc09725ac28bd30f8bba Reviewed-on: https://go-review.googlesource.com/c/go/+/481060 Reviewed-by: Heschi Kreinick <heschi@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

gopherbot · 2023-03-31T19:53:57Z

Change https://go.dev/cl/481061 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

gopherbot · 2023-04-03T19:23:52Z

Change https://go.dev/cl/481795 mentions this issue: runtime/cgo: guard pthread_getattr_np on Illumos

While Solaris supports pthread_getattr_np, Illumos doesn't... Instead, Illumos supports pthread_attr_get_np. Updates #59294. Change-Id: I2c66dad79b8bf3d510352875bf21d04415f23eeb Reviewed-on: https://go-review.googlesource.com/c/go/+/481795 TryBot-Bypass: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>

gopherbot · 2023-04-17T18:33:15Z

Change https://go.dev/cl/485275 mentions this issue: Revert "runtime/cgo: store M for C-created thread in pthread key"

This reverts CL 481061. Reason for revert: When built with C TSAN, x_cgo_getstackbound triggers race detection on `g->stacklo` because the synchronization is in Go, which isn't instrumented. For #51676. For #59294. For #59678. Change-Id: I38afcda9fcffd6537582a39a5214bc23dc147d47 Reviewed-on: https://go-review.googlesource.com/c/go/+/485275 TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Than McIntosh <thanm@google.com>

gopherbot · 2023-04-17T21:45:37Z

Change https://go.dev/cl/485500 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

This reapplies CL 481061, with the followup fixes in CL 482975, CL 485315, and CL 485316 incorporated. CL 481061, by doujiang24 <doujiang24@gmail.com>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 482975 is a followup fix to a C declaration in testprogcgo. CL 485315 is a followup fix for x_cgo_getstackbound on Illumos. CL 485316 is a followup cleanup for ppc64 assembly. [Original CL 481061 description] This reapplies CL 392854, with the followup fixes in CL 479255, CL 479915, and CL 481057 incorporated. CL 392854, by doujiang24 <doujiang24@gmail.com>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 479255 is a followup fix for a small bug in ARM assembly code. CL 479915 is another followup fix to address C to Go calls after the C code uses some stack, but that CL is also buggy. CL 481057, by Michael Knyszek, is a followup fix for a memory leak bug of CL 479915. [Original CL 392854 description] In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls. So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call. Instead, we only dropm while the C thread exits, so the extra M won't leak. When invoking a Go function from C: Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor. And store the g0 of the current m into the thread-specified value of the pthread key, only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits. When returning back to C: Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C. This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows. This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread. For the newly added BenchmarkCGoInCThread, some benchmark results: 1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz [CL 479915 description] Currently, when C calls into Go the first time, we grab an M using needm, which sets m.g0's stack bounds using the SP. We don't know how big the stack is, so we simply assume 32K. Previously, when the Go function returns to C, we drop the M, and the next time C calls into Go, we put a new stack bound on the g0 based on the current SP. After CL 392854, we don't drop the M, and the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack space before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows. This CL makes needm get a more accurate stack bound from pthread. (In some platforms this may still be a guess as we don't know exactly where we are in the C stack), but it is probably better than simply assuming 32K. [CL 485500 description] CL 479915 passed the G to _cgo_getstackbound for direct updates to gp.stack.lo. A G can be reused on a new thread after the previous thread exited. This could trigger the C TSAN race detector because it couldn't see the synchronization in Go (lockextra) preventing the same G from being used on multiple threads at the same time. We work around this by passing the address of a stack variable to _cgo_getstackbound rather than the G. The stack is generally unique per thread, so TSAN won't see the same address from multiple threads. Even if stacks are reused across threads by pthread, C TSAN should see the synchonization in the stack allocator. A regression test is added to misc/cgo/testsanitizer. Fixes #51676. Fixes #59294. Fixes #59678. Change-Id: Ic62be31a06ee83568215e875a891df37084e08ca Reviewed-on: https://go-review.googlesource.com/c/go/+/485500 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Michael Pratt <mpratt@google.com>

gopherbot · 2023-05-17T16:18:10Z

Change https://go.dev/cl/495855 mentions this issue: runtime/cgo: store M for C-created thread in pthread key

This reapplies CL 485500, with a fix drafted in CL 492987 incorporated. CL 485500 is reverted due to #60004 and #60007. #60004 is fixed in CL 492743. #60007 is fixed in CL 492987 (incorporated in this CL). [Original CL 485500 description] This reapplies CL 481061, with the followup fixes in CL 482975, CL 485315, and CL 485316 incorporated. CL 481061, by doujiang24 <doujiang24@gmail.com>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 482975 is a followup fix to a C declaration in testprogcgo. CL 485315 is a followup fix for x_cgo_getstackbound on Illumos. CL 485316 is a followup cleanup for ppc64 assembly. CL 479915 passed the G to _cgo_getstackbound for direct updates to gp.stack.lo. A G can be reused on a new thread after the previous thread exited. This could trigger the C TSAN race detector because it couldn't see the synchronization in Go (lockextra) preventing the same G from being used on multiple threads at the same time. We work around this by passing the address of a stack variable to _cgo_getstackbound rather than the G. The stack is generally unique per thread, so TSAN won't see the same address from multiple threads. Even if stacks are reused across threads by pthread, C TSAN should see the synchonization in the stack allocator. A regression test is added to misc/cgo/testsanitizer. [Original CL 481061 description] This reapplies CL 392854, with the followup fixes in CL 479255, CL 479915, and CL 481057 incorporated. CL 392854, by doujiang24 <doujiang24@gmail.com>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 479255 is a followup fix for a small bug in ARM assembly code. CL 479915 is another followup fix to address C to Go calls after the C code uses some stack, but that CL is also buggy. CL 481057, by Michael Knyszek, is a followup fix for a memory leak bug of CL 479915. [Original CL 392854 description] In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls. So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call. Instead, we only dropm while the C thread exits, so the extra M won't leak. When invoking a Go function from C: Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor. And store the g0 of the current m into the thread-specified value of the pthread key, only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits. When returning back to C: Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C. This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows. This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread. For the newly added BenchmarkCGoInCThread, some benchmark results: 1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz [CL 479915 description] Currently, when C calls into Go the first time, we grab an M using needm, which sets m.g0's stack bounds using the SP. We don't know how big the stack is, so we simply assume 32K. Previously, when the Go function returns to C, we drop the M, and the next time C calls into Go, we put a new stack bound on the g0 based on the current SP. After CL 392854, we don't drop the M, and the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack space before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows. This CL makes needm get a more accurate stack bound from pthread. (In some platforms this may still be a guess as we don't know exactly where we are in the C stack), but it is probably better than simply assuming 32K. [CL 492987 description] On the first call into Go from a C thread, currently we set the g0 stack's high bound imprecisely based on the SP. With CL 485500, we keep the M and don't recompute the stack bounds when it calls into Go again. If the first call is made when the C thread uses some deep stack, but a subsequent call is made with a shallower stack, the SP may be above g0.stack.hi. This is usually okay as we don't check usually stack.hi. One place where we do check for stack.hi is in the signal handler, in adjustSignalStack. In particular, C TSAN delivers signals on the g0 stack (instead of the usual signal stack). If the SP is above g0.stack.hi, we don't see it is on the g0 stack, and throws. This CL makes it get an accurate stack upper bound with the pthread API (on the platforms where it is available). Also add some debug print for the "handler not on signal stack" throw. Fixes #51676. Fixes #59294. Fixes #59678. Fixes #60007. Change-Id: Ie51c8e81ade34ec81d69fd7bce1fe0039a470776 Reviewed-on: https://go-review.googlesource.com/c/go/+/495855 Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com>

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 28, 2023

mknyszek assigned cherrymui Mar 29, 2023

mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 29, 2023

mknyszek added this to the Go1.21 milestone Mar 29, 2023

felixge mentioned this issue Apr 2, 2023

runtime: TestCgoTraceParser failures #59233

Closed

gopherbot closed this as completed in ccad8a9 Apr 3, 2023

prattmic mentioned this issue May 5, 2023

runtime: "signal 23 received but handler not on signal stack" after CL 485500 when signal delivered on shallow g0 stack by TSAN #60007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: morestack on g0 when C calls Go with deep stack #59294

runtime: morestack on g0 when C calls Go with deep stack #59294

cherrymui commented Mar 28, 2023

gopherbot commented Mar 28, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Apr 3, 2023

gopherbot commented Apr 17, 2023

gopherbot commented Apr 17, 2023

gopherbot commented May 17, 2023

Navigation Menu

runtime: morestack on g0 when C calls Go with deep stack #59294

runtime: morestack on g0 when C calls Go with deep stack #59294

Comments

cherrymui commented Mar 28, 2023

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

gopherbot commented Mar 28, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Mar 31, 2023

gopherbot commented Apr 3, 2023

gopherbot commented Apr 17, 2023

gopherbot commented Apr 17, 2023

gopherbot commented May 17, 2023

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?