Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: severe performance drop for cgo calls in go1.22.5 #68587

Closed
zeroshade opened this issue Jul 25, 2024 · 24 comments
Closed

runtime: severe performance drop for cgo calls in go1.22.5 #68587

zeroshade opened this issue Jul 25, 2024 · 24 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsFix The path to resolution is known, but the work has not been done. Performance release-blocker
Milestone

Comments

@zeroshade
Copy link

zeroshade commented Jul 25, 2024

Go version

go version go1.22.5 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/zeroshade/.cache/go-build'
GOENV='/home/zeroshade/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/zeroshade/go/pkg/mod'
GONOPROXY='github.com/voltrondata/*'
GONOSUMDB='github.com/voltrondata/*'
GOOS='linux'
GOPATH='/home/zeroshade/go'
GOPRIVATE='github.com/voltrondata/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.22'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.22/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.5'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/zeroshade/Projects/GitHub/arrow-adbc/go/adbc/go.mod'
GOWORK='/home/zeroshade/Projects/GitHub/go.work'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build279063330=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Building a shared library with Go that is loaded and called from C. I was able to reproduce a trivial example with the following:

package main

import "C"

//export trivialFunctionCall
func trivialFunctionCall(v uintptr) C.int {
    return C.int(v)
}

func main() {}

Build with go build -o libtrivial.so -buildmode=c-shared main.go

#include "libtrivial.h"
#include <stdio.h>

int main() {
    int a = 0;
    for (int i = 0; i < 500; ++i) {
        a = trivialFunctionCall((size_t)i);
    }
    printf("Done %d\n", a);
}

Build with gcc -o trivial trivial.c -ltrivial -L. -Wl,-rpath,..

Then we run and record the performance: perf record --call-graph lbr -- ./trivial and look at the report with perf report -g.

What did you see happen?

The overwhelming amount of time in the cgocall is spent in __isoc99_sscanf as seen here:

+   97.82%     0.00%  trivial  trivial               [.] _start
+   97.82%     0.00%  trivial  libc.so.6             [.] __libc_start_main
+   97.82%     0.00%  trivial  libc.so.6             [.] 0x0000729379829d8e
+   97.82%     0.00%  trivial  trivial               [.] main
-   97.82%     0.00%  trivial  libtrivial.so         [.] trivialFunctionCall
     trivialFunctionCall
     crosscall2
     runtime.cgocallback.abi0
     runtime.cgocallbackg.abi0
   - runtime.cgocallbackg
      - 96.56% runtime.callbackUpdateSystemStack
           runtime.asmcgocall.abi0
         - x_cgo_getstackbound
            - 94.01% pthread_getattr_np
               + 76.00% __isoc99_sscanf
               + 9.14% __getdelim
                 2.96% getline
               + 2.06% fopen
               + 1.97% fclose
               + 1.88% cfree
              2.56% pthread_getattr_np@plt
      + 1.25% runtime.exitsyscall
+   97.82%     0.00%  trivial  libtrivial.so         [.] crosscall2
+   97.82%     0.00%  trivial  libtrivial.so         [.] runtime.cgocallback.abi0
+   97.82%     0.00%  trivial  libtrivial.so         [.] runtime.cgocallbackg.abi0
+   97.82%     0.00%  trivial  libtrivial.so         [.] runtime.cgocallbackg
+   96.56%     2.56%  trivial  libc.so.6             [.] pthread_getattr_np
+   96.56%     0.00%  trivial  libtrivial.so         [.] runtime.callbackUpdateSystemStack
+   96.56%     0.00%  trivial  libtrivial.so         [.] runtime.asmcgocall.abi0
+   96.56%     0.00%  trivial  libtrivial.so         [.] x_cgo_getstackbound
+   76.00%     0.00%  trivial  libc.so.6             [.] __isoc99_sscanf
+   42.52%     0.00%  trivial  libc.so.6             [.] 0x00007293798629e0
+   16.45%     0.00%  trivial  libc.so.6             [.] 0x0000729379865265
+   16.45%     0.00%  trivial  libc.so.6             [.] __strtoull_internal
+   10.59%     3.14%  trivial  libc.so.6             [.] __getdelim

If we instead build using GOTOOLCHAIN=go1.22.4 go build -o libtrivial.so -buildmode=c-shared main.go and re-run it, you see that the cost of callbackUpdateSystemStack and pthread_getattr_np don't even register in the performance trace:

+   92.66%    92.66%  trivial  libtrivial.so  [.] runtime.execute
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.mallocgc
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.gcTrigger.test
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.morestack.abi0
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.newstack
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.goschedImpl
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.schedule
+    7.34%     0.00%  trivial  libtrivial.so  [.] runtime.sigtramp.abi0
+    7.34%     0.00%  trivial  libtrivial.so  [.] runtime.sigtrampgo
+    6.78%     6.78%  trivial  libtrivial.so  [.] runtime.sighandler
+    0.56%     0.56%  trivial  libtrivial.so  [.] runtime.sigfwdgo
     0.00%     0.00%  trivial  libtrivial.so  [.] runtime.cgoSigtramp.abi0

The problem was introduced by 3560cf0afb which forces all calls to cgo to update the system stack bounds. Ultimately the problem traces down to the glibc implementation of pthread_getattr_np which is called by

pthread_getattr_np(pthread_self(), &attr); // GNU extension

In glibc, pthread_getattr_np ends up directly parsing /proc/self/maps by using sscanf, hence the high cost as seen here https://github.com/lattera/glibc/blob/master/nptl/pthread_getattr_np.c#L81-L124

It looks like musl's implementation of pthread_getattr_np isn't as expensive as glibc's version though.

For the full, in-context explanation see apache/arrow-adbc#2021 in context we're seeing a test run go from ~2s to now taking over 5.5s due to this issue and are concerned it may end up affecting consumers of the library

What did you expect to see?

Shifting from go1.22.4 to go1.22.5 shouldn't come with such a significant drop in cgo's FFI performance when calling a Go shared library function from C that the cost is dominated by calls to sscanf if the underlying Go code itself isn't expensive enough. While the example used to reproduce is a trivial one to exemplify the problem, in the context of the actual library, the cgo calls are non-trivial and involve multiple network connection initializations, data processing and more, yet several of the calls end up completely dominated by the sscanf call in pthread_getattr_np.

Either we need find more situations where we can avoid having to fully update the stack bounds via x_cgo_getstackbound or there needs to be a less expensive way to get those bounds than using pthread_getattr_np when using glibc due to the high cost of it just parsing /proc/self/maps.

@zeroshade
Copy link
Author

It also looks like the Rust devs are running into this problem too: https://internals.rust-lang.org/t/who-is-doing-read-proc-self-maps-1024-at-startup/17348

@dmitshur dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 25, 2024
@dmitshur
Copy link
Contributor

dmitshur commented Jul 25, 2024

CC @prattmic (FYI the report reports this is introduced with CL 585935), @golang/runtime.

@cherrymui
Copy link
Member

Thanks for the report. From the Rust post it sounds like it only does the read if it is called for the main thread. Could you try calling Go from a different thread, a pthread-created one?

@zeroshade
Copy link
Author

@cherrymui I only included the rust post because it was referring to the same scenario of calling pthread_getattr_np in hindsight it was probably not really great to include as the problem here is frequency (which is not the issue for the Rust post).

In the context of the original issue it's being called through Python loading up the shared library and then making calls to it. As this is a client library that we provide we don't have control over whether consumers call from the main thread or other threads. I can try calling from another thread instead and seeing if that mitigates the issue a bit, but that wouldn't be a viable solution IMO as we can't expect all calls to the shared library to be funneled through a separate thread instead of being able to use the main thread. Right?

@cherrymui
Copy link
Member

Yes, trying calling from a different thread is for validating the assumption. Calling on the main thread should not be slow, either. So if that's the case we'll look into a targeted solution. Thanks.

@zeroshade
Copy link
Author

zeroshade commented Jul 25, 2024

Looks like you're right!

I updated the code from the reproducer to the following:

#include "libtrivial.h"
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void* f(void*) {
    int a = 0;
    for (int i = 0; i < 500; ++i) {
        a = trivialFunctionCall((size_t)i);
    }

    printf("Done %d\n", a);
    pthread_exit(NULL);
}

int main() {
    pthread_t thr;
    int rc;
    if ((rc = pthread_create(&thr, NULL, &f, NULL)) {
        fprintf(stderr, "error: pthread_create: rc: %d\n", rc);
        return EXIT_FAILURE;
    }

    pthread_join(thr, NULL);
}

After running the result using perf record -g and checking the report, sscanf is not showing up at all in the report anymore.

So this just needs to be handled for the main thread as it appears to not cause slowdowns on other threads, only if you call from the main thread.

lidavidm added a commit to lidavidm/arrow-adbc-split-feedstock that referenced this issue Jul 26, 2024
@seankhliao seankhliao changed the title runtime/cgocall: severe performance drop for cgo calls in go1.22.5 runtime: severe performance drop for cgo calls in go1.22.5 Jul 26, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 26, 2024
@mknyszek
Copy link
Contributor

In triage, we're thinking we're probably stuck with having to work around this by caching the stack bounds one way or another.

@mknyszek mknyszek added this to the Backlog milestone Jul 31, 2024
@zeroshade
Copy link
Author

@cherrymui @mknyszek any updates here?

@cherrymui
Copy link
Member

cherrymui commented Aug 22, 2024

Sorry, I haven't got a chance to implement this. My plan is that if we can get an accurate stack bounds, which is the common case with glibc, we just get it once per thread and stick with it. If we cannot get the accurate bounds (on systems that pthread_getattr_np is not available), we'll use an estimate every time, but computing the estimate is not costly. I'll probably combine the work for this and #68285.

@zeroshade
Copy link
Author

Thanks for the update! That's about what I was expecting the fix would end up being

@Splizard
Copy link

Splizard commented Oct 5, 2024

Workaround for now is to revert 3560cf0 if using Go 1.23, NOTE this has quite severe performance degradation for graphical programs calling c-shared Go functions every frame.

@tomcnolan
Copy link

tomcnolan commented Oct 7, 2024

The Teradata SQL Driver for Python is also impacted by this performance regression in Go 1.22.5. The driver is a thin Python wrapper around a shared library implemented in Go.
After switching to Go 1.22.5, we have had multiple customers report significant performance problems.
One particular use case we obtained from a customer showed that fetching 10,000 rows of data went from a sub-second operation to taking more than 16 seconds.

@jakebailey
Copy link
Contributor

jakebailey commented Oct 8, 2024

I've also experienced this slowdown; in a minimal piece of code where I've implemented an identity function and exported it to Node.js via N-API, Go 1.23 can only do about 5,000 ops per second (roughly 204,000 ns per op). Reverting 3560cf0 brings that up to 2.5 million ops per second (~400 ns per op), a 500x difference.

@WangLeonard
Copy link
Contributor

WangLeonard commented Oct 14, 2024

@cherrymui @mknyszek
sorry to bother, any updates here? Thanks you!

Since this is a change introduced in a minor version, there are performance degradations mentioned in the current issue, as well as some possible potential functional issues
#68760
#68285

We are currently unable to update to the latest upstream go1.22 version to follow some bugfixes (such as 6fab4b9 ) and CVE fixes

@ianlancetaylor
Copy link
Member

Marking as a release blocker since it seems to have a real impact on some programs.

@cherrymui
Copy link
Member

cherrymui commented Oct 18, 2024

Sorry, I was OOO and then was occupied when I was back. My plan is to get a new version of CL 600296 that covers both this and #68285. I'll work on that soon.

@aclements
Copy link
Member

@gopherbot , please open a backport to 1.23.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #69988 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@aclements
Copy link
Member

Just a thought: 3560cf0 was to deal with an OS thread's stack potentially moving between calls, which is a fairly obscure condition. If we discover the stack has moved, could we just set that g0's stack bounds to [0, ∞) and stop worrying about it? That will lose bounds checks on the g0 stack, which is unfortunate, but again this is uncommon. Is there anything more critical this would mess up?

@dmitshur dmitshur modified the milestones: Backlog, Go1.24 Oct 22, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/600296 mentions this issue: runtime: update and restore g0 stack bounds at cgocallback

@cherrymui
Copy link
Member

cherrymui commented Oct 24, 2024

I updated CL 600296 to cover this, implemented basically #68587 (comment). If the stack moved, we use a new bounds of [SP, SP+32KB), which I think is not that different from [0, ∞).

I tried on a Linux glibc machine and that gets mostly of the performance back:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Feel free to try the CL, and let me know if there is any issue. Thanks!

@jakebailey
Copy link
Contributor

jakebailey commented Oct 24, 2024

I'm happy to say that CL 600296 does fix the performance issue I was observing.

(It appears to be slightly slower than a straight revert of CL 584597, but there is of course new code there to fix the original bug, and it's really, really close, so I of course cannot complain.)

@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Oct 30, 2024
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 30, 2024
@dr2chase dr2chase added the Critical A critical problem that affects the availability or correctness of production systems built using Go label Nov 6, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/635775 mentions this issue: [release-branch.go1.23] runtime: update and restore g0 stack bounds at cgocallback

gopherbot pushed a commit that referenced this issue Dec 13, 2024
…t cgocallback

Currently, at a cgo callback where there is already a Go frame on
the stack (i.e. C->Go->C->Go), we require that at the inner Go
callback the SP is within the g0's stack bounds set by a previous
callback. This is to prevent that the C code switches stack while
having a Go frame on the stack, which we don't really support. But
this could also happen when we cannot get accurate stack bounds,
e.g. when pthread_getattr_np is not available. Since the stack
bounds are just estimates based on the current SP, if there are
multiple C->Go callbacks with various stack depth, it is possible
that the SP of a later callback falls out of a previous call's
estimate. This leads to runtime throw in a seemingly reasonable
program.

This CL changes it to save the old g0 stack bounds at cgocallback,
update the bounds, and restore the old bounds at return. So each
callback will get its own stack bounds based on the current SP,
and when it returns, the outer callback has the its old stack
bounds restored.

Also, at a cgo callback when there is no Go frame on the stack,
we currently always get new stack bounds. We do this because if
we can only get estimated bounds based on the SP, and the stack
depth varies a lot between two C->Go calls, the previous
estimates may be off and we fall out or nearly fall out of the
previous bounds. But this causes a performance problem: the
pthread API to get accurate stack bounds (pthread_getattr_np) is
very slow when called on the main thread. Getting the stack bounds
every time significantly slows down repeated C->Go calls on the
main thread.

This CL fixes it by "caching" the stack bounds if they are
accurate. I.e. at the second time Go calls into C, if the previous
stack bounds are accurate, and the current SP is in bounds, we can
be sure it is the same stack and we don't need to update the bounds.
This avoids the repeated calls to pthread_getattr_np. If we cannot
get the accurate bounds, we continue to update the stack bounds
based on the SP, and that operation is very cheap.

On a Linux/AMD64 machine with glibc:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Updates #68285.
Updates #68587.
Fixes #69988.

Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
Reviewed-on: https://go-review.googlesource.com/c/go/+/600296
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit 76a8409)
Reviewed-on: https://go-review.googlesource.com/c/go/+/635775
yangjuncode pushed a commit to yangjuncode/go that referenced this issue Dec 27, 2024
…t cgocallback

Currently, at a cgo callback where there is already a Go frame on
the stack (i.e. C->Go->C->Go), we require that at the inner Go
callback the SP is within the g0's stack bounds set by a previous
callback. This is to prevent that the C code switches stack while
having a Go frame on the stack, which we don't really support. But
this could also happen when we cannot get accurate stack bounds,
e.g. when pthread_getattr_np is not available. Since the stack
bounds are just estimates based on the current SP, if there are
multiple C->Go callbacks with various stack depth, it is possible
that the SP of a later callback falls out of a previous call's
estimate. This leads to runtime throw in a seemingly reasonable
program.

This CL changes it to save the old g0 stack bounds at cgocallback,
update the bounds, and restore the old bounds at return. So each
callback will get its own stack bounds based on the current SP,
and when it returns, the outer callback has the its old stack
bounds restored.

Also, at a cgo callback when there is no Go frame on the stack,
we currently always get new stack bounds. We do this because if
we can only get estimated bounds based on the SP, and the stack
depth varies a lot between two C->Go calls, the previous
estimates may be off and we fall out or nearly fall out of the
previous bounds. But this causes a performance problem: the
pthread API to get accurate stack bounds (pthread_getattr_np) is
very slow when called on the main thread. Getting the stack bounds
every time significantly slows down repeated C->Go calls on the
main thread.

This CL fixes it by "caching" the stack bounds if they are
accurate. I.e. at the second time Go calls into C, if the previous
stack bounds are accurate, and the current SP is in bounds, we can
be sure it is the same stack and we don't need to update the bounds.
This avoids the repeated calls to pthread_getattr_np. If we cannot
get the accurate bounds, we continue to update the stack bounds
based on the SP, and that operation is very cheap.

On a Linux/AMD64 machine with glibc:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Updates golang#68285.
Updates golang#68587.
Fixes golang#69988.

Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
Reviewed-on: https://go-review.googlesource.com/c/go/+/600296
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit 76a8409)
Reviewed-on: https://go-review.googlesource.com/c/go/+/635775
mpminardi pushed a commit to tailscale/go that referenced this issue Jan 28, 2025
…t cgocallback

Currently, at a cgo callback where there is already a Go frame on
the stack (i.e. C->Go->C->Go), we require that at the inner Go
callback the SP is within the g0's stack bounds set by a previous
callback. This is to prevent that the C code switches stack while
having a Go frame on the stack, which we don't really support. But
this could also happen when we cannot get accurate stack bounds,
e.g. when pthread_getattr_np is not available. Since the stack
bounds are just estimates based on the current SP, if there are
multiple C->Go callbacks with various stack depth, it is possible
that the SP of a later callback falls out of a previous call's
estimate. This leads to runtime throw in a seemingly reasonable
program.

This CL changes it to save the old g0 stack bounds at cgocallback,
update the bounds, and restore the old bounds at return. So each
callback will get its own stack bounds based on the current SP,
and when it returns, the outer callback has the its old stack
bounds restored.

Also, at a cgo callback when there is no Go frame on the stack,
we currently always get new stack bounds. We do this because if
we can only get estimated bounds based on the SP, and the stack
depth varies a lot between two C->Go calls, the previous
estimates may be off and we fall out or nearly fall out of the
previous bounds. But this causes a performance problem: the
pthread API to get accurate stack bounds (pthread_getattr_np) is
very slow when called on the main thread. Getting the stack bounds
every time significantly slows down repeated C->Go calls on the
main thread.

This CL fixes it by "caching" the stack bounds if they are
accurate. I.e. at the second time Go calls into C, if the previous
stack bounds are accurate, and the current SP is in bounds, we can
be sure it is the same stack and we don't need to update the bounds.
This avoids the repeated calls to pthread_getattr_np. If we cannot
get the accurate bounds, we continue to update the stack bounds
based on the SP, and that operation is very cheap.

On a Linux/AMD64 machine with glibc:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Updates golang#68285.
Updates golang#68587.
Fixes golang#69988.

Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
Reviewed-on: https://go-review.googlesource.com/c/go/+/600296
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit 76a8409)
Reviewed-on: https://go-review.googlesource.com/c/go/+/635775
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsFix The path to resolution is known, but the work has not been done. Performance release-blocker
Projects
Development

No branches or pull requests