Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/cgo: avoid calls to cgoCheckPointer when debug.cgocheck=0 #28454

Open
egonelbre opened this issue Oct 29, 2018 · 15 comments
Open

cmd/cgo: avoid calls to cgoCheckPointer when debug.cgocheck=0 #28454

egonelbre opened this issue Oct 29, 2018 · 15 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@egonelbre
Copy link
Contributor

egonelbre commented Oct 29, 2018

With DEBUG=cgocheck=0 Go still makes calls to cgoCheckPointer which will bail out early in

if debug.cgocheck == 0 {
. Every such call adds few ns, but funcs with many arguments can end up accumulating a lot of them.

https://golang.org/cl/142884 changes cgo generated code to:

defer func() func() {
    _cgo0 := x
    _cgo1 := y
    return func() {
        _cgoCheckPointer(_cgo0)
        _cgoCheckPointer(_cgo1)
        C.f(_cgo0, _cgo1)
    }
}()()

I propose, instead of checking debug.cgocheck=0 inside cgoCheckPointer it would check it before calling cgoCheckPointer, so cgo would generate:

defer func() func() {
    _cgo0 := x
    _cgo1 := y
    return func() {
        if debug.cgocheck != 0 {
            _cgoCheckPointer(_cgo0)
            _cgoCheckPointer(_cgo1)
        }
        C.f(_cgo0, _cgo1)
    }
}()()
@ianlancetaylor
Copy link
Contributor

I see the advantage but I'm not excited about encouraging people to use GODEBUG=cgocheck=0.

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Oct 30, 2018
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 30, 2018
@dominikh
Copy link
Member

We could file a separate issue for making cgocheck=1 faster, but as it stands, it has a pretty substantial cost. I measured ~50ns per checked argument in the trivial case (struct { **int }), with ~68ns being an unchecked cgo call. In the context of APIs like Vulkan, where every function call contains a pointer to multiple pointers, and where performance is paramount, cgocheck=0 is hugely beneficial.

In these environments, it makes much more sense to run with cgocheck=2 during development, but to use cgocheck=0 in production.

@egonelbre
Copy link
Contributor Author

egonelbre commented Oct 30, 2018

Just to clarify, I'm not excited about it either and would rather see a check that has close to zero cost.

The other things I thought that might be possible are:

  1. aggressive cgoCheckCall optimizer (something that elides all the type walking),
  2. single call to cgoCheckCall (still would have the overhead),
  3. inlinable cgoCheckCall.

However, all of these seem to need significantly more effort than this change, but also are complimentary to the external check.

@egonelbre
Copy link
Contributor Author

I did a bunch of experiments on https://github.com/egonelbre/exp/blob/master/bench/call/cgo.go#L50.

Experiments:

  1. Baseline is CL142884.
  2. Disabling cgoCheckPointer code generation completely, (best-case scenario).
  3. Using if before calling cgoCheckPointer (proposal).
  4. Using cgoCheckPointer1 without variadic args (additional idea).
  5. Combining outer if and cgoCheckPointer1.

PS: note, my machine is somewhat noisy, so take the results with a grain of salt.

Results

Raw results https://gist.github.com/egonelbre/fe11fc2a3bf1617e1e14dfc562bb3e2a

Baseline vs disabling code generation:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  66.5ns ± 4%  -25.45%  (p=0.000 n=19+20)
CArgs2-8   113ns ± 9%    70ns ± 7%  -38.18%  (p=0.000 n=20+20)
CArgs3-8   138ns ± 9%    73ns ± 3%  -47.01%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%    70ns ± 6%  -50.92%  (p=0.000 n=17+20)
CArgs8-8   224ns ± 2%    78ns ± 1%  -65.32%  (p=0.000 n=18+16)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  66.8ns ± 4%   -6.58%  (p=0.000 n=19+20)
CArgs2-8  76.9ns ± 1%  68.1ns ± 1%  -11.35%  (p=0.000 n=19+17)
CArgs3-8  76.7ns ± 2%  74.7ns ± 3%   -2.62%  (p=0.000 n=20+17)
CArgs4-8  81.3ns ± 6%  67.9ns ± 2%  -16.56%  (p=0.000 n=20+20)
CArgs8-8  95.7ns ± 2%  77.6ns ± 2%  -18.99%  (p=0.000 n=20+20)

Baseline vs outer if:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  88.8ns ± 1%    ~     (p=0.395 n=19+15)
CArgs2-8   113ns ± 9%   105ns ± 3%  -7.46%  (p=0.000 n=20+18)
CArgs3-8   138ns ± 9%   130ns ± 4%  -6.35%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   151ns ± 2%  +6.38%  (p=0.000 n=17+20)
CArgs8-8   224ns ± 2%   238ns ± 1%  +6.33%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  69.9ns ± 2%   -2.28%  (p=0.000 n=19+18)
CArgs2-8  76.9ns ± 1%  70.8ns ± 3%   -7.85%  (p=0.000 n=19+17)
CArgs3-8  76.7ns ± 2%  72.0ns ± 2%   -6.15%  (p=0.000 n=20+16)
CArgs4-8  81.3ns ± 6%  74.2ns ± 2%   -8.80%  (p=0.000 n=20+20)
CArgs8-8  95.7ns ± 2%  80.2ns ± 3%  -16.26%  (p=0.000 n=20+18)

Baseline vs cgoCheckPointer1(interface{}):

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  88.9ns ± 2%     ~     (p=0.641 n=19+20)
CArgs2-8   113ns ± 9%    99ns ± 2%  -12.38%  (p=0.000 n=20+18)
CArgs3-8   138ns ± 9%   120ns ± 5%  -13.65%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   140ns ± 2%   -1.46%  (p=0.000 n=17+19)
CArgs8-8   224ns ± 2%   219ns ± 2%   -1.98%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  74.7ns ± 3%  +4.44%  (p=0.000 n=19+19)
CArgs2-8  76.9ns ± 1%  74.5ns ± 3%  -3.07%  (p=0.000 n=19+20)
CArgs3-8  76.7ns ± 2%  75.7ns ± 2%  -1.34%  (p=0.000 n=20+20)
CArgs4-8  81.3ns ± 6%  80.7ns ± 1%    ~     (p=0.649 n=20+18)
CArgs8-8  95.7ns ± 2%  92.1ns ± 1%  -3.85%  (p=0.000 n=20+16)

Baseline vs combined:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  85.2ns ± 5%   -4.50%  (p=0.000 n=19+20)
CArgs2-8   113ns ± 9%   102ns ± 4%  -10.01%  (p=0.000 n=20+20)
CArgs3-8   138ns ± 9%   126ns ± 8%   -8.66%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   135ns ± 1%   -4.40%  (p=0.000 n=17+19)
CArgs8-8   224ns ± 2%   209ns ± 5%   -6.81%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  72.4ns ± 3%   +1.30%  (p=0.019 n=19+20)
CArgs2-8  76.9ns ± 1%  71.5ns ± 8%   -7.00%  (p=0.000 n=19+20)
CArgs3-8  76.7ns ± 2%  69.9ns ± 1%   -8.97%  (p=0.000 n=20+19)
CArgs4-8  81.3ns ± 6%  74.2ns ± 0%   -8.78%  (p=0.000 n=20+14)
CArgs8-8  95.7ns ± 2%  79.8ns ± 1%  -16.64%  (p=0.000 n=20+19)

@gopherbot
Copy link

Change https://golang.org/cl/198081 mentions this issue: cmd/cgo: optimize cgoCheckPointer call

@gopherbot
Copy link

Change https://golang.org/cl/226342 mentions this issue: [WIP] cmd/cgo,runtime: inline cgocheck==0 check

@egonelbre
Copy link
Contributor Author

@ianlancetaylor I finally thought it would be nice to get this closed one way or another :).

I made a proof-of-concept change in https://go-review.googlesource.com/c/go/+/226342. I guess the main question is, whether it should be done at all.

The performance improvements are significant:

GODEBUG=cgocheck=0

name                             old time/op  new time/op  delta
CgoCall/add-int-32               48.2ns ± 7%  45.3ns ± 0%   -5.96%  (p=0.016 n=5+4)
CgoCall/one-pointer-32           48.8ns ± 1%  48.4ns ± 2%     ~     (p=0.127 n=5+5)
CgoCall/eight-pointers-32        69.3ns ± 1%  50.0ns ± 0%  -27.87%  (p=0.008 n=5+5)
CgoCall/eight-pointers-nil-32    67.8ns ± 1%  50.6ns ± 1%  -25.41%  (p=0.008 n=5+5)
CgoCall/eight-pointers-array-32  1.44µs ± 1%  0.05µs ± 2%  -96.52%  (p=0.008 n=5+5)
CgoCall/eight-pointers-slice-32   351ns ± 2%    49ns ± 1%  -86.00%  (p=0.008 n=5+5)

Should I try to finish the CL properly and add missing parts from gcc-go or is it better to drop this issue altogether?

@ianlancetaylor
Copy link
Contributor

Sorry, but I'm still not at all persuaded that we should try to make GODEBUG=cgocheck=0 faster. We should absolutely try to make GODEBUG=cgocheck=1 faster, since that is the default. But cgocheck=0 was only intended to support old code that was written before the pointer checking rules were written down. (The argument of using cgocheck=0 only during production is always tempting, but it's similar to the argument that you should use a life jacket while practicing in a swimming pool but not while swimming in the ocean.)

That said I think the CL would be simpler if you introduce a new function

func cgoMuchCheckPointers() {
    return debug.cgocheck != 0
}

and call that. The compiler should inline that into the calling code. (If it doesn't, let's find out why not.)

@egonelbre
Copy link
Contributor Author

egonelbre commented Mar 30, 2020

Fair enough, I won't bother with making cgocheck=0 faster. I understand the reasoning.

But, while doing this, I also realized why the array/slices cases are still so slow.

Here's a reproducer, where a single call takes 15ms:

package main

/*
typedef struct Example {
  int value;
  int *other;
} Example;

int getValue(Example *example) {
	return example->value;
}
*/
import "C"
import (
	"fmt"
	"time"
)

func main() {
	const N = 100

	var data [1 << 20]C.Example
	start := time.Now()
	for i := 0; i < N; i++ {
		_ = C.getValue(&data[0])
	}
	finish := time.Now()
	fmt.Println(finish.Sub(start) / N)
}

Cgo ends up generating this code:

	var data [1 << 20] /*line :22:20*/ _Ctype_struct_Example /*line :22:29*/
	start := time.Now()
	for i := 0; i < N; i++ {
		_ = func() _Ctype_int {
			_cgoIndex0 := & /*line :25:19*/ data
			_cgo0 := /*line :25:18*/ &(*_cgoIndex0)[0]
			_cgoCheckPointer(_cgo0, *_cgoIndex0) // <--- this makes a copy, of the array
			return _Cfunc_getValue(_cgo0)
		}()
	}
	finish := time.Now()

I'll look into fixing this and then use it as a resolution for this issue.

@gopherbot
Copy link

Change https://golang.org/cl/226517 mentions this issue: src/cmd/cgo,src/runtime: avoid array clone during cgo call [WIP]

@OneOfOne
Copy link
Contributor

OneOfOne commented May 8, 2021

Have been any updates about this?

@egonelbre
Copy link
Contributor Author

No updates so far, I got stuck trying to make it work with gccgo and haven't taken the time to fix it.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@Zyl9393
Copy link

Zyl9393 commented Feb 19, 2023

I use cgocheck=0 because cgocheck creates false positives, and the performance penalty is now pressing me hard to change languages, which I would love to avoid. CGO-calls also always call GetLastError() on Windows, arbitrarily doubling the cost of every call, even when you did not call a WINAPI function. I'm not sure how to best address these issues, but there are people here (mostly graphics programmers lol) who would really benefit from any improvements made.

@ianlancetaylor
Copy link
Contributor

@Zyl9393 As far as I can tell your "create false positives" link is actually describing true positives: cases where cgocheck is correctly warning. In Go an invalid pointer must never have pointer type. Doing so will break the garbage collector.

@Zyl9393
Copy link

Zyl9393 commented Feb 21, 2023

@ianlancetaylor You are technically 100% correct. The problem is that for the purpose of getting work done today (and, for that matter, 6 years ago), this 100% gets in the way and needs to be disabled. I am of the opinion that half of the problems people are experiencing with Go can be fixed by improving the quality of the projects using it; we desperately need more of that to be happening, alas it does not appear to be the kind of work a community of FOSS hobbyists can accomplish. Just a thought. /digress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

7 participants