Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: significant number of bounds checks in growslice #60071

Open
egonelbre opened this issue May 9, 2023 · 7 comments
Open

runtime: significant number of bounds checks in growslice #60071

egonelbre opened this issue May 9, 2023 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@egonelbre
Copy link
Contributor

While inspecting the runtime.growslice code I noticed that it contains a lot of bounds checks. They all come from runtime.roundupsize.

What version of Go are you using (go version)?

$ go version
go version devel go1.21-8d5065ce6e Tue May 9 11:45:59 2023 +0300 windows/amd64

However also happens with go1.20.4.

Does this issue reproduce with the latest release?

Yes and with tip.

What did you do?

Inspecting runtime.growslice and seeing:

$ go tool objdump -s runtime.growslice ./somebinary
...
  0x445f0f              488d052a320500          LEAQ type:*+45376(SB), AX
  0x445f16              488d1dfb3c0800          LEAQ runtime.buildVersion.str+720(SB), BX
  0x445f1d              0f1f00                  NOPL 0(AX)
  0x445f20              e87bd0feff              CALL runtime.gopanic(SB)
                        return uintptr(class_to_size[size_to_class128[divRoundUp(size-smallSizeMax, largeSizeDiv)]])
  0x445f25              b944000000              MOVL $0x44, CX
  0x445f2a              e891860100              CALL runtime.panicIndex(SB)
  0x445f2f              b9f9000000              MOVL $0xf9, CX
  0x445f34              e8a7860100              CALL runtime.panicIndexU(SB)
                        return uintptr(class_to_size[size_to_class8[divRoundUp(size, smallSizeDiv)]])
  0x445f39              b944000000              MOVL $0x44, CX
  0x445f3e              6690                    NOPW
  0x445f40              e87b860100              CALL runtime.panicIndex(SB)
  0x445f45              b981000000              MOVL $0x81, CX
  0x445f4a              e891860100              CALL runtime.panicIndexU(SB)
                        return uintptr(class_to_size[size_to_class128[divRoundUp(size-smallSizeMax, largeSizeDiv)]])
...

What did you expect to see?

I'm not sure how much they affect performance, but triggering that many in roundupsize does seem a bit concerning.


I tried to do the trivial patch with zero-filling the tables to exactly 256 elements https://go-review.googlesource.com/c/go/+/493796, however, I'm not convinced that this is the best way to address this. Either way, I'm willing to take a stab at fixing it.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 9, 2023
@gopherbot
Copy link

Change https://go.dev/cl/493796 mentions this issue: runtime: avoid bounds checks in roundup

@cherrymui cherrymui added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 9, 2023
@cherrymui cherrymui added this to the Unplanned milestone May 9, 2023
@cherrymui
Copy link
Member

Do you have some benchmark results for how much the CL helps? Thanks.

@egonelbre
Copy link
Contributor Author

So, benchmarking all the ^Append things in runtime:

goos: windows
goarch: amd64
pkg: runtime
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor
                                 │ append_baseline.txt~ │         append_roundup.txt~         │
                                 │        sec/op        │   sec/op     vs base                │
Append-32                                   12.29n ± 2%   15.56n ± 1%  +26.65% (p=0.000 n=10)
AppendGrowByte-32                           1.869m ± 3%   1.902m ± 4%        ~ (p=0.247 n=10)
AppendGrowString-32                         41.46m ± 5%   38.72m ± 3%   -6.61% (p=0.002 n=10)
AppendSlice/1Bytes-32                       2.197n ± 1%   2.214n ± 1%   +0.80% (p=0.004 n=10)
AppendSlice/4Bytes-32                       1.950n ± 0%   1.966n ± 1%   +0.79% (p=0.001 n=10)
AppendSlice/7Bytes-32                       2.271n ± 0%   2.282n ± 1%   +0.53% (p=0.008 n=10)
AppendSlice/8Bytes-32                       1.991n ± 1%   2.012n ± 1%   +1.05% (p=0.006 n=10)
AppendSlice/15Bytes-32                      2.285n ± 0%   2.293n ± 1%        ~ (p=0.101 n=10)
AppendSlice/16Bytes-32                      1.989n ± 1%   2.002n ± 0%   +0.65% (p=0.000 n=10)
AppendSlice/32Bytes-32                      2.191n ± 0%   2.200n ± 1%   +0.41% (p=0.033 n=10)
AppendSliceLarge/1024Bytes-32               339.0n ± 3%   338.3n ± 1%        ~ (p=0.684 n=10)
AppendSliceLarge/4096Bytes-32               1.275µ ± 3%   1.284µ ± 1%        ~ (p=0.591 n=10)
AppendSliceLarge/16384Bytes-32              4.261µ ± 5%   4.348µ ± 1%   +2.03% (p=0.035 n=10)
AppendSliceLarge/65536Bytes-32              13.44µ ± 3%   13.04µ ± 6%        ~ (p=0.089 n=10)
AppendSliceLarge/262144Bytes-32             50.48µ ± 3%   48.35µ ± 2%   -4.21% (p=0.000 n=10)
AppendSliceLarge/1048576Bytes-32            161.5µ ± 6%   154.5µ ± 3%   -4.30% (p=0.029 n=10)
AppendStr/1Bytes-32                         2.331n ± 0%   2.334n ± 0%        ~ (p=0.423 n=10)
AppendStr/4Bytes-32                         1.998n ± 0%   2.001n ± 1%        ~ (p=0.224 n=10)
AppendStr/8Bytes-32                         2.034n ± 1%   2.045n ± 1%        ~ (p=0.107 n=10)
AppendStr/16Bytes-32                        2.248n ± 1%   2.251n ± 0%        ~ (p=0.323 n=10)
AppendStr/32Bytes-32                        2.353n ± 1%   2.356n ± 0%        ~ (p=0.956 n=10)
AppendSpecialCase-32                        10.22n ± 0%   10.21n ± 0%        ~ (p=0.194 n=10)
AppendInPlace/NoGrow/Byte-32                400.1n ± 1%   403.9n ± 1%   +0.96% (p=0.007 n=10)
AppendInPlace/NoGrow/1Ptr-32                1.279µ ± 2%   1.275µ ± 2%        ~ (p=0.868 n=10)
AppendInPlace/NoGrow/2Ptr-32                2.160µ ± 2%   2.173µ ± 1%        ~ (p=0.288 n=10)
AppendInPlace/NoGrow/3Ptr-32                2.649µ ± 3%   2.666µ ± 3%        ~ (p=0.700 n=10)
AppendInPlace/NoGrow/4Ptr-32                3.985µ ± 3%   4.073µ ± 2%        ~ (p=0.143 n=10)
AppendInPlace/Grow/Byte-32                  228.8n ± 2%   221.9n ± 2%   -2.99% (p=0.001 n=10)
AppendInPlace/Grow/1Ptr-32                  238.3n ± 2%   223.7n ± 1%   -6.13% (p=0.000 n=10)
AppendInPlace/Grow/2Ptr-32                  329.3n ± 2%   323.7n ± 2%   -1.72% (p=0.009 n=10)
AppendInPlace/Grow/3Ptr-32                  397.8n ± 3%   414.2n ± 1%   +4.14% (p=0.000 n=10)
AppendInPlace/Grow/4Ptr-32                  465.8n ± 4%   472.3n ± 3%        ~ (p=0.239 n=10)
geomean                                     176.1n        176.9n        +0.42%

I haven't yet dug into the reasons for things getting slower.

It might also make sense to try and see how it will interact with #49480 and https://go-review.googlesource.com/c/go/+/493836.

@egonelbre
Copy link
Contributor Author

egonelbre commented May 9, 2023

Hmm, maybe it's due to needing to convert the division result into a uint8 and hence ending up emitting a MOVZX:

  msize.go:16		0x4444b1		400fb6ff		MOVZX DI, DI					
  msize.go:16		0x4444b5		4c8d0504c80d00		LEAQ runtime.size_to_class8(SB), R8		
  msize.go:16		0x4444bc		420fb63c07		MOVZX 0(DI)(R8*1), DI				
  msize.go:16		0x4444c1		4c8d0538cc0d00		LEAQ runtime.class_to_size(SB), R8		
  msize.go:16		0x4444c8		410fb73c78		MOVZX 0(R8)(DI*2), DI				

Also, maybe it's possible to use a single lookup instead of a two lookup.

@egonelbre
Copy link
Contributor Author

egonelbre commented May 9, 2023

I guess one option would be to teach BCE about division and calculations, such as for https://go.godbolt.org/z/ccf4Gbs8E. But, I have no clue how complicated that would be.

package runtime

const (
	_PageSize       = 4*1024
	_MaxSmallSize   = 32768
	smallSizeDiv    = 8
	smallSizeMax    = 1024
	largeSizeDiv    = 128
)

func roundupsize(size uintptr) uintptr {
	if size < _MaxSmallSize {
		if size <= smallSizeMax-8 {
			// here we know upper bounds:
			//     size = 1024-8 = 1016
			//     (n + a - 1) / a) = (1016 + 8 - 1) / 8 = 127 // guaranteed to not overflow
			//     smallSizeMax/smallSizeDiv + 1 =  129
			return uintptr(size_to_class_size[divRoundUp(size, smallSizeDiv)])
		} else {
			// similarly here
			return uintptr(size_to_class128_size[divRoundUp(size-smallSizeMax, largeSizeDiv)])
		}
	}
	if size+_PageSize < size {
		return size
	}
	return alignUp(size, _PageSize)
}

// alignUp rounds n up to a multiple of a. a must be a power of 2.
func alignUp(n, a uintptr) uintptr {
	return (n + a - 1) &^ (a - 1)
}

// divRoundUp returns ceil(n / a).
func divRoundUp(n, a uintptr) uintptr {
	// a is generally a power of two. This will get inlined and
	// the compiler will optimize the division.
	return (n + a - 1) / a
}

var size_to_class_size = [smallSizeMax/smallSizeDiv + 1]uint16{}
var size_to_class128_size = [(_MaxSmallSize-smallSizeMax)/largeSizeDiv + 1]uint16{}

@egonelbre
Copy link
Contributor Author

egonelbre commented May 17, 2023

I haven't yet dug into the reasons for things getting slower.

My best guess is that the performance is very sensitive to code alignment on Threadripper 2950X amd64 Windows... for some reason. I'm seeing Append benchmark vary between 8ns, 12ns, 15ns depending on the commit, which don't seem directly impact the code.

Anyways, I submitted the basic change.

@egonelbre
Copy link
Contributor Author

From my experiments and benchmarks, it seems that the benefit is at most ~2ns... and almost imperceptible due to code alignment affecting benchmark results about the same.

So, it seems there's no performance win here. However, supporting division in BCE might be a good idea nevertheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

3 participants