Skip to content

runtime: tune appendCrossover for 386 and arm #4963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davecheney opened this issue Mar 2, 2013 · 16 comments
Closed

runtime: tune appendCrossover for 386 and arm #4963

davecheney opened this issue Mar 2, 2013 · 16 comments
Milestone

Comments

@davecheney
Copy link
Contributor

https://code.google.com/p/go/source/detail?r=31e379ac82794d282601aca59f61261a40f511af

introduced a new per arch constant for inline vs memmove memory copying. The results on
amd64 are excellent, but the constant is probably too agressive for 386 and arm.
@remyoudompheng
Copy link
Contributor

Comment 1:

Benchmarks on ARM (ODROID-X): things got same or worse in all cases
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     394          394   +0.00%
BenchmarkAppend1Byte                 48           46   -5.14%
BenchmarkAppend4Bytes                55           54   -2.87%
BenchmarkAppend8Bytes                55           64  +16.16%
BenchmarkAppend16Bytes               58           85  +46.67%
BenchmarkAppend32Bytes               57           60   +5.57%
BenchmarkAppendStr1Byte              39           34  -13.89%
BenchmarkAppendStr4Bytes             44           42   -4.76%
BenchmarkAppendStr8Bytes             45           52  +14.91%
BenchmarkAppendStr16Bytes            48           74  +51.64%
BenchmarkAppendStr32Bytes            47           47   +1.27%
BenchmarkAppendSpecialCase          276          282   +2.17%

@davecheney
Copy link
Contributor Author

Comment 2:

This was my suspicion. I would support a CL that set appendCrossover
to 0 for arm.

@remyoudompheng
Copy link
Contributor

Comment 3:

The results on 386 are not very interesting. Only AppendStr1Byte has significantly
improved:
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     225          226   +0.44%
BenchmarkAppend1Byte                 69           58  -15.42%
BenchmarkAppend4Bytes                69           71   +3.75%
BenchmarkAppend8Bytes                69           69   +0.29%
BenchmarkAppend16Bytes               71           86  +21.12%
BenchmarkAppend32Bytes               69           70   +2.17%
BenchmarkAppendStr1Byte              63           37  -40.50%
BenchmarkAppendStr4Bytes             63           47  -25.75%
BenchmarkAppendStr8Bytes             63           52  -17.74%
BenchmarkAppendStr16Bytes            64           72  +13.73%
BenchmarkAppendStr32Bytes            63           65   +2.98%
BenchmarkAppendSpecialCase          119          120   +0.84%

@rsc
Copy link
Contributor

rsc commented Mar 12, 2013

Comment 4:

[The time for maybe has passed.]

Labels changed: removed go1.1maybe.

@davecheney
Copy link
Contributor Author

Comment 5:

Remy, i submitted a CL late in the 1.1 cycle that addresses this issue wrt/ arm. Do you
plan to tune the 386 crossover value, or could this issue be closed ?

Owner changed to ---.

@remyoudompheng
Copy link
Contributor

Comment 6:

I think the new memmove and inlining of appendslice make things quite different.

@davecheney
Copy link
Contributor Author

Comment 7:

Good point. Has that landed yet ? If so, i'll put myself down to redo the crossovers for
all three archs.

Labels changed: added go1.2, removed arch-arm.

Owner changed to @davecheney.

@rsc
Copy link
Contributor

rsc commented Jul 30, 2013

Comment 8:

Labels changed: added feature.

@davecheney
Copy link
Contributor Author

Comment 9:

Here as an interesting result on linux/386 comparing a crossover of 16 (old) to 8 (new)
220887(~/go/src/pkg/runtime) % ~/go/misc/benchcmp {old,new}.txt
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     219          218   -0.46%
BenchmarkAppend1Byte                 75           75   +0.00%
BenchmarkAppend4Bytes                91           92   +0.11%
BenchmarkAppend8Bytes               114          114   +0.00%
BenchmarkAppend16Bytes              168           78  -53.51%
BenchmarkAppend32Bytes               88           88   +0.00%
BenchmarkAppendStr1Byte              57           57   +0.35%
BenchmarkAppendStr4Bytes             72           73   +1.10%
BenchmarkAppendStr8Bytes             92          113  +22.16%
BenchmarkAppendStr16Bytes           141           63  -54.82%
BenchmarkAppendStr32Bytes            75           76   +1.32%
BenchmarkAppendSpecialCase          270          272   +0.74%

@remyoudompheng
Copy link
Contributor

Comment 10:

The new memmove is very fast. But the benchmark lacks tests for length 15 and 31 slices.
I'd rather make the compiler inline append so that memmove is called directly.

@davecheney
Copy link
Contributor Author

Comment 11:

The results at the moment are pointing me towards, on 386 at least, removing the
appendCrossover. I'll add some more benchmarks for non power of 2 values.

@davecheney
Copy link
Contributor Author

Comment 12:

tip vs append crossover logic removed, linux/386
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     218          218   +0.00%
BenchmarkAppend1Byte                 75           72   -3.44%
BenchmarkAppend4Bytes                91           73  -19.80%
BenchmarkAppend7Bytes               108           74  -31.30%
BenchmarkAppend8Bytes               115           74  -35.48%
BenchmarkAppend15Bytes              162           77  -52.22%
BenchmarkAppend16Bytes              168           77  -53.99%
BenchmarkAppend32Bytes               88           86   -2.16%
BenchmarkAppendStr1Byte              57           57   -0.35%
BenchmarkAppendStr4Bytes             72           72   -0.14%
BenchmarkAppendStr8Bytes             92           92   -0.22%
BenchmarkAppendStr16Bytes           141          141   +0.00%
BenchmarkAppendStr32Bytes            75           75   -0.26%
BenchmarkAppendSpecialCase          270          270   +0.00

@remyoudompheng
Copy link
Contributor

Comment 13:

Ah indeed, that memmove is written so that it has equal run time for powers of two and
non-powers of two.

@davecheney
Copy link
Contributor Author

Comment 14:

I've been playing with this a bit more and have some results that show that removing
append crossover for 386/amd64 is a net positive.
linux/amd64
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     102          104   +1.96%
BenchmarkAppend1Byte                 10           11   +0.92%
BenchmarkAppend4Bytes                15           11  -28.10%
BenchmarkAppend7Bytes                17           12  -32.58%
BenchmarkAppend8Bytes                18           12  -36.17%
BenchmarkAppend15Bytes               24           11  -55.02%
BenchmarkAppend16Bytes               25           11  -56.03%
BenchmarkAppend32Bytes               11           12   +4.31%
BenchmarkAppendStr1Byte               8            9  +13.99%
BenchmarkAppendStr4Bytes             11            9  -17.52%
BenchmarkAppendStr8Bytes             14            9  -35.70%
BenchmarkAppendStr16Bytes            21            9  -55.19%
BenchmarkAppendStr32Bytes            10           10   -5.66%
BenchmarkAppendSpecialCase           49           52   +7.96%
linux/386
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     218          218   +0.00%
BenchmarkAppend1Byte                 75           73   -3.31%
BenchmarkAppend4Bytes                91           73  -19.91%
BenchmarkAppend7Bytes               108           74  -31.20%
BenchmarkAppend8Bytes               115           74  -35.39%
BenchmarkAppend15Bytes              162           77  -52.22%
BenchmarkAppend16Bytes              168           77  -53.93%
BenchmarkAppend32Bytes               88           86   -1.93%
BenchmarkAppendStr1Byte              57           59   +3.14%
BenchmarkAppendStr4Bytes             72           59  -17.40%
BenchmarkAppendStr8Bytes             92           60  -34.77%
BenchmarkAppendStr16Bytes           141           63  -54.89%
BenchmarkAppendStr32Bytes            75           74   -1.72%
BenchmarkAppendSpecialCase          270          288   +6.67%
There is a small regression for AppendStr1Byte and the special case case.
However the same is not true for linux/arm which regresses as it lacks the powerful
string move operations that intel has. It is possible that the append crossover should
be reduced from 8 to 4.
alarm(~/go/src/pkg/runtime) % ~/go/misc/benchcmp {old,new}.txt
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     811          801   -1.23%
BenchmarkAppend1Byte                100          103   +3.00%
BenchmarkAppend4Bytes               117          145  +23.93%
BenchmarkAppend8Bytes               139          119  -14.39%
BenchmarkAppend16Bytes              130          124   -4.62%
BenchmarkAppend32Bytes              125          145  +16.00%
BenchmarkAppendStr1Byte              72           83  +16.39%
BenchmarkAppendStr4Bytes             90          125  +38.58%
BenchmarkAppendStr8Bytes            112           98  -11.88%
BenchmarkAppendStr16Bytes           105          103   -1.90%
BenchmarkAppendStr32Bytes           101          124  +22.77%
BenchmarkAppendSpecialCase          624          622   -0.32%

@davecheney
Copy link
Contributor Author

Comment 15:

https://golang.org/cl/12440044/

Status changed to Started.

@davecheney
Copy link
Contributor Author

Comment 16:

This issue was closed by revision 8ce8adb.

Status changed to Fixed.

@rsc rsc added this to the Go1.2 milestone Apr 14, 2015
@rsc rsc removed the go1.2 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants