runtime: tune appendCrossover for 386 and arm #4963

davecheney · 2013-03-02T01:12:22Z

https://code.google.com/p/go/source/detail?r=31e379ac82794d282601aca59f61261a40f511af

introduced a new per arch constant for inline vs memmove memory copying. The results on
amd64 are excellent, but the constant is probably too agressive for 386 and arm.

remyoudompheng · 2013-03-04T16:23:56Z

Comment 1:

Benchmarks on ARM (ODROID-X): things got same or worse in all cases
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     394          394   +0.00%
BenchmarkAppend1Byte                 48           46   -5.14%
BenchmarkAppend4Bytes                55           54   -2.87%
BenchmarkAppend8Bytes                55           64  +16.16%
BenchmarkAppend16Bytes               58           85  +46.67%
BenchmarkAppend32Bytes               57           60   +5.57%
BenchmarkAppendStr1Byte              39           34  -13.89%
BenchmarkAppendStr4Bytes             44           42   -4.76%
BenchmarkAppendStr8Bytes             45           52  +14.91%
BenchmarkAppendStr16Bytes            48           74  +51.64%
BenchmarkAppendStr32Bytes            47           47   +1.27%
BenchmarkAppendSpecialCase          276          282   +2.17%

davecheney · 2013-03-04T16:25:49Z

Comment 2:

This was my suspicion. I would support a CL that set appendCrossover
to 0 for arm.

remyoudompheng · 2013-03-05T00:58:49Z

Comment 3:

The results on 386 are not very interesting. Only AppendStr1Byte has significantly
improved:
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     225          226   +0.44%
BenchmarkAppend1Byte                 69           58  -15.42%
BenchmarkAppend4Bytes                69           71   +3.75%
BenchmarkAppend8Bytes                69           69   +0.29%
BenchmarkAppend16Bytes               71           86  +21.12%
BenchmarkAppend32Bytes               69           70   +2.17%
BenchmarkAppendStr1Byte              63           37  -40.50%
BenchmarkAppendStr4Bytes             63           47  -25.75%
BenchmarkAppendStr8Bytes             63           52  -17.74%
BenchmarkAppendStr16Bytes            64           72  +13.73%
BenchmarkAppendStr32Bytes            63           65   +2.98%
BenchmarkAppendSpecialCase          119          120   +0.84%

rsc · 2013-03-12T20:19:30Z

Comment 4:

[The time for maybe has passed.]

Labels changed: removed go1.1maybe.

davecheney · 2013-06-15T12:50:41Z

Comment 5:

Remy, i submitted a CL late in the 1.1 cycle that addresses this issue wrt/ arm. Do you
plan to tune the 386 crossover value, or could this issue be closed ?

Owner changed to ---.

remyoudompheng · 2013-06-15T13:09:38Z

Comment 6:

I think the new memmove and inlining of appendslice make things quite different.

davecheney · 2013-06-15T13:13:23Z

Comment 7:

Good point. Has that landed yet ? If so, i'll put myself down to redo the crossovers for
all three archs.

Labels changed: added go1.2, removed arch-arm.

Owner changed to @davecheney.

rsc · 2013-07-30T20:56:51Z

Comment 8:

Labels changed: added feature.

davecheney · 2013-08-05T06:19:42Z

Comment 9:

Here as an interesting result on linux/386 comparing a crossover of 16 (old) to 8 (new)
220887(~/go/src/pkg/runtime) % ~/go/misc/benchcmp {old,new}.txt
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     219          218   -0.46%
BenchmarkAppend1Byte                 75           75   +0.00%
BenchmarkAppend4Bytes                91           92   +0.11%
BenchmarkAppend8Bytes               114          114   +0.00%
BenchmarkAppend16Bytes              168           78  -53.51%
BenchmarkAppend32Bytes               88           88   +0.00%
BenchmarkAppendStr1Byte              57           57   +0.35%
BenchmarkAppendStr4Bytes             72           73   +1.10%
BenchmarkAppendStr8Bytes             92          113  +22.16%
BenchmarkAppendStr16Bytes           141           63  -54.82%
BenchmarkAppendStr32Bytes            75           76   +1.32%
BenchmarkAppendSpecialCase          270          272   +0.74%

remyoudompheng · 2013-08-05T06:36:29Z

Comment 10:

The new memmove is very fast. But the benchmark lacks tests for length 15 and 31 slices.
I'd rather make the compiler inline append so that memmove is called directly.

davecheney · 2013-08-05T06:41:48Z

Comment 11:

The results at the moment are pointing me towards, on 386 at least, removing the
appendCrossover. I'll add some more benchmarks for non power of 2 values.

davecheney · 2013-08-05T06:49:24Z

Comment 12:

tip vs append crossover logic removed, linux/386
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     218          218   +0.00%
BenchmarkAppend1Byte                 75           72   -3.44%
BenchmarkAppend4Bytes                91           73  -19.80%
BenchmarkAppend7Bytes               108           74  -31.30%
BenchmarkAppend8Bytes               115           74  -35.48%
BenchmarkAppend15Bytes              162           77  -52.22%
BenchmarkAppend16Bytes              168           77  -53.99%
BenchmarkAppend32Bytes               88           86   -2.16%
BenchmarkAppendStr1Byte              57           57   -0.35%
BenchmarkAppendStr4Bytes             72           72   -0.14%
BenchmarkAppendStr8Bytes             92           92   -0.22%
BenchmarkAppendStr16Bytes           141          141   +0.00%
BenchmarkAppendStr32Bytes            75           75   -0.26%
BenchmarkAppendSpecialCase          270          270   +0.00

remyoudompheng · 2013-08-05T06:53:34Z

Comment 13:

Ah indeed, that memmove is written so that it has equal run time for powers of two and
non-powers of two.

davecheney · 2013-08-05T09:40:17Z

Comment 14:

I've been playing with this a bit more and have some results that show that removing
append crossover for 386/amd64 is a net positive.
linux/amd64
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     102          104   +1.96%
BenchmarkAppend1Byte                 10           11   +0.92%
BenchmarkAppend4Bytes                15           11  -28.10%
BenchmarkAppend7Bytes                17           12  -32.58%
BenchmarkAppend8Bytes                18           12  -36.17%
BenchmarkAppend15Bytes               24           11  -55.02%
BenchmarkAppend16Bytes               25           11  -56.03%
BenchmarkAppend32Bytes               11           12   +4.31%
BenchmarkAppendStr1Byte               8            9  +13.99%
BenchmarkAppendStr4Bytes             11            9  -17.52%
BenchmarkAppendStr8Bytes             14            9  -35.70%
BenchmarkAppendStr16Bytes            21            9  -55.19%
BenchmarkAppendStr32Bytes            10           10   -5.66%
BenchmarkAppendSpecialCase           49           52   +7.96%
linux/386
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     218          218   +0.00%
BenchmarkAppend1Byte                 75           73   -3.31%
BenchmarkAppend4Bytes                91           73  -19.91%
BenchmarkAppend7Bytes               108           74  -31.20%
BenchmarkAppend8Bytes               115           74  -35.39%
BenchmarkAppend15Bytes              162           77  -52.22%
BenchmarkAppend16Bytes              168           77  -53.93%
BenchmarkAppend32Bytes               88           86   -1.93%
BenchmarkAppendStr1Byte              57           59   +3.14%
BenchmarkAppendStr4Bytes             72           59  -17.40%
BenchmarkAppendStr8Bytes             92           60  -34.77%
BenchmarkAppendStr16Bytes           141           63  -54.89%
BenchmarkAppendStr32Bytes            75           74   -1.72%
BenchmarkAppendSpecialCase          270          288   +6.67%
There is a small regression for AppendStr1Byte and the special case case.
However the same is not true for linux/arm which regresses as it lacks the powerful
string move operations that intel has. It is possible that the append crossover should
be reduced from 8 to 4.
alarm(~/go/src/pkg/runtime) % ~/go/misc/benchcmp {old,new}.txt
benchmark                     old ns/op    new ns/op    delta
BenchmarkAppend                     811          801   -1.23%
BenchmarkAppend1Byte                100          103   +3.00%
BenchmarkAppend4Bytes               117          145  +23.93%
BenchmarkAppend8Bytes               139          119  -14.39%
BenchmarkAppend16Bytes              130          124   -4.62%
BenchmarkAppend32Bytes              125          145  +16.00%
BenchmarkAppendStr1Byte              72           83  +16.39%
BenchmarkAppendStr4Bytes             90          125  +38.58%
BenchmarkAppendStr8Bytes            112           98  -11.88%
BenchmarkAppendStr16Bytes           105          103   -1.90%
BenchmarkAppendStr32Bytes           101          124  +22.77%
BenchmarkAppendSpecialCase          624          622   -0.32%

davecheney · 2013-08-05T10:29:25Z

Comment 15:

https://golang.org/cl/12440044/

Status changed to Started.

davecheney · 2013-08-05T21:52:04Z

Comment 16:

This issue was closed by revision 8ce8adb.

Status changed to Fixed.

davecheney added fixed labels Aug 5, 2013

rsc added this to the Go1.2 milestone Apr 14, 2015

rsc removed the go1.2 label Apr 14, 2015

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: tune appendCrossover for 386 and arm #4963

runtime: tune appendCrossover for 386 and arm #4963

davecheney commented Mar 2, 2013

remyoudompheng commented Mar 4, 2013

davecheney commented Mar 4, 2013

remyoudompheng commented Mar 5, 2013

rsc commented Mar 12, 2013

davecheney commented Jun 15, 2013

remyoudompheng commented Jun 15, 2013

davecheney commented Jun 15, 2013

rsc commented Jul 30, 2013

davecheney commented Aug 5, 2013

remyoudompheng commented Aug 5, 2013

davecheney commented Aug 5, 2013

davecheney commented Aug 5, 2013

remyoudompheng commented Aug 5, 2013

davecheney commented Aug 5, 2013

davecheney commented Aug 5, 2013

davecheney commented Aug 5, 2013

runtime: tune appendCrossover for 386 and arm #4963

runtime: tune appendCrossover for 386 and arm #4963

Comments

davecheney commented Mar 2, 2013

remyoudompheng commented Mar 4, 2013

davecheney commented Mar 4, 2013

remyoudompheng commented Mar 5, 2013

rsc commented Mar 12, 2013

davecheney commented Jun 15, 2013

remyoudompheng commented Jun 15, 2013

davecheney commented Jun 15, 2013

rsc commented Jul 30, 2013

davecheney commented Aug 5, 2013

remyoudompheng commented Aug 5, 2013

davecheney commented Aug 5, 2013

davecheney commented Aug 5, 2013

remyoudompheng commented Aug 5, 2013

davecheney commented Aug 5, 2013

davecheney commented Aug 5, 2013

davecheney commented Aug 5, 2013