cmd/compile: registerization of large functions worse than in Go 1.2 #8214

dchest · 2014-06-15T21:21:35Z

http://github.com/dchest/blake2b (and http://github.com/dchest/blake2s) are a lot slower
when compiled with Go 1.3RC2 than with 1.2.

Benchmark for blake2b on Linux amd64 (Atom CPU) comparing 1.2 and 1.3:

benchmark            old ns/op     new ns/op     delta        
BenchmarkWrite1K     15711         67947         +332.48%     
BenchmarkWrite8K     122868        528719        +330.31%     
BenchmarkHash64      3923          15419         +293.04%     
BenchmarkHash128     3551          14642         +312.33%     
BenchmarkHash1K      17192         73758         +329.03%     

benchmark            old MB/s     new MB/s     speedup     
BenchmarkWrite1K     65.17        15.07        0.23x       
BenchmarkWrite8K     66.67        15.49        0.23x       
BenchmarkHash64      16.31        4.15         0.25x       
BenchmarkHash128     36.04        8.74         0.24x       
BenchmarkHash1K      59.56        13.88        0.23x       


The meat is in block.go, which tries to put state into registers by using variables v0 -
v15:

https://github.com/dchest/blake2b/blob/master/block.go#L41

I looked at the listing generated by go build -gcflags "-S" and see the main
difference in temporary variables:

original code:

        v0 += m[9]
        v0 += v5
        v15 ^= v0
        v15 = v15<<(64-16) | v15>>16
        v10 += v15
        v5 ^= v10
        v5 = v5<<(64-63) | v5>>63

go1.2:

1995 (../block.go:146) ADDQ    BX,R11
1996 (../block.go:147) MOVQ    v5+-200(SP),BX
1997 (../block.go:147) ADDQ    BX,R11
1998 (../block.go:148) XORQ    R11,AX
1999 (../block.go:149) ROLQ    $48,AX
2000 (../block.go:150) ADDQ    AX,R13
2001 (../block.go:151) XORQ    R13,v5+-200(SP)
2002 (../block.go:152) MOVQ    v5+-200(SP),BX
2003 (../block.go:152) ROLQ    $1,BX
2004 (../block.go:152) MOVQ    BX,v5+-200(SP)

go1.3:

0x07bd 01981 (../block.go:146)  MOVQ    "".v0+120(SP),BX
0x07c2 01986 (../block.go:146)  MOVQ    BX,"".autotmp_0264+176(SP)
0x07ca 01994 (../block.go:146)  MOVQ    "".m+264(SP),BX
0x07d2 02002 (../block.go:146)  MOVQ    "".autotmp_0264+176(SP),BP
0x07da 02010 (../block.go:146)  ADDQ    BP,BX
0x07dd 02013 (../block.go:147)  MOVQ    "".v5+32(SP),BP
0x07e2 02018 (../block.go:147)  ADDQ    BP,BX
0x07e5 02021 (../block.go:147)  MOVQ    BX,"".v0+120(SP)
0x07ea 02026 (../block.go:148)  MOVQ    "".v0+120(SP),BP
0x07ef 02031 (../block.go:148)  XORQ    BP,AX
0x07f2 02034 (../block.go:150)  MOVQ    "".v10+104(SP),BX
0x07f7 02039 (../block.go:149)  ROLQ    $48,AX
0x07fb 02043 (../block.go:150)  ADDQ    AX,BX
0x07fe 02046 (../block.go:150)  MOVQ    BX,"".v10+104(SP)
0x0803 02051 (../block.go:151)  MOVQ    "".v5+32(SP),BX
0x0808 02056 (../block.go:151)  MOVQ    "".v10+104(SP),BP
0x080d 02061 (../block.go:151)  XORQ    BP,BX
0x0810 02064 (../block.go:152)  ROLQ    $1,BX
0x0813 02067 (../block.go:152)  MOVQ    BX,"".v5+32(SP)

If I combine addition like this:

-               v2 += m[9]
-               v2 += v6
+               v2 += m[9] + v6


Benchmark improves from being ~300% slower to ~100% slower:

benchmark            old ns/op     new ns/op     delta        
BenchmarkWrite1K     15711         33286         +111.86%     
BenchmarkWrite8K     122868        254970        +107.52%     
BenchmarkHash64      3923          9868          +151.54%     
BenchmarkHash128     3551          9268          +161.00%     
BenchmarkHash1K      17192         38254         +122.51%     

benchmark            old MB/s     new MB/s     speedup     
BenchmarkWrite1K     65.17        30.76        0.47x       
BenchmarkWrite8K     66.67        32.13        0.48x       
BenchmarkHash64      16.31        6.49         0.40x       
BenchmarkHash128     36.04        13.81        0.38x       
BenchmarkHash1K      59.56        26.77        0.45x       



What does 'go version' print?

go version devel +6146799f32ed Tue Jun 10 20:20:49 2014 -0400 linux/amd64

rsc · 2014-06-15T22:20:26Z

Comment 1:

This is caused by primarily by the introduction of temporaries in any +=, ^=, and so on.
The larger number of variables hits a limit in the compiler that stops registerization
earlier. That's why combining addition helped. It may also help to say x = x + y instead
of x += y (I'm not 100% sure about that).
This is certainly unfortunate for 1.3 but it only affects very large functions. Even
with more registerization (by increasing 6g/opt.h's NGRN) it's not as fast as 1.2. As
far as I can tell the compiler was ending up with better register choices in 1.2 almost
by blind luck.

Labels changed: added release-go1.4.

Status changed to Accepted.

rsc · 2014-09-15T15:41:57Z

Comment 2:

Labels changed: added release-go1.5, removed release-go1.4.

griesemer · 2014-10-01T21:00:24Z

Comment 3:

Labels changed: added repo-main.

dchest · 2016-08-16T08:09:17Z

I believe this is fixed in Go 1.7:

$ benchcmp GO163.txt GO17.txt
benchmark              old ns/op     new ns/op     delta       
BenchmarkWrite1K-4     4817          1923          -60.08%     
BenchmarkWrite8K-4     37955         15150         -60.08%     
BenchmarkHash64-4      805           386           -52.05%     
BenchmarkHash128-4     751           340           -54.73%     
BenchmarkHash1K-4      4903          1996          -59.29%     

benchmark              old MB/s     new MB/s     speedup     
BenchmarkWrite1K-4     212.55       532.26       2.50x       
BenchmarkWrite8K-4     215.83       540.70       2.51x       
BenchmarkHash64-4      79.41        165.50       2.08x       
BenchmarkHash128-4     170.39       376.43       2.21x       
BenchmarkHash1K-4      208.81       512.86       2.46x

(Go 1.6.3 vs Go 1.7 on MacBook Pro 2.6 GHz Intel Core i5)

dchest added accepted labels Oct 1, 2014

bradfitz modified the milestone: Go1.5 Dec 16, 2014

bradfitz removed the release-go1.5 label Dec 16, 2014

rsc removed accepted labels Apr 14, 2015

rsc modified the milestones: Unplanned, Go1.5 May 19, 2015

rsc changed the title ~~cmd/gc: registerization of large functions worse than in Go 1.2~~ cmd/compile: registerization of large functions worse than in Go 1.2 Jun 8, 2015

dchest closed this as completed Aug 16, 2016

golang locked and limited conversation to collaborators Aug 16, 2017

gopherbot added the FrozenDueToAge label Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: registerization of large functions worse than in Go 1.2 #8214

cmd/compile: registerization of large functions worse than in Go 1.2 #8214

dchest commented Jun 15, 2014

rsc commented Jun 15, 2014

rsc commented Sep 15, 2014

griesemer commented Oct 1, 2014

dchest commented Aug 16, 2016 •

edited

cmd/compile: registerization of large functions worse than in Go 1.2 #8214

cmd/compile: registerization of large functions worse than in Go 1.2 #8214

Comments

dchest commented Jun 15, 2014

rsc commented Jun 15, 2014

rsc commented Sep 15, 2014

griesemer commented Oct 1, 2014

dchest commented Aug 16, 2016 • edited

dchest commented Aug 16, 2016 •

edited