Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: registerization of large functions worse than in Go 1.2 #8214

Closed
dchest opened this issue Jun 15, 2014 · 4 comments
Closed

cmd/compile: registerization of large functions worse than in Go 1.2 #8214

dchest opened this issue Jun 15, 2014 · 4 comments

Comments

@dchest
Copy link
Contributor

dchest commented Jun 15, 2014

http://github.com/dchest/blake2b (and http://github.com/dchest/blake2s) are a lot slower
when compiled with Go 1.3RC2 than with 1.2.

Benchmark for blake2b on Linux amd64 (Atom CPU) comparing 1.2 and 1.3:

benchmark            old ns/op     new ns/op     delta        
BenchmarkWrite1K     15711         67947         +332.48%     
BenchmarkWrite8K     122868        528719        +330.31%     
BenchmarkHash64      3923          15419         +293.04%     
BenchmarkHash128     3551          14642         +312.33%     
BenchmarkHash1K      17192         73758         +329.03%     

benchmark            old MB/s     new MB/s     speedup     
BenchmarkWrite1K     65.17        15.07        0.23x       
BenchmarkWrite8K     66.67        15.49        0.23x       
BenchmarkHash64      16.31        4.15         0.25x       
BenchmarkHash128     36.04        8.74         0.24x       
BenchmarkHash1K      59.56        13.88        0.23x       


The meat is in block.go, which tries to put state into registers by using variables v0 -
v15:

https://github.com/dchest/blake2b/blob/master/block.go#L41

I looked at the listing generated by go build -gcflags "-S" and see the main
difference in temporary variables:

original code:

        v0 += m[9]
        v0 += v5
        v15 ^= v0
        v15 = v15<<(64-16) | v15>>16
        v10 += v15
        v5 ^= v10
        v5 = v5<<(64-63) | v5>>63

go1.2:

1995 (../block.go:146) ADDQ    BX,R11
1996 (../block.go:147) MOVQ    v5+-200(SP),BX
1997 (../block.go:147) ADDQ    BX,R11
1998 (../block.go:148) XORQ    R11,AX
1999 (../block.go:149) ROLQ    $48,AX
2000 (../block.go:150) ADDQ    AX,R13
2001 (../block.go:151) XORQ    R13,v5+-200(SP)
2002 (../block.go:152) MOVQ    v5+-200(SP),BX
2003 (../block.go:152) ROLQ    $1,BX
2004 (../block.go:152) MOVQ    BX,v5+-200(SP)

go1.3:

0x07bd 01981 (../block.go:146)  MOVQ    "".v0+120(SP),BX
0x07c2 01986 (../block.go:146)  MOVQ    BX,"".autotmp_0264+176(SP)
0x07ca 01994 (../block.go:146)  MOVQ    "".m+264(SP),BX
0x07d2 02002 (../block.go:146)  MOVQ    "".autotmp_0264+176(SP),BP
0x07da 02010 (../block.go:146)  ADDQ    BP,BX
0x07dd 02013 (../block.go:147)  MOVQ    "".v5+32(SP),BP
0x07e2 02018 (../block.go:147)  ADDQ    BP,BX
0x07e5 02021 (../block.go:147)  MOVQ    BX,"".v0+120(SP)
0x07ea 02026 (../block.go:148)  MOVQ    "".v0+120(SP),BP
0x07ef 02031 (../block.go:148)  XORQ    BP,AX
0x07f2 02034 (../block.go:150)  MOVQ    "".v10+104(SP),BX
0x07f7 02039 (../block.go:149)  ROLQ    $48,AX
0x07fb 02043 (../block.go:150)  ADDQ    AX,BX
0x07fe 02046 (../block.go:150)  MOVQ    BX,"".v10+104(SP)
0x0803 02051 (../block.go:151)  MOVQ    "".v5+32(SP),BX
0x0808 02056 (../block.go:151)  MOVQ    "".v10+104(SP),BP
0x080d 02061 (../block.go:151)  XORQ    BP,BX
0x0810 02064 (../block.go:152)  ROLQ    $1,BX
0x0813 02067 (../block.go:152)  MOVQ    BX,"".v5+32(SP)

If I combine addition like this:

-               v2 += m[9]
-               v2 += v6
+               v2 += m[9] + v6


Benchmark improves from being ~300% slower to ~100% slower:

benchmark            old ns/op     new ns/op     delta        
BenchmarkWrite1K     15711         33286         +111.86%     
BenchmarkWrite8K     122868        254970        +107.52%     
BenchmarkHash64      3923          9868          +151.54%     
BenchmarkHash128     3551          9268          +161.00%     
BenchmarkHash1K      17192         38254         +122.51%     

benchmark            old MB/s     new MB/s     speedup     
BenchmarkWrite1K     65.17        30.76        0.47x       
BenchmarkWrite8K     66.67        32.13        0.48x       
BenchmarkHash64      16.31        6.49         0.40x       
BenchmarkHash128     36.04        13.81        0.38x       
BenchmarkHash1K      59.56        26.77        0.45x       



What does 'go version' print?

go version devel +6146799f32ed Tue Jun 10 20:20:49 2014 -0400 linux/amd64
@rsc
Copy link
Contributor

rsc commented Jun 15, 2014

Comment 1:

This is caused by primarily by the introduction of temporaries in any +=, ^=, and so on.
The larger number of variables hits a limit in the compiler that stops registerization
earlier. That's why combining addition helped. It may also help to say x = x + y instead
of x += y (I'm not 100% sure about that).
This is certainly unfortunate for 1.3 but it only affects very large functions. Even
with more registerization (by increasing 6g/opt.h's NGRN) it's not as fast as 1.2. As
far as I can tell the compiler was ending up with better register choices in 1.2 almost
by blind luck.

Labels changed: added release-go1.4.

Status changed to Accepted.

@rsc
Copy link
Contributor

rsc commented Sep 15, 2014

Comment 2:

Labels changed: added release-go1.5, removed release-go1.4.

@griesemer
Copy link
Contributor

Comment 3:

Labels changed: added repo-main.

@bradfitz bradfitz modified the milestone: Go1.5 Dec 16, 2014
@rsc rsc removed accepted labels Apr 14, 2015
@rsc rsc modified the milestones: Unplanned, Go1.5 May 19, 2015
@rsc rsc changed the title cmd/gc: registerization of large functions worse than in Go 1.2 cmd/compile: registerization of large functions worse than in Go 1.2 Jun 8, 2015
@dchest
Copy link
Contributor Author

dchest commented Aug 16, 2016

I believe this is fixed in Go 1.7:

$ benchcmp GO163.txt GO17.txt
benchmark              old ns/op     new ns/op     delta       
BenchmarkWrite1K-4     4817          1923          -60.08%     
BenchmarkWrite8K-4     37955         15150         -60.08%     
BenchmarkHash64-4      805           386           -52.05%     
BenchmarkHash128-4     751           340           -54.73%     
BenchmarkHash1K-4      4903          1996          -59.29%     

benchmark              old MB/s     new MB/s     speedup     
BenchmarkWrite1K-4     212.55       532.26       2.50x       
BenchmarkWrite8K-4     215.83       540.70       2.51x       
BenchmarkHash64-4      79.41        165.50       2.08x       
BenchmarkHash128-4     170.39       376.43       2.21x       
BenchmarkHash1K-4      208.81       512.86       2.46x      

(Go 1.6.3 vs Go 1.7 on MacBook Pro 2.6 GHz Intel Core i5)

@dchest dchest closed this as completed Aug 16, 2016
@golang golang locked and limited conversation to collaborators Aug 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants