Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: optimizations to generated code on amd64 #1766

Closed
gopherbot opened this issue May 2, 2011 · 15 comments
Closed

cmd/compile: optimizations to generated code on amd64 #1766

gopherbot opened this issue May 2, 2011 · 15 comments

Comments

@gopherbot
Copy link

by umbricola:

What steps will reproduce the problem?
1. tar xf umbricola-go-defect-report-addloop1.tar
2. ./umbricola-go-defect-report-addloop1/addloop1_run.sh

What is the expected output?
Timing statistics similar to those below but with the modular and monolithic Go versions
having the same speed.

What do you see instead?
C version:
-1243309312
        0.50 real         0.50 user         0.00 sys
Modular Go version:
-1243309312
        0.86 real         0.85 user         0.00 sys
Monolithic Go version:
-1243309312
        3.02 real         3.01 user         0.00 sys


Which compiler are you using (5g, 6g, 8g, gccgo)?
6g/6l

Which operating system are you using?
Mac OS X 10.5.8

Which revision are you using?  (hg identify)
9266c53a8fc0 tip

Please provide any additional information below.
The shell script compiles and measures the running time of three programs that each
compute and print the sum of the integers from
0 to 10^9 - 1 reduced modulo 2^32 (i.e., they accumulate the sum into an int and it
overwraps a lot).  The measurements I quoted above were run on my Mac laptop with a 2.0
GHz Core 2 (65 nm process) CPU.

The first program is a 386 binary compiled from C99 source addloop1_c.c with gcc -O3. 
It runs the 10^9 loop iterations in 0.50 s, i.e., one iteration per clock, which is the
best result I would realistically expect from a compiler.  The generated code is quite
natural: an add, an inc, and a cmp/jne pair eligible for macrofusion, which fit nicely
on execution ports 0, 1, and 5.

The second program is an x86-64 binary compiled with 6g/6l from Go sources
addloop1_sum.go and addloop1_modular.go.  The former contains the arithmetic loop as an
exported function that the main function in the latter file calls out to.  This version
deserves to be a little slower than the first program just because it's an x86-64 binary
running on a chip that can't do macrofusion in 64-bit mode.  Theoretically I'd expect
3/4 iterations per cycle, thus about 0.50 s * 4/3 = 0.67 s running time.  The observed
0.86 s suggests that Go is generating worse code than gcc here, but the speed is still
reasonable.  Unfortunately, I don't know a way to disassemble the code in a 6l binary,
so I don't know what the loop looks like.

The third program is another x86-64 binary compiled with 6g/6l.  This time there is only
one Go source file addloop1_monolithic.go, in which the main() function contains the
arithmetic loop.  Because we didn't change the loop, just inlined it in the main
function, I'd expect to see the same 0.86 s timing again.  Instead I got 3.02 s --- much
worse!  I can't guess how that happened, but I'm sure it's not the behavior you want.

Attachments:

  1. umbricola-go-defect-report-addloop1.tar (10240 bytes)
@rsc
Copy link
Contributor

rsc commented May 2, 2011

Comment 1:

You can see the assembly by using a standard tool like gdb or objdump
but also by running 6l with the -a flag.
On my Linux system, I get
addloop1_c (0.24s)
.L10:
    addl    %ecx, %edx
    addl    $1, %ecx
    cmpl    %ecx, %esi
    jg  .L10
addloop1_modular (0.95s)
421734  01c1                 | (6)  ADDL    AX,CX
421736  ffc0                 | (5)  INCL    ,AX
421738  39d0                 | (5)  CMPL    AX,DX
42173a  7cf8                 | (5)  JLT ,421734
It looks like the only difference is the incl vs addl $1. 
Should that really cause such a huge time difference?
addloop1_monolithic (2.83s)
400d36  0144244c             | (12) ADDL    AX,main.t+76(SP)
400d3a  ffc0                 | (11) INCL    ,AX
400d3c  418b18               | (11) MOVL    (R8),BX
400d3f  39c3                 | (11) CMPL    BX,AX
400d41  7ff3                 | (11) JGT ,400d36
Here the difference is the use of (R8) as the limit.
Because the call to fmt.Sscan took the address of limit,
the code is being overly conservative about making sure
it sees changes to limit.  I will take a note to fix that:
it only really needs to reload from memory after an
assignment through a pointer or a function call.

Owner changed to @rsc.

Status changed to Accepted.

@edsrzf
Copy link

edsrzf commented May 2, 2011

Comment 2:

According to Agner Fog's assembly optimization manual[1], add is faster than inc. It
costs an extra micro-op on some architectures and always causes a false dependency on
the carry flag from a previous instruction -- in this case the instruction directly
preceding it. I'm also surprised it would make this much of a difference, though.
[1] http://www.agner.org/optimize/optimizing_assembly.pdf, section 16.2

@gopherbot
Copy link
Author

Comment 3 by olof.lindholm:

Could that difference between addloop1_c and addloop1_modular just be the difference in
branch target alignment? *just a thought*

@gopherbot
Copy link
Author

Comment 4 by umbricola:

Thanks for the 6l -a tip --- my gdb seems not to be able to disassemble gc binaries.
Russ, what are your CPU specs?  I am surprised that you got a billion loop iterations to
run in 0.24 seconds.  I think that would require your CPU to be running at a bit over
4.0 GHz at minimum.
Also, the code you cite from addloop1_c and addloop1_modular has essentially the same
first three instructions, but then addloop1_c branches with jg, while addloop1 says JLT.
 Those don't appear to mean the same thing.
Note that addloop1_monolithic differs from addloop1_modular not only in reloading limit
from memory, but also in keeping the accumulator t in memory.  The accumulation into t
is a loop-carried dependence chain; routing that through memory is bound to be expensive
in a short loop like this.  I bet the performance is getting hurt from t being in memory
more than from reloading bx through r8, which is not involved in a loop-carried
dependence chain.
It is interesting that t was put in memory during the loop: it is local to the function
and nothing has happened to it yet but := and += statements.  The only thing I can think
of is that the later call to fmt.Printf() might be taking the address of t under the
hood (I don't know what the compiler does to make an interface{} out of an int for that
call).
Regarding inc versus add, Agner Fog's excellent manuals do warn that using inc foo
instead of add $1, foo can cause stalls or extra dependencies in some cases, but I don't
have a good understanding of when it hurts.  I have attached another tarball that runs
four different assembly implementations of the addition loop.  On my chip it doesn't
seem to matter in this case whether I use add $1, foo or inc foo.  But it does matter
whether I use jg or jne as the loop branch instruction: jne is better here.  I think
that is because my chip (running in 32-bit mode for this executable) can macrofuse a cmp
with a jne but not with a jg.  I wouldn't have expected that to hurt this much, though:
% ./addloop1_asm-run.sh
inc + jg:
-1243309312
        0.86 real         0.86 user         0.00 sys
add + jg:
-1243309312
        0.86 real         0.85 user         0.00 sys
inc + jne:
-1243309312
        0.50 real         0.50 user         0.00 sys
add + jne:
-1243309312
        0.50 real         0.50 user         0.00 sys
In the numbers I posted originally, my gcc figured out that provided limit is positive,
it could use jne instead of jg or the like in the arithmetic loop, and that's why the
binary it compiled ran in 0.50 seconds instead of 0.86 seconds.
Regarding branch target alignment, that can matter, but I wouldn't expect Russ's listed
addloop1_modular code to get hurt because the whole loop fits into one naturally aligned
16-byte code block despite the branch target's misalignment.

Attachments:

  1. umbricola-go-defect-report-addloop1_asm.tar (10240 bytes)

@lvdlvd
Copy link

lvdlvd commented Nov 7, 2011

Comment 5:

Labels changed: added compilerbug, performance.

@rsc
Copy link
Contributor

rsc commented Dec 9, 2011

Comment 6:

Labels changed: added priority-later.

@rsc
Copy link
Contributor

rsc commented Sep 12, 2012

Comment 8:

Labels changed: added go1.1maybe.

@robpike
Copy link
Contributor

robpike commented Mar 7, 2013

Comment 9:

Labels changed: removed go1.1maybe.

@rsc
Copy link
Contributor

rsc commented Mar 12, 2013

Comment 10:

[The time for maybe has passed.]

@rsc
Copy link
Contributor

rsc commented Jul 30, 2013

Comment 11:

Labels changed: added go1.3maybe.

@robpike
Copy link
Contributor

robpike commented Aug 20, 2013

Comment 12:

Labels changed: removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 13:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 14:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 15:

Labels changed: added repo-main.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc rsc changed the title cmd/6g: optimizations to generated code cmd/compile: optimizations to generated code on amd64 Jun 8, 2015
@ALTree
Copy link
Member

ALTree commented Feb 2, 2017

I think this can be closed; the "modular" vs "monolithic" performance issue (which was the main problem here) is a thing of the past; I tested the monolithic version and it's as fast as I'd expect.

We're still slower than gcc -O3, and that's mainly because gcc -O3 unrolls the loop very aggressively, while gc doesn't; but we know that and the function benchmarked here (for i := 1; i < 1e9; i++ { sum += i }) is not particularly notable, so it's not worth leaving the issue open just for that.

@ALTree ALTree closed this as completed Feb 2, 2017
@golang golang locked and limited conversation to collaborators Feb 2, 2018
@rsc rsc removed their assignment Jun 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants