cmd/compile: optimizations to generated code on amd64 #1766

gopherbot · 2011-05-02T04:36:50Z

by umbricola:

What steps will reproduce the problem?
1. tar xf umbricola-go-defect-report-addloop1.tar
2. ./umbricola-go-defect-report-addloop1/addloop1_run.sh

What is the expected output?
Timing statistics similar to those below but with the modular and monolithic Go versions
having the same speed.

What do you see instead?
C version:
-1243309312
        0.50 real         0.50 user         0.00 sys
Modular Go version:
-1243309312
        0.86 real         0.85 user         0.00 sys
Monolithic Go version:
-1243309312
        3.02 real         3.01 user         0.00 sys


Which compiler are you using (5g, 6g, 8g, gccgo)?
6g/6l

Which operating system are you using?
Mac OS X 10.5.8

Which revision are you using?  (hg identify)
9266c53a8fc0 tip

Please provide any additional information below.
The shell script compiles and measures the running time of three programs that each
compute and print the sum of the integers from
0 to 10^9 - 1 reduced modulo 2^32 (i.e., they accumulate the sum into an int and it
overwraps a lot).  The measurements I quoted above were run on my Mac laptop with a 2.0
GHz Core 2 (65 nm process) CPU.

The first program is a 386 binary compiled from C99 source addloop1_c.c with gcc -O3. 
It runs the 10^9 loop iterations in 0.50 s, i.e., one iteration per clock, which is the
best result I would realistically expect from a compiler.  The generated code is quite
natural: an add, an inc, and a cmp/jne pair eligible for macrofusion, which fit nicely
on execution ports 0, 1, and 5.

The second program is an x86-64 binary compiled with 6g/6l from Go sources
addloop1_sum.go and addloop1_modular.go.  The former contains the arithmetic loop as an
exported function that the main function in the latter file calls out to.  This version
deserves to be a little slower than the first program just because it's an x86-64 binary
running on a chip that can't do macrofusion in 64-bit mode.  Theoretically I'd expect
3/4 iterations per cycle, thus about 0.50 s * 4/3 = 0.67 s running time.  The observed
0.86 s suggests that Go is generating worse code than gcc here, but the speed is still
reasonable.  Unfortunately, I don't know a way to disassemble the code in a 6l binary,
so I don't know what the loop looks like.

The third program is another x86-64 binary compiled with 6g/6l.  This time there is only
one Go source file addloop1_monolithic.go, in which the main() function contains the
arithmetic loop.  Because we didn't change the loop, just inlined it in the main
function, I'd expect to see the same 0.86 s timing again.  Instead I got 3.02 s --- much
worse!  I can't guess how that happened, but I'm sure it's not the behavior you want.

Attachments:

umbricola-go-defect-report-addloop1.tar (10240 bytes)

rsc · 2011-05-02T13:02:18Z

Comment 1:

You can see the assembly by using a standard tool like gdb or objdump
but also by running 6l with the -a flag.
On my Linux system, I get
addloop1_c (0.24s)
.L10:
    addl    %ecx, %edx
    addl    $1, %ecx
    cmpl    %ecx, %esi
    jg  .L10
addloop1_modular (0.95s)
421734  01c1                 | (6)  ADDL    AX,CX
421736  ffc0                 | (5)  INCL    ,AX
421738  39d0                 | (5)  CMPL    AX,DX
42173a  7cf8                 | (5)  JLT ,421734
It looks like the only difference is the incl vs addl $1. 
Should that really cause such a huge time difference?
addloop1_monolithic (2.83s)
400d36  0144244c             | (12) ADDL    AX,main.t+76(SP)
400d3a  ffc0                 | (11) INCL    ,AX
400d3c  418b18               | (11) MOVL    (R8),BX
400d3f  39c3                 | (11) CMPL    BX,AX
400d41  7ff3                 | (11) JGT ,400d36
Here the difference is the use of (R8) as the limit.
Because the call to fmt.Sscan took the address of limit,
the code is being overly conservative about making sure
it sees changes to limit.  I will take a note to fix that:
it only really needs to reload from memory after an
assignment through a pointer or a function call.

Owner changed to @rsc.

Status changed to Accepted.

edsrzf · 2011-05-02T23:34:41Z

Comment 2:

According to Agner Fog's assembly optimization manual[1], add is faster than inc. It
costs an extra micro-op on some architectures and always causes a false dependency on
the carry flag from a previous instruction -- in this case the instruction directly
preceding it. I'm also surprised it would make this much of a difference, though.
[1] http://www.agner.org/optimize/optimizing_assembly.pdf, section 16.2

gopherbot · 2011-05-03T11:43:55Z

Comment 3 by olof.lindholm:

Could that difference between addloop1_c and addloop1_modular just be the difference in
branch target alignment? *just a thought*

gopherbot · 2011-05-03T13:16:24Z

Comment 4 by umbricola:

Thanks for the 6l -a tip --- my gdb seems not to be able to disassemble gc binaries.
Russ, what are your CPU specs?  I am surprised that you got a billion loop iterations to
run in 0.24 seconds.  I think that would require your CPU to be running at a bit over
4.0 GHz at minimum.
Also, the code you cite from addloop1_c and addloop1_modular has essentially the same
first three instructions, but then addloop1_c branches with jg, while addloop1 says JLT.
 Those don't appear to mean the same thing.
Note that addloop1_monolithic differs from addloop1_modular not only in reloading limit
from memory, but also in keeping the accumulator t in memory.  The accumulation into t
is a loop-carried dependence chain; routing that through memory is bound to be expensive
in a short loop like this.  I bet the performance is getting hurt from t being in memory
more than from reloading bx through r8, which is not involved in a loop-carried
dependence chain.
It is interesting that t was put in memory during the loop: it is local to the function
and nothing has happened to it yet but := and += statements.  The only thing I can think
of is that the later call to fmt.Printf() might be taking the address of t under the
hood (I don't know what the compiler does to make an interface{} out of an int for that
call).
Regarding inc versus add, Agner Fog's excellent manuals do warn that using inc foo
instead of add $1, foo can cause stalls or extra dependencies in some cases, but I don't
have a good understanding of when it hurts.  I have attached another tarball that runs
four different assembly implementations of the addition loop.  On my chip it doesn't
seem to matter in this case whether I use add $1, foo or inc foo.  But it does matter
whether I use jg or jne as the loop branch instruction: jne is better here.  I think
that is because my chip (running in 32-bit mode for this executable) can macrofuse a cmp
with a jne but not with a jg.  I wouldn't have expected that to hurt this much, though:
% ./addloop1_asm-run.sh
inc + jg:
-1243309312
        0.86 real         0.86 user         0.00 sys
add + jg:
-1243309312
        0.86 real         0.85 user         0.00 sys
inc + jne:
-1243309312
        0.50 real         0.50 user         0.00 sys
add + jne:
-1243309312
        0.50 real         0.50 user         0.00 sys
In the numbers I posted originally, my gcc figured out that provided limit is positive,
it could use jne instead of jg or the like in the arithmetic loop, and that's why the
binary it compiled ran in 0.50 seconds instead of 0.86 seconds.
Regarding branch target alignment, that can matter, but I wouldn't expect Russ's listed
addloop1_modular code to get hurt because the whole loop fits into one naturally aligned
16-byte code block despite the branch target's misalignment.

Attachments:

umbricola-go-defect-report-addloop1_asm.tar (10240 bytes)

lvdlvd · 2011-11-07T16:49:32Z

Comment 5:

Labels changed: added compilerbug, performance.

rsc · 2011-12-09T19:46:39Z

Comment 6:

Labels changed: added priority-later.

rsc · 2012-09-12T21:42:49Z

Comment 8:

Labels changed: added go1.1maybe.

robpike · 2013-03-07T22:06:30Z

Comment 9:

Labels changed: removed go1.1maybe.

rsc · 2013-03-12T20:18:47Z

Comment 10:

[The time for maybe has passed.]

rsc · 2013-07-30T11:40:12Z

Comment 11:

Labels changed: added go1.3maybe.

robpike · 2013-08-20T23:35:26Z

Comment 12:

Labels changed: removed go1.3maybe.

rsc · 2013-11-27T18:50:30Z

Comment 13:

Labels changed: added go1.3maybe.

rsc · 2013-12-04T01:27:52Z

Comment 14:

Labels changed: added release-none, removed go1.3maybe.

rsc · 2013-12-04T01:43:13Z

Comment 15:

Labels changed: added repo-main.

ALTree · 2017-02-02T08:57:30Z

I think this can be closed; the "modular" vs "monolithic" performance issue (which was the main problem here) is a thing of the past; I tested the monolithic version and it's as fast as I'd expect.

We're still slower than gcc -O3, and that's mainly because gcc -O3 unrolls the loop very aggressively, while gc doesn't; but we know that and the function benchmarked here (for i := 1; i < 1e9; i++ { sum += i }) is not particularly notable, so it's not worth leaving the issue open just for that.

gopherbot added accepted Performance labels Dec 4, 2013

gopherbot assigned rsc Dec 4, 2013

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed priority-later labels Apr 10, 2015

rsc changed the title ~~cmd/6g: optimizations to generated code~~ cmd/compile: optimizations to generated code on amd64 Jun 8, 2015

ALTree closed this as completed Feb 2, 2017

golang locked and limited conversation to collaborators Feb 2, 2018

gopherbot added the FrozenDueToAge label Feb 2, 2018

rsc removed their assignment Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: optimizations to generated code on amd64 #1766

cmd/compile: optimizations to generated code on amd64 #1766

gopherbot commented May 2, 2011

rsc commented May 2, 2011

edsrzf commented May 2, 2011

gopherbot commented May 3, 2011

gopherbot commented May 3, 2011

lvdlvd commented Nov 7, 2011

rsc commented Dec 9, 2011

rsc commented Sep 12, 2012

robpike commented Mar 7, 2013

rsc commented Mar 12, 2013

rsc commented Jul 30, 2013

robpike commented Aug 20, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

ALTree commented Feb 2, 2017

cmd/compile: optimizations to generated code on amd64 #1766

cmd/compile: optimizations to generated code on amd64 #1766

Comments

gopherbot commented May 2, 2011

rsc commented May 2, 2011

edsrzf commented May 2, 2011

gopherbot commented May 3, 2011

gopherbot commented May 3, 2011

lvdlvd commented Nov 7, 2011

rsc commented Dec 9, 2011

rsc commented Sep 12, 2012

robpike commented Mar 7, 2013

rsc commented Mar 12, 2013

rsc commented Jul 30, 2013

robpike commented Aug 20, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

ALTree commented Feb 2, 2017