cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

cherrymui · 2017-02-07T15:00:59Z

CL https://go-review.googlesource.com/c/36205/ deletes assembler backend instruction reordering. This makes Fannkuch-11 on AMD64 slow down 6%. Figure out why and improve the compiler.

gopherbot · 2017-03-21T22:45:26Z

CL https://golang.org/cl/38431 mentions this issue.

Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>

josharian · 2017-05-11T17:34:31Z

Tip (6897030) looks like a net win vs 1.8.1. Code size is smaller:

1.8.1: "".fannkuch t=1 size=844 args=0x10 locals=0x80
tip: "".fannkuch STEXT size=829 args=0x10 locals=0x80

Execution time is faster, at least on my machine:

name          old time/op  new time/op  delta
Fannkuch11-8   2.74s ± 2%   2.68s ± 2%  -2.32%  (p=0.000 n=20+19)

So whatever regression occurred has at the least been compensated for, if not fixed by CL 38431.

I do see one notable regression, though, which is in the number of panicindex calls. I've filed #20332 for that.

I'm going to close this as fixed. @cherrymui or @randall77, please reopen if you disagree.

navytux · 2017-05-11T18:53:30Z

@josharian, sad to bring bad news but here is how it looks on my machine:

$ benchstat go18.txt gotip.txt
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.94s ± 0%  +9.88%  (p=0.000 n=20+19)

in other words it is ~ 10% regression.

$ go18 version
go version go1.8.1 linux/amd64

$ go version
go version devel +ee69c21747 Thu May 11 18:28:40 2017 +0000 linux/amd64

Linux deco 4.9.0-2-amd64 #1 SMP Debian 4.9.18-1 (2017-03-30) x86_64 GNU/Linux
Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

josharian · 2017-05-11T18:55:11Z

@navytux will you re-measure including CL 43291?

navytux · 2017-05-11T18:55:23Z

let me try

navytux · 2017-05-11T19:10:44Z

Strangely CL 43291 make things only worse:

$ benchstat gotip.txt gotip+cl43291.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.94s ± 0%   3.01s ± 0%  +2.29%  (p=0.000 n=19+19)

and I've doble-checked it is reproducible via recompiling (make.bash) and benchmarking tip then tip+CL then tip again then tip+CL again.

josharian · 2017-05-11T19:35:59Z

OK, re-opening.

One other things I see in tip vs 1.8.1 is that in 1.8.1, all the panicindex calls are at the end of the function, so jumps to them are always unlikely (forward jumps). In tip, they're scattered throughout. Not sure whether that's responsible, but I'll see if I can fix it and check whether it helps.

navytux · 2017-05-11T20:12:10Z

@josharian thanks for feedback and reopening.

gopherbot · 2017-05-11T21:39:41Z

CL https://golang.org/cl/43293 mentions this issue.

josharian · 2017-05-11T21:42:08Z

@navytux will you see what effect CL 43293 has on your machine? Thanks.

navytux · 2017-05-12T07:24:58Z

@josharian sure. Today's tip is go version devel +482da51803 Fri May 12 01:10:11 2017 +0000 linux/amd64 and it already includes CL 43291. We are using it as the base:

$ benchstat gotip.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.01s ± 0%   2.84s ± 1%  -5.66%  (p=0.000 n=17+19)

in other words it is partial improvement and we are now back to 6% regression over go18:

$ benchstat go18.txt gotip+cl43293.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.68s ± 0%   2.84s ± 1%  +6.17%  (p=0.000 n=20+19)

navytux · 2017-05-12T08:10:49Z

For the reference here is how Fannkuch11 looks under perf stat -d:

 Performance counter stats for './go1.test-18 -test.bench Fannkuch11' (10 runs):

       4008,809815      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,06% )
               834      context-switches          #    0,208 K/sec                    ( +-  1,04% )
                14      cpu-migrations            #    0,003 K/sec                    ( +- 10,89% )
            79 865      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 398 886 442      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
    29 180 837 581      instructions              #    2,18  insn per cycle           ( +-  0,06% )  (62,57%)
     5 974 722 394      branches                  # 1490,398 M/sec                    ( +-  0,06% )  (62,62%)
       162 930 510      branch-misses             #    2,73% of all branches          ( +-  0,09% )  (62,73%)
     6 602 849 508      L1-dcache-loads           # 1647,085 M/sec                    ( +-  0,11% )  (62,84%)
        73 480 588      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,24% )  (62,61%)
         3 604 844      LLC-loads                 #    0,899 M/sec                    ( +-  3,96% )  (49,89%)
           316 385      LLC-load-misses           #    8,78% of all LL-cache hits     ( +-  5,10% )  (49,85%)

       4,003714952 seconds time elapsed                                          ( +-  0,07% )

 Performance counter stats for './go1.test-tip -test.bench Fannkuch11' (10 runs):

       4259,019276      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,09% )
               822      context-switches          #    0,193 K/sec                    ( +-  1,31% )
                15      cpu-migrations            #    0,004 K/sec                    ( +-  7,89% )
            79 928      page-faults               #    0,019 M/sec                    ( +-  0,01% )
    14 228 793 806      cycles                    #    3,341 GHz                      ( +-  0,08% )  (49,92%)
    29 011 760 342      instructions              #    2,04  insn per cycle           ( +-  0,08% )  (62,54%)
     6 812 768 092      branches                  # 1599,610 M/sec                    ( +-  0,05% )  (62,56%)
       169 436 247      branch-misses             #    2,49% of all branches          ( +-  0,07% )  (62,68%)
     6 617 477 028      L1-dcache-loads           # 1553,756 M/sec                    ( +-  0,07% )  (62,84%)
        72 643 321      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,21% )  (62,77%)
         3 792 842      LLC-loads                 #    0,891 M/sec                    ( +-  8,00% )  (49,95%)
           324 234      LLC-load-misses           #    8,55% of all LL-cache hits     ( +-  6,96% )  (49,88%)

       4,255140742 seconds time elapsed                                          ( +-  0,09% )

 Performance counter stats for './go1.test-tip+cl43293 -test.bench Fannkuch11' (10 runs):

       4076,691573      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,18% )
               809      context-switches          #    0,198 K/sec                    ( +-  1,07% )
                14      cpu-migrations            #    0,003 K/sec                    ( +-  7,07% )
            79 926      page-faults               #    0,020 M/sec                    ( +-  0,01% )
    13 622 118 634      cycles                    #    3,341 GHz                      ( +-  0,10% )  (49,99%)
    28 659 892 607      instructions              #    2,10  insn per cycle           ( +-  0,11% )  (62,64%)
     6 412 523 320      branches                  # 1572,972 M/sec                    ( +-  0,07% )  (62,67%)
       175 939 383      branch-misses             #    2,74% of all branches          ( +-  0,09% )  (62,74%)
     6 634 137 889      L1-dcache-loads           # 1627,334 M/sec                    ( +-  0,10% )  (62,81%)
        72 791 290      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,52% )  (62,68%)
         3 613 857      LLC-loads                 #    0,886 M/sec                    ( +-  3,46% )  (49,88%)
           315 113      LLC-load-misses           #    8,72% of all LL-cache hits     ( +-  5,94% )  (49,86%)

       4,072909978 seconds time elapsed                                          ( +-  0,19% )

highlights:

amount of instructions executed is less for tip (99.4%) and tip+cl43293 (98.2%) compared to go18.
however instructions per CPU cycle is lower at tip (2,04 insn/cycle) and tip+cl43293 (2,10 insn/cycle) compared to go18 (2,18 insn/cycle). That means there are probably more stalls due to instructions mischeduling or something. That results for more cycles being spent even though there are less instructions to execute.
total amount of branches is higher at tip (114%) and tip+cl43293 (107%) compared to go18.
L1d loads is only a bit slightly more for tip (0.2%) and tip+cl43293 (0.4%) compared to go18. L1d miss-ratio is also almost the same amongst 3 versions.
LLC (last level cache - L3 in my case - that goes first before RAM) loads are more for tip (105.2%) and go almost back to go18 level at tip+cl43293 (100.2%). The variation in LLC-loads is more for tip over go18 and tip+cl43293. LLC mis ratio also is approximately back to go18 level at tip+cl43293.

To me all this suggests that after CL43293 or similar is applied the problem is in that:

there are more branches compared to go18, and
instructions are less well scheduled by compiler

Hope this helps a bit,
Kirill

josharian · 2017-05-12T14:12:24Z

Thanks, Kirill, that's great. My guess was that CL 43293 was making the critical sections of code more compact, thus keeping more in cache; nice to have that confirmed. I'm working on making CL 43293 submittable now, and then will look again.

navytux · 2017-05-12T14:58:10Z

Josh thanks for feedback and patches. Offhand me thinks to get this regression gone we need to fix "Kinda surprisingly" in 39ce590 (CL 38431) - in other words to understand the root cause and apply loop rotation only when it is more optimal judging on corresponding factors. Current state is, it seems, we are applying loop rotation always and it sometimes helps and sometimes harms.

@randall77, @TocarIP please see #18977 (comment) for a bit of info about what is going on with Fannkuch11.

josharian · 2017-05-12T15:29:46Z

I'm going to dig a bit more, although I'll be mostly AFK for a while. In the meantime, Kirill, maybe you could confirm that disabling looprotate on your machine fixes the problem by adding "return" at the very beginning of func loopRotate in cmd/compile/internal/ssa/looprotate.go, to disable the pass.

navytux · 2017-05-12T18:22:13Z

Josh, thanks for feedback. Sure, staying at the same tip = 482da51 here is what we have:

going to CL 43293.2 (second patch in that CL) birings more speedup:

$ benchstat gotip+cl43293.txt gotip+cl43293.2.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.84s ± 1%   2.78s ± 0%  -2.09%  (p=0.000 n=19+16)

 Performance counter stats for './go1.test-tip+cl43293.2 -test.bench Fannkuch11' (10 runs):

       4029,642462      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,11% )
               814      context-switches          #    0,202 K/sec                    ( +-  1,00% )
                11      cpu-migrations            #    0,003 K/sec                    ( +-  8,11% )
             79921      page-faults               #    0,020 M/sec                    ( +-  0,01% )
       13482684046      cycles                    #    3,346 GHz                      ( +-  0,10% )  (50,05%)
       28731337641      instructions              #    2,13  insn per cycle           ( +-  0,17% )  (62,66%)
        6429703132      branches                  # 1595,601 M/sec                    ( +-  0,14% )  (62,72%)
         168958851      branch-misses             #    2,63% of all branches          ( +-  0,18% )  (62,74%)
        6634192402      L1-dcache-loads           # 1646,348 M/sec                    ( +-  0,14% )  (62,80%)
          73264431      L1-dcache-load-misses     #    1,10% of all L1-dcache hits    ( +-  0,66% )  (62,57%)
           4233228      LLC-loads                 #    1,051 M/sec                    ( +-  3,12% )  (49,84%)
            376635      LLC-load-misses           #    8,90% of all LL-cache hits     ( +-  9,29% )  (49,89%)

       4,026348487 seconds time elapsed                                          ( +-  0,11% )

some comments: there is a bit more (0.2%) instructions but insn/cycle grows from 2.10 to 2.13 and as the result total cpu cycles is ~ 1% less (which correllates with totall run time reported by perf and probably includes preparatory program actions -test.bench itself excludes from accounting). The amount of branches is too 0.2% more, however, if I read counters correctly, not all new instructions are branches as δ(insn) = 71445034 but δ(branches) = 17179812 (4x less).

disabling loopRotate over CL 43293.2 makes it slower:

$ benchstat gotip+cl43293.2.txt gotip+cl43293.2-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.78s ± 0%   2.87s ± 0%  +2.93%  (p=0.000 n=16+15)

 Performance counter stats for './go1.test-tip+cl43293.2-nolooprot -test.bench Fannkuch11' (10 runs):

       4103,320043      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,12% )
               822      context-switches          #    0,200 K/sec                    ( +-  0,82% )
                12      cpu-migrations            #    0,003 K/sec                    ( +-  9,02% )
             79925      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       13711752541      cycles                    #    3,342 GHz                      ( +-  0,06% )  (49,99%)
       29498486031      instructions              #    2,15  insn per cycle           ( +-  0,07% )  (62,58%)
        7212426730      branches                  # 1757,705 M/sec                    ( +-  0,08% )  (62,64%)
         174290760      branch-misses             #    2,42% of all branches          ( +-  0,04% )  (62,73%)
        6609550227      L1-dcache-loads           # 1610,781 M/sec                    ( +-  0,07% )  (62,79%)
          73869349      L1-dcache-load-misses     #    1,12% of all L1-dcache hits    ( +-  0,28% )  (62,67%)
           3816460      LLC-loads                 #    0,930 M/sec                    ( +-  2,51% )  (49,91%)
            429992      LLC-load-misses           #   11,27% of all LL-cache hits     ( +-  8,17% )  (49,89%)

       4,098764012 seconds time elapsed                                          ( +-  0,12% )

insn / cycle grows from 2.13 to 2.15 (good) but there are 2.7% more total instructions so overal cycles grow by 1.7% (correlates wrt total run time reported by perf). There is 12% more branches and looks like they are the only visible factor that contribute to instruction growth: δ(branches) / δ(insn) ≈ 1.

but disabling loopRotate over just tip makes it faster:

$ benchstat gotip.txt gotip-nolooprot.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   3.02s ± 1%   2.98s ± 0%  -1.29%  (p=0.000 n=19+16)

 Performance counter stats for './go1.test-tip-nolooprot -test.bench Fannkuch11' (10 runs):

       4215,739102      task-clock (msec)         #    1,001 CPUs utilized            ( +-  0,08% )
               827      context-switches          #    0,196 K/sec                    ( +-  0,81% )
                13      cpu-migrations            #    0,003 K/sec                    ( +- 10,58% )
             79932      page-faults               #    0,019 M/sec                    ( +-  0,01% )
       14108860946      cycles                    #    3,347 GHz                      ( +-  0,08% )  (49,91%)
       29508075225      instructions              #    2,09  insn per cycle           ( +-  0,14% )  (62,49%)
        7220934286      branches                  # 1712,851 M/sec                    ( +-  0,09% )  (62,58%)
         174301863      branch-misses             #    2,41% of all branches          ( +-  0,06% )  (62,77%)
        6621367485      L1-dcache-loads           # 1570,630 M/sec                    ( +-  0,07% )  (62,86%)
          73684219      L1-dcache-load-misses     #    1,11% of all L1-dcache hits    ( +-  0,45% )  (62,71%)
           3886187      LLC-loads                 #    0,922 M/sec                    ( +-  3,16% )  (49,91%)
            369954      LLC-load-misses           #    9,52% of all LL-cache hits     ( +-  6,86% )  (49,82%)

       4,211567986 seconds time elapsed                                          ( +-  0,08% )

Here the number of instructions increases by 1.7% which is mostly due to 6% N(branches) increase (δ(branches) / δ(insn) ≈ 0.82) however the insn/cycle ratio also grows from 2.04 to 2.09 which overcompensate all that and resulting N(cycles) decrease by ~ 1%.

My observation:

above everywhere disabling loop rotation makes insn/cycle ↑ but go18 2.18 level is never reached; closest was tip+cl43293.2-nolooprot (2.15 vs 2.18 for go18).
the task here seems to be to optimize target function (Ncycles) via variating Ninstructions⋅(insn/cycle). Code transformations which minimize Ninstructions (loop rotation, etc) explicitly change only one part of the formula, but I'm not sure insn/cycle is modelled or taken into account anywhere.

( I've rebenchmarked previous runs from todays morning and reverified all timings several times )

navytux · 2017-05-12T19:54:56Z

I'm not an x86 guy (probably @TocarIP could help here) but I imagined: if there are several execution units inside cpu with somehow different properties, units overal load could depend on the order of instructions in instruction stream. With this thinking in mind I googled for "x86 add nop speedup" and found e.g.

https://research.google.com/pubs/pub37077.html
https://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a/17906589
...

also if we simplify and assume data loads/stores are always hitting cache what affects insn/cycle is

N(branch-misses) / N(instructions)

(note not / N(branches)). The stackoverflow link and MAO article above contains some information on what factors could affect branch prediction and other loop related things.

Coincidence or not there the talking is usually too about 5% - 7% speedup or slowdown.

TocarIP · 2017-05-12T22:52:42Z

I've played a bit with https://github.com/andikleen/pmu-tools
It suggests that this bench is bound by branch misprediction in various forms:
toplev.py -l=3 -d ./go1.test -test.run=NONE -test.bench=Fan:

FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:         17.59 % Clocks [ 16.66%]
        This metric represents cycles fraction the CPU was stalled
        due to Branch Resteers...
        Sampling events:  br_misp_retired.all_branches
BAD            Bad_Speculation:                                         22.39 % Slots [ 17.00%]
BAD            Bad_Speculation.Branch_Mispredicts:                      22.21 % Slots [ 17.00%] BN
        This metric represents slots fraction the CPU has wasted due
        to Branch Misprediction...
        Sampling events:  br_misp_retired.all_branches
               MUX:                                                     16.62 %

Running ocperf.py record -e br_misp_retired.all_branches_pebs
Shows that guilty branches are caused by index checks:

884511   75ef95:       4d 39 e7                cmp    %r12,%r15
884512   75ef98:       73 33                   jae    75efcd <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1ed>

and code layout:

75efcb:       eb 0e                   jmp    75efdb <_/localdisk/itocar/golang/test/bench/go1.fannkuch+0x1fb>
884526   75efcd:       e8 9e c8 cc ff          callq  42b870 <runtime.panicindex>

Note that in this case branch is unconditional!
With 43293 at least all problems are with two conditional branches.
Playing with MaxLoopPad in cmd/internal/obj/x86/asm6 produce no improvements, so the problem is not in branch targets alignment.
I'd like to experiment with branch source alignment, but this isn't as simple as changing 1 constant, and I will be away for a ~week.

josharian · 2017-05-13T15:27:51Z

I haven't had a chance to dig into this excellent data, but see also #20355 for something that seems definitely wrong with looprotate. It seems that looprotate is doing both good work and bad for the benchmark.

josharian · 2017-05-13T16:44:49Z

Also #20356.

navytux · 2017-05-19T16:57:51Z

For the reference: Fannkuch11 can be speedup -33% via improving BCE:

#20393 (comment)

josharian · 2017-05-19T19:53:54Z

It can be sped up 33% with -B. That might or might not be the same thing.

navytux · 2017-05-20T15:39:29Z

Correct. However all runtime bounds checks in fannkuch can be statically proven to be unneccessary:

#20393 (comment)

That means the compiler can theoretically reach -B level with just clever BCE.

dvyukov · 2017-05-22T06:09:49Z

@josharian Dmitry's performance dashboard was noisy and spammy and ultimately disabled.

It wasn't spammy and it wasn't particularly noisy. #8717 lead to episodic noise, but otherwise it reliably detected changes in multiple metrics within fractions of percent. Dashboard wasn't disabled. It was broken by changes to the app and then never fixed.

dvyukov · 2017-05-22T06:49:43Z

@navytux @josharian I thought the light was already sched in #18977 (comment) and #18977 (comment) .

The analysis looks great! Thanks!

navytux · 2017-05-22T08:59:59Z

@dvyukov thanks for feedback.

randall77 · 2017-06-06T19:47:28Z

We're punting more loop rotation work to 1.10.

randall77 · 2018-12-12T00:29:24Z

Punting to 1.13, too late for anything major in 1.12.

cherrymui self-assigned this Feb 7, 2017

cherrymui added this to the Go1.9 milestone Feb 7, 2017

cherrymui changed the title ~~cmd/compile: Fannkuch-11 on AMD64 slow down 6% after removing assembler backend instruction reordering~~ cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering Feb 7, 2017

josharian added the Performance label Mar 16, 2017

randall77 mentioned this issue Mar 20, 2017

cmd/compile: BLAS regressions at tip #19595

Closed

ALTree mentioned this issue Mar 21, 2017

cmd/compile: performance regression on append-like loop #19633

Closed

josharian closed this as completed May 11, 2017

josharian reopened this May 11, 2017

josharian mentioned this issue May 13, 2017

cmd/compile: lay out loop-free, likeliness-free control flow more compactly #20356

Open

josharian mentioned this issue May 18, 2017

cmd/compile: loads/constants not lifted out of loop #15808

Open

randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017

josharian mentioned this issue Aug 13, 2017

cmd/compile/ssa: conditional select instructions #21391

Closed

bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

navytux mentioned this issue Apr 3, 2018

cmd/compile: random performance fluctuations after unrelated changes #8717

Open

josharian mentioned this issue Jun 16, 2018

cmd/compile: possible missed optimization in append benchmark #25916

Open

bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018

navytux mentioned this issue Aug 28, 2018

cmd/compile: Fannkuch benchmark performance regression #27108

Open

navytux mentioned this issue Oct 18, 2018

cmd/link: possible linker bug when using the -T flag to set the text address #28257

Open

randall77 modified the milestones: Go1.12, Go1.13 Dec 12, 2018

josharian mentioned this issue Apr 2, 2019

cmd/compile, runtime: reduce function prologue overhead #31193

Open

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

mstoykov mentioned this issue Aug 30, 2021

cmd/compile: regression around now inlining code in 1.17 #48067

Closed

rsc unassigned cherrymui Jun 23, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

cherrymui commented Feb 7, 2017

gopherbot commented Mar 21, 2017

josharian commented May 11, 2017

navytux commented May 11, 2017

josharian commented May 11, 2017

navytux commented May 11, 2017

navytux commented May 11, 2017

josharian commented May 11, 2017

navytux commented May 11, 2017

gopherbot commented May 11, 2017

josharian commented May 11, 2017

navytux commented May 12, 2017

navytux commented May 12, 2017

josharian commented May 12, 2017

navytux commented May 12, 2017

josharian commented May 12, 2017

navytux commented May 12, 2017

navytux commented May 12, 2017

TocarIP commented May 12, 2017

josharian commented May 13, 2017

josharian commented May 13, 2017 •

edited

navytux commented May 19, 2017

josharian commented May 19, 2017

navytux commented May 20, 2017

dvyukov commented May 22, 2017

dvyukov commented May 22, 2017

navytux commented May 22, 2017

randall77 commented Jun 6, 2017

randall77 commented Dec 12, 2018

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977

Comments

cherrymui commented Feb 7, 2017

gopherbot commented Mar 21, 2017

josharian commented May 11, 2017

navytux commented May 11, 2017

josharian commented May 11, 2017

navytux commented May 11, 2017

navytux commented May 11, 2017

josharian commented May 11, 2017

navytux commented May 11, 2017

gopherbot commented May 11, 2017

josharian commented May 11, 2017

navytux commented May 12, 2017

navytux commented May 12, 2017

josharian commented May 12, 2017

navytux commented May 12, 2017

josharian commented May 12, 2017

navytux commented May 12, 2017

navytux commented May 12, 2017

TocarIP commented May 12, 2017

josharian commented May 13, 2017

josharian commented May 13, 2017 • edited

navytux commented May 19, 2017

josharian commented May 19, 2017

navytux commented May 20, 2017

dvyukov commented May 22, 2017

dvyukov commented May 22, 2017

navytux commented May 22, 2017

randall77 commented Jun 6, 2017

randall77 commented Dec 12, 2018

josharian commented May 13, 2017 •

edited