cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

prattmic · 2023-01-31T18:53:14Z

The performance dashboard shows several regressions with the same culprit. Look for entries that say "change at fc81405".

Digging further, there are actually three commits between the previous good run and the bad run:

https://go.googlesource.com/go/+log/b419db6c15519a29ff3d7d2e56d8f115204f8c5d..fc814056aae191f61f46bef5be6e29ee3dc09b89

https://go.dev/cl/270940 cmd/compile: improve scheduling pass
https://go.dev/cl/448036 runtime: remove 104 byte stack guard
https://go.dev/cl/454036 cmd/compile: rewrite empty makeslice to zerobase pointer

My gut says that the scheduling pass CL is most likely to cause regressions (cc @randall77).

The full set of regressions I see:

MulWorkspaceDense1000Hundredth-8 (sec/op): +22.69%
- Source, module version v0.9.3
Hash8K-8 (sec/op): +4.87%
- Source, module version v0.0.0-20180603041446-333f0593be06
Dgeev/Circulant100-8 (sec/op): +3.09%
- Source, module version v0.9.3
FoglemanPathTraceRenderGopherIter1 (sec/op): +2.09%
- Source, ~tip
MTFT-8 (sec/op): +1.13%
- Source, module version v1.9.0
FoglemanFauxGLRenderRotateBoat (sec/op): +1.12%
- Source, ~tip
Encoding4KBVerySparse-8 (sec/op): +4.75%
- Source, module version v1.10.9

cc @golang/compiler

The text was updated successfully, but these errors were encountered:

prattmic · 2023-01-31T19:23:19Z

For reference, here are the raw results from the runs immediate before and after the change.

Before: https://perf.golang.org/search?q=upload%3A20230120.26+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment
After: https://perf.golang.org/search?q=upload%3A20230120.29+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment

The geomean change of time/op went from -1.24% before to -1.26%, so overall things seems slightly better, though I'm not sure this difference is statistically significant.

randall77 · 2023-01-31T21:57:39Z

I can't reproduce, at least for BenchmarkMulWorkspaceDense1000Hundredth.
All of 1.19.4, 1.20rc3, and tip are within just a percent or so of each other. In fact tip is the fastest.

Is this linux/amd64?
Are there more detailed instructions to reproduce?

randall77 · 2023-01-31T23:56:51Z

Ah, I think I need -tags noasm as part of the build. That gives me some performance delta.

randall77 · 2023-02-01T00:04:19Z

Old loop

    13.25s     13.25s     508284: MULSD X1, X3
    10.38s     10.41s     508288: ADDSD 0(R11)(R13*8), X3
    28.20s     28.21s     50828e: MOVSD_XMM X3, 0(R11)(R13*8)
    16.44s     16.44s     508294: INCQ R13
     7.65s      7.65s     508297: CMPQ R13, BX
         .       10ms     50829a: JLE 0x5082a6
     5.27s      5.29s     50829c: MOVSD_XMM 0(DI)(R13*8), X3
    12.29s     12.29s     5082a2: JA 0x508284

New loop:

    7.41s      7.42s     502fe5: LEAQ 0x1(R13), R8
     9.70s      9.70s     502fe9: MULSD X1, X3
    16.60s     16.61s     502fed: ADDSD 0(R11)(R13*8), X3
    27.75s     27.75s     502ff3: MOVSD_XMM X3, 0(R11)(R13*8)
    12.59s     12.59s     502ff9: MOVQ R8, R13
     6.34s      6.34s     502ffc: MOVQ 0x88(SP), R8
     8.22s      8.22s     503004: CMPQ R13, BX
         .          .     503007: JLE 0x503013
     9.95s      9.96s     503009: MOVSD_XMM 0(DI)(R13*8), X3
    17.17s     17.17s     50300f: JA 0x502fe5

So #57976 is definitely in play, but there's also the MOVQ 0x88(SP), R8 I don't understand. I will continue looking tomorrow.

gopherbot · 2023-02-01T16:46:21Z

Change https://go.dev/cl/463751 mentions this issue: cmd/compile: schedule values with no in-block uses later

prattmic · 2023-02-01T16:52:39Z

That test runs as part of bent: https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/benchmarks-50.toml;l=81. I actually don't see anywhere that it passes -tags noasm, but perhaps I missed something? (cc @dr2chase)

randall77 · 2023-02-01T18:42:48Z

I just submitted a fix, at least for BenchmarkMulWorkspaceDense1000Hundredth. I haven't checked the other regressions but I think they are probably related. We'll see if they recover on the perf dashboard.

randall77 · 2023-02-02T20:29:07Z

Looks like these are fixed:

MulWorkspaceDense1000Hundredth
Dgeev/Circulant100

These are not:

Hash8K
FoglemanPathTraceRenderGopherIter1
MTFT
FoglemanFauxGLRenderRotateBoat

The last 2 are noisy enough that it is hard to be sure, though.

Encoding4KBVerySparse looks even worse after the fix :(

prattmic · 2023-02-02T20:42:20Z

For reference, the run on your fix CL (commit 6224db9) was the very last run using 1.19 as baseline. The very next run (commit ab0f045) used 1.20 as baseline.

So the best way to eyeball impact is to look only at results on the 6224db9 run.

prattmic · 2023-02-02T20:47:54Z

To be clear, I still (mostly) agree with your assessment @randall77.

The only point I disagree on is Encoding4KBVerySparse. At 6224db9 it looks unchanged (maybe slightly worse). The reason it jumps up to +7% on the next commit is because prior to regression this benchmark was -10% vs 1.19. 1.20 locked in those gains, so now with a baseline of 1.20 this benchmark is regressed vs 1.20.

dr2chase · 2023-02-02T20:50:26Z

Encoding4KBVerySparse is a very micro benchmark, so it might just be noisy for silly reasons.

Random slightly-related thought, Emery Berger and student had a project where they randomized things like load order so as to generate a more normal distribution of benchmark results, thus allowing you to compare actual means. Since, if we're doing build benchmarking, we compile and link more than once, we could add a random seed to the linker for symbol (package?) ordering, build multiple binaries and get some of that sweet sweet randomness.

randall77 · 2023-02-07T21:40:22Z

Looking a bit more at Hash8K. It is slower at tip than 1.20, by about 6%.

It is all due to (*blake2s/Digest).compress.
It's a pretty big function, it compiles to 2008 instructions in 1.20 and 2014 instructions in tip. It's all one giant manually unrolled basic block.
It wants to put a lot more stuff in registers than is available on amd64. So keeping live values to a minimum and generating proper spill/restore code is important.

Not really clear yet what about this is affected by the different schedule. I'll see if any tweaks could make it a bit better.

randall77 · 2023-02-08T00:46:44Z

Several theories discarded.
Current hypothesis: the different schedule causes the register allocator to allocate a different output register than both input registers more often. There are lots of adds in this code, which is a 2-register operation. The nonmatching registers forces a MOV+OP or a LEAL instead of an in-place update.

When scheduling a block, deprioritize values whose results aren't used until subsequent blocks. For golang#58166, this has the effect of pushing the induction variable increment to the end of the block, past all the other uses of the pre-incremented value. Do this only with optimizations on. Debuggers have a preference for values in source code order, which this CL can degrade. Fixes golang#58166 Fixes golang#57976 Change-Id: I40d5885c661b142443c6d4702294c8abe8026c4f Reviewed-on: https://go-review.googlesource.com/c/go/+/463751 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Keith Randall <khr@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

mknyszek · 2023-02-15T20:40:32Z

#58534 should reduce the targets left in this issue to MTFT and Hash8k.

prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 31, 2023

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 31, 2023

gopherbot closed this as completed in 6224db9 Feb 1, 2023

randall77 reopened this Feb 2, 2023

mknyszek assigned randall77 Feb 15, 2023

mknyszek added this to the Backlog milestone Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

prattmic commented Jan 31, 2023

prattmic commented Jan 31, 2023

randall77 commented Jan 31, 2023

randall77 commented Jan 31, 2023

randall77 commented Feb 1, 2023

gopherbot commented Feb 1, 2023

prattmic commented Feb 1, 2023

randall77 commented Feb 1, 2023

randall77 commented Feb 2, 2023

prattmic commented Feb 2, 2023

prattmic commented Feb 2, 2023

dr2chase commented Feb 2, 2023

randall77 commented Feb 7, 2023

randall77 commented Feb 8, 2023

mknyszek commented Feb 15, 2023

cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

Comments

prattmic commented Jan 31, 2023

prattmic commented Jan 31, 2023

randall77 commented Jan 31, 2023

randall77 commented Jan 31, 2023

randall77 commented Feb 1, 2023

gopherbot commented Feb 1, 2023

prattmic commented Feb 1, 2023

randall77 commented Feb 1, 2023

randall77 commented Feb 2, 2023

prattmic commented Feb 2, 2023

prattmic commented Feb 2, 2023

dr2chase commented Feb 2, 2023

randall77 commented Feb 7, 2023

randall77 commented Feb 8, 2023

mknyszek commented Feb 15, 2023