Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036 #58166

Open
prattmic opened this issue Jan 31, 2023 · 14 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@prattmic
Copy link
Member

The performance dashboard shows several regressions with the same culprit. Look for entries that say "change at fc81405".

Digging further, there are actually three commits between the previous good run and the bad run:

https://go.googlesource.com/go/+log/b419db6c15519a29ff3d7d2e56d8f115204f8c5d..fc814056aae191f61f46bef5be6e29ee3dc09b89

My gut says that the scheduling pass CL is most likely to cause regressions (cc @randall77).

The full set of regressions I see:

cc @golang/compiler

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 31, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 31, 2023
@prattmic
Copy link
Member Author

For reference, here are the raw results from the runs immediate before and after the change.

Before: https://perf.golang.org/search?q=upload%3A20230120.26+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment
After: https://perf.golang.org/search?q=upload%3A20230120.29+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment

The geomean change of time/op went from -1.24% before to -1.26%, so overall things seems slightly better, though I'm not sure this difference is statistically significant.

@randall77
Copy link
Contributor

I can't reproduce, at least for BenchmarkMulWorkspaceDense1000Hundredth.
All of 1.19.4, 1.20rc3, and tip are within just a percent or so of each other. In fact tip is the fastest.

Is this linux/amd64?
Are there more detailed instructions to reproduce?

@randall77
Copy link
Contributor

Ah, I think I need -tags noasm as part of the build. That gives me some performance delta.

@randall77
Copy link
Contributor

Old loop

    13.25s     13.25s     508284: MULSD X1, X3
    10.38s     10.41s     508288: ADDSD 0(R11)(R13*8), X3
    28.20s     28.21s     50828e: MOVSD_XMM X3, 0(R11)(R13*8)
    16.44s     16.44s     508294: INCQ R13
     7.65s      7.65s     508297: CMPQ R13, BX
         .       10ms     50829a: JLE 0x5082a6
     5.27s      5.29s     50829c: MOVSD_XMM 0(DI)(R13*8), X3
    12.29s     12.29s     5082a2: JA 0x508284

New loop:

    7.41s      7.42s     502fe5: LEAQ 0x1(R13), R8
     9.70s      9.70s     502fe9: MULSD X1, X3
    16.60s     16.61s     502fed: ADDSD 0(R11)(R13*8), X3
    27.75s     27.75s     502ff3: MOVSD_XMM X3, 0(R11)(R13*8)
    12.59s     12.59s     502ff9: MOVQ R8, R13
     6.34s      6.34s     502ffc: MOVQ 0x88(SP), R8
     8.22s      8.22s     503004: CMPQ R13, BX
         .          .     503007: JLE 0x503013
     9.95s      9.96s     503009: MOVSD_XMM 0(DI)(R13*8), X3
    17.17s     17.17s     50300f: JA 0x502fe5

So #57976 is definitely in play, but there's also the MOVQ 0x88(SP), R8 I don't understand. I will continue looking tomorrow.

@gopherbot
Copy link

Change https://go.dev/cl/463751 mentions this issue: cmd/compile: schedule values with no in-block uses later

@prattmic
Copy link
Member Author

prattmic commented Feb 1, 2023

That test runs as part of bent: https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/benchmarks-50.toml;l=81. I actually don't see anywhere that it passes -tags noasm, but perhaps I missed something? (cc @dr2chase)

@randall77
Copy link
Contributor

I just submitted a fix, at least for BenchmarkMulWorkspaceDense1000Hundredth. I haven't checked the other regressions but I think they are probably related. We'll see if they recover on the perf dashboard.

@randall77
Copy link
Contributor

Looks like these are fixed:

MulWorkspaceDense1000Hundredth
Dgeev/Circulant100

These are not:

Hash8K
FoglemanPathTraceRenderGopherIter1
MTFT
FoglemanFauxGLRenderRotateBoat

The last 2 are noisy enough that it is hard to be sure, though.

Encoding4KBVerySparse looks even worse after the fix :(

@randall77 randall77 reopened this Feb 2, 2023
@prattmic
Copy link
Member Author

prattmic commented Feb 2, 2023

For reference, the run on your fix CL (commit 6224db9) was the very last run using 1.19 as baseline. The very next run (commit ab0f045) used 1.20 as baseline.

So the best way to eyeball impact is to look only at results on the 6224db9 run.

@prattmic
Copy link
Member Author

prattmic commented Feb 2, 2023

To be clear, I still (mostly) agree with your assessment @randall77.

The only point I disagree on is Encoding4KBVerySparse. At 6224db9 it looks unchanged (maybe slightly worse). The reason it jumps up to +7% on the next commit is because prior to regression this benchmark was -10% vs 1.19. 1.20 locked in those gains, so now with a baseline of 1.20 this benchmark is regressed vs 1.20.

@dr2chase
Copy link
Contributor

dr2chase commented Feb 2, 2023

Encoding4KBVerySparse is a very micro benchmark, so it might just be noisy for silly reasons.

Random slightly-related thought, Emery Berger and student had a project where they randomized things like load order so as to generate a more normal distribution of benchmark results, thus allowing you to compare actual means. Since, if we're doing build benchmarking, we compile and link more than once, we could add a random seed to the linker for symbol (package?) ordering, build multiple binaries and get some of that sweet sweet randomness.

@randall77
Copy link
Contributor

Looking a bit more at Hash8K. It is slower at tip than 1.20, by about 6%.

It is all due to (*blake2s/Digest).compress.
It's a pretty big function, it compiles to 2008 instructions in 1.20 and 2014 instructions in tip. It's all one giant manually unrolled basic block.
It wants to put a lot more stuff in registers than is available on amd64. So keeping live values to a minimum and generating proper spill/restore code is important.

Not really clear yet what about this is affected by the different schedule. I'll see if any tweaks could make it a bit better.

@randall77
Copy link
Contributor

Several theories discarded.
Current hypothesis: the different schedule causes the register allocator to allocate a different output register than both input registers more often. There are lots of adds in this code, which is a 2-register operation. The nonmatching registers forces a MOV+OP or a LEAL instead of an in-place update.

johanbrandhorst pushed a commit to Pryz/go that referenced this issue Feb 12, 2023
When scheduling a block, deprioritize values whose results aren't used
until subsequent blocks.

For golang#58166, this has the effect of pushing the induction variable increment
to the end of the block, past all the other uses of the pre-incremented value.

Do this only with optimizations on. Debuggers have a preference for values
in source code order, which this CL can degrade.

Fixes golang#58166
Fixes golang#57976

Change-Id: I40d5885c661b142443c6d4702294c8abe8026c4f
Reviewed-on: https://go-review.googlesource.com/c/go/+/463751
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@mknyszek mknyszek added this to the Backlog milestone Feb 15, 2023
@mknyszek
Copy link
Contributor

#58534 should reduce the targets left in this issue to MTFT and Hash8k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

5 participants