Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: auto-generated code pathologically slow to compile #16407

Closed
rogpeppe opened this issue Jul 18, 2016 · 24 comments
Closed

cmd/compile: auto-generated code pathologically slow to compile #16407

rogpeppe opened this issue Jul 18, 2016 · 24 comments
Milestone

Comments

@rogpeppe
Copy link
Contributor

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.7rc1 linux/amd64

  1. What operating system and processor architecture are you using (go env)?

amd64, linux

  1. What did you do?

go get github.com/zhenjl/xparse/etld

  1. What did you expect to see?

A response in reasonable time.

  1. What did you see instead?

No response - the code takes a very long time to compile.

Under Go 1.4, the package takes ~2s to install. Under Go 1.6, it took ~4s.

Using Go1.7rc1 I killed the compiler after 9 minutes because my machine was
overheating and becoming unusable.

It's a large state machine, but this doesn't seem like reasonable compiler behaviour.
FWIW this package is used as part of the sequencer tool (see http://zhen.org/blog/sequence-high-performance-sequential-semantic-log--parser/)

@ianlancetaylor ianlancetaylor added this to the Go1.7Maybe milestone Jul 18, 2016
@ianlancetaylor
Copy link
Contributor

CC @randall77 @josharian @dr2chase

@bradfitz
Copy link
Contributor

Quick code link for those curious:
https://raw.githubusercontent.com/zhenjl/xparse/master/etld/fsm.go

@josharian
Copy link
Contributor

Haven't dug much, but I see at least two issues here.

(1) The sparse phi locator is slow on this. Determined by sending SIGQUIT a few times after letting the compilation run for a while. Running with GO_SSA_PHI_LOC_CUTOFF=-1 gets past the SSA construction phase quickly, and lets you reach problem 2.

(2) Memory blowup in regalloc during liveness computation. There are lots of values live across lots of blocks, leading to O(n^2) behavior. That's what causes the machine instability.

There might be some cheap additional heuristics we can use for selecting whether to use the sparse phi locator. That'd probably be safe for 1.7.

The quadratic memory usage in liveness is a general, important, known problem, and way out of scope for 1.7. (David, might an SCC-based representation of liveness help here?)

@dgryski
Copy link
Contributor

dgryski commented Jul 18, 2016

This will also affect programs using https://github.com/opennota/re2dfa or www.complang.org/ragel/ to build custom DFAs for matching regexps.

@josharian
Copy link
Contributor

@rogpeppe I assume you are perfectly capable of finding workarounds as needed in the meantime (including using 1.6 or turning off ssa). But I will say that I would consider converting some of the innermost switch statements to lookup tables. This should compile faster, run faster, and probably produce a smaller binary, particularly if you shift all array indices so that the smallest interesting value is 0. This might require a non-trivial transform, though, to indicate e.g. (lines 495-594) a new value for s and a new value for m and whether to check i and pb before modifying m.

@rogpeppe
Copy link
Contributor Author

@josharian It's not my code - I have nothing to do with it or the sequencer command that imports it. I just encountered the issue when trying out that command.

@rogpeppe
Copy link
Contributor Author

@josharian But to add to your thoughts, I wonder if just using goto instead of a
huge switch statement might help matters.
That's the main reason why goto is there, after all, AIUI.

@randall77
Copy link
Contributor

Regalloc is dying because there are ~30000 integer constants, and they are all live during all blocks in the program. We put them all in the first block, and they never get moved down by the tighten pass because their only use is in a phi.
I'll modify tighten to fix this case. That will fix the regalloc part of this problem.
David, any idea why the sparse phi locator is failing?

@gopherbot
Copy link

CL https://golang.org/cl/25046 mentions this issue.

@mewmew
Copy link
Contributor

mewmew commented Jul 19, 2016

I'd reckon this is a duplicate of or related to #14934.

@dr2chase
Copy link
Contributor

It's kinda pathological for the sparse phi locator, too. One of the "this isn't usually large" assumptions isn't true. I'm working on that.

@randall77
Copy link
Contributor

Looks like the phi locator takes ~20 minutes to complete on this example.
Not good, but at least it completes.

thanm added a commit to thanm/xparse that referenced this issue Jul 20, 2016
Changes genfsm.go to divide the state space into chunks
of size 256, then call a helper for function for each
chunk (as opposed to having a single giant function that
does everything). This reduces the compile time for go1.7
down to something reasonable (15-30 seconds depending on
machine speed). See related issues

zhenjl#1
golang/go#16407
@thanm
Copy link
Contributor

thanm commented Jul 20, 2016

I created a pull request [as "learn to program in go" exercise] for the fsm generator that divides the state space into chunks, then puts the chunks in separate functions. Each of the new functions is a couple thousand lines long. Bring the total compilation time down to 10-15 seconds.

@dr2chase
Copy link
Contributor

This is not showing up here, but this CL contains two fixes (one that just missed the first 1.7 deadline, the other created earlier this week) that cut the phi location time from(on my laptop) 12+ minutes down to 2, and also cuts the memory footprint of that phase down to about 500MB.

https://go-review.googlesource.com/c/23136/

@josharian
Copy link
Contributor

With the sparse phi locator off, though, the SSA construction completes in a few seconds, not 2 minutes. It seems like a better fix here might be updated heuristics about when to use the sparse phi locator, if such heuristics are available.

CL 23136 should probably also go in, but maybe for 1.8 instead? I'm not sure.

@dr2chase
Copy link
Contributor

If I knew of such heuristics, I would use them. I haven't yet figured out what makes this particular flow-graph so special. It's big.

gopherbot pushed a commit that referenced this issue Jul 21, 2016
entry:
   x = MOVQconst [7]
   ...
b1:
   goto b2
b2:
   v = Phi(x, y, z)

Transform that program to:

entry:
   ...
b1:
   x = MOVQconst [7]
   goto b2
b2:
   v = Phi(x, y, z)

This CL moves constant-generating instructions used by a phi to the
appropriate immediate predecessor of the phi's block.

We used to put all constants in the entry block.  Unfortunately, in
large functions we have lots of constants at the start of the
function, all of which are used by lots of phis throughout the
function.  This leads to the constants being live through most of the
function (especially if there is an outer loop).  That's an O(n^2)
problem.

Note that most of the non-phi uses of constants have already been
folded into instructions (ADDQconst, MOVQstoreconst, etc.).

This CL may be generally useful for other instances of compiler
slowness, I'll have to check.  It may cause some programs to run
slower, but probably not by much, as rematerializeable values like
these constants are allocated late (not at their originally scheduled
location) anyway.

This CL is definitely a minimal change that can be considered for 1.7.
We probably want to do a better job in the tighten pass generally, not
just for phi args.  Leaving that for 1.8.

Update #16407

Change-Id: If112a8883b4ef172b2f37dea13e44bda9346c342
Reviewed-on: https://go-review.googlesource.com/25046
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
gopherbot pushed a commit that referenced this issue Jul 21, 2016
This is:

(1) a simple trick that cuts the number of phi-nodes
(temporarily) inserted into the ssa representation by a factor
of 10, and can cut the user time to compile tricky inputs like
gogo/protobuf tests from 13 user minutes to 9.5, and memory
allocation from 3.4GB to 2.4GB.

(2) a fix to sparse lookup, that does not rely on
an assumption proven false by at least one pathological
input "etldlen".

These two changes fix unrelated compiler performance bugs,
both necessary to obtain good performance compiling etldlen.
Without them it takes 20 minutes or longer, with them it
completes in 2 minutes, without a gigantic memory footprint.

Updates #16407

Change-Id: Iaa8aaa8c706858b3d49de1c4865a7fd79e6f4ff7
Reviewed-on: https://go-review.googlesource.com/23136
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@josharian
Copy link
Contributor

Looks like 2m is as fast as this is going to get for 1.7. Moving to 1.8.

@mschoch
Copy link

mschoch commented Sep 15, 2016

Initially reported under #17127 the blevesearch/segment package appears to also illustrate the problem, perhaps even more severely.

I've just tested with tip:

$ go version
go version devel +22d3bf1 Thu Sep 15 19:24:04 2016 +0000 darwin/amd64

And the following compilation runs for over 20 minutes (still going):

$ go get -tags 'prod' github.com/blevesearch/segment

@randall77
Copy link
Contributor

@mschoch : your example compiles for me in ~8 minutes. Most of the time (~7 min) is lowered CSE.
Other slow phases:
phi building: 37 sec
generic CSE: 14 sec
regalloc: 4 sec
I thought regalloc would be worse.

@mschoch
Copy link

mschoch commented Oct 3, 2016

@randall77 are there new commits on tip related to this? I let it go for 45 minutes and it never finished. I also with GO_SSA_PHI_LOC_CUTOFF=-1, but it too never finished.

@randall77
Copy link
Contributor

@mschoch : I've been experimenting on CL 30163 which changes the way phis are inserted. It takes 8 minutes there. I wasn't expecting big differences with tip as both tip and that CL take about the same amount of time for phi building on your example. However, it looks like tip is going to take much longer (still running). I don't understand why, as the phi building takes about the same amount of time. For some reason, the differences in phi building make later phases take longer.
Looks like generic CSE is at least the first such phase.

@randall77
Copy link
Contributor

@rogpeppe 's example seems to be fixed with https://go-review.googlesource.com/c/30163/ . Compile time down to ~30sec (most of that still in phi building).

@randall77
Copy link
Contributor

I'm going to reopen #17127. That slow compile appears to be mostly CSE time, whereas this issue is mostly phi building time.

@gopherbot
Copy link

CL https://golang.org/cl/30163 mentions this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests