New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines #38068
Comments
Tentatively milestoning as a Go 1.15 release blocker: @jayconrod has been working to iron out the remaining reproducibility bugs, and it would be really unfortunate to miss that target by this one issue. (CC @matloob) |
CC @thanm because DWARF. |
I'll take a look. |
I can reproduce the problem pretty easily. Issue is not with .debug_loc/.debug_range but with .debug_info, I am working on identifying a root cause. |
This is a pretty interesting bug to work on. Normally when debugging such problems "-S" and "-m" are your friends, but those are not an option when the concurrent back end is turned on. One of the functions where I can see the problem in the cmd/present build is the method encoding/asn1.BitString.At. This function winds up with different DWARF depending on whether the parallel back end is on. What's interesting about this method is that it has a value receiver, and the compiler decides at some point that it's going to generated a pointer receiver wrapper (in genwrapper), e.g. encoding/asn1.(*BitString).At. When the parallel back end is not on, the sequence of events is:
When the parallel back end runs, the sequence of events is:
At first glance this doesn't look so bad, but it turns out that there is actually IR sharing going on between the two routines: their fn.Dcl lists point to the same nodes for params, variables, etc). This means that the actions of the inliner are going to wind up making perturbations that are going to be visible the DWARF gen code, which is bad. I'm still thinking about the best way to fix this; I need to fig into the wrapper generation code to understand exactly why/how there is sharing and what the story is there. |
Oops, got the wrong github user. Trying again: @jeremyfaller |
Ping @thanm. Any progress on this? |
No progress to report, I have been bogged down with other work. I have some ideas on how to fix this, but I am still trying to come up with a good solution. |
One large scale architectural approach is to try to end the lifetime of gc.Nodes at ssa construction time. That would involve building replacement built-for-purpose data structures from the nodes which then get used throughout ssa, dwarf, and everything else downstream. This would also be one big helpful step towards eliminating nodes entirely. |
I am finally getting back to looking at this bug again. More detail below. Background: when the initial version of DWARF inline support in Go 1.11 rolled out, at that point inlining of calls in wrappers was disabled (the comment at the time said that there was a bad interaction with indexed export data, with a TODO to turn it back on). In addition, wrapper generation happened very late -- long after the main part of the compile was done. In the 2018 time frame (this was for Go 1.13), wrapper inlining was enabled again in this commit: fa913a36a2 cmd/compile/internal/gc: inline autogenerated (*T).M wrappers A little later in the Go 1.13 development cycle we added stack objects: cbafcc55e8 cmd/compile,runtime: implement stack objects This added a section of code to the "compile" function that included a call to dtypesym. The relevant code looks like this:
What's interesting here is that dtypesym can trigger a call to genwrapper, which in turn calls "compile". This means that wrapper generation happens much earlier, and is intermixed with compilation of "regular" functions. So let's say that the order of functions in xtop is: BitString.At In the non-concurrent scenario, the order of operations is:
The important thing about sequence above is that at the point where we generate DWARF for BitString.At, it has not been inlined. Hence according to the DWARF rules, we emit a single subprogram DIE and move on. Now the concurrent scenario. The order of operations here is:
Key thing to note here is that when a parallel worker picks up BitString.At and compiles it, at that point in time BitString.At has been inlined, meaning that according to DWARF we need to emit an abstract subprogram DIE and a concrete DIE that refers back to it. What's triggering the actual code difference is the code in the compiler's assembleInlines() function, which treats inlined and non-inlined functions differently with respect to the order it assigns to the variable/param children of the function. In the non-inlined case, it can just the order in the function's Dcl list. In the inlined case, it has to do a lot of work to assign variables to the correct inlined instance, and as a result the child ordering comes out differently. In terms of a fix, I can think of a couple of options. The simplest thing would be to just turn off inlining of wrappers. This would be the simplest option, but also undesirable given that we would lose whatever performance gain we got when it was turned on in 1.13. A second option would be to change assembleInlines() in dwinl.go to look not at whether a function was actually inlined, but whether it was a inlinable in the first place (e.g. marked as an inline candidate). This would require capturing the pre-inline Dcl list at an earlier stage. There is also some risk in that the ordering of child variable DIEs in subprogram DIEs would be perturbed (this is probably pretty minor). In addition, this doesn't simplify things much -- things are already complicated enough in assembleInlines(). A third option would be to delay wrapper compilation. The idea here would be that when 'genwrapper' is invoked, it captures its inputs and stores them to a list, then when the main parallel phase is complete, we can drain the list, generating the wrappers and doing the compilation then. The potential drawback here is that we'd lose some parallelism in the compiler (would need to benchmark here). |
Also a fourth option: change the non-concurrent path so that we generate all wrappers before compiling any inlinable functions. That might actually be a good way to go. |
I think this is what we should do for 1.15. It is simple and low risk. We have significant other compiler performance gains this cycle to offset any losses here. We should benchmark of course, but IIRC, this mainly matters for a handful of packages containing a great many types. We can then return to a proper but higher risk fix for 1.16. Your fourth option sounds pretty good to me, but I haven't looked at the code in any detail in a long time. |
@josharian thanks for that. I'll work up CLs for options 1 and 4 and do some experimentation. I also need to figure out how to write a good test for this bug. Although I was able to reproduce it pretty easily on my laptop, it's very timing sensitive (on my workstation, which has more cores, the problem doesn't trigger for me using the submitter's recipe). |
Change https://golang.org/cl/234177 mentions this issue: |
Change https://golang.org/cl/234178 mentions this issue: |
Both those CLs look pretty reasonable to me. Having seen them, I’m weakly inclined towards the deeper fix. |
I'd like to propose a 5th option. Not sure if this would actually fix the issue, but... Before compilation, can we iterate through xtop, find all the wrappers that need to be generated, and generate them? Including making any inline decisions for them. |
@randall77 yes, I think doing that would also fix the problem. I can send a CL for that as well. I wasn't aware that we frown on adding functions during compilation... if that's one of our goals then I think it would definitely be better to go with your option. |
I think we just try to keep as many complicated bits out of the parallel phase as possible. We've historically had trouble with symbols, errors, etc. showing up in the parallel phase and having to jump through hoops to serialize them deterministically. The less we do in the parallel phase, the better, at least from a code maintenance perspective (i.e. as long as it is a tiny fraction of compile time). |
OK. Just to be clear, the damage in this specific bug is being done before the parallel phase starts-- the bad behavior in question can also be triggered by "go tool compile -d compilelater" with the concurrent back end turned off. |
@randall77 I made an attempt at your suggestion, but it seems more complicated than I thought. In particular, with this code in "compile()":
Note the "n.Name.Addrtaken()" in the if guarding the dtypesym call. This property is set by the type checker, but also in the "order" phase. If I hoist the code above out into a loop prior to the compilation loop (as you suggested) it's going to see a smaller set of addr-taken variables, due to the fact that "order" hasn't run yet. With this in mind I think it might be better to stick to option 4. |
I'm not sure what that code has to do with my suggestion. I'm not suggesting that that code be moved. I'm not suggesting we compile the wrappers early, just generate them. |
This is what I tried to do. The problem is that without running "order" you don't have an accurate picture of "address taken". Here is a CL with my attempt (the new functionality is guarded by an environment variable): https://go-review.googlesource.com/c/go/+/234115 I think this is what you were suggesting, but maybe I am missing something? In particular see the "PROBLEM" comment in "funcgenwrappers". |
Change https://golang.org/cl/234417 mentions this issue: |
Discussed in the compiler/runtime meeting today. The plan is move forward with https://go-review.googlesource.com/c/go/+/234178 as opposed to the other candidate fixes. |
New test case for issue 38068, which deals with build reproducibility: do a pair of compilations, the first with the concurrent back end turned on, and the second with -c=1, then check to make sure we get the same output (using a test case that triggers late inlining into wrapper methods). Updates #38068. Change-Id: I4afaf78898706a66985f09d18f6f6f29876c9017 Reviewed-on: https://go-review.googlesource.com/c/go/+/234417 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
What version of Go are you using (
go version
)?Also tested with all versions from go1.13 and up, same problem.
Does this issue reproduce with the latest release?
Latest release, yes. Haven't tried tip.
What operating system and processor architecture are you using (
go env
)?I ran into this when building the same code on many different machines: freebsd12 freebsd/386, debian10-stable linux/386, ubuntu18lts linux/amd64, debian-unstable linux/amd64, macos-latest darwin/amd64.
What did you do?
I built the same code on all the machines, with options that should result in the exact same binary, but got different results on same of the machines. I tracked it down to the machines with a single cpu creating a different binary. Building with GO19CONCURRENTCOMPILATION=0 on the multi-cpu machines resulted in binaries identical to the ones produced by the single-cpu machines.
Here is an example to run on a multi-cpu machine (I used linux/amd64, but should not make a difference):
What did you expect to see?
The exact same binary.
What did you see instead?
Different binaries.
Details
I found this problem by building on a single-cpu VM, where NumCPU is 1. That probably prevents concurrency during compiles. GO19CONCURRENTCOMPILATION=0 disables concurrent compiles. The compiler and linker use runtime.NumCPU() in a few places, and perhaps GO19CONCURRENTCOMPILATION=0 isn't enough but just hides the symptoms.
FYI, so far I always got the same binary on machines with multiple cpu's, whether 2 cpu's or more, like 8 cpu's. But perhaps that's just because a high level of parallelism isn't reached.
The problem does not manifest for very simple programs (e.g. goimports in the same git checkout). Perhaps because there isn't enough to parallelize.
When building with -ldflags=-w, omitting the DWARF symbol table, the two build commands produce the same output again. I looked into that because of the diff below. I've seen earlier "reproducible build commands" that included -ldflags="-s -w". I don't expect those to be required to get reproducible builds.
The text was updated successfully, but these errors were encountered: