Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: R_PPC64_REL24 relocation truncation to fit errors when building Kubernetes on ppc64le #15823

Closed
laboger opened this issue May 24, 2016 · 35 comments
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented May 24, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    go 1.6.1
  2. What operating system and processor architecture are you using (go env)?
    Ubuntu 16.04
  3. What did you do?
    If possible, provide a recipe for reproducing the error.
    A complete runnable program is good.
    A link on play.golang.org is best.
    Trying to build kubernetes upstream with installed go 1.6.1 on Ubuntu 16.04.

git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes

modify the hack/lib/golang.sh file to remove the 2 comments that are preventing ppc64le from being used.

make

  1. What did you expect to see?
    A clean build
  2. What did you see instead?
    A failed build with linker errors due to the extremely large size of the intermediate go.o file that is being generated when building the hyperkube binary. The text section of go.o is too large for the GNU linker on ppc64le to handle, specifically with respect to branch offsets within the code.

From the output with -v option set for ldflags:
host link: "powerpc64le-linux-gnu-gcc" "-m64" "-gdwarf-2" "-o" "/tmp/go-build669045650/k8s.io/kubernetes/cmd/hyperkube/_obj/exe/a.out" "-rdynamic" "/tmp/go-link-267127109/go.o" "/tmp/go-link-267127109/000000.o" "/tmp/go-link-267127109/000001.o" "/tmp/go-link-267127109/000002.o" "/tmp/go-link-267127109/000003.o" "/tmp/go-link-267127109/000004.o" "/tmp/go-link-267127109/000005.o" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-lpthread" "-g" "-O2"
/usr/lib/go-1.6/pkg/tool/linux_ppc64le/link: running powerpc64le-linux-gnu-gcc failed: exit status 1
/tmp/go-link-267127109/go.o: In function k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.(*eventMonitoringState).monitorEvents': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/event.go:214:(.text+0x200798c): relocation truncated to fit: R_PPC64_REL24 (stub) againstruntime.ifaceeq'
/tmp/go-link-267127109/go.o: In function k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.headersWithAuth': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/image.go:550:(.text+0x200dc10): relocation truncated to fit: R_PPC64_REL24 (stub) againstruntime.makemap'
/tmp/go-link-267127109/go.o: In function k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.tlsDialWithDialer': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/tls.go:50:(.text+0x20115dc): relocation truncated to fit: R_PPC64_REL24 (stub) againstruntime.makechan'
/home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/tls.go:85:(.text+0x20119ec): relocation truncated to fit: R_PPC64_REL24 (stub) against runtime.chanrecv1' /tmp/go-link-267127109/go.o: In functionk8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.(*Client).ListVolumes':
/home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/volume.go:46:(.text+0x2011c78): relocation truncated to fit: R_PPC64_REL24 (stub) against `runtime.makemap'
..... more

Interestingly, if I try to add the -tmpdir option to -ldflags so I can look at the intermediate .o files, I get this:
host link: "powerpc64le-linux-gnu-gcc" "-m64" "-gdwarf-2" "-o" "/tmp/go-build496546066/k8s.io/kubernetes/cmd/hyperkube/_obj/exe/a.out" "-rdynamic" "/home/boger/gotmp/go.o" "/home/boger/gotmp/000000.o" "/home/boger/gotmp/000001.o" "/home/boger/gotmp/000002.o" "/home/boger/gotmp/000003.o" "/home/boger/gotmp/000004.o" "/home/boger/gotmp/000005.o" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-lpthread" "-g" "-O2"
/home/boger/golang/go1.6/go/pkg/tool/linux_ppc64le/link: running powerpc64le-linux-gnu-gcc failed: exit status 1
/usr/bin/ld: /home/boger/gotmp/go.o: invalid string offset 1869033317 >= 7130106 for section `"���'
/home/boger/gotmp/go.o: error adding symbols: No error
collect2: error: ld returned 1 exit status

I was hoping that the -buildshared option would help this problem but I ran into other problems when I did that and opened another issue when using that option.

This is a serious problem for us because it prevents upstream Kubernetes from building on ppc64le.

Our Power linker expert says that the go.o file that is generated has a single text section that is too large. Usually, when the branch offset could be too long, stubs are inserted to handle it but in this case the text section is too long for the stubs to work.

From the linker output when it fails:
.text 0x0000000010001950 0x107b7c0 /tmp/go-link-220343705/go.o
0x000000001006a938 _cgo_reginit
0x000000001006d4c0 _cgo_topofstack
0x000000001006d898 main
0x0000000010a6b490 _cgo_panic
0x0000000010a6b4c8 crosscall2

From the last commit where it linked successfully:
.text 0x0000000010001740 0xe68d58 /tmp/go-link-262454621/go.o
0x000000001006a8a8 _cgo_reginit
0x000000001006d430 _cgo_topofstack
0x000000001006d808 main
0x00000000102d2c40 _cgo_panic
0x00000000102d2c78 crosscall2

So for now we are the first platform to hit this problem, but as applications keep adding more and more packages eventually limits could be hit on other platforms too. It is not very efficient for the linker to have to handle such large sections either.

FYI... I did try go 1.6.2 and go built from master with similar results. Right now the size of the text section in the go.o file is living on the edge so depending on what is done by a commit, the link will sometimes work.

I suspect resolving this is not easy, so any packaging suggestions or other ideas on how to reduce the text size would help too. The last commit to successfully build was 997d55d.

@minux
Copy link
Member

minux commented May 24, 2016 via email

@ianlancetaylor ianlancetaylor added this to the Go1.8 milestone May 24, 2016
@laboger
Copy link
Contributor Author

laboger commented May 25, 2016

Minux, I think it is better to have a max text section size and then split it into multiple .o files if it gets to that limit. Perhaps the limit could be based on GOARCH. I think there are other advantages to avoiding excessively large .o files and text sections. I can look into that.

@laboger
Copy link
Contributor Author

laboger commented Jun 1, 2016

@minux I've been looking at this to try and understand how the go.o file gets generated. I'm not sure how best to decide when to generate different .o files. Should each new package just create a new goX.o file? Naming them go1.o, go2.o, etc.? Or are there reasons to divide them up differently.

A comment about the internal linker. Personally, I don't think a golang only app will get that big. All the real world Go applications that I know of link against external libraries or packages of some kind so will require external linking.

Also a question about what gets generated. I've been looking at what gets put into the go.o file today. I see a lot of functions in the objdump of go.o with names like <type..eq .something> or <type..init something>. What are those for?

@laboger
Copy link
Contributor Author

laboger commented Jun 2, 2016

I believe the right solution is to create multiple text sections within the go.o file, instead of multiple .o files. There seems to be some support in place in golang to do that. I was told by my Power linker expert that the GNU linker should be able to handle this. I will try it out.

@laboger
Copy link
Contributor Author

laboger commented Jun 8, 2016

I see this was marked for the Go 1.8 milestone, but is a serious bug that can happen in Go 1.7 and does happen in Go 1.6. I have a fix for it, does the fact that it is marked as Go 1.8 mean that I can't submit it?

@ianlancetaylor
Copy link
Contributor

You can still submit it. I marked it 1.8 because it's not a regression and because it sounds like the fix is complicated. If you have a safe fix, we can still get it in for 1.7.

@gopherbot
Copy link

CL https://golang.org/cl/23963 mentions this issue.

@laboger
Copy link
Contributor Author

laboger commented Jun 23, 2016

I have found 2 problems with the golang processing for elf sections that are related to my change. It's taken me a while to debug it because I'm not familiar with it yet.

When multiple text sections are created then the syms for the functions found in the sections other than the first will have their Value set to the offset relative to the start of the section they are in. So if there is a function in the second text section, its Value is set to the offset from the start of that section and since the shndx is set correctly, the linker relocates it according to its containing section correctly.

However I have found at least 2 other places where golang's elf processing assumes that the Value in a sym is relative to the start of the first text section and that causes errors. In the first case, there is a structure being created into rodata somewhere (not documented, not named the same as where it is used at least not that I could find) for the ifn and tfn fields of the method structure in runtime/type.go, for some of the go.itab functions and probably others. These values were filled in incorrectly for functions in text sections other than the first text section. If someone could point me to where the method's fields are being filled into the rodata section then I can try to find a way to make that work.

The other problem occurs when generating the findfunctab because the code assumes that all the function syms have Value fields that are ascending, and again if there are multiple text sections the functions in the second text section will have offsets relative to the start of the second text section. As a result the functab is not correct and look up fails. I think I have some ideas on how to make that one work.

@ianlancetaylor
Copy link
Contributor

@crawshaw

@crawshaw
Copy link
Member

The data structure you mention has some documentation in reflect/type.go. It is generated in cmd/compile/internal/gc/reflect.go.

When a concrete type has methods, the reflect.rtype is followed by a reflect.uncommonType which has an offset to an array of reflect.method structures, which contain the text offsets you mention.

The compiler creates the text offsets by generating an R_METHODOFF relocation, which cmd/link turns into a R_ADDROFF relocation and then turns into a section offset. For the reflect.method structure, this R_METHODOFF is generated in dextratypeData. (In particular, see dmethodptrOffLSym.)

@laboger
Copy link
Contributor Author

laboger commented Jun 24, 2016

Based on my debugging it looks like the ifn and tfn addresses in this table for methods in the second text section are being generated as if the section offsets were relative to the first text section instead of the second.

@laboger
Copy link
Contributor Author

laboger commented Jun 29, 2016

There are problems with two elf tables in output sections related to this change. The method offset tables are generated with absolute offsets determined at compile time, assuming that all text will appear consecutively. But if there are multiple text sections, the second one will not appear right after the first, there will be tables inserted by the linker to handle the long call offset instructions. I believe the method table could be set up with offsets that are relocated so that the values are correct (although not sure what the relocation would be.)

The second problem is with the pclntab. It contains "entries" for the functions in the program. The entries in the table are relocated addresses but the buckets that are generated to look up those entries were created based on absolute function addresses. As in the first problem, it assumed that all the text is consecutive so that means the buckets used for looking up functions in text sections after the first are not correct. I think the problem is somewhat easier to solve because worst case it could just do some searching if the lookup PC didn't provide the right function address the first time.

@laboger
Copy link
Contributor Author

laboger commented Jul 8, 2016

@ianlancetaylor What is the backport policy on a change like this. Must it go upstream before it can be backported to the Go 1.6 branch? The reason I'm asking is that the latest additions to my fix (the risky ones?) are needed because of the new method offset tables (int32) in Go 1.7 that depend on absolute offsets of methods rather than relocated addresses like they were in Go 1.6. My previous patch mostly worked for Go 1.6, but needs the minor addition of more searching in functab.

@ianlancetaylor
Copy link
Contributor

We do normally require that a CL be accepted on tip before it be backported. And, we don't currently plan to make a new 1.6 release anyhow. And, it would be odd and unfortunate if programs worked with 1.6 but broke with 1.7.

@laboger
Copy link
Contributor Author

laboger commented Jul 8, 2016

Do you have any thoughts on why this problem wouldn't also occur on arm64? It is my understanding that they have a similar call instruction with a 24 bit field (shifted by 2 to get 26 bit range) for the offset of the method being called.

@minux
Copy link
Member

minux commented Jul 20, 2016 via email

@laboger
Copy link
Contributor Author

laboger commented Jul 20, 2016

If there is a cleaner, more general solution, I'm all for it.

With this suggestion, how would the linker decide when to create a long or short call? As it is generating code/text and writing it out for the final text section, it still won't know the offset for calls to future functions in this text section, which also could be at offsets that are too large. Or will you always generate long calls if the total text size is over a certain threshold and the call target offset is unknown?

@minux
Copy link
Member

minux commented Jul 20, 2016 via email

@laboger
Copy link
Contributor Author

laboger commented Jul 20, 2016

If you put the list of near jumps after a text section that exceeds the limit, then how can the internal call sites branch there? The offset would be too big.

If we try to generate per function .text sections, we'd hit the same problem with the method offset table as happens with my patch. Because the linker would fix the long calls and add jump tables between sections, causing the offsets in the method offset table to be wrong.

@minux
Copy link
Member

minux commented Jul 20, 2016 via email

@minux
Copy link
Member

minux commented Jul 20, 2016 via email

@laboger
Copy link
Contributor Author

laboger commented Jul 22, 2016

Based on the comments from Brad and Ian in #16356 I didn't think any patch for this problem in Go 1.7 was going to be accepted. Did I misunderstand that? So I don't understand why you are trying to find a solution for it there.

As far as the long term fix, I think that changing the generated code is as risky or more than my original fix. The section split can only happen on ppc64le now so any of your concerns about what might go wrong if the sections are split can't happen on other platforms. We have built and tested programs where the sections are split and I've inspected the symbols and they have correct addresses even if they are in the additional text sections. I'm not sure what your concerns are about the pcdata. I think it is best to generate correct elf sections in a way so that the GNU linker can solve this problem since it already knows how instead of reinventing another solution for it.

I have some ideas on how to improve the way my patch is using the method offset table, but I want to do some experiments before posting anything.

As far as the same problem in a pure Go binary, the solution/suggested workaround could be to require that they use external linking in that situation.

@jianzhangbjz
Copy link

@laboger I encounter similar linker relocation problems on ppc64le machine based on go version go1.6 linux/ppc64le, can you provide me your fixed Go version or other suggestions?

/tmp/go-build007825104/nvml/_obj/bindings.cgo2.o: In function `_cgo_fdee712f6ce0_Cfunc_nvmlDeviceGetBAR1MemoryInfo':
bindings.cgo2.c:(.text+0x4c): relocation truncated to fit: R_PPC64_REL24 against undefined symbol `nvmlDeviceGetBAR1MemoryInfo'

@laboger
Copy link
Contributor Author

laboger commented Aug 15, 2016

@jianzhangbjz I don't think this is the same error as what's discussed in this issue. Your error message says there is an undefined symbol. This problem identified in this issue happens when the symbol exists but the offset to that symbol exceeds the limit. First fix the problem with the missing symbol and see if it still happens.

@laboger
Copy link
Contributor Author

laboger commented Aug 24, 2016

What has to happen to move forward with this issue and patch?

I have tried some experimentation with modifying the call sequences. I still think there is risk in making those changes too.

I don't have enough detail from @minux previous proposals to understand fully what he is suggesting and if he is the one who is planning to implement them. If his plan is to change the code sequences at assembly time then that means all imported packages would need changing too so those calls are fix and thus all imported packages need rebuilding.

If there are concerns with my patch related to the way sections are generated I can clean it up and/or make it more specific to ppc64le only if I know what areas are of concern.

Please let me know how to proceed because this is a big issue since it prevents several Kubernetes binaries from building and we need to get a fix into 1.8.

@gopherbot
Copy link

CL https://golang.org/cl/27790 mentions this issue.

@laboger
Copy link
Contributor Author

laboger commented Sep 14, 2016

@jianzhangbjz The problem you reported earlier was found to be related to the level of binutils. The error occurs on binutils 2.24 but not on binutils 2.25. If you can build with binutils 2.25 the error should go away.

@laboger
Copy link
Contributor Author

laboger commented Sep 15, 2016

@crawshaw suggested I add a testcase for this scenario. It have a prototype that generates the test, which took a long time, and then it took over 45 minutes to compile and link it. It needs to be cleaned up a bit before adding to the CL but wanted to know if this is too long to use for a testcase. This would test that the linker tables were created and inserted correctly but wouldn't stress the number of sections since only a few are generated.
@ianlancetaylor suggested an option to specify the max text section size and I've been working on implementing this. In the real world case using the actual max text section size for ppc64le, a limit of about 128 text sections should be able to handle a big enough program that anyone would care to create. If a small value is allowed on the argument there could be many 1000s of text sections. I'm planning to set up a reasonable minimum on the argument and maximum on the number of text sections since I don't think there is a good reason to accept extreme values for either. The shdrs and phdrs are now static arrays but would be changed to slices of variable size. This would be good to test out a larger number of sections even though no linker tables might be inserted. If there are any comments on this approach let me know.

@ianlancetaylor
Copy link
Contributor

I'm sure @bradfitz would prefer the 45 minute test case. That will give him a nice challenge for #17104.

Just kidding, 45 minutes is clearly too long. I also don't think you should do heavy modifications to the linker for your CL just for testing. So probably we should move ahead without a test case. @crawshaw, opinions?

@bradfitz
Copy link
Contributor

and then it took over 45 minutes to compile and link it. It needs to be cleaned up a bit before adding to the CL but wanted to know if this is too long to use for a testcase.

Definitely too long to be on by default.

But you can t.Skip it by default and require a --ginormous_tests flag to enable it. We've done things like that before.

@crawshaw
Copy link
Member

45 minutes is never going to be run, so it's not much use. But I'm surprised it takes so long. Does the kubernetes binary that started all this take that long? If not, can it be simplified into a test case?

@laboger
Copy link
Contributor Author

laboger commented Sep 16, 2016

I'll try some more experiments to see if I can get a testcase with a big text section that builds faster. I just redid the full Kubernetes build (which builds 25 binaries, with 3 that exceed the text size limit) and it took 48 minutes for all of them.

Either there's something odd in my testcase or it's because my test is a single huge .go source file. With the Kub binaries they build from multiple .go files and include large packages (.a files) that were compiled in an earlier separate step.

I like Brad's idea of having the test out there but skipping it by default.

@laboger
Copy link
Contributor Author

laboger commented Sep 16, 2016

Ooops there is probably some parallelism in the Kubernetes build so I will need to measure the build times individually to really know how they compare.

@crawshaw
Copy link
Member

You want a big text section right?

$ echo 'TEXT ·bigfn(SB),$0' > bigfn.s
$ for i in {1..1000000}; do echo " MOVQ $42, DX" >> bigfn.s; done
$ echo '  RET' >> bigfn.s
$ cat bigfn.go 
package main

func bigfn()

func main() {
        bigfn()
}
$ time go build

real    0m2.377s
user    0m2.309s
sys     0m0.313s
$ ls -l junk 
-rwxr-xr-x@ 1 crawshaw  eng  11961472 Sep 16 12:48 junk

Slowest part is 1e6 echos. If you do it with a single open and a write buffer in a go program the test should be decently fast. (You may still want to guard it with testing.Short() though.)

@laboger
Copy link
Contributor Author

laboger commented Sep 20, 2016

I am working on the latest suggestion for a test. It takes only a few minutes to build when done this way. It does reproduce the problem and identified a bug that only happens with really big functions. So I am in the process of fixing that and retesting it all and will update the CL once it is ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants