cmd/link: R_PPC64_REL24 relocation truncation to fit errors when building Kubernetes on ppc64le #15823

laboger · 2016-05-24T19:22:00Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?
go 1.6.1
What operating system and processor architecture are you using (go env)?
Ubuntu 16.04
What did you do?
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
Trying to build kubernetes upstream with installed go 1.6.1 on Ubuntu 16.04.

git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes

modify the hack/lib/golang.sh file to remove the 2 comments that are preventing ppc64le from being used.

make

What did you expect to see?
A clean build
What did you see instead?
A failed build with linker errors due to the extremely large size of the intermediate go.o file that is being generated when building the hyperkube binary. The text section of go.o is too large for the GNU linker on ppc64le to handle, specifically with respect to branch offsets within the code.

From the output with -v option set for ldflags:
host link: "powerpc64le-linux-gnu-gcc" "-m64" "-gdwarf-2" "-o" "/tmp/go-build669045650/k8s.io/kubernetes/cmd/hyperkube/_obj/exe/a.out" "-rdynamic" "/tmp/go-link-267127109/go.o" "/tmp/go-link-267127109/000000.o" "/tmp/go-link-267127109/000001.o" "/tmp/go-link-267127109/000002.o" "/tmp/go-link-267127109/000003.o" "/tmp/go-link-267127109/000004.o" "/tmp/go-link-267127109/000005.o" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-lpthread" "-g" "-O2"
/usr/lib/go-1.6/pkg/tool/linux_ppc64le/link: running powerpc64le-linux-gnu-gcc failed: exit status 1
/tmp/go-link-267127109/go.o: In function k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.(*eventMonitoringState).monitorEvents': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/event.go:214:(.text+0x200798c): relocation truncated to fit: R_PPC64_REL24 (stub) againstruntime.ifaceeq'
/tmp/go-link-267127109/go.o: In function k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.headersWithAuth': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/image.go:550:(.text+0x200dc10): relocation truncated to fit: R_PPC64_REL24 (stub) againstruntime.makemap'
/tmp/go-link-267127109/go.o: In function k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.tlsDialWithDialer': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/tls.go:50:(.text+0x20115dc): relocation truncated to fit: R_PPC64_REL24 (stub) againstruntime.makechan'
/home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/tls.go:85:(.text+0x20119ec): relocation truncated to fit: R_PPC64_REL24 (stub) against runtime.chanrecv1' /tmp/go-link-267127109/go.o: In functionk8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.(*Client).ListVolumes':
/home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/volume.go:46:(.text+0x2011c78): relocation truncated to fit: R_PPC64_REL24 (stub) against `runtime.makemap'
..... more

Interestingly, if I try to add the -tmpdir option to -ldflags so I can look at the intermediate .o files, I get this:
host link: "powerpc64le-linux-gnu-gcc" "-m64" "-gdwarf-2" "-o" "/tmp/go-build496546066/k8s.io/kubernetes/cmd/hyperkube/_obj/exe/a.out" "-rdynamic" "/home/boger/gotmp/go.o" "/home/boger/gotmp/000000.o" "/home/boger/gotmp/000001.o" "/home/boger/gotmp/000002.o" "/home/boger/gotmp/000003.o" "/home/boger/gotmp/000004.o" "/home/boger/gotmp/000005.o" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-lpthread" "-g" "-O2"
/home/boger/golang/go1.6/go/pkg/tool/linux_ppc64le/link: running powerpc64le-linux-gnu-gcc failed: exit status 1
/usr/bin/ld: /home/boger/gotmp/go.o: invalid string offset 1869033317 >= 7130106 for section `"ï¿½ï¿½ï¿½'
/home/boger/gotmp/go.o: error adding symbols: No error
collect2: error: ld returned 1 exit status

I was hoping that the -buildshared option would help this problem but I ran into other problems when I did that and opened another issue when using that option.

This is a serious problem for us because it prevents upstream Kubernetes from building on ppc64le.

Our Power linker expert says that the go.o file that is generated has a single text section that is too large. Usually, when the branch offset could be too long, stubs are inserted to handle it but in this case the text section is too long for the stubs to work.

From the linker output when it fails:
.text 0x0000000010001950 0x107b7c0 /tmp/go-link-220343705/go.o
0x000000001006a938 _cgo_reginit
0x000000001006d4c0 _cgo_topofstack
0x000000001006d898 main
0x0000000010a6b490 _cgo_panic
0x0000000010a6b4c8 crosscall2

From the last commit where it linked successfully:
.text 0x0000000010001740 0xe68d58 /tmp/go-link-262454621/go.o
0x000000001006a8a8 _cgo_reginit
0x000000001006d430 _cgo_topofstack
0x000000001006d808 main
0x00000000102d2c40 _cgo_panic
0x00000000102d2c78 crosscall2

So for now we are the first platform to hit this problem, but as applications keep adding more and more packages eventually limits could be hit on other platforms too. It is not very efficient for the linker to have to handle such large sections either.

FYI... I did try go 1.6.2 and go built from master with similar results. Right now the size of the text section in the go.o file is living on the edge so depending on what is done by a commit, the link will sometimes work.

I suspect resolving this is not easy, so any packaging suggestions or other ideas on how to reduce the text size would help too. The last commit to successfully build was 997d55d.

The text was updated successfully, but these errors were encountered:

minux · 2016-05-24T22:50:22Z

I think the cleanest solution is to make cmd/link split the text section into suitably sized smaller ones and let external linker do the job (but for internal linking, our linker also need to generate stubs). Another possibility is that we always generate the longer form of branch/call instructions at compile time, and rewrite them into short branch/call and nops during linking when we know the distance is not too far away. I agree similar issue will eventually hit other RISC architectures because as far as I know, none of them support huge branches.

laboger · 2016-05-25T13:28:00Z

Minux, I think it is better to have a max text section size and then split it into multiple .o files if it gets to that limit. Perhaps the limit could be based on GOARCH. I think there are other advantages to avoiding excessively large .o files and text sections. I can look into that.

laboger · 2016-06-01T21:11:00Z

@minux I've been looking at this to try and understand how the go.o file gets generated. I'm not sure how best to decide when to generate different .o files. Should each new package just create a new goX.o file? Naming them go1.o, go2.o, etc.? Or are there reasons to divide them up differently.

A comment about the internal linker. Personally, I don't think a golang only app will get that big. All the real world Go applications that I know of link against external libraries or packages of some kind so will require external linking.

Also a question about what gets generated. I've been looking at what gets put into the go.o file today. I see a lot of functions in the objdump of go.o with names like <type..eq .something> or <type..init something>. What are those for?

laboger · 2016-06-02T17:32:19Z

I believe the right solution is to create multiple text sections within the go.o file, instead of multiple .o files. There seems to be some support in place in golang to do that. I was told by my Power linker expert that the GNU linker should be able to handle this. I will try it out.

laboger · 2016-06-08T14:54:01Z

I see this was marked for the Go 1.8 milestone, but is a serious bug that can happen in Go 1.7 and does happen in Go 1.6. I have a fix for it, does the fact that it is marked as Go 1.8 mean that I can't submit it?

ianlancetaylor · 2016-06-08T17:46:45Z

You can still submit it. I marked it 1.8 because it's not a regression and because it sounds like the fix is complicated. If you have a safe fix, we can still get it in for 1.7.

gopherbot · 2016-06-09T21:00:18Z

CL https://golang.org/cl/23963 mentions this issue.

laboger · 2016-06-23T19:12:34Z

I have found 2 problems with the golang processing for elf sections that are related to my change. It's taken me a while to debug it because I'm not familiar with it yet.

When multiple text sections are created then the syms for the functions found in the sections other than the first will have their Value set to the offset relative to the start of the section they are in. So if there is a function in the second text section, its Value is set to the offset from the start of that section and since the shndx is set correctly, the linker relocates it according to its containing section correctly.

However I have found at least 2 other places where golang's elf processing assumes that the Value in a sym is relative to the start of the first text section and that causes errors. In the first case, there is a structure being created into rodata somewhere (not documented, not named the same as where it is used at least not that I could find) for the ifn and tfn fields of the method structure in runtime/type.go, for some of the go.itab functions and probably others. These values were filled in incorrectly for functions in text sections other than the first text section. If someone could point me to where the method's fields are being filled into the rodata section then I can try to find a way to make that work.

The other problem occurs when generating the findfunctab because the code assumes that all the function syms have Value fields that are ascending, and again if there are multiple text sections the functions in the second text section will have offsets relative to the start of the second text section. As a result the functab is not correct and look up fails. I think I have some ideas on how to make that one work.

ianlancetaylor · 2016-06-23T22:09:00Z

@crawshaw

crawshaw · 2016-06-24T13:49:25Z

The data structure you mention has some documentation in reflect/type.go. It is generated in cmd/compile/internal/gc/reflect.go.

When a concrete type has methods, the reflect.rtype is followed by a reflect.uncommonType which has an offset to an array of reflect.method structures, which contain the text offsets you mention.

The compiler creates the text offsets by generating an R_METHODOFF relocation, which cmd/link turns into a R_ADDROFF relocation and then turns into a section offset. For the reflect.method structure, this R_METHODOFF is generated in dextratypeData. (In particular, see dmethodptrOffLSym.)

laboger · 2016-06-24T15:13:04Z

Based on my debugging it looks like the ifn and tfn addresses in this table for methods in the second text section are being generated as if the section offsets were relative to the first text section instead of the second.

laboger · 2016-06-29T20:42:35Z

There are problems with two elf tables in output sections related to this change. The method offset tables are generated with absolute offsets determined at compile time, assuming that all text will appear consecutively. But if there are multiple text sections, the second one will not appear right after the first, there will be tables inserted by the linker to handle the long call offset instructions. I believe the method table could be set up with offsets that are relocated so that the values are correct (although not sure what the relocation would be.)

The second problem is with the pclntab. It contains "entries" for the functions in the program. The entries in the table are relocated addresses but the buckets that are generated to look up those entries were created based on absolute function addresses. As in the first problem, it assumed that all the text is consecutive so that means the buckets used for looking up functions in text sections after the first are not correct. I think the problem is somewhat easier to solve because worst case it could just do some searching if the lookup PC didn't provide the right function address the first time.

laboger · 2016-07-08T18:23:54Z

@ianlancetaylor What is the backport policy on a change like this. Must it go upstream before it can be backported to the Go 1.6 branch? The reason I'm asking is that the latest additions to my fix (the risky ones?) are needed because of the new method offset tables (int32) in Go 1.7 that depend on absolute offsets of methods rather than relocated addresses like they were in Go 1.6. My previous patch mostly worked for Go 1.6, but needs the minor addition of more searching in functab.

ianlancetaylor · 2016-07-08T18:39:53Z

We do normally require that a CL be accepted on tip before it be backported. And, we don't currently plan to make a new 1.6 release anyhow. And, it would be odd and unfortunate if programs worked with 1.6 but broke with 1.7.

laboger · 2016-07-08T19:28:16Z

Do you have any thoughts on why this problem wouldn't also occur on arm64? It is my understanding that they have a similar call instruction with a 24 bit field (shifted by 2 to get 26 bit range) for the offset of the method being called.

minux · 2016-07-20T07:10:04Z

I have a another proposal for Go 1.8 and beyond, and it's not limited to ppc64 port. It should help with all the RISC ports where jump/call instructions can't cover the entire 2GB/4GB space. In compiler, we don't assemble the entire function into a single blob of instructions, and instead, we assemble each basic block separately, and we let the linker assemble the exit instruction of each basic block based on the actual distance. The other benefit of this is that, we can make the linker reorder the basic blocks and possibly do basic block cache alignment. In the new SSA backend, this is probably fairly easy to do. If not, I think we can make cmd/internal/obj do this by tracing the instructions. I think we already have the code for some RISC architectures (but the code is probably disabled and lost during the c2go translation?)

laboger · 2016-07-20T18:50:24Z

If there is a cleaner, more general solution, I'm all for it.

With this suggestion, how would the linker decide when to create a long or short call? As it is generating code/text and writing it out for the final text section, it still won't know the offset for calls to future functions in this text section, which also could be at offsets that are too large. Or will you always generate long calls if the total text size is over a certain threshold and the call target offset is unknown?

minux · 2016-07-20T20:01:51Z

There are multiple ways to handle external targets. For example, we can put a list of near jumps to external code at the end of the text segment and make all internal call sites jump there. Or, we can generate per function .text sections. I think the first solution is also a simpler solution for ppc64 Go 1.7?

laboger · 2016-07-20T21:46:15Z

If you put the list of near jumps after a text section that exceeds the limit, then how can the internal call sites branch there? The offset would be too big.

If we try to generate per function .text sections, we'd hit the same problem with the method offset table as happens with my patch. Because the linker would fix the long calls and add jump tables between sections, causing the offsets in the method offset table to be wrong.

minux · 2016-07-20T23:26:03Z

Liblink used to be able to choose appropriate instruction to jump further. That capability must be lost in the transition to generate object code for each function in the compiler. Not able to support large pure Go binary seems a general problem to me. However, if each function is not too big, we should be able to insert trampolines after each function to resolve all internal jumps.

minux · 2016-07-20T23:28:03Z

Right. The function offset in method table fundamentally assume all Go code is in the same section. I think the Go symbol table and module data (pcdata) also make such assumption. Splitting into multiple .text sections must address those assumptions in runtime first. That's why I'm suggesting not to split the .text section.

laboger · 2016-07-22T14:38:41Z

Based on the comments from Brad and Ian in #16356 I didn't think any patch for this problem in Go 1.7 was going to be accepted. Did I misunderstand that? So I don't understand why you are trying to find a solution for it there.

As far as the long term fix, I think that changing the generated code is as risky or more than my original fix. The section split can only happen on ppc64le now so any of your concerns about what might go wrong if the sections are split can't happen on other platforms. We have built and tested programs where the sections are split and I've inspected the symbols and they have correct addresses even if they are in the additional text sections. I'm not sure what your concerns are about the pcdata. I think it is best to generate correct elf sections in a way so that the GNU linker can solve this problem since it already knows how instead of reinventing another solution for it.

I have some ideas on how to improve the way my patch is using the method offset table, but I want to do some experiments before posting anything.

As far as the same problem in a pure Go binary, the solution/suggested workaround could be to require that they use external linking in that situation.

jianzhangbjz · 2016-08-12T07:37:57Z

@laboger I encounter similar linker relocation problems on ppc64le machine based on go version go1.6 linux/ppc64le, can you provide me your fixed Go version or other suggestions?

/tmp/go-build007825104/nvml/_obj/bindings.cgo2.o: In function `_cgo_fdee712f6ce0_Cfunc_nvmlDeviceGetBAR1MemoryInfo':
bindings.cgo2.c:(.text+0x4c): relocation truncated to fit: R_PPC64_REL24 against undefined symbol `nvmlDeviceGetBAR1MemoryInfo'

laboger · 2016-08-15T13:02:41Z

@jianzhangbjz I don't think this is the same error as what's discussed in this issue. Your error message says there is an undefined symbol. This problem identified in this issue happens when the symbol exists but the offset to that symbol exceeds the limit. First fix the problem with the missing symbol and see if it still happens.

laboger · 2016-08-24T15:29:23Z

What has to happen to move forward with this issue and patch?

I have tried some experimentation with modifying the call sequences. I still think there is risk in making those changes too.

I don't have enough detail from @minux previous proposals to understand fully what he is suggesting and if he is the one who is planning to implement them. If his plan is to change the code sequences at assembly time then that means all imported packages would need changing too so those calls are fix and thus all imported packages need rebuilding.

If there are concerns with my patch related to the way sections are generated I can clean it up and/or make it more specific to ppc64le only if I know what areas are of concern.

Please let me know how to proceed because this is a big issue since it prevents several Kubernetes binaries from building and we need to get a fix into 1.8.

gopherbot · 2016-08-25T17:00:25Z

CL https://golang.org/cl/27790 mentions this issue.

laboger · 2016-09-14T21:26:00Z

@jianzhangbjz The problem you reported earlier was found to be related to the level of binutils. The error occurs on binutils 2.24 but not on binutils 2.25. If you can build with binutils 2.25 the error should go away.

laboger · 2016-09-15T19:11:47Z

@crawshaw suggested I add a testcase for this scenario. It have a prototype that generates the test, which took a long time, and then it took over 45 minutes to compile and link it. It needs to be cleaned up a bit before adding to the CL but wanted to know if this is too long to use for a testcase. This would test that the linker tables were created and inserted correctly but wouldn't stress the number of sections since only a few are generated.
@ianlancetaylor suggested an option to specify the max text section size and I've been working on implementing this. In the real world case using the actual max text section size for ppc64le, a limit of about 128 text sections should be able to handle a big enough program that anyone would care to create. If a small value is allowed on the argument there could be many 1000s of text sections. I'm planning to set up a reasonable minimum on the argument and maximum on the number of text sections since I don't think there is a good reason to accept extreme values for either. The shdrs and phdrs are now static arrays but would be changed to slices of variable size. This would be good to test out a larger number of sections even though no linker tables might be inserted. If there are any comments on this approach let me know.

ianlancetaylor · 2016-09-15T22:45:21Z

I'm sure @bradfitz would prefer the 45 minute test case. That will give him a nice challenge for #17104.

Just kidding, 45 minutes is clearly too long. I also don't think you should do heavy modifications to the linker for your CL just for testing. So probably we should move ahead without a test case. @crawshaw, opinions?

bradfitz · 2016-09-15T23:12:58Z

and then it took over 45 minutes to compile and link it. It needs to be cleaned up a bit before adding to the CL but wanted to know if this is too long to use for a testcase.

Definitely too long to be on by default.

But you can t.Skip it by default and require a --ginormous_tests flag to enable it. We've done things like that before.

crawshaw · 2016-09-16T00:42:01Z

45 minutes is never going to be run, so it's not much use. But I'm surprised it takes so long. Does the kubernetes binary that started all this take that long? If not, can it be simplified into a test case?

laboger · 2016-09-16T16:20:01Z

I'll try some more experiments to see if I can get a testcase with a big text section that builds faster. I just redid the full Kubernetes build (which builds 25 binaries, with 3 that exceed the text size limit) and it took 48 minutes for all of them.

Either there's something odd in my testcase or it's because my test is a single huge .go source file. With the Kub binaries they build from multiple .go files and include large packages (.a files) that were compiled in an earlier separate step.

I like Brad's idea of having the test out there but skipping it by default.

laboger · 2016-09-16T16:36:39Z

Ooops there is probably some parallelism in the Kubernetes build so I will need to measure the build times individually to really know how they compare.

crawshaw · 2016-09-16T16:51:12Z

You want a big text section right?

$ echo 'TEXT ·bigfn(SB),$0' > bigfn.s
$ for i in {1..1000000}; do echo " MOVQ $42, DX" >> bigfn.s; done
$ echo '  RET' >> bigfn.s
$ cat bigfn.go 
package main

func bigfn()

func main() {
        bigfn()
}
$ time go build

real    0m2.377s
user    0m2.309s
sys     0m0.313s
$ ls -l junk 
-rwxr-xr-x@ 1 crawshaw  eng  11961472 Sep 16 12:48 junk

Slowest part is 1e6 echos. If you do it with a single open and a write buffer in a go program the test should be decently fast. (You may still want to guard it with testing.Short() though.)

laboger · 2016-09-20T18:19:34Z

I am working on the latest suggestion for a test. It takes only a few minutes to build when done this way. It does reproduce the problem and identified a bug that only happens with really big functions. So I am in the process of fixing that and retesting it all and will update the CL once it is ready.

ianlancetaylor added this to the Go1.8 milestone May 24, 2016

laboger mentioned this issue Jul 13, 2016

cmd/link: if there is a go1.6.3 backport the ppc64le text section size fix #15823 #16356

Closed

laboger mentioned this issue Aug 10, 2016

cmd/link: relocation is too big for binaries built on ppc64le #16665

Closed

laboger mentioned this issue Sep 9, 2016

cmd/compile, cmd/link: can't build large arm binaries with external linking, Kubernetes now too big #17028

Closed

gopherbot closed this as completed in b4efd09 Sep 21, 2016

laboger mentioned this issue Nov 4, 2016

cmd/link: regression in big text fix for external linking on ppc64le #17795

Closed

laboger mentioned this issue Nov 14, 2016

cmd/link: handle too far branches in ppc64le with trampolines in external linking in Go 1.9 #17917

Closed

golang locked and limited conversation to collaborators Sep 21, 2017

gopherbot added the FrozenDueToAge label Sep 21, 2017

cmd/link: R_PPC64_REL24 relocation truncation to fit errors when building Kubernetes on ppc64le #15823

cmd/link: R_PPC64_REL24 relocation truncation to fit errors when building Kubernetes on ppc64le #15823

Comments

laboger commented May 24, 2016 • edited Loading

minux commented May 24, 2016 via email

laboger commented May 25, 2016

laboger commented Jun 1, 2016

laboger commented Jun 2, 2016

laboger commented Jun 8, 2016

ianlancetaylor commented Jun 8, 2016

gopherbot commented Jun 9, 2016

laboger commented Jun 23, 2016

ianlancetaylor commented Jun 23, 2016

crawshaw commented Jun 24, 2016

laboger commented Jun 24, 2016

laboger commented Jun 29, 2016

laboger commented Jul 8, 2016 • edited Loading

ianlancetaylor commented Jul 8, 2016

laboger commented Jul 8, 2016

minux commented Jul 20, 2016 via email

laboger commented Jul 20, 2016

minux commented Jul 20, 2016 via email

laboger commented Jul 20, 2016

minux commented Jul 20, 2016 via email

minux commented Jul 20, 2016 via email

laboger commented Jul 22, 2016

jianzhangbjz commented Aug 12, 2016

laboger commented Aug 15, 2016

laboger commented Aug 24, 2016

gopherbot commented Aug 25, 2016

laboger commented Sep 14, 2016

laboger commented Sep 15, 2016

ianlancetaylor commented Sep 15, 2016

bradfitz commented Sep 15, 2016

crawshaw commented Sep 16, 2016

laboger commented Sep 16, 2016

laboger commented Sep 16, 2016

crawshaw commented Sep 16, 2016

laboger commented Sep 20, 2016

laboger commented May 24, 2016 •

edited

Loading

laboger commented Jul 8, 2016 •

edited

Loading