-
Notifications
You must be signed in to change notification settings - Fork 18k
syscall: build test failure on linux-ppc64 #42178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'll debug this (don't have permission to assign the bug to myself, so I'm putting this note here). My plan is as follows:
|
Change https://golang.org/cl/264719 mentions this issue: |
Curiously, trying to turn this off, we also find: --- FAIL: TestSetuidEtc (0.01s) So, I'm disabling that test too. |
For some reason, currently unknown, this test case fails exclusively on the linux-ppc64 platform. Until such time as it can be made to work, we'll disable this test case on that platform. The same issue causes TestSetuidEtc to fail too, so disable that on this platform. Updates #42178 Change-Id: Idd3f6c2ee9f2fba2eb8ce4de69de7f316858bb15 Reviewed-on: https://go-review.googlesource.com/c/go/+/264719 Trust: Emmanuel Odeke <emm.odeke@gmail.com> Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
So, 1 is complete. 2 requires some access to be setup. May take a few days to resolve. |
Tentatively marking as a release-blocker since this is a platform issue in a new API. |
I built a toolchain on a ppc64 power8 with 3.10.0-1062.12.1.el7.ppc64 and enabled these 2 tests and they passed there. I'm suspicious of your ppc64 builder, especially since a few weeks ago, there was a glitch at OSU and the ppc64 & ppc64le systems had to be restarted but the ppc64 didn't start. After working with Lance at OSU we found that one of the build files on this builder was out of date and I asked him to post that information in the issue #41742. With that information it was resolved. Just wondering if anything else could be out of sync. A few other things: the Docker being used on this builder is specially built. Also we (IBM) have not run ppc64 big endian on Ubuntu for a while so I don't have a system to try out that distro on BE. |
This is really useful info. So, ppc64le = little-endian (passing), and ppc64 = big-endian (failing) ? Are they otherwise the same architecture? Could the issue be something generated by the compiler? I've been confused about how this failing on only one architecture. |
That is correct: ppc64 is original 64-bit PowerPC and ppc64le is the newer little-endian 64-bit PowerPC. The ppc64le processors are newer and have additional instructions but they are basically the same architecture. That said, it's worth noting that C code on the different processors uses significantly different calling conventions. |
linux/ppc64 does not support cgo either. I don't think it's the compiler since these tests worked for me on ppc64 machines. Information on the builders is under golang.org/x/build/ ppc64 builder is go-be-xenial-3 4.4.0-130-powerpc64-smp So the ppc64 builder has an older kernel. All the ppc64 machines I tried here have newer kernels. |
My mistake, I didn't realize the builder tests as root. I can reproduce the failure with SetuidEtc if I run as root, but not the TestAllThreads failure. And you probably realize, the difference between ppc64 and ppc64le on the TestAllThreads is because ppc64le has cgo and the test is disabled for cgo. |
The test should work with or without cgo. On ppc64le, can you see if this works? CGO_ENABLED=0 go test syscall |
On the cgo front, does this system use glibc or some other libc variant? |
Yes that works.
glibc |
What version of glibc? |
The GNU/Linux pp64le buildlet is running glibc 2.28. (The GNU/Linux ppc64 buildlet is running glibc 2.23.) |
I found that the SetuidEtc test fails on ppc64 because for the Setgroups(nil) test it doesn't have a line in the /procs/pid/status file, but for ppc64le it has a line that says "Groups: " with nothing following, which is what the testcase expects. |
Thanks to Ian's glibc information, I found that the SetuidEtc test also fails on a ppc64le that has glibc 2.23. So this failure is due to the difference in glibc.
|
To summarize, with the newer glibc and a recent kernel ppc64le passes both tests with cgo enabled and without. I am confused about what we believe is true of the ppc64 case. Is this accurate? 1 both cases failed on the build serverfor ppc64 only |
For test SetuidEtc, this appears to be related to the kernel for both ppc64 and ppc64le. The output in the /proc/pid/status file is different in older kernels and the test expects output from newer kernels. I can reproduce the failure on older kernels but not newer kernels. I have not been able to reproduce the AllThreadSyscall failure on any system I've tried so far. |
OK, so upgrading the kernel on the build server is out of my hands, presumably that can be done? We have some confidence that this will address at least one of the two issues. For the AllThreads case, the only things I can think of about the build server is that:
|
I think the test for SetuidEtc should be fixed, because shouldn't it still work on older kernels? I just tried in on my laptop and got the same error:
I'm trying to find out about the seccomp question. |
Can't be an old kernel thing because I tried the AllThreads test on ppc64 systems with old kernels and they all passed (not in a container). I built a container on a ppc64le and ran the syscall tests and they passed there, so it couldn't be a seccomp issue unless there is some special seccomp setting being used on ppc64 but not ppc64le. We (IBM) don't support Docker for ppc64 so I'm not able to try running it in a container. |
The tests should in principle run on any Linux kernel 2.6.23 or higher. It's fine to skip the test if the Linux kernel version is too old. There is an example of a test that checks the kernel version in syscall/exec_linux_test.go: |
I don't set up the Go builders for ppc64/ppc64le and I don't have access to run on them. But I do have access to several similar ppc64 machines to test and run on. I was able to find a ppc64 machine running Debian where I can reproduce the failure in AllThreads without running in a container. This is running a newer kernel and while trying to debug this I found that the failure is intermittent. If I try to debug with gdb it doesn't fail. By adding some creative panic messages it reports that the r2 value is wrong when returning from the syscall and that causes the failure. I also found the same failure on a Fedora 28 ppc64 machine if I set -test.count=2000. With lower test count values it fails but less often, with count=1 I couldn't get it to fail. The kernels where this happens are relatively new: |
Cool. This is good info. It suggests there is something subtle going on. (I've done 10k runs on x86s and Arm's without failure.) Are you confident that this same level of runs on ppc64le doesn't similarly fail? I have an access key now for running tests on the ppc64, so I'll try to reproduce the failure(s) myself and work on them. |
Mystery. Removing the workaround, both of these tests pass for me on linux-ppc64-buildlet. I'll try some 10000 runs. |
I did get it to fail on ppc64le. Using Ubuntu/Debian seems to be the key. Couldn't make it fail: In all cases I've been using -test.cpu=2 since that's what the builders have. And I usually have to use a count >=200 to make it fail. |
[Pilot error on the reproduction front - getting up to speed on gomote, can reproduce both failures.] |
So, I guess this should have been obvious, but because the ppc64 build is sans cgo, both of these tests end up using the AllThreadsSyscall and thus both fail due to this. I have a fix for the Setgroups() specific extra failure with proc file parsing. It is not significantly better than not running the test for now. So I'll hold off on a commit. I'll see if I can get the AllThreadsSyscall thing characterized before deciding whether I should do two commits or one combined commit to resolve this bug. |
Change https://golang.org/cl/266202 mentions this issue: |
[It is awesome to have great build and test infrastructure!] Lynn, it looks like you are right. The r2 return value on ppc64 looks suspicious. I added some instrumentation to the doSyscall() functions, just before the panic() and the odd r2 comparison stood out:
Stuff like this:
This caused me to read up (for the first time) on what that value is supposed to be. It turns out, it is architecturally specific. "
Which looks like it is defined on all of the architectures I had tested on, but apparently not on ppc64x. Live and learn. To address this and the Linux |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
No. Newly added test is failing
https://build.golang.org/log/dc73e1644c3b432ec162a373589d7e37db108ba4
What did you do?
This revealed itself in build testing after https://go-review.googlesource.com/c/go/+/210639 was merged.
What did you expect to see?
The test pass.
What did you see instead?
The syscall test failed.
The text was updated successfully, but these errors were encountered: