Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: build test failure on linux-ppc64 #42178

Closed
AndrewGMorgan opened this issue Oct 23, 2020 · 31 comments
Closed

syscall: build test failure on linux-ppc64 #42178

AndrewGMorgan opened this issue Oct 23, 2020 · 31 comments

Comments

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Oct 23, 2020

What version of Go are you using (go version)?

HEAD

Does this issue reproduce with the latest release?

No. Newly added test is failing

https://build.golang.org/log/dc73e1644c3b432ec162a373589d7e37db108ba4

linux-ppc64-buildlet at f24ff3856a629a6b5fefe28a1676638d5f103342

:: Running /workdir/go/src/make.bash with args ["/workdir/go/src/make.bash"] and env ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" "HOSTNAME=ppc64_04" "GO_BUILDER_ENV=host-linux-ppc64-osu" "DEBIAN_FRONTEND=noninteractive" "GOROOT_BOOTSTRAP=/workdir/go1.4" "GO_BUILD_KEY_DELETE_AFTER_READ=true" "GO_BUILD_KEY_PATH=/buildkey/gobuildkey" "HOME=/root" "USER=root" "GO_STAGE0_NET_DELAY=800ms" "GO_STAGE0_DL_DELAY=1.1s" "WORKDIR=/workdir" "GO_BUILDER_NAME=linux-ppc64-buildlet" "GO_BUILDER_FLAKY_NET=1" "GOROOT_BOOTSTRAP=/usr/local/go-bootstrap" "GOBIN=" "TMPDIR=/workdir/tmp" "GOCACHE=/workdir/gocache" "GOROOT_BOOTSTRAP=/usr/local/go-bootstrap"] in dir /workdir/go/src
[...]
ok  	strings	2.593s
ok  	sync	1.376s
ok  	sync/atomic	0.522s
panic: AllThreadsSyscall results differ between threads; runtime corrupted
fatal error: panic on system stack
panic: AllThreadsSyscall results differ between threads; runtime corrupted
fatal error: panic on system stack
panic: AllThreadsSyscall results differ between threads; runtime corrupted
fatal error: panic on system stack
panic: AllThreadsSyscall results differ between threads; runtime corrupted
fatal error: panic on system stack
panic: AllThreadsSyscall results differ between threads; runtime corrupted
fatal error: panic on system stack

runtime stack:
syscall.(*allThreadsCaller).doSyscall(0xc00012c780, 0x0, 0x3)
	/workdir/go/src/syscall/syscall_linux.go:1007 +0xbc

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0000fe780, 0x199871, 0xd, 0x1a5028, 0x100000000000424)
	/workdir/go/src/testing/testing.go:1219 +0x280
testing.runTests.func1(0xc000001200)
	/workdir/go/src/testing/testing.go:1491 +0x78
testing.tRunner(0xc000001200, 0xc000052ce8)
	/workdir/go/src/testing/testing.go:1173 +0xd8
testing.runTests(0xc00000c048, 0x29be20, 0x2e, 0x2e, 0xbfdcf078c7bb2c07, 0x29e8e35e3f, 0x29f820, 0x10000c00001c360)
	/workdir/go/src/testing/testing.go:1489 +0x2b4
testing.(*M).Run(0xc00000a080, 0x0)
	/workdir/go/src/testing/testing.go:1397 +0x1a0
syscall_test.TestMain(0xc00000a080)
	/workdir/go/src/syscall/syscall_linux_test.go:153 +0xc8
main.main()
	_testmain.go:137 +0x134

goroutine 30 [chan receive]:
testing.(*T).Parallel(0xc0000b5980)
	/workdir/go/src/testing/testing.go:1039 +0xec
syscall_test.TestInvalidExec(0xc0000b5980)
	/workdir/go/src/syscall/exec_unix_test.go:222 +0x2c
testing.tRunner(0xc0000b5980, 0x1a4f18)
	/workdir/go/src/testing/testing.go:1173 +0xd8
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1218 +0x264

goroutine 47 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1218 +0x264

goroutine 38 [sleep]:
time.Sleep(0x12a05f200)
	/workdir/go/src/runtime/time.go:188 +0xc4
syscall_test.TestLinuxDeathSignal.func1(0x1ce7e0, 0xc0000630e0)
	/workdir/go/src/syscall/syscall_linux_test.go:214 +0x30
created by syscall_test.TestLinuxDeathSignal
	/workdir/go/src/syscall/syscall_linux_test.go:213 +0x7ac

runtime stack:
syscall.(*allThreadsCaller).doSyscall(0xc00012c780, 0x0, 0x3)
	/workdir/go/src/syscall/syscall_linux.go:1007 +0xbc

runtime stack:
syscall.(*allThreadsCaller).doSyscall(0xc00012c780, 0x5215c, 0x49814)
	/workdir/go/src/syscall/syscall_linux.go:1007 +0xbc

runtime stack:
syscall.(*allThreadsCaller).doSyscall(0xc00012c780, 0x120b7deda, 0x6cb44a2249beb)
	/workdir/go/src/syscall/syscall_linux.go:1007 +0xbc

runtime stack:
syscall.(*allThreadsCaller).doSyscall(0xc00012c780, 0x1322b0, 0x4af94)
	/workdir/go/src/syscall/syscall_linux.go:1007 +0xbc
FAIL	syscall	0.269s
ok  	testing	1.768s
ok  	testing/fstest	0.033s
FAIL
go tool dist: Failed: exit status 1

What did you do?

This revealed itself in build testing after https://go-review.googlesource.com/c/go/+/210639 was merged.

What did you expect to see?

The test pass.

What did you see instead?

The syscall test failed.

@AndrewGMorgan
Copy link
Contributor Author

I'll debug this (don't have permission to assign the bug to myself, so I'm putting this note here). My plan is as follows:

  1. create a change that disables this test case on linux-ppc64 and get it submitted
  2. figure out how to reproduce the test build failure
  3. determine what might be wrong and fix it

@gopherbot
Copy link

Change https://golang.org/cl/264719 mentions this issue: syscall: factor out TestAllThreadsSyscall to exclude linux-ppc64

@AndrewGMorgan
Copy link
Contributor Author

Curiously, trying to turn this off, we also find:

--- FAIL: TestSetuidEtc (0.01s)
syscall_linux_test.go:627: [7] "Setgroups(nil)" comparison: /proc/22027/status Groups: (bad)
FAIL
FAIL syscall 0.115s

So, I'm disabling that test too.

gopherbot pushed a commit that referenced this issue Oct 24, 2020
For some reason, currently unknown, this test case fails exclusively
on the linux-ppc64 platform. Until such time as it can be made to
work, we'll disable this test case on that platform.

The same issue causes TestSetuidEtc to fail too, so disable that
on this platform.

Updates #42178

Change-Id: Idd3f6c2ee9f2fba2eb8ce4de69de7f316858bb15
Reviewed-on: https://go-review.googlesource.com/c/go/+/264719
Trust: Emmanuel Odeke <emm.odeke@gmail.com>
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@AndrewGMorgan
Copy link
Contributor Author

So, 1 is complete. 2 requires some access to be setup. May take a few days to resolve.

@aclements
Copy link
Member

Tentatively marking as a release-blocker since this is a platform issue in a new API.

@laboger
Copy link
Contributor

laboger commented Oct 26, 2020

I built a toolchain on a ppc64 power8 with 3.10.0-1062.12.1.el7.ppc64 and enabled these 2 tests and they passed there.

I'm suspicious of your ppc64 builder, especially since a few weeks ago, there was a glitch at OSU and the ppc64 & ppc64le systems had to be restarted but the ppc64 didn't start. After working with Lance at OSU we found that one of the build files on this builder was out of date and I asked him to post that information in the issue #41742. With that information it was resolved. Just wondering if anything else could be out of sync.

A few other things: the Docker being used on this builder is specially built. Also we (IBM) have not run ppc64 big endian on Ubuntu for a while so I don't have a system to try out that distro on BE.

@AndrewGMorgan
Copy link
Contributor Author

This is really useful info. So, ppc64le = little-endian (passing), and ppc64 = big-endian (failing) ? Are they otherwise the same architecture?

Could the issue be something generated by the compiler? I've been confused about how this failing on only one architecture.

@ianlancetaylor
Copy link
Contributor

That is correct: ppc64 is original 64-bit PowerPC and ppc64le is the newer little-endian 64-bit PowerPC. The ppc64le processors are newer and have additional instructions but they are basically the same architecture.

That said, it's worth noting that C code on the different processors uses significantly different calling conventions.

@laboger
Copy link
Contributor

laboger commented Oct 26, 2020

linux/ppc64 does not support cgo either.

I don't think it's the compiler since these tests worked for me on ppc64 machines.

Information on the builders is under golang.org/x/build/
From golang.org/x/build/env/ directories linux-ppc64 and linux-ppc64le contain the configuration information.

ppc64 builder is go-be-xenial-3 4.4.0-130-powerpc64-smp
ppc64le builder is go-le-bionic-1 4.15.0-65-generic #74-Ubuntu SMP

So the ppc64 builder has an older kernel. All the ppc64 machines I tried here have newer kernels.

@laboger
Copy link
Contributor

laboger commented Oct 26, 2020

My mistake, I didn't realize the builder tests as root. I can reproduce the failure with SetuidEtc if I run as root, but not the TestAllThreads failure.

And you probably realize, the difference between ppc64 and ppc64le on the TestAllThreads is because ppc64le has cgo and the test is disabled for cgo.

@AndrewGMorgan
Copy link
Contributor Author

The test should work with or without cgo. On ppc64le, can you see if this works?

CGO_ENABLED=0 go test syscall

@AndrewGMorgan
Copy link
Contributor Author

AndrewGMorgan commented Oct 26, 2020

On the cgo front, does this system use glibc or some other libc variant?

@laboger
Copy link
Contributor

laboger commented Oct 26, 2020

The test should work with or without cgo. On ppc64le, can you see if this works?
CGO_ENABLED=0 go test syscall

Yes that works.

On the cgo front, does this system use glibc or some other libc variant?'

glibc

@AndrewGMorgan
Copy link
Contributor Author

What version of glibc?

@ianlancetaylor
Copy link
Contributor

The GNU/Linux pp64le buildlet is running glibc 2.28. (The GNU/Linux ppc64 buildlet is running glibc 2.23.)

@laboger
Copy link
Contributor

laboger commented Oct 27, 2020

I found that the SetuidEtc test fails on ppc64 because for the Setgroups(nil) test it doesn't have a line in the /procs/pid/status file, but for ppc64le it has a line that says "Groups: " with nothing following, which is what the testcase expects.

@laboger
Copy link
Contributor

laboger commented Oct 27, 2020

Thanks to Ian's glibc information, I found that the SetuidEtc test also fails on a ppc64le that has glibc 2.23. So this failure is due to the difference in glibc.

./syscall.test -test.run=SetuidEtc -test.v
=== RUN   TestSetuidEtc
    syscall_linux_test.go:630: [7] "Setgroups(nil)" comparison: /proc/32397/status Groups:	 (bad)
--- FAIL: TestSetuidEtc (0.01s)
FAIL
Linux willow3 4.4.0-128-generic #154-Ubuntu SMP Fri May 25 14:13:59 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

@AndrewGMorgan
Copy link
Contributor Author

To summarize, with the newer glibc and a recent kernel ppc64le passes both tests with cgo enabled and without.

I am confused about what we believe is true of the ppc64 case. Is this accurate?

1 both cases failed on the build serverfor ppc64 only
2 they never failed when I was developing the change (does the ppc64 test normally run then? None of my notes or comments in the cl refer to it.)
3 we have seen with a newer kernel than the build server both tests pass on ppc64.
4 since ppc64 runs without cgo, glibc version should not matter

@laboger
Copy link
Contributor

laboger commented Oct 27, 2020

For test SetuidEtc, this appears to be related to the kernel for both ppc64 and ppc64le. The output in the /proc/pid/status file is different in older kernels and the test expects output from newer kernels. I can reproduce the failure on older kernels but not newer kernels.

I have not been able to reproduce the AllThreadSyscall failure on any system I've tried so far.

@AndrewGMorgan
Copy link
Contributor Author

OK, so upgrading the kernel on the build server is out of my hands, presumably that can be done? We have some confidence that this will address at least one of the two issues.

For the AllThreads case, the only things I can think of about the build server is that:

  • it might be an old kernel thing too
  • or is it possibly using seccomp to filter prctl() syscalls in some way that is interfering with the test?

@laboger
Copy link
Contributor

laboger commented Oct 27, 2020

I think the test for SetuidEtc should be fixed, because shouldn't it still work on older kernels? I just tried in on my laptop and got the same error:

[root@oc5561066826 syscall]# ./syscall.test -test.run=SetuidEtc
--- FAIL: TestSetuidEtc (0.01s)
    syscall_linux_test.go:630: [7] "Setgroups(nil)" comparison: /proc/8750/status Groups:	 (bad)
FAIL
[root@oc5561066826 syscall]# uname -a
Linux oc5561066826.ibm.com 3.10.0-1127.13.1.el7.x86_64 #1 SMP Fri Jun 12 14:34:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

I'm trying to find out about the seccomp question.

@laboger
Copy link
Contributor

laboger commented Oct 27, 2020

For the AllThreads case, the only things I can think of about the build server is that:

t might be an old kernel thing too
or is it possibly using seccomp to filter prctl() syscalls in some way that is interfering with the test?

Can't be an old kernel thing because I tried the AllThreads test on ppc64 systems with old kernels and they all passed (not in a container).

I built a container on a ppc64le and ran the syscall tests and they passed there, so it couldn't be a seccomp issue unless there is some special seccomp setting being used on ppc64 but not ppc64le.

We (IBM) don't support Docker for ppc64 so I'm not able to try running it in a container.

@ianlancetaylor
Copy link
Contributor

The tests should in principle run on any Linux kernel 2.6.23 or higher. It's fine to skip the test if the Linux kernel version is too old. There is an example of a test that checks the kernel version in syscall/exec_linux_test.go: TestAmbientCaps.

@laboger
Copy link
Contributor

laboger commented Oct 28, 2020

I don't set up the Go builders for ppc64/ppc64le and I don't have access to run on them. But I do have access to several similar ppc64 machines to test and run on. I was able to find a ppc64 machine running Debian where I can reproduce the failure in AllThreads without running in a container. This is running a newer kernel and while trying to debug this I found that the failure is intermittent. If I try to debug with gdb it doesn't fail. By adding some creative panic messages it reports that the r2 value is wrong when returning from the syscall and that causes the failure. I also found the same failure on a Fedora 28 ppc64 machine if I set -test.count=2000. With lower test count values it fails but less often, with count=1 I couldn't get it to fail.

The kernels where this happens are relatively new:
Linux willow11 5.0.16-100.fc28.ppc64 #1 SMP Tue May 14 17:55:15 UTC 2019 ppc64 ppc64 ppc64 GNU/Linux
Linux yanny3 5.4.0-2-powerpc64 #1 SMP Debian 5.4.8-1 (2020-01-05) ppc64 GNU/Linux

@AndrewGMorgan
Copy link
Contributor Author

Cool. This is good info. It suggests there is something subtle going on. (I've done 10k runs on x86s and Arm's without failure.)

Are you confident that this same level of runs on ppc64le doesn't similarly fail?

I have an access key now for running tests on the ppc64, so I'll try to reproduce the failure(s) myself and work on them.

@AndrewGMorgan
Copy link
Contributor Author

Mystery. Removing the workaround, both of these tests pass for me on linux-ppc64-buildlet.

I'll try some 10000 runs.

@laboger
Copy link
Contributor

laboger commented Oct 28, 2020

I did get it to fail on ppc64le. Using Ubuntu/Debian seems to be the key.
Fails:
5.4.0-21-generic #25-Ubuntu SMP Sat Mar 28 13:10:37 UTC 2020 ppc64le power9
4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:08:26 UTC 2020 ppc64le power9
4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:09 UTC 2020 ppc64le power8

Couldn't make it fail:
4.18.0-193.19.1.el8_2.ppc64le #1 SMP Wed Aug 26 15:13:15 EDT 2020 ppc64le power9
4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:09 UTC 2020 ppc64le power9 (weird? same as above but this is power9)

In all cases I've been using -test.cpu=2 since that's what the builders have. And I usually have to use a count >=200 to make it fail.

@AndrewGMorgan
Copy link
Contributor Author

[Pilot error on the reproduction front - getting up to speed on gomote, can reproduce both failures.]

@AndrewGMorgan
Copy link
Contributor Author

So, I guess this should have been obvious, but because the ppc64 build is sans cgo, both of these tests end up using the AllThreadsSyscall and thus both fail due to this.

I have a fix for the Setgroups() specific extra failure with proc file parsing. It is not significantly better than not running the test for now. So I'll hold off on a commit.

I'll see if I can get the AllThreadsSyscall thing characterized before deciding whether I should do two commits or one combined commit to resolve this bug.

@gopherbot
Copy link

Change https://golang.org/cl/266202 mentions this issue: syscall: address Linux AllThreadsSyscall() bug on ppc64

@AndrewGMorgan
Copy link
Contributor Author

[It is awesome to have great build and test infrastructure!]

Lynn, it looks like you are right. The r2 return value on ppc64 looks suspicious. I added some instrumentation to the doSyscall() functions, just before the panic() and the odd r2 comparison stood out:

print("trap:", pc.trap, ", a123=[", pc.a1, ",", pc.a2, ",", pc.a3, "]\n")
print("results: got {r1=", r1, ",r2=", r2, ",err=", err, "}, want {r1=", pc.r1, ",r2=", pc.r2, ",r3=", pc.err, "}\n")

Stuff like this:

trap:171, a123=[8,1,0]
results: got {r1=0,r2=1,err=0}, want {r1=0,r2=578712,r3=0}
panic: AllThreadsSyscall6 results differ between threads; runtime corrupted
fatal error: panic on system stack

This caused me to read up (for the first time) on what that value is supposed to be. It turns out, it is architecturally specific. "man syscall" on my workstation lists this (we're interested in Ret/val2):

       Arch/ABI    Instruction           System  Ret  Ret  Error    Notes
                                         call #  val  val2
       ───────────────────────────────────────────────────────────────────
       alpha       callsys               v0      v0   a4   a3       1, 6
       arc         trap0                 r8      r0   -    -
       arm/OABI    swi NR                -       r0   -    -        2
       arm/EABI    swi 0x0               r7      r0   r1   -
       arm64       svc #0                w8      x0   x1   -
       blackfin    excpt 0x0             P0      R0   -    -
       i386        int $0x80             eax     eax  edx  -
       ia64        break 0x100000        r15     r8   r9   r10      1, 6
       m68k        trap #0               d0      d0   -    -
       microblaze  brki r14,8            r12     r3   -    -
       mips        syscall               v0      v0   v1   a3       1, 6
       nios2       trap                  r2      r2   -    r7
       parisc      ble 0x100(%sr2, %r0)  r20     r28  -    -
  =>   powerpc     sc                    r0      r3   -    r0       1
  =>   powerpc64   sc                    r0      r3   -    cr0.SO   1
       riscv       ecall                 a7      a0   a1   -
       s390        svc 0                 r1      r2   r3   -        3
       s390x       svc 0                 r1      r2   r3   -        3
       superh      trap #0x17            r3      r0   r1   -        4, 6
       sparc/32    t 0x10                g1      o0   o1   psr/csr  1, 6
       sparc/64    t 0x6d                g1      o0   o1   psr/csr  1, 6
       tile        swint1                R10     R00  -    R01      1
       x86-64      syscall               rax     rax  rdx  -        5
       x32         syscall               rax     rax  rdx  -        5
       xtensa      syscall               a2      a2   -    -

Which looks like it is defined on all of the architectures I had tested on, but apparently not on ppc64x. Live and learn.

To address this and the Linux /proc/<PID>/status file format and longevity issues, I've prepared a patch:

https://go-review.googlesource.com/c/go/+/266202

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants
@aclements @ianlancetaylor @gopherbot @laboger @AndrewGMorgan and others