Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime, x/net/internal/socket: thread sanitizer failing on ppc64le #35547

Closed
randall77 opened this issue Nov 12, 2019 · 19 comments
Closed

runtime, x/net/internal/socket: thread sanitizer failing on ppc64le #35547

randall77 opened this issue Nov 12, 2019 · 19 comments
Labels
arch-ppc64x FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@randall77
Copy link
Contributor

Thread sanitizer is failing on linux-ppc64le-buildlet and linux-ppc64le-power9osu

The test is from the x/net repository, but I'm not sure that matters much.

From https://build.golang.org/log/7ba95758a5a8b138b85ba4a9f87a543c3177e884 :

--- FAIL: TestRace (16.94s)
    --- FAIL: TestRace/test_0 (16.04s)
        socket_test.go:363: race not detected for test 0: err:<nil> out:FATAL: ThreadSanitizer CHECK failed: ./gotsan.cpp:14108 "((personality(old_personality | ADDR_NO_RANDOMIZE))) != ((-1))" (0xffffffffffffffff, 0xffffffffffffffff)
    --- FAIL: TestRace/test_1 (0.91s)
        socket_test.go:363: race not detected for test 1: err:<nil> out:FATAL: ThreadSanitizer CHECK failed: ./gotsan.cpp:14108 "((personality(old_personality | ADDR_NO_RANDOMIZE))) != ((-1))" (0xffffffffffffffff, 0xffffffffffffffff)
FAIL
FAIL	golang.org/x/net/internal/socket	16.953s

Some internal check in tsan is failing, causing this test to fail.
We claim to support the race detector in this config, so this test should pass.

@laboger

@laboger
Copy link
Contributor

laboger commented Nov 13, 2019

Once again, I am not able to reproduce this failure. Maybe I'm not building and running this like is done with the buildlet.

I checked out the latest golang.org/x/net and rebuilt it with upstream Go. The tests all passed on my debian buster power9 machine.

I see the test says it is testing in module mode. Not exactly sure what that means. Is there documentation on how to run all the golang.org/x repos? I did copy the old go.mod file it was using and set up GOMOD to point to it and it still passed.

@bradfitz bradfitz added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 13, 2019
@bradfitz bradfitz added this to the Go1.14 milestone Nov 13, 2019
@randall77
Copy link
Contributor Author

I hate unreproducible bugs :(

Try checking out Go and x/net from the two hashes listed in that error log.
Then maybe try using a gomote instead of your machine.

I don't think this failure would have anything to do with modules.

Maybe just try writing a helloworld race detector example and see if it works?

@laboger
Copy link
Contributor

laboger commented Nov 13, 2019

I ran all the race tests I know about in golang. They all passed on my power8 and power9.

in runtime/race:
go test -race


Passed 348 of 348 tests (100.00%, 0+, 0-)
0 expected failures (0 has not fail)
--- PASS: TestRace (6.42s)
=== RUN   TestIssue8102
--- PASS: TestIssue8102 (0.00s)
=== RUN   TestIssue9137
--- PASS: TestIssue9137 (0.00s)
=== RUN   TestNonGoMemory
--- PASS: TestNonGoMemory (0.00s)
=== RUN   TestRandomScheduling
--- PASS: TestRandomScheduling (0.01s)
PASS
ok  	runtime/race	15.214s

In golang.org/x/net at the commit from the failure:

:~/gocode/src/golang.org/x/net/internal/socket$ go test -test.v
=== RUN   TestSocket
=== RUN   TestSocket/Option
--- PASS: TestSocket (0.01s)
    --- PASS: TestSocket/Option (0.01s)
=== RUN   TestControlMessage
--- PASS: TestControlMessage (0.00s)
=== RUN   TestUDP
=== RUN   TestUDP/Message
=== RUN   TestUDP/Messages
--- PASS: TestUDP (0.00s)
    --- PASS: TestUDP/Message (0.00s)
    --- PASS: TestUDP/Messages (0.00s)
=== RUN   TestRace
=== RUN   TestRace/test_0
=== RUN   TestRace/test_1
--- PASS: TestRace (1.98s)
    --- PASS: TestRace/test_0 (0.98s)
    --- PASS: TestRace/test_1 (0.99s)
PASS
ok  	golang.org/x/net/internal/socket	1.997s
boger@yanny6:~/gocode/src/golang.org/x/net/internal/socket$ uname -a
Linux yanny6 4.19.0-5-powerpc64le #1 SMP Debian 4.19.37-5+deb10u1 (2019-07-19) ppc64le GNU/Linux
go tool dist test race

##### Testing race detector
ok  	runtime/race	7.897s
ok  	flag	0.025s
ok  	net	0.104s
ok  	os	0.115s
ok  	os/exec	0.083s
ok  	encoding/gob	0.036s
ok  	flag	0.028s
ok  	os/exec	0.077s

ALL TESTS PASSED (some were excluded)

I do not have gomote access.
Isn't the error message about personality related to some kind of system or kernel setting?

@laboger
Copy link
Contributor

laboger commented Nov 13, 2019

If I look at the code from LLVM from the error output I see this:

#elif SANITIZER_PPC64V2
  // Disable ASLR for Linux PPC64LE.
  int old_personality = personality(0xffffffff);
  if (old_personality != -1 && (old_personality & ADDR_NO_RANDOMIZE) == 0) {
    VReport(1, "WARNING: Program is being run with address space layout "
               "randomization (ASLR) enabled which prevents the thread and "
               "memory sanitizers from working on powerpc64le.\n"
               "ASLR will be disabled and the program re-executed.\n");
    CHECK_NE(personality(old_personality | ADDR_NO_RANDOMIZE), -1);
    ReExec();
  }

So I believe that error message is saying the test is supposed to be able to set the personality to ADDR_NO_RANDOMIZE but that failed.

@ceseo
Copy link
Contributor

ceseo commented Nov 13, 2019

Can someone disable ASLR in the ppc64le builders (i.e. echo 0 | sudo tee /proc/sys/kernel/randomize_va_space) and see if it works? I can't reproduce on VMs or bare metal systems and I don't have a Docker instance available at the moment.

@randall77
Copy link
Contributor Author

Interesting. So that check is failing, so it fails to re-exec itself (which presumably would work?). I don't understand what that check is trying to achieve - is it just that -1 is a special case meaning "failure"? Anyone have the docs to the personality function?

Instead of fixing the reexec in the race detector, we could set ADDR_NO_RANDOMIZE on the binary that go build -race (and go run -race) generates.

@laboger
Copy link
Contributor

laboger commented Nov 14, 2019

Look up the documentation for the Linux personality syscall (you can google it). In this case the value of ADDR_NO_RANDOMIZE is not a setting on the binary but a setting for the personality of the process. If the syscall returns -1 (EINVAL) that means the kernel could not change the personality.

To me it looks like even though the syscall fails it continues to try to ReExec regardless of the syscall result but the comments say the race detector doesn't work in ASLR mode (due to unexpected address ranges) and in this test it is supposed to detect a race condition but does not so fails.

I don't know why it would fail to change the personality in this case -- it must work in some cases because this code is executed whenever using race on ppc64le and distros have had ASLR enabled for a while. Maybe if we had the errno that would help.

@laboger
Copy link
Contributor

laboger commented Nov 18, 2019

OK I think I found out the issue. When a Docker container is set up, there is a default seccomp security profile with a list of syscalls that are not allowed. The personality syscall is on that list. There is a way to define the container so it allows this syscall.

@laboger
Copy link
Contributor

laboger commented Dec 4, 2019

Can I get gomote access so I can try to understand this problem a bit more. It is curious as to why some race tests work fine (I think they must all have to do the personality syscall) but this one does not. Also I am not able to reproduce it.

@dmitshur
Copy link
Contributor

dmitshur commented Jan 7, 2020

Sorry for the delay in getting back to you @laboger. I wanted to update you that we're not able to give out gomote access at this time, unfortunately. We're still working on fixing that, but it will take more time.

I realize it's hard to make progress on this issue because it doesn't reproduce easily outside the ppc64le builder, and I'll try to look into what else we can do in the meantime to help make investigating this easier. I just wanted to share this update for now.

@laboger
Copy link
Contributor

laboger commented Jan 7, 2020

Since my last post I was able to reproduce this in a container. I have confirmed the theory that was mentioned in my post on Nov. 18.

When running LLVM's thread sanitizer code (used by Go's race detector) on a system with ASLR enabled, the address mappings needed for the thread sanitizer are not always usable on ppc64le so it calls the personality syscall to disable ASLR. Outside of a container this works fine, but within a container that is running with the default seccomp profile, it fails as in this testcase. This is because the default seccomp profile restricts the use of the personality syscall.

One solution would be to have an alternate security profile which allows the personality syscall and then use that profile for the container where the testing is done.

If you don't care about the seccomp profile for the container then the container can be run using --security-opt seccomp=unconfined. The test works if that option is used.

This does not fail on amd64 because they don't have the same problem with address ranges apparently. It is needed on arm64 but I don't think they have the race detector working for Go yet.

@aclements
Copy link
Member

Thanks for figuring that out @laboger !

It sounds to me like there's nothing Go can do to fix or work around this (is that true?). If so, maybe we can add something to the documentation and resolve this bug?

@bradfitz
Copy link
Contributor

bradfitz commented Jan 9, 2020

@aclements, it sounds to me like we need to add "--security-opt", "seccomp=unconfined", to x/build/cmd/rundockerbuildlet.

/cc @toothrot @bcmills

@laboger
Copy link
Contributor

laboger commented Jan 10, 2020

@aclements, it sounds to me like we need to add "--security-opt", "seccomp=unconfined", to x/build/cmd/rundockerbuildlet.

Yes, I think that would be the easiest for now. There could be something documented to state that the use of the race detector on ppc64le within a container requires this option when starting the container.

@dmitshur dmitshur changed the title runtime: thread sanitizer failing on ppc64le runtime, x/net/internal/socket: thread sanitizer failing on ppc64le Jan 15, 2020
@gopherbot
Copy link

Change https://golang.org/cl/214919 mentions this issue: cmd/rundockerbuildlet: set --security-opt=seccomp=unconfined in containers

gopherbot pushed a commit to golang/build that referenced this issue Jan 15, 2020
…iners

This fixes race tests; the thread sanitizer needs to check its
personality, which seccomp defaults prevent apparently.

Updates golang/go#35547 (needs to be deployed first, then bug can be closed)

Change-Id: I8b87618f63ef2b7a75b72290098c09bf04298d86
Reviewed-on: https://go-review.googlesource.com/c/build/+/214919
Reviewed-by: Alexander Rakoczy <alex@golang.org>
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@toothrot
Copy link
Contributor

I've redeployed rundockerbuildlet for these builders. We ought to see a difference the next test run for x/net.

@gopherbot
Copy link

Change https://golang.org/cl/214979 mentions this issue: test: trybot test

@laboger
Copy link
Contributor

laboger commented Feb 24, 2020

The thread sanitizer failure was fixed on ppc64le with CL 214919.

@toothrot toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020
@ianlancetaylor
Copy link
Contributor

The discussion above suggests that this issue is fixed, so closing. Please comment if you disagree.

cyb70289 added a commit to cyb70289/tricks that referenced this issue Mar 17, 2021
@golang golang locked and limited conversation to collaborators Jun 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-ppc64x FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants