Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

Closed
vielmetti opened this issue Aug 22, 2017 · 31 comments
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@vielmetti
Copy link

vielmetti commented Aug 22, 2017

What version of Go are you using (go version)?

go version go1.9rc2 linux/arm64
bootstrap compiler is go version go1.6.2 linux/arm64

Machine is

$ uname -a
Linux lab-centriq 4.7.0-2-generic #5~pdaw1.1+bandera.6-Ubuntu SMP Wed Feb 8 23:10:54 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

Does this issue reproduce with the latest release?

This issue does not reproduce on the Cavium ThunderX hardware, which passes all tests without flaws.

What operating system and processor architecture are you using (go env)?

./go env
GOARCH="arm64"
GOBIN=""
GOEXE=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/emv/go"
GORACE=""
GOROOT="/home/emv/src/go.googlesource.com/go"
GOTOOLDIR="/home/emv/src/go.googlesource.com/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build997434454=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

git checkout go1.9rc2
cd src
./all.bash

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

All tests pass

What did you see instead?

runtime: program exceeds 50-thread limit
fatal error: thread exhaustion
FAIL    os      0.509s 
Failed: exit status 1

Complete output at https://gist.github.com/017dbb7fa20d184d66a8d9ea4bf61225

@vielmetti
Copy link
Author

From reading the test code, I think the error is showing up in

// Test that reading from a pipe doesn't use up a thread.
func TestPipeThreads(t *testing.T) {

@vielmetti vielmetti changed the title "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 Aug 22, 2017
@vielmetti vielmetti changed the title "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 Aug 22, 2017
@vielmetti
Copy link
Author

Note that all tests pass on the Cavium ThunderX, but this one test fails to pass on the Centriq system. Exploring further to see if I can get different results.

@ianlancetaylor
Copy link
Contributor

Is this easily repeatable? If so, please run go test -c os to get an os.test binary, then run ./os.test --testing.short to verify that the problem still occurs, and attach the output of strace -f ./os.test --testing.short. Thanks.

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Aug 24, 2017
@ianlancetaylor ianlancetaylor added Testing An issue that has been verified to require only test changes, not just a test failure. release-blocker labels Aug 24, 2017
@vielmetti
Copy link
Author

Odd. It's reproducible from the ./all.bash shell script, but when I make and run the the os.test binary, it passes.

@vielmetti
Copy link
Author

vielmetti commented Aug 24, 2017

When I download the binaries for go 1.9 rc2 they pass the os.test test, but when I use those binaries to bootstrap the go 1.9 rc2 sources, the all.bash script gets this failure - only on the Centriq system.

@ianlancetaylor
Copy link
Contributor

Is there any difference in the output of ulimit -u on the succeeding and failing systems?

@vielmetti
Copy link
Author

ulimit -u on the ThunderX: 515386

ulimit -u on Centriq: 385905

Both should be plenty high for any reason.

@ianlancetaylor ianlancetaylor changed the title "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 Aug 25, 2017
@ianlancetaylor
Copy link
Contributor

I haven't been able to think of any reason why this test would behave differently when invoked from all.bash. I can't recall seeing any other reports of this problem. The test is fairly simple, and is basically testing that the pipes are handled by the internal poller. If we could capture it standalone, then running the test under strace -f would likely show the problem. But running all.bash under strace -f will produce too much irrelevant information.

@vielmetti
Copy link
Author

I'm also puzzled. The only think I can possibly think of is that this machine is running a unique kernel, and that we're planning to update the kernel to something closer to mainline. I'll put this on hold for now, and rerun it the next time there's a new kernel.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 25, 2017
@valyala
Copy link
Contributor

valyala commented Aug 27, 2017

This may be related to #21621 and #21550 during compilation of a package with many files. As I uderstand, currently each file may be concurrently opened and read in a separate OS thread if the corresponding file I/O syscalls (cgo calls) are slow.

@ianlancetaylor
Copy link
Contributor

@valyala In this case the calls should not be slow, because when using a pipe all I/O should be done using non-blocking calls.

@vielmetti
Copy link
Author

OK, adding to the puzzle:

git checkout go1.9
cd src
./all.bash

just worked perfectly on the same machine, no errors; building master also gave me no errors. But building go1.9rc2 did again error out on os just the same way as it did originally.

Curious!

@orivej
Copy link

orivej commented Sep 13, 2017

I can easily reproduce this on a 56-threaded Intel Xeon E5-2660 v4 Linux. (Out of 100 consecutive runs of os.test --test.short built from go master, 9 panicked with this error.) Unfortunately, the problem vanishes under strace.

@orivej
Copy link

orivej commented Sep 13, 2017

Here is a test panic from that machine, if it may be of any use: os.txt.

@orivej
Copy link

orivej commented Sep 13, 2017

In fact this reproduces only when CPU is under some load. Stressing CPU makes it reproduce practically every time, whereas without any load the test may always succeed. I used the following script to bisect go history, c05b06a is the first bad commit.

#!/bin/bash
set -e
cd src
./make.bash
cd os
../../bin/go test -a -c os
stress-ng -t 60 -c 60 &
trap 'killall stress-ng || true' EXIT
for i in `seq 100`; do
    ./os.test --test.short
done

@ianlancetaylor
Copy link
Contributor

Thanks for the bisect. That is the commit that introduced the failing test in the first place.

@ianlancetaylor ianlancetaylor removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Sep 13, 2017
@ianlancetaylor
Copy link
Contributor

Can you generate a failure with GOTRACEBACK=system set in the environment and show us the stack dump? Thanks.

@orivej
Copy link

orivej commented Sep 13, 2017

Here is one: os.txt (go is at the current master, 9593b74).

@gopherbot
Copy link

Change https://golang.org/cl/63650 mentions this issue: os: avoid crashing with a thundering herd in TestPipeThreads

@ianlancetaylor
Copy link
Contributor

Sorry for missing that you attached that before.

The stack trace suggests that the problem is not with the test itself, it is with the cleanup when the test is done. It looks like we can get a thundering herd as we stop all the goroutines.

Can you see if https://golang.org/cl/63650 fixes the problem for you? Thanks.

@orivej
Copy link

orivej commented Sep 13, 2017

When I repeat the test under load (and it fails every time), the number of goroutines in the dump typically is either 8-10, or 107-108. Above was an example with 10 goroutines, and here is one with 108:
os.txt

I'll test your patch now.

@orivej
Copy link

orivej commented Sep 13, 2017

The patch fixes the issue. Thanks!

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9.1 Sep 13, 2017
@ianlancetaylor
Copy link
Contributor

Thanks for testing it.

Marking issue as 1.9.1 to consider bringing the test fix into 1.9.1 to avoid spurious build failure reports.

@ianlancetaylor
Copy link
Contributor

Reopening to consider a backport to 1.9.1.

@rsc rsc modified the milestones: Go1.9.1, Go1.9.2 Oct 4, 2017
@rsc
Copy link
Contributor

rsc commented Oct 13, 2017

I'm confused. The test says "Test that reading from a pipe doesn't use up a thread." and it sounds like it was using up a thread on the breaking system. The rewrite of the test may just be making the test not as precise about finding problems, essentially papering over a real problem.

Can you explain why the old test was wrong? I just don't see it. The old channel+waitgroup looks functionally equivalent to the new channel+channel, and I don't see why putting Close into the goroutine (instead of a later cleanup pass) would change anything, unless Close is happening to cause extra serialization that defeats the test.

If you don't mind just disabling the test for Go 1.9.2 because you think it's a bad test for some reason, that's fine. But it seems like a worthwhile test and maybe pointing out a real problem.

@rsc rsc added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 13, 2017
@ianlancetaylor
Copy link
Contributor

The test is intended to test that reading from a pipe when there is no data to read doesn't use up a thread. When data arrives, a thread is required to actually do the read. With the original test, all the goroutines would queue up to read, and all is well. Then we would write to all of them at once, and we would blow past the thread limit.

But, yeah, my fix is not well written. I think it may work because of synchronization on the cdone channel. cdone ought to be an unbuffered channel.

For 1.9 we should probably just disable the test.

@rsc
Copy link
Contributor

rsc commented Oct 13, 2017

OK. Disabling test OK for Go 1.9.2.
TODO: Send a CL on the release branch, but do not submit it. (I want to try automating the actual release branch commits.)

@rsc rsc removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Oct 13, 2017
@ianlancetaylor
Copy link
Contributor

Sent https://golang.org/cl/70771 for 1.9 branch.

@gopherbot
Copy link

Change https://golang.org/cl/70771 mentions this issue: [release-branch.go1.9] os: skip TestPipeThreads as flaky for 1.9

@rsc
Copy link
Contributor

rsc commented Oct 14, 2017

CL 70771 OK for Go 1.9.2.

@rsc rsc added CherryPickApproved Used during the release process for point releases and removed release-blocker labels Oct 14, 2017
gopherbot pushed a commit that referenced this issue Oct 25, 2017
Updates #21559

Change-Id: I90fa8b4ef97c4251440270491ac4c833d76ee872
Reviewed-on: https://go-review.googlesource.com/70771
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc
Copy link
Contributor

rsc commented Oct 26, 2017

go1.9.2 has been packaged and includes:

The release is posted at golang.org/dl.

— golang.org/x/build/cmd/releasebot, Oct 26 21:09:08 UTC

@rsc rsc closed this as completed Oct 26, 2017
finlay pushed a commit to dragonfly-science/nixpkgs that referenced this issue Apr 29, 2018
@golang golang locked and limited conversation to collaborators Oct 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

7 participants
@vielmetti @orivej @rsc @valyala @ianlancetaylor @gopherbot and others