Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

Closed
rogpeppe opened this issue Jul 5, 2021 · 17 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rogpeppe
Copy link
Contributor

rogpeppe commented Jul 5, 2021

$ go version
go version devel go1.17-dc00dc6c6b Thu Jun 10 12:41:37 2021 +0000 linux/amd64

Issue #38184 seems to be a reasonably hard limit in the race detector library, but it's not always easy to work around it.
In tests, particularly, this issue more commonly arises in higher level tests that run more slowly and hence use testing.T.Parallel to speed themselves up. But this means that if a large instance is used to run tests (for example in continuous integration), we might see more simultaneously alive goroutines because the value of GOMAXPROCS is greater so more tests will be run in parallel, leading to this issue arising.

There are a few current possibilities for working around this at present, although none of them feel great:

  • avoid calling testing.T.Parallel when the race detector is enabled, but this disables parallelism completely, which feels like overkill.
  • run some packages with a different value for the -test.parallel flag, but this means that the top level CI script needs to be changed (you'll no longer be able to run go test -race ./...) which has maintainability issues.
  • update the value of -test.parallel within a given package by using flag.Lookup and setting the flag directly, but this is a nasty hack.

Ideally, we'd be able to configure the maximum number of active goroutines allowed, but failing that, another possibility might be to make the default value of testing.parallel come from runtime.GOMAXPROCS after TestMain has run rather than before, allowing a test package to adjust the amount of test parallelism itself. Another idea (courtesy of @mvdan) is that the testing package could avoid starting new parallel tests when the number of goroutines grows large.

@mvdan
Copy link
Member

mvdan commented Jul 5, 2021

Another idea (courtesy of @mvdan) is that the testing package could avoid starting new parallel tests when the number of goroutines grows large.

To expand a little: with the race detector enforcing a limit of 8k (active?) goroutines, presumably the testing package could refuse to start new parallel tests or sub-tests if we are currently over 50% of that limit of (active?) goroutines.

That would still leave the "limit exceeded" error as a possibility, but it would presumably make them far less likely, and should only make hugely parallel tests slightly slower. And, best of all, it should allow go test -race ./... to usually do the right thing without extra knobs.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 7, 2021

@mvdan That kind of heuristic could be effective in the general case, but could end up more confusing in cases where tests themselves spawn many goroutines. I'm not really sure what to do here besides trying to make the race detector more robust (I think we haven't kept up with TSAN...? Don't quote me on that.).

@mknyszek mknyszek added this to the Backlog milestone Jul 7, 2021
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 7, 2021
@randall77
Copy link
Contributor

I believe this is a limitation of the TSAN library itself. It has a max thread limit. We do pull TSAN somewhat periodically, the last pull was 2020-10-20.
The max thread limit is somewhat more painful for Go because goroutines are lightweight, and thus people make more of them. As opposed to C++ threads, which people tend to make fewer of.

All that being said, there's no way to fix this in the Go project directly. Someone would have to redesign TSAN somehow to handle more threads. It's probably a big lift, but @dvyukov might know more about why that limit exists and how we might increase it.

@randall77
Copy link
Contributor

@dvyukov
Copy link
Member

dvyukov commented Jul 9, 2021

I just happen to have a redesigned tsan runtime that supports infinite number of goroutines (also twice faster and consumes less memory). However, there is still a long road to upstreaming it.
If you feel adventurous enough, you may try:
https://go-review.googlesource.com/c/go/+/333529
or build yourself from:
dvyukov/llvm-project#4
(only linux/amd64 is supported for now)

@mvdan
Copy link
Member

mvdan commented Jul 9, 2021

Getting rid of the limit entirely would be a far better solution, of course :) My "limit parallelism" suggestion was more of a temporary workaround until that happens.

@mvdan
Copy link
Member

mvdan commented Sep 1, 2021

However, there is still a long road to upstreaming it.

I just used your experimental CL to help debug a very tricky memory corruption bug at work, where the software spawns tens of thousands of goroutines :)

Really, really looking forward to this new tsan runtime being upstreamed, so Go can ship with it. Is there any way I can help make that happen?

@dvyukov
Copy link
Member

dvyukov commented Sep 17, 2021

Just got back from a long vacation.
Thanks, that's good to hear (and also that it did not crash :)).
Re helping... hard to say, depends on what types of work you are ready to do. Almost all work is in C++/llvm and related to rebasing, splitting, productionizing and reviewing this large change:
https://github.com/dvyukov/llvm-project/pull/3/files

@mvdan
Copy link
Member

mvdan commented Sep 17, 2021

That CL's HEAD did end up failing after a few days, for what it's worth :)

ThreadSanitizer: CHECK failed: sanitizer_common.h:493 "((i)) < ((size_))" (0x100a9fc, 0x79e5ac) (tid=3701513)

@dvyukov
Copy link
Member

dvyukov commented Sep 17, 2021

Nothing immediately comes to mind. I would need to a reproducer to debug.

@mvdan
Copy link
Member

mvdan commented Sep 17, 2021

The code is open source, but a large daemon that ran for days, so not a reproducer. Not worth pursuing for now, I think - wanted to bring it up in case it was helpful on its own.

@seebs
Copy link
Contributor

seebs commented Oct 18, 2021

I seem to be hitting a thing which isn't this, but might be vaguely related, where sometimes go test -race will die in very strange ways, with tracebacks which always show a corrupted stack that goes back to an errgroup.Go or something similar, which I guess is not ultimately all that surprising. But there was a specific test which was hitting it all the time, so I took the errgroup usage out of that test, and suddenly that went fine but a later test did the same thing.

Nothing even close to a viable reproducer yet, unfortunately. Originally saw this with 1.16.9, with 1.17.2 I get similar things but the tracebacks can have 3000 lines of fatal error: runtime*unlock: lock count (only with the dot, not the asterisk).

@gopherbot
Copy link

Change https://golang.org/cl/333529 mentions this issue: runtime/race: update runtime (v3)

@dvyukov
Copy link
Member

dvyukov commented Nov 22, 2021

I've uploaded a new version at https://go-review.googlesource.com/c/go/+/333529. A number of bugs were fixed. Testing is welcome.

@ernado
Copy link
Contributor

ernado commented Dec 19, 2021

Actually this error can be triggered even without using testing.T.Parallel, t.Run is already spawning a goroutine:

go/src/testing/testing.go

Lines 1482 to 1493 in c5fee93

// Instead of reducing the running count of this test before calling the
// tRunner and increasing it afterwards, we rely on tRunner keeping the
// count correct. This ensures that a sequence of sequential tests runs
// without being preempted, even when their parent is a parallel test. This
// may especially reduce surprises if *parallel == 1.
go tRunner(t, f)
if !<-t.signal {
// At this point, it is likely that FailNow was called on one of the
// parent tests by one of the subtests. Continue aborting up the chain.
runtime.Goexit()
}
return !t.failed

Looks like I've wrote too many subtests for -race:

$ git clone https://github.com/go-faster/ch.git && cd ch && git checkout bde7621d
$ go test -race ./proto
race: limit on 8128 simultaneously alive goroutines is exceeded, dying
FAIL	github.com/go-faster/ch/proto	0.427s
FAIL

UPD: I'm not sure about the cause, still investigating, probably init()-s?

@ernado
Copy link
Contributor

ernado commented Dec 19, 2021

$ go test -trace trace.out .
Goroutines:
runtime.gcBgMarkWorker N=32
testing.(*T).Run·dwrap·21 N=10
github.com/klauspost/compress/zstd.newBlockDec·dwrap·1 N=19735
runtime.bgsweep N=1
runtime.main N=1
runtime/trace.Start.func1 N=1
runtime.bgscavenge N=1
N=11

Oh, now I see that N=19735.

ernado added a commit to ClickHouse/ch-go that referenced this issue Dec 19, 2021
The zstd.NewReader function spawns multiple
goroutines.

Ref: golang/go#47056
@dvyukov
Copy link
Member

dvyukov commented Mar 28, 2022

The linux/amd64 version of race runtime is submitted.
Besides removing the goroutine limit, it should also be ~2x-ish faster and consume less memory.
It would be great to hear any success stories. Should be also much more suitable for running end-to-end systems tests and server canaries.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants