testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

rogpeppe · 2021-07-05T13:08:28Z

$ go version
go version devel go1.17-dc00dc6c6b Thu Jun 10 12:41:37 2021 +0000 linux/amd64

Issue #38184 seems to be a reasonably hard limit in the race detector library, but it's not always easy to work around it.
In tests, particularly, this issue more commonly arises in higher level tests that run more slowly and hence use testing.T.Parallel to speed themselves up. But this means that if a large instance is used to run tests (for example in continuous integration), we might see more simultaneously alive goroutines because the value of GOMAXPROCS is greater so more tests will be run in parallel, leading to this issue arising.

There are a few current possibilities for working around this at present, although none of them feel great:

avoid calling testing.T.Parallel when the race detector is enabled, but this disables parallelism completely, which feels like overkill.
run some packages with a different value for the -test.parallel flag, but this means that the top level CI script needs to be changed (you'll no longer be able to run go test -race ./...) which has maintainability issues.
update the value of -test.parallel within a given package by using flag.Lookup and setting the flag directly, but this is a nasty hack.

Ideally, we'd be able to configure the maximum number of active goroutines allowed, but failing that, another possibility might be to make the default value of testing.parallel come from runtime.GOMAXPROCS after TestMain has run rather than before, allowing a test package to adjust the amount of test parallelism itself. Another idea (courtesy of @mvdan) is that the testing package could avoid starting new parallel tests when the number of goroutines grows large.

The text was updated successfully, but these errors were encountered:

mvdan · 2021-07-05T13:53:12Z

Another idea (courtesy of @mvdan) is that the testing package could avoid starting new parallel tests when the number of goroutines grows large.

To expand a little: with the race detector enforcing a limit of 8k (active?) goroutines, presumably the testing package could refuse to start new parallel tests or sub-tests if we are currently over 50% of that limit of (active?) goroutines.

That would still leave the "limit exceeded" error as a possibility, but it would presumably make them far less likely, and should only make hugely parallel tests slightly slower. And, best of all, it should allow go test -race ./... to usually do the right thing without extra knobs.

mknyszek · 2021-07-07T14:08:27Z

@mvdan That kind of heuristic could be effective in the general case, but could end up more confusing in cases where tests themselves spawn many goroutines. I'm not really sure what to do here besides trying to make the race detector more robust (I think we haven't kept up with TSAN...? Don't quote me on that.).

randall77 · 2021-07-07T16:18:09Z

I believe this is a limitation of the TSAN library itself. It has a max thread limit. We do pull TSAN somewhat periodically, the last pull was 2020-10-20.
The max thread limit is somewhat more painful for Go because goroutines are lightweight, and thus people make more of them. As opposed to C++ threads, which people tend to make fewer of.

All that being said, there's no way to fix this in the Go project directly. Someone would have to redesign TSAN somehow to handle more threads. It's probably a big lift, but @dvyukov might know more about why that limit exists and how we might increase it.

randall77 · 2021-07-07T16:19:01Z

(The error comes from the CreateThread function here: https://opensource.apple.com/source/clang/clang-700.1.81/src/projects/compiler-rt/lib/sanitizer_common/sanitizer_thread_registry.cc )

dvyukov · 2021-07-09T16:25:47Z

I just happen to have a redesigned tsan runtime that supports infinite number of goroutines (also twice faster and consumes less memory). However, there is still a long road to upstreaming it.
If you feel adventurous enough, you may try:
https://go-review.googlesource.com/c/go/+/333529
or build yourself from:
dvyukov/llvm-project#4
(only linux/amd64 is supported for now)

mvdan · 2021-07-09T16:57:31Z

Getting rid of the limit entirely would be a far better solution, of course :) My "limit parallelism" suggestion was more of a temporary workaround until that happens.

mvdan · 2021-09-01T17:13:46Z

However, there is still a long road to upstreaming it.

I just used your experimental CL to help debug a very tricky memory corruption bug at work, where the software spawns tens of thousands of goroutines :)

Really, really looking forward to this new tsan runtime being upstreamed, so Go can ship with it. Is there any way I can help make that happen?

dvyukov · 2021-09-17T16:25:25Z

Just got back from a long vacation.
Thanks, that's good to hear (and also that it did not crash :)).
Re helping... hard to say, depends on what types of work you are ready to do. Almost all work is in C++/llvm and related to rebasing, splitting, productionizing and reviewing this large change:
https://github.com/dvyukov/llvm-project/pull/3/files

mvdan · 2021-09-17T17:04:57Z

That CL's HEAD did end up failing after a few days, for what it's worth :)

ThreadSanitizer: CHECK failed: sanitizer_common.h:493 "((i)) < ((size_))" (0x100a9fc, 0x79e5ac) (tid=3701513)

dvyukov · 2021-09-17T17:47:20Z

Nothing immediately comes to mind. I would need to a reproducer to debug.

mvdan · 2021-09-17T18:07:40Z

The code is open source, but a large daemon that ran for days, so not a reproducer. Not worth pursuing for now, I think - wanted to bring it up in case it was helpful on its own.

seebs · 2021-10-18T21:08:54Z

I seem to be hitting a thing which isn't this, but might be vaguely related, where sometimes go test -race will die in very strange ways, with tracebacks which always show a corrupted stack that goes back to an errgroup.Go or something similar, which I guess is not ultimately all that surprising. But there was a specific test which was hitting it all the time, so I took the errgroup usage out of that test, and suddenly that went fine but a later test did the same thing.

Nothing even close to a viable reproducer yet, unfortunately. Originally saw this with 1.16.9, with 1.17.2 I get similar things but the tracebacks can have 3000 lines of fatal error: runtime*unlock: lock count (only with the dot, not the asterisk).

gopherbot · 2021-11-22T15:39:43Z

Change https://golang.org/cl/333529 mentions this issue: runtime/race: update runtime (v3)

dvyukov · 2021-11-22T15:47:04Z

I've uploaded a new version at https://go-review.googlesource.com/c/go/+/333529. A number of bugs were fixed. Testing is welcome.

ernado · 2021-12-19T18:15:20Z

~~Actually this error can be triggered even without using testing.T.Parallel, t.Run is already spawning a goroutine:~~

go/src/testing/testing.go

Lines 1482 to 1493 in c5fee93

    
           // Instead of reducing the running count of this test before calling the 
        
           // tRunner and increasing it afterwards, we rely on tRunner keeping the 
        
           // count correct. This ensures that a sequence of sequential tests runs 
        
           // without being preempted, even when their parent is a parallel test. This 
        
           // may especially reduce surprises if *parallel == 1. 
        
           go tRunner(t, f) 
        
           if !<-t.signal { 
        
           	// At this point, it is likely that FailNow was called on one of the 
        
           	// parent tests by one of the subtests. Continue aborting up the chain. 
        
           	runtime.Goexit() 
        
           } 
        
           return !t.failed

~~Looks like I've wrote too many subtests for -race:~~

$ git clone https://github.com/go-faster/ch.git && cd ch && git checkout bde7621d
$ go test -race ./proto
race: limit on 8128 simultaneously alive goroutines is exceeded, dying
FAIL	github.com/go-faster/ch/proto	0.427s
FAIL

UPD: I'm not sure about the cause, still investigating, probably init()-s?

ernado · 2021-12-19T18:26:26Z

$ go test -trace trace.out .

Goroutines:
runtime.gcBgMarkWorker N=32
testing.(*T).Run·dwrap·21 N=10
github.com/klauspost/compress/zstd.newBlockDec·dwrap·1 N=19735
runtime.bgsweep N=1
runtime.main N=1
runtime/trace.Start.func1 N=1
runtime.bgscavenge N=1
N=11

Oh, now I see that N=19735.

The zstd.NewReader function spawns multiple goroutines. Ref: golang/go#47056

dvyukov · 2022-03-28T17:58:18Z

The linux/amd64 version of race runtime is submitted.
Besides removing the goroutine limit, it should also be ~2x-ish faster and consume less memory.
It would be great to hear any success stories. Should be also much more suitable for running end-to-end systems tests and server canaries.

mknyszek added this to the Backlog milestone Jul 7, 2021

mknyszek added the NeedsInvestigation label Jul 7, 2021

soltzen mentioned this issue Feb 18, 2022

[torrentcounter] Abstract replica-peer's on-demand.go getlantern/replica#60

Closed

gopherbot closed this as completed in d6a1ffd Mar 28, 2022

gouthamve mentioned this issue Apr 11, 2022

Flaky test - prometheusreceiver - TestOpenMetricsNegative open-telemetry/opentelemetry-collector-contrib#9119

Closed

golang locked and limited conversation to collaborators Mar 28, 2023

gopherbot added the FrozenDueToAge label Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

rogpeppe commented Jul 5, 2021

mvdan commented Jul 5, 2021

mknyszek commented Jul 7, 2021

randall77 commented Jul 7, 2021

randall77 commented Jul 7, 2021

dvyukov commented Jul 9, 2021

mvdan commented Jul 9, 2021

mvdan commented Sep 1, 2021

dvyukov commented Sep 17, 2021

mvdan commented Sep 17, 2021

dvyukov commented Sep 17, 2021

mvdan commented Sep 17, 2021

seebs commented Oct 18, 2021

gopherbot commented Nov 22, 2021

dvyukov commented Nov 22, 2021

ernado commented Dec 19, 2021 •

edited

Loading

ernado commented Dec 19, 2021

dvyukov commented Mar 28, 2022

testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

Comments

rogpeppe commented Jul 5, 2021

mvdan commented Jul 5, 2021

mknyszek commented Jul 7, 2021

randall77 commented Jul 7, 2021

randall77 commented Jul 7, 2021

dvyukov commented Jul 9, 2021

mvdan commented Jul 9, 2021

mvdan commented Sep 1, 2021

dvyukov commented Sep 17, 2021

mvdan commented Sep 17, 2021

dvyukov commented Sep 17, 2021

mvdan commented Sep 17, 2021

seebs commented Oct 18, 2021

gopherbot commented Nov 22, 2021

dvyukov commented Nov 22, 2021

ernado commented Dec 19, 2021 • edited Loading

ernado commented Dec 19, 2021

dvyukov commented Mar 28, 2022

ernado commented Dec 19, 2021 •

edited

Loading