runtime: TestStackGrowth "finalizer did not run" flakes #19381

bradfitz · 2017-03-03T15:38:41Z

@broady noticed TestStackGrowth "finalizer did not run" on freebsd-386 in #18598 (comment)

I just cleared all the Solaris builds from the dashboard and rebuilt them with 4x the CPU and they're now happy, except for:

https://build.golang.org/log/84bc7b4f864592e44e7964d8892ef2d9752f2754

--- FAIL: TestStackGrowth (49.50s)
	stack_test.go:114: finalizer did not run
FAIL
FAIL	runtime	102.054s

The text was updated successfully, but these errors were encountered:

aclements · 2017-06-07T16:16:22Z

Seems to have migrated to freebsd/arm. It's happening quite often:

$ greplogs -dashboard -md -E "finalizer did not run" -l | tail -n 10
2017-06-05T19:55:31-6a42568/freebsd-arm-paulzhol
2017-06-05T19:56:34-51711d1/freebsd-arm-paulzhol
2017-06-05T19:59:33-555d1e3/freebsd-arm-paulzhol
2017-06-05T19:59:33-555d1e3/linux-arm-arm5spacemonkey
2017-06-05T19:59:49-b1af539/freebsd-arm-paulzhol
2017-06-05T22:42:48-2d86f49/freebsd-arm-paulzhol
2017-06-05T22:42:48-2d86f49/linux-mips64
2017-06-06T15:48:54-3a27f28/freebsd-arm-paulzhol
2017-06-06T19:50:00-3c1914f/freebsd-arm-paulzhol
2017-06-06T20:20:41-a838191/freebsd-arm-paulzhol

It happens with some regularity on linux-arm-arm5spacemonkey, which has a gomote.

I haven't been able to reproduce the failure yet, though I have noticed that TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop.

aclements · 2017-06-07T16:29:12Z

TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop.

This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB.

aclements · 2017-06-07T17:34:37Z

This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB.

I believe this is a red herring. The 8MB comes from numerators in vlop_arm_test.go, but it's all non-scannable anyway. Adding // +build ignore to this file doesn't speed up TestStackGrowth.

aclements · 2017-06-08T16:00:46Z

There's some useful discussion of this problem at #19805 (which I'm closing as a dup). @eliasnaur summarized it as:

Increasing the timeout in stack_test.go:114 to 30 seconds or checking out 44ed88a (the parent of CL 37520) both make the problem go away.

CL 37520 made runtime.GC concurrent. It's odd that this would matter, since the test is effectively single-threaded (spread across a few synchronized goroutines). It's possible this is just slightly slower than STW GC and the ~770 GCs done between triggering the finalizer and passing the test take too long on slow hardware. Or it's possible this somehow perturbed finalizers, so the single forced GC cycle that's supposed to start the finalizer doesn't.

gopherbot · 2017-06-08T17:50:08Z

CL https://golang.org/cl/45138 mentions this issue.

This adds diagnostics so we can tell if the finalizer has started, in addition to whether or not it has finished. Updates #19381. Change-Id: Icb7b1b0380c9ad1128b17074828945511a6cca5d Reviewed-on: https://go-review.googlesource.com/45138 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

aclements · 2017-06-09T14:10:32Z

Looking at the few builds on freebsd/arm since 45138 went in, all but one of them only say "finalizer did not run", which means the finalizer did start and it's just taking a long time to run.

aclements · 2017-06-09T14:18:58Z

@eliasnaur, since you can reproduce this, could I get you to do a test for me on android/arm? Could you check out 44ed88a and run the runtime test with -v to get the run time of TestStackGrowth? I'd like to know how long TestStackGrowth took before CL 37520 to get a sense for whether it was just below the threshold before or if there was a more serious regression.

eliasnaur · 2017-06-10T08:34:50Z

Of course. I ran GOOS=android GOARCH=arm ../bin/go test -short -v runtime and got (among others):

--- PASS: TestStackGrowth (3.43s)

If there's anything else you need, please ping me again.

aclements · 2017-07-10T18:33:26Z

@eliasnaur, can you reproduce this if you just comment out the t.Parallel() at the top of TestStackGrowth?

aclements · 2017-07-10T20:10:33Z

For my own record, this is what I've been using to try (unsuccessfully) to reproduce this:

VM=$(gomote create android-arm-wiko-fever)
gomote push $VM
gomote run $VM /bin/sh -c '
PATH=$PATH:$PWD/go/bin && 
cd go/src &&
./make.bash &&
cd runtime &&
go test -c &&
adb shell mkdir -p /data/local/tmp/goroot &&
adb push runtime.test /data/local/tmp/goroot/runtime.test'
gomote run $VM /bin/sh -c 'adb shell /data/local/tmp/goroot/runtime.test -test.short -test.cpu=1,2,4 -test.v' |& tee /tmp/x

I did notice once that on -cpu 4, TestStackGrowth took ~25 seconds (but did not timeout). The builder has been really uncooperative, though, so I've only managed to run that once.

eliasnaur · 2017-07-11T10:54:26Z

Initially I couldn't reproduce the problem on tip nor on the version from my #19805 (comment) comment. Then I remembered that the Android builder had a problem around the same time where stuck Go test processes would slowly clog up the device resources, slowing it down and triggering timing sensitive tests such as TestStackGrowth.

So, in summary: I can not reproduce the problem on a freshly rebooted device but I can reproduce the problem if I run CPU-heavy processes on it in parallel with the Go tests.

To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter.

I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase.

aclements · 2017-07-11T13:25:24Z

To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter.

Thanks!

I think this all makes sense. The test started failing when I made runtime.GC() concurrent because it's a parallel test: before I made that change all of the expensive things in that test were STW so it wasn't really parallel, but after that change it was fighting with all of the other tests. Since concurrent GC claims 25% of the CPU, it probably started taking ~4 times longer to run.

I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase.

Not at all. It's still failing pretty regularly on the dashboard and it sounds like we now have a solution.

gopherbot · 2017-07-11T13:31:28Z

CL https://golang.org/cl/48110 mentions this issue.

TestStackGrowth is currently a parallel test. However, it depends on a 20 second timeout, which is already dubious in a parallel test, and became really problematic on slow builders when runtime.GC switched to triggering concurrent GC instead of STW GC. Before that change, the test spent much of its time in STW GC, so it wasn't *really* parallel. After that change, it was competing with all of the other parallel tests and GC likely started taking ~4 times longer. On most builders the whole test runs in well under a second, but on the slow builders that was enough to push it over the 20 second timeout. Fix this by making the test serial. Updates #19381 (probably fixes it, but we'll have to wait and see). Change-Id: I21af7cf543ab07f1ec1c930bfcb355b0df75672d Reviewed-on: https://go-review.googlesource.com/48110 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Elias Naur <elias.naur@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

aclements · 2017-07-17T18:59:29Z

This dropped from happening to a third of the commits to not happening at all with 286599f.

Thanks for the help @eliasnaur!

First observed d3f5e36 07 Mar 03:26 2017 (1482 commits ago)
Last observed  58ae050 11 Jul 14:01 2017 (53 commits ago)
0.00% chance failure is still happening
36% failure probability (515 of 1430 commits)

bradfitz · 2017-12-01T19:11:54Z

@aclements, this is back.

Here's a NetBSD one: https://build.golang.org/log/7b248c375b767afdaa5d716c075734fed740af6d

Did any recent change to runtime.GC() affect this test?

        // in finalizer                                                                                                                                                                   
        wg.Add(1)
        go func() {
                defer wg.Done()
                done := make(chan bool)
                var started uint32
                go func() {
                        s := new(string)
                        SetFinalizer(s, func(ss *string) {
                                atomic.StoreUint32(&started, 1)
                                growStack()
                                done <- true
                        })
                        s = nil
                        done <- true
                }()
                <-done
                GC()
                select {
                case <-done:
                case <-time.After(20 * time.Second):
                        if atomic.LoadUint32(&started) == 0 {
                                t.Log("finalizer did not start")
                        }
                        t.Error("finalizer did not run")
                        return
                }
        }()
        wg.Wait()

If GC blocks until finalizers are all done, should the done channel be buffered?

aclements · 2017-12-01T20:09:49Z

Finalizers are still run asynchronously. runtime.GC doesn't block for them.

I will note that the failure indicates that the finalizer did start, it's just not done yet. Maybe we should add some debugging that prints how long the finalizer has been running for and how much progress it's made so we can tell whether it's just taking a long time or if it started really late.

Here are the recent failures:

2017-09-22T20:07:52-f2ab41b/linux-386-clang
2017-09-22T20:08:03-eb32dbc/linux-386-clang
2017-09-22T20:08:16-0c01487/linux-386-clang
2017-09-27T13:48:21-90fdc45/freebsd-arm-paulzhol
2017-10-06T06:49:40-07e36af/linux-mips64
2017-10-10T01:02:27-bb0bfd0/linux-mips
2017-11-21T19:13:52-2951f90/freebsd-arm-paulzhol
2017-11-23T16:45:24-b72678f/solaris-amd64-oraclerel
2017-12-01T01:16:45-7e394a2/netbsd-amd64-8branch
2017-12-01T07:09:54-f22cf71/netbsd-amd64-8branch

Are these builders just slow/overloaded?

gopherbot · 2017-12-01T20:18:25Z

Change https://golang.org/cl/81556 mentions this issue: runtime: even more TestStackGrowth timeout debugging

bradfitz · 2017-12-01T20:44:31Z

@aclements, the arm and mips builders might be slow, but shouldn't be overloaded. No clue about solaris. But netbsd-amd64 is a fresh isolated VM per build, and linux-386-clang is a fresh kubernetes pod per build. It shouldn't be oversubscribed or anything.

This adds logging for the expected duration of a growStack, plus progress information on the growStack that timed out. Updates #19381. Change-Id: Ic358f8350f499ff22dd213b658aece7d1aa62675 Reviewed-on: https://go-review.googlesource.com/81556 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

ianlancetaylor · 2018-07-08T23:56:38Z

All the recent occurrences seem to be on emulators:

2018-06-19T16:35:58-f125052/android-386-emulator
2018-06-19T17:52:23-707ca18/android-amd64-emulator
2018-06-19T20:59:10-c6e455b/android-amd64-emulator
2018-06-19T21:36:56-500293d/android-amd64-emulator
2018-06-20T16:04:33-578e066/android-386-emulator
2018-06-20T16:55:30-3ca5b7c/android-386-emulator
2018-06-20T17:17:37-dd0e7a9/android-amd64-emulator
2018-06-21T22:13:34-24fb2e0/android-amd64-emulator
2018-06-22T00:28:25-4991bc6/android-386-emulator
2018-06-22T08:05:11-bedfa4e/android-amd64-emulator

ianlancetaylor · 2018-07-09T00:00:31Z

Spot checking a few I see things like

stack_test.go:135: finalizer started 21.053487s ago and finished 210 iterations
    stack_test.go:137: first growStack took 25.316309s
    stack_test.go:138: finalizer did not run

--- FAIL: TestStackGrowth (58.21s)
    stack_test.go:135: finalizer started 22.339328s ago and finished 242 iterations
    stack_test.go:137: first growStack took 14.320874s
    stack_test.go:138: finalizer did not run

The "first growStack" numbers suggest that 20 seconds is simply not enough given ordinary variation, at least not on an emulator.

But I don't know why the test no longer fails on ordinary systems. It seems to have stopped failing after CL 81556, but I don't know why.

gopherbot · 2018-07-09T02:18:34Z

Change https://golang.org/cl/122587 mentions this issue: runtime: scale timeout in TestStackGrowth

Updates #19381 Change-Id: I62b8b0cd7170941af77281eb3aada3802623ec27 Reviewed-on: https://go-review.googlesource.com/122587 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

n2vi · 2021-06-18T17:38:12Z

go1.17beta1 on OpenBSD6.9 amd64 (4GB NUQkit)

--- FAIL: TestStackGrowth (76.54s)
stack_test.go:151: finalizer started 20.079434621s ago and finished 186 iterations
stack_test.go:153: first growStack took 38.964539746s
stack_test.go:154: finalizer did not run

This failed for me on two out of two tries. It worked fine on a larger memory OpenBSD machine.
If you're still tweaking things and want me to run tests, happy to do so.
Otherwise, I'm content to ignore and just use the new go; I haven't noticed any other problems.

Some tests timeout on armhf because it's generally slower. Set GO_TEST_TIMEOUT_SCALE=2 to account for that. See: golang/go#19381

bcmills · 2022-01-12T14:45:33Z

Curiously, this seems not to have failed on the Go post-submit builders in literally years, but @thanm observed it on the new freebsd-amd64-12_3 builders (#49967 (comment)).

greplogs --dashboard -md -l -e 'FAIL: TestStackGrowth' --since=2019-01-01

[no results]

gopherbot · 2022-01-12T15:52:29Z

Change https://golang.org/cl/378034 mentions this issue: runtime: eliminate arbitrary timeout in TestStackGrowth

Instead, allow the test to run up until nearly the test's deadline, whatever that may be, and then crash with a panic (instead of calling t.Errorf) to get a useful goroutine dump. With the arbitrary timeout removed, we can now also run this test in short mode, reducing its impact on test latency. Fixes golang#19381 Change-Id: Ie1fae321a2973fcb9b69a012103363f16214f529 Reviewed-on: https://go-review.googlesource.com/c/go/+/378034 Trust: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

bradfitz added NeedsInvestigation Testing labels Mar 3, 2017

bradfitz added this to the Go1.9 milestone Mar 3, 2017

bradfitz assigned aclements Mar 3, 2017

bradfitz mentioned this issue Mar 3, 2017

runtime: freebsd/386 flaky TestCgoSignalDeadlock #18598

Closed

aclements mentioned this issue Jun 8, 2017

runtime: finalizer did not run #19805

Closed

aclements closed this as completed Jul 17, 2017

bradfitz reopened this Dec 1, 2017

bradfitz modified the milestones: Go1.9, Go1.10 Dec 1, 2017

ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018

ianlancetaylor modified the milestones: Go1.11, Unplanned Jul 9, 2018

gopherbot pushed a commit that referenced this issue Jul 9, 2018

runtime: scale timeout in TestStackGrowth

b56e247

Updates #19381 Change-Id: I62b8b0cd7170941af77281eb3aada3802623ec27 Reviewed-on: https://go-review.googlesource.com/122587 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

algitbot pushed a commit to alpinelinux/aports that referenced this issue Oct 25, 2021

community/go: increase timeout scale for armhf

fbc772c

Some tests timeout on armhf because it's generally slower. Set GO_TEST_TIMEOUT_SCALE=2 to account for that. See: golang/go#19381

bcmills mentioned this issue Jan 12, 2022

cmd/compile: failures with "symbol type.[…] listed multiple times" on freebsd-amd64-12_2 since 2021-11-24 #49967

Closed

bcmills self-assigned this Jan 19, 2022

gopherbot closed this as completed in bec2cc3 Jan 19, 2022

rsc unassigned aclements and bcmills Jun 23, 2022

golang locked and limited conversation to collaborators Jun 23, 2023

gopherbot added the FrozenDueToAge label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: TestStackGrowth "finalizer did not run" flakes #19381

runtime: TestStackGrowth "finalizer did not run" flakes #19381

bradfitz commented Mar 3, 2017

aclements commented Jun 7, 2017

aclements commented Jun 7, 2017

aclements commented Jun 7, 2017

aclements commented Jun 8, 2017

gopherbot commented Jun 8, 2017

aclements commented Jun 9, 2017

aclements commented Jun 9, 2017

eliasnaur commented Jun 10, 2017 •

edited

Loading

aclements commented Jul 10, 2017

aclements commented Jul 10, 2017

eliasnaur commented Jul 11, 2017

aclements commented Jul 11, 2017

gopherbot commented Jul 11, 2017

aclements commented Jul 17, 2017

bradfitz commented Dec 1, 2017

aclements commented Dec 1, 2017

gopherbot commented Dec 1, 2017

bradfitz commented Dec 1, 2017

ianlancetaylor commented Jul 8, 2018

ianlancetaylor commented Jul 9, 2018

gopherbot commented Jul 9, 2018

n2vi commented Jun 18, 2021

bcmills commented Jan 12, 2022

gopherbot commented Jan 12, 2022

runtime: TestStackGrowth "finalizer did not run" flakes #19381

runtime: TestStackGrowth "finalizer did not run" flakes #19381

Comments

bradfitz commented Mar 3, 2017

aclements commented Jun 7, 2017

aclements commented Jun 7, 2017

aclements commented Jun 7, 2017

aclements commented Jun 8, 2017

gopherbot commented Jun 8, 2017

aclements commented Jun 9, 2017

aclements commented Jun 9, 2017

eliasnaur commented Jun 10, 2017 • edited Loading

aclements commented Jul 10, 2017

aclements commented Jul 10, 2017

eliasnaur commented Jul 11, 2017

aclements commented Jul 11, 2017

gopherbot commented Jul 11, 2017

aclements commented Jul 17, 2017

bradfitz commented Dec 1, 2017

aclements commented Dec 1, 2017

gopherbot commented Dec 1, 2017

bradfitz commented Dec 1, 2017

ianlancetaylor commented Jul 8, 2018

ianlancetaylor commented Jul 9, 2018

gopherbot commented Jul 9, 2018

n2vi commented Jun 18, 2021

bcmills commented Jan 12, 2022

gopherbot commented Jan 12, 2022

eliasnaur commented Jun 10, 2017 •

edited

Loading