Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestStackGrowth "finalizer did not run" flakes #19381

Closed
bradfitz opened this issue Mar 3, 2017 · 24 comments
Closed

runtime: TestStackGrowth "finalizer did not run" flakes #19381

bradfitz opened this issue Mar 3, 2017 · 24 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Mar 3, 2017

@broady noticed TestStackGrowth "finalizer did not run" on freebsd-386 in #18598 (comment)

I just cleared all the Solaris builds from the dashboard and rebuilt them with 4x the CPU and they're now happy, except for:

https://build.golang.org/log/84bc7b4f864592e44e7964d8892ef2d9752f2754

--- FAIL: TestStackGrowth (49.50s)
	stack_test.go:114: finalizer did not run
FAIL
FAIL	runtime	102.054s
@bradfitz bradfitz added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure. labels Mar 3, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Mar 3, 2017
@aclements
Copy link
Member

Seems to have migrated to freebsd/arm. It's happening quite often:

$ greplogs -dashboard -md -E "finalizer did not run" -l | tail -n 10
2017-06-05T19:55:31-6a42568/freebsd-arm-paulzhol
2017-06-05T19:56:34-51711d1/freebsd-arm-paulzhol
2017-06-05T19:59:33-555d1e3/freebsd-arm-paulzhol
2017-06-05T19:59:33-555d1e3/linux-arm-arm5spacemonkey
2017-06-05T19:59:49-b1af539/freebsd-arm-paulzhol
2017-06-05T22:42:48-2d86f49/freebsd-arm-paulzhol
2017-06-05T22:42:48-2d86f49/linux-mips64
2017-06-06T15:48:54-3a27f28/freebsd-arm-paulzhol
2017-06-06T19:50:00-3c1914f/freebsd-arm-paulzhol
2017-06-06T20:20:41-a838191/freebsd-arm-paulzhol

It happens with some regularity on linux-arm-arm5spacemonkey, which has a gomote.

I haven't been able to reproduce the failure yet, though I have noticed that TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop.

@aclements
Copy link
Member

TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop.

This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB.

@aclements
Copy link
Member

This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB.

I believe this is a red herring. The 8MB comes from numerators in vlop_arm_test.go, but it's all non-scannable anyway. Adding // +build ignore to this file doesn't speed up TestStackGrowth.

@aclements
Copy link
Member

There's some useful discussion of this problem at #19805 (which I'm closing as a dup). @eliasnaur summarized it as:

Increasing the timeout in stack_test.go:114 to 30 seconds or checking out 44ed88a (the parent of CL 37520) both make the problem go away.

CL 37520 made runtime.GC concurrent. It's odd that this would matter, since the test is effectively single-threaded (spread across a few synchronized goroutines). It's possible this is just slightly slower than STW GC and the ~770 GCs done between triggering the finalizer and passing the test take too long on slow hardware. Or it's possible this somehow perturbed finalizers, so the single forced GC cycle that's supposed to start the finalizer doesn't.

@gopherbot
Copy link

CL https://golang.org/cl/45138 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 8, 2017
This adds diagnostics so we can tell if the finalizer has started, in
addition to whether or not it has finished.

Updates #19381.

Change-Id: Icb7b1b0380c9ad1128b17074828945511a6cca5d
Reviewed-on: https://go-review.googlesource.com/45138
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@aclements
Copy link
Member

Looking at the few builds on freebsd/arm since 45138 went in, all but one of them only say "finalizer did not run", which means the finalizer did start and it's just taking a long time to run.

@aclements
Copy link
Member

@eliasnaur, since you can reproduce this, could I get you to do a test for me on android/arm? Could you check out 44ed88a and run the runtime test with -v to get the run time of TestStackGrowth? I'd like to know how long TestStackGrowth took before CL 37520 to get a sense for whether it was just below the threshold before or if there was a more serious regression.

@eliasnaur
Copy link
Contributor

eliasnaur commented Jun 10, 2017

Of course. I ran GOOS=android GOARCH=arm ../bin/go test -short -v runtime and got (among others):

--- PASS: TestStackGrowth (3.43s)

If there's anything else you need, please ping me again.

@aclements
Copy link
Member

@eliasnaur, can you reproduce this if you just comment out the t.Parallel() at the top of TestStackGrowth?

@aclements
Copy link
Member

For my own record, this is what I've been using to try (unsuccessfully) to reproduce this:

VM=$(gomote create android-arm-wiko-fever)
gomote push $VM
gomote run $VM /bin/sh -c '
PATH=$PATH:$PWD/go/bin && 
cd go/src &&
./make.bash &&
cd runtime &&
go test -c &&
adb shell mkdir -p /data/local/tmp/goroot &&
adb push runtime.test /data/local/tmp/goroot/runtime.test'
gomote run $VM /bin/sh -c 'adb shell /data/local/tmp/goroot/runtime.test -test.short -test.cpu=1,2,4 -test.v' |& tee /tmp/x

I did notice once that on -cpu 4, TestStackGrowth took ~25 seconds (but did not timeout). The builder has been really uncooperative, though, so I've only managed to run that once.

@eliasnaur
Copy link
Contributor

Initially I couldn't reproduce the problem on tip nor on the version from my #19805 (comment) comment. Then I remembered that the Android builder had a problem around the same time where stuck Go test processes would slowly clog up the device resources, slowing it down and triggering timing sensitive tests such as TestStackGrowth.

So, in summary: I can not reproduce the problem on a freshly rebooted device but I can reproduce the problem if I run CPU-heavy processes on it in parallel with the Go tests.

To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter.

I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase.

@aclements
Copy link
Member

To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter.

Thanks!

I think this all makes sense. The test started failing when I made runtime.GC() concurrent because it's a parallel test: before I made that change all of the expensive things in that test were STW so it wasn't really parallel, but after that change it was fighting with all of the other tests. Since concurrent GC claims 25% of the CPU, it probably started taking ~4 times longer to run.

I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase.

Not at all. It's still failing pretty regularly on the dashboard and it sounds like we now have a solution.

@gopherbot
Copy link

CL https://golang.org/cl/48110 mentions this issue.

gopherbot pushed a commit that referenced this issue Jul 11, 2017
TestStackGrowth is currently a parallel test. However, it depends on a
20 second timeout, which is already dubious in a parallel test, and
became really problematic on slow builders when runtime.GC switched to
triggering concurrent GC instead of STW GC. Before that change, the
test spent much of its time in STW GC, so it wasn't *really* parallel.
After that change, it was competing with all of the other parallel
tests and GC likely started taking ~4 times longer. On most builders
the whole test runs in well under a second, but on the slow builders
that was enough to push it over the 20 second timeout.

Fix this by making the test serial.

Updates #19381 (probably fixes it, but we'll have to wait and see).

Change-Id: I21af7cf543ab07f1ec1c930bfcb355b0df75672d
Reviewed-on: https://go-review.googlesource.com/48110
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Elias Naur <elias.naur@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@aclements
Copy link
Member

This dropped from happening to a third of the commits to not happening at all with 286599f.

Thanks for the help @eliasnaur!

First observed d3f5e36 07 Mar 03:26 2017 (1482 commits ago)
Last observed  58ae050 11 Jul 14:01 2017 (53 commits ago)
0.00% chance failure is still happening
36% failure probability (515 of 1430 commits)

@bradfitz bradfitz reopened this Dec 1, 2017
@bradfitz bradfitz modified the milestones: Go1.9, Go1.10 Dec 1, 2017
@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 1, 2017

@aclements, this is back.

Here's a NetBSD one: https://build.golang.org/log/7b248c375b767afdaa5d716c075734fed740af6d

Did any recent change to runtime.GC() affect this test?

        // in finalizer                                                                                                                                                                   
        wg.Add(1)
        go func() {
                defer wg.Done()
                done := make(chan bool)
                var started uint32
                go func() {
                        s := new(string)
                        SetFinalizer(s, func(ss *string) {
                                atomic.StoreUint32(&started, 1)
                                growStack()
                                done <- true
                        })
                        s = nil
                        done <- true
                }()
                <-done
                GC()
                select {
                case <-done:
                case <-time.After(20 * time.Second):
                        if atomic.LoadUint32(&started) == 0 {
                                t.Log("finalizer did not start")
                        }
                        t.Error("finalizer did not run")
                        return
                }
        }()
        wg.Wait()

If GC blocks until finalizers are all done, should the done channel be buffered?

@aclements
Copy link
Member

Finalizers are still run asynchronously. runtime.GC doesn't block for them.

I will note that the failure indicates that the finalizer did start, it's just not done yet. Maybe we should add some debugging that prints how long the finalizer has been running for and how much progress it's made so we can tell whether it's just taking a long time or if it started really late.

Here are the recent failures:

2017-09-22T20:07:52-f2ab41b/linux-386-clang
2017-09-22T20:08:03-eb32dbc/linux-386-clang
2017-09-22T20:08:16-0c01487/linux-386-clang
2017-09-27T13:48:21-90fdc45/freebsd-arm-paulzhol
2017-10-06T06:49:40-07e36af/linux-mips64
2017-10-10T01:02:27-bb0bfd0/linux-mips
2017-11-21T19:13:52-2951f90/freebsd-arm-paulzhol
2017-11-23T16:45:24-b72678f/solaris-amd64-oraclerel
2017-12-01T01:16:45-7e394a2/netbsd-amd64-8branch
2017-12-01T07:09:54-f22cf71/netbsd-amd64-8branch

Are these builders just slow/overloaded?

@gopherbot
Copy link

Change https://golang.org/cl/81556 mentions this issue: runtime: even more TestStackGrowth timeout debugging

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 1, 2017

@aclements, the arm and mips builders might be slow, but shouldn't be overloaded. No clue about solaris. But netbsd-amd64 is a fresh isolated VM per build, and linux-386-clang is a fresh kubernetes pod per build. It shouldn't be oversubscribed or anything.

gopherbot pushed a commit that referenced this issue Dec 1, 2017
This adds logging for the expected duration of a growStack, plus
progress information on the growStack that timed out.

Updates #19381.

Change-Id: Ic358f8350f499ff22dd213b658aece7d1aa62675
Reviewed-on: https://go-review.googlesource.com/81556
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018
@ianlancetaylor
Copy link
Contributor

Spot checking a few I see things like

stack_test.go:135: finalizer started 21.053487s ago and finished 210 iterations
    stack_test.go:137: first growStack took 25.316309s
    stack_test.go:138: finalizer did not run
--- FAIL: TestStackGrowth (58.21s)
    stack_test.go:135: finalizer started 22.339328s ago and finished 242 iterations
    stack_test.go:137: first growStack took 14.320874s
    stack_test.go:138: finalizer did not run

The "first growStack" numbers suggest that 20 seconds is simply not enough given ordinary variation, at least not on an emulator.

But I don't know why the test no longer fails on ordinary systems. It seems to have stopped failing after CL 81556, but I don't know why.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Unplanned Jul 9, 2018
@gopherbot
Copy link

Change https://golang.org/cl/122587 mentions this issue: runtime: scale timeout in TestStackGrowth

gopherbot pushed a commit that referenced this issue Jul 9, 2018
Updates #19381

Change-Id: I62b8b0cd7170941af77281eb3aada3802623ec27
Reviewed-on: https://go-review.googlesource.com/122587
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@n2vi
Copy link

n2vi commented Jun 18, 2021

go1.17beta1 on OpenBSD6.9 amd64 (4GB NUQkit)

--- FAIL: TestStackGrowth (76.54s)
stack_test.go:151: finalizer started 20.079434621s ago and finished 186 iterations
stack_test.go:153: first growStack took 38.964539746s
stack_test.go:154: finalizer did not run

This failed for me on two out of two tries. It worked fine on a larger memory OpenBSD machine.
If you're still tweaking things and want me to run tests, happy to do so.
Otherwise, I'm content to ignore and just use the new go; I haven't noticed any other problems.

algitbot pushed a commit to alpinelinux/aports that referenced this issue Oct 25, 2021
Some tests timeout on armhf because it's generally slower. Set
GO_TEST_TIMEOUT_SCALE=2 to account for that.

See: golang/go#19381
@bcmills
Copy link
Contributor

bcmills commented Jan 12, 2022

Curiously, this seems not to have failed on the Go post-submit builders in literally years, but @thanm observed it on the new freebsd-amd64-12_3 builders (#49967 (comment)).

greplogs --dashboard -md -l -e 'FAIL: TestStackGrowth' --since=2019-01-01

[no results]

@gopherbot
Copy link

Change https://golang.org/cl/378034 mentions this issue: runtime: eliminate arbitrary timeout in TestStackGrowth

@bcmills bcmills self-assigned this Jan 19, 2022
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
Instead, allow the test to run up until nearly the test's deadline,
whatever that may be, and then crash with a panic (instead of calling
t.Errorf) to get a useful goroutine dump.

With the arbitrary timeout removed, we can now also run this test in
short mode, reducing its impact on test latency.

Fixes golang#19381

Change-Id: Ie1fae321a2973fcb9b69a012103363f16214f529
Reviewed-on: https://go-review.googlesource.com/c/go/+/378034
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

7 participants