-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestStackGrowth "finalizer did not run" flakes #19381
Comments
Seems to have migrated to freebsd/arm. It's happening quite often:
It happens with some regularity on linux-arm-arm5spacemonkey, which has a gomote. I haven't been able to reproduce the failure yet, though I have noticed that TestStackGrowth takes ~2.9 seconds to run on linux-arm-arm5spacemonkey, compared to 0.1 seconds on my laptop. |
This appears to be because TestStackGrowth GC's ~770 times, but on linux-arm each GC scans an 8 MB heap, while on my linux/amd64 laptop the heap is under 1 MB. |
I believe this is a red herring. The 8MB comes from |
There's some useful discussion of this problem at #19805 (which I'm closing as a dup). @eliasnaur summarized it as:
CL 37520 made |
CL https://golang.org/cl/45138 mentions this issue. |
This adds diagnostics so we can tell if the finalizer has started, in addition to whether or not it has finished. Updates #19381. Change-Id: Icb7b1b0380c9ad1128b17074828945511a6cca5d Reviewed-on: https://go-review.googlesource.com/45138 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Looking at the few builds on freebsd/arm since 45138 went in, all but one of them only say "finalizer did not run", which means the finalizer did start and it's just taking a long time to run. |
@eliasnaur, since you can reproduce this, could I get you to do a test for me on android/arm? Could you check out 44ed88a and run the runtime test with -v to get the run time of TestStackGrowth? I'd like to know how long TestStackGrowth took before CL 37520 to get a sense for whether it was just below the threshold before or if there was a more serious regression. |
Of course. I ran
If there's anything else you need, please ping me again. |
@eliasnaur, can you reproduce this if you just comment out the |
For my own record, this is what I've been using to try (unsuccessfully) to reproduce this: VM=$(gomote create android-arm-wiko-fever)
gomote push $VM
gomote run $VM /bin/sh -c '
PATH=$PATH:$PWD/go/bin &&
cd go/src &&
./make.bash &&
cd runtime &&
go test -c &&
adb shell mkdir -p /data/local/tmp/goroot &&
adb push runtime.test /data/local/tmp/goroot/runtime.test'
gomote run $VM /bin/sh -c 'adb shell /data/local/tmp/goroot/runtime.test -test.short -test.cpu=1,2,4 -test.v' |& tee /tmp/x I did notice once that on -cpu 4, TestStackGrowth took ~25 seconds (but did not timeout). The builder has been really uncooperative, though, so I've only managed to run that once. |
Initially I couldn't reproduce the problem on tip nor on the version from my #19805 (comment) comment. Then I remembered that the Android builder had a problem around the same time where stuck Go test processes would slowly clog up the device resources, slowing it down and triggering timing sensitive tests such as TestStackGrowth. So, in summary: I can not reproduce the problem on a freshly rebooted device but I can reproduce the problem if I run CPU-heavy processes on it in parallel with the Go tests. To answer your question, commenting out t.Parallel does help, even on a loaded device: 6 consecutive successful runs without t.Parallel and with t.Parallel the tests failed on the first run thereafter. I since modified the android builder to reboot before every build, so I'm not sure there is more to do on this. If not, I'm sorry to have led you on a wild goose chase. |
Thanks! I think this all makes sense. The test started failing when I made
Not at all. It's still failing pretty regularly on the dashboard and it sounds like we now have a solution. |
CL https://golang.org/cl/48110 mentions this issue. |
TestStackGrowth is currently a parallel test. However, it depends on a 20 second timeout, which is already dubious in a parallel test, and became really problematic on slow builders when runtime.GC switched to triggering concurrent GC instead of STW GC. Before that change, the test spent much of its time in STW GC, so it wasn't *really* parallel. After that change, it was competing with all of the other parallel tests and GC likely started taking ~4 times longer. On most builders the whole test runs in well under a second, but on the slow builders that was enough to push it over the 20 second timeout. Fix this by making the test serial. Updates #19381 (probably fixes it, but we'll have to wait and see). Change-Id: I21af7cf543ab07f1ec1c930bfcb355b0df75672d Reviewed-on: https://go-review.googlesource.com/48110 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Elias Naur <elias.naur@gmail.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
This dropped from happening to a third of the commits to not happening at all with 286599f. Thanks for the help @eliasnaur!
|
@aclements, this is back. Here's a NetBSD one: https://build.golang.org/log/7b248c375b767afdaa5d716c075734fed740af6d Did any recent change to // in finalizer
wg.Add(1)
go func() {
defer wg.Done()
done := make(chan bool)
var started uint32
go func() {
s := new(string)
SetFinalizer(s, func(ss *string) {
atomic.StoreUint32(&started, 1)
growStack()
done <- true
})
s = nil
done <- true
}()
<-done
GC()
select {
case <-done:
case <-time.After(20 * time.Second):
if atomic.LoadUint32(&started) == 0 {
t.Log("finalizer did not start")
}
t.Error("finalizer did not run")
return
}
}()
wg.Wait() If |
Finalizers are still run asynchronously. I will note that the failure indicates that the finalizer did start, it's just not done yet. Maybe we should add some debugging that prints how long the finalizer has been running for and how much progress it's made so we can tell whether it's just taking a long time or if it started really late. Here are the recent failures: 2017-09-22T20:07:52-f2ab41b/linux-386-clang Are these builders just slow/overloaded? |
Change https://golang.org/cl/81556 mentions this issue: |
@aclements, the arm and mips builders might be slow, but shouldn't be overloaded. No clue about solaris. But netbsd-amd64 is a fresh isolated VM per build, and linux-386-clang is a fresh kubernetes pod per build. It shouldn't be oversubscribed or anything. |
This adds logging for the expected duration of a growStack, plus progress information on the growStack that timed out. Updates #19381. Change-Id: Ic358f8350f499ff22dd213b658aece7d1aa62675 Reviewed-on: https://go-review.googlesource.com/81556 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Spot checking a few I see things like
The "first growStack" numbers suggest that 20 seconds is simply not enough given ordinary variation, at least not on an emulator. But I don't know why the test no longer fails on ordinary systems. It seems to have stopped failing after CL 81556, but I don't know why. |
Change https://golang.org/cl/122587 mentions this issue: |
Updates #19381 Change-Id: I62b8b0cd7170941af77281eb3aada3802623ec27 Reviewed-on: https://go-review.googlesource.com/122587 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
go1.17beta1 on OpenBSD6.9 amd64 (4GB NUQkit) --- FAIL: TestStackGrowth (76.54s) This failed for me on two out of two tries. It worked fine on a larger memory OpenBSD machine. |
Some tests timeout on armhf because it's generally slower. Set GO_TEST_TIMEOUT_SCALE=2 to account for that. See: golang/go#19381
Curiously, this seems not to have failed on the Go post-submit builders in literally years, but @thanm observed it on the new
[no results] |
Change https://golang.org/cl/378034 mentions this issue: |
Instead, allow the test to run up until nearly the test's deadline, whatever that may be, and then crash with a panic (instead of calling t.Errorf) to get a useful goroutine dump. With the arbitrary timeout removed, we can now also run this test in short mode, reducing its impact on test latency. Fixes golang#19381 Change-Id: Ie1fae321a2973fcb9b69a012103363f16214f529 Reviewed-on: https://go-review.googlesource.com/c/go/+/378034 Trust: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
@broady noticed TestStackGrowth "finalizer did not run" on freebsd-386 in #18598 (comment)
I just cleared all the Solaris builds from the dashboard and rebuilt them with 4x the CPU and they're now happy, except for:
https://build.golang.org/log/84bc7b4f864592e44e7964d8892ef2d9752f2754
The text was updated successfully, but these errors were encountered: