runtime: failures in TestVectoredHandlerDontCrashOnLibrary on windows-amd64-longtest #49959

bcmills · 2021-12-03T17:06:39Z

--- FAIL: TestVectoredHandlerDontCrashOnLibrary (6.69s)
    signal_windows_test.go:58: expected output "exceptionCount: 1\ncontinueCount: 1\n", got ""
FAIL
FAIL	runtime	215.975s

greplogs --dashboard -md -l -e 'FAIL: TestVectoredHandlerDontCrashOnLibrary'

2021-11-24T20:56:40-9e7600d/windows-amd64-longtest
2021-11-12T23:26:33-3a4b950/windows-amd64-longtest
2021-11-09T20:10:50-5430203/windows-amd64-longtest

This appears to be a regression during the Go 1.18 cycle; it may be related in some way to #49681.

(CC @bufflig @alexbrainman)

The text was updated successfully, but these errors were encountered:

bufflig · 2021-12-03T23:22:57Z

Yea, looks related. And as with that one, probably a pain to reproduce, but I'll try.

cherrymui · 2021-12-15T23:22:30Z

Hasn't occurred since 2021-11-24. I also tried to reproduce on gomote but couldn't reproduce.

cherrymui · 2021-12-15T23:26:41Z

The interesting part is that the binary runs successfully with no error, just didn't print anything. This may be different from #49681.

Maybe we need fflush or something to flush the output? I would think we don't need that as it exits normally. But I'm not sure I know enough about C buffered I/O on Windows.

aclements · 2022-01-04T18:48:51Z

This hasn't happened again. Unfortunately, it was rare enough that that doesn't tell us much:

$ greplogs --dashboard -l -e 'FAIL: TestVectoredHandlerDontCrashOnLibrary' -since 2021-01-01 | findflakes -paths
First observed 5430203 09 Nov 20:10 2021 (2604 commits ago)
Last observed  9e7600d 24 Nov 20:56 2021 (658 commits ago)
51% chance failure is still happening
0.10% failure probability (3 of 1947 commits)

aclements · 2022-01-19T19:11:36Z

Still no more occurrences. I'm going to try stress testing this.

git checkout 9e7600d
gopool create -setup 'gomote push $VM && gomote run $VM go/src/make.bat' windows-amd64-longtest
stress2 -p 8 -fail FAIL gopool run gomote run \$VM go/bin/go test runtime -run TestVectoredHandlerDontCrashOnLibrary -v -count 1

It may be necessary to run the full runtime test, but I'll try just the failing test first.

aclements · 2022-01-20T16:38:18Z

I was able to reproduce this at commit 9e7600d, the last commit we saw a dashboard failure at. But it takes a long time. Over the course of about 20 hours with 8 gomotes, I got 4 failures out of 17,040 runs (plus 2,813 buildlet flakes :( ).

aclements · 2022-01-20T17:23:03Z

I've added a bunch of trace prints to the test program and will see if that sheds any light on this.

aclements · 2022-01-20T18:06:30Z

It seems I can only create 3 windows-amd64-longtest builders at a time, though I'm not sure where that limit is coming from. I'm switching to windows-amd64-2016, which is the same image, just on a smaller VM. Hopefully I can still reproduce there.

(Edited to clarify: on the builders, the test only runs with longtest, so we wouldn't expect to see it on windows-amd64-2016, but since I'm running the test manually, that doesn't matter.)

aclements · 2022-01-21T18:43:58Z

This will probably surprise no one, but it seems adding tracing made this no longer reproducible. I've done 21,665 runs with no failures.

Now I'm not sure what to do.

bcmills · 2022-01-21T19:03:17Z

Now I'm not sure what to do.

Given @cherrymui's insight in #49959 (comment), and combining that with C best practices (https://wiki.sei.cmu.edu/confluence/display/c/FIO23-C.+Do+not+exit+with+unflushed+data+in+stdout+or+stderr), I would say maybe we should add an explicit fflush or fclose to the test program and close this issue as “possibly fixed”.

If we see any more of these failures (on our own builders or reported by users) after that, we can always either reopen this issue or open a followup issue to investigate further.

aclements · 2022-01-22T15:12:34Z

I'll try the fflush.

I tried reproducing it a different way that didn't cause nearly as many problems with the builders. Over the course of 20 hours I got 3 failures out of 231,000 iterations of the test.

cat >i49959 <<EOF
#!/bin/bash

set -ex

VM=$(gomote create windows-amd64-2016)
function cleanup() {
    gomote destroy $VM
}
trap cleanup EXIT
gomote push $VM
gomote run $VM go/src/make.bat
gomote run $VM go/bin/go test -c -o ../src/runtime/runtime.test.exe runtime
for i in {1..100}; do
    gomote run $VM go/src/runtime/runtime.test.exe -test.run TestVectoredHandlerDontCrashOnLibrary -test.v
done
EOF
stress2 -p 20 -timeout 20m -fail FAIL ./i49959

aclements · 2022-01-24T14:34:27Z

Looks like Cherry was right. 0 failures after 250,000 iterations with the fflush.

gopherbot · 2022-01-24T14:36:25Z

Change https://golang.org/cl/380494 mentions this issue: runtime: call fflush before exiting in C test

Very, very rarely TestVectoredHandlerDontCrashOnLibrary fails because the C subprocess exits with a 0 status code and no output. This appears to happen because C does not actually guarantee that stdout will be flushed on exit and somehow, very rarely, it is not flushed. Add explicit fflushes to fix this. This reduces the failure rate of TestVectoredHandlerDontCrashOnLibrary from 0.0013% to 0% in 250,000 iterations. Fixes golang#49959. Change-Id: I892cf49a165ac91134c5da37588a2ab11e1f3f8b Reviewed-on: https://go-review.googlesource.com/c/go/+/380494 Trust: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com>

bcmills added OS-Windows NeedsInvestigation release-blocker okay-after-beta1 labels Dec 3, 2021

bcmills added this to the Go1.18 milestone Dec 3, 2021

cherrymui removed the okay-after-beta1 label Dec 14, 2021

toothrot modified the milestones: Go1.18, Go1.19 Jan 19, 2022

gopherbot closed this as completed in 19d819d Jan 24, 2022

thanm mentioned this issue Jan 25, 2022

Go compiler and runtime meeting notes #43930

Open

golang locked and limited conversation to collaborators Jan 24, 2023

gopherbot added the FrozenDueToAge label Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: failures in TestVectoredHandlerDontCrashOnLibrary on windows-amd64-longtest #49959

runtime: failures in TestVectoredHandlerDontCrashOnLibrary on windows-amd64-longtest #49959

bcmills commented Dec 3, 2021

bufflig commented Dec 3, 2021

cherrymui commented Dec 15, 2021

cherrymui commented Dec 15, 2021

aclements commented Jan 4, 2022

aclements commented Jan 19, 2022

aclements commented Jan 20, 2022

aclements commented Jan 20, 2022

aclements commented Jan 20, 2022 •

edited

Loading

aclements commented Jan 21, 2022

bcmills commented Jan 21, 2022 •

edited

Loading

aclements commented Jan 22, 2022

aclements commented Jan 24, 2022

gopherbot commented Jan 24, 2022

runtime: failures in TestVectoredHandlerDontCrashOnLibrary on windows-amd64-longtest #49959

runtime: failures in TestVectoredHandlerDontCrashOnLibrary on windows-amd64-longtest #49959

Comments

bcmills commented Dec 3, 2021

bufflig commented Dec 3, 2021

cherrymui commented Dec 15, 2021

cherrymui commented Dec 15, 2021

aclements commented Jan 4, 2022

aclements commented Jan 19, 2022

aclements commented Jan 20, 2022

aclements commented Jan 20, 2022

aclements commented Jan 20, 2022 • edited Loading

aclements commented Jan 21, 2022

bcmills commented Jan 21, 2022 • edited Loading

aclements commented Jan 22, 2022

aclements commented Jan 24, 2022

gopherbot commented Jan 24, 2022

aclements commented Jan 20, 2022 •

edited

Loading

bcmills commented Jan 21, 2022 •

edited

Loading