Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: failures in TestVectoredHandlerDontCrashOnLibrary on windows-amd64-longtest #49959

Closed
bcmills opened this issue Dec 3, 2021 · 13 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Dec 3, 2021

--- FAIL: TestVectoredHandlerDontCrashOnLibrary (6.69s)
    signal_windows_test.go:58: expected output "exceptionCount: 1\ncontinueCount: 1\n", got ""
FAIL
FAIL	runtime	215.975s

greplogs --dashboard -md -l -e 'FAIL: TestVectoredHandlerDontCrashOnLibrary'

2021-11-24T20:56:40-9e7600d/windows-amd64-longtest
2021-11-12T23:26:33-3a4b950/windows-amd64-longtest
2021-11-09T20:10:50-5430203/windows-amd64-longtest

This appears to be a regression during the Go 1.18 cycle; it may be related in some way to #49681.

(CC @bufflig @alexbrainman)

@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 labels Dec 3, 2021
@bcmills bcmills added this to the Go1.18 milestone Dec 3, 2021
@bufflig
Copy link
Contributor

bufflig commented Dec 3, 2021

Yea, looks related. And as with that one, probably a pain to reproduce, but I'll try.

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@cherrymui
Copy link
Member

Hasn't occurred since 2021-11-24. I also tried to reproduce on gomote but couldn't reproduce.

@cherrymui
Copy link
Member

The interesting part is that the binary runs successfully with no error, just didn't print anything. This may be different from #49681.

Maybe we need fflush or something to flush the output? I would think we don't need that as it exits normally. But I'm not sure I know enough about C buffered I/O on Windows.

@aclements
Copy link
Member

This hasn't happened again. Unfortunately, it was rare enough that that doesn't tell us much:

$ greplogs --dashboard -l -e 'FAIL: TestVectoredHandlerDontCrashOnLibrary' -since 2021-01-01 | findflakes -paths
First observed 5430203 09 Nov 20:10 2021 (2604 commits ago)
Last observed  9e7600d 24 Nov 20:56 2021 (658 commits ago)
51% chance failure is still happening
0.10% failure probability (3 of 1947 commits)

@toothrot toothrot modified the milestones: Go1.18, Go1.19 Jan 19, 2022
@aclements
Copy link
Member

Still no more occurrences. I'm going to try stress testing this.

git checkout 9e7600d
gopool create -setup 'gomote push $VM && gomote run $VM go/src/make.bat' windows-amd64-longtest
stress2 -p 8 -fail FAIL gopool run gomote run \$VM go/bin/go test runtime -run TestVectoredHandlerDontCrashOnLibrary -v -count 1

It may be necessary to run the full runtime test, but I'll try just the failing test first.

@aclements
Copy link
Member

I was able to reproduce this at commit 9e7600d, the last commit we saw a dashboard failure at. But it takes a long time. Over the course of about 20 hours with 8 gomotes, I got 4 failures out of 17,040 runs (plus 2,813 buildlet flakes :( ).

@aclements
Copy link
Member

I've added a bunch of trace prints to the test program and will see if that sheds any light on this.

@aclements
Copy link
Member

aclements commented Jan 20, 2022

It seems I can only create 3 windows-amd64-longtest builders at a time, though I'm not sure where that limit is coming from. I'm switching to windows-amd64-2016, which is the same image, just on a smaller VM. Hopefully I can still reproduce there.

(Edited to clarify: on the builders, the test only runs with longtest, so we wouldn't expect to see it on windows-amd64-2016, but since I'm running the test manually, that doesn't matter.)

@aclements
Copy link
Member

This will probably surprise no one, but it seems adding tracing made this no longer reproducible. I've done 21,665 runs with no failures.

Now I'm not sure what to do.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 21, 2022

Now I'm not sure what to do.

Given @cherrymui's insight in #49959 (comment), and combining that with C best practices (https://wiki.sei.cmu.edu/confluence/display/c/FIO23-C.+Do+not+exit+with+unflushed+data+in+stdout+or+stderr), I would say maybe we should add an explicit fflush or fclose to the test program and close this issue as “possibly fixed”.

If we see any more of these failures (on our own builders or reported by users) after that, we can always either reopen this issue or open a followup issue to investigate further.

@aclements
Copy link
Member

I'll try the fflush.

I tried reproducing it a different way that didn't cause nearly as many problems with the builders. Over the course of 20 hours I got 3 failures out of 231,000 iterations of the test.

cat >i49959 <<EOF
#!/bin/bash

set -ex

VM=$(gomote create windows-amd64-2016)
function cleanup() {
    gomote destroy $VM
}
trap cleanup EXIT
gomote push $VM
gomote run $VM go/src/make.bat
gomote run $VM go/bin/go test -c -o ../src/runtime/runtime.test.exe runtime
for i in {1..100}; do
    gomote run $VM go/src/runtime/runtime.test.exe -test.run TestVectoredHandlerDontCrashOnLibrary -test.v
done
EOF
stress2 -p 20 -timeout 20m -fail FAIL ./i49959

@aclements
Copy link
Member

Looks like Cherry was right. 0 failures after 250,000 iterations with the fflush.

@gopherbot
Copy link

Change https://golang.org/cl/380494 mentions this issue: runtime: call fflush before exiting in C test

jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
Very, very rarely TestVectoredHandlerDontCrashOnLibrary fails because
the C subprocess exits with a 0 status code and no output. This
appears to happen because C does not actually guarantee that stdout
will be flushed on exit and somehow, very rarely, it is not flushed.

Add explicit fflushes to fix this. This reduces the failure rate of
TestVectoredHandlerDontCrashOnLibrary from 0.0013% to 0% in 250,000
iterations.

Fixes golang#49959.

Change-Id: I892cf49a165ac91134c5da37588a2ab11e1f3f8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/380494
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
@golang golang locked and limited conversation to collaborators Jan 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants