Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: TestStop flaky on DragonFly #25092

Closed
timdarbydotnet opened this issue Apr 25, 2018 · 24 comments
Closed

os/signal: TestStop flaky on DragonFly #25092

timdarbydotnet opened this issue Apr 25, 2018 · 24 comments
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@timdarbydotnet
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.10

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

DragonflyBSD amd64

The builder sporadically fails with this message:
ok net/url 0.021s
ok os 2.430s
ok os/exec 1.919s
--- FAIL: TestStop (1.98s)
signal_test.go:32: timeout waiting for window size changes
FAIL
FAIL os/signal 10.420s
ok os/user 0.009s
ok path 0.008s
2018/04/25 07:54:09 Failed: exit status 1

If this is indicative of an OS bug, I need to know what this test is doing so I can report it to the OS team.

@ianlancetaylor ianlancetaylor changed the title builder: dragonfly os/signal: TestStop flaky on DragonFly Apr 25, 2018
@ianlancetaylor
Copy link
Contributor

findflakes says

First observed 20c98dc 21 Apr 20:20 2018 (54 commits ago)
Last observed  c5f0104 25 Apr 21:34 2018 (0 commits ago)
100% chance failure is still happening
19% failure probability (11 of 55 commits)
Likely culprits:
   19% 20c98dc cmd/link: skip TestRuntimeTypeAttr to fix build
   15% 0a129db misc/cgo/testcshared: use file descriptor 30 for TestUnexportedSymbols
   12% dfb1b69 os/signal: add func Ignored(sig Signal) bool
   10% 37dd7cd runtime: use sys.PtrSize in growslice
    8% 566e3e0 cmd/compile: avoid runtime call during switch string(byteslice)
    7% f6ca6ed net: document caveats for (*syscall.RawConn).Write on Windows
    5% d487488 cmd/internal/obj/x86: disallow PC/FP/SB scaled index
    4% dd71e3f cmd/compile: refactor how declarations are imported
    4% 514018c cmd/internal/obj/x86: ifelse->switch, named const for cap=6
    3% 21fa148 cmd/internal/obj/x86: add named consts for Prog.Back flags
No known past failures

@ianlancetaylor
Copy link
Contributor

Presumably the new failures are somehow due to https://golang.org/cl/108376, which added two new tests that run before TestStop. The new test TestIgnored does fiddle with SIGWINCH. It resets the handling of the signal, but perhaps something goes wrong with that somewhere. It is definitely odd that we're only seeing a problem on DragonFly, but it's possible that there is some bug in the DragonFly support in the runtime package.

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 25, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone Apr 25, 2018
@ianlancetaylor
Copy link
Contributor

You can probably make this fail 1 out of 10 times or so by running

go test -c os/signal
./signal.test -test.run='TestStop|TestDetectNohup'

If that works for you--that is, if the test fails--could you attach the DragonFly equivalent of strace -f output for a failed run, so we can see what system calls the test is making? Thanks.

@bcmills
Copy link
Contributor

bcmills commented Apr 26, 2018

Possibly related to the race in #20748?

@timdarbydotnet
Copy link
Author

@ianlancetaylor I tried to run the test you suggested, but where is "signal.test"? Does something generate that?

@bcmills
Copy link
Contributor

bcmills commented May 25, 2018

where is "signal.test"? Does something generate that?

signal.test is the output generated by go test -c os/signal. I believe it is written to your current working directory.
(See https://golang.org/cmd/go/#hdr-Test_packages for more detail.)

@timdarbydotnet
Copy link
Author

@bcmills I forgot the -c option, thanks. Attached is the ktrace from a TestStop failure.
ktrace.zip

@timdarbydotnet
Copy link
Author

Interesting, I can't get this failure to occur if I run this test on a single CPU VM, but it readily happens on a dual CPU VM.

@timdarbydotnet
Copy link
Author

Adding human-readable format of ktrace.
ktrace_dump.txt

@timdarbydotnet
Copy link
Author

The builder is running dragonfly release 5.2.0, but I've confirmed that the flakiness in this test also happens on the most recent development branch. I can't recall for sure, but the appearance of this failure might very well coincide with switching the builder from a single CPU to a dual CPU VM a month or so ago.

@ianlancetaylor
Copy link
Contributor

Thanks for the ktrace output. Unfortunately I don't see anything unusual in it. The SIGWINCH signal is being delivered twice, as expected. The first one is being ignored, as expected. For some reason the second one is not being sent on the channel.

@timdarbydotnet
Copy link
Author

Thanks, if you can give me a description of what this test does, I can run it by the dev team.

@ianlancetaylor
Copy link
Contributor

I'm not sure that description will help, as most of the activity is in the Go signal handler.

That said, the failing test is TestStop in os/signal/signal_test.go. It raises SIGWINCH signal which will be caught and discarded. Then it calls signal.Notify to ask for notifications of SIGWINCH signals. Then it raises SIGWINCH again. The signal is caught again and, this time, the Go runtime should set a bit in sig.mask (in runtime/sigqueue.go) and wake up the goroutine sleeping in signal_recv (same file). The latter goroutine should observe the signal in sig.mask and send a notification on the channel. The test goroutine should receive that notification and carry on. The ktrace shows that the second SIGWINCH is delivered and caught. Something after that is going wrong, but I don't know what. As you can there are several steps involved.

@timdarbydotnet
Copy link
Author

I played around some with signal_test.go and discovered that if I increase the timeout in waitSig to 1900 * time.Millisecond, then I can no longer get it to fail by repeatedly running signal.test.

@timdarbydotnet
Copy link
Author

I know what's happening. See this commit:
http://gitweb.dragonflybsd.org/dragonfly.git/commit/afd7f1247cb20d5b03f8ccddb470de6a55afc530

Note that the default cap on the timeout for umtx_sleep() is 2 seconds and notice the time in the FAIL: TestStop message - 1.99s. If I increase sysctl kern.umtx_timeout_max to 3 seconds, then the FAIL message time changes to 2.99s. So, the question is what's causing umtx_sleep to hit its max timeout.

@ianlancetaylor
Copy link
Contributor

Thanks for the tip. I can recreate the problem. As far as I can tell, there is sometimes an unreasonably long period of time between a call to umtx_wakeup and the time that the thread that has called umtx_sleep starts running. So I agree that this seems clearly related to the kernel change.

I see that the kernel changes discusses fork. I can see the problem consistently when I run both TestDetectNohup and TestStop, but I do not see the problem when I run only TestStop. One interesting thing about TestDetectNohup is that it uses the os/exec package to run another program, which of course uses fork.

The sequence is

  • signal_recv calls umtx_sleep
  • TestDetectNoHup calls fork, twice
  • TestStop triggers a signal
  • the signal handler calls umtx_wakeup
  • occasionally the umtx_sleep called by signal_recv fails to wake up immediately

So if fork is somehow disassociating the umtx_wakeup from the previous umtx_sleep, that could cause the problem we are seeing.

Both the umtx_wakeup and umtx_sleep calls are using the address &sig.note, where sig is a package-scope variable defined in runtime/sigqueue.go.

@timdarbydotnet
Copy link
Author

If either fork is executed in TestDetectNohup, then the test is flaky. When I remove both, it passes every time.

@ianlancetaylor
Copy link
Contributor

It sounds like you've demonstrated that this is a bug in the DragonFly kernel, so I guess we should just skip the test on DragonFly.

@timdarbydotnet
Copy link
Author

I'm awaiting clarification from the dev team

@timdarbydotnet
Copy link
Author

Dragonfly has found the issue and will fix the timing window in the kernel. I'll be testing a patch soon.

@ianlancetaylor
Copy link
Contributor

Great, thanks!

@ianlancetaylor ianlancetaylor added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Jun 5, 2018
@timdarbydotnet
Copy link
Author

@ianlancetaylor
Copy link
Contributor

@tdfbsd Thanks!

@golang golang locked and limited conversation to collaborators Jun 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants