Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: TestNohup flaky #33174

Open
bcmills opened this issue Jul 18, 2019 · 28 comments
Open

os/signal: TestNohup flaky #33174

bcmills opened this issue Jul 18, 2019 · 28 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jul 18, 2019

Seen on the solaris-amd64-oraclerel builder in https://build.golang.org/log/d5185d101eddc0f5a05103f11b74caf24421bf46:

--- FAIL: TestNohup (1.25s)
    signal_test.go:329: ran test with -send_uncaught_sighup=1 and it succeeded: expected failure.
        Output:
        PASS
FAIL
FAIL	os/signal	3.575s

See previously #8682, #5526.

CC @ianlancetaylor

@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 18, 2019
@bcmills bcmills added this to the Unplanned milestone Jul 18, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Sep 11, 2019

Same symptom on linux-ppc64le-power9osu: https://build.golang.org/log/ae981484d0f765e824066be33e4fb24537b07db5

@bcmills bcmills changed the title os/signal: TestNohup flake on solaris-amd64-oraclerel os/signal: TestNohup flaky Sep 11, 2019
@ianlancetaylor
Copy link
Contributor

Note that the case in #33174 (comment) is not the same symptom, but rather the reverse:

--- FAIL: TestNohup (1.62s)
    signal_test.go:354: ran test with -send_uncaught_sighup=1 under nohup and it failed: expected success.
        Error: signal: user defined signal 1
        Output:

@ianlancetaylor
Copy link
Contributor

I think all these cases could be explained if there is an unusually long time, more than 100 milliseconds, between calling syscall.Kill and the time the program actually receives the signal.

@gopherbot
Copy link

Change https://golang.org/cl/194879 mentions this issue: os/signal: split up sleeps waiting for signal

@bcmills
Copy link
Contributor Author

bcmills commented Feb 20, 2020

Still flaky, unfortunately:
2020-02-20T19:12:11-75ea964/solaris-amd64-oraclerel

@bcmills bcmills reopened this Feb 20, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Mar 27, 2020

@bcmills bcmills modified the milestones: Unplanned, Backlog Mar 27, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Mar 27, 2020

Both of those failures are with -send_uncaught_sighup=2:

--- FAIL: TestNohup (1.77s)
    signal_test.go:377: ran test with -send_uncaught_sighup=2 and it succeeded: expected failure.
        Output:
        PASS
FAIL

@gopherbot
Copy link

Change https://golang.org/cl/226138 mentions this issue: os/signal: rework test timeouts and concurrency

@bcmills bcmills self-assigned this Mar 27, 2020
@bcmills bcmills modified the milestones: Backlog, Go1.15 Mar 27, 2020
gopherbot pushed a commit that referenced this issue Mar 30, 2020
Use a uniform function (named “quiesce”) to wait for possible signals
in a way that gives the kernel many opportunities to deliver them.

Simplify channel usage and concurrency in stress tests.

Use (*testing.T).Deadline instead of parsing the deadline in TestMain.

In TestStop, sleep forever in a loop if we expect the test to die from
a signal. That should reduce the flakiness of TestNohup, since
TestStop will no longer spuriously pass when run as a subprocess of
TestNohup.

Since independent signals should not interfere, run the different
signals in TestStop in parallel when testing in short mode.

Since TestNohup runs TestStop as a subprocess, and TestStop needs to
wait many times for signals to quiesce, run its test subprocesses
concurrently and in short mode — reducing the latency of that test by
more than a factor of 2.

The above two changes reduce the running time of TestNohup on my
workstation to ~345ms, making it possible to run much larger counts of
the test in the same amount of wall time. If the test remains flaky
after this CL, we can spend all or part of that latency improvement on
a longer settle time.

Updates #33174

Change-Id: I09206f213d8c1888b50bf974f965221a5d482419
Reviewed-on: https://go-review.googlesource.com/c/go/+/226138
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/226458 mentions this issue: os/signal: in TestStop, skip the final "unexpected signal" check for SIGUSR1 on Android

@gopherbot
Copy link

Change https://golang.org/cl/226461 mentions this issue: os/signal: make TestStop resilient to initially-blocked signals

gopherbot pushed a commit that referenced this issue Mar 30, 2020
…SIGUSR1 on Android

In CL 226138, I updated TestStop to have more uniform behavior for its signals.
However, that test seems to always fail for SIGUSR1 on the Android ARM builders.

I'm not sure what's special about Android for this particular case,
but let's skip the test to unbreak the builders while I investigate.

For #38165
Updates #33174

Change-Id: I35a70346cd9757a92acd505a020bf95e6871405c
Reviewed-on: https://go-review.googlesource.com/c/go/+/226458
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Mar 31, 2020
For reasons unknown, SIGUSR1 appears to be blocked at process start
for tests on the android-arm-corellium and android-arm64-corellium
builders. (This has been observed before, too: see CL 203957.)
Make the test resilient to blocked signals by always calling Notify
and waiting for potential signal delivery after sending any signal
that is not known to be unblocked.

Also remove the initial SIGWINCH signal from testCancel. The behavior
of an unhandled SIGWINCH is already tested in TestStop, so we don't
need to re-test that same case: waiting for an unhandled signal takes
a comparatively long time (because we necessarily don't know when it
has been delivered), so this redundancy makes the overall test binary
needlessly slow, especially since it is called from both TestReset and
TestIgnore.

Since each signal is always unblocked while we have a notification
channel registered for it, we don't need to modify any other tests:
TestStop and testCancel are the only functions that send signals
without a registered channel.

Fixes #38165
Updates #33174
Updates #15661

Change-Id: I215880894e954b62166024085050d34323431b63
Reviewed-on: https://go-review.googlesource.com/c/go/+/226461
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/227649 mentions this issue: os/signal: increase settle time in tests

gopherbot pushed a commit that referenced this issue Apr 8, 2020
I noticed a timeout in TestIgnore in
https://build.golang.org/log/52d83a72f3a5ea9a16eb5d670c729694144f9624,
which suggests that the settle time is currently set too low.

I've also added a check for the same GO_TEST_TIMEOUT_SCALE used in
TestTerminalSignal, so that if this builder remains too slow we can
increase the builder's scale factor rather than the test's baseline
running time.

Updates #33174

Change-Id: I18b10eaa3bb5ae2f604300aedaaf6f79ee7ad567
Reviewed-on: https://go-review.googlesource.com/c/go/+/227649
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Apr 14, 2020

Rats, still flaky even with a longer settle time:
https://build.golang.org/log/dd62939f6d3b512fe3e6147074a9c6db1144113f

I'm tempted to dramatically overinflate the settle time on this builder just to see if it's actually dropping the signal.

@ianlancetaylor

This comment has been minimized.

@bcmills

This comment has been minimized.

@gopherbot
Copy link

Change https://golang.org/cl/228266 mentions this issue: os/signal: special-case test settle time on the solaris-amd64-oraclerel builder

@bcmills
Copy link
Contributor Author

bcmills commented Apr 15, 2020

@rorth, are there any known signal-dropping bugs on Oracle Solaris that may account for this flakiness?

@rorth
Copy link

rorth commented Apr 15, 2020 via email

gopherbot pushed a commit that referenced this issue Apr 15, 2020
…el builder

This is an attempt to distinguish between a dropped signal and
general builder slowness.

The previous attempt (increasing the settle time to 250ms) still
resulted in a timeout:
https://build.golang.org/log/dd62939f6d3b512fe3e6147074a9c6db1144113f

For #33174

Change-Id: I79027e91ba651f9f889985975f38c7b01d82f634
Reviewed-on: https://go-review.googlesource.com/c/go/+/228266
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2020

No timeouts on solaris-amd64-oraclerel since CL 228266, so I think that one may just be a slow builder:
2020-04-13T20:22:26-6f3a951/solaris-amd64-oraclerel
2020-04-08T18:04:59-94d22d1/solaris-amd64-oraclerel
2019-09-09T23:04:30-a3a1bdf/solaris-amd64-oraclerel

Unfortunately, it seems that 100ms may be too short on the other builders:
2020-05-04T22:52:07-b5f7ff4/linux-ppc64-buildlet
2020-04-22T21:08:58-0ee4b13/darwin-amd64-10_14

@gopherbot
Copy link

Change https://golang.org/cl/232378 mentions this issue: os/signal: retune the settleTime again

@ianlancetaylor ianlancetaylor modified the milestones: Go1.15, Go1.16 Jun 15, 2020
@ianlancetaylor
Copy link
Contributor

Changing milestone to 1.16.

@odeke-em
Copy link
Member

odeke-em commented Feb 9, 2021

Moving the milestone to Go1.17, but good to note that we haven't seen this flake in a long time.

@odeke-em odeke-em modified the milestones: Go1.16, Go1.17 Feb 9, 2021
@ianlancetaylor
Copy link
Contributor

The errors seem to consistently be of the form

signal_test.go:357: timeout after 100ms waiting for hangup

(for differing signals).

This is where the process sends itself the signal, and then waits for it to arrive.

@ianlancetaylor
Copy link
Contributor

Every failure is with -send_uncaught_sighup set, either to 1 or 2. Every failure is TestNohup/nohup-N where N is 1 or 2. That means the test is running

nohup signal.test -test.v -test.run=TestStop -send_uncaught_sighup=N

That run is failing waiting for a signal, although the signal that it is waiting for varies.

@ianlancetaylor
Copy link
Contributor

The failure always occurs on a separate run of signal.test. Since we are running under nohup, SIGHUP is blocked, so the SIGHUP subtest always starts by calling Notify and then immediately calling syscall.Kill. The signal can of course be delivered to a different thread. We are assuming that the signal handler will successfully call sigsend to send the signal back to the channel.

This is a real stretch, but as far as I can tell there are no atomic operations between setting sig.inuse the first time that signal_enable is called and testing sig.inuse in sigsend. It's at least theoretically possible that the thread receives the signal will think that signals have been enabled, and not queue the signal for delivery.

So while I doubt this is the problem, I'll try removing the test of sig.inuse in sigsend.

@gopherbot
Copy link

Change https://golang.org/cl/312131 mentions this issue: runtime: don't test sig.inuse in sigsend

gopherbot pushed a commit that referenced this issue Apr 21, 2021
Signals can be delivered on a different thread. There is no necessary
happens-before relationship between setting sig.inuse in signal_enable
and checking it in sigsend. It is theoretically possible, if unlikely,
that sig.inuse is set by thread 1, thread 2 receives a signal, does not
see that sig.inuse is set, and discards the signal. This could happen
if the signal is received immediately after the first call to signal_enable.

For #33174

Change-Id: Idb0f1c77847b7d4d418bd139e801c0c4460531d2
Reviewed-on: https://go-review.googlesource.com/c/go/+/312131
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bcmills bcmills modified the milestones: Go1.17, Backlog Apr 28, 2021
@gopherbot
Copy link

Change https://go.dev/cl/403199 mentions this issue: os/signal: scale back the solaris-amd64-oraclerel settle time

gopherbot pushed a commit that referenced this issue Apr 30, 2022
The settleTime is arbitrary. Ideally we should refactor the test to
avoid it (using subprocesses instead of sleeps to isolate tests from
each others' delayed signals), but as a shorter-term workaround let's
try scaling it back to match linux/ppc64 (the other builder that
empirically requires a longer settleTime).

For #51054.
Updates #33174.

Change-Id: I574fffaadd74c52c13d63974e87f20b6d3cf3c4b
Reviewed-on: https://go-review.googlesource.com/c/go/+/403199
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@bcmills bcmills removed their assignment Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
Status: Triage Backlog
Development

No branches or pull requests

5 participants