runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

bcmills · 2020-11-11T16:16:45Z

(Issue forked from #42237; see #42237 (comment).)

bcmills · 2020-11-11T16:17:43Z

2020-10-14T08:05:58-fc3a6f4/netbsd-amd64-9_0

has similar symptoms from before CL 232298, but it's the only such occurrence, and the failure rate is markedly higher since then.

bcmills · 2020-11-11T16:29:30Z

(CC @prattmic @aclements @ChrisHines @bsiegert @coypoop)

bsiegert · 2020-11-11T16:59:50Z

/cc @tklauser

This issue together with #42422 means that the NetBSD builders are much more flaky than we would like :(

prattmic · 2020-11-30T20:41:07Z

Still ongoing, as expected:

prattmic · 2020-12-04T21:40:08Z

This seems to be reproducible on gomote (~20% of the time) with:

$ gomote push ${MOTE?}
$ gomote run ${MOTE?} ./go/src/make.bash
$ gomote run ${MOTE?} ./go/bin/go test -count=1 -short -timeout=2m -run=TestNoShrinkStackWhileParking -trace ./trace.out -o ./runtime.test runtime

prattmic · 2020-12-04T22:46:19Z

This issue is reminding me a bit of hard-to-reproduce issues we've been tracking with worldsema for a while. That is, the test is getting work done, and then just ... stops:

Two minutes later forcegchelper runs and lo and behold, the test starts making progress again (G30):

Note necessarily related, but suspicious.

cc @aclements @mknyszek

prattmic · 2020-12-07T23:07:12Z

Update for the day: I've narrowed down the issue to findrunnable finding a timer to wait on, deciding to netpollBreak to kick the poller... and then nothing. Either netpollBreak isn't waking the poller or there isn't actually a poller.

Hopefully, I'll be able to narrow this further tomorrow.

prattmic · 2020-12-07T23:48:14Z

There is a poller, but it is not woken by netpollBreak. Here's the flow:

# netpoll with 180s delta (test timeout)
[0.753745874 P -1] M 4 P ? now 16465974762371 netpoll poll delay 179247420027

# Successfully write to netpoll break FD
[0.753788314 P -1] M 5 P ? netpollBreak: write =  1

# A different (non-blocking) netpoll receives the netpollBreak readiness, but doesn't clear netpollWakeSig
# because this is a non-blocking call: https://cs.opensource.google/go/go/+/master:src/runtime/netpoll_kqueue.go;l=149-156
[0.753803005 P 2] M 7 P ? now 16465974819894 netpoll poll delay 0
[0.753806724 P 2] M 7 P ? now 16465974823551 netpoll poll delay 0 duration 3657 n 1
[0.753808903 P 2] M 7 P ? now 16465974825645 netpoll received netpollBreak delay 0

# Later netpollBreak calls don't write to FD because netpollWakeSig = 1
[0.754030446 P 0] M 0 P ? netpollBreak wakeSig 1

# 180s later netpoll finally completes, does not receive the netpollBreak readiness (n = 1)
[180.253550882 P -1] M 4 P ? now 16645474571844 netpoll poll delay 179247420027 duration 179499809473 n 0

The 180s netpoll should have woken from netpollBreak but did not. I see two primary possibilities:

The NetBSD kernel is treating this event as edge-triggered with exactly-once delivery. It delivered a notification to the non-blocking netpoll (which ignored it), so it didn't deliver to the blocking call as well.
Delivery of this event is simply flaky in the kernel in general.

I suppose I'll take a look at the kernel source tomorrow...

prattmic · 2020-12-07T23:50:56Z

Or (3) the event is edge-triggered and the kqueue call was slightly too late. Note that the netpollBreak write is very close in time to the netpoll call (which I'm logging before the call).

prattmic · 2020-12-08T17:28:46Z

@bsiegert are you aware of any recent issues with the NetBSD kernel that may cause the behavior described in #42515 (comment)?

To add further details, all following calls to kevent after the netpollBreak are getting the break event returned, so possibility (1) doesn't seem right. (Note all these calls are non-blocking). It is just the blocking kevent call that occurred right around the same time as the pipe write that isn't getting an event returned.

The source all seems OK to me:

kqueue lock is taken.
If the event is already in the queue, then we should get it (though there is a lot of logic there).
If the event isn't in the queue, wait on the condition variable.

This seems fine, but makes me wonder if there is some race where the condition variable isn't firing via a racing broadcast. The kqueue lock seems to prevent this, but who knows.

prattmic · 2020-12-10T16:48:46Z

Looking more closely at the write and kevent, I can get time uncertainty bounds [1] like:

kevent: [2197268650624, 2376959698401]
write:  [2197268677873, 2197268710147]

The write bound start is after the kqueue bound start, so the race is close enough that we can't really tell which came first, which is a bit disappointing. Though recall that based on the kevent API contract it shouldn't matter which came first regardless. The kevent should return the event.

[1] That is, start time is some time before the syscall was made. end time is some time after the syscall returned. But we can't be sure exactly when the call occurred inbetween.

prattmic · 2020-12-10T16:56:28Z

I've attempted to reproduce this issue on my own NetBSD VM, based on the gomote build script https://cs.opensource.google/go/x/build/+/master:env/netbsd-386/.

Unfortunately that script builds NetBSD 9.0_2019Q4, which seems to no longer exist and thus doesn't build. The closest I could try was 9.0_2020Q2 and 9.1. I've been unable to reproduce the problem on either of these versions.

It is possible that something has changed since 9.0_2019Q2 to fix the issue, or perhaps my environment is simply different enough somehow to avoid triggering this issue.

If it is the former, then perhaps the fix is just to update the builders (cc @golang/release), though I don't understand NetBSD's release policy well enough to know if 9.0_2019Q2 is a supported release that we want to keep working (cc @bsiegert).

gopherbot · 2020-12-10T20:55:19Z

Change https://golang.org/cl/277033 mentions this issue: WIP: debug #42515

prattmic · 2020-12-10T21:39:49Z

Ah, after many hours of waiting I did manage to reproduce this on i386 9.0_2020Q2. Interestingly, the behavior is similar but not identical:

# Older kevent gets a netpollBreak.
[1.416594591 P -1] M 6 P ? now 17564795668567 netpoll exit poll delay 178584895297 real start 17564795248432 duration 420135 n 1
[1.416597333 P -1] M 6 P ? now 17564795672846 netpoll received netpollBreak delay 178584895297

# New write to netpollBreak (there is no poller at this precise moment (racing with above)).
[1.416607232 P 0] M 4 P ? netpollBreak: write =  1 start 17564795679978 end 17564795681250 duration 1272

# Non-blocking kevent receives break.
[1.435770113 P 1] M 0 P ? now 17564814845535 netpoll enter poll delay 0
[1.435775386 P 1] M 0 P ? now 17564814849491 netpoll exit poll delay 0 real start 17564814846159 duration 3332 n 1
[1.435777146 P 1] M 0 P ? now 17564814852322 netpoll received netpollBreak delay 0

# Blocking kevent (on the same M!) _doesn't_ receive netpoll break.
[1.436044372 P -1] M 0 P ? now 17564815119711 netpoll enter poll delay 178565017725
...
[180.401118534 P -1] M 0 P ? now 17743780191203 netpoll exit poll delay 178565017725 real start 17564815120557 duration 178965070646 n 0

This can't be a race in setup in kevent since the same thread called kevent twice and received the event the first time but not the second.

gopherbot · 2020-12-11T19:26:10Z

Change https://golang.org/cl/277332 mentions this issue: netbsd: limit maximum netbsd kevent delay

prattmic · 2020-12-11T19:27:48Z

I've pushed a workaround discussed with @aclements that could mitigate this issue, but it's pretty nasty, so I'd still rather find the actual cause.

prattmic · 2020-12-11T20:33:33Z

I've finally managed to get an i386 VM running a custom NetBSD kernel that definitely fails at c98ec4120ecf0b9a29bf31c1b00d7896536b7d76 (original 9.0 commit) and seems to pass at f15eb9e6ad34c3315d354274fd26356e3ae79d84 (HEAD), so I'll try to do a bisect.

prattmic · 2020-12-16T23:15:52Z

I found the bug!

First thing first: the bisect was a red herring. I didn't give HEAD enough time to reproduce the issue. It turns out this is still failing on trunk NetBSD.

The issue is a race in kevent(2) after all. The bad flow is:

Lock kq on entry.
Remove head entry: https://github.com/NetBSD/src/blob/b9fe321c6a0b02f7519a6649de7ceaf4e71af02c/sys/kern/kern_event.c#L1478
Unlock kq, perform fs call, lock kq: https://github.com/NetBSD/src/blob/b9fe321c6a0b02f7519a6649de7ceaf4e71af02c/sys/kern/kern_event.c#L1488-L1495
Add entry back to queue: https://github.com/NetBSD/src/blob/b9fe321c6a0b02f7519a6649de7ceaf4e71af02c/sys/kern/kern_event.c#L1553

Step 3 is the problem here: once kq is unlocked another kevent call can come along and miss this entry altogether. If that is a blocking call and there are no entries left, then the call will even block and only wake when a completely new event occurs or timeout (i.e., step 4 doesn't notify the condvar [1]).

To verify this theory, I've added tracepoints to kqueue_scan, which can be seen in prattmic/netbsd-src@2cd2ba9.

The interesting part of the trace is below. Note that the first two columns are PID and TID, respectively. The format at the end of the line is trace_name: <num bytes>, <bytes displayed big endian>. Commentary inline.

# Write to netpollBreak FD.
  4132   3656 runtime.test 1608155897.373396927 GIO   fd 5 wrote 1 bytes
       "\0"

# Irrelevant.
  4132   3656 runtime.test 1608155897.373427571 PSIG  SIGURG caught handler=0x80b9890 mask=(): code=SI_LWP sent by pid=4132, uid=0)

# Non-blocking kevent by 3675, which receives the netpollBreak event (retval = 1).
  4132   3675 runtime.test 1608155897.373500103 MISC  kq_scan_timeout: 4, ffffffff
  4132   3675 runtime.test 1608155897.373500735 MISC  kq_scan_lock: 1, 01
  4132   3675 runtime.test 1608155897.373501022 MISC  kq_scan_count: 4, 01000000
  4132   3675 runtime.test 1608155897.373501298 MISC  kq_scan_unlock: 1, 02
  4132   3675 runtime.test 1608155897.373501595 MISC  kq_scan_lock: 1, 02
  4132   3675 runtime.test 1608155897.373501895 MISC  kq_scan_unlock: 1, 03
  4132   3675 runtime.test 1608155897.373502336 MISC  kq_scan_lock: 1, 03
  4132   3675 runtime.test 1608155897.373502697 MISC  kq_scan_unlock: 1, 07
  4132   3675 runtime.test 1608155897.373503109 MISC  kq_scan_retval: 4, 01000000

# Racing non-blocking and blocking calls by 3656 and 3675, respectively.
  4132   3656 runtime.test 1608155897.392784451 MISC  kq_scan_timeout: 4, ffffffff
  4132   3675 runtime.test 1608155897.392785991 MISC  kq_scan_timeout: 4, 4e460000

# 3656 locks and sees event.
  4132   3656 runtime.test 1608155897.392786080 MISC  kq_scan_lock: 1, 01
  4132   3656 runtime.test 1608155897.392786876 MISC  kq_scan_count: 4, 01000000

# Lock cycle; won't cause this bug, but could make a blocking call return 0 without blocking.
  4132   3656 runtime.test 1608155897.392787560 MISC  kq_scan_unlock: 1, 02
  4132   3656 runtime.test 1608155897.392788778 MISC  kq_scan_lock: 1, 02

# Between lock 2 and unlock 3, the event is removed from the queue.
  4132   3656 runtime.test 1608155897.392789390 MISC  kq_scan_unlock: 1, 03

# 3675 manages to grab lock. Since the event is transiently removed, it sees no events
# and blocks on the condvar (which implicitly unlocks).
  4132   3675 runtime.test 1608155897.392790016 MISC  kq_scan_lock: 1, 01
  4132   3675 runtime.test 1608155897.392790407 MISC  kq_scan_count: 4, 00000000

# 3656 reacquires the lock, adds the event back to the queue, then unlocks and returns.
  4132   3656 runtime.test 1608155897.392791932 MISC  kq_scan_lock: 1, 03
  4132   3656 runtime.test 1608155897.392792717 MISC  kq_scan_unlock: 1, 07
  4132   3656 runtime.test 1608155897.392793325 MISC  kq_scan_retval: 4, 01000000

...

# condvar wait times out 180s later.
  4132   3675 runtime.test 1608156077.372309293 MISC  kq_scan_unlock: 1, 01
  4132   3675 runtime.test 1608156077.372311582 MISC  kq_scan_retval: 4, 00000000

I'll file a NetBSD bug once I figure out how. This bug does not seem to be new; as far as I can tell it dates back to the addition of kqueue locking (NetBSD/src@c743ad7). It is exposed by removal of sysmon as a backstop for overrun timers. Thus, I think http://golang.org/cl/277332 is still the best workaround [2].

It is unclear to me how difficult this is to fix, as it is not clear to me which, if any, constraints require unlocking the kq.

For completeness, I took a look at two other BSDs to see if they are affected:

OpenBSD: Has a Big Kernel Lock at syscall level, so no kevent locking: https://github.com/openbsd/src/blob/e20c779da119f998aba452410e439df1275cd7e8/sys/sys/syscall_mi.h#L92-L104
FreeBSD: Seems to use marker notes to notice that the queue is "in flux" and wait for the mutator to finish: https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1859-L1869

[1] N.B. putting a notification after step 4 would fix the problem for Go, but it would still be generally incorrect because calls that don't block could still return with missing events in violation of the API.
[2] The only other workaround I see is to add our own locking around kevent calls, but that seems much worse for scalability.

prattmic · 2020-12-17T16:43:49Z

Minimal C reproducer: https://gist.github.com/prattmic/8b5bc6c87437bd4496d5f546fb3226fc

Pretty simple, just two threads concurrently polling the same kqueue. One of them usually misses an event within a few hundred iterations.

Don't run this under ktrace with my kernel changes or the machine will panic! (OOM I guess?)

prattmic · 2020-12-17T17:33:07Z

Sent update to https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50094, turns out this issue was reported in 2015!

Our strategy here should be to get http://golang.org/cl/277332 in and hopefully we can remove that one day when NetBSD is fixed.

bsiegert · 2020-12-17T19:19:31Z

Awesome, thanks for the thorough investigation! I asked on the tech-kern@ mailing list to see if there can be any traction on this issue.

The netbsd kernel has a bug [1] that occassionally prevents netpoll from waking with netpollBreak, which could result in missing timers for an unbounded amount of time, as netpoll can't restart with a shorter delay when an earlier timer is added. Prior to CL 232298, sysmon could detect these overrun timers and manually start an M to run them. With this fallback gone, the bug actually prevents timer execution indefinitely. As a workaround, we add back sysmon detection only for netbsd. [1] https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50094 Updates #42515 Change-Id: I8391f5b9dabef03dd1d94c50b3b4b3bd4f889e66 Reviewed-on: https://go-review.googlesource.com/c/go/+/277332 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Austin Clements <austin@google.com> Trust: Michael Pratt <mpratt@google.com>

prattmic · 2020-12-21T21:44:46Z

With http://golang.org/cl/277332, the immediate problem is solved. Further cleanup is blocked by upstream fix to https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50094, so I'll close this for for now.

prattmic · 2021-01-21T19:13:09Z

Upstream change NetBSD/src@7fb7f43 should address this issue. I haven't had a chance to test it, but according to Jaromir it passes the C repro I created (which was more reliable than Go anyways).

jdolecek-zz · 2021-01-24T10:50:17Z

The fix in NetBSD will be pulled up to release branch once the change is stabilizied. Eventually the sysmon conditional can be removed for good and refer the go users to use the latest NetBSD release. I'll create a ticket for go once that it possible.

Removing the conditional would be highly preferable, as it's not at all good to be the only platform doing the timer kick from sysmon - not just due the extra latency only experienced on older NetBSD, but there might be some interference due to locking even with fixed kernel due to the timers being checked from sysmon.

krytarowski · 2021-02-04T17:00:37Z

@prattmic great work on it!

sys/kern/kern_event.c r1.110-1.115 (via patch) fix a race in kqueue_scan() - when multiple threads check the same kqueue, it could happen other thread seen empty kqueue while kevent was being checked for re-firing and re-queued make sure to keep retrying if there are outstanding kevents even if no kevent is found on first pass through the queue, and only kq_count when actually completely done with the kevent PR kern/50094 by Christof Meerwal Also fixes timer latency in Go, as reported in golang/go#42515 by Michael Pratt

gopherbot · 2021-06-03T15:01:23Z

Change https://golang.org/cl/324472 mentions this issue: runtime: skip sysmon workaround on NetBSD >= 9.2

Detect the NetBSD version in osinit and only enable the workaround for the kernel bug identified in #42515 for NetBSD versions older than 9.2. For #42515 For #46495 Change-Id: I808846c7f8e47e5f7cc0a2f869246f4bd90d8e22 Reviewed-on: https://go-review.googlesource.com/c/go/+/324472 Trust: Tobias Klauser <tobias.klauser@gmail.com> Trust: Benny Siegert <bsiegert@gmail.com> Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Go Bot <gobot@golang.org>

sys/kern/kern_event.c r1.110-1.115 (via patch) fix a race in kqueue_scan() - when multiple threads check the same kqueue, it could happen other thread seen empty kqueue while kevent was being checked for re-firing and re-queued make sure to keep retrying if there are outstanding kevents even if no kevent is found on first pass through the queue, and only kq_count when actually completely done with the kevent PR kern/50094 by Christof Meerwal Also fixes timer latency in Go, as reported in golang/go#42515 by Michael Pratt

bcmills added OS-NetBSD NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 11, 2020

bcmills added this to the Go1.16 milestone Nov 11, 2020

prattmic self-assigned this Nov 30, 2020

dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 10, 2020

prattmic mentioned this issue Dec 11, 2020

time: test timeout in TestAfterFunc on linux-s390x-ibm builder #43067

Open

toothrot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 17, 2020

prattmic closed this as completed Dec 21, 2020

bcmills mentioned this issue Feb 17, 2021

runtime: time.Sleep takes more time than expected on Windows (1ms -> 10ms) #44343

Closed

prattmic mentioned this issue Jun 1, 2021

runtime: drop NetBSD kernel bug sysmon workaround fixed in NetBSD 9.2 #46495

Open

golang locked and limited conversation to collaborators Jun 3, 2022

gopherbot added the FrozenDueToAge label Jun 3, 2022

rsc unassigned prattmic Jun 23, 2022

prattmic self-assigned this Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

bcmills commented Nov 11, 2020

bcmills commented Nov 11, 2020

bcmills commented Nov 11, 2020

bsiegert commented Nov 11, 2020

prattmic commented Nov 30, 2020

prattmic commented Dec 4, 2020 •

edited

prattmic commented Dec 4, 2020

prattmic commented Dec 7, 2020 •

edited

prattmic commented Dec 7, 2020 •

edited

prattmic commented Dec 7, 2020

prattmic commented Dec 8, 2020

prattmic commented Dec 10, 2020 •

edited

prattmic commented Dec 10, 2020

gopherbot commented Dec 10, 2020

prattmic commented Dec 10, 2020

gopherbot commented Dec 11, 2020

prattmic commented Dec 11, 2020

prattmic commented Dec 11, 2020

prattmic commented Dec 16, 2020 •

edited

prattmic commented Dec 17, 2020 •

edited

prattmic commented Dec 17, 2020

bsiegert commented Dec 17, 2020 via email

prattmic commented Dec 21, 2020

prattmic commented Jan 21, 2021

jdolecek-zz commented Jan 24, 2021

krytarowski commented Feb 4, 2021

gopherbot commented Jun 3, 2021

runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

runtime: test timeouts / deadlocks on NetBSD after CL 232298 #42515

Comments

bcmills commented Nov 11, 2020

bcmills commented Nov 11, 2020

bcmills commented Nov 11, 2020

bsiegert commented Nov 11, 2020

prattmic commented Nov 30, 2020

prattmic commented Dec 4, 2020 • edited

prattmic commented Dec 4, 2020

prattmic commented Dec 7, 2020 • edited

prattmic commented Dec 7, 2020 • edited

prattmic commented Dec 7, 2020

prattmic commented Dec 8, 2020

prattmic commented Dec 10, 2020 • edited

prattmic commented Dec 10, 2020

gopherbot commented Dec 10, 2020

prattmic commented Dec 10, 2020

gopherbot commented Dec 11, 2020

prattmic commented Dec 11, 2020

prattmic commented Dec 11, 2020

prattmic commented Dec 16, 2020 • edited

prattmic commented Dec 17, 2020 • edited

prattmic commented Dec 17, 2020

bsiegert commented Dec 17, 2020 via email

prattmic commented Dec 21, 2020

prattmic commented Jan 21, 2021

jdolecek-zz commented Jan 24, 2021

krytarowski commented Feb 4, 2021

gopherbot commented Jun 3, 2021

prattmic commented Dec 4, 2020 •

edited

prattmic commented Dec 7, 2020 •

edited

prattmic commented Dec 7, 2020 •

edited

prattmic commented Dec 10, 2020 •

edited

prattmic commented Dec 16, 2020 •

edited

prattmic commented Dec 17, 2020 •

edited