-
Notifications
You must be signed in to change notification settings - Fork 18k
time: fix test flakiness on OpenBSD #9903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Joel Sing tells me that the old openbsd-386-rootbsd builder was MP, but openbsd-amd64-rootbsd builders was SP. Looking through old build.golang.org logs, I can't find any evidence of time.TimeAfterQueuing failing on openbsd-amd64-rootbsd (which aligns with my memories of thinking it was an openbsd/386-specific bug). However, now openbsd-amd64-gce56 is MP too, and it's also exhibiting the bug. That seems to support my theory that it's a memory barrier issue. Following up with OpenBSD to see about fixing kern_tc.c. |
We should have a Github issue label for kernel-bugs-in-various-operating-systems-that-go-has-revealed |
Still jet lagged, so I dug into this some more early this morning. On my X201s running OpenBSD -current, I was able to reproduce this issue. I think the kern_tc.c race is still technically a bug, but it doesn't seem to be the root cause of this crash. My way of easily reproducing currently is:
(Notably, so far I've only repro'd with GOMAXPROCS>1, but it seems like the builders are using GOMAXPROCS=1.) A couple things to explain:
That leads to these consequences:
(Note: Subsequent retries, the global "slots" array is already sorted, so this cause goes away. To trigger the failure below with "attempts = 1", move the sort.Ints() call earlier.)
(Interestingly, in a simplified test case, I'm able to reproduce this non-deterministic goroutine scheduling with GOMAXPROCS>1, but not with GOMAXPROCS=1. I need to investigate more to find out why this is.) Takeaway: I don't think "kernel-bug" is an entirely accurate here, but because of OpenBSD's kernel behavior, it might make sense to only set "delta = 20 * Millisecond" for GOOS != "openbsd". |
CL https://golang.org/cl/9510 mentions this issue. |
Sigh, still flaking: http://build.golang.org/log/6ae361bdba9c95e35737fe173e291cd45959f1b3 (This was the build from 10f6d30 on build.golang.org for openbsd/386.) |
CL https://golang.org/cl/9795 mentions this issue. |
The time tests seem flaky on OpenBSD for some reason. E.g., recent-ish time.TestAfterQueuing failures on openbsd/amd64:
http://build.golang.org/log/7cf643d530ccbe39c97376ffdb2af64db66bcb5d
http://build.golang.org/log/10a9c8a1ffc2a088f48a12fdc9379f91a8bbc8b8
http://build.golang.org/log/702430fd8bf22805d19672d0795f6bc0d1395e6b
http://build.golang.org/log/cb50ea140b06232bdecb4c926134c9113f94e56a
Here's a time.TestReset failure on openbsd/amd64:
http://build.golang.org/log/26bb60d01823523470ffe47c7ed5fa9cdb280513
Flakiness has been seen on openbsd/386 in the past too; e.g.
http://build.golang.org/log/bd19f8590db8c7f317b3f968ac1af5126f5449cf
I've previously conjectured that this could be related to random scheduling delays interacting with repeated absolute<->relative timeout conversions in package runtime, but after measuring this seems unlikely (see https://groups.google.com/d/msg/golang-dev/AbSUgOucZyk/6ea0yP3ba1QJ).
It's possible this is actually an OpenBSD kernel issue. One conjecture I have at the moment is that the tc_windup() function (http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_tc.c) is not multiprocessor safe, because it doesn't use proper memory barriers. E.g., the final "timehands = th;" assignment might become visible to another thread before the modifications to the struct timehands's fields become visible.
Aside: I pointed this out privately to PHK in 2010, to which he responded "You are probably right, explicit memory barriers are probably called for." but FreeBSD still does not use memory barriers in tc_windup() either as far as I can tell: https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_tc.c
NetBSD appears to have added explicit memory barriers back in 2007: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/kern/kern_tc.c#rev1.20
The text was updated successfully, but these errors were encountered: