-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: "go test -short -cpu=1,2,4" command takes too long #13645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think @aclements is on this. |
Possible dup of #13642. |
Also maybe "go test -short -cpu=1,2,4" DOES take too long. Maybe we should create a CL (with some failure in it), that runs "go test -short -cpu=1,2,4 -v" instead of "go test -short -cpu=1,2,4" to see if some runtime test is too long. Alex |
Sometimes is does take a long time. Just using the failure logs we already have, here are the last 50 (successful) times that are over 120s. The worst amd64 time is only 145s, though, which suggests the 10 minute timeout is a freeze.
|
That is fine, @aclements. I will let you deal with this. Alex |
I've seen this recently too and reported #13632. And I think I saw it on a builder today -- so I think there is a problem here. |
Some recent samples from try bots:
|
Another crash, but with more stack trace: Patch Set 1: TryBot-Result-1 1 of 15 TryBots failed: Alex |
I was hoping this would be fixed by 8d5ff2e, which fixed a bug in the StackOverflow test, but it's still happening. |
I don't see original crash anymore. But there is new failure happening http://build.golang.org/log/9097c870bd4716e62c87a18648e0bd9da7061abc
that looks similar. Alex |
I could be wrong, but my current hunch is that the runtime test is a red herring. I think t.Parallel() may be broken. There are only three places where we use it in the standard library: runtime, net, and net/http. Looking through recent failures, these are the packages we see timing out, and they time out with the tests marked t.Parallel() waiting on the channel that tells them to go. |
There's also a recent timeout not in the runtime test. It might be unrelated: |
The leftover Futexsleep goroutines are unrelated and should be fixed by https://go-review.googlesource.com/18617. |
I cannot reproduce failures we see on windows builders here on my computers. I run selected runtime tests in a loop. If you have suggestions on how to break things here, I might help more. Maybe tiny sleep here or there? Alex |
I wish I did. I spent all day trying to reproduce this on a linux-amd64-nocgo gomote and somehow haven't succeeded. It shows up frequently enough on the dashboard for this builder that I'm rather perplexed. |
I ran a
|
I've now run over 200 cycles of create VM/go test std/destroy VM and still haven't reproduced this. |
2016-01-12T02:00:48-352e287/linux-amd64-sid: Lots of runnable goroutines in the prologue of runtime_test.growing I'm not sure what's happening on Windows, but on the other platforms every single timeout has a goroutine in state "running" that's in a context where it could be allocating, or it clearly is trying to allocate and is blocked in an assist. |
CL https://golang.org/cl/18648 mentions this issue. |
Do not forget that all windows failures we see are around Windows callbacks. There is a lot of unsafe (and asm) code there. I wouldn't be surprised if some recent runtime changes affected that. Alex |
While the default behavior of eliding runtime frames from tracebacks usually makes sense, this is not the case when you're trying to test the runtime itself. Fix this by forcing the traceback level to at least "system" in the runtime tests. This will specifically help with debugging issue #13645, which has proven remarkably resistant to reproduction outside of the build dashboard itself. Change-Id: I2a8356ba6c3c5badba8bb3330fc527357ec0d296 Reviewed-on: https://go-review.googlesource.com/18648 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
I looked at Windows a little more closely and those failures are also in a context that might be allocating (because of the defer statement in cgocall). Hopefully we'll have much more information once a few more failures show up on the dashboard with 1556c31 applied. |
I managed to reproduce this by repeatedly running the trybots on https://go-review.googlesource.com/#/c/18646/ and https://go-review.googlesource.com/#/c/18680/. These show runtime frames. Unfortunately, for reasons I don't understand, this didn't traceback at "crash" level, so some tracebacks are still hidden. Some interesting cases: https://storage.googleapis.com/go-build-log/390e0725/windows-amd64-gce_4122b0f2.log: Shows a cgocall mallocing and assisting and getting stuck in the assist. Interestingly, the goroutine is shown as "runnable", not "GC assist wait", which indicates we woke the goroutine but it never actually ran again. My best guess is that there's an issue with us putting that goroutine on the global run queue and never pulling it off the global run queue. However, the background sweeper is also runnable and in Gosched, which indicates a few things: GC is definitely over, at least some scheduling happened after GC started the world again, and there's almost certainly a G on a local run queue (edit: this is not correct; Gosched puts it on the global run queue). However, it also looks like the system has been wedged for at least five minutes, which is more than enough time for the background sweeper to finish, suggesting that the scheduler lost track of both the background sweeper and the assist. The scheduler isn't completely wedged, though, because it ran the timeout goroutine. https://storage.googleapis.com/go-build-log/763ec68b/windows-amd64-gce_7689e79d.log: A runtime test timeout, but probably unrelated to the GC problem since GC appears to be completely idle and there are no assists running. https://storage.googleapis.com/go-build-log/7647f2db/windows-amd64-gce_e5e4abda.log: Same. https://storage.googleapis.com/go-build-log/7d903bfa/linux-amd64_d3fa644c.log: User goroutine is stuck in the GC assist wait park. There's a running background mark worker, which, unfortunately, didn't show a traceback. The stuck mark worker is presumably why the system isn't making progress. I've started the same cycle on https://go-review.googlesource.com/18681 but with the traceback level set more aggressively to crash. I'll probably add some scheduler debugging to it. |
CL https://golang.org/cl/18682 mentions this issue. |
GC assists check gcBlackenEnabled under the assist queue lock to avoid going to sleep after gcWakeAllAssists has already woken all assists. However, currently we clear gcBlackenEnabled shortly *after* waking all assists, which opens a window where this exact race can happen. Fix this by clearing gcBlackenEnabled before waking blocked assists. However, it's unlikely this actually matters because the world is stopped between waking assists and clearing gcBlackenEnabled and there aren't any obvious allocations during this window, so I don't think an assist could actually slip in to this race window. Updates #13645. Change-Id: I7571f059530481dc781d8fd96a1a40aadebecb0d Reviewed-on: https://go-review.googlesource.com/18682 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
Got a reproduction with GOTRACEBACK=crash (but unfortunately not the scheduler debug dump I added because it was killed by the infrastructure, not from within): https://storage.googleapis.com/go-build-log/f7c7c056/darwin-amd64-10_10_795d2e74.log Goroutine 53 is parked in a GC assist wait and has been for at least 3 minutes. Goroutines 67 and 74 are actively running mark workers. 74 is on M 8 trying to stop the world but is stuck in the loop waiting for other Ms to voluntarily stop. 67 is presumably the one preventing it. Unfortunately, 67 doesn't obviously appear in any of the other thread tracebacks, which is confusing because all of the other threads are clearly at points where they would have switched their Gs out of _Grunning. |
Another failure on the dashboard, now with "system" level tracebacks for runtime failures: 2016-01-16T02:40:00-b05f18e/linux-ppc64-minux None of the goroutines are in assists and the background mark workers are idle. Other than the alarm goroutine, only two goroutines are runnable: the background sweeper is in Gosched and the main test thread is in the prologue of fmt.Sprintf. These are both weird. Based on the other goroutines, there almost certainly hasn't been GC activity for 19 minutes, so the background sweeper has had oodles of time to finish. fmt.Sprintf could have stopped either because it needed to grow its stack or because of a preemption. I don't think it needed to grow its stack because the calling frame was already past the first doubling boundary (the stack size in the caller is 0xc820037f80 - 0xc820037910 = 1648 bytes and the first boundary should be 2048 - 720 + 128 = 1456) and it's nowhere near the next boundary (4096-720+128=3504). That means it got preempted, which calls gopreempt_m, which puts it in to _Grunnable (consistent with the traceback) and puts it on the global queue. Since both goroutines are on the global run queue, this again points at some issue with scheduling things from the global queue. |
In nearly every case, one of the goroutines that is runnable or in "GC assist wait" is also "locked to thread". This happens in Linux on the main testing thread because of the init function in runtime_linux_test.go. It happens on Windows in the test that we see freezing (I'm not sure why; I assume because of the callback). It also happened in the Darwin failure because the wedged test, TestTraceStressStartStop, calls LockOSThread. |
Finally a timeout with a snapshot of the scheduler state: https://storage.googleapis.com/go-build-log/8ecdfe49/windows-amd64-gce_e20c47a0.log. GC is not running and has not run for over five minutes. Both P run queues are empty (one is idle, the other is running the alarm goroutine). The two runnable goroutines (background sweeper and a cgocall) are on the global run queue and must have been stuck there for several minutes. The stuck cgocall is "locked to thread". |
The same For example, |
So, @aclements, how can that happen? :-) A few thoughts, perhaps none of them useful. (1) It looks to me like there are two different p's both with gcMarkWorkerMode == gcMarkWorkerFractionalMode. That seems wrong: I thought there would be only one worker dedicated to being a fraction at any given time. It looks like the "we lost a race" here would explain having two fractional workers:
I don't see any actual harm caused by having two fractional workers in the code. Eventually one will die and not be replaced (until the next race). But still it seems scary and could be avoided with a Cas if we understood how it explained real damage. (2) gcphase is 0. Why are there any fractional workers in that case? The GC that kicked off the fractional workers must have finished to get to gcphase 0. How did it finish with the fractional workers still thinking they should run? (3) I see assignments that set p.gcMarkWorkerMode but nothing that clears it. Also gcMarkWorkerDedicatedMode == 0, which is a little odd for distinguishing "not a worker" from "a worker". But maybe code only looks at that field if it knows it is a worker, in which case the two p's with fractional mode could both be stale, with neither actually being a worker. (4) All cgocalls lock the goroutine to the thread for the duration of the call, in case of cgo callbacks. When a goroutine is locked to a thread, that thread bails out of func schedule very early. It does not even reach the code thinking about being a GC worker. It's possible that the thread ran a GC worker earlier, before being locked, but that worker is preempted if we're doing the cgo call. And then the worker is attached to the p, which will be released elsewhere while the cgo call is blocked. And the GC finished. (5) The stack trace doesn't show us where the cgo call goroutine is blocked. It will show the entersyscall information until gp.syscallsp is cleared after the goroutine is cleared to run. But I don't see any way it can block during entersyscall, so it must be blocked in exitsyscall. (6) System stacks would be nice. Since this seems to happen mainly on Windows, I wonder if we can adapt os1_windows.go's profileloop1/profilem into something that will give us the same thread stacks we now can get on Unix. |
For 1, 2, and 3, you're right that it wouldn't be intentional for multiple fractional workers to be running at the same time, but we never clear gcMarkWorkerMode, so these are just stale values, gcMarkWorkerMode is a one time communication from the scheduler at wakeup of what type of worker a worker should be. We didn't even have to lose the decIfPositive race. One fractional worker could have parked before we started the other one. 4 is interesting. I strongly suspect locked goroutines are part of this. I hadn't thought about how that skips specifically the worker scheduling code. OTOH, in https://storage.googleapis.com/go-build-log/8ecdfe49/windows-amd64-gce_e20c47a0.log the workers are parked in their usual spot and GC isn't active, so it shouldn't have scheduled any workers anyway. |
The locking of goroutines concern threads, not p's, so the fact that it skips the worker scheduling code shouldn't really matter. The locked thread (m) will park itself and some other m will get to run. Not that this helps. |
I finally managed to reproduce this outside of the build system! Last night I got 3 failures out of 187 runs on the windows-386-gce builder using the following script
run as diff --git a/src/run.bat b/src/run.bat
index 01a66bc..2f4ab00 100644
--- a/src/run.bat
+++ b/src/run.bat
@@ -37,7 +37,7 @@ call env.bat
del env.bat
echo.
-go tool dist test
+go tool dist test runtime:cpu124
if errorlevel 1 goto fail
echo.
Next steps are to narrow this down a bit and to see if the same approach works on a Linux builder where it's easier to debug things. |
Reproduced in 2 out of 230 iterations across 13 windows-386-gce VMs with a loop around go tool dist test:
|
Reproduced once out of 135 linux-amd64-noopt VMs with the equivalent Linux change to #13645 (comment) |
Reproduced 2 out of 147 iterations across 6 linux-amd64-noopt VMs with the equivalent of #13645 (comment) |
I was able to collect some core dumps from this. I took the core dumps without allowing the freezetheworld and various other things that happen in the panic handler and, interestingly, the story they tell is rather different from the time out tracebacks. Here's a representative example:
All of the Gs are waiting and all of the Ms are blocked. Notably, the main G is blocked in an assist and no workers are running and everything else is waiting, so nothing schedules a new worker and no worker satisfies the assist. GC is in mark 2. work.full is empty, but only 2 of 4 mark root jobs have completed, so there is still GC work to be done. In this situation, since there are only two Ps, it's possible that no fractional worker would be scheduled, but I would expect idle workers to keep things going. The lack of idle workers indicates that we somehow failed to satisfy the idle worker start condition on all of the Ps simultaneously just before they all blocked. My best guess is that the workers were detached, but I haven't been able to come up with a sequence that would cause this, and as of the core file all Ps have workers. A few things worth trying to see if they amplify the chance of this happening: Add delay between detaching and gcMarkDone to encourage the workers to simultaneously detach. Disable fractional and dedicated workers to force everything through idle workers. For debugging, it may help for the Ps to record some of the state of the world before they go idle, like why they didn't start an idle worker. |
The following patch exacerbates the problem with locked Gs by locking all Gs. In theory this should work, but it causes the runtime test to regularly freeze in much the same way as seen in these test failures. (It also causes TestMainGoroutineID to fail for uninteresting reasons.) diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 2bc3c92..cfb1172 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2647,6 +2652,8 @@ func newproc(siz int32, fn *funcval) {
})
}
+var lockOSThreadFunc uintptr
+
// Create a new g running fn with narg bytes of arguments starting
// at argp and returning nret bytes of results. callerpc is the
// address of the go statement that created this. The new g is put
@@ -2703,6 +2710,10 @@ func newproc1(fn *funcval, argp *uint8, narg int32, nret int32, callerpc uintptr
newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
newg.sched.g = guintptr(unsafe.Pointer(newg))
gostartcallfn(&newg.sched, fn)
+ if lockOSThreadFunc == 0 {
+ lockOSThreadFunc = funcPC(LockOSThread)
+ }
+ gostartcallfn(&newg.sched, (*funcval)(unsafe.Pointer(&lockOSThreadFunc)))
newg.gopc = callerpc
newg.startpc = fn.fn
if isSystemGoroutine(newg) { |
I've been stressing https://go-review.googlesource.com/19107 for over two hours across 10 VMs and so far none of the 1,242 runs has failed. |
CL https://golang.org/cl/19107 mentions this issue. |
Build fails sometime http://build.golang.org/log/0f5d58c2d0ba874ec4e387ad5a6cc9c6972b6aa0
Alex
The text was updated successfully, but these errors were encountered: