You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does this issue reproduce with the latest release?
Not sure. This is the first time we've encountered it.
What operating system and processor architecture are you using (go env)?
Linux (amd64)
What did you do?
CockroachDB has various time based loops. A customer reported a problem on a cluster which we eventually traced down to some of these timer loops being wedged for excessively long periods of time. The cleanest example is:
func (s *Store) raftTickLoop(ctx context.Context) {
ticker := time.NewTicker(s.cfg.RaftTickInterval)
defer ticker.Stop()
var rangeIDs []roachpb.RangeID
for {
select {
case <-ticker.C:
...
case <-s.stopper.ShouldStop():
return
}
}
}
s.cfg.RaftTickInterval is 200 * time.Milliseconds (there is no way to change this without recompiling). The omitted ... code isn't doing anything with the ticker.
We have examples of badness with time.Ticker loops and time.Timer loops. This problem has occurred on multiple nodes within the same cluster, though we're unaware of it ever occurring on another cluster. Also somewhat interesting is that we have evidence that not all timer loops on a node blocked at the same time. For example, from the same node there is this stack:
The loop in replicaScanner.scanLoop is supposed to iterate over all replicas on a node every 10min. That's what we always see, yet here we see that goroutine blocked on a select for 2.5 days.
Have there every been other similar reports to this (we found nothing looking through the open and closed issues)? Is it conceivable that we have a bug in cockroach that corrupted internal runtime data structures? We're scratching our heads over here about what could be going on.
The text was updated successfully, but these errors were encountered:
Well, it appears that the investigation above triggered a memory from the customer experiencing this problem about a timer related issue (not Go related) which they had previously traced to kernel timer corruption and sleeps getting stuck. Apparently https://lore.kernel.org/lkml/tip-1f71addd34f4c442bec7d7c749acc1beb58126f2@git.kernel.org/ fixed the problem.
Fun times. Not a CockroachDB bug. Not a Go runtime bug.
What version of Go are you using (
go version
)?go version go1.10 linux/amd64
Does this issue reproduce with the latest release?
Not sure. This is the first time we've encountered it.
What operating system and processor architecture are you using (
go env
)?Linux (amd64)
What did you do?
CockroachDB has various time based loops. A customer reported a problem on a cluster which we eventually traced down to some of these timer loops being wedged for excessively long periods of time. The cleanest example is:
s.cfg.RaftTickInterval
is200 * time.Milliseconds
(there is no way to change this without recompiling). The omitted...
code isn't doing anything with the ticker.Here is a goroutine stack showing the impossible:
We have examples of badness with
time.Ticker
loops andtime.Timer
loops. This problem has occurred on multiple nodes within the same cluster, though we're unaware of it ever occurring on another cluster. Also somewhat interesting is that we have evidence that not all timer loops on a node blocked at the same time. For example, from the same node there is this stack:The loop in
replicaScanner.scanLoop
is supposed to iterate over all replicas on a node every 10min. That's what we always see, yet here we see that goroutine blocked on a select for 2.5 days.Have there every been other similar reports to this (we found nothing looking through the open and closed issues)? Is it conceivable that we have a bug in
cockroach
that corrupted internal runtime data structures? We're scratching our heads over here about what could be going on.The text was updated successfully, but these errors were encountered: