Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: Timer reset broken under heavy use since go1.16 timer optimizations added #47329

Closed
andrewvc opened this issue Jul 21, 2021 · 7 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@andrewvc
Copy link

What version of Go are you using (go version)?

Tracked the specific issue to commit b4b0144 via git bisect

$ go version
go version devel go1.17-d568e6e075 Tue Jul 20 19:54:36 2021 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
andrewvc@LAPTOP-80O11FM2 ~/p/b/heartbeat (fix-timer-failure)> go env
warning: GOPATH set to GOROOT (/home/andrewvc/projects/go) has no effect
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/andrewvc/.cache/go-build"
GOENV="/home/andrewvc/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/andrewvc/projects/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/andrewvc/projects/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/andrewvc/projects/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/andrewvc/projects/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/andrewvc/projects/beats/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build674387485=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Longstanding unit tests (~1.5 years old) started sporadically failing against go 1.16.x

Our program (Heartbeat) rapidly stops and resets a single timer based on user submitted jobs, effectively testing the golang timer. Further investigation revealed that timer.Reset was no longer resetting the timer consistently. Every 10-40k iterations or so it would have no effect, resulting in a non-triggering timer, and in our case, a deadlocked program.

We tracked the specific issue to a change introduced in golang commit b4b0144 via git bisect

The failure can be reproduced by running from the special branch below, which contains an enhanced test suite for Heartbeat using a watchdog timer to catch the failed timer.

# Examples use a zip download to prevent a full repo clone
curl -L https://github.com/andrewvc/beats/archive/refs/heads/broken-timer.zip -o broken-timer.zip
unzip -q broken-timer.zip
cd beats-broken-timer/heartbeat
go test -timeout 30s -run '^TestStress$' github.com/elastic/beats/v7/heartbeat/scheduler/timerqueue

We are now avoiding Reset in favor of NewTimer in a workaround PR elastic/beats#27006 . You can validate this by deleting the Reset call here and replacing it with the NewTimer call here

Digging into the golang source code I discovered that I could fix the issue by commenting out the optimization on this line inside adjusttimers . It seems that the accounting of that variable may have an issue somewhere. The code is quite tricky, heavily concurrent, etc, and could use the eye of someone familiar with it.

What did you expect to see?

I expected the timer to fire consistently when reset.

What did you see instead?

Nothing, after enhancing the test suite for heartbeat to dump traces it was apparent that the program was in an idle state, with no timer scheduled, and no other code blocked.

@ianlancetaylor ianlancetaylor changed the title Timer reset broken under heavy use since go1.16 timer optimizations added time: Timer reset broken under heavy use since go1.16 timer optimizations added Jul 21, 2021
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 21, 2021
@ianlancetaylor ianlancetaylor added this to the Go1.18 milestone Jul 21, 2021
@ianlancetaylor
Copy link
Contributor

Thanks for the good test case and analysis.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.18, Go1.17 Jul 22, 2021
@ianlancetaylor
Copy link
Contributor

@gopherbot Please open backport to 1.16.

This bug also exists in 1.16. It can cause programs that use Timer.Reset to fail to run a timer when it is ready.

@gopherbot
Copy link

Backport issue(s) opened: #47332 (for 1.16).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/336432 mentions this issue: runtime: don't clear timerModifiedEarliest if adjustTimers is 0

andrewvc added a commit to elastic/beats that referenced this issue Jul 22, 2021
Fixes #26205

Fixes deadlocks in heartbeat scheduler.

No changelog needed because this bug only exists in go 1.16+ and v7.14.0 is the first affected beats version

The underlying cause is a likely bug in golang, reported here by myself here: golang/go#47329
mergify bot pushed a commit to elastic/beats that referenced this issue Jul 22, 2021
Fixes #26205

Fixes deadlocks in heartbeat scheduler.

No changelog needed because this bug only exists in go 1.16+ and v7.14.0 is the first affected beats version

The underlying cause is a likely bug in golang, reported here by myself here: golang/go#47329

(cherry picked from commit 2f62e8f)
mergify bot pushed a commit to elastic/beats that referenced this issue Jul 22, 2021
Fixes #26205

Fixes deadlocks in heartbeat scheduler.

No changelog needed because this bug only exists in go 1.16+ and v7.14.0 is the first affected beats version

The underlying cause is a likely bug in golang, reported here by myself here: golang/go#47329

(cherry picked from commit 2f62e8f)
andrewvc added a commit to elastic/beats that referenced this issue Jul 22, 2021
Fixes #26205

Fixes deadlocks in heartbeat scheduler.

No changelog needed because this bug only exists in go 1.16+ and v7.14.0 is the first affected beats version

The underlying cause is a likely bug in golang, reported here by myself here: golang/go#47329

(cherry picked from commit 2f62e8f)

Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>
@gopherbot
Copy link

Change https://golang.org/cl/336689 mentions this issue: [release-branch.go1.16] runtime: don't clear timerModifiedEarliest if adjustTimers is 0

gopherbot pushed a commit that referenced this issue Jul 22, 2021
… adjustTimers is 0

This avoids a race when a new timerModifiedEarlier timer is created by
a different goroutine.

For #47329
Fixes #47332

Change-Id: I6f6c87b4a9b5491b201c725c10bc98e23e0ed9d1
Reviewed-on: https://go-review.googlesource.com/c/go/+/336432
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit 798ec73)
Reviewed-on: https://go-review.googlesource.com/c/go/+/336689
andrewvc added a commit to elastic/beats that referenced this issue Jul 22, 2021
Fixes #26205

Fixes deadlocks in heartbeat scheduler.

No changelog needed because this bug only exists in go 1.16+ and v7.14.0 is the first affected beats version

The underlying cause is a likely bug in golang, reported here by myself here: golang/go#47329

(cherry picked from commit 2f62e8f)

Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@gopherbot
Copy link

Change https://golang.org/cl/337309 mentions this issue: runtime: remove adjustTimers counter

gopherbot pushed a commit that referenced this issue Jul 26, 2021
In CL 336432 we changed adjusttimers so that it no longer cleared
timerModifiedEarliest if there were no timersModifiedEarlier timers.
This caused some Google internal tests to time out, presumably due
to the increased contention on timersLock.  We can avoid that by
simply not skipping the loop in adjusttimers, which lets us safely
clear timerModifiedEarliest.  And if we don't skip the loop, then there
isn't much reason to keep the count of timerModifiedEarlier timers at all.
So remove it.

The effect will be that for programs that create some timerModifiedEarlier
timers and then remove them all, the program will do an occasional
additional loop over all the timers.  And, programs that have some
timerModifiedEarlier timers will always loop over all the timers,
without the quicker exit when they have all been seen.  But the loops
should not occur all that often, due to timerModifiedEarliest.

For #47329

Change-Id: I7b244c1244d97b169a3c7fbc8f8a8b115731ddee
Reviewed-on: https://go-review.googlesource.com/c/go/+/337309
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link

Change https://golang.org/cl/338649 mentions this issue: [release-branch.go1.16] runtime: remove adjustTimers counter

gopherbot pushed a commit that referenced this issue Jul 30, 2021
In CL 336432 we changed adjusttimers so that it no longer cleared
timerModifiedEarliest if there were no timersModifiedEarlier timers.
This caused some Google internal tests to time out, presumably due
to the increased contention on timersLock.  We can avoid that by
simply not skipping the loop in adjusttimers, which lets us safely
clear timerModifiedEarliest.  And if we don't skip the loop, then there
isn't much reason to keep the count of timerModifiedEarlier timers at all.
So remove it.

The effect will be that for programs that create some timerModifiedEarlier
timers and then remove them all, the program will do an occasional
additional loop over all the timers.  And, programs that have some
timerModifiedEarlier timers will always loop over all the timers,
without the quicker exit when they have all been seen.  But the loops
should not occur all that often, due to timerModifiedEarliest.

For #47329
For #47332

Change-Id: I7b244c1244d97b169a3c7fbc8f8a8b115731ddee
Reviewed-on: https://go-review.googlesource.com/c/go/+/337309
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit bfbb288)
Reviewed-on: https://go-review.googlesource.com/c/go/+/338649
andrewvc added a commit to andrewvc/beats that referenced this issue Jan 6, 2022
Undoes https://github.com/elastic/beats/pull/27006/files (while preserving the new test), and also cleaning up the syntax using `time.Until`.

Since the The golang bug I reported in golang/go#47329 has been fixed since somewhere in go 1.16.x (it's hard to track the exact version). This should be backported to 7.16.x since that already uses go 1.17.x and is safe.
andrewvc added a commit to elastic/beats that referenced this issue Jan 7, 2022
Undoes https://github.com/elastic/beats/pull/27006/files (while preserving the new test), and also cleaning up the syntax using `time.Until`.

Since the The golang bug I reported in golang/go#47329 has been fixed since somewhere in go 1.16.x (it's hard to track the exact version). This should be backported to 7.16.x since that already uses go 1.17.x and is safe.
mergify bot pushed a commit to elastic/beats that referenced this issue Jan 7, 2022
Undoes https://github.com/elastic/beats/pull/27006/files (while preserving the new test), and also cleaning up the syntax using `time.Until`.

Since the The golang bug I reported in golang/go#47329 has been fixed since somewhere in go 1.16.x (it's hard to track the exact version). This should be backported to 7.16.x since that already uses go 1.17.x and is safe.

(cherry picked from commit 7b095da)
mergify bot pushed a commit to elastic/beats that referenced this issue Jan 7, 2022
Undoes https://github.com/elastic/beats/pull/27006/files (while preserving the new test), and also cleaning up the syntax using `time.Until`.

Since the The golang bug I reported in golang/go#47329 has been fixed since somewhere in go 1.16.x (it's hard to track the exact version). This should be backported to 7.16.x since that already uses go 1.17.x and is safe.

(cherry picked from commit 7b095da)
andrewvc added a commit to elastic/beats that referenced this issue Jan 14, 2022
…#29733)

Undoes https://github.com/elastic/beats/pull/27006/files (while preserving the new test), and also cleaning up the syntax using `time.Until`.

Since the The golang bug I reported in golang/go#47329 has been fixed since somewhere in go 1.16.x (it's hard to track the exact version). This should be backported to 7.16.x since that already uses go 1.17.x and is safe.

(cherry picked from commit 7b095da)

Co-authored-by: Andrew Cholakian <andrew@andrewvc.com>
@golang golang locked and limited conversation to collaborators Jul 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants