Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time, runtime: scheduled timer may never fire if GOMAXPROCS is reduced #45716

Closed
prashantv opened this issue Apr 23, 2021 · 6 comments
Closed

Comments

@prashantv
Copy link
Contributor

prashantv commented Apr 23, 2021

This issue is related to #44868 (and seems to be resolved with the same fix) but I wanted to file a separate issue as the impact is different, and may help others running into the same issue.

What version of Go are you using (go version)?

$ go version
go version go1.16.2 linux/amd64

Does this issue reproduce with the latest release?

Not yet,, go1.16.3 seems fine (only tested with simple repro, not on production). This does reproduce with 1.15.11.

This is likely fixed by https://go-review.googlesource.com/c/go/+/300610/.

What operating system and processor architecture are you using (go env)?

linux, amd64

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
[..]
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
[...]
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/.gimme/versions/go1.16.2.linux.amd64"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/.gimme/versions/go1.16.2.linux.amd64/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.2"
[..]

What did you do?

Created a background goroutine to do some periodic work every 500ms. The background goroutine created a ticker, and used a for loop that waited on the ticker.

Concurrently, the main goroutine reduced GOMAXPROCS (based on some system config).

Repro:
https://gist.github.com/prashantv/fdd710e270244efa23ba7a3fcb45b088

This needs to be run in a loop as it happens quite infrequently, it should eventually freeze with a log like:

$ while [ true ] ; do ./stuck-timer-repro || break; done
[...]
attempt 4
========= Likely bug =======
timer 53 timed out after 10s of waiting 0

The repro happens in a few seconds on go1.16.2.

The process is then blocked waiting on the ticker's channel, and allows the process to be debugged.

What did you expect to see?

Ticker channel to receive ticks periodically, ensuring periodic work run roughly twice a second.

What did you see instead?

Ticker did not tick at all, and periodic work did not happen.

More Context

In an internal production service, we noticed some periodic work was not happening as expected. After ruling out other possibilities, the only explanation left was that the ticker was not ticking. Using /debug/pprof/goroutines, I was able to find "stuck" select calls the appropriate stacktrace; the ticker was supposed to fire every few hundred ms, but the select was stuck for many minutes (including thousands of minutes).

Using delve with a core, I discovered that the timer was added to a P that was dead. The same can be reproduced with the repro, and an example delve sessions showing the state is below:

delve debugging

wait for the stuck-docker-repro to print the "likely bug" as above before running delve

dlv attach $(pidof stuck-timer-repro)
(dlv) grs
Goroutine 1 - User: ./stuck-timer-repro.go:52 main.run (0x4a74e5)
Goroutine 2 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
Goroutine 3 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
Goroutine 4 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
Goroutine 5 - User: /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337 runtime.gopark (0x435985)
[5 goroutines]

switch to the main goroutine so we can find the timer object

(dlv) gr 1
Switched from 0 to 1 (thread 108415)

(dlv) bt -full
0 0x0000000000435985 in runtime.gopark
at /home/user/.gimme/versions/go1.16.2.linux.amd64/src/runtime/proc.go:337
unlockf = runtime.chanparkcommit
[...]

3 0x00000000004a74e5 in main.run
at ./stuck-timer-repro.go:52
gomaxprocs = 10
goroutines = 100
wg = sync.WaitGroup {noCopy: sync.noCopy {}, state1: [3]uint32 [...]}
tickers = []*time.Ticker len: 100, cap: 100, [...]
i = 0
ti = 81
t = (*time.Ticker)(0xc0004820a0)
[...]

get information about the timer

(dlv) frame 3
(truncated)
=> 52: <-t.C

(dlv) print t
*time.Ticker {
C: <-chan time.Time {
qcount: 0,
dataqsiz: 1,
buf: *[1]time.Time [
(*time.Time)(0xc000480030),
],
elemsize: 24,
closed: 0,
elemtype: *runtime._type {size: 24, ptrdata: 24, hash: 3225689344, tflag: tflagUncommon|tflagExtraStar|tflagNamed|tflagRegularMemory (15), align: 8, fieldAlign: 8, kind: 25, equal: runtime.memequal_varlen, gcdata: *4, str: 9430, ptrToThis: 140736},
sendx: 0,
recvx: 0,
recvq: waitq<time.Time> {
first: *(*sudog<time.Time>)(0xc00013e000),
last: *(*sudog<time.Time>)(0xc00013e000),},
sendq: waitq<time.Time> {
first: *sudog<time.Time> nil,
last: *sudog<time.Time> nil,},
lock: runtime.mutex {
lockRankStruct: runtime.lockRankStruct {},
key: 0,},},
r: time.runtimeTimer {
pp: 824633896960,
when: 373510132342503,
period: 1000000,
f: time.sendTime,
arg: interface {}(chan time.Time) *(*interface {})(0xc0004820c8),
seq: 0,
nextwhen: 0,
status: 1,},}

print status of the p that the timer was added to, which shows the p is dead

(dlv) print (*runtime.p)(824633896960).status
4
(dlv) print (*runtime.p)(824633896960).timers
[]*runtime.timer len: 1, cap: 1, [
*{
pp: 824633896960,
when: 373510132342503,
period: 1000000,
f: time.sendTime,
arg: interface {}(chan time.Time) *(*interface {})(0xc0004820c8),
seq: 0,
nextwhen: 0,
status: 1,},

@ianlancetaylor
Copy link
Contributor

If I'm reading this issue correctly, it is already fixed. Should we just close it?

@prashantv
Copy link
Contributor Author

prashantv commented Apr 23, 2021

I think it's fixed, but it would be great if someone who knows the timer code more closely can verify that https://go-review.googlesource.com/c/go/+/300610/ fixes this. I haven't been able to reproduce this with 1.16.3, but don't know the code well enough to confirm that it's definitely fixed.

I'm not sure if this is possible or is worth doing, but I would have liked to see something like this mentioned in the 1.16.3 release notes, since it's not just that a timer doesn't fire for 2 minutes, but a timer may never fire.

It also reproduces on 1.15.11, can the fix be backported to 1.15.x?

@ianlancetaylor
Copy link
Contributor

Yes, it seems plausible that CL 300610 will fix this problem.

I think it's quite rare that people reduce GOMAXPROCS while a program is running; I don't think this is the kind of thing that we would normally put into the release notes. Certainly we didn't think of this case.

@ianlancetaylor
Copy link
Contributor

@gopherbot Please open backport to 1.15 branch.

@gopherbot
Copy link

Backport issue(s) opened: #45731 (for 1.15).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@ianlancetaylor
Copy link
Contributor

Closing this issue as it is fixed on tip and 1.16 branch.

@golang golang locked and limited conversation to collaborators Apr 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants