Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: Timer reset broken under heavy use since go1.16 timer optimizations added [1.16 backport] #47332

Closed
gopherbot opened this issue Jul 22, 2021 · 16 comments
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge
Milestone

Comments

@gopherbot
Copy link

@ianlancetaylor requested issue #47329 to be considered for backport to the next 1.16 minor release.

@gopherbot Please open backport to 1.16.

This bug also exists in 1.16. It can cause programs that use Timer.Reset to fail to run a timer when it is ready.

@gopherbot gopherbot added the CherryPickCandidate Used during the release process for point releases label Jul 22, 2021
@gopherbot gopherbot added this to the Go1.16.7 milestone Jul 22, 2021
@toothrot toothrot added the CherryPickApproved Used during the release process for point releases label Jul 22, 2021
@gopherbot gopherbot removed the CherryPickCandidate Used during the release process for point releases label Jul 22, 2021
@toothrot
Copy link
Contributor

Approved. This is a serious issue with no workaround.

@gopherbot
Copy link
Author

Change https://golang.org/cl/336689 mentions this issue: [release-branch.go1.16] runtime: don't clear timerModifiedEarliest if adjustTimers is 0

@gopherbot
Copy link
Author

Closed by merging ed8cbbc to release-branch.go1.16.

gopherbot pushed a commit that referenced this issue Jul 22, 2021
… adjustTimers is 0

This avoids a race when a new timerModifiedEarlier timer is created by
a different goroutine.

For #47329
Fixes #47332

Change-Id: I6f6c87b4a9b5491b201c725c10bc98e23e0ed9d1
Reviewed-on: https://go-review.googlesource.com/c/go/+/336432
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit 798ec73)
Reviewed-on: https://go-review.googlesource.com/c/go/+/336689
@dmitshur
Copy link
Contributor

dmitshur commented Jul 27, 2021

@ianlancetaylor Should the follow up change CL 337309 also be cherry-picked to 1.16?

(I'll reopen this so we don't forget to make that decision before release time.)

@dmitshur dmitshur added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. release-blocker labels Jul 27, 2021
@dmitshur dmitshur reopened this Jul 27, 2021
@ianlancetaylor
Copy link
Contributor

I'm uncertain whether CL 337309 should be cherry picked to the 1.16 branch. The bug fix (CLs 336432, 336689) fixed a real bug that could cause timers to fail to fire. CL 337309 is a performance fix. I'm mildly concerned that CL 337309 will introduce some other unforeseen performance problem. On the 1.16 release branch we needed the bug fix, but I don't know how hard we want to change performance issues on that branch.

@cagedmantis
Copy link
Contributor

Closed per Ian's comment.

@tmm1
Copy link
Contributor

tmm1 commented Jul 30, 2021

FYI I'm seeing increased CPU usage on various x64 and arm64 linux nodes after rolling out ed8cbbc. I will rollback to 1.16.6 tomorrow to confirm and capture some cpu profiles.

EDIT: I confirmed reverting that single commit brought the cpu usage of my otherwise idle golang program from 100% back to 0%.

Here's a 60s cpu profile with that commit included:

(pprof) top --cum
Showing nodes accounting for 49.47s, 82.42% of 60.02s total
Dropped 25 nodes (cum <= 0.30s)
Showing top 10 nodes out of 26
      flat  flat%   sum%        cum   cum%
     4.17s  6.95%  6.95%     60.01s   100%  runtime.findrunnable
         0     0%  6.95%     58.82s 98.00%  runtime.schedule
         0     0%  6.95%     58.76s 97.90%  runtime.mcall
         0     0%  6.95%     58.63s 97.68%  runtime.park_m
     3.91s  6.51% 13.46%     42.10s 70.14%  runtime.netpoll
    38.19s 63.63% 77.09%     38.19s 63.63%  runtime.epollwait
     1.90s  3.17% 80.26%      4.70s  7.83%  runtime.checkTimers
     1.25s  2.08% 82.34%      1.54s  2.57%  runtime.nanotime (inline)
     0.01s 0.017% 82.36%      1.42s  2.37%  runtime.unlock (inline)
     0.04s 0.067% 82.42%      1.41s  2.35%  runtime.unlockWithRank (inline)

and back to normal after I reverted it:

(pprof) top --cum
Showing nodes accounting for 30ms, 75.00% of 40ms total
Showing top 10 nodes out of 17
      flat  flat%   sum%        cum   cum%
         0     0%     0%       20ms 50.00%  runtime.gcBgMarkWorker
         0     0%     0%       20ms 50.00%  runtime.gcBgMarkWorker.func2
      10ms 25.00% 25.00%       20ms 50.00%  runtime.gcDrain
         0     0% 25.00%       20ms 50.00%  runtime.mcall
         0     0% 25.00%       20ms 50.00%  runtime.park_m
         0     0% 25.00%       20ms 50.00%  runtime.schedule
         0     0% 25.00%       20ms 50.00%  runtime.systemstack
      10ms 25.00% 50.00%       10ms 25.00%  runtime.epollwait
         0     0% 50.00%       10ms 25.00%  runtime.findrunnable
      10ms 25.00% 75.00%       10ms 25.00%  runtime.futex

@tmm1
Copy link
Contributor

tmm1 commented Jul 30, 2021

The CPU usage issue was also observed on macOS and Windows nodes, and was also fixed there after rolling back this commit.

@ianlancetaylor
Copy link
Contributor

What can you tell us about the way in which your program uses timers and deadlines?

Does the change in https://golang.org/cl/337309 fix the problem?

Thanks.

@tmm1
Copy link
Contributor

tmm1 commented Jul 30, 2021

I will try a build with that CL now. I'm testing on an arm64 rpi4 which consistently shows 100% cpu usage with the bad commit.

My program has a number of goroutines which run infinite loops as so:

for {
	select {
	case <-time.After(3 * time.Hour):
	case <-r.wakeup:
	case <-r.quit:
		r.quit <- true
		return
	}

	...
}

@tmm1
Copy link
Contributor

tmm1 commented Jul 30, 2021

CPU usage is normal with https://golang.org/cl/337309 applied.

@ianlancetaylor
Copy link
Contributor

Thanks for testing. I will backport 337309 to the 1.16 branch.

@gopherbot
Copy link
Author

Change https://golang.org/cl/338649 mentions this issue: [release-branch.go1.16] runtime: remove adjustTimers counter

@ianlancetaylor
Copy link
Contributor

Backport complete.

gopherbot pushed a commit that referenced this issue Jul 30, 2021
In CL 336432 we changed adjusttimers so that it no longer cleared
timerModifiedEarliest if there were no timersModifiedEarlier timers.
This caused some Google internal tests to time out, presumably due
to the increased contention on timersLock.  We can avoid that by
simply not skipping the loop in adjusttimers, which lets us safely
clear timerModifiedEarliest.  And if we don't skip the loop, then there
isn't much reason to keep the count of timerModifiedEarlier timers at all.
So remove it.

The effect will be that for programs that create some timerModifiedEarlier
timers and then remove them all, the program will do an occasional
additional loop over all the timers.  And, programs that have some
timerModifiedEarlier timers will always loop over all the timers,
without the quicker exit when they have all been seen.  But the loops
should not occur all that often, due to timerModifiedEarliest.

For #47329
For #47332

Change-Id: I7b244c1244d97b169a3c7fbc8f8a8b115731ddee
Reviewed-on: https://go-review.googlesource.com/c/go/+/337309
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit bfbb288)
Reviewed-on: https://go-review.googlesource.com/c/go/+/338649
@dmitshur dmitshur removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. release-blocker labels Jul 30, 2021
@lukestoward
Copy link

I've been trying to diagnose an issue relating to short sharp CPU spikes which sound similar to @tmm1's comment above. I'm running a http server that configures a read (5 secs), write (10 secs) and idle (15 secs) timeout. I found that when using Go 1.16.6 the CPU usage is stable, however when using 1.16.7 I see several 100% CPU spikes lasting <2 mins over a 24 hour period. I have been unable to correlate the spikes with any other metric. It appears to be random. HTTP traffic ranges from several request per minute to 10,000+.

If I remove the IdleTimeout value on the http server, falling back to the default, as per the documentation, I do not see any CPU spikes. For a little while now I've been trying to reproduce the issue with a minimal example unsuccessfully. I'm convinced the changes to src/runtime/time.go in 1.16.7 introduced a bug, but I'm having difficulty proving it. Any suggestions on what I might be able to try/do to help provide more information?

@ianlancetaylor
Copy link
Contributor

@lukestoward This issue is closed and as far as we can tell, it is fixed. The fix was in the Go 1.16.7 release. Although the fix may well have introduced the problem you describe, since the patch has been released we should treat this as a new problem. I encourage you to open a new issue, ideally with a case that we can use to reproduce the problem.

See also #47762.

@golang golang locked and limited conversation to collaborators Sep 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

7 participants