Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: high CPU usage with high timer adjustments #41699

Closed
dobegor opened this issue Sep 29, 2020 · 27 comments
Closed

runtime: high CPU usage with high timer adjustments #41699

dobegor opened this issue Sep 29, 2020 · 27 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@dobegor
Copy link

dobegor commented Sep 29, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.2 linux/amd64

Does this issue reproduce with the latest release?

Yes. This performance regression reproduces only on Go 1.14+. Go 1.13.15 is unaffected.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOHOSTOS="linux"
GOARCH="amd64"

What did you do?

package main

import (
	"time"
)

func main() {
	for i := 0; i < 50000; i++ {
		go func() {
			t := time.NewTimer(5 * time.Second)

			for {
				t.Reset(5 * time.Second)
				t.Reset(10 * time.Second)
				time.Sleep(20 * time.Millisecond)
			}
		}()
	}

	<-make(chan struct{})
}

Running on 8-core Google Cloud machine.

What did you expect to see?

Same or less CPU usage on Go 1.15.2 as on Go 1.13.15

What did you see instead?

~70% CPU usage on Go 1.13.15 and 100% CPU usage on Go 1.15.2.

Actual story of this issue

We have a "worker" application that connects to email accounts via IMAP and listens them. There's one goroutine per connection, total ~160k connections per worker node (4-core machine on GCP).

Sometimes, connection's deadline is being moved (conn.SetReadDeadline) to a sooner point than was previously set. This is a requirement due to application logic. This causes runtime.adjusttimers to do a lot of work.

For example, here is the pprof profile with production workload: https://gist.github.com/dobegor/80134e0d4d997cf03fa8f429da72d753

Upgrading from Go 1.13 to 1.14/1.15.2 resulted in change ~80% CPU usage to 100% CPU usage and huge performance degradation of business metrics. We were unable to move from Go 1.13 because it actually takes a lot more resources to run the same code until we implemented some workarounds to not move deadline to a sooner point.

/cc @ianlancetaylor @ChrisHines -- awesome thanks for reaching out to help and find the root cause of this issue.

@ianlancetaylor
Copy link
Contributor

Thanks for the test case. I wrote https://golang.org/cl/258303, which helps with that test case. It would be interesting to know whether it helps with your real code. Thanks.

@gopherbot
Copy link

Change https://golang.org/cl/258303 mentions this issue: runtime: don't always adjust timers

@dobegor
Copy link
Author

dobegor commented Oct 1, 2020

@ianlancetaylor Unfortunately that did not help much. CPU usage is still at 95-100%, but there's no runtime.adjusttimers in pprof top output anymore.
https://gist.github.com/dobegor/e64c5d70a59d6b24b4d70d9394e1de55

It's certainly lower than without the CL, but not nearly as good as on Go 1.13.15.

I'd add that this issue is no longer a problem for us since we've re-implemented our idling logic for IMAP so that it no longer involves moving connections deadlines to a sooner point. These pprof results are provided for an older version of our worker application specifically deployed for testing out the CL. But still there might be valid cases for moving deadlines further, so I'd not close this.

@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 1, 2020
@ianlancetaylor
Copy link
Contributor

@dobegor Thanks for testing it. Unfortunately that profile doesn't tell me anything useful. If you are interested in pursuing this, it would be very helpful to have a sample program that still shows the CPU regression. I do not see the CPU regression in the sample program that you already provided. That leads me to believe that there must be some relevant difference in your real application, but I don't know what it might be.

@rs
Copy link
Contributor

rs commented Oct 2, 2020

We are experiencing the exact same issue. Upgrading from 1.13 to 1.15 makes our software (a HTTP reverse proxy) taking 40% more CPU with runtime.adjusttimers being the top function.

When testing your patch, it does not change the CPU consumption, but the top function becomes runtime.TimeSleepUntil instead.

@ianlancetaylor
Copy link
Contributor

@rs Thanks for the info. I've updated https://golang.org/cl/258303. Would it be possible for you to give it another try? Thanks.

@rs
Copy link
Contributor

rs commented Oct 3, 2020

With this patch, the delta is down to 10% increase instead of 40%. Nothing really stand out anymore in the top. The next "higher than before" function usage are runtime.scanobject and runtime.lock2 (see screenshot):

Screenshot 2020-10-03 at 11 24 11

If I compare flamegraphs, runtime.checkTimer seems to be the one adding 3% on the selected pair (top is control, bottom is canary):

image

Here is a zoom on runtime.checkTimer:

image

@ianlancetaylor
Copy link
Contributor

Thanks for the detailed examination. I've adjusted checkTimers to take advantage of the new field. It might be worth trying https://golang.org/cl/258303 again, at your convenience.

@rs
Copy link
Contributor

rs commented Oct 4, 2020

We are down to an average of 4% CPU increase with still up to 10-20% on some canary pairs.

@ianlancetaylor
Copy link
Contributor

@rs Thanks. In the absence of a test case I think I'm going to leave it there. We know that the new timer code is significantly better for a good range of real code, it's OK if it's a little worse for some code.

@dobegor I would still very much like to hear whether https://golang.org/cl/258303 helps at all with your real code. Thanks.

@rs
Copy link
Contributor

rs commented Oct 6, 2020

@ianlancetaylor For us, 4% is still pretty high and costly given our scale. The code we are running is not particularly special, it's basically a simple reverse proxy built on top of net/http and x/net/http2. I'm surprised this issue does not seem hit more people.

What do you suggest to make our code less impacted by this issue? We do not update timers and we aren't using so much of them except a deadline context to expire client requests which is almost never reached.

@ianlancetaylor
Copy link
Contributor

@rs I agree that 4% is higher than I would like, but there is only so much I can do without a test case. I'm essentially guessing. So what I suggest is: if at all possible, give me a test case that I can run to see the results that you are seeing. Thanks.

@dobegor
Copy link
Author

dobegor commented Oct 6, 2020

@ianlancetaylor

whether https://golang.org/cl/258303 helps at all with your real code

Will do that this week and get back to you with the results.

@rs
Copy link
Contributor

rs commented Oct 6, 2020

@rs I agree that 4% is higher than I would like, but there is only so much I can do without a test case. I'm essentially guessing. So what I suggest is: if at all possible, give me a test case that I can run to see the results that you are seeing. Thanks.

Ok I'll try to come up with a test case.

@rs
Copy link
Contributor

rs commented Oct 19, 2020

@ianlancetaylor: I can't find a way to isolate the issue. With all my synthetic tests I've done trying to simulate what we do in production, the more I put timers, the better Go 1.15 behave compared to 1.13 (as you would expect). We don't have an unreasonable use of timers anyway, the only timer we have in the critical path is a context deadline for each client request.

Is there any info I could provide that would help you understand what's going on?

@ianlancetaylor
Copy link
Contributor

@rs I suppose updated profiles with the most recent version of the CL might help. Basically: where is the time going? Also, what can you tell me about how your application uses timers?

For that matter, are we sure that timers are the reason for the slowdown from Go 1.13 to Go 1.15? There were, of course, other changes.

@dobegor Have you had any chance to try the later version of https://golang.org/cl/258303? Thanks.

@rs
Copy link
Contributor

rs commented Oct 20, 2020

@ianlancetaylor here it is

Go 1.15 right:
image

Go 1.15 down:
image

Edit: fixed the control (wasn't the right code version)

@ianlancetaylor
Copy link
Contributor

Thanks. I don't see anything obviously timer related there. The only possible timer related change I see there is a possible slowdown in findrunnable, which will likely be addressed by https://golang.org/cl/232298.

@rs
Copy link
Contributor

rs commented Oct 20, 2020

I just edited the comment, the control wasn't right. Sorry about that.

@ianlancetaylor
Copy link
Contributor

I suppose I still don't see anything obviously timer related.

@dobegor
Copy link
Author

dobegor commented Oct 27, 2020

@ianlancetaylor I'm sorry for delaying the testing of the CL on our use case. I will post the test results asap as I've just got back from sick leave, this covid thing is no joke.

@ianlancetaylor
Copy link
Contributor

Sorry to hear that, feel better soon.

No rush on testing, I went ahead and submitted the change anyhow. It should be an improvement even if it doesn't fix the whole problem.

@agirbal
Copy link

agirbal commented Jan 28, 2021

Hey @ianlancetaylor I'm following up on @rs report as we are still experiencing this with Go 1.15.7. My initial digging showed a large CPU used by sysmon.timeSleepUntil with Go 1.15 which let me to try delaying sysmon (as reported in #18236)

The graph below shows canary results:

  • baseline is always vanilla Go 1.13
  • beginning of graph until 11:00 has Go 1.15 vanilla as canary. It results in fairly massive ~+15% increase. Our busiest sites are over 100% increase.
  • 11:00 to 16:00 uses Go 1.15 with sysmon patched to 10ms. We see ~+10% increase, so getting back about 5%.
  • 16:00 onward uses Go 1.13 with sysmon patched to 10ms. At this point we see a gain of 3%.

image

A better view of some of our busiest sites show the trend, with large CPU increase from Go 1.15, helped somewhat by sysmon removal:
image

image

From the profiler the main difference we now see is runtime.adjusttimers taking a lot of CPU:
image

Compared to Go 1.13:
image

Thank you for any pointer! We're stuck on Go 1.13 :(
PS: we run on FreeBSD OS.

@prattmic
Copy link
Member

prattmic commented Jan 28, 2021

@agirbal could you test with Go 1.16 as well (rc1 should be released today!)? 1.16 contains http://golang.org/cl/258303, which is specifically intended to reduce time in adjusttimers.

Also, could you clarify which case the profile containing adjusttimers is on? Is that 1.15 with the sysmon frequency reduced, or baseline 1.15?

@agirbal
Copy link

agirbal commented Jan 28, 2021

@prattmic ok I just tried 1.16rc1 and the issue seems gone! That is amazing news, we can get back on top of our upgrades, thank you.

In graph below:

  • baseline is Go 1.13 vanilla
  • until 11:00 canary is Go 1.13 with sysmon reduction, showing about 4% savings.
  • 11:00 to 14:00 canary is Go 1.16 with sysmon reduction, showing about 4% savings.
  • from 14:00 canary is vanilla Go 1.16. It seems to have a slight regression of less than 1%.

image

I think for our purposes the <1% increase is acceptable and we won't have much time digging further. The sysmon part is interesting though, maybe we can discuss at #18236

@dobegor
Copy link
Author

dobegor commented Jan 29, 2021

I can confirm that using 1.16rc1 with the last version of CL 258303 did not show any measurable performance hit compared to Go 1.13.15 baseline on our production workloads.

@prattmic
Copy link
Member

Great! I think we are done here, and agreed that sysmon discussion can go on #18236.

@golang golang locked and limited conversation to collaborators Jan 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants