-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: Kubernetes performance regression with 1.14 #36752
Comments
@mknyszek writes: Since the profiles are taken at regular intervals and the test runs for about the same length of time, it turns out that there are exactly 21 profiles for each run. My methodology thus far has been to try and compare the application in the same phase by comparing the first profile of each, then the second profile of each, etc. I wrote up some preliminary notes from the profile diffs. I've bolded what I consider to be the most interesting in terms of where the regression could be coming from.
So what I'm getting from this is that in Go 1.14 we're spending more time contending on goroutine stealing. Here's a (sort of half-baked) theory: due to the timer system and netpoller being integrated, we now experience more wake-ups when there isn't much work to do, and contend a lot in work-stealing (specifically in runNext stealing; meaning most of the other Ps don't have a lot of work). All of this just hurts responsiveness in promptly running goroutines which are e.g. coming out of I/O or something, ultimately hurting latency. WDYT? @ianlancetaylor @aclements |
@ianlancetaylor writes:
When you say "same for Your theory does sound somewhat plausible to me. I don't know that we experience more wake ups as such. But for each timer related wakeup, we do do more work. In 1.13 we would just handle timers on a timer wakeup. On tip on a timer wakeup we look for a goroutine to run. But I don't yet see why that would affect responsiveness to actual network I/O. As far as I can see that should be more or less the same. |
@mknyszek writes:
Got it. Thanks!
Ah, good catch. It's always hard reading the profiles when things change this way. It looks like
I'll note that the increase is relatively small compared to some of the others.
Yeah, for whatever reason. Here's an example from one diff:
And here's a
Given my responses above, the only fishy thing left to me is Given
Yeah I don't know, that part was a bit of a stretch. My thought was that somehow goroutines ended up bouncing around (stolen and restolen) between Ps after they wake up. |
@aclements writes: Is it possible to get periodic 30 second (or shorter, even) execution traces, similar to the periodic CPU profiles? |
@ianlancetaylor writes: Interesting. Looks like even a non-atomic read of mp.locks is on the expensive side, presumably due to memory cache effects on the contended memory location. I have no evidence, but here is a thought. When a call to the netpoller reports that multiple goroutines are ready, the calling thread will pick the first one and start running it. The remainder of the goroutines will be put onto the global queue. This is the code after the only call to This means that if the program in steady state has both timers expiring and network I/O, the expiring timers will consistently be preferred. It seems conceivable that this will tend to slightly increase network latency. In Go 1.13, when a timer expired, the But this thought only holds up for programs that do have a steady stream of timers expiring. I don't know if that applies to this testcase. |
@mknyszek writes: Looking at the memory profiles for both Go 1.13 and 1.14, and particularly at the alloc_space profile at the very end of the test run, about 41 GiB worth of timers from All that to say that I do think that your theory may apply to this test case. |
@aclements writes: It would be pretty easy to try putting the network-readied goroutines on the local runqueue. This sounds like a good idea in general, since the global run queue is really pretty low priority. (We could alternatively make the timer code add it to the global queue, which would be a little closer to 1.13 behavior, but is somewhat trickier and seems strictly worse.) |
@ianlancetaylor writes: I tried writing that up and came up with https://golang.org/cl/216198. |
@aclements writes (to the person running these tests): I think that CL [216198] is our best hypothesis right now. I just took a look over the code and correctness-wise it looks good to me. Another hunch, though, is that it may be worth trying https://golang.org/cl/215157. I was thinking of that as a fairly minor fix, so we were going to wait, but now it's occurring to me that in Go 1.14, I know these runs are costly and annoying on your end. Maybe we do a run with both CLs first. If that moves the needle, we just go ahead and land 215157, and do another run with 215157 but without 216198 to see if 216198 was important. What do you think? |
(That brings this issue up to date with internal communication.) |
Change https://golang.org/cl/216198 mentions this issue: |
@mknyszek writes (to the person running these tests): Austin mentioned trying https://golang.org/cl/215157, but instead of that CL I just reverted the CL that introduced the bug it was fixing. We think now that it would be better to try the tip of Go's master branch with https://go-review.googlesource.com/c/go/+/216198/. If that goes well, then we'd like you to also try just Go tip to see if Ian's change is the fix. |
Update: Kubernetes testing with tip (including https://golang.org/cl/216358) plus https://golang.org/cl/216198 showed performance similar to, perhaps slightly better than, 1.13. The Kubernetes team is now doing further testing with just tip, to see whether 216358 fixed the problem, or whether 216198 is still needed. |
Update: Kubernetes testing with just tip also showed similar performance. So it looks like the real problem was https://golang.org/cl/182657, already reverted by https://golang.org/cl/216358. |
@ianlancetaylor I think we can probably close this now? WDYT? |
Sure, I didn't want to close it without talking to you, but closing it now. Thanks. |
We are seeing a performance regression with 1.14 with a Kubernetes performance test. This is not necessarily a release blocker, but it needs investigation.
I'm adding this issue based on comments on golang-dev (https://groups.google.com/forum/#!topic/golang-dev/05KhuIiyrXs).
I don't really know whether the details are Google internal or not, but I'm going to replicate a few internal e-mails here analyzing the profiles (but the profiles themselves are not public information).
The text was updated successfully, but these errors were encountered: