runtime: Kubernetes performance regression with 1.14 #36752

ianlancetaylor · 2020-01-24T20:47:53Z

We are seeing a performance regression with 1.14 with a Kubernetes performance test. This is not necessarily a release blocker, but it needs investigation.

I'm adding this issue based on comments on golang-dev (https://groups.google.com/forum/#!topic/golang-dev/05KhuIiyrXs).

I don't really know whether the details are Google internal or not, but I'm going to replicate a few internal e-mails here analyzing the profiles (but the profiles themselves are not public information).

ianlancetaylor · 2020-01-24T20:52:45Z

@mknyszek writes:

Since the profiles are taken at regular intervals and the test runs for about the same length of time, it turns out that there are exactly 21 profiles for each run. My methodology thus far has been to try and compare the application in the same phase by comparing the first profile of each, then the second profile of each, etc.

I wrote up some preliminary notes from the profile diffs. I've bolded what I consider to be the most interesting in terms of where the regression could be coming from.

Less time is spent on math/big operations overall.
Consistently and significantly more time spent hitting epollwait (netpoller).
- Since the timer system and the netpoller are now integrated, perhaps we're seeing lots of spurious wake-ups?
Consistently and significantly more time spent hitting runqgrab (goroutine stealing).
- Digging into this, the time spent is being attributed to two things:
  - The LoadAcq on the runq's tail.
  - The usleep we do if we're trying to steal from the runnext slot and the P we're stealing from is running.
checkTimers is consistently showing up as a non-trivial additional sort of time.
- Most of the time spent is not unexpected: checkTimers -> runtimer -> runOneTimer -> dodeltimer
- Same for shouldStealTimers.
Sometimes more time spent in runtime.tgkill.
- I'm mostly interested in this because I have no idea what it means.
Consistently more time spent in runtime.madvise (though usually < 0.3s increase for a 30 second profile).
Time spent in marking is inconsistent (sometimes better, sometimes worse).
Consistently less time spent in the page allocator.
Consistently more time spent in runtime._VDSO
- I don't really understand the call stack there; it's just runtime._System -> runtime._VDSO. Is it related to nanotime at all?
Consistently less time spent in deferreturn.

So what I'm getting from this is that in Go 1.14 we're spending more time contending on goroutine stealing.

Here's a (sort of half-baked) theory: due to the timer system and netpoller being integrated, we now experience more wake-ups when there isn't much work to do, and contend a lot in work-stealing (specifically in runNext stealing; meaning most of the other Ps don't have a lot of work). All of this just hurts responsiveness in promptly running goroutines which are e.g. coming out of I/O or something, ultimately hurting latency.

WDYT? @ianlancetaylor @aclements

ianlancetaylor · 2020-01-24T20:54:03Z

@ianlancetaylor writes:

runtime.tgkill is used by the async preemption code to send a signal to another M.

checkTimers should to some extent be replacing Go 1.13 timerproc.

VDSO is calls to nanotime or walltime. Probably nanotime. I put some effort in the new timer code into not adding additional nanotime calls. Maybe I messed that up somehow.

When you say "same for shouldStealTimers" are you saying that that function is expensive? It's a fairly simple function that doesn't even do any atomic operations. I'm not sure why that is showing up at all.

Your theory does sound somewhat plausible to me. I don't know that we experience more wake ups as such. But for each timer related wakeup, we do do more work. In 1.13 we would just handle timers on a timer wakeup. On tip on a timer wakeup we look for a goroutine to run.

But I don't yet see why that would affect responsiveness to actual network I/O. As far as I can see that should be more or less the same.

ianlancetaylor · 2020-01-24T20:57:31Z

@mknyszek writes:

runtime.tgkill is used by the async preemption code to send a signal to another M.

Got it. Thanks!

checkTimers should to some extent be replacing Go 1.13 timerproc.

Ah, good catch. It's always hard reading the profiles when things change this way. It looks like checkTimers takes a lot less cumulative time than runtime.timerproc. Even if we take cumulative time in checkTimers + shouldStealTimers, it's still less overall than time spent in runtime.timerproc before.

VDSO is calls to nanotime or walltime. Probably nanotime. I put some effort in the new timer code into not adding additional nanotime calls. Maybe I messed that up somehow.

I'll note that the increase is relatively small compared to some of the others.

When you say "same for shouldStealTimers" are you saying that that function is expensive? It's a fairly simple function that doesn't even do any atomic operations. I'm not sure why that is showing up at all.

Yeah, for whatever reason. Here's an example from one diff:

     1.63s  1.16%  2.18%      1.63s  1.16%  runtime.shouldStealTimers (inline)

And here's a list runtime.shouldStealTimers:

ROUTINE ======================== runtime.shouldStealTimers in /usr/local/go/src/runtime/proc.go
     1.63s      1.63s (flat, cum)  1.16% of Total
         .          .   2673:// shouldStealTimers reports whether we should try stealing the timers from p2.
         .          .   2674:// We don't steal timers from a running P that is not marked for preemption,
         .          .   2675:// on the assumption that it will run its own timers. This reduces
         .          .   2676:// contention on the timers lock.
         .          .   2677:func shouldStealTimers(p2 *p) bool {
     280ms      280ms   2678:   if p2.status != _Prunning {
         .          .   2679:           return true
         .          .   2680:   }
      20ms       20ms   2681:   mp := p2.m.ptr()
     750ms      750ms   2682:   if mp == nil || mp.locks > 0 {
         .          .   2683:           return false
         .          .   2684:   }
         .          .   2685:   gp := mp.curg
     580ms      580ms   2686:   if gp == nil || gp.atomicstatus != _Grunning || !gp.preempt {
         .          .   2687:           return false
         .          .   2688:   }
         .          .   2689:   return true
         .          .   2690:}
         .          .   2691:

Your theory does sound somewhat plausible to me. I don't know that we experience more wake ups as such. But for each timer related wakeup, we do do more work. In 1.13 we would just handle timers on a timer wakeup. On tip on a timer wakeup we look for a goroutine to run.

Given my responses above, the only fishy thing left to me is epollwait and runqgrab, which still supports this "spurious wake-ups" theory, maybe.

Given tgkill is showing up sometimes, maybe it would be worth trying to run without asynchronous preemption again, now that we fixed the timer-related latency regression? Perhaps one was hiding the other, but if async preemption alone was causing that much of a tail latency regression then we should have seen some improvement when we turned it off. But who knows, the runtime is known to have performance non-linearities. I don't think it's the first thing we should try, since it's unlikely this is the cause at this point.

But I don't yet see why that would affect responsiveness to actual network I/O. As far as I can see that should be more or less the same.

Yeah I don't know, that part was a bit of a stretch. My thought was that somehow goroutines ended up bouncing around (stolen and restolen) between Ps after they wake up.

ianlancetaylor · 2020-01-24T20:58:01Z

@aclements writes:

Is it possible to get periodic 30 second (or shorter, even) execution traces, similar to the periodic CPU profiles?

ianlancetaylor · 2020-01-24T21:01:14Z

@ianlancetaylor writes:

Interesting. Looks like even a non-atomic read of mp.locks is on the expensive side, presumably due to memory cache effects on the contended memory location. shouldStealTimers is a lot less important than (I think) it used to be thanks to https://golang.org/cl/214185. Maybe we should just check p2.status and nothing else.

I have no evidence, but here is a thought. When a call to the netpoller reports that multiple goroutines are ready, the calling thread will pick the first one and start running it. The remainder of the goroutines will be put onto the global queue. This is the code after the only call to netpoll with a non-zero argument, near the end of findrunnable. By comparison, when the timer code finds that a goroutine is ready, from time.Sleep or time.After, that goroutine will be woken up and put on the local queue. The scheduler will always pull goroutines from the local queue if there are any, and only turn to the global queue if the local queue is empty (actually not quite always; the schedule function will occasionally check the global queue even if there are entries on the local queue, based on schedtick).

This means that if the program in steady state has both timers expiring and network I/O, the expiring timers will consistently be preferred. It seems conceivable that this will tend to slightly increase network latency.

In Go 1.13, when a timer expired, the timerproc goroutine would wake up, call exitsyscall to acquire a P, and only then start putting goroutines on the local run queue. If there were no P available, the timerproc goroutine would go onto the global queue. So in Go 1.13 timers and network I/O competed more fairly for resources.

But this thought only holds up for programs that do have a steady stream of timers expiring. I don't know if that applies to this testcase.

ianlancetaylor · 2020-01-24T21:01:59Z

@mknyszek writes:

Looking at the memory profiles for both Go 1.13 and 1.14, and particularly at the alloc_space profile at the very end of the test run, about 41 GiB worth of timers from time.NewTimer are generated in the whole 12 hours. Assuming the time.Timer struct is about 80 bytes in size that's about 560 million timers created over the lifetime of the test, or about 13,000 timers per second. At any given time there are about 50 MiB of timers alive with respect to GC, which is about 700k timers. It's unclear how many are deleted timers, but most of them likely aren't given that we didn't see a big RSS increase with Kubernetes like we did elsewhere (and we didn't see a corresponding drop with your patch).

All that to say that I do think that your theory may apply to this test case.

ianlancetaylor · 2020-01-24T21:02:19Z

@aclements writes:

It would be pretty easy to try putting the network-readied goroutines on the local runqueue. This sounds like a good idea in general, since the global run queue is really pretty low priority. (We could alternatively make the timer code add it to the global queue, which would be a little closer to 1.13 behavior, but is somewhat trickier and seems strictly worse.)

ianlancetaylor · 2020-01-24T21:02:51Z

@ianlancetaylor writes:

I tried writing that up and came up with https://golang.org/cl/216198.

ianlancetaylor · 2020-01-24T21:04:09Z

@aclements writes (to the person running these tests):

I think that CL [216198] is our best hypothesis right now. I just took a look over the code and correctness-wise it looks good to me.

Another hunch, though, is that it may be worth trying https://golang.org/cl/215157. I was thinking of that as a fairly minor fix, so we were going to wait, but now it's occurring to me that in Go 1.14, ReadMemStats can have large, global effect on latency (in Go 1.13, it could have a large effect on the latency of a goroutine calling it, but very little global effect). That CL fixes that issue. I'd be fine with just landing that change in tip regardless, since it's actually quite small.

I know these runs are costly and annoying on your end. Maybe we do a run with both CLs first. If that moves the needle, we just go ahead and land 215157, and do another run with 215157 but without 216198 to see if 216198 was important. What do you think?

ianlancetaylor · 2020-01-24T21:04:29Z

(That brings this issue up to date with internal communication.)

gopherbot · 2020-01-24T21:05:21Z

Change https://golang.org/cl/216198 mentions this issue: runtime: add goroutines returned by poller to local run queue

aclements · 2020-01-24T23:58:01Z

@mknyszek writes (to the person running these tests):

Austin mentioned trying https://golang.org/cl/215157, but instead of that CL I just reverted the CL that introduced the bug it was fixing. We think now that it would be better to try the tip of Go's master branch with https://go-review.googlesource.com/c/go/+/216198/. If that goes well, then we'd like you to also try just Go tip to see if Ian's change is the fix.

ianlancetaylor · 2020-01-28T19:37:08Z

Update: Kubernetes testing with tip (including https://golang.org/cl/216358) plus https://golang.org/cl/216198 showed performance similar to, perhaps slightly better than, 1.13. The Kubernetes team is now doing further testing with just tip, to see whether 216358 fixed the problem, or whether 216198 is still needed.

ianlancetaylor · 2020-01-29T13:46:37Z

Update: Kubernetes testing with just tip also showed similar performance. So it looks like the real problem was https://golang.org/cl/182657, already reverted by https://golang.org/cl/216358.

mknyszek · 2020-01-29T14:41:57Z

@ianlancetaylor I think we can probably close this now? WDYT?

ianlancetaylor · 2020-01-29T14:42:40Z

Sure, I didn't want to close it without talking to you, but closing it now. Thanks.

ianlancetaylor added the NeedsInvestigation label Jan 24, 2020

ianlancetaylor added this to the Go1.14 milestone Jan 24, 2020

ianlancetaylor closed this as completed Jan 29, 2020

mm4tt mentioned this issue Feb 28, 2020

Update Golang to v1.14.4 kubernetes/kubernetes#88638

Merged

mogren mentioned this issue Aug 18, 2020

go 1.15 was just released. We should at least bump to 1.14 aws/amazon-vpc-cni-k8s#1133

Merged

golang locked and limited conversation to collaborators Jan 28, 2021

gopherbot added the FrozenDueToAge label Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: Kubernetes performance regression with 1.14 #36752

runtime: Kubernetes performance regression with 1.14 #36752

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

gopherbot commented Jan 24, 2020

aclements commented Jan 24, 2020

ianlancetaylor commented Jan 28, 2020

ianlancetaylor commented Jan 29, 2020

mknyszek commented Jan 29, 2020

ianlancetaylor commented Jan 29, 2020

runtime: Kubernetes performance regression with 1.14 #36752

runtime: Kubernetes performance regression with 1.14 #36752

Comments

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

ianlancetaylor commented Jan 24, 2020

gopherbot commented Jan 24, 2020

aclements commented Jan 24, 2020

ianlancetaylor commented Jan 28, 2020

ianlancetaylor commented Jan 29, 2020

mknyszek commented Jan 29, 2020

ianlancetaylor commented Jan 29, 2020