runtime: frequent blocking syscalls cause high sysmon overhead #18236

philhofer · 2016-12-07T21:44:53Z

The hard-coded 20us sysmon tick in the runtime appears to be far too aggressive for minority architectures.

Our storage controllers run a single-core 800MHz arm cpu, which runs single-threaded workloads between 10 and 20 times slower than my desktop Intel machine.

Most of our product-level benchmarks improved substantially (5-10%) when I manually adjusted the minimum tick to 2ms on our arm devices. (Our use-case benefits from particularly long syscall ticks, since we tend to make read and write syscalls with large buffers, and those can take a long time even if they never actually block in I/O. Also, as you might imagine, the relative cost of thread creation and destruction on this system is quite high.)

Perhaps the runtime should calibrate the sysmon tick based on the performance of a handful of trivial syscalls (getpid(), etc). Similarly, the tick could be adjusted upwards if an m wakes up within a millisecond (or so) of being kicked off its p. Perhaps someone with a better intuition for the runtime can suggest something heuristically less messy.

Thanks,
Phil

The text was updated successfully, but these errors were encountered:

bradfitz · 2016-12-07T21:49:44Z

/cc @minux @cherrymui @davecheney

davecheney · 2016-12-07T22:31:47Z

That looks like a marvel ARMv5 chip, which we're proposing on dropping in Go 1.9

…

On Thu, Dec 8, 2016 at 6:49 AM, Brad Fitzpatrick ***@***.***> wrote: /cc @minux <https://github.com/minux> @cherrymui <https://github.com/cherrymui> @davecheney <https://github.com/davecheney> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA_qAG040kCGVl0TpdSTSaNv1sYOAks5rFyn5gaJpZM4LHJlV> .

philhofer · 2016-12-07T22:34:11Z

It's a Marvell Armada 370, which is ARMv7

davecheney · 2016-12-07T22:34:37Z

Phew!

…

On Thu, Dec 8, 2016 at 7:34 AM, Philip Hofer ***@***.***> wrote: It's a Marvell Armada 370, which is ARMv7 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA280K1VdUNc9HYwq-_oJta9rUPfWks5rFzRkgaJpZM4LHJlV> .

philhofer · 2016-12-07T22:36:00Z

Yes, please drop ARMv5. I was secretly elated when I saw that was finally happening. There are too many wildly divergent implementations of ARM in the wild.

minux · 2016-12-07T22:53:24Z

Yeah, I think timing a few getpid syscalls and adjust sysmon tick accordingly is reasonable.

aclements · 2016-12-07T22:55:35Z

In general, sysmon should back off to a 10ms delay unless it's frequently finding goroutines that are blocked in syscalls for more than 10ms or there are frequent periods where all threads are idle (e.g., waiting on the network or blocked in syscalls). Is that the case in your application?

Perhaps the runtime should calibrate the sysmon tick based on the performance of a handful of trivial syscalls (getpid(), etc). Similarly, the tick could be adjusted upwards if an m wakes up within a millisecond (or so) of being kicked off its p.

Perhaps I don't understand your application, but I don't see how these would help. There are two delays involved here: one is the minimum sysmon delay, which is 20us, and the other is the retake delay, which is 10ms. I think both of your suggestions aim at adjusting the retake delay, but it sounds like that's not the problem here.

I think the best way to handle the minimum delay is to eliminate it by making sysmon event-driven. The minimum delay is there because sysmon is sampling what is ultimately a stream of instantaneous events. I believe it could instead compute when the next interesting event will occur and just sleep for that long. This is slightly complicated by how we figure out how long a goroutine has been executing or in a syscall, but in this model the minimum delay controls the variance we're willing to accept on that. In practice, that variance can be up to 10ms right now, which would suggest a minimum tick of 10ms if sysmon were event-driven.

philhofer · 2016-12-07T23:31:33Z

@aclements Thanks for the detailed reply.

We do, in fact, have all threads blocked in syscalls quite frequently. There's only one P, and the only meaningful thing that this hardware does is move data from the network to the disk and vice-versa. (We spend a good chunk of time in startm and stopm.) Many of these syscalls don't truly "block," because we do a relatively good job of issuing the right prefetch hints to hit the kernel block cache, but inevitably blocking requests start to pile up.

Sorta related: #6817, #7903 (we do already have a sempahore around blocking syscalls)

Probably also worth pointing out that this hardware definitely can't hit a 1ms timer deadline reliably under linux, and struggles to meet a 10ms deadline with less than 2ms of overrun.

philhofer · 2016-12-07T23:35:49Z

In case it helps, here's the hack that bought us a perf boost:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index cad1b1c..5fc3dcd 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3753,7 +3753,11 @@ func sysmon() {
        delay := uint32(0)
        for {
                if idle == 0 { // start with 20us sleep...
-                       delay = 20
+                       if GOARCH == "arm" {
+                               delay = 2 * 1000
+                       } else {
+                               delay = 20
+                       }
                } else if idle > 50 { // start doubling the sleep after 1ms...
                        delay *= 2
                }

bradfitz · 2018-06-19T17:36:00Z

Sorry, bumping this again. (But I assume Igneous has been running their own builds for the past two years anyway)

jeffdh · 2018-06-20T01:11:51Z

Thanks for the bump @bradfitz. We have opted to put a semaphore in front of disk I/O that keeps things in optimal syscall/thread territory to avoid this issue as opposed to the patch listed above.

aclements · 2019-06-25T16:27:55Z

Is this still an issue or should we go ahead and close this?

philhofer · 2019-06-25T21:41:53Z

I'm not aware of any runtime re-architecting that would have fixed the root cause here, so I think it's still an issue. A lot of projects seem to have independently adopted a semaphore for I/O as a workaround. Perhaps the runtime should respect an environment variable `GOMAXM` that explicitly limits the number of "m"s spawned. For bleeding-edge Linux systems, there's also `io_uring`, which would fix the issue altogether.

…

On 6/25/19, Austin Clements ***@***.***> wrote: Is this still an issue or should we go ahead and close this? -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #18236 (comment)

agirbal · 2021-01-28T02:29:08Z

While troubleshooting CPU increase of our HTTP server since Go 1.14 (#41699) I've noticed that runtime.sysmon / runtime.timeSleepUntil was showing up prominently in our profiling (6-12% of CPU). I have tried a patch to set sysmon delay to 10ms on both Go 1.15 and Go 1.13.

The graph below shows canary results:

baseline is always vanilla Go 1.13
beginning of graph until 11:00 has Go 1.15 vanilla as canary. It results in fairly massive ~+15% increase.
11:00 to 16:00 uses Go 1.15 with sysmon patched to 10ms. We see ~+10% increase, so getting back about 5%.
16:00 onward uses Go 1.13 with sysmon patched to 10ms. At this point we see a gain of 3%.

This seems to indicate that sysmon can significantly increase the overhead, which grows as the server becomes busier.
Here is what it looks like in our profile graph:

PS: we run on FreeBSD OS.

ianlancetaylor · 2021-01-28T20:08:45Z

@agirbal If your program uses many timers and often resets those timers to fire earlier, then the code in the future 1.16 release will be much more efficient than the code in 1.15. Is it possible for you to test your program with the 1.16rc1 release that just came out? Thanks.

agirbal · 2021-01-29T00:02:44Z

@ianlancetaylor ok I just tried 1.16rc1 and the issue seems gone! That is amazing news, we can get back on top of our upgrades, thank you. I have commented further in #41699 .

To center back on the sysmon issue, in graph below:

baseline is Go 1.13 vanilla
until 11:00 canary is Go 1.13 with sysmon reduction, showing about 4% savings.
11:00 to 14:00 canary is Go 1.16 with sysmon reduction, showing about 4% savings.
from 14:00 canary is vanilla Go 1.16. It seems to have a slight regression of less than 1%.

The 4-5% savings is fairly significant for us. Is it acceptable to run with a constant sysmon delay of 10ms, any obvious downsides? Would you consider reducing the sysmon overhead? Thanks!

prattmic · 2021-01-29T16:19:11Z

I think this is worth considering. I think that simply changing the delay to 10ms without any other changes may be too aggressive, but perhaps we can do some work to make it less active.

One of the major reasons not to extend the delay previously was that sysmon helped with timer expirations. That is gone now in 1.16.

The biggest reason is for retaking Ps from blocking syscalls (which occurs after 20us). If we increase the delay to 10ms, then retake will almost never occur.

Could you share profiles of what your application looks like now on 1.16, with and without the sysmon reduction?

ianlancetaylor · 2021-01-29T18:06:11Z

It occurs to me that we could handle blocking syscalls by having entersyscall start a timer. If the timer expires, it hands off the P. exitsyscall cancels the timer. The timeout would be fairly short, so the main cost would be storing the timer somewhere. I don't know if we want to make the G struct that much bigger. Although it's already pretty big.

We would still need sysmon to preempt long running G's, though, so I don't know if it's worth it.

prattmic · 2021-01-29T19:16:27Z

It occurs to me that we could handle blocking syscalls by having entersyscall start a timer. If the timer expires, it hands off the P. exitsyscall cancels the timer. The timeout would be fairly short, so the main cost would be storing the timer somewhere.

It seems to me like the main cost would be arming the timer. Since a 20us timer is almost certainly the next timer, I think that basically every entersyscall would have to wakeNetPoller which would get very expensive fast. Perhaps we could get away with not waking netpoll and just letting it extend to the next M that notices it? Or more generally, perhaps exitsyscall0 could itself attempt to a retake by finding a P with an expired timer.

Edit for the last part: I guess it would need to be any pidleget trying to retake, not just exitsyscall0.

ianlancetaylor · 2021-01-29T19:58:14Z

True, waking the poller doesn't make sense. Having pidleget pick up P's waiting in syscall make sense. But it seems that we would still need sysmon.

bradfitz assigned aclements Dec 7, 2016

aclements added this to the Go1.9Early milestone Dec 7, 2016

aclements added the NeedsInvestigation label Dec 7, 2016

aclements changed the title ~~runtime: sysmon tick is hardcoded~~ runtime: frequent blocking syscalls cause high sysmon overhead Jan 8, 2017

bradfitz modified the milestones: Go1.9Maybe, Go1.9Early May 3, 2017

bradfitz added the Performance label May 3, 2017

aclements modified the milestones: Go1.10, Go1.9Maybe Jul 18, 2017

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

hanikesn mentioned this issue May 3, 2018

High CPU usage compared to dnsmasq coredns/coredns#1757

Closed

bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018

aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019

networkimprov mentioned this issue Apr 7, 2019

time: excessive CPU usage when using Ticker and Sleep #27707

Open

aclements removed this from the Go1.13 milestone Jun 25, 2019

aclements added this to the Unplanned milestone Jun 25, 2019

agirbal mentioned this issue Jan 28, 2021

runtime: high CPU usage with high timer adjustments #41699

Closed

rsc unassigned aclements Jun 23, 2022

gopherbot added the compiler/runtime label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 7, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: frequent blocking syscalls cause high sysmon overhead #18236

runtime: frequent blocking syscalls cause high sysmon overhead #18236

philhofer commented Dec 7, 2016

bradfitz commented Dec 7, 2016

davecheney commented Dec 7, 2016 via email

philhofer commented Dec 7, 2016

davecheney commented Dec 7, 2016 via email

philhofer commented Dec 7, 2016

minux commented Dec 7, 2016 via email

aclements commented Dec 7, 2016

philhofer commented Dec 7, 2016

philhofer commented Dec 7, 2016

bradfitz commented Jun 19, 2018

jeffdh commented Jun 20, 2018

aclements commented Jun 25, 2019

philhofer commented Jun 25, 2019 via email

agirbal commented Jan 28, 2021

ianlancetaylor commented Jan 28, 2021

agirbal commented Jan 29, 2021

prattmic commented Jan 29, 2021

ianlancetaylor commented Jan 29, 2021

prattmic commented Jan 29, 2021 •

edited

Loading

ianlancetaylor commented Jan 29, 2021

runtime: frequent blocking syscalls cause high sysmon overhead #18236

runtime: frequent blocking syscalls cause high sysmon overhead #18236

Comments

philhofer commented Dec 7, 2016

bradfitz commented Dec 7, 2016

davecheney commented Dec 7, 2016 via email

philhofer commented Dec 7, 2016

davecheney commented Dec 7, 2016 via email

philhofer commented Dec 7, 2016

minux commented Dec 7, 2016 via email

aclements commented Dec 7, 2016

philhofer commented Dec 7, 2016

philhofer commented Dec 7, 2016

bradfitz commented Jun 19, 2018

jeffdh commented Jun 20, 2018

aclements commented Jun 25, 2019

philhofer commented Jun 25, 2019 via email

agirbal commented Jan 28, 2021

ianlancetaylor commented Jan 28, 2021

agirbal commented Jan 29, 2021

prattmic commented Jan 29, 2021

ianlancetaylor commented Jan 29, 2021

prattmic commented Jan 29, 2021 • edited Loading

ianlancetaylor commented Jan 29, 2021

prattmic commented Jan 29, 2021 •

edited

Loading