Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: time.Sleep takes more time than expected on Windows (1ms -> 10ms) #44343

Closed
egonelbre opened this issue Feb 17, 2021 · 131 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. OS-Linux OS-Windows
Milestone

Comments

@egonelbre
Copy link
Contributor

This seems to be a regression with Go 1.16 time.Sleep.

What version of Go are you using (go version)?

$ go version
go version go1.16 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=Z:\gocache
set GOENV=C:\Users\egone\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=F:\Go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=F:\Go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=c:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.16
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=f:\temp\sleep\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=Z:\Temp\go-build3014714148=/tmp/go-build -gno-record-gcc-switches

What did you do?

package main

import (
	"fmt"
	"time"

	"github.com/loov/hrtime"
)

func main() {
	b := hrtime.NewBenchmark(100)
	for b.Next() {
		time.Sleep(50 * time.Microsecond)
	}
	fmt.Println(b.Histogram(10))
}

hrtime is a package that uses RDTSC for time measurements.

Output on Windows 10:

> go1.16 run .
  avg 15.2ms;  min 55.3µs;  p50 15.5ms;  max 16.3ms;
  p90 16.1ms;  p99 16.3ms;  p999 16.3ms;  p9999 16.3ms;
     55.3µs [  2] █
        2ms [  0]
        4ms [  0]
        6ms [  0]
        8ms [  0]
       10ms [  1] ▌
       12ms [  0]
       14ms [ 75] ████████████████████████████████████████
       16ms [ 22] ███████████▌
       18ms [  0]

> go1.15.8 run .
  avg 1.03ms;  min 63.9µs;  p50 1ms;  max 2.3ms;
  p90 1.29ms;  p99 2.3ms;  p999 2.3ms;  p9999 2.3ms;
     63.9µs [  1] ▌
      500µs [ 47] ███████████████████████████████████████
        1ms [ 48] ████████████████████████████████████████
      1.5ms [  1] ▌
        2ms [  3] ██
      2.5ms [  0]
        3ms [  0]
      3.5ms [  0]
        4ms [  0]
      4.5ms [  0]

Output on Linux (Debian 10):

$ go1.16 run test.go
  avg 1.06ms;  min 1.06ms;  p50 1.06ms;  max 1.08ms;
  p90 1.07ms;  p99 1.08ms;  p999 1.08ms;  p9999 1.08ms;
     1.06ms [  4] █▌
     1.07ms [ 84] ████████████████████████████████████████
     1.07ms [  7] ███
     1.08ms [  3] █
     1.08ms [  1]
     1.09ms [  1]
     1.09ms [  0]
      1.1ms [  0]
      1.1ms [  0]
     1.11ms [  0]

$ go1.15.8 run test.go
  avg 86.7µs;  min 57.3µs;  p50 83.6µs;  max 132µs;
  p90 98.3µs;  p99 132µs;  p999 132µs;  p9999 132µs;
     57.3µs [  2] █
       60µs [  1] ▌
       70µs [ 13] ████████
       80µs [ 64] ████████████████████████████████████████
       90µs [ 11] ██████▌
      100µs [  2] █
      110µs [  1] ▌
      120µs [  3] █▌
      130µs [  3] █▌
      140µs [  0]

The time granularity shouldn't be that bad even for Windows. So, there might be something going wrong somewhere.

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 17, 2021
@egonelbre
Copy link
Contributor Author

It looks like this happened in 8fdc79e. https://go-review.googlesource.com/c/go/+/232298

CC: @ChrisHines

@seankhliao seankhliao changed the title time.Sleep takes more time than expected runtime: time.Sleep takes more time than expected Feb 17, 2021
@bcmills
Copy link
Contributor

bcmills commented Feb 17, 2021

That same CL shook out a number of kernel and runtime bugs in various configurations. (See previously #43067, #42515, #42237; cc @prattmic.)

@ianlancetaylor
Copy link
Contributor

CC @mknyszek @aclements

@ianlancetaylor ianlancetaylor added this to the Go1.17 milestone Feb 17, 2021
@prattmic
Copy link
Member

prattmic commented Feb 17, 2021

This is reproducible with a trivial benchmark in time package:

func BenchmarkSimpleSleep(b *testing.B) {
       for i := 0; i < b.N; i++ {
               Sleep(50*Microsecond)
       }
}

amd64/linux, before/after http://golang.org/cl/232298:

name            old time/op  new time/op   delta
SimpleSleep-12  86.9µs ± 0%  609.8µs ± 5%  +601.73%  (p=0.000 n=10+9)

@prattmic
Copy link
Member

For reference, across different sleep times:

name                  old time/op  new time/op   delta
SimpleSleep/1ns-12     460ns ± 3%    479ns ± 1%    +4.03%  (p=0.000 n=10+9)
SimpleSleep/100ns-12   466ns ± 3%    476ns ± 2%    +2.35%  (p=0.001 n=10+9)
SimpleSleep/500ns-12  6.47µs ±11%   6.70µs ± 5%      ~     (p=0.105 n=10+10)
SimpleSleep/1µs-12    10.3µs ±10%   12.2µs ±13%   +18.23%  (p=0.000 n=10+10)
SimpleSleep/10µs-12   81.9µs ± 1%  502.5µs ± 4%  +513.45%  (p=0.000 n=10+10)
SimpleSleep/50µs-12   87.0µs ± 0%  622.9µs ±18%  +615.69%  (p=0.000 n=8+10)
SimpleSleep/100µs-12   179µs ± 0%   1133µs ± 1%  +533.52%  (p=0.000 n=8+10)
SimpleSleep/500µs-12   592µs ± 0%   1137µs ± 1%   +91.97%  (p=0.000 n=10+10)
SimpleSleep/1ms-12    1.12ms ± 2%   1.14ms ± 1%    +1.36%  (p=0.000 n=9+10)
SimpleSleep/10ms-12   10.2ms ± 0%   10.3ms ± 0%    +0.79%  (p=0.000 n=9+9)

@prattmic
Copy link
Member

prattmic commented Feb 17, 2021

Looking at the 100µs, the immediate problem is the delay resolution in netpoll.

Prior to http://golang.org/cl/232298, 95% of timer expirations in the 100µs case are detected by sysmon, which calls startm to wake an M to handle the timer (though this is not a particularly efficient approach).

After http://golang.org/cl/232298, this path is gone and the wakeup must come from netpoll (assuming all Ms are parked/blocked). netpoll on Linux only has 1ms resolution, so it must sleep at least that long before detecting the timer.

I'm not sure why I'm seeing ~500µs on the 10µs and 50µs benchmarks, but I may have bimodal distribution where ~50% of cases a spinning M is still awake long enough to detect the timer before entering netpoll.

I'm also not sure why @egonelbre is seeing ~14ms on Windows, as that also appears to have 1ms resolution on netpoll.

@prattmic
Copy link
Member

I think the ideal fix to this would be to increase the resolution of netpoll. Even for longer timers, this limited resolution will cause slight skew to timer delivery (though of course there are no real-time guarantees).

As it happens, Linux v5.11 includes epoll_pwait2 which switches the timeout argument to a timespec for nanosecond resolution. Unfortunately, Linux v5.11 was released ... 3 days ago, so availability is not widespread to say the least.

In the past, I've also prototyped changing the netpoll timeout to being controlled by a timerfd (with the intention of being able to adjust the timer earlier without a full netpollBreak). That could be an option as well.

Both of these are Linux-specific solutions, I'd have to research other platforms more to get a sense of the options there.

We also may just want to bring the sysmon wakeup back, perhaps with slight overrun allowed to avoid excessive M wakeups.

@ianlancetaylor
Copy link
Contributor

I guess that wakeNetPoller doesn't help, because there is no poller sleeping at the point of calling time.Sleep.

Perhaps when netpoll sees a delay that is shorter than the poller resolution it should just do a non-blocking poll. That will effectively turn findrunnable into a busy wait when the next timer is very soon.

@ChrisHines
Copy link
Contributor

While working on CL232298 I definitely observed anecdotal evidence that the netpoller has more latency than other ways of sleeping. From #38860 (comment):

My anecdotal observation here is that it appears the Linux netpoller implementation has more latency when waking after a timeout than the Go 1.13 timerproc implementation. Most of these benchmark numbers replicate on my Mac laptop, but the darwin netpoller seems to suffer less of that particular latency by comparison, and is also worse in other ways. So it may not be possible to close the gap with Go 1.13 purely in the scheduler code. Relying on the netpoller for timers changes the behavior as well, but these new numbers are at least in the same order of magnitude as Go 1.13.

I didn't try to address that in CL232298 primarily because it was already risky enough that I didn't want to make bigger changes. But an idea for something to try occurred to me back then. Maybe we could improve the latency of non-network timers by having one M block on a notesleep call instead of the netpoller. That would require findrunnable to compute two wakeup times, one for net timers to pass to the netpoller and one for all other timers to use with notesleep depending on which role it takes on (if any) when it cannot find any other work.

I haven't fully gauged how messy that would get.

Questions and concerns:

  • Coordinating two sleeping M's probably has complicating edge cases to figure out.
  • I haven't tested the latency of notesleep wakeups to know if it would actually help.
  • Would it require duplicating all the timer fields on each P, one for net timers and one for the others?

One other oddity that I noticed when testing CL232298: The linux netpoller sometimes wakes up from the timeout value early by a several microseconds. When that happens findrunnable usually does not find any expired timers since they haven't actually expired yet. A new--very short--pollUntil value gets computed and the M reenters the netpoller. The subsequent wakeup is then typically rather late, maybe up to 1ms, but I am going from memory here. I might be able to dig up some trace logs showing this behavior if I still have them and people are interested.

@prattmic
Copy link
Member

I guess that wakeNetPoller doesn't help, because there is no poller sleeping at the point of calling time.Sleep.

wakeNetPoller shouldn't matter either way, because even if we wake the netpoller, it will just sleep again with a new timeout of 1ms, which is too long. (Unless wakeNetPoller happens to take so long that the timer has expired by the time the woken M gets to checkTimers).

Perhaps when netpoll sees a delay that is shorter than the poller resolution it should just do a non-blocking poll. That will effectively turn findrunnable into a busy wait when the next timer is very soon.

Maybe we could improve the latency of non-network timers by having one M block on a notesleep call instead of the netpoller.

As somewhat of a combination of these, one potential option would be to make netpoll with a short timeout do non-blocking netpoll, short notetsleep, non-blocking netpoll. Though this has the disadvantage of slightly increasing latency of network events from netpoll.

One other oddity that I noticed when testing CL232298: The linux netpoller sometimes wakes up from the timeout value early by a several microseconds. When that happens findrunnable usually does not find any expired timers since they haven't actually expired yet. A new--very short--pollUntil value gets computed and the M reenters the netpoller. The subsequent wakeup is then typically rather late, maybe up to 1ms, but I am going from memory here. I might be able to dig up some trace logs showing this behavior if I still have them and people are interested.

Hm, this sounds like another bug, or perhaps a spurious netpollBreak from another M.

@prattmic
Copy link
Member

It seems that on Windows, notetsleep has 1ms precision in addition to netpoll, so the explanation in #44343 (comment) doesn't explain the increase in latency on Windows.

@ChrisHines
Copy link
Contributor

ChrisHines commented Feb 18, 2021

My first thought on the Windows behavior is that somehow osRelax is being mismanaged and allowing the timer resolution to decrease to its resting mode. That thought is driven by the spike on the above histograms at 15ms. I haven't thought through how that might happen now.

@ChrisHines
Copy link
Contributor

Hm, this sounds like another bug, or perhaps a spurious netpollBreak from another M.

That could be, but I was logging at least some of the calls to netpollBreak as well and don't recall seeing seeing that happen. I saved my logging code in case it can help. https://github.com/ChrisHines/go/tree/dlog-backup

@vellotis
Copy link

For reference, output on my Windows 10:

> go1.16 run .
  avg 1.06ms;  min 475µs;  p50 1.01ms;  max 1.99ms;
  p90 1.13ms;  p99 1.99ms;  p999 1.99ms;  p9999 1.99ms;
      475µs [  1] ▌
      600µs [  1] ▌
      800µs [ 36] ██████████████████████████
        1ms [ 55] ████████████████████████████████████████
      1.2ms [  0]
      1.4ms [  0]
      1.6ms [  0]
      1.8ms [  7] █████
        2ms [  0]
      2.2ms [  0]

A totally different results from the #44343 (comment).

go env Output
$ go env
set GO111MODULE=on
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\user\AppData\Local\go-build
set GOENV=C:\Users\user\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Projects\Go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Projects\Go
set GOPRIVATE=
set GOPROXY=
set GOROOT=C:\Tools\Go\go1.16
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Tools\Go\go1.16\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.16
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=NUL
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=C:\Users\user\AppData\Local\Temp\go-build2862485594=/tmp/go-build -gno-record-gcc-switches

@egonelbre
Copy link
Contributor Author

@vellotis this could be because there's something running in the background changing Windows timer resolution. This could be some other Go service/binary built using older Go version. Of course there can plenty of other programs that may change it.

You can use https://github.com/tebjan/TimerTool to see what the current value is. There's some more detail in https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted.

@zobkiw
Copy link

zobkiw commented Mar 19, 2021

For what it's worth, go1.16.2 darwin/amd64 is also exhibiting this. In a program I'm running, the line time.Sleep(1 * time.Hour) takes roughly an hour and 3 minutes each time.

@prattmic
Copy link
Member

@zobkiw It sounds like they is more likely to be related to #44868. I'm also curious if the failure is consistent and it really is always 1hr 3min, not 1hr 2min (or rather, ranging from 2-4min since alignment with 2min forcegc will vary)?

@zobkiw
Copy link

zobkiw commented Mar 19, 2021

@prattmic Strangely, I just checked the output of the script now and (although I just restarted it a few hours before) it was spot on at one hour twice in a row. However, yesterday it was "generally" 3 minutes. I don't have an exact time since we were only seeing the minute logged. It was always 3 minutes (rounded) from the previous run. 1:00pm, 2:03pm, 3:06pm, 4:09pm, etc.

Some things to note about this loop I'm running, it is calling a shell script using exec.Command before sleeping for an hour, then it does it again. The script takes about 10 seconds to execute and completes its job. It is also running in a screen session but so was the one this morning that was doing fine so I don't think that was it. The machine it was running on, a mac mini, was basically idle other that this lightweight job.

If you have some code you would like me to run I would be happy to - otherwise I will keep an eye on it here as well and report anything else I see. Hopefully some of this is helpful in narrowing down the cause if you haven't already.

UPDATE 3-20-2021 10am ET: I just ran my program for about the last 24 hours and sure enough, it would run fine for the first few iterations and then start to sleep longer based on the logging. Mostly it would hover between 3-5 minutes late but once it was 9 minutes! I've (this morning) written another program using the same logic but with simulated activity (since the actual tasks did not seem to be part of the problem) and much more detailed logging. I have it running now in 3 environments, two using screen and one not. It is running on an M1 Big Sur 11.2.3 and an Intel mac mini running the same. One thing that struck me this morning was wondering if the machine being asleep at all caused the delay. Since it runs OK for awhile (while I was using the machine presumably) and then had the delays over night. This morning, the latest reading (while I was using the machine again to install the new test code) was back to 1 hour spot on - no delay. Once I get some more data at the end of the day or tomorrow morning I will report back and share the code if there remains a problem.

@tandr
Copy link

tandr commented Mar 19, 2021

3 minutes per hour is rather worrisome, especially if some cron-like functionality required in a long-running service...

@networkimprov
Copy link

In case it's related; on Windows, timers tick while the system is in standby/suspend, but they pause on other platforms.

This is supposed to have been addressed already #36141

@zobkiw
Copy link

zobkiw commented Mar 21, 2021

@prattmic et al, I took the liberty of writing a testbed to experiment with this issue. See all the details of test runs on Linux, Intel, and M1, along with source code here:

https://github.com/zobkiw/sleeptest

Comments welcome - especially if you find a bug in the code. Also, see the results directory for the output of the tests. Regardless, these results seem strange to me. I look forward to other input on this issue.

@networkimprov
Copy link

cc @zx2c4 :-)

@egonelbre
Copy link
Contributor Author

@zobkiw could it be something due to cpu frequency scaling? Can you try locking the cpu to a specific frequency (something low) and see whether the problem disappears?

@johnrs
Copy link

johnrs commented Jul 31, 2023

I've submitted the CL 514275 with your suggestion.

Thanks. However...

We could expose timeBeginPeriod and timeEndPeriod in x/sys/windows. Please submit a separate proposal asking for it.

Will do.

I don't think that the Go runtime should change the default time resolution on Windows. Microsoft considers it to be a good default considering all the tradeoffs, and I don't have enough information to take a better decision.

Perhaps this additional information will help.

  1. The runtime used timeBeginPeriod from go1.0 to go1.15. The latency / power tradeoff was an issue for a while around go1.3 - go1.5. The "do it manually" fix wasn't appealing and didn't work properly at the time. Then @aclements came up with the osRelax() auto-switching solution which seemed to satisfied everyone, and it was put into go1.5. All was well.

  2. With the recent change to Windows 10+, which makes a faster timer rate non-global (not affecting other Windows processes), the power aspect is largely mitigated.

  3. The 15.6ms clock breaks an old program of mine. Others have reported this, too. Just today I read that it caused a problem with pprof. So I view this as a bug - one which would seem to violate the Go Compatibility Guarantee goal.

  4. I'm not suggesting that the changes causing this be reverted. Simply setting the clock to 2ms would probably solve 90% or more of the latency issues, and would save half of the power (besides the global savings). Short of restoring the auto-switching, this seems like a workable compromise. For the few who need either extreme (1ms or 15.6ms), then manually setting the clock is available, now that it's safe for user calls to timeBeginPeriod. It's an extra "knob", so should be avoided as much as possible, I believe.

  5. This compromise requires only 1 line of code be added to the runtime and modifying the text in CL 514275. It ends the 2.5 yr wait for a solution to the problem.

gopherbot pushed a commit that referenced this issue Aug 1, 2023
Updates #44343
Updates #53824

Change-Id: Ia7234fac4b1b88b3c331328aaa98dc85205e09ba
Reviewed-on: https://go-review.googlesource.com/c/go/+/514275
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
@johnrs
Copy link

johnrs commented Aug 1, 2023

I've submitted the CL 514275 with your suggestion

... a lower resolution may be requested using [timeBeginPeriod].

1ms would be a higher resolution than 15.6ms.

@gopherbot
Copy link

Change https://go.dev/cl/515495 mentions this issue: time: amend time resolution docs

@gopherbot
Copy link

Change https://go.dev/cl/515915 mentions this issue: windows: add TimeBeginPeriod and TimeEndPeriod syscalls

gopherbot pushed a commit to golang/sys that referenced this issue Aug 7, 2023
Updates golang/go#44343
Fixes golang/go#61723

Change-Id: I920f2fad6b3a2a9ba67ff2fafea4319644a0bc8c
Reviewed-on: https://go-review.googlesource.com/c/sys/+/515915
Reviewed-by: qiulaidongfeng <2645477756@qq.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Aug 7, 2023
Updates #44343

Change-Id: Id1497f0236fe7b4937e64877fa943329e280d192
Reviewed-on: https://go-review.googlesource.com/c/go/+/515495
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Quim Muntal <quimmuntal@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@prattmic prattmic changed the title runtime: time.Sleep takes more time than expected on Windows runtime: time.Sleep takes more time than expected on Windows (1ms -> 10ms) and Linux+others (80us -> 1ms) Sep 22, 2023
@checkm4ted
Copy link

checkm4ted commented Oct 29, 2023

Is there any workaround I can use for my program to sleep with a high resolution while this gets fixed?
EDIT:
syscall.NewLazyDLL("winmm.dll").NewProc("timeBeginPeriod").Call(1) seems to fix it ;)
EDIT 2: Just realised windows.TimeBeginPeriod(1) is already out

@johnrs
Copy link

johnrs commented Oct 29, 2023

/*
init() switches the Windows system clock from 15.6ms to 1ms.
@ In the future (v1.22): golang.org/x/sys/windows.TimeBeginPeriod
*/
func init() {
	winmmDLL := syscall.NewLazyDLL("winmm.dll")
	procTimeBeginPeriod := winmmDLL.NewProc("timeBeginPeriod")
	procTimeBeginPeriod.Call(uintptr(1))
}

theinternetftw added a commit to theinternetftw/glimmer that referenced this issue Jan 10, 2024
With time.Sleep now being particularly low resolution on windows in
go (see golang/go#44343), waiting for the
rest of a frame after it's complete isn't viable.

I'm now basing my wait in my own programs on the OS asking for more
sound data / the sound buffer growing too full, which doesn't rely
on sleep resolution (thanks to using a channel that sleeps/wakes
based (ultimately) on an OS audio system callback)
@qmuntal
Copy link
Contributor

qmuntal commented Jan 24, 2024

NtCreateWaitCompletionPacket and friends required for #44343 (comment) have been officially documented 😄 @thepudds.

This means that I'll try to land CL 488675 in go1.23, which will improve timer resolution to ~1ms.

@thepudds
Copy link
Contributor

That’s great @qmuntal! This should be a very nice step forward for Windows users.

@thepudds
Copy link
Contributor

thepudds commented Feb 1, 2024

This is a very long issue, with 150 hidden comments currently. For a period of time, the title here was:

  • runtime: time.Sleep takes more time than expected on Windows

In Sept 2023, the title was changed to add and Linux+others (80us -> 1ms).

In the interests of making things more manageable, I suggest we re-focus this issue back on Windows so that @qmuntal's https://go.dev/cl/488675 could hopefully close this issue. (This issue also has a decent amount of external visibility, including it was cited as a reason some Windows users couldn't upgrade to modern Go versions, so it would be nice for clarity if @qmuntal's change could close this issue).

If that sounds reasonable, I'd be happy to open a new issue for Linux if there isn't one already.

Alternatively, if @prattmic thinks the WIP https://go.dev/cl/363417 (which switches to epoll_pwait2 on Linux) or something similar has a chance of landing for Go 1.23 to improve things for Linux as well, then maybe this issue could be both for Windows and Linux. (Even so, I'm not sure there is an improvement pending in the near term for "other" OSes).


Separately, I wanted to copy here for posterity an explanation @prattmic gave earlier today in Gopher Slack in response to someone observing that on Linux they were "seeing a weird mix of certain sub-ms sleeps being serviced timely, and some rounding to 1ms":

Timers are tracked with nanosecond resolution. If the runtime enters the scheduler, it will service any expired timers with full resolution. But if the runtime goes fully idle, it will wait for the next timer expiration with the epoll_wait timeout, which has 1ms resolution. We've been meaning to switch to epoll_pwait2, which has ns resolution.

@prattmic had some related comments about epoll_pwait2 earlier in #44343 (comment) and #44343 (comment), when epoll_pwait2 was only freshly available, as well as mentioning there that bringing back some form of sysmon wakeup or using timerfd might help things.

One final reference for posterity is that I had tried in #44343 (comment) to summarize the state of this issue with some links to other key comments as of March 2023.


(In short, after re-skimming this issue just now, I'm hoping to avoid re-skimming this issue ever again 😅).

@thepudds
Copy link
Contributor

thepudds commented Feb 2, 2024

Seeing a few thumbs up and hearing no objection so far, I am going to change this issue title to re-focus back on Windows. If someone has a different opinion, feel free of course to edit the title again.

It looks like #53824 is already tracking using use epoll_pwait2 for Linux. (It also looks like an external contributor sent CL 449815 to use epoll_pwait2 based on @prattmic's earlier epoll_pwait2 CL, but they eventually got stuck on some test failures that needed investigation -- if anyone is interested, some basic triage of those failures could be a way to help move things forward).

@thepudds thepudds changed the title runtime: time.Sleep takes more time than expected on Windows (1ms -> 10ms) and Linux+others (80us -> 1ms) runtime: time.Sleep takes more time than expected on Windows (1ms -> 10ms) Feb 2, 2024
@qmuntal
Copy link
Contributor

qmuntal commented Feb 19, 2024

FYI: CL 488675 has been merged.

The new time resolution on Windows with high resolution timers is now ~0.5ms, measured using the code from the issue description:

  avg 513µs;  min 17.4µs;  p50 531µs;  max 722µs;
  p90 558µs;  p99 722µs;  p999 722µs;  p9999 722µs;
     17.4µs [  2] ▌
      100µs [  2] ▌
      200µs [  2] ▌
      300µs [  1]
      400µs [  3] █
      500µs [ 84] ████████████████████████████████████████
      600µs [  5] ██
      700µs [  1]
      800µs [  0]
      900µs [  0]

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. OS-Linux OS-Windows
Projects
Development

No branches or pull requests