Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Timer buckets may get "stuck" for long periods of time after Windows 8/10 systems wake from sleep #31528

Closed
jmontgomery-jc opened this issue Apr 17, 2019 · 27 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows

Comments

@jmontgomery-jc
Copy link

jmontgomery-jc commented Apr 17, 2019

What version of Go are you using (go version)?

PS C:\> go version
go version go1.12.4 windows/amd64

Does this issue reproduce with the latest release?

Yes, and also with go 1.11

What operating system and processor architecture are you using (go env)?

go env Output
PS C:\> go env
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\jordan\AppData\Local\go-build
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=c:\gopath
set GOPROXY=
set GORACE=
set GOROOT=C:\projects\go1.12.4.windows-amd64\go
set GOTMPDIR=
set GOTOOLDIR=C:\projects\go1.12.4.windows-amd64\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=C:\Users\jordan\AppData\Local\Temp\go-build618130038=/tmp/go-build -gno-record-gcc-switches

What did you do?

I ran this program on a Windows 10 amd64 VM: https://play.golang.org/p/Jyqki-iIF54. I put the Windows VM to sleep(right click Start button->"Shut down or sign out"->"Sleep") right after seeing the first “ShortInterval 0 fired at…” print at 15:19:09. I then woke the machine up about 8 minutes later around 15:27:28 and the program continued. I later killed the program with ctrl+c around 15:34:30. Full program output copied below:

PS C:\> .\timertest.exe
doStuffLongInterval 2 starting
doStuffLongInterval 0 starting
doStuffShortInterval 0 starting
doStuffLongInterval 1 starting
ShortInterval 0 fired at 2019-04-16 15:19:09.159805 -0700 PDT m=+30.006215401 after 30 seconds
LongInterval 0 fired at 2019-04-16 15:27:28.4460785 -0700 PDT m=+529.330931901 after 5 minutes
ShortInterval 0 fired at 2019-04-16 15:27:28.4468785 -0700 PDT m=+529.331733901 after 30 seconds
LongInterval 2 fired at 2019-04-16 15:31:28.2059465 -0700 PDT m=+769.327050901 after 5 minutes
ShortInterval 0 fired at 2019-04-16 15:31:28.2059465 -0700 PDT m=+769.327050901 after 30 seconds
LongInterval 1 fired at 2019-04-16 15:31:28.2059465 -0700 PDT m=+769.327050901 after 5 minutes
ShortInterval 0 fired at 2019-04-16 15:31:58.205199 -0700 PDT m=+799.330143301 after 30 seconds
ShortInterval 0 fired at 2019-04-16 15:32:28.2045636 -0700 PDT m=+829.332021501 after 30 seconds
LongInterval 0 fired at 2019-04-16 15:32:28.2053959 -0700 PDT m=+829.332853801 after 5 minutes
ShortInterval 0 fired at 2019-04-16 15:32:58.2042094 -0700 PDT m=+859.333011301 after 30 seconds
ShortInterval 0 fired at 2019-04-16 15:33:28.2022148 -0700 PDT m=+889.334911201 after 30 seconds
ShortInterval 0 fired at 2019-04-16 15:33:58.1962279 -0700 PDT m=+919.335067001 after 30 seconds
ShortInterval 0 fired at 2019-04-16 15:34:28.1926601 -0700 PDT m=+949.335907901 after 30 seconds

What did you expect to see?

I would expect that while the system is awake the ShortInterval function would regularly print approximately every 30 seconds. I would also expect the LogInterval prints to be consistently placed in time rather than “LongInterval 0 …” seemingly being staggered and firing at different times than “LongInterval 1 …” and “LongInterval 2 …” which started at approximately the same time.

What did you see instead?

After waking up the system the “ShortInterval 0 fired” message was printed once and then was not printed again for 4 minutes, meaning its 30 second timer fired about 3.5 minutes too late.

I believe this is due to the design of the timerBucket mechanism within the Go runtime. Each timerBucket is sorted such that the timer with the lowest firing time(the “when” field on the runtime.timer struct) is always at the front and the goroutines that handle firing of timers sleep until they are:

  • Awakened from sleep at the time to fire the timer

  • Awakened by a new timer being placed at the front of the timerBucket which causes the sleep time to be recalculated.

On Windows, the current time is retrieved using nanotime() which grabs the time from the KUSER_SHARED_DATA structure and which continues “ticking” even while the system is asleep. I’ve tested and observed this behavior both in a VMWare VM running Windows 10 and on a laptop running Windows 10. Relevant code here: https://golang.org/src/runtime/sys_windows_amd64.s?h=nanotime#L468

The timed wakeup, on the other hand, is implemented in semasleep() using WaitForSingleObject which the documentation states does not count time spent in a “low-power state”(i.e. asleep) on Windows 8 and 10. See here: https://golang.org/src/runtime/os_windows.go?h=semasleep#L600 and here: https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobject

What seems to happen here is that when a Windows 8 or 10 machine goes to sleep, the actual time returned by nanotime() can potentially surpass the “when” time of the timer at the front of the queue while the machine is asleep. Newly added timers will thus have a later “when” time and so will get queued behind the element at the front of the timerBucket and since WaitForSingleObject does not count this sleep time the wakeup will not come until the machine has actually been awake for the total amount of time originally specified for the timer, potentially making the new timers added to the timerBucket late by however long the original timer was for. All of this is invisible to a normal go program so from the user or programmer’s perspective, some timers simply stop firing.

In my testing I’ve noticed that this happens more often when machines have a low number of cores, due to the lower number of timerBuckets so most of my testing was done on a 2-core VM. That said, it does reproduce on real hardware if the number of “long interval” goroutines is increased.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows labels Apr 17, 2019
@ianlancetaylor
Copy link
Contributor

The analysis sounds entirely plausible to me. Unix systems avoid this problem by using the same clock for time and for sleeping. Does Windows have an option to do that?

@jmontgomery-jc
Copy link
Author

I don't know of any APIs on Windows that provide similar functionality. It appears that outside of the KUSER_SHARED_DATA method of getting timestamps there's also QPC(which the runtime uses for binaries running under Wine) but even that "keeps ticking" while the system is asleep: https://docs.microsoft.com/en-us/windows/desktop/sysinfo/acquiring-high-resolution-time-stamps

@jmontgomery-jc
Copy link
Author

jmontgomery-jc commented Apr 25, 2019

It turns out I was wrong and there is a Windows API that provides a monotonic clock that does not "keep ticking" while the system is asleep:
https://docs.microsoft.com/en-us/windows/desktop/api/realtimeapiset/nf-realtimeapiset-queryunbiasedinterrupttime

The only downside of using this API is that it is unavailable on anything lower than Windows 7/Server 2008 R2. We may also see early wakeups on Win7/2k8R2 due to the differences highlighted above in WaitForSingleObject timeouts on these versions. Thoughts?

@alexbrainman
Copy link
Member

@jmontgomery-jc thank you for creating this issue.

I agree with your suggestion to use QueryUnbiasedInterruptTime in nanotime.

I tried to see what QueryUnbiasedInterruptTime returns, and QueryUnbiasedInterruptTime returns counter that stops when I put my Windows 10 computer to sleep.

I tried making QueryUnbiasedInterruptTime call as efficient as possible - something like this

// func unbiased() int64
TEXT runtime·unbiased(SB),NOSPLIT|NOFRAME,$0-8
        LEAQ    ret+0(FP), CX
        MOVQ    SP, AX
        ANDQ    $~15, SP        // alignment as per Windows requirement
        SUBQ    $(48), SP       // room for SP and 4 args as per Windows require
                                // plus one extra word to keep stack 16 bytes al
        MOVQ    AX, 32(SP)
        MOVQ    runtime·_QueryUnbiasedInterruptTime(SB), AX
        //MOVQ  runtime·_QueryUnbiasedInterruptTimePrecise(SB), AX
        CALL    AX
        MOVQ    32(SP), SP
        RET

(you need to multiply what this function returns by 100 to get nanoseconds), and it is about twice as slow as our current nanotime implementation on my Windows 10 computer.

BenchmarkNanotime-4                                     522565533                2.30 ns/op
BenchmarkNanotimeUnbiasedInterruptTime-4                290020608                4.13 ns/op

Or maybe we could even use QueryUnbiasedInterruptTimePrecise instead (this is only available on recent Windows versions - Windows 10 or Windows Server 2016 ). QueryUnbiasedInterruptTimePrecise is about 10 times slower that our current implementation

BenchmarkNanotime-4                                     472691991                2.33 ns/op
BenchmarkNanotimeUnbiasedInterruptTimePrecise-4         54546445                22.4 ns/op

But maybe it is good enough? Maybe we could call QueryUnbiasedInterruptTime in nanotime, and call QueryUnbiasedInterruptTimePrecise in time.now ? Also QueryUnbiasedInterruptTimePrecise documentation does not mentions timeBeginPeriod, so maybe we could get rid of timeBeginPeriod call too?

What do you think @dvyukov, @aclements and @ianlancetaylor ?

Alex

@zx2c4
Copy link
Contributor

zx2c4 commented Aug 25, 2019

Using QueryUnbiasedInterruptTime would be a major step backwards. Already I'm trying to move Go/Linux to use CLOCK_BOOTTIME, and upstream Linux is discussing making CLOCK_BOOTTIME and CLOCK_MONOTONIC the same thing.

It's impossible to implement WireGuard securely if timers don't take into account sleep time. Currently on non-Windows platforms I'm required to patch the Go runtime. I won't be happy about the prospect of having to add additional patches on Windows.

@zx2c4
Copy link
Contributor

zx2c4 commented Aug 25, 2019

The timed wakeup, on the other hand, is implemented in semasleep() using WaitForSingleObject which the documentation states does not count time spent in a “low-power state”(i.e. asleep) on Windows 8 and 10. See here: https://golang.org/src/runtime/os_windows.go?h=semasleep#L600 and here: https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobject

Indeed it looks like WaitForSingleObject no longer takes into account sleep time, which is a bummer.

Could we just re-arm/re-sleep on wakeup? PowerRegisterSuspendResumeNotification to get changes on Win8+ (the affected platforms), and then we use WaitForMultipleObjects in semasleep with an additional re-arming global event, which causes semasleep to calculate how much longer it should sleep (using the normal nanosleep) and then do it again.

@gopherbot
Copy link

Change https://golang.org/cl/191843 mentions this issue: [WIP/RFC] runtime: use futexes on Windows

@gopherbot
Copy link

Change https://golang.org/cl/191897 mentions this issue: runtime: use SetWaitableTimer for better semasleep

@gopherbot
Copy link

Change https://golang.org/cl/191957 mentions this issue: runtime: monitor for suspend/resume to kick timeouts

@zx2c4
Copy link
Contributor

zx2c4 commented Aug 27, 2019

@jmontgomery-jc Would you try out the patch in https://golang.org/cl/191957 and let me know if it fixes your issue? In my brief tests, it works for me.

@jmontgomery-jc
Copy link
Author

@zx2c4 The patch seems to fix the issue on my machine. I tested my program to make sure I could still repro the bug and saw the timers get stuck with the latest go 1.12 branch. I then tested with yours 5 times and was unable to get the timers stuck after waking the machine up from sleep.

@zx2c4
Copy link
Contributor

zx2c4 commented Aug 27, 2019

Excellent, thanks for the confirmation. @alexbrainman : Let's think about merging https://golang.org/cl/191957 then.

tomocy pushed a commit to tomocy/go that referenced this issue Sep 1, 2019
Starting in Windows 8, the wait functions don't take into account
suspend time, even though the monotonic counters do. This results in
timer buckets stalling on resume. Therefore, this commit makes it so
that on resume, we return from the wait functions and recalculate the
amount of time left to wait.

Fixes: golang#31528

Change-Id: I0db02cc72188cb620954e87a0180e0a3c83f4a56
Reviewed-on: https://go-review.googlesource.com/c/go/+/191957
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
t4n6a1ka pushed a commit to t4n6a1ka/go that referenced this issue Sep 5, 2019
Starting in Windows 8, the wait functions don't take into account
suspend time, even though the monotonic counters do. This results in
timer buckets stalling on resume. Therefore, this commit makes it so
that on resume, we return from the wait functions and recalculate the
amount of time left to wait.

Fixes: golang#31528

Change-Id: I0db02cc72188cb620954e87a0180e0a3c83f4a56
Reviewed-on: https://go-review.googlesource.com/c/go/+/191957
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
@zx2c4
Copy link
Contributor

zx2c4 commented Sep 5, 2019

@gopherbot Please backport this to 1.13

@gopherbot
Copy link

Backport issue(s) opened: #34130 (for 1.13).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/193607 mentions this issue: [release-branch.go1.13] runtime: monitor for suspend/resume to kick timeouts

@networkimprov
Copy link

Should backport to 1.12, where issue was first found...

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 21, 2019

Perhaps you can first chime in on the 1.13 issue to motivate that.

@gopherbot
Copy link

Change https://golang.org/cl/198417 mentions this issue: runtime: iterate ms via allm linked list to avoid race

gopherbot pushed a commit that referenced this issue Oct 3, 2019
It's pointless to reach all ms via allgs, and doing so introduces a
race, since the m member of a g can change underneath it. Instead
iterate directly through the allm linked list.

Updates: #31528
Updates: #34130

Change-Id: I34b88402b44339b0a5b4cd76eafd0ce6e43e2be1
Reviewed-on: https://go-review.googlesource.com/c/go/+/198417
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/198539 mentions this issue: runtime: iterate ms via allm linked list to avoid race

@gopherbot
Copy link

Change https://golang.org/cl/198540 mentions this issue: [release-branch.go1.13] runtime: monitor for suspend/resume to kick timeouts

gopherbot pushed a commit that referenced this issue Oct 4, 2019
…imeouts

Starting in Windows 8, the wait functions don't take into account
suspend time, even though the monotonic counters do. This results in
timer buckets stalling on resume. Therefore, this commit makes it so
that on resume, we return from the wait functions and recalculate the
amount of time left to wait.

This is a cherry pick of CL 191957 and its cleanup, CL 198417.

Updates #31528
Fixes #34130

Change-Id: I0db02cc72188cb620954e87a0180e0a3c83f4a56
Reviewed-on: https://go-review.googlesource.com/c/go/+/193607
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@networkimprov
Copy link

networkimprov commented Nov 9, 2019

Microsoft has asked us to revert this change, and adopt the QueryUnbiasedInterruptTime approach suggested above. See #35447 (comment)

The change made Go timers on Windows function differently than on Unix. See also #24595 #35012

Filed #35482

@ijc
Copy link

ijc commented Dec 3, 2019

Should backport to 1.12, where issue was first found...

I'm not sure what the current status here is, but did this fall through the cracks? I see the 1.13 backport but not 1.12.

I'm seeing this issue not through sleeps in my own code but due to the time.Sleep in os.process.wait on Windows (see #25965).

I'm investigating bumping to 1.13 but that may not be possible for us in the short term.

@networkimprov
Copy link

See #35482

@gopherbot
Copy link

Change https://golang.org/cl/210437 mentions this issue: runtime: revert "monitor for suspend/resume to kick timeouts"

@networkimprov
Copy link

@aclements, this should be reopened if it won't be fixed in 1.14.

@gopherbot
Copy link

Change https://golang.org/cl/213197 mentions this issue: [release-branch.go1.12] runtime: monitor for suspend/resume to kick timeouts

gopherbot pushed a commit that referenced this issue Jan 3, 2020
…imeouts

Starting in Windows 8, the wait functions don't take into account
suspend time, even though the monotonic counters do. This results in
timer buckets stalling on resume. Therefore, this commit makes it so
that on resume, we return from the wait functions and recalculate the
amount of time left to wait.

This is a cherry pick of CL 191957 and its cleanup, CL 198417.

Updates #31528
Fixes #36376

Change-Id: I0db02cc72188cb620954e87a0180e0a3c83f4a56
Reviewed-on: https://go-review.googlesource.com/c/go/+/193607
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-on: https://go-review.googlesource.com/c/go/+/213197
@golang golang locked and limited conversation to collaborators Jan 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
None yet
Development

No branches or pull requests

8 participants