Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Timer buckets may get "stuck" for long periods of time after Windows 8/10 systems wake from sleep [1.13 backport] #34130

Closed
gopherbot opened this issue Sep 5, 2019 · 26 comments
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge
Milestone

Comments

@gopherbot
Copy link

@zx2c4 requested issue #31528 to be considered for backport to the next 1.13 minor release.

@gopherbot Please backport this to 1.13

@gopherbot
Copy link
Author

Change https://golang.org/cl/193607 mentions this issue: [release-branch.go1.13] runtime: monitor for suspend/resume to kick timeouts

@ianlancetaylor
Copy link
Contributor

This seems like a lot of complex changes to the Windows support. As far as I can tell, it doesn't fix a regression. I'm not convinced this should go onto the release branch without a lot more testing on different versions of Windows.

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 5, 2019

On Windows 7 it's a no'op.

On Windows 10 it works. On Windows 8.1 it works.

Without this, Go does not work on Windows laptops. That's not very good.

@ianlancetaylor
Copy link
Contributor

Are you saying that Go has never worked on Windows laptops?

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 6, 2019

The code was written in the Windows 7 era and it worked then. But then Microsoft changed things and Go stopped working.

So it's definitely a regression of sorts, the key oddity being that it's a Microsoft-induced regression rather than a Go-induced regression. I'm not sure how this factors into the backporting logic, but I would hope it does in the affirmative.

Otherwise I have to tell consumers of my library, "if you intend to use this on a computer with S3, copy GOROOT into a local folder, patch it with this odd patch, set GOROOT to that local folder, and then compile," which isn't super friendly.

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 6, 2019

CC @alexbrainman

@jmontgomery-jc
Copy link

I agree with @zx2c4 that this should ideally be backported to 1.13. It's a very subtle bug in that you're not going to usually see crashes or errors returned or anything overt like that but it causes programs to behave badly in ways that are very hard to debug and reproduce. I would have never been able to get enough information to file the original issue had I not been lucky enough to encounter it on a VM that I could snapshot and debug over the course of a few days. Because timers are so central to how Go works the only workaround is using a patched build of Go to build your programs.

@alexbrainman
Copy link
Member

Are you saying that Go has never worked on Windows laptops?

What @zx2c4 said.

Go uses WaitForSingleObject Windows API to wait for periods of time. The API works properly on some Windows versions, but does not work on others. From https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject (I made important bits in bold)

Windows XP, Windows Server 2003, Windows Vista, Windows 7, Windows Server 2008 and Windows Server 2008 R2: The dwMilliseconds value does include time spent in low-power states. For example, the timeout does keep counting down while the computer is asleep.

Windows 8, Windows Server 2012, Windows 8.1, Windows Server 2012 R2, Windows 10 and Windows Server 2016: The dwMilliseconds value does not include time spent in low-power states. For example, the timeout does not keep counting down while the computer is asleep.

So when Go programs sleeps on new versions of Windows, it does not account for time laptop is in low-powered state.

We did not know about that until #31528. So it was always broken this way for Windows 10 users. And always worked properly (still works) for Windows 7 users.

But I also agree that the change is not trivial. And the change is just 1 week old. And it only lives on current master. I doubt many users tested it yet.

Alex

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 7, 2019

But I also agree that the change is not trivial. And the change is just 1 week old. And it only lives on current master. I doubt many users tested it yet.

I'm shipping the patch to a somewhat substantial userbase now, which has a pretty good track record of turning up bugs and emailing me about them. If you'd like, we can just leave this GH issue open for a few weeks, and I'll pipe up if folks encounter problems.

@alexbrainman
Copy link
Member

I'm shipping the patch to a somewhat substantial userbase now, which has a pretty good track record of turning up bugs and emailing me about them. If you'd like, we can just leave this GH issue open for a few weeks, and I'll pipe up if folks encounter problems.

Sounds good to me. Thank you.

Alex

@connected-nkozul
Copy link

Chiming in with the same comment from go-review #191957:

Confirmed on two Windows 10 machines (1903 and 1809). Placing the system in an S3 sleep state does indeed interfere with time.Sleep(). The latest provided patchset fixes the issue.

I'm part of a project which is very much reliant on working around this issue manually, which does not make for a sane pipeline. Would really love to see this in a minor release of 1.13.

@networkimprov
Copy link

networkimprov commented Sep 21, 2019

Could we also backport to 1.12, where the issue was first reported? I believe the patch has already been tested thereon.

EDIT: the above patch doesn't apply cleanly to 1.12.5 due to an unrelated change.

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 23, 2019

I'm shipping the patch to a somewhat substantial userbase now, which has a pretty good track record of turning up bugs and emailing me about them. If you'd like, we can just leave this GH issue open for a few weeks, and I'll pipe up if folks encounter problems.

Sounds good to me. Thank you.

Alex

15 days later... No reported problems, and sleep resumption issues seem to have gone away. Seems like we should put this into 1.13.1.

@alexbrainman
Copy link
Member

15 days later... No reported problems, and sleep resumption issues seem to have gone away. Seems like we should put this into 1.13.1.

I am fine with porting this onto 1.13.1, if others agree.

Thank you.

Alex

@zx2c4
Copy link
Contributor

zx2c4 commented Sep 24, 2019

15 days later... No reported problems, and sleep resumption issues seem to have gone away. Seems like we should put this into 1.13.1.

I am fine with porting this onto 1.13.1, if others agree.

Thank you.

Alex

@bradfitz @ianlancetaylor AFAICT, 1.13.1 is due out tomorrow, so we should get moving with this ASAP if it's going to happen.

@networkimprov
Copy link

If it's urgent for 1.13.x why is it not urgent for 1.12.x, which is more widely deployed at this point?

@bradfitz
Copy link
Contributor

AFAICT, 1.13.1 is due out tomorrow, so we should get moving with this ASAP if it's going to happen.

1.13.1 is a security-only release. We don't mix security and non-security point releases.

@bcmills bcmills modified the milestones: Go1.13.1, Go1.13.2 Sep 25, 2019
@toothrot
Copy link
Contributor

toothrot commented Oct 1, 2019

/cc @aclements - We want your feedback on whether this is safe to backport to both 1.13.x or 1.12.x, or if this needs to wait for 1.14.

@aclements
Copy link
Member

Several of us discussed this today and agreed that it should be backported. It's a good sign that it's been stable on master for a while now. I just reviewed the backport CL and just have one concern (which would also apply to the version of this on master).

@gopherbot
Copy link
Author

Change https://golang.org/cl/198417 mentions this issue: runtime: iterate ms via allm linked list to avoid race

@andybons andybons added CherryPickApproved Used during the release process for point releases and removed CherryPickCandidate Used during the release process for point releases labels Oct 2, 2019
@andybons
Copy link
Member

andybons commented Oct 2, 2019

Blocked on golang.org/cl/198417

gopherbot pushed a commit that referenced this issue Oct 3, 2019
It's pointless to reach all ms via allgs, and doing so introduces a
race, since the m member of a g can change underneath it. Instead
iterate directly through the allm linked list.

Updates: #31528
Updates: #34130

Change-Id: I34b88402b44339b0a5b4cd76eafd0ce6e43e2be1
Reviewed-on: https://go-review.googlesource.com/c/go/+/198417
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link
Author

Change https://golang.org/cl/198539 mentions this issue: runtime: iterate ms via allm linked list to avoid race

@networkimprov
Copy link

networkimprov commented Oct 3, 2019

If I apply these two patches to my /usr/lib/go/src/ will they get linked into windows builds?

@andybons this also needs backport to 1.12, but gopherbot won't open an issue.

@zx2c4
Copy link
Contributor

zx2c4 commented Oct 3, 2019

If I apply these two patches to my /usr/lib/go/src/ will they get linked into windows builds?

Yes.

Alternatively, here's some makefile deviousness from wireguard-windows:

OLD_GOROOT := $(GOROOT)
export GOROOT := $(CURDIR)/.deps/goroot

.deps/prepared: export GOROOT := $(OLD_GOROOT)
.deps/prepared: $(wildcard golang-*.patch)
	rm -rf .deps && mkdir -p .deps
	if ! rsync --exclude=pkg/obj/go-build/trim.txt -aq $$(go env GOROOT)/ .deps/goroot; then chmod -R +w .deps/goroot; exit 1; fi
	chmod -R +w .deps/goroot
	cat $^ | patch -f -N -r- -p1 -d .deps/goroot
	touch $@

amd64/wireguard.exe: .deps/prepared $(SOURCE_FILES)
	go build $(GOFLAGS) -o $@

@gopherbot
Copy link
Author

Change https://golang.org/cl/198540 mentions this issue: [release-branch.go1.13] runtime: monitor for suspend/resume to kick timeouts

gopherbot pushed a commit that referenced this issue Oct 4, 2019
…imeouts

Starting in Windows 8, the wait functions don't take into account
suspend time, even though the monotonic counters do. This results in
timer buckets stalling on resume. Therefore, this commit makes it so
that on resume, we return from the wait functions and recalculate the
amount of time left to wait.

This is a cherry pick of CL 191957 and its cleanup, CL 198417.

Updates #31528
Fixes #34130

Change-Id: I0db02cc72188cb620954e87a0180e0a3c83f4a56
Reviewed-on: https://go-review.googlesource.com/c/go/+/193607
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot
Copy link
Author

Closed by merging 84b070f to release-branch.go1.13.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CherryPickApproved Used during the release process for point releases FrozenDueToAge
Projects
None yet
Development

No branches or pull requests