Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: time goes backwards on windows-arm64 (frequent TestGcLastTime failures) #48072

Closed
bcmills opened this issue Aug 30, 2021 · 33 comments
Closed
Labels
arch-arm64 FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Milestone

Comments

@bcmills bcmills added OS-Windows arch-arm64 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 30, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Aug 30, 2021

These failures are new as of Go 1.17 because windows-arm64 support itself is new as of Go 1.17, so marking as release-blocker for the next release: we should not allow builder flakiness to increase from one release to the next.

(If we can determine that this test failure does not demonstrate a significant problem on the new platform, it should be skipped on that platform and the issue can be moved to the backlog.)

CC @golang/release

@aclements
Copy link
Member

I looked through some of the logs and they all show time (as measured by time.Now().UnixNano()) moving backwards.

What's interesting is they all appear to roll over in the last 8 digits. The more significant digits are the same between the two reads. To me this suggests a shear when reading the time.

1630030642099881100 => 1630030642000853600
1629664923099568500 => 1629664923000568000
1629390290099822500 => 1629390290000990900
1629181133099089200 => 1629181133000075200
1628840582099496300 => 1628840582001301100

But I can't figure out how we would get shear at 8 decimal digits specifically. Time itself is split between seconds and nanoseconds, but nanoseconds are the bottom 9 digits. time.now in runtime/time_windows_arm64.s reads the low 32 bits and high 32 bits of the time separately with no memory barriers (this is the same on all Windows architectures), but those are in units of 100ns, so if that sheared, I would expect it to be at 100<<32 = 429496729600 ns.

@rsc wrote the windows/arm64 time.now. Also cc @zx2c4 .

@aclements
Copy link
Member

As an experiment, I did the same comparison in hex. There's no obvious roll-over pattern in hex, so this definitely seems to be related to a decimal representation of time.

0x169f083b**935d848c** => 0x169f083b**8d767a60**
0x169dbb9cf**c439974** => 0x169dbb9cf**65cf8c0**
0x169cc1d**601dedfa4** => 0x169cc1d**5fbfad2b4**
0x169c039bd**923bd30** => 0x169c039bd**33ce7c0**
0x169acde1**23c7ed6c** => 0x169acde1**1ded966c**

@aclements
Copy link
Member

Also cc @bufflig and @alexbrainman for thoughts.

I'm not worried about the test itself, but this seems to indicate a more fundamental issue with reading time on windows/arm64.

@aclements
Copy link
Member

Actually, it's possible this was caused by #48476, which was fixed three days ago.

@aclements
Copy link
Member

Nevermind, #48476 only applied to arm32.

@alexbrainman
Copy link
Member

It could be a bug in OS. Not many people use windows-arm64 OS.

Alex

@heschi
Copy link
Contributor

heschi commented Sep 29, 2021

cc @mknyszek

@aclements
Copy link
Member

Ping @bufflig

@dmitshur
Copy link
Contributor

dmitshur commented Nov 3, 2021

CC @jeremyfaller.

@dmitshur
Copy link
Contributor

dmitshur commented Nov 3, 2021

@jstarks Would you be able to also take a look at this issue? Thanks.

@aclements aclements changed the title runtime: frequent TestGcLastTime failures on windows-arm64 runtime: time goes backwards on windows-arm64 (frequent TestGcLastTime failures) Nov 3, 2021
@jstarks
Copy link

jstarks commented Nov 3, 2021

In runtime·nanotime1, the high-low-high algorithm is used to read the current 64-bit interrupt time as three 32-bit values. This code was probably copied from the AMD64 version. However, ARM64 has a looser memory model than x86/AMD64, so you need some extra memory barriers to make the algorithm work. These are missing (as @aclements points out).

The data points above are consistent with a reordering of the loads, such that the low 32 bits is not correctly matched to the high 32 bits. This can cause the low 32 bits to move backward between measurements, or more generally be wrong when the high bits are changing. The one confusing entry was where it appeared that high value was moving backward too, but that's because aclements's values were multiplied by 100 after they were read.

The fix is not to add memory barriers but to do a single 64-bit read of the time in one instruction. That's what we do in Windows (and I recommend changing the AMD64 code to match, even though there's no race condition due to the stricter memory model).

@bufflig
Copy link
Contributor

bufflig commented Nov 3, 2021

I missed the ping due to messed up mail notification config. I'll pick it up.

Thanks for the analysis!

@bufflig bufflig self-assigned this Nov 3, 2021
@gopherbot
Copy link

Change https://golang.org/cl/361057 mentions this issue: runtime: on windows, read nanotime with one instruction on 64-bit

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 3, 2021

@bufflig I submitted a CL already: https://go-review.googlesource.com/c/go/+/361057

@bufflig
Copy link
Contributor

bufflig commented Nov 3, 2021

Ah, awesome. Thanks!

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 3, 2021

That's not really thoroughly tested, so if you want to run it through the paces on various hardware, feel free.

@bufflig
Copy link
Contributor

bufflig commented Nov 3, 2021

Ack, I will run it on anything I can get my hands on :)

@aclements
Copy link
Member

Dividing by 100 (decimal) and then comparing in hex does indeed reveal a much clearer pattern:

0x39e90ad5f4229b => 0x39e90ad5e50658
0x39e5b7547150e5 => 0x39e5b7546235b0
0x39e337e66b3051 => 0x39e337e65c1bb5
0x39e150eb21a32c => 0x39e150eb128770
0x39de3802eaf593 => 0x39de3802dbf9d3

What I'm still missing is that roll-over seems to be happening at 24 bits. If the two 32-bit reads were inconsistent, shouldn't we see the roll-over at 32 bits?

(Using a single 64-bit load is a clear improvement and may well fix this bug, I'm just not quite seeing how all the dots connect yet. I blame any slowness on my part on baby-induced sleep deprivation. :)

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 3, 2021

If the two 32-bit reads were inconsistent, shouldn't we see the roll-over at 32 bits?

Is this just a matter of the actual timing of the race and the precision of the clock? That is, maybe those top 8 bits just didn't change between reads, because that takes longer?

gopherbot pushed a commit that referenced this issue Nov 3, 2021
On 64-bit, this is more efficient, and on ARM64, this prevents the time
from moving backwards due to the weaker memory model. On ARM32 due to
the weaker memory model, we issue a memory barrier.

Updates #48072.

Change-Id: If4695716c3039d8af14e14808af217f5c99fc93a
Reviewed-on: https://go-review.googlesource.com/c/go/+/361057
Trust: Jason A. Donenfeld <Jason@zx2c4.com>
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@bufflig
Copy link
Contributor

bufflig commented Nov 4, 2021

I believe that this did not actually fix the problem. On arm64 time still moves backwards. Looking at the assembler, it appears to me that it's rather a typo - it calculates (x / 1000000000, x % 100000000) instead of (x / 1000000000, x % 1000000000) (one zero to few in the x % expression). That I think fits perfectly with the symptoms as well.

Or am I reading it completely wrong?

@jstarks
Copy link

jstarks commented Nov 4, 2021

Haha! You're right. The comment has the right constant but the code has the wrong one.

MOVD $100000000, R2

The memory ordering fix is still necessary for correctness but I guess it's not actually hitting in practice.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 4, 2021

Your reading of the assembly looks correct. And actually, this matches what's in the comment too:

        // Code stolen from compiler output for:
        //
        //      var x uint64
        //      func f() (sec uint64, nsec uint32) { return x / 1000000000, uint32(x % 100000000) }
        //

One less zero in that second expression. But actually, that code is all over the place:

  • sys_plan9_amd64.s
  • time_windows_amd64.s
  • time_windows_arm64.s

Seems like a good catch?

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 4, 2021

Haha! You're right. The comment has the right constant but the code has the wrong one.

Wait, they both look wrong to me...

@jstarks
Copy link

jstarks commented Nov 4, 2021

The comment has 9 zeroes, the constant has 8.

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 4, 2021

For the modulo?

image

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 4, 2021

Looks like on the remaining two platforms, the constant is fine and doesn't match the comment. But on ARM64 the constant and comment are both wrong.

Sending a patch.

@ianlancetaylor
Copy link
Contributor

Gah, thanks for spotting that. Perhaps the assembler should support the 1e9 syntax we can use in Go.

@gopherbot
Copy link

Change https://golang.org/cl/361474 mentions this issue: runtime: use correct constant when computing nsec remainder

@zx2c4
Copy link
Contributor

zx2c4 commented Nov 4, 2021

@gopherbot please backport this to 1.17

@gopherbot
Copy link

Backport issue(s) opened: #49369 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/361476 mentions this issue: [release-branch.go1.17] runtime: use correct constant when computing nsec remainder

@gopherbot
Copy link

Change https://golang.org/cl/361475 mentions this issue: [release-branch.go1.17] runtime: on windows, read nanotime with one instruction or issue barrier

gopherbot pushed a commit that referenced this issue Nov 24, 2021
…nstruction or issue barrier

On 64-bit, this is more efficient, and on ARM64, this prevents the time
from moving backwards due to the weaker memory model. On ARM32 due to
the weaker memory model, we issue a memory barrier.

Updates #48072.
Updates #49369.

Change-Id: If4695716c3039d8af14e14808af217f5c99fc93a
Reviewed-on: https://go-review.googlesource.com/c/go/+/361057
Trust: Jason A. Donenfeld <Jason@zx2c4.com>
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-on: https://go-review.googlesource.com/c/go/+/361475
Reviewed-by: Patrik Nyblom <pnyb@google.com>
gopherbot pushed a commit that referenced this issue Nov 24, 2021
…nsec remainder

A code comment on amd64 for windows and plan9 contained a snippet for
splitting apart the sec and nsec components of a unix timestamp, with
produced assembly below, which was then cleaned up by hand. When arm64
was ported, that code snippet in the comment went through the compiler
to produce some code that was then pasted and cleaned up. Unfortunately,
the comment had a typo in it, containing 8 zeros instead of 9.

This resulted in the constant used in the assembly being wrong, spotted
by @bufflig's eagle eyes. So, this commit fixes the comment on all three
platforms, and the assembly on windows/arm64.

Updates #48072.
Fixes #49369.

Change-Id: I786fe89147328b0d25544f52c927ddfdb9f6f1cf
Reviewed-on: https://go-review.googlesource.com/c/go/+/361474
Trust: Jason A. Donenfeld <Jason@zx2c4.com>
Run-TryBot: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Patrik Nyblom <pnyb@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-on: https://go-review.googlesource.com/c/go/+/361476
@rsc rsc unassigned bufflig Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests

10 participants