Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: timediv should use |= instead of += since we are adding bitsets to an originally cleared value #27529

Closed
odeke-em opened this issue Sep 6, 2018 · 13 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@odeke-em
Copy link
Member

odeke-em commented Sep 6, 2018

Right now just before bed, I have been studying runtime/os_darwin.go in relation to https://go-review.googlesource.com/c/go/+/133655 to try to write a regression/end-to-end test after that CL is merged.

I noticed that in the code for timediv

go/src/runtime/runtime1.go

Lines 414 to 421 in 22afb35

func timediv(v int64, div int32, rem *int32) int32 {
res := int32(0)
for bit := 30; bit >= 0; bit-- {
if v >= int64(div)<<uint(bit) {
v = v - (int64(div) << uint(bit))
res += 1 << uint(bit)
}
}

we try to reduce the original timeunit by at most ((1<<31)-1) = 0x7FFFFFFF if (v >= conversionUnit<<bits)
where:

  • bits is in the reverse domain of [0, 30]
  • conversionUnit is "div" aka the number of units converting from the original to the target unit e..g for ns to ms conversion, div = 1e6, ns to us, div = 1e3

In particular notice that we are doing a += for values that are essentially bit sets for a value that started at 0

res += 1 << uint(bit)

i.e. we are only doing power of 2 increments/toggling bits.

We could replace that by |= and that code should be faster by a couple of nanoseconds and to justify the change:

  • That code is used is converting from nanotime() ns so at most ((1<<31)-1) * 1e9 ns (since max epoch second = (1<<31)-1) which by any measure will mean that every single invocation of timediv
    with a system nanotime that began later than September 8th 2001 ~7PM, will run through that loop at least once -- from ns to at best s, with smaller units it is even more guaranteed to run since (1e9<<30) (div=1e9, ns to s)

I believe that we could shave off a couple of nanoseconds and take advantage of registers.
My simple & dirty benchmark at https://play.golang.org/p/tWy9l5oH0tG when run shows an 18% decrement in time

name  old time/op  new time/op  delta
-8    37.7ns ± 2%  30.7ns ± 4%  -18.37%  (p=0.008 n=5+5)

timediv is in a very hot path throughout Go and the ecosystem; when we perform any semaphore operations e.g. while locking and unlocking a mutex on many OSes, of which Go's bread and butter for dealing with concurrent accesses are mutex operations and those are also used all around the standard library.

I am filing this issue because the comment for that function says that is is a very special function

go/src/runtime/runtime1.go

Lines 410 to 411 in 22afb35

// This is a very special function, do not use it if you are not sure what you are doing.
// int64 division is lowered into _divv() call on 386, which does not fit into nosplit functions.

and I also wonder if perhaps there was an intention of using +=?

Please feel free to correct me or provide some ideas, my apologies for any mistakes #~4AMChecks. Thank you!

@ianlancetaylor
Copy link
Contributor

Looks like |= is faster on amd64 because it replaces mov; shl; add with bts. On 386 we don't use bts, so both versions are about the same speed. I can't think of any reason why |= would be slower than += on any plausible hardware, so the change seems reasonable to me. Thanks.

@bcmills bcmills added Performance NeedsFix The path to resolution is known, but the work has not been done. labels Sep 6, 2018
@bcmills bcmills added this to the Unplanned milestone Sep 6, 2018
@rasky
Copy link
Member

rasky commented Sep 6, 2018

If timediv is just a 64/32=32 division, it should probably be converted to use math/bits.Div32 once CL123157 goes in. That would speed it up a lot more than just using |= instead of +=, on both 32-bit and 64-bit platforms.

@odeke-em
Copy link
Member Author

If timediv is just a 64/32=32 division, it should probably be converted to use math/bits.Div32 once CL123157 goes in. That would speed it up a lot more than just using |= instead of +=, on both 32-bit and 64-bit platforms.

Thanks for the chime in @rasky! Interesting and cool that bits.Div* is being added. Perhaps I am misunderstanding how the new API will be used but I think it'll involve a bit of retrofitting given a few restrictions in the doc comments. I just skimmed real quick over it and thought to plugin and play one of the sanity checks in the runtime

if timediv(12345*1000000000+54321, 1000000000, &e) != 12345 || e != 54321 {

but the results are unexpected

package main_test

import (
	"math/bits"
	"testing"
)

func TestDiv(t *testing.T) {
	v := int64(12345*1000000000+54321)
	div := int32(1000000000)

	hi, lo := uint(v>>32), uint(v&0xFFFFFFFF)
	quo, rem := bits.Div(hi, lo, uint(div))
	if g, w := quo, uint(12345); g != w {
		t.Errorf("Quotient:\n\tgot  %d (0x%X)\n\twant %d (0x%X)", g, g, w, w)
	}
	if g, w := rem, uint(54321); g != w {
		t.Errorf("Remainder :\n\tgot  %d (0x%X)\n\twant %d (0x%X)", g, g, w, w)
	}
}

with results

$ go test -run=TestDiv
--- FAIL: TestDiv (0.00s)
    div_test.go:15: Quotient:
        	got  53015942467842 (0x3037BC6B1102)
        	want 12345 (0x3039)
    div_test.go:18: Remainder :
        	got  515390001 (0x1EB83A31)
        	want 54321 (0xD431)
FAIL
exit status 1
FAIL	_/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/runtime-timediv	0.005s

With my point being that: perhaps let's file a separate bug as a reminder to retrofit "math/bits".Div into the runtime once https://golang.org/cl/123157 is landed as you suggested. For now I think we can easily address using |= :)

@gopherbot
Copy link

Change https://golang.org/cl/134231 mentions this issue: runtime: convert initial timediv quotient increments to bitsets

@cherrymui
Copy link
Member

We should not use timediv on 64-bit machines. We should just use the division operator, as os_linux.go does.

@rasky
Copy link
Member

rasky commented Sep 10, 2018

@odeke-em which arch are you testing that on? Try using Div32 instead of Div to make it behave the same on both 32-bit and 64-bit archs (your code would fail on 64 bit because of the right shift sign extending the 64 bit input).

@cherrymui using bits.Div is a portable way of doing a 64 bit division without two different code paths

@odeke-em
Copy link
Member Author

@rasky I was on AMD64, Darwin. Thank you for the tip of switching to Div32, that runs correctly and is magnitudes faster than all the others in deed

$ go test -v -run=^$ -bench=As
goos: darwin
goarch: amd64
BenchmarkAsPlus-8   	50000000	        27.8 ns/op
BenchmarkAsOr-8     	50000000	        26.0 ns/op
BenchmarkAsDiv-8    	2000000000	         0.39 ns/op
PASS
ok  	_/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/runtime-timediv	3.568s

@cherrymui
Copy link
Member

I'm not sure we want the runtime package to import math/bits, although it is probably a possibility.

And, with the current implementation in CL 123157, it probably doesn't work. math/bits.Div32 is implemented simply with 64-bit division. On 32-bit systems, this lowers to runtime.uint64div. As the comment in timediv says, this may not fit into nosplit stack limit (maybe it does now, I don't know, at least it didn't when timediv was introduced). And not all 32-bit machines have 64-bit-divided-by-32-bit intrinsic. Of course we could probably have a different implementation of Div32 that uses very small stack space.

@randall77
Copy link
Contributor

@odeke-em : That last benchmark result looks like the entire operation was optimized away. It's unlikely that the chip is doing a divide in less than a nanosecond.
(Or maybe it's just really pipelined?)

@cherrymui
Copy link
Member

If 64-bit division now fits into nosplit stack limit on all machines, I suggest changing back to using the division operator.

@odeke-em
Copy link
Member Author

@odeke-em : That last benchmark result looks like the entire operation was optimized away.

@randall77 aha, yeah it looked too good to be true, but alas seems to actually be running alright

package main

import (
        "math/bits"
        "testing"
)

func BenchmarkAsDiv(b *testing.B) {
        v := int64(12345*1000000000 + 54321)
        div := int32(1000000000)

        var quo, rem uint32

        for i := 0; i < b.N; i++ {
                hi, lo := uint32(v>>32), uint32(v&0xFFFFFFFF)
                quo, rem = 0, 0
                quo, rem = bits.Div32(hi, lo, uint32(div))
        }

        if g, w := quo, uint32(12345); g != w {
                b.Fatalf("Got %d\nWant %d", g, w)
        }
        if g, w := rem, uint32(54321); g != w {
                b.Fatalf("Got %d\nWant %d", g, w)
        }

}

still giving

$ go test -run=$^ -bench=As
goos: darwin
goarch: amd64
BenchmarkAsPlus-8   	50000000	        24.5 ns/op
BenchmarkAsOr-8     	100000000	        22.8 ns/op
BenchmarkAsDiv-8    	2000000000	         0.33 ns/op
PASS
ok  	_/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/runtime-timediv	4.252s
$ uname -a
Darwin Emmanuels-MacBook-Pro-2.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64

@rasky
Copy link
Member

rasky commented Sep 10, 2018

I'm not sure we want the runtime package to import math/bits, although it is probably a possibility.

We'll have to figure this out, as there are many other points where runtime could benefit from math/bits operations; there are for instances some multiplication overflow checks that could be simplified and made faster using bits.Mul.

And, with the current implementation in CL 123157, it probably doesn't work.

Right, I was implying that an intrinsification CL would eventually land (like CL129415, but for 32-bit archs).

And not all 32-bit machines have 64-bit-divided-by-32-bit intrinsic.

Oh that's right, I guess ARM doesn't.

@odeke-em
Copy link
Member Author

@rasky would you mind perhaps filing a reminder issue to examine what we can do around the standard library with math/bits.Div* when the CL lands? I ask because that conversation dominated this issue(which is now closed), but requires more thoughts and investigation, but also the conversations are very valuable and would be nice to keep it going on a fresh issue. Thank you!

@golang golang locked and limited conversation to collaborators Sep 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

7 participants