Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: BLAS DdotSmall and Idamax regressions #14917

Closed
btracey opened this issue Mar 22, 2016 · 10 comments
Closed

cmd/compile: BLAS DdotSmall and Idamax regressions #14917

btracey opened this issue Mar 22, 2016 · 10 comments
Milestone

Comments

@btracey
Copy link
Contributor

btracey commented Mar 22, 2016

Idamax and Ddot small are showing regressions between 1.6 and gotip.

brendan:~/Documents/mygo/src/github.com/gonum/blas/native$ go version
go version devel +7177cb9 Tue Mar 22 17:30:30 2016 +0000 darwin/amd64

Reproducing:

go get github.com/gonum/blas
go get github.com/gonum/floats
cd $GOPATH/src/github.com/gonum/blas/native
go test -bench DdotSmall -tags=noasm
go test -bench Ida -tags=noasm

While the Ddot regressions look small, there had been improvements as of the resolution of #14511 , so it's a significant regression since then. Ddot shows improvement between 1.6 and tip for the benchmarks with larger vector sizes (slices of size 100+ instead of 10).

DdotSmallBothUnitary-8    17.9ns ± 2%  18.9ns ± 1%   +6.04%  (p=0.000 n=10+10)
DdotSmallIncUni-8         22.1ns ± 1%  23.1ns ± 1%   +4.87%   (p=0.000 n=10+9)
DdotSmallUniInc-8         21.6ns ± 1%  21.6ns ± 1%     ~      (p=0.455 n=10+9)
DdotSmallBothInc-8        21.6ns ± 1%  22.1ns ± 1%   +2.73%    (p=0.000 n=8+8)
IdamaxSmallUnitaryInc-8   31.6ns ±10%  36.5ns ± 4%  +15.46%   (p=0.000 n=10+8)
IdamaxSmallPosInc-8       27.5ns ±19%  44.0ns ±11%  +59.75%  (p=0.000 n=10+10)
IdamaxMediumUnitaryInc-8  1.59µs ± 4%  1.92µs ± 5%  +20.81%  (p=0.000 n=10+10)
IdamaxMediumPosInc-8      1.83µs ± 1%  2.17µs ± 2%  +18.58%   (p=0.000 n=9+10)
IdamaxLargeUnitaryInc-8    145µs ± 2%   192µs ± 2%  +32.64%  (p=0.000 n=10+10)
IdamaxLargePosInc-8        195µs ± 2%   218µs ± 2%  +12.06%  (p=0.000 n=10+10)
IdamaxHugeUnitaryInc-8    14.8ms ± 1%  19.8ms ± 1%  +33.49%  (p=0.000 n=10+10)
IdamaxHugePosInc-8        26.9ms ± 3%  27.9ms ± 1%   +3.62%   (p=0.000 n=10+9)

/cc @randall77 @tzneal @dr2chase @josharian @brtzsnr

@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Mar 22, 2016
@OneOfOne
Copy link
Contributor

I wonder if this is related to #14920, try to run your benchmark against d37d3bd (the last good commit for me).

@btracey
Copy link
Contributor Author

btracey commented Mar 22, 2016

#14920 is the same issue for Ddot. That commit fixes the problem.

The Idamax cases are a different issue, there is still significant regression even at d37d3bd

@randall77
Copy link
Contributor

I'm pretty sure this has nothing to do with #14920. At least, the fix for that issue certainly doesn't apply here.
It looks like for ddot, anyway, the additional overhead (compared to c03ed49, about 20 days ago and just after the #14511 fix) looks like an unnecessary nil check and the associated load:

before:

    0x0014 00020 (ddot.go:5)    LEAQ    8(DX), BP
    0x0018 00024 (ddot.go:5)    INCQ    BX
    0x001b 00027 (ddot.go:5)    MOVQ    (DX), DX
    0x001e 00030 (ddot.go:6)    ADDQ    DX, CX
    0x0021 00033 (ddot.go:5)    MOVQ    BP, DX
    0x0024 00036 (ddot.go:5)    CMPQ    BX, AX
    0x0027 00039 (ddot.go:5)    JLT $0, 20

after:

    0x0014 00020 (ddot.go:5)    MOVQ    "".a+8(FP), BP
    0x0019 00025 (ddot.go:5)    TESTB   AL, (BP)
    0x001c 00028 (ddot.go:5)    MOVQ    (AX), SI
    0x001f 00031 (ddot.go:5)    ADDQ    $8, AX
    0x0023 00035 (ddot.go:5)    INCQ    BX
    0x0026 00038 (ddot.go:6)    ADDQ    DX, SI
    0x0029 00041 (ddot.go:6)    MOVQ    SI, DX
    0x002c 00044 (ddot.go:5)    CMPQ    BX, CX
    0x002f 00047 (ddot.go:5)    JLT $0, 20

The nil check in the "before" case can be removed because we have a nil check of &a[i] followed by a load from that pointer. The nil check is effectively merged into the load so you don't see it in the generated assembly.

The nil check in the "after" case isn't being removed because it is checking the original slice ptr &a[0] each time. The load is still from &a[i] so the merging doesn't happen and the nil check must be explicit.

In either case, it is kind of dumb to do nil checks on the pointer loaded from a slice. If the bounds check passes, the pointer is guaranteed to be non-nil. I'll see if I can whip up a fix. It might also be instructive to figure out what changed in the last 20 days about nil checks, I'm not sure what caused it.

I don't see any difference in the inner loop between d37d3bd and tip, so I'm not sure why @btracey you saw performance improve at d37d3bd.

@randall77
Copy link
Contributor

Looks like the culprit is https://go-review.googlesource.com/20307
@brtzsnr
It rewrites the nil check to be on &a[0] instead of &a[i]. It doesn't seem to be helping any in this case.
(Lifting the nil check out of the loop altogether is an admirable goal, and this rewrite is one step on the way. But without lifting it out of the loop, the rewrite is just hurting us.)

@brtzsnr
Copy link
Contributor

brtzsnr commented Mar 23, 2016

I'll send a CL to remove that rewrite and I'll figure it out how to remove the NilChecks completely. Now that we have use counts should be easier.

@gopherbot
Copy link

CL https://golang.org/cl/21040 mentions this issue.

@btracey
Copy link
Contributor Author

btracey commented Mar 28, 2016

That change may have fixed Ddot, but it does not fix the Idamax problem. Comparing go1.6 with go version devel +7e88826 Mon Mar 28 14:10:21 2016 +0000 darwin/amd64

IdamaxSmallUnitaryInc-8   32.4ns ± 7%  39.1ns ± 6%  +20.80%  (p=0.008 n=5+5)
IdamaxSmallPosInc-8       28.7ns ±11%  41.1ns ±11%  +43.24%  (p=0.008 n=5+5)
IdamaxMediumUnitaryInc-8  1.58µs ± 2%  2.03µs ± 2%  +27.83%  (p=0.008 n=5+5)
IdamaxMediumPosInc-8      1.86µs ± 2%  2.38µs ±11%  +27.93%  (p=0.008 n=5+5)
IdamaxLargeUnitaryInc-8    150µs ± 2%   195µs ± 1%  +30.23%  (p=0.008 n=5+5)
IdamaxLargePosInc-8        202µs ± 1%   241µs ± 2%  +19.10%  (p=0.008 n=5+5)
IdamaxHugeUnitaryInc-8    15.1ms ± 1%  21.0ms ± 2%  +39.67%  (p=0.008 n=5+5)
IdamaxHugePosInc-8        27.9ms ± 3%  30.0ms ± 2%   +7.38%  (p=0.008 n=5+5)

Should I open a different issue?

@randall77
Copy link
Contributor

Sorry about closing this prematurely. But yes, please open a different issue. There are two different underlying problems so it would be better to have them in different issues.

@btracey
Copy link
Contributor Author

btracey commented Mar 28, 2016

No problem. Do you have a preferred title?

@randall77
Copy link
Contributor

cmd/compile: BLAS Idamax regression

@golang golang locked and limited conversation to collaborators Mar 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants