Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

Closed
brtzsnr opened this issue Feb 29, 2016 · 10 comments
Closed

cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

brtzsnr opened this issue Feb 29, 2016 · 10 comments

Comments

@brtzsnr
Copy link
Contributor

brtzsnr commented Feb 29, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    dev.ssa branch
    go version devel +8107b00 Sun Feb 28 22:29:23 2016 +0000 linux/amd64
  2. What operating system and processor architecture are you using (go env)?
    amd64
  3. What did you do?
    http://play.golang.org/p/kEmOTtpIGU
func g_ssa(a[]int, b int) {
        a[b]++
        a[b]++
}
  1. What did you expect to see?

The code is optimized as if it was a[b] += 2

  1. What did you see instead?

The code generated was:
Read from mem
Inc
Write to mem
Read from mem
Inc
Write to mem

The read to/write from memory in the middle are redundant.

% GOSSAFUNC=g_ssa go build a.go

g_ssa <nil>
  b1:
    v1 = InitMem <mem>
    v27 = Arg <*uint8> {a} [0] : a[*uint8]
    v7 = Arg <int> {b} [0] : b[int]
    v26 = Arg <int> {a} [8] : a+8[int]
    v19 = LoadReg <int> v7 : CX
    v10 = LoadReg <int> v26 : AX
    v43 = CMPQ <flags> v19 v10
    ULT v43 -> b7 b3 (likely)
  b7: <- b1
    v13 = LoadReg <*uint8> v27 : DX
    v15 = MOVQloadidx8 <int> [0] v13 v19 v1 : AX
    v17 = ADDQconst <int> [1] v15 : AX
    v25 = MOVQstoreidx8 <mem> [0] v13 v19 v17 v1
    v35 = MOVQloadidx8 <int> [0] v13 v19 v25 : AX
    v37 = ADDQconst <int> [1] v35 : AX
    v45 = MOVQstoreidx8 <mem> [0] v13 v19 v37 v25
    Ret v45
  b3: <- b1
    v11 = CALLstatic <mem> {runtime.panicindex} [0] v1
    Exit v11
name b[int]: [v7]
@randall77
Copy link
Contributor

@tzneal
It would be nice if this was captured by Todd's new rule:
(Load p1 (Store [w] p2 x _)) && isSamePtr(p1,p2) && t1.Compare(x.Type)==CMPeq && w == t1.Size() -> x
Unfortunately, when it runs at generic rewrite time, it doesn't realize that the pointers are equal because they are indexed math and not cse'd yet.

@tzneal
Copy link
Member

tzneal commented Mar 1, 2016

I'll have a look at it. Maybe just a more complex isSamePtr that uses cmpVal to compare pointers?

@randall77
Copy link
Contributor

That would work. I get the feeling that it is just a patch over a deeper phase-ordering problem, though. Another cse would do it, but that is expensive.

@tzneal
Copy link
Member

tzneal commented Mar 2, 2016

Adding another cse/zcse pair costs 3-6% in compile time (consistency checks disabled). Adding a second opt pass after cse is negligible in cost and resolves the issue:

        {name: "generic cse", fn: cse},
+       {name: "opt second pass", fn: opt},
        {name: "phiopt", fn: phi opt},

name       old time/op     new time/op     delta
Template       293ms ± 2%      295ms ± 2%    ~             (p=0.421 n=5+5)
GoTypes        1.09s ± 1%      1.10s ± 2%    ~             (p=0.841 n=5+5)
Compiler       4.79s ± 1%      4.83s ± 1%    ~             (p=0.151 n=5+5)
MakeBash       35.8s ± 2%      36.3s ± 1%    ~             (p=0.222 n=5+5)

# file difference
/home/todd/Projects/gn/bin/go 11942904
/home/todd/Projects/go/bin/go 11942600 [-304 bytes]

# section differences
read-only data = -25 bytes (-0.001476%)
global text (code) = -240 bytes (-0.006249%)
Total difference -265 bytes (-0.004533%)
g_ssa <nil>
  b1:
    v1 = InitMem <mem>
    v27 = Arg <*uint8> {a} [0] : a[*uint8]
    v7 = Arg <int> {b} [0] : b[int]
    v26 = Arg <int> {a} [8] : a+8[int]
    v16 = LoadReg <int> v7 : CX
    v46 = LoadReg <int> v26 : AX
    v45 = CMPQ <flags> v16 v46
    ULT v45 -> b7 b3 (likely)
  b7: <- b1
    v50 = LoadReg <*uint8> v27 : DX
    v15 = MOVQloadidx8 <int> [0] v50 v16 v1 : AX
    v36 = ADDQconst <int> [2] v15 : AX
    v44 = MOVQstoreidx8 <mem> [0] v50 v16 v36 v1
    Ret v44
  b3: <- b1
    v11 = CALLstatic <mem> {runtime.panicindex} [0] v1
    Exit v11

There's a long standing TODO about splitting the required and optimizing rules. What are your thoughts on doing that split and adding the second optimization rule pass after cse?

@brtzsnr
Copy link
Contributor Author

brtzsnr commented Mar 2, 2016

I was also thinking of adding a second opt optional pass. I saw a
measurable decrease in code size of pkg/tools/linuxamd64 binaries.

I don't see why we should split opt. It's generally fast enough
Pe 2 mar. 2016 12:48 p.m., "Todd" notifications@github.com a scris:

Adding another cse/zcse pair costs 3-6% in compile time (consistency
checks disabled). Adding a second opt pass after cse is negligible in
cost and resolves the issue:

    {name: "generic cse", fn: cse},
  •   {name: "opt second pass", fn: opt},
    {name: "phiopt", fn: phi opt},
    

name old time/op new time/op delta
Template 293ms ± 2% 295ms ± 2% ~ (p=0.421 n=5+5)
GoTypes 1.09s ± 1% 1.10s ± 2% ~ (p=0.841 n=5+5)
Compiler 4.79s ± 1% 4.83s ± 1% ~ (p=0.151 n=5+5)
MakeBash 35.8s ± 2% 36.3s ± 1% ~ (p=0.222 n=5+5)

file difference

/home/todd/Projects/gn/bin/go 11942904
/home/todd/Projects/go/bin/go 11942600 [-304 bytes]

section differences

read-only data = -25 bytes (-0.001476%)
global text (code) = -240 bytes (-0.006249%)
Total difference -265 bytes (-0.004533%)

g_ssa
b1:
v1 = InitMem
v27 = Arg <_uint8> {a} [0] : a[_uint8]
v7 = Arg {b} [0] : b[int]
v26 = Arg {a} [8] : a+8[int]
v16 = LoadReg v7 : CX
v46 = LoadReg v26 : AX
v45 = CMPQ v16 v46
ULT v45 -> b7 b3 (likely)
b7: <- b1
v50 = LoadReg <*uint8> v27 : DX
v15 = MOVQloadidx8 [0] v50 v16 v1 : AX
v36 = ADDQconst [2] v15 : AX
v44 = MOVQstoreidx8 [0] v50 v16 v36 v1
Ret v44
b3: <- b1
v11 = CALLstatic {runtime.panicindex} [0] v1
Exit v11

There's a long standing TODO about splitting the required and optimizing
rules. What are your thoughts on doing that split and adding the second
optimization rule pass after cse?


Reply to this email directly or view it on GitHub
#14564 (comment).

@tzneal
Copy link
Member

tzneal commented Mar 2, 2016

I think the intent of the split is primarily to support the -N option so that the generated code is much closer to the input go code and easier to debug with gdb.

@randall77
Copy link
Contributor

Yes, the split I proposed was intended to separate the required from the optional rewrites. I was more interested in doing this for the lowering pass, but we could do it for the machine-independent one as well.

One benefit of separating the passes is that the required rewrites, for lowering at least, don't need to be run in a loop. You can do just one pass over the CFG.

@josharian josharian changed the title dev.ssa: Suboptimal code generated for a[b]++; a[b]++ cmd/compile: suboptimal code generated for a[b]++; a[b]++ Mar 2, 2016
@tzneal
Copy link
Member

tzneal commented Mar 3, 2016

I don't think a single pass of the required lower rules would work because of rules like these that don't fully lower:

(Move [size] dst src mem) && size > 16 && size%16 != 0 && size%16 <= 8 ->
    (Move [size-size%16] (ADDQconst <dst.Type> dst [size%16]) (ADDQconst <src.Type> src [size%16])
        (MOVQstore dst (MOVQload src mem) mem))
(Move [size] dst src mem) && size > 16 && size%16 != 0 && size%16 > 8 ->
    (Move [size-size%16] (ADDQconst <dst.Type> dst [size%16]) (ADDQconst <src.Type> src [size%16])
        (MOVOstore dst (MOVOload src mem) mem))

I spent some time splitting rules, but since opt is so fast anyway along with the extra complexity to writing rules (i.e. you can no longer rewrite in an "optional" rule to a form that's an input to a "required" rule), I'm not sure it's worth it.

I'll throw up a CL with just the additional pass for now.

@gopherbot
Copy link

CL https://golang.org/cl/20172 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 9, 2016
* Move lowering into a separate pass.
* SliceLen/SliceCap is now available to various intermediate passes
which use useful for bounds checking.
* Add a second opt pass to handle the new opportunities

Decreases the code size of binaries in pkg/tool/linux_amd64
by ~45K.

Updates #14564 #14606

Change-Id: I5b2bd6202181c50623a3585fbf15c0d6db6d4685
Reviewed-on: https://go-review.googlesource.com/20172
Run-TryBot: Alexandru Moșoi <alexandru@mosoi.ro>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@brtzsnr
Copy link
Contributor Author

brtzsnr commented Mar 9, 2016

Fixed. The ssa code looks like the following:

 <nil>
  b1:
    v1 = InitMem <mem>
    v7 = Arg <uint> {b} [0] : b[uint]
    v26 = Arg <int> {a} [8] : a+8[int]
    v27 = Arg <*uint8> {a} [0] : a[*uint8]
    v16 = LoadReg <uint> v7 : CX
    v46 = LoadReg <int> v26 : AX
    v45 = CMPQ <flags> v16 v46
    ULT v45 -> b7 b3 (likely)
  b7: <- b1
    v50 = LoadReg <*uint8> v27 : DX
    v15 = MOVQloadidx8 <uint> [0] v50 v16 v1 : AX
    v36 = ADDQconst <uint> [2] v15 : AX
    v44 = MOVQstoreidx8 <mem> [0] v50 v16 v36 v1
    Ret v44
  b3: <- b1
    v11 = CALLstatic <mem> {runtime.panicindex} [0] v1
    Exit v11

which is optimal.

@brtzsnr brtzsnr closed this as completed Mar 9, 2016
@golang golang locked and limited conversation to collaborators Mar 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants