cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

brtzsnr · 2016-02-29T10:01:54Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?
dev.ssa branch
go version devel +8107b00 Sun Feb 28 22:29:23 2016 +0000 linux/amd64
What operating system and processor architecture are you using (go env)?
amd64
What did you do?
http://play.golang.org/p/kEmOTtpIGU

func g_ssa(a[]int, b int) {
        a[b]++
        a[b]++
}

What did you expect to see?

The code is optimized as if it was a[b] += 2

What did you see instead?

The code generated was:
Read from mem
Inc
Write to mem
Read from mem
Inc
Write to mem

The read to/write from memory in the middle are redundant.

% GOSSAFUNC=g_ssa go build a.go

g_ssa <nil>
  b1:
    v1 = InitMem <mem>
    v27 = Arg <*uint8> {a} [0] : a[*uint8]
    v7 = Arg <int> {b} [0] : b[int]
    v26 = Arg <int> {a} [8] : a+8[int]
    v19 = LoadReg <int> v7 : CX
    v10 = LoadReg <int> v26 : AX
    v43 = CMPQ <flags> v19 v10
    ULT v43 -> b7 b3 (likely)
  b7: <- b1
    v13 = LoadReg <*uint8> v27 : DX
    v15 = MOVQloadidx8 <int> [0] v13 v19 v1 : AX
    v17 = ADDQconst <int> [1] v15 : AX
    v25 = MOVQstoreidx8 <mem> [0] v13 v19 v17 v1
    v35 = MOVQloadidx8 <int> [0] v13 v19 v25 : AX
    v37 = ADDQconst <int> [1] v35 : AX
    v45 = MOVQstoreidx8 <mem> [0] v13 v19 v37 v25
    Ret v45
  b3: <- b1
    v11 = CALLstatic <mem> {runtime.panicindex} [0] v1
    Exit v11
name b[int]: [v7]

The text was updated successfully, but these errors were encountered:

randall77 · 2016-02-29T16:23:40Z

@tzneal
It would be nice if this was captured by Todd's new rule:
(Load p1 (Store [w] p2 x _)) && isSamePtr(p1,p2) && t1.Compare(x.Type)==CMPeq && w == t1.Size() -> x
Unfortunately, when it runs at generic rewrite time, it doesn't realize that the pointers are equal because they are indexed math and not cse'd yet.

tzneal · 2016-03-01T02:56:24Z

I'll have a look at it. Maybe just a more complex isSamePtr that uses cmpVal to compare pointers?

randall77 · 2016-03-01T03:10:05Z

That would work. I get the feeling that it is just a patch over a deeper phase-ordering problem, though. Another cse would do it, but that is expensive.

tzneal · 2016-03-02T11:48:53Z

Adding another cse/zcse pair costs 3-6% in compile time (consistency checks disabled). Adding a second opt pass after cse is negligible in cost and resolves the issue:

        {name: "generic cse", fn: cse},
+       {name: "opt second pass", fn: opt},
        {name: "phiopt", fn: phi opt},

name       old time/op     new time/op     delta
Template       293ms ± 2%      295ms ± 2%    ~             (p=0.421 n=5+5)
GoTypes        1.09s ± 1%      1.10s ± 2%    ~             (p=0.841 n=5+5)
Compiler       4.79s ± 1%      4.83s ± 1%    ~             (p=0.151 n=5+5)
MakeBash       35.8s ± 2%      36.3s ± 1%    ~             (p=0.222 n=5+5)

# file difference
/home/todd/Projects/gn/bin/go 11942904
/home/todd/Projects/go/bin/go 11942600 [-304 bytes]

# section differences
read-only data = -25 bytes (-0.001476%)
global text (code) = -240 bytes (-0.006249%)
Total difference -265 bytes (-0.004533%)

g_ssa <nil>
  b1:
    v1 = InitMem <mem>
    v27 = Arg <*uint8> {a} [0] : a[*uint8]
    v7 = Arg <int> {b} [0] : b[int]
    v26 = Arg <int> {a} [8] : a+8[int]
    v16 = LoadReg <int> v7 : CX
    v46 = LoadReg <int> v26 : AX
    v45 = CMPQ <flags> v16 v46
    ULT v45 -> b7 b3 (likely)
  b7: <- b1
    v50 = LoadReg <*uint8> v27 : DX
    v15 = MOVQloadidx8 <int> [0] v50 v16 v1 : AX
    v36 = ADDQconst <int> [2] v15 : AX
    v44 = MOVQstoreidx8 <mem> [0] v50 v16 v36 v1
    Ret v44
  b3: <- b1
    v11 = CALLstatic <mem> {runtime.panicindex} [0] v1
    Exit v11

There's a long standing TODO about splitting the required and optimizing rules. What are your thoughts on doing that split and adding the second optimization rule pass after cse?

brtzsnr · 2016-03-02T12:14:28Z

I was also thinking of adding a second opt optional pass. I saw a
measurable decrease in code size of pkg/tools/linuxamd64 binaries.

I don't see why we should split opt. It's generally fast enough
Pe 2 mar. 2016 12:48 p.m., "Todd" notifications@github.com a scris:

Adding another cse/zcse pair costs 3-6% in compile time (consistency
checks disabled). Adding a second opt pass after cse is negligible in
cost and resolves the issue:
    {name: "generic cse", fn: cse},
  {name: "opt second pass", fn: opt},
{name: "phiopt", fn: phi opt},
name old time/op new time/op delta
Template 293ms ± 2% 295ms ± 2% ~ (p=0.421 n=5+5)
GoTypes 1.09s ± 1% 1.10s ± 2% ~ (p=0.841 n=5+5)
Compiler 4.79s ± 1% 4.83s ± 1% ~ (p=0.151 n=5+5)
MakeBash 35.8s ± 2% 36.3s ± 1% ~ (p=0.222 n=5+5)

file difference

/home/todd/Projects/gn/bin/go 11942904
/home/todd/Projects/go/bin/go 11942600 [-304 bytes]

section differences

read-only data = -25 bytes (-0.001476%)
global text (code) = -240 bytes (-0.006249%)
Total difference -265 bytes (-0.004533%)

g_ssa
b1:
v1 = InitMem
v27 = Arg <_uint8> {a} [0] : a[_uint8]
v7 = Arg {b} [0] : b[int]
v26 = Arg {a} [8] : a+8[int]
v16 = LoadReg v7 : CX
v46 = LoadReg v26 : AX
v45 = CMPQ v16 v46
ULT v45 -> b7 b3 (likely)
b7: <- b1
v50 = LoadReg <*uint8> v27 : DX
v15 = MOVQloadidx8 [0] v50 v16 v1 : AX
v36 = ADDQconst [2] v15 : AX
v44 = MOVQstoreidx8 [0] v50 v16 v36 v1
Ret v44
b3: <- b1
v11 = CALLstatic {runtime.panicindex} [0] v1
Exit v11

There's a long standing TODO about splitting the required and optimizing
rules. What are your thoughts on doing that split and adding the second
optimization rule pass after cse?

—
Reply to this email directly or view it on GitHub
#14564 (comment).

tzneal · 2016-03-02T12:50:17Z

I think the intent of the split is primarily to support the -N option so that the generated code is much closer to the input go code and easier to debug with gdb.

randall77 · 2016-03-02T16:51:48Z

Yes, the split I proposed was intended to separate the required from the optional rewrites. I was more interested in doing this for the lowering pass, but we could do it for the machine-independent one as well.

One benefit of separating the passes is that the required rewrites, for lowering at least, don't need to be run in a loop. You can do just one pass over the CFG.

tzneal · 2016-03-03T03:04:17Z

I don't think a single pass of the required lower rules would work because of rules like these that don't fully lower:

(Move [size] dst src mem) && size > 16 && size%16 != 0 && size%16 <= 8 ->
    (Move [size-size%16] (ADDQconst <dst.Type> dst [size%16]) (ADDQconst <src.Type> src [size%16])
        (MOVQstore dst (MOVQload src mem) mem))
(Move [size] dst src mem) && size > 16 && size%16 != 0 && size%16 > 8 ->
    (Move [size-size%16] (ADDQconst <dst.Type> dst [size%16]) (ADDQconst <src.Type> src [size%16])
        (MOVOstore dst (MOVOload src mem) mem))

I spent some time splitting rules, but since opt is so fast anyway along with the extra complexity to writing rules (i.e. you can no longer rewrite in an "optional" rule to a form that's an input to a "required" rule), I'm not sure it's worth it.

I'll throw up a CL with just the additional pass for now.

gopherbot · 2016-03-03T11:00:14Z

CL https://golang.org/cl/20172 mentions this issue.

* Move lowering into a separate pass. * SliceLen/SliceCap is now available to various intermediate passes which use useful for bounds checking. * Add a second opt pass to handle the new opportunities Decreases the code size of binaries in pkg/tool/linux_amd64 by ~45K. Updates #14564 #14606 Change-Id: I5b2bd6202181c50623a3585fbf15c0d6db6d4685 Reviewed-on: https://go-review.googlesource.com/20172 Run-TryBot: Alexandru Moșoi <alexandru@mosoi.ro> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>

brtzsnr · 2016-03-09T11:13:48Z

Fixed. The ssa code looks like the following:

 <nil>
  b1:
    v1 = InitMem <mem>
    v7 = Arg <uint> {b} [0] : b[uint]
    v26 = Arg <int> {a} [8] : a+8[int]
    v27 = Arg <*uint8> {a} [0] : a[*uint8]
    v16 = LoadReg <uint> v7 : CX
    v46 = LoadReg <int> v26 : AX
    v45 = CMPQ <flags> v16 v46
    ULT v45 -> b7 b3 (likely)
  b7: <- b1
    v50 = LoadReg <*uint8> v27 : DX
    v15 = MOVQloadidx8 <uint> [0] v50 v16 v1 : AX
    v36 = ADDQconst <uint> [2] v15 : AX
    v44 = MOVQstoreidx8 <mem> [0] v50 v16 v36 v1
    Ret v44
  b3: <- b1
    v11 = CALLstatic <mem> {runtime.panicindex} [0] v1
    Exit v11

which is optimal.

josharian changed the title ~~dev.ssa: Suboptimal code generated for a[b]++; a[b]++~~ cmd/compile: suboptimal code generated for a[b]++; a[b]++ Mar 2, 2016

brtzsnr closed this as completed Mar 9, 2016

golang locked and limited conversation to collaborators Mar 13, 2017

gopherbot added the FrozenDueToAge label Mar 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

brtzsnr commented Feb 29, 2016

randall77 commented Feb 29, 2016

tzneal commented Mar 1, 2016

randall77 commented Mar 1, 2016

tzneal commented Mar 2, 2016

brtzsnr commented Mar 2, 2016

file difference

section differences

tzneal commented Mar 2, 2016

randall77 commented Mar 2, 2016

tzneal commented Mar 3, 2016

gopherbot commented Mar 3, 2016

brtzsnr commented Mar 9, 2016

cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

cmd/compile: suboptimal code generated for a[b]++; a[b]++ #14564

Comments

brtzsnr commented Feb 29, 2016

randall77 commented Feb 29, 2016

tzneal commented Mar 1, 2016

randall77 commented Mar 1, 2016

tzneal commented Mar 2, 2016

brtzsnr commented Mar 2, 2016

file difference

section differences

tzneal commented Mar 2, 2016

randall77 commented Mar 2, 2016

tzneal commented Mar 3, 2016

gopherbot commented Mar 3, 2016

brtzsnr commented Mar 9, 2016