Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: eliminate redundant zeroing after lower pass #47107

Open
tdakkota opened this issue Jul 9, 2021 · 9 comments
Open

cmd/compile: eliminate redundant zeroing after lower pass #47107

tdakkota opened this issue Jul 9, 2021 · 9 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@tdakkota
Copy link

tdakkota commented Jul 9, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16.4 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=on
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\tdakkota\AppData\Local\go-build
set GOENV=C:\Users\tdakkota\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\tdakkota\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\tdakkota\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.16.4
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\tdakkota\GolandProjects\gh\abc\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0

What did you do?

https://go.godbolt.org/z/64qsjGTcd

package foo

func foo(v uint64) (b [8]byte) {
	b[0] = byte(v)
	b[1] = byte(v >> 8)
	b[2] = byte(v >> 16)
	b[3] = byte(v >> 24)
	b[4] = byte(v >> 32)
	b[5] = byte(v >> 40)
	b[6] = byte(v >> 48)
	b[7] = byte(v >> 56)
	return b
}

What did you expect to see?

Compiler should generate

        MOVQ    "".v+8(SP), AX
        MOVQ    AX, "".b+16(SP)

What did you see instead?

Compiler generates

        MOVQ    $0, "".b+16(SP) // redundant zeroing
        MOVQ    "".v+8(SP), AX
        MOVQ    AX, "".b+16(SP)

According to SSA dump and rules, I see that one-byte shift-and-moves are folded to one MOVQ during lower pass, so zeroing cannot be elimated during opt passes.

(MOVLstore [i] {s} p (SHRQconst [32] w) x:(MOVLstore [i-4] {s} p w mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVQstore [i-4] {s} p w mem)
(MOVLstore [i] {s} p (SHRQconst [j] w) x:(MOVLstore [i-4] {s} p w0:(SHRQconst [j-32] w) mem))
&& x.Uses == 1
&& clobber(x)
=> (MOVQstore [i-4] {s} p w0 mem)
(MOVLstore [i] {s} p1 (SHRQconst [32] w) x:(MOVLstore [i] {s} p0 w mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& clobber(x)
=> (MOVQstore [i] {s} p0 w mem)
(MOVLstore [i] {s} p1 (SHRQconst [j] w) x:(MOVLstore [i] {s} p0 w0:(SHRQconst [j-32] w) mem))
&& x.Uses == 1
&& sequentialAddresses(p0, p1, 4)
&& clobber(x)
=> (MOVQstore [i] {s} p0 w0 mem)

@tdakkota tdakkota changed the title cmd/compile: elimate redundant zeroing after lower pass cmd/compile: eliminate redundant zeroing after lower pass Jul 9, 2021
@icholy
Copy link

icholy commented Jul 9, 2021

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 9, 2021
@mknyszek mknyszek added this to the Backlog milestone Jul 9, 2021
@mknyszek
Copy link
Contributor

mknyszek commented Jul 9, 2021

CC @randall77 @martisch @mdempsky

@randall77
Copy link
Contributor

The zeroing could still be detected as unnecessary in the current dead store pass before the writes are folded into one MOVQ.
I think that pass is just confused by the LocalAddr reading the memory state between the byte writes. Maybe that's all that needs fixing.

@mengzhuo
Copy link
Contributor

I don't think dse can elide this.
dse using shadowed set looking for shadow size bigger than previous store.
b[i] = byte(a >> x) only try to looking for one byte while zero [8]byte has 8 bytes in shadow.

@mengzhuo
Copy link
Contributor

I tried this rule that elided the redundant zero store rule
@randall77 is this OK to you and can I submit a CL for this issue?

(MOVQstore [0] {sym2} destptr2 val a:(MOVQstoreconst [makeValAndOff(0,0)] {sym1} destptr1 mem))
        && sym2 == sym1
        && destptr2 == destptr1
        => (MOVQstore [0] {sym2} destptr2 val mem)

@randall77
Copy link
Contributor

I don't think dse can elide this.
dse using shadowed set looking for shadow size bigger than previous store.

Yes, that's the current problem. It's not combining the ranges of the individual byte writes, so when it gets to the 8-byte Zero it is bigger than the shadow size. We need a more robust computation of what is shadowed. We need to combine byte writes into larger regions.

v10 (4) = LocalAddr <*[8]byte> {b} v2 v8
v13 (4) = OffPtr <*byte> [0] v10
v76 (4) = Store <mem> {byte} v13 v9 v8
v18 (5) = LocalAddr <*[8]byte> {b} v2 v76
v20 (5) = OffPtr <*byte> [1] v18
v75 (5) = Store <mem> {byte} v20 v17 v76

This should shadow 2 bytes at v10/v18. Currently dse treats those LocalAddrs as reads, which prevents combination. Then we'll need to implement combination as well.

@randall77 is this OK to you and can I submit a CL for this issue?

I'd really rather fix dse. Your rule works only for 8-byte constant stores, and only for amd64. We'd need a lot more rules to fix it for all cases.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@dominikh
Copy link
Member

For me, on master, the example from the original comment compiles ideally:

func foo(v uint64) (b [8]byte) {
	b[0] = byte(v)
	b[1] = byte(v >> 8)
	b[2] = byte(v >> 16)
	b[3] = byte(v >> 24)
	b[4] = byte(v >> 32)
	b[5] = byte(v >> 40)
	b[6] = byte(v >> 48)
	b[7] = byte(v >> 56)
	return b
}

command-line-arguments.foo STEXT nosplit size=15 args=0x10 locals=0x0 funcid=0x0 align=0x0
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      TEXT    command-line-arguments.foo(SB), NOSPLIT|ABIInternal, $0-16
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      FUNCDATA        $0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      FUNCDATA        $1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      FUNCDATA        $5, command-line-arguments.foo.arginfo1(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      FUNCDATA        $6, command-line-arguments.foo.argliveinfo(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      PCDATA  $3, $1
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:3)      MOVQ    $0, command-line-arguments.b+8(SP)
        0x0009 00009 (/home/dominikh/prj/src/example.com/foo.go:11)     MOVQ    AX, command-line-arguments.b+8(SP)
        0x000e 00014 (/home/dominikh/prj/src/example.com/foo.go:12)     RET

However, the following still has unnecessary zeroing:

type S struct {
	A int
	B byte
	C [4]uint64
}

func foo() S {
	return S{1, 2, [4]uint64{3, 4, 5, 6}}
}

command-line-arguments.foo STEXT nosplit size=69 args=0x30 locals=0x0 funcid=0x0 align=0x0
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      TEXT    command-line-arguments.foo(SB), NOSPLIT|ABIInternal, $0-48
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      FUNCDATA        $0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      FUNCDATA        $1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      MOVUPS  X15, command-line-arguments.~r0+8(SP)
        0x0006 00006 (/home/dominikh/prj/src/example.com/foo.go:9)      MOVUPS  X15, command-line-arguments.~r0+24(SP)
        0x000c 00012 (/home/dominikh/prj/src/example.com/foo.go:9)      MOVUPS  X15, command-line-arguments.~r0+40(SP)
        0x0012 00018 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $1, command-line-arguments.~r0+8(SP)
        0x001b 00027 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVB    $2, command-line-arguments.~r0+16(SP)
        0x0020 00032 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $3, command-line-arguments.~r0+24(SP)
        0x0029 00041 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $4, command-line-arguments.~r0+32(SP)
        0x0032 00050 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $5, command-line-arguments.~r0+40(SP)
        0x003b 00059 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $6, command-line-arguments.~r0+48(SP)
        0x0044 00068 (/home/dominikh/prj/src/example.com/foo.go:10)     RET

Interestingly, there's no zeroing when storing through a pointer:

func foo(s *S) {
	*s = S{1, 2, [4]uint64{3, 4, 5, 6}}
}

command-line-arguments.foo STEXT nosplit size=48 args=0x8 locals=0x0 funcid=0x0 align=0x0
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      TEXT    command-line-arguments.foo(SB), NOSPLIT|ABIInternal, $0-8
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      FUNCDATA        $0, gclocals·wgcWObbY2HYnK2SU/U22lA==(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      FUNCDATA        $1, gclocals·J5F+7Qw7O7ve2QcWC7DpeQ==(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      FUNCDATA        $5, command-line-arguments.foo.arginfo1(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      FUNCDATA        $6, command-line-arguments.foo.argliveinfo(SB)
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:9)      PCDATA  $3, $1
        0x0000 00000 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $1, (AX)
        0x0007 00007 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $2, 8(AX)
        0x000f 00015 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $3, 16(AX)
        0x0017 00023 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $4, 24(AX)
        0x001f 00031 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $5, 32(AX)
        0x0027 00039 (/home/dominikh/prj/src/example.com/foo.go:10)     MOVQ    $6, 40(AX)
        0x002f 00047 (/home/dominikh/prj/src/example.com/foo.go:11)     RET

but I haven't checked to see if the root cause for that is the same.

@tdakkota
Copy link
Author

@dominikh There is still redundant zeroing. It seems b [8]byte is placed in the stack.

TEXT    command-line-arguments.foo(SB), NOSPLIT|ABIInternal, $0-16
...
MOVQ    $0, command-line-arguments.b+8(SP) // <- zero the result value
MOVQ    AX, command-line-arguments.b+8(SP) // then overwrite it using the first argument
RET

@dominikh
Copy link
Member

You're right. I must've misread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Triage Backlog
Development

No branches or pull requests

7 participants