Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: compiler built with PGO crashes occasionally on ppc64{le} #60368

Closed
prattmic opened this issue May 23, 2023 · 38 comments
Closed

cmd/compile: compiler built with PGO crashes occasionally on ppc64{le} #60368

prattmic opened this issue May 23, 2023 · 38 comments
Assignees
Labels
arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. release-blocker
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented May 23, 2023

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22" && date < "2023-06-01"

https://go.dev/cl/495596 added a default.pgo profile for cmd/compile, enabling a PGO build of the compiler (as long as -pgo=none is not set explicitly).

This caused a variety of crashes in the compiler on ppc64{le} builders:

2023-05-18T16:55:07-88f89d8/linux-ppc64le-power9osu
2023-05-18T13:41:27-33a601b/aix-ppc64
2023-05-18T12:52:14-7b0835d/aix-ppc64
2023-05-18T10:23:17-75add1c/aix-ppc64
2023-05-18T09:16:07-774f602/linux-ppc64-sid-power10
2023-05-18T09:16:07-774f602/linux-ppc64le-buildlet
2023-05-18T09:15:25-27906bb/aix-ppc64
2023-05-18T09:15:25-27906bb/linux-ppc64-sid-power10
2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-buildlet
2023-05-18T01:40:37-6ed8474/linux-ppc64-sid-power10
2023-05-18T00:35:53-956d31e/linux-ppc64le-buildlet
2023-05-17T22:11:31-0b86a04/linux-ppc64le-buildlet
2023-05-17T21:53:11-c426c87/linux-ppc64-sid-buildlet
2023-05-17T21:53:11-c426c87/linux-ppc64le-power10osu
2023-05-17T21:44:30-2693ade/linux-ppc64-sid-power10

That CL also caused #60263. Since it caused several issues, the CL was reverted in https://go.dev/cl/496185. #60263 has since been fixed.

Given these failures are all on ppc64{le}, @dr2chase and I suspect that they are due to a bad ppc64-specific optimization (SSA rule, e.g.) that is tickled by the additional inlining caused by PGO.

I have had some success reproducing these crashes. Running all.bash in a loop on three linux-ppc64-sid-power10 builders concurrently with GOGC=5 usually gets me a failure in <30 minutes. Not stellar, but I think workable.

We should then be able to bisect down to a bad function with GOSSAHASH applied in inlineCostOK to enable/disable PGO-based inlining. (Also set -d=pgodevirtualize=0 to disable PGO-based devirtualization, which was submitted in https://go.dev/cl/492436, after the bad CL above was reverted).

Given that there is a path forward to debugging ppc64, and we'd like more soak time on the primary ports, I intend to resubmit https://go.dev/cl/495596, which will make ppc64 flaky until this issue is resolved. (GOARCH=ppc64{le} could also temporarily change the default of -pgo from auto to none if necessary).

cc @golang/ppc64 @dr2chase @aclements @cherrymui

@prattmic prattmic added arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. labels May 23, 2023
@prattmic prattmic added this to the Go1.21 milestone May 23, 2023
@gopherbot
Copy link

Change https://go.dev/cl/497455 mentions this issue: cmd/compile: build compiler with PGO

@prattmic
Copy link
Member Author

prattmic commented May 23, 2023

My current repro approach:

$ export GOROOT=/my/dev/tree/
$ export GOMOTE_GROUP=ppc64_pgo
$ gomote create -count=3 -setup linux-ppc64-sid-power10
$ gomote run -until=goroutine -e GOGC=5 ./go/bin/go build -a cmd/dist

i.e., rebuild the cmd/dist over and over until we get a crash ("goroutine" should appear in stack trace of any compiler crash).

gopherbot pushed a commit that referenced this issue May 23, 2023
Reapples CL 495596, which was reverted at CL 496185. The x/tools
failure, #60263, has been resolved. The ppc64 failures, #60368, have
_not_ been resolved, but are believed to be specific to that port. This
CL will make ppc64 flaky while the issue is investigated, but give more
soak time on primary ports.

Build the compiler with PGO. As go build -pgo=auto is enabled by
default, we just need to store a profile in the compiler's
directory.

The profile is collected from building all std and cmd packages on
Linux/AMD64 machine, using profile.sh.

This improves the compiler speed. On Linux/AMD64,
name        old time/op       new time/op       delta
Template          138ms ± 5%        136ms ± 4%  -1.44%  (p=0.005 n=36+39)
Unicode           147ms ± 4%        140ms ± 4%  -4.99%  (p=0.000 n=40+39)
GoTypes           780ms ± 3%        778ms ± 4%    ~     (p=0.172 n=39+39)
Compiler          105ms ± 5%         99ms ± 7%  -5.64%  (p=0.000 n=40+40)
SSA               5.83s ± 6%        5.80s ± 6%    ~     (p=0.556 n=40+40)
Flate            89.0ms ± 5%       87.0ms ± 6%  -2.18%  (p=0.000 n=40+40)
GoParser          172ms ± 4%        167ms ± 4%  -2.72%  (p=0.000 n=39+40)
Reflect           333ms ± 4%        333ms ± 3%    ~     (p=0.426 n=40+39)
Tar               128ms ± 4%        126ms ± 4%  -1.82%  (p=0.000 n=39+39)
XML               173ms ± 4%        170ms ± 4%  -1.39%  (p=0.000 n=39+40)
[Geo mean]        253ms             248ms       -2.13%

The profile is pretty transferable. Using the same profile, we
see a bigger win on Darwin/ARM64,
name        old time/op       new time/op       delta
Template         71.0ms ± 2%       68.3ms ± 2%  -3.90%  (p=0.000 n=20+20)
Unicode          71.8ms ± 2%       66.8ms ± 2%  -6.90%  (p=0.000 n=20+20)
GoTypes           444ms ± 1%        428ms ± 1%  -3.53%  (p=0.000 n=19+20)
Compiler         48.9ms ± 3%       45.6ms ± 3%  -6.81%  (p=0.000 n=20+20)
SSA               3.25s ± 2%        3.09s ± 1%  -5.03%  (p=0.000 n=19+20)
Flate            44.0ms ± 2%       42.3ms ± 2%  -3.72%  (p=0.000 n=19+20)
GoParser         76.7ms ± 1%       73.5ms ± 1%  -4.15%  (p=0.000 n=18+19)
Reflect           172ms ± 1%        165ms ± 1%  -4.13%  (p=0.000 n=20+19)
Tar              63.1ms ± 1%       60.4ms ± 2%  -4.24%  (p=0.000 n=19+20)
XML              83.2ms ± 2%       79.2ms ± 2%  -4.79%  (p=0.000 n=20+20)
[Geo mean]        127ms             121ms       -4.73%

For #60368.

Change-Id: I2cec0fc85e21c38d57ba6f0e5e90cde5d443ebd2
Reviewed-on: https://go-review.googlesource.com/c/go/+/497455
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-23 18:15 linux-ppc64le-buildlet go@b0f15b4a cmd/cgo/internal/testshared.TestTrivialExecutable (log)
--- FAIL: TestTrivialExecutable (3.68s)
    shared_test.go:491: executing /workdir/go/bin/go install -linkshared ./trivial failed exit status 1:
        # runtime
        panic: runtime error: index out of range [7156992] with length 7898 [recovered]
        	panic: runtime error: index out of range [7156992] with length 7898

        goroutine 1 [running]:
        panic({0x928e40?, 0xc0026367e0?})
        	runtime/panic.go:1017 +0x448 fp=0xc0022ec640 sp=0xc0022ec580 pc=0x58a78
        runtime.goPanicIndex(0x6d3500, 0x1eda)
...
        cmd/compile/internal/gc.compileFunctions()
        	cmd/compile/internal/gc/compile.go:194 +0x220 fp=0xc0022ed968 sp=0xc0022ed8e8 pc=0x8595b0
        cmd/compile/internal/gc.Main(0x999188)
        	cmd/compile/internal/gc/main.go:344 +0x1764 fp=0xc0022edea8 sp=0xc0022ed968 pc=0x85b904
        main.main()
        	cmd/compile/main.go:57 +0x134 fp=0xc0022edf18 sp=0xc0022edea8 pc=0x8807b4
        runtime.main()
        	runtime/proc.go:267 +0x348 fp=0xc0022edfc0 sp=0xc0022edf18 pc=0x5ba88
        runtime.goexit()
        	runtime/asm_ppc64x.s:912 +0x4 fp=0xc0022edfc0 sp=0xc0022edfc0 pc=0x94a94

watchflakes

@laboger
Copy link
Contributor

laboger commented May 23, 2023

We have done some experimenting with this. I found that if we build the default.pgo file natively instead of using the one built on amd64 we are not able to get the problem to occur. When using the default pgo file built on amd64 more inlining happens within cmd/compile/internal/ssagen.(*state).stmt and according to the stacktraces that seems like where many of the problems occur.

I was going to attempt to force more aggressive inlining and see if that forces the problem.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 01:03 linux-ppc64le-power10osu go@bdc5533f cmd/compile/internal/ssa [build] (log)
panic: runtime error: index out of range [7177776] with length 32378

goroutine 346 [running]:
cmd/internal/src.(*PosTable).Pos(...)
	cmd/internal/src/xpos.go:153
cmd/internal/obj.(*Link).InnermostPos(...)
	cmd/internal/obj/inl.go:124
cmd/internal/obj.(*Link).OutermostPos(0x10?, {0x80000?, 0xc0?})
	cmd/internal/obj/inl.go:105 +0x134
cmd/compile/internal/base.FmtPos({0x11?, 0x0?})
...
	cmd/compile/internal/ssagen/ssa.go:548 +0x235c
cmd/compile/internal/ssagen.Compile(0xc01047f340, 0x0?)
	cmd/compile/internal/ssagen/pgen.go:187 +0x48
cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
	cmd/compile/internal/gc/compile.go:184 +0x4c
cmd/compile/internal/gc.compileFunctions.func3.1()
	cmd/compile/internal/gc/compile.go:166 +0x48
created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 90
	cmd/compile/internal/gc/compile.go:165 +0x20c
go tool dist: FAILED: /workdir/go/pkg/tool/linux_ppc64le/go_bootstrap install -a cmd/asm cmd/cgo cmd/compile cmd/link: exit status 1

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 03:36 linux-ppc64le-buildlet go@f0d575c2 cmd/cgo/internal/testshared.TestTrivialExecutable (log)
--- FAIL: TestTrivialExecutable (4.81s)
    shared_test.go:491: executing /workdir/go/bin/go install -linkshared ./trivial failed exit status 1:
        # runtime
        panic: runtime error: index out of range [50463491] with length 7908 [recovered]
        	panic: runtime error: index out of range [50463491] with length 7908

        goroutine 1 [running]:
        panic({0x929740?, 0xc004ff4ee8?})
        	runtime/panic.go:1017 +0x448 fp=0xc007406640 sp=0xc007406580 pc=0x58d88
        runtime.goPanicIndex(0x3020303, 0x1ee4)
...
        cmd/compile/internal/gc.compileFunctions()
        	cmd/compile/internal/gc/compile.go:194 +0x220 fp=0xc007407968 sp=0xc0074078e8 pc=0x85ae80
        cmd/compile/internal/gc.Main(0x999ec8)
        	cmd/compile/internal/gc/main.go:344 +0x1760 fp=0xc007407ea8 sp=0xc007407968 pc=0x85d1d0
        main.main()
        	cmd/compile/main.go:57 +0x134 fp=0xc007407f18 sp=0xc007407ea8 pc=0x882ae4
        runtime.main()
        	runtime/proc.go:267 +0x348 fp=0xc007407fc0 sp=0xc007407f18 pc=0x5bd98
        runtime.goexit()
        	runtime/asm_ppc64x.s:912 +0x4 fp=0xc007407fc0 sp=0xc007407fc0 pc=0x94ee4

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 21:17 linux-ppc64le-power9osu go@78143d96 cmd/link.TestMachOBuildVersion (log)
--- FAIL: TestMachOBuildVersion (4.98s)
    link_test.go:372: [/workdir/go/bin/go build -ldflags=-linkmode=internal -o /workdir/tmp/TestMachOBuildVersion949832594/001/main /workdir/tmp/TestMachOBuildVersion949832594/001/main.go]: exit status 1:
        # runtime
        panic: runtime error: invalid memory address or nil pointer dereference
        [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]

        goroutine 2970 [running]:
        panic({0x8d5240?, 0xf65c70?})
        	runtime/panic.go:1017 +0x440 fp=0xc002338db0 sp=0xc002338cf0 pc=0x58af0
        runtime.panicmem(...)
...
        cmd/compile/internal/ssagen.Compile(0xc001b7fe40, 0x0?)
        	cmd/compile/internal/ssagen/pgen.go:187 +0x48 fp=0xc002339f30 sp=0xc002339d60 pc=0x64e068
        cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
        	cmd/compile/internal/gc/compile.go:184 +0x4c fp=0xc002339f80 sp=0xc002339f30 pc=0x8537fc
        cmd/compile/internal/gc.compileFunctions.func3.1()
        	cmd/compile/internal/gc/compile.go:166 +0x48 fp=0xc002339fc0 sp=0xc002339f80 pc=0x853bc8
        runtime.goexit()
        	runtime/asm_ppc64x.s:912 +0x4 fp=0xc002339fc0 sp=0xc002339fc0 pc=0x94bb4
        created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 133
        	cmd/compile/internal/gc/compile.go:165 +0x20c

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 17:54 linux-ppc64-sid-power10 benchmarks@c4c5b3d1 go@1ddab592 google.golang.org/protobuf/internal/impl [build] (log)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x672354]

goroutine 591 [running]:
cmd/compile/internal/ssagen.(*state).stmt(0xc004fd1600, {0x0, 0x0?})
	cmd/compile/internal/ssagen/ssa.go:1423 +0x54
cmd/compile/internal/ssagen.(*state).stmtList(...)
	cmd/compile/internal/ssagen/ssa.go:1417
cmd/compile/internal/ssagen.(*state).stmt(0xc004fd1600, {0xadf768, 0xc001cceee0?})
	cmd/compile/internal/ssagen/ssa.go:1432 +0x268
cmd/compile/internal/ssagen.(*state).stmtList(...)
	cmd/compile/internal/ssagen/ssa.go:1417
cmd/compile/internal/ssagen.buildssa(0xc001cda840, 0x1)
	cmd/compile/internal/ssagen/ssa.go:548 +0x2374
cmd/compile/internal/ssagen.Compile(0xc001cda840, 0x1?)
	cmd/compile/internal/ssagen/pgen.go:187 +0x48
cmd/compile/internal/gc.compileFunctions.func5.1(0xc00000e000?)
	cmd/compile/internal/gc/compile.go:184 +0x4c
cmd/compile/internal/gc.compileFunctions.func3.1()
	cmd/compile/internal/gc/compile.go:166 +0x48
created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 30
	cmd/compile/internal/gc/compile.go:165 +0x20c

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 22:39 linux-ppc64le-power9osu go@79a8997a cmd/internal/obj/x86.TestVexEvexPCrelative (log)
--- FAIL: TestVexEvexPCrelative (3.40s)
    pcrelative_test.go:72: error exit status 1 output # runtime
        panic: runtime error: invalid memory address or nil pointer dereference
        [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x65d65c]

        goroutine 940 [running]:
        panic({0x8d5240?, 0xf65c70?})
        	runtime/panic.go:1017 +0x440 fp=0xc001a24db0 sp=0xc001a24cf0 pc=0x58b90
        runtime.panicmem(...)
        	runtime/panic.go:261
...
        cmd/compile/internal/ssagen.Compile(0xc0021ee160, 0x0?)
        	cmd/compile/internal/ssagen/pgen.go:187 +0x48 fp=0xc001a25f30 sp=0xc001a25d60 pc=0x64ee98
        cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
        	cmd/compile/internal/gc/compile.go:184 +0x4c fp=0xc001a25f80 sp=0xc001a25f30 pc=0x8546dc
        cmd/compile/internal/gc.compileFunctions.func3.1()
        	cmd/compile/internal/gc/compile.go:166 +0x48 fp=0xc001a25fc0 sp=0xc001a25f80 pc=0x854aa8
        runtime.goexit()
        	runtime/asm_ppc64x.s:912 +0x4 fp=0xc001a25fc0 sp=0xc001a25fc0 pc=0x94c54
        created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 157
        	cmd/compile/internal/gc/compile.go:165 +0x20c

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 23:56 linux-ppc64-sid-buildlet go@6824765b internal/unsafeheader [build] (log)
,,,
2023-05-24 23:56 linux-ppc64-sid-buildlet go@6824765b internal/coverage/rtcov [build] (log)
,,,
2023-05-24 23:56 linux-ppc64-sid-buildlet go@6824765b internal/goexperiment [build] (log)
,,,

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-24 23:57 linux-ppc64-sid-power10 go@2a4f4fc5 (log)
cmd/compile/internal/ssa: /workdir/go/pkg/tool/linux_ppc64/compile: signal: killed
cmd/compile/internal/ssa: /workdir/go/pkg/tool/linux_ppc64/compile: signal: killed
go tool dist: FAILED: /workdir/go/pkg/tool/linux_ppc64/go_bootstrap install cmd: exit status 1
2023-05-25 00:04 linux-ppc64-sid-power10 go@04c62893 (log)
cmd/compile/internal/ssa: /workdir/go/pkg/tool/linux_ppc64/compile: signal: killed
go tool dist: FAILED: /workdir/go/pkg/tool/linux_ppc64/go_bootstrap install cmd: exit status 1
2023-05-25 00:13 linux-ppc64-sid-power10 go@869da4a2 cmd/compile/internal/ssa.TestDebugLines_53456 (log)
--- FAIL: TestDebugLines_53456 (1.62s)
    debug_lines_test.go:162: error running cmd (cd /workdir/tmp/debug_lines_test137894918; HOSTNAME=power_04 GO_BUILD_KEY_PATH=/buildkey/gobuildkey GO_BUILD_KEY_DELETE_AFTER_READ=true GO_BUILDER_ENV=host-linux-ppc64-sid-power10 DEBIAN_FRONTEND=noninteractive GO_STAGE0_NET_DELAY=100ms GO_STAGE0_DL_DELAY=300ms WORKDIR=/workdir GO_BUILDER_NAME=linux-ppc64-sid-power10 GO_BUILDER_FLAKY_NET=1 GOPPC64=power10 GOROOT_BOOTSTRAP=/usr/local/go-bootstrap GO_TEST_TIMEOUT_SCALE=2 GOROOT=/workdir/go GOPATH=/workdir/gopath GOPROXY=off TMPDIR=/workdir/tmp GOCACHE=/workdir/gocache GOPLSCACHE=/workdir/goplscache GOARCH=ppc64 GOENV=off GOHOSTARCH=ppc64 GOHOSTOS=linux GOMODCACHE=/workdir/gopath/pkg/mod GOOS=linux GOSUMDB=sum.golang.org GOTOOLCHAIN=auto GOTOOLDIR=/workdir/go/pkg/tool/linux_ppc64 'GOVERSION=devel 869da4a2a48555d71f0298caba6385c4d10590ba' GCCGO=gccgo AR=ar CC=gcc CXX=g++ CGO_ENABLED=0 TERM=dumb GO386=sse2 GOAMD64=v1 GOARM=7 GOMIPS=hardfloat GOMIPS64=hardfloat GOROOT_FINAL=/workdir/go GOBIN=/workdir/go/bin LANG=C LANGUAGE=en_US.UTF8 GOWORK=off GOTRACEBACK=system PWD=/workdir/go/src/cmd/compile/internal/ssa GOSSADIR=/workdir/tmp/debug_lines_test137894918 /workdir/go/bin/go build -o foo.o '-gcflags=-d=ssa/genssa/dump=(*T).Inc -N -l' /workdir/go/src/cmd/compile/internal/ssa/testdata/b53456.go ): exit status 1
        stdout:
        stderr:
        # runtime
        panic: runtime error: invalid memory address or nil pointer dereference
        [signal SIGSEGV: segmentation violation code=0x1 addr=0x54 pc=0x699e74]

        goroutine 945 [running]:
        panic({0x8f51e0?, 0xf85c70?})
...
        cmd/compile/internal/ssagen.Compile(0xc001adac60, 0x0?)
        	cmd/compile/internal/ssagen/pgen.go:187 +0x48 fp=0xc000093f30 sp=0xc000093d60 pc=0x6643c8
        cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
        	cmd/compile/internal/gc/compile.go:184 +0x4c fp=0xc000093f80 sp=0xc000093f30 pc=0x86a84c
        cmd/compile/internal/gc.compileFunctions.func3.1()
        	cmd/compile/internal/gc/compile.go:166 +0x48 fp=0xc000093fc0 sp=0xc000093f80 pc=0x86ac38
        runtime.goexit()
        	runtime/asm_ppc64x.s:912 +0x4 fp=0xc000093fc0 sp=0xc000093fc0 pc=0x956e4
        created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 151
        	cmd/compile/internal/gc/compile.go:165 +0x20c

watchflakes

@dr2chase
Copy link
Contributor

@laboger
PROGRESS!

I ran a binary search on the flake, and isolated it to a single PGO inline.

you can compare two versions of ssa.html with (full recipe):

First checkout experiment CL, build a known-good (no PGO-inlining) compiler, and save it:

cd whatever/go/src
git fetch https://go.googlesource.com/go refs/changes/57/497557/1 && git checkout -b change-497557 FETCH_HEAD
GOCOMPILEDEBUG=gossahash=10101010101010110101001011 ./make.bash
go install golang.org/x/tools/cmd/toolstash@latest
toolstash save

Then you can build a bad compiler, or a good compiler, with output to a nearby directory (../bad or ../good):

GOSSADIR=`pwd`/../bad  GOSSAFUNC='cmd/compile/internal/ssa.(*expandState).rewriteSelect' GOCOMPILEDEBUG=gossahash=v1100101001011 toolstash go install cmd/compile

GOSSADIR=`pwd`/../good  GOSSAFUNC='cmd/compile/internal/ssa.(*expandState).rewriteSelect' GOCOMPILEDEBUG=gossahash=10101010101010110101001011 toolstash go install cmd/compile

Note that gossahash=10101010101010110101001011 matches nothing, but is quieter than n which logs a lot about what it isn't doing. When you build the bad compiler with the command line above, it should print:

# cmd/compile/internal/ssa
cmd/compile/internal/ssa/expand_calls.go:550:22 [bisect-match 0xa6214988bb18594b]
gossahash triggered cmd/compile/internal/ssa/expand_calls.go:550:22 000110000101100101001011
dumped SSA to ../bad/cmd/compile/internal/ssa.(*expandState).rewriteSelect.html

I have not had a chance to dig into this yet, it seems likely that you might be able to make faster progress than me, unless something obvious leaps out at me.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-26 17:03 linux-ppc64le-buildlet go@1ff89009 runtime.TestCgoPprofPIE (log)
--- FAIL: TestCgoPprofPIE (2.85s)
    crash_test.go:141: running /workdir/go/bin/go build -o /workdir/tmp/go-build994731933/testprogcgo_-buildmode=pie.exe -buildmode=pie
    crash_cgo_test.go:303: building testprogcgo [-buildmode=pie]: exit status 1
        # runtime
        panic: runtime error: invalid memory address or nil pointer dereference
        [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1e8b38]

        goroutine 1392 [running]:
        cmd/compile/internal/types.CalcSize(0x0?)
        	cmd/compile/internal/types/size.go:265 +0x48
...
        cmd/compile/internal/ssagen.buildssa(0xc00558c9a0, 0x0)
        	cmd/compile/internal/ssagen/ssa.go:548 +0x24d0
        cmd/compile/internal/ssagen.Compile(0xc00558c9a0, 0x80d9c0?)
        	cmd/compile/internal/ssagen/pgen.go:187 +0x48
        cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
        	cmd/compile/internal/gc/compile.go:184 +0x4c
        cmd/compile/internal/gc.compileFunctions.func3.1()
        	cmd/compile/internal/gc/compile.go:166 +0x48
        created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 165
        	cmd/compile/internal/gc/compile.go:165 +0x20c

watchflakes

@dr2chase
Copy link
Contributor

dr2chase commented May 26, 2023

@mknyszek suggested turning off async preemption, and so far we are almost an hour in with no failure.

@dr2chase
Copy link
Contributor

It's at 100 minutes, still running, I will leave it running.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-26 22:45 linux-ppc64le-buildlet go@260a9b0a cmd/compile/internal/ssa [build] (log)
panic: runtime error: index out of range [65679360] with length 32426

goroutine 1525 [running]:
cmd/internal/src.(*PosTable).Pos(...)
	cmd/internal/src/xpos.go:153
cmd/internal/obj.(*Link).InnermostPos(...)
	cmd/internal/obj/inl.go:124
cmd/internal/obj.(*Link).OutermostPos(0x10?, {0x80000?, 0xc0?})
	cmd/internal/obj/inl.go:105 +0x140
cmd/compile/internal/base.FmtPos({0x11?, 0x0?})
...
	cmd/compile/internal/ssagen/ssa.go:548 +0x24d0
cmd/compile/internal/ssagen.Compile(0xc010b59080, 0x0?)
	cmd/compile/internal/ssagen/pgen.go:187 +0x48
cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
	cmd/compile/internal/gc/compile.go:184 +0x4c
cmd/compile/internal/gc.compileFunctions.func3.1()
	cmd/compile/internal/gc/compile.go:166 +0x48
created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 113
	cmd/compile/internal/gc/compile.go:165 +0x20c
go tool dist: FAILED: /workdir/go/pkg/tool/linux_ppc64le/go_bootstrap install cmd: exit status 1

watchflakes

@dr2chase
Copy link
Contributor

dr2chase commented May 27, 2023

6 hours, no failures. Looks like there's an instruction sequence in that code that should not be preempted.
PS/edit now doing a low run with async preempt enabled w/o touching the compiler, just to be sure. I need to see at least one failure, preferably in fewer than 6 hours.

I can get failures with async preemption turned off, darn it. The failure rate seems to depend on whether the machine is busy, maybe? It was low Friday night, I thought that seemed fishy, so I added a second gomote on the same box, and now it fails more often, e.g. with a plain compiler (no special gossahash setting)

3: ./once.sh
+ go clean -cache
+ GOGC=1
+ GODEBUG=asyncpreemptoff=1
+ go build -a cmd std
# cmd/compile/internal/ssa
runtime: marked free object in span 0x7e3a50623680, elemsize=32 freeindex=0 (bad use of unsafe.Pointer? try -d=checkptr)

I am not happy.

Because our mechanical servants don't get the weekend off, I started a pair of searches, one with async preemption disabled, the other not, partly for the exercise of looking for any other failure points, so far none have appeared (i.e., the binary search, which is randomized, is repeating itself).

@laboger
Copy link
Contributor

laboger commented May 30, 2023

I can get it to fail pretty consistently a few times while running index0.go this way:

cd cmd/internal/testdir
GOMAXPROCS=4 go test -test.run=Test/index0.go -test.count=100
...

FAIL: Test (0.04s)
    --- FAIL: Test/index0.go (0.71s)
        testdir_test.go:132: exit status 1
            # command-line-arguments
            panic: runtime error: index out of range [8484704] with length 4028
            
            goroutine 472 [running]:
            cmd/internal/src.(*PosTable).Pos(...)
            	/home/boger/golang/base/go/src/cmd/internal/src/xpos.go:153
            cmd/internal/obj.(*Link).InnermostPos(...)
            	/home/boger/golang/base/go/src/cmd/internal/obj/inl.go:124
            cmd/internal/obj.(*Link).OutermostPos(0x10?, {0x56800?, 0xc0?})
            	/home/boger/golang/base/go/src/cmd/internal/obj/inl.go:105 +0x140
            cmd/compile/internal/base.FmtPos({0x11?, 0x0?})
            	/home/boger/golang/base/go/src/cmd/compile/internal/base/print.go:66 +0x50
            cmd/compile/internal/base.FatalfAt({0x9ad2e0?, 0xc0?}, {0xc0009ad2b8, 0x17}, {0xc0010c5260, 0x2, 0x2})
            	/home/boger/golang/base/go/src/cmd/compile/internal/base/print.go:218 +0xa8
            cmd/compile/internal/base.Fatalf(...)
            	/home/boger/golang/base/go/src/cmd/compile/internal/base/print.go:199
            cmd/compile/internal/ssagen.(*ssafn).Fatalf(0x665ad4?, {0x6?, 0x7f01?}, {0x96cbde, 0x11}, {0xc00116f5b0, 0x1, 0xc000817701?})
            	/home/boger/golang/base/go/src/cmd/compile/internal/ssagen/ssa.go:7986 +0x1a0
            cmd/compile/internal/ssagen.(*state).Fatalf(0xc000c1eb00?, {0x96cbde?, 0x104cb48?}, {0xc00116f5b0?, 0x42604?, 0x0?})
            	/home/boger/golang/base/go/src/cmd/compile/internal/ssagen/ssa.go:932 +0xb8
            cmd/compile/internal/ssagen.(*state).stmt(0xc000c1eb00, {0xacf3a8, 0xc000be1c98?})
            	/home/boger/golang/base/go/src/cmd/compile/internal/ssagen/ssa.go:2002 +0x2998
            cmd/compile/internal/ssagen.(*state).stmtList(...)
            	/home/boger/golang/base/go/src/cmd/compile/internal/ssagen/ssa.go:1418
            cmd/compile/internal/ssagen.buildssa(0xc002ee8840, 0x1)
            	/home/boger/golang/base/go/src/cmd/compile/internal/ssagen/ssa.go:548 +0x24d0
            cmd/compile/internal/ssagen.Compile(0xc002ee8840, 0x0?)
            	/home/boger/golang/base/go/src/cmd/compile/internal/ssagen/pgen.go:187 +0x48
            cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
            	/home/boger/golang/base/go/src/cmd/compile/internal/gc/compile.go:184 +0x4c
            cmd/compile/internal/gc.compileFunctions.func3.1()
            	/home/boger/golang/base/go/src/cmd/compile/internal/gc/compile.go:166 +0x48
            created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 10
            	/home/boger/golang/base/go/src/cmd/compile/internal/gc/compile.go:165 +0x20c
            
FAIL
exit status 1

It only fails if I set GOMAXPROCS=4, which is the number of processors available on the OSU builders.
I also suspected preemption because of the intermittent nature of the failure and the fact that some of the registers have to be bad some of the time, but this also fails if I turn off async preemption.
My other suspicion is inlining, and in particular in this case there is some recursion in the stmt, stmtList functions (since they appear in stacks in so many of the failures) along with being inlined. I don't know how to just force more aggressive inlining, I was thinking it was the -l option but I don't see many additional inlined functions when I set this. Maybe there is another way, or PGO just does more inlining than normally and that hasn't been tested very well on Power. I did try setting some //go:noinline directives before some suspicious functions inlined by stmt or stmtList but that caused the failures to appear in different places.
Because of the random nature of the failures, my next theory is data corruption and in a way where it doesn't matter very often, probably on the stack. That's as far as I got.

@prattmic
Copy link
Member Author

I don't know how to just force more aggressive inlining, I was thinking it was the -l option but I don't see many additional inlined functions when I set this. Maybe there is another way, or PGO just does more inlining than normally

The code you are looking for is https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/inline/inl.go;l=987.

tl;dr, of this function is that a function is noramlly inlinable if its cost is < 80 (inlineMaxBudget). Increase inlineMaxBudget to get generally more aggressive inlining (there is no flag). With PGO, hot calls can have a cost up to 2000 (inlineHotMaxBudget).

@gopherbot
Copy link

Change https://go.dev/cl/499217 mentions this issue: DO NOT SUBMIT: cmd/compile: hardcode position check for bad inline

@dr2chase
Copy link
Contributor

For better reproducibility in the future, use -trimpath with gossahash:

GOCOMPILEDEBUG=gossahash=v100010110000011001010011 GOARCH=ppc64le GOOS=linux go build -trimpath -o ./compile.ppc64le cmd/compile
# cmd/compile/internal/ssa
cmd/compile/internal/ssa/expand_calls.go:550:22 [bisect-match 0xe94438dd7b8b0653]
gossahash triggered cmd/compile/internal/ssa/expand_calls.go:550:22 100010110000011001010011

I'll try to find some way to make this the default for gossahash; this was explicitly NOT chosen for the other flags.

@prattmic
Copy link
Member Author

We've realized that the gossahash is using the full absolute path in the hash input, so it probably only works on @dr2chase's machine.

I wrote https://go.dev/cl/499217 to hard-code the bad position check. Enable with -gcflags=all=-d=ppc64bad=1 to get a bad binary.

@prattmic
Copy link
Member Author

I've attached the ssa.html output from the good and bad builds (renamed to .txt because Github doesn't allow .html...):

compile.good.txt
compile.bad.txt

Though they are admittedly difficult to eyeball because this function is so huge.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-31 15:34 linux-ppc64le-power10osu benchmarks@018588a2 go@4406e8ef google.golang.org/protobuf/internal/impl [build] (log)
panic: runtime error: index out of range [4097] with length 3669

goroutine 610 [running]:
cmd/internal/src.(*PosTable).Pos(...)
	cmd/internal/src/xpos.go:153
cmd/internal/obj.(*Link).InnermostPos(...)
	cmd/internal/obj/inl.go:124
cmd/internal/obj.(*Link).OutermostPos(0x10?, {0x46400?, 0xc0?})
	cmd/internal/obj/inl.go:105 +0x134
cmd/compile/internal/base.FmtPos({0x11?, 0x0?})
...
cmd/compile/internal/ssagen.buildssa(0xc001c63ce0, 0x0)
	cmd/compile/internal/ssagen/ssa.go:548 +0x235c
cmd/compile/internal/ssagen.Compile(0xc001c63ce0, 0x811c50?)
	cmd/compile/internal/ssagen/pgen.go:187 +0x48
cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
	cmd/compile/internal/gc/compile.go:184 +0x4c
cmd/compile/internal/gc.compileFunctions.func3.1()
	cmd/compile/internal/gc/compile.go:166 +0x48
created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 42
	cmd/compile/internal/gc/compile.go:165 +0x20c

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-05-31 15:15 aix-ppc64 go@d59926ec cmd/compile/internal/types2 [build] (log)
panic: interface conversion: ir.Node is *ir.BinaryExpr, not *ir.IndexExpr

goroutine 443 [running]:
cmd/compile/internal/ssagen.(*state).exprCheckPtr(0xa00010004926300, {0x100b387e8, 0xa0001000014eaf0?}, 0x1)
	cmd/compile/internal/ssagen/ssa.go:3132 +0x6108
cmd/compile/internal/ssagen.(*state).expr(...)
	cmd/compile/internal/ssagen/ssa.go:2639
cmd/compile/internal/ssagen.(*state).stmt(0xa00010004926300, {0x100b38728, 0xa0001000369c960?})
	cmd/compile/internal/ssagen/ssa.go:1671 +0x65cc
cmd/compile/internal/ssagen.(*state).stmtList(...)
...
	cmd/compile/internal/ssagen/ssa.go:548 +0x27ec
cmd/compile/internal/ssagen.Compile(0xa000100037db600, 0x3?)
	cmd/compile/internal/ssagen/pgen.go:187 +0x48
cmd/compile/internal/gc.compileFunctions.func5.1(0x0?)
	cmd/compile/internal/gc/compile.go:184 +0x4c
cmd/compile/internal/gc.compileFunctions.func3.1()
	cmd/compile/internal/gc/compile.go:166 +0x54
created by cmd/compile/internal/gc.compileFunctions.func3 in goroutine 8
	cmd/compile/internal/gc/compile.go:165 +0x20c
go tool dist: FAILED: /ramdisk8GB/workdir-host-aix-ppc64-osuosl/go/pkg/tool/aix_ppc64/go_bootstrap install cmd: exit status 1

watchflakes

@aclements
Copy link
Member

Neither 4406e8e nor d59926e include the fix commit, so those flakes are okay.

@cherrymui
Copy link
Member

If the commit time is after the issue closure time, watchflakes should reopen the issue, otherwise it just posts. So if the issue is not reopened we can assume the failures are old.

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
default <- goarch ~ `ppc64` && date != "" && date > "2023-05-22"
2023-07-10 19:19 linux-ppc64-sid-power10 go@07ede7a5 os/exec.TestConcurrentExec (log)
--- FAIL: TestConcurrentExec (0.00s)
    exec_test.go:1757: context canceled

watchflakes

@cherrymui
Copy link
Member

The new failure is unrelated.

@cherrymui
Copy link
Member

cherrymui commented Jul 11, 2023

Removed from the "Test Flakes" project so watchflakes will not look at this. It's been long enough so I think we're confident that the original issue is fixed.

@bcmills
Copy link
Contributor

bcmills commented Jul 11, 2023

Removed from the "Test Flakes" project so watchflakes will not look at this. It's been long enough so I think we're confident that the original issue is fixed.

That resulted in watchflakes filing a bunch of duplicates with the old logs. 😅
(It needs the old issue to remain in place in order to find where it has already reported those logs.)

I have re-added this issue to the “Test Flakes” project, but added an upper bound on the date to prevent watchflakes from reopening it for unrelated failures.

@cherrymui
Copy link
Member

Bummer. Sorry about that. Thanks for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-ppc64x compiler/runtime Issues related to the Go compiler and/or runtime. release-blocker
Projects
Status: No status
Development

No branches or pull requests

8 participants