Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: 32-bit random data corruption #43570

Closed
erikwilson opened this issue Jan 7, 2021 · 10 comments
Closed

cmd/compile: 32-bit random data corruption #43570

erikwilson opened this issue Jan 7, 2021 · 10 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@erikwilson
Copy link

erikwilson commented Jan 7, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.6 linux/arm

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm"
GOBIN=""
GOCACHE="/home/pi-star/.cache/go-build"
GOENV="/home/pi-star/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/pi-star/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/pi-star/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm"
GCCGO="gccgo"
GOARM="6"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/pi-star/go/src/github.com/golang/32bit-bug/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -marm -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build227798987=/tmp/go-build -gno-record-gcc-switches"

What did you do?

While running a 32-bit golang executable random data corruption may occur. Here is an example Kubernetes issue which is able to reproduce the problem with 32-bit arm and 386:
kubernetes/kubernetes#97685

The unit test is compiled and attached as a standalone example here:
32bit-bug.tar.gz

Test code, commands, armv6 binaries, and relevant objdump data is available in above tarball.

If the test is compiled without optimizations the test will pass (on arm at least).

(edit: remove link to aarch64 issue)

What did you expect to see?

$ ./passing-arm32-test 
PASS

What did you see instead?

$ ./failing-arm32-test 
--- FAIL: TestSampler (1.06s)
    sample_and_watermark_test.go:48: t0=2021-01-07 11:24:28.746138334 -0700 MST m=+0.016036811
panic: that was a bad year!: 0011-12-20 14:41:53.017034732 -0728 LMT [recovered]
	panic: that was a bad year!: 0011-12-20 14:41:53.017034732 -0728 LMT

goroutine 19 [running]:
testing.tRunner.func1.1(0x374e38, 0x122d068)
	/usr/local/go/src/testing/testing.go:1072 +0x264
testing.tRunner.func1(0x1082380)
	/usr/local/go/src/testing/testing.go:1075 +0x364
panic(0x374e38, 0x122d068)
	/usr/local/go/src/runtime/panic.go:969 +0x158
github.com/golang/32bit-bug.(*sampleAndWaterMarkObserverGenerator).quantize(0x10a6150, 0x103edec, 0x103ed78, 0x14a13f43, 0x0, 0x654a80, 0x80c40200, 0x434e01)
	/home/pi-star/go/src/github.com/golang/32bit-bug/sample_and_watermark.go:99 +0x164
github.com/golang/32bit-bug.(*sampleAndWaterMarkHistograms).innerSet.func1(0x10b8080, 0x103edec, 0x103edd0, 0x103ee00, 0x103edc4, 0x103ee48)
	/home/pi-star/go/src/github.com/golang/32bit-bug/sample_and_watermark.go:175 +0xb4
github.com/golang/32bit-bug.(*sampleAndWaterMarkHistograms).innerSet(0x10b8080, 0x103ee48)
	/home/pi-star/go/src/github.com/golang/32bit-bug/sample_and_watermark.go:204 +0x70
github.com/golang/32bit-bug.(*sampleAndWaterMarkHistograms).Set(0x10b8080, 0x0, 0x3ff00000)
	/home/pi-star/go/src/github.com/golang/32bit-bug/sample_and_watermark.go:155 +0x54
github.com/golang/32bit-bug.TestSampler(0x1082380)
	/home/pi-star/go/src/github.com/golang/32bit-bug/sample_and_watermark_test.go:57 +0x454
testing.tRunner(0x1082380, 0x3ea0c4)
	/usr/local/go/src/testing/testing.go:1123 +0xbc
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x220
@erikwilson
Copy link
Author

Thanks @rkojedzinszky @neolit123 @MikeSpreitzer @brandond @liggitt for help reporting & investigating

@bcmills
Copy link
Contributor

bcmills commented Jan 7, 2021

Is this still reproducible with a compiler built after CL 274026?

If not, it may be a duplicate of #42876.

@bcmills
Copy link
Contributor

bcmills commented Jan 7, 2021

Hrm, but that wouldn't explain a repro on 386.

@randall77
Copy link
Contributor

I can reproduce on 386.
It does not reproduce with GOGC=off.

Looks like something is getting clobbered. If I print out the good and bad time structs in hex, I get

85432914 bff5fbf2 0013edd9 00000000 0868a6c0    // good
0a44bde4 0a44bd70 0f446329 00000000 0868a6c0   // bad

The first 2 words there are valid pointers to the heap. Not sure about the 3rd one.

I think this only happens on 32-bit because on 32-bit the time.Time structure is not SSA-able (and thus is always in memory and copied using duffcopy) but on 64-bit it is SSA-able (and thus can live in registers).

I belive the bad code is in (*sampleAndWaterMarkHistograms).innerSet.func1, although I'm not sure yet. My guess is that the return value from Now is somehow getting clobbered before we can copy it out of the outargs region.

Still looking.

@randall77 randall77 self-assigned this Jan 7, 2021
@brandond
Copy link

brandond commented Jan 7, 2021

FWIW, moving the time.Time struct into func1 seems to prevent it from getting clobbered. The in-tree fix using this approach is at brandond/kubernetes@1a4e511

@randall77
Copy link
Contributor

I think I see the problem. We have (irrelevant code elided):

v52 (176) = InterCall <mem> [24] v49 v51  // Call to Now (sample_and_watermark.go, 3rd line of (*sampleAndWaterMarkHistograms).innerSet.func1)
v53 (?) = OffPtr <*time.Time> [4] v2 // address of return value of Now (in outargs section - very volatile!)
v54 (176) = Move <mem> {time.Time} [20] v15 v53 v52 // save it to destination (v15, the heap storage for the `when` variable)
v60 (177) = Move <mem> {time.Time} [20] v59 v15 v58 // copy from `when` to a temporary (v59, address of a stack slot)
v64 (177) = Move <mem> {time.Time} [20] v53 v63 v62 // copy from temporary (v63, == v59) to the outargs section
v65 (177) = StaticCall <mem> {"".(*sampleAndWaterMarkObserverGenerator).quantize} [32] v64 // call `quantize`

That's all fine. But then we run the opt pass, which has this rule:

(Move {t} [n] dst1 src1 move:(Move {t} [n] dst2 _ mem))
	&& move.Uses == 1
	&& isSamePtr(dst1, dst2) && disjoint(src1, n, dst2, n)
	&& clobber(move)
	=> (Move {t} [n] dst1 src1 mem)

Which basically says, if you're moving from a to b, and then b to c, might as well move directly from a to c.
But, that rule isn't valid if a is a volatile location. We need to get data out of a ASAP.
(Volatile here means can get clobbered by a call.)

After the opt pass, we have the same code, except v60 copies from v53 instead of v15. Which is bad, because that source is the volatile memory area. It's only a problem if there is an intervening call. In this case there is one, but only if the write barrier is enabled. So the corruption doesn't happen until the write barrier is turned on. (The write barrier replaces v54 with a bunch of code that may call runtime.typedmemmove.)

I think the right fix is to add an extra condition to the move optimization rule to disable it if the source is a volatile region.

@randall77
Copy link
Contributor

I have a small repro that also works on 64 bit.
It's been broken since 1.12.
@gopherbot please open backport issues.

@gopherbot
Copy link

Backport issue(s) opened: #43574 (for 1.14), #43575 (for 1.15).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/282492 mentions this issue: cmd/compile: don't short-circuit copies whose source is volatile

@erikwilson
Copy link
Author

Thanks @randall77 for the fast analysis & fix!

MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jan 8, 2021
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jan 8, 2021
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jan 8, 2021
MikeSpreitzer added a commit to MikeSpreitzer/kubernetes that referenced this issue Jan 8, 2021
k8s-publishing-bot pushed a commit to kubernetes/apiserver that referenced this issue Jan 9, 2021
to tiptoe around golang/go#43570 for #97685

Kubernetes-commit: 611184aa59d0cd40466bc3bc4b40a3712a038171
k8s-publishing-bot pushed a commit to kubernetes/apiserver that referenced this issue Jan 9, 2021
to tiptoe around golang/go#43570 for #97685

Kubernetes-commit: d8a1dfb21f1f78595d1aaf9985eb58ee50edc940
adamzhoul pushed a commit to adamzhoul/kubernetes that referenced this issue Jan 11, 2021
tkashem added a commit to tkashem/kubernetes that referenced this issue Jan 12, 2021
…innerSet

move all variables in sampleAndWaterMarkHistograms::innerSet
to tiptoe around golang/go#43570 for kubernetes#97685
tkashem added a commit to tkashem/kubernetes that referenced this issue Jan 13, 2021
…innerSet

move all variables in sampleAndWaterMarkHistograms::innerSet
to tiptoe around golang/go#43570 for kubernetes#97685
@dmitshur dmitshur changed the title 32-bit random data corruption cmd/compile: 32-bit random data corruption Jan 21, 2021
@dmitshur dmitshur added this to the Go1.16 milestone Jan 21, 2021
@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label Jan 21, 2021
tkashem pushed a commit to tkashem/kubernetes that referenced this issue Jan 26, 2021
k8s-publishing-bot pushed a commit to kubernetes/apiserver that referenced this issue Jan 26, 2021
to tiptoe around golang/go#43570 for #97685

Kubernetes-commit: f393cb405bd04b9bcc014b7852e8c76ee90eb418
@golang golang locked and limited conversation to collaborators Jan 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants
@brandond @dmitshur @erikwilson @bcmills @randall77 @gopherbot and others