Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: miscompilation of partially-overlapping array assignments #54467

Closed
mdempsky opened this issue Aug 15, 2022 · 22 comments
Closed

cmd/compile: miscompilation of partially-overlapping array assignments #54467

mdempsky opened this issue Aug 15, 2022 · 22 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mdempsky
Copy link
Member

https://go.dev/play/p/QL9pJw6LY9e

When p and q are pointers to arrays, the behavior of *p = *q should be as if *q is fully evaluated before any assignments through *p take effect.

The above program however demonstrates (at least on GOARCH=amd64) that when they're pointers to partially-overlapping arrays, the evaluation of *q and assignments through *p are interleaved.

/cc @golang/compiler

@mdempsky mdempsky added the NeedsFix The path to resolution is known, but the work has not been done. label Aug 15, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 15, 2022
@randall77
Copy link
Contributor

Yep, I think OpMove assumes nonoverlapping source and destination (except for those generated from a memmove). Casting slices to array pointers violates that.
I think we need a way to mark OpMove as known disjoint, and don't set that for *p=*q cases. Other cases, like moving data from stack to heap, are known disjoint.

Speaking of which, rewrite.go:disjoint seems kinda sketchy now that I look at it. For instance, it depends on two OpAddrs being known disjoint, but that's only really true if CSE has been run. Otherwise we might have 2 OpAddrs pointing to the same variable. Maybe that can never happen currently because we only make one OpAddr per variable, but that invariant isn't written down, or checked, anywhere.

@cuiweixie
Copy link
Contributor

arch == amd64 seems ok!

@go101
Copy link

go101 commented Aug 17, 2022

Not ok on my amd64.

@mdempsky
Copy link
Member Author

@cuiweixie Why do you say amd64 seems ok?

The program I linked to above should run and exit without printing anything; but on the Go playground, which uses amd64, it prints that y[20] and y[21] have been assigned incorrect values.

@cuiweixie
Copy link
Contributor

package main

import "fmt"

func main() {
	var x [16]byte
	for i := range x {
		x[i] = byte(i)
	}
	y := x

	copy(x[1:11], x[0:10])
	*(*[10]byte)(y[1:11]) = *(*[10]byte)(y[0:10])
	for i := range x {
		//println("i", i)
		//if x[i] != y[i] {
			fmt.Printf("x[%v] = %v; y[%v] = %v\n", i, x[i], i, y[i])
		//}
	}
}

this code is ok in amd64, but not arm64!

@mknyszek mknyszek added this to the Go1.20 milestone Aug 17, 2022
@gopherbot
Copy link

Change https://go.dev/cl/425076 mentions this issue: cmd/compile: handle partially overlapping assignments

@randall77
Copy link
Contributor

@gopherbot please open backport issues for 1.18 and 1.17.

@gopherbot
Copy link

Backport issue(s) opened: #54603 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@randall77
Copy link
Contributor

@gopherbot please open a backport issue for 1.19.

(My off-by-one error: 1.17 is out of maintenance, so we don't need that backport.)

@W8CYE
Copy link

W8CYE commented Sep 21, 2022

The fix for the above issue on ARM64 appears to be treating the following code as a partially overlapping assignment and therefore calling runtime.memmove. The net result is an 80% performance penalty on benchmarks. Is this correct behavior, or has this simple use case been incorrectly tagged for this safer/slower approach?

type Array [2]int

func moveSliceOfArrayData(s []Array) []Array {
	for i := 1; i < len(s); i++ {
		s[i-1], s[i] = s[i], s[i-1]
	}
	return s
}

@go101
Copy link

go101 commented Sep 22, 2022

Is it good to unify the internal implementation of this and copy?

@gopherbot
Copy link

Change https://go.dev/cl/433056 mentions this issue: cmd/compile: use stricter rule for possible partial overlap

gopherbot pushed a commit that referenced this issue Sep 27, 2022
Partial overlaps can only happen for strict sub-pieces of larger arrays.
That's a much stronger condition than the current optimization rules.

Update #54467

Change-Id: I11e539b71099e50175f37ee78fddf69283f83ee5
Reviewed-on: https://go-review.googlesource.com/c/go/+/433056
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
@randall77
Copy link
Contributor

@W8CYE The performance regression you were seeing has now been fixed on tip.
I don' think we'll be backporting that fix, as we don't normally backport performance fixes. So you'll have to wait for 1.20.
You might be able to work around the issue by wrapping your type with a struct {} or something like that.

@W8CYE
Copy link

W8CYE commented Sep 27, 2022

@randall77 My ARM64 benchmarks have not returned to the 1.18.5 or 1.19.0 level. The tests in issue54467.go appear to be focused on AMD64 and not ARM64. Let me know if you would like me to open a proper issue. I have included my benchmarks below for reference.

go version devel go1.20-871a3a4 Tue Sep 27 21:00:51 2022 +0000 darwin/arm64
BenchmarkSliceOfArray 	  245748	      4264 ns/op
go version go1.19.1 darwin/arm64
BenchmarkSliceOfArray 	  229864	      4641 ns/op
go version go1.19 darwin/arm64
BenchmarkSliceOfArray 	  463353	      2592 ns/op

go version go1.18.6 darwin/arm64
BenchmarkSliceOfArray 	  243319	      4422 ns/op
go version go1.18.5 darwin/arm64
BenchmarkSliceOfArray 	  504110	      2378 ns/op
go version go1.18.4 darwin/arm64
BenchmarkSliceOfArray 	  503744	      2384 ns/op
go version go1.18.3 darwin/arm64
BenchmarkSliceOfArray 	  504073	      2379 ns/op
go version go1.18.2 darwin/arm64
BenchmarkSliceOfArray 	  422592	      2382 ns/op
go version go1.18.1 darwin/arm64
BenchmarkSliceOfArray 	  428832	      2382 ns/op
go version go1.18 darwin/arm64
BenchmarkSliceOfArray 	  421044	      2385 ns/op

@randall77
Copy link
Contributor

It fixes the code in #54467 (comment) for me.
If you could post a full reproducing file and instructions to run it, that might help.

The tests in issue54467.go appear to be focused on AMD64 and not ARM64.

True. The bug is arch-agnostic, just with slightly different constants. It only really needs a test on one arch.

Let me know if you would like me to open a proper issue.

That would be fine, or just post it here.

@W8CYE
Copy link

W8CYE commented Sep 28, 2022

@randall77

True. The bug is arch-agnostic, just with slightly different constants. It only really needs a test on one arch.

I have run the benchmarks on AMD64 and ARM64. What I see is the AMD64 benchmarks are consistent from 1.18 -> TIP, but ARM64 shows significant spikes for 1.18.6, 1.19.1, and TIP.

go<version> version
GOMAXPROCS=1 go<vesion> test -bench=BenchmarkSliceOfArray | grep BenchmarkSliceOfArray
darwin/amd64
go version devel go1.20-53773a5 Wed Sep 28 11:50:58 2022 +0000 darwin/amd64
BenchmarkSliceOfArray 	  423855	      2818 ns/op
go version go1.19.1 darwin/amd64
BenchmarkSliceOfArray 	  426967	      2803 ns/op
go version go1.19 darwin/amd64
BenchmarkSliceOfArray 	  431457	      2784 ns/op

go version go1.18.6 darwin/amd64
BenchmarkSliceOfArray 	  432540	      2782 ns/op
go version go1.18.5 darwin/amd64
BenchmarkSliceOfArray 	  431468	      2756 ns/op
go version go1.18.4 darwin/amd64
BenchmarkSliceOfArray 	  424591	      2790 ns/op
go version go1.18.3 darwin/amd64
BenchmarkSliceOfArray 	  434716	      2721 ns/op
go version go1.18.2 darwin/amd64
BenchmarkSliceOfArray 	  432724	      2779 ns/op
go version go1.18.1 darwin/amd64
BenchmarkSliceOfArray 	  376879	      2782 ns/op
go version go1.18 darwin/amd64
BenchmarkSliceOfArray 	  432793	      2744 ns/op
darwin/arm64
go version devel go1.20-53773a5 Wed Sep 28 11:50:58 2022 +0000 darwin/arm64
BenchmarkSliceOfArray     343330              3516 ns/op
go version go1.19.1 darwin/arm64
BenchmarkSliceOfArray     271564              3717 ns/op
go version go1.19 darwin/arm64
BenchmarkSliceOfArray     690951              1703 ns/op

go version go1.18.6 darwin/arm64
BenchmarkSliceOfArray     280936              3702 ns/op
go version go1.18.5 darwin/arm64
BenchmarkSliceOfArray     702750              1687 ns/op
go version go1.18.4 darwin/arm64
BenchmarkSliceOfArray     703149              1684 ns/op
go version go1.18.3 darwin/arm64
BenchmarkSliceOfArray     707517              1692 ns/op
go version go1.18.2 darwin/arm64
BenchmarkSliceOfArray     710638              1685 ns/op
go version go1.18.1 darwin/arm64
BenchmarkSliceOfArray     704236              1687 ns/op
go version go1.18 darwin/arm64
BenchmarkSliceOfArray     706743              1685 ns/op

datamove.go

package datamove

type Array [2]int

func moveSliceOfArrayData(s []Array) []Array {
	for i := 1; i < len(s); i++ {
		s[i-1], s[i] = s[i], s[i-1]
	}
	return s
}

datamove_test.go

package datamove

import (
	"testing"
)

var resultArray []Array

func BenchmarkSliceOfArray(b *testing.B) {
	var r []Array
	s := make([]Array, 1000)
	for n := 0; n < b.N; n++ {
		r = moveSliceOfArrayData(s)
	}
	resultArray = r
}

@randall77
Copy link
Contributor

Hmm, yes I am seeing the performance changes you are. The memmove problem has been fixed, but there are other differences in the assembly between 1.19 and tip. If anything, the tip assembly looks better to me. But who am I to argue with the chip as to how fast assembly is?

I'm going to open a new issue, this is getting complicated.

@go101
Copy link

go101 commented Jan 12, 2023

Is there a reason why such assignments and the built-in copy don't share the same implementation?

package main

import "testing"

const N = 10000

var x = make([]byte, N)

func Benchmark_copy(b *testing.B) {
	for i := 0; i < b.N; i++ {
		copy(x, x)
	}
}

func Benchmark_assign(b *testing.B) {
	for i := 0; i < b.N; i++ {
		*(*[N]byte)(x) = *(*[N]byte)(x)
	}
}
Benchmark_copy-4     	1000000000	         0.3880 ns/op	       0 B/op	       0 allocs/op
Benchmark_assign-4   	 9201558	       133.0 ns/op	       0 B/op	       0 allocs/op

@randall77
Copy link
Contributor

Not sure. Benchmark_copy looks like the body has been completely compiled away (<1ns per iteration). I'd be more interested if that performance difference persists when actually doing some copy work.

@go101
Copy link

go101 commented Jan 12, 2023

It is not compiled away for all such cases:

package main

import "testing"

const N = 10000

var x = make([]byte, N+N)
var y = x[:N]

//go:noinline
func Copy(x, y []byte) {
	copy(x, y)
}

//go:noinline
func Copy2(x, y []byte) {
	*(*[N]byte)(x) = *(*[N]byte)(y)
}

func Benchmark_copy(b *testing.B) {
	for i := 0; i < b.N; i++ {
		Copy(x, y)
	}
}

func Benchmark_assign(b *testing.B) {
	for i := 0; i < b.N; i++ {
		Copy2(x, y)
	}
}

I think the built-in copy function checks whether or not the addresses of the first elements of the two slices are identical.
If it is true, then does nothing.

@randall77
Copy link
Contributor

That could be the difference.
Is the performance similar if you do var y = x[N:]?

@go101
Copy link

go101 commented Jan 12, 2023

Yes, then it is similar.

@golang golang locked and limited conversation to collaborators Jan 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants