Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: Array/slice clearing idiom inconsistently recognized #56954

Closed
greatroar opened this issue Nov 28, 2022 · 2 comments
Closed

cmd/compile: Array/slice clearing idiom inconsistently recognized #56954

greatroar opened this issue Nov 28, 2022 · 2 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge

Comments

@greatroar
Copy link

greatroar commented Nov 28, 2022

What version of Go are you using (go version)?

$ gotip version
go version devel go1.20-b1678e508b Wed Nov 16 04:04:52 2022 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What did you do?

package p

const N = 1 << 22

type T struct{ a, b int32 }

func arrayclear1(p *[N]T) { *p = [N]T{} }

func arrayclear2(p *[N]T) {
	for i := range p {
		p[i] = T{}
	}
}

func arrayclear3(p *[N]T) {
	for i := range p[:] {
		p[i] = T{}
	}
}

What did you expect to see?

Same code generated for all three.

What did you see instead?

arrayclear1 becomes REP STOSQ on amd64, REP STOSL on 386, a loop doing STP from ZR on arm64 and a simple loop on arm.

arrayclear2 becomes a call to runtime.memclrNoHeapPointers.

arrayclear3, which represents a library function that I found to be a hot spot, becomes a loop that sets one element at a time. On 386 and arm, that loop even includes a bounds check.

I suspected this might have to do with conversion to slice, but slice clearing is similarly inconsistent, producing different code depending on whether [:] (or [:n]) is present or not.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 28, 2022
@randall77
Copy link
Contributor

arrayclear2 calls into the runtime because the clearing detection doesn't have a special case for constant bounds like those that occur when iterating over pointer-to-arrays.
arrayclear3 doesn't detect that this is a clearing situation because it doesn't understand the relationship between p[:] (a slice) and p[i] (an index into a pointer-to-array). In fact, this is also why the bounds checks remain.
It should work to do

func arrayclear3(p *[N]T) {
	s := p[:]
	for i := range s {
		s[i] = T{}
	}
}

It's not clear to me that any of these are particularly important. Certainly it is easy to work around if you notice it.

It's not clear to me that 1 is faster than 2. I think it is on some more recent machines, but that's not why we chose it - sometimes we do memclrs in situations where calls are not allowed, so things like REP STOSQ or explicit loops are our only choice. The runtime call uses SSE/AVX/... which can be faster, especially for large memory regions.

I suspected this might have to do with conversion to slice, but slice clearing is similarly inconsistent, producing different code depending on whether [:] (or [:n]) is present or not.

This can be subtle. For instance,

func arrayclear4(s []T, n int) {
	for i := range s[:n] {
		s[i] = T{}
	}
}

We can't rewrite to a clear of some sort, because if len(s)<n<=cap(s), then this code should bounds-check panic when i reaches len(s).

@greatroar
Copy link
Author

It's not clear to me that 1 is faster than 2.

My benchmarks suggest that they're about equally fast, on an Intel with the ERMS feature.

I'll patch that code, feel free to close if you deem this too unimportant.

greatroar added a commit to greatroar/compress that referenced this issue Nov 29, 2022
Benchmark results on amd64 below. These do not take into account klauspost#701.
They're on Go 1.19; Go 1.20 produces slightly better asm for the old
code, but still produces terrible asm on 32-bit platforms.

See also golang/go#56954.

name                                 old speed      new speed       delta
Encoder_EncodeAllXML-8                283MB/s ± 1%    284MB/s ± 0%     ~     (p=0.026 n=30+20)
Encoder_EncodeAllSimple/fastest-8     111MB/s ± 0%    111MB/s ± 1%     ~     (p=0.011 n=28+20)
Encoder_EncodeAllSimple/default-8    78.4MB/s ± 1%   78.3MB/s ± 1%     ~     (p=0.572 n=30+19)
Encoder_EncodeAllSimple/better-8     65.9MB/s ± 1%   66.2MB/s ± 1%   +0.53%  (p=0.009 n=30+20)
Encoder_EncodeAllSimple/best-8       11.1MB/s ± 1%   11.6MB/s ± 3%   +4.42%  (p=0.000 n=27+28)
Encoder_EncodeAllSimple4K/fastest-8   911MB/s ± 1%    914MB/s ± 1%   +0.31%  (p=0.004 n=29+20)
Encoder_EncodeAllSimple4K/default-8  73.1MB/s ± 1%   73.6MB/s ± 1%   +0.67%  (p=0.000 n=29+20)
Encoder_EncodeAllSimple4K/better-8   60.5MB/s ± 1%   62.7MB/s ± 1%   +3.64%  (p=0.000 n=29+17)
Encoder_EncodeAllSimple4K/best-8     8.62MB/s ± 3%  10.11MB/s ± 1%  +17.24%  (p=0.000 n=30+27)
Encoder_EncodeAllHTML-8               133MB/s ± 1%    133MB/s ± 1%     ~     (p=0.101 n=30+19)
Encoder_EncodeAllTwain-8             84.8MB/s ± 1%   86.2MB/s ± 3%   +1.63%  (p=0.000 n=24+20)
Encoder_EncodeAllPi-8                62.6MB/s ± 1%   62.7MB/s ± 0%     ~     (p=0.102 n=30+20)
Random4KEncodeAllFastest-8           2.50GB/s ± 1%   2.50GB/s ± 1%     ~     (p=0.449 n=29+20)
Random10MBEncodeAllFastest-8         2.39GB/s ± 2%   2.52GB/s ± 6%   +5.23%  (p=0.000 n=27+20)

name                                 old alloc/op   new alloc/op    delta
Encoder_EncodeAllXML-8                  0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple/fastest-8       2.73B ±27%      3.00B ± 0%     ~     (p=0.018 n=30+18)
Encoder_EncodeAllSimple/default-8       4.00B ± 0%      4.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/better-8        5.00B ± 0%      5.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/best-8          19.5B ± 3%      19.0B ± 0%   -2.40%  (p=0.000 n=30+24)
Encoder_EncodeAllSimple4K/fastest-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/default-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/better-8      0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/best-8        2.00B ± 0%      1.43B ±40%  -28.33%  (p=0.000 n=30+30)
Encoder_EncodeAllHTML-8                 2.37B ±27%      2.25B ±33%     ~     (p=0.398 n=30+20)
Encoder_EncodeAllTwain-8                0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllPi-8                   12.4B ± 5%      12.2B ± 6%     ~     (p=0.283 n=30+20)
Random4KEncodeAllFastest-8              0.00B           0.00B          ~     (all equal)
Random10MBEncodeAllFastest-8           31.9kB ± 2%     30.5kB ± 9%   -4.27%  (p=0.002 n=28+20)
greatroar added a commit to greatroar/compress that referenced this issue Nov 29, 2022
Benchmark results on amd64 below. These do not take into account klauspost#701.
They're for Go 1.19; Go 1.20 produces slightly better asm for the old
code, but still produces terrible asm on 32-bit platforms.

See also golang/go#56954.

name                                 old speed      new speed       delta
Encoder_EncodeAllXML-8                283MB/s ± 1%    284MB/s ± 0%     ~     (p=0.026 n=30+20)
Encoder_EncodeAllSimple/fastest-8     111MB/s ± 0%    111MB/s ± 1%     ~     (p=0.011 n=28+20)
Encoder_EncodeAllSimple/default-8    78.4MB/s ± 1%   78.3MB/s ± 1%     ~     (p=0.572 n=30+19)
Encoder_EncodeAllSimple/better-8     65.9MB/s ± 1%   66.2MB/s ± 1%   +0.53%  (p=0.009 n=30+20)
Encoder_EncodeAllSimple/best-8       11.1MB/s ± 1%   11.6MB/s ± 3%   +4.42%  (p=0.000 n=27+28)
Encoder_EncodeAllSimple4K/fastest-8   911MB/s ± 1%    914MB/s ± 1%   +0.31%  (p=0.004 n=29+20)
Encoder_EncodeAllSimple4K/default-8  73.1MB/s ± 1%   73.6MB/s ± 1%   +0.67%  (p=0.000 n=29+20)
Encoder_EncodeAllSimple4K/better-8   60.5MB/s ± 1%   62.7MB/s ± 1%   +3.64%  (p=0.000 n=29+17)
Encoder_EncodeAllSimple4K/best-8     8.62MB/s ± 3%  10.11MB/s ± 1%  +17.24%  (p=0.000 n=30+27)
Encoder_EncodeAllHTML-8               133MB/s ± 1%    133MB/s ± 1%     ~     (p=0.101 n=30+19)
Encoder_EncodeAllTwain-8             84.8MB/s ± 1%   86.2MB/s ± 3%   +1.63%  (p=0.000 n=24+20)
Encoder_EncodeAllPi-8                62.6MB/s ± 1%   62.7MB/s ± 0%     ~     (p=0.102 n=30+20)
Random4KEncodeAllFastest-8           2.50GB/s ± 1%   2.50GB/s ± 1%     ~     (p=0.449 n=29+20)
Random10MBEncodeAllFastest-8         2.39GB/s ± 2%   2.52GB/s ± 6%   +5.23%  (p=0.000 n=27+20)

name                                 old alloc/op   new alloc/op    delta
Encoder_EncodeAllXML-8                  0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple/fastest-8       2.73B ±27%      3.00B ± 0%     ~     (p=0.018 n=30+18)
Encoder_EncodeAllSimple/default-8       4.00B ± 0%      4.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/better-8        5.00B ± 0%      5.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/best-8          19.5B ± 3%      19.0B ± 0%   -2.40%  (p=0.000 n=30+24)
Encoder_EncodeAllSimple4K/fastest-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/default-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/better-8      0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/best-8        2.00B ± 0%      1.43B ±40%  -28.33%  (p=0.000 n=30+30)
Encoder_EncodeAllHTML-8                 2.37B ±27%      2.25B ±33%     ~     (p=0.398 n=30+20)
Encoder_EncodeAllTwain-8                0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllPi-8                   12.4B ± 5%      12.2B ± 6%     ~     (p=0.283 n=30+20)
Random4KEncodeAllFastest-8              0.00B           0.00B          ~     (all equal)
Random10MBEncodeAllFastest-8           31.9kB ± 2%     30.5kB ± 9%   -4.27%  (p=0.002 n=28+20)
greatroar added a commit to greatroar/compress that referenced this issue Nov 29, 2022
Benchmark results on amd64 below. These do not take into account klauspost#701.
They're for Go 1.19; Go 1.20 produces slightly better asm for the old
code, but still produces pretty bad asm on 32-bit platforms.

See also golang/go#56954.

name                                 old speed      new speed       delta
Encoder_EncodeAllXML-8                283MB/s ± 1%    284MB/s ± 0%     ~     (p=0.026 n=30+20)
Encoder_EncodeAllSimple/fastest-8     111MB/s ± 0%    111MB/s ± 1%     ~     (p=0.011 n=28+20)
Encoder_EncodeAllSimple/default-8    78.4MB/s ± 1%   78.3MB/s ± 1%     ~     (p=0.572 n=30+19)
Encoder_EncodeAllSimple/better-8     65.9MB/s ± 1%   66.2MB/s ± 1%   +0.53%  (p=0.009 n=30+20)
Encoder_EncodeAllSimple/best-8       11.1MB/s ± 1%   11.6MB/s ± 3%   +4.42%  (p=0.000 n=27+28)
Encoder_EncodeAllSimple4K/fastest-8   911MB/s ± 1%    914MB/s ± 1%   +0.31%  (p=0.004 n=29+20)
Encoder_EncodeAllSimple4K/default-8  73.1MB/s ± 1%   73.6MB/s ± 1%   +0.67%  (p=0.000 n=29+20)
Encoder_EncodeAllSimple4K/better-8   60.5MB/s ± 1%   62.7MB/s ± 1%   +3.64%  (p=0.000 n=29+17)
Encoder_EncodeAllSimple4K/best-8     8.62MB/s ± 3%  10.11MB/s ± 1%  +17.24%  (p=0.000 n=30+27)
Encoder_EncodeAllHTML-8               133MB/s ± 1%    133MB/s ± 1%     ~     (p=0.101 n=30+19)
Encoder_EncodeAllTwain-8             84.8MB/s ± 1%   86.2MB/s ± 3%   +1.63%  (p=0.000 n=24+20)
Encoder_EncodeAllPi-8                62.6MB/s ± 1%   62.7MB/s ± 0%     ~     (p=0.102 n=30+20)
Random4KEncodeAllFastest-8           2.50GB/s ± 1%   2.50GB/s ± 1%     ~     (p=0.449 n=29+20)
Random10MBEncodeAllFastest-8         2.39GB/s ± 2%   2.52GB/s ± 6%   +5.23%  (p=0.000 n=27+20)

name                                 old alloc/op   new alloc/op    delta
Encoder_EncodeAllXML-8                  0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple/fastest-8       2.73B ±27%      3.00B ± 0%     ~     (p=0.018 n=30+18)
Encoder_EncodeAllSimple/default-8       4.00B ± 0%      4.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/better-8        5.00B ± 0%      5.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/best-8          19.5B ± 3%      19.0B ± 0%   -2.40%  (p=0.000 n=30+24)
Encoder_EncodeAllSimple4K/fastest-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/default-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/better-8      0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/best-8        2.00B ± 0%      1.43B ±40%  -28.33%  (p=0.000 n=30+30)
Encoder_EncodeAllHTML-8                 2.37B ±27%      2.25B ±33%     ~     (p=0.398 n=30+20)
Encoder_EncodeAllTwain-8                0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllPi-8                   12.4B ± 5%      12.2B ± 6%     ~     (p=0.283 n=30+20)
Random4KEncodeAllFastest-8              0.00B           0.00B          ~     (all equal)
Random10MBEncodeAllFastest-8           31.9kB ± 2%     30.5kB ± 9%   -4.27%  (p=0.002 n=28+20)
klauspost added a commit to klauspost/compress that referenced this issue Nov 30, 2022
Benchmark results on amd64 below. These do not take into account #701.
They're for Go 1.19; Go 1.20 produces slightly better asm for the old
code, but still produces pretty bad asm on 32-bit platforms.

See also golang/go#56954.

name                                 old speed      new speed       delta
Encoder_EncodeAllXML-8                283MB/s ± 1%    284MB/s ± 0%     ~     (p=0.026 n=30+20)
Encoder_EncodeAllSimple/fastest-8     111MB/s ± 0%    111MB/s ± 1%     ~     (p=0.011 n=28+20)
Encoder_EncodeAllSimple/default-8    78.4MB/s ± 1%   78.3MB/s ± 1%     ~     (p=0.572 n=30+19)
Encoder_EncodeAllSimple/better-8     65.9MB/s ± 1%   66.2MB/s ± 1%   +0.53%  (p=0.009 n=30+20)
Encoder_EncodeAllSimple/best-8       11.1MB/s ± 1%   11.6MB/s ± 3%   +4.42%  (p=0.000 n=27+28)
Encoder_EncodeAllSimple4K/fastest-8   911MB/s ± 1%    914MB/s ± 1%   +0.31%  (p=0.004 n=29+20)
Encoder_EncodeAllSimple4K/default-8  73.1MB/s ± 1%   73.6MB/s ± 1%   +0.67%  (p=0.000 n=29+20)
Encoder_EncodeAllSimple4K/better-8   60.5MB/s ± 1%   62.7MB/s ± 1%   +3.64%  (p=0.000 n=29+17)
Encoder_EncodeAllSimple4K/best-8     8.62MB/s ± 3%  10.11MB/s ± 1%  +17.24%  (p=0.000 n=30+27)
Encoder_EncodeAllHTML-8               133MB/s ± 1%    133MB/s ± 1%     ~     (p=0.101 n=30+19)
Encoder_EncodeAllTwain-8             84.8MB/s ± 1%   86.2MB/s ± 3%   +1.63%  (p=0.000 n=24+20)
Encoder_EncodeAllPi-8                62.6MB/s ± 1%   62.7MB/s ± 0%     ~     (p=0.102 n=30+20)
Random4KEncodeAllFastest-8           2.50GB/s ± 1%   2.50GB/s ± 1%     ~     (p=0.449 n=29+20)
Random10MBEncodeAllFastest-8         2.39GB/s ± 2%   2.52GB/s ± 6%   +5.23%  (p=0.000 n=27+20)

name                                 old alloc/op   new alloc/op    delta
Encoder_EncodeAllXML-8                  0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple/fastest-8       2.73B ±27%      3.00B ± 0%     ~     (p=0.018 n=30+18)
Encoder_EncodeAllSimple/default-8       4.00B ± 0%      4.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/better-8        5.00B ± 0%      5.00B ± 0%     ~     (all equal)
Encoder_EncodeAllSimple/best-8          19.5B ± 3%      19.0B ± 0%   -2.40%  (p=0.000 n=30+24)
Encoder_EncodeAllSimple4K/fastest-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/default-8     0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/better-8      0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllSimple4K/best-8        2.00B ± 0%      1.43B ±40%  -28.33%  (p=0.000 n=30+30)
Encoder_EncodeAllHTML-8                 2.37B ±27%      2.25B ±33%     ~     (p=0.398 n=30+20)
Encoder_EncodeAllTwain-8                0.00B           0.00B          ~     (all equal)
Encoder_EncodeAllPi-8                   12.4B ± 5%      12.2B ± 6%     ~     (p=0.283 n=30+20)
Random4KEncodeAllFastest-8              0.00B           0.00B          ~     (all equal)
Random10MBEncodeAllFastest-8           31.9kB ± 2%     30.5kB ± 9%   -4.27%  (p=0.002 n=28+20)

Co-authored-by: Klaus Post <klauspost@gmail.com>
@golang golang locked and limited conversation to collaborators Nov 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

3 participants