Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

Closed
aktau opened this issue Aug 1, 2023 · 22 comments
Closed

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

aktau opened this issue Aug 1, 2023 · 22 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@aktau
Copy link
Contributor

aktau commented Aug 1, 2023

What version of Go are you using (go version)?

$ go version
go1.21-dev +fe5af1532a

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH=amd64

What did you do?

We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:

func setBit(part *uint32, num uint32) {
	*part |= 1 << (num % 32)
}

type x struct {
	bitmap [4]uint32 // A bitmap containing whether "Set" was called on a given field.
	u      int32     // Imagine this is field number 8.
	v      int32     // Imagine this is field number 38.

}

func (m *x) SetV(val int32) {
	m.v = val
	setBit(&(m.bitmap[1]), 37)
}

func (m *x) SetU(val int32) {
	m.u = val
	setBit(&(m.bitmap[0]), 7)
}

What did you expect to see?

I expected similar instructions (with different operands) being generated for both setters.

What did you see instead?

SetU is ~30% slower than SetV, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):

TEXT    main.(*x).SetV(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
        MOVL    BX, 20(AX)
        NOP
        ORL     $32, 4(AX)
        RET

TEXT    main.(*x).SetU(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
        MOVL    BX, 16(AX)
        MOVL    (AX), CX
        BTSL    $7, CX
        NOP
        MOVL    CX, (AX)
        RET

It seems like OR into memory does better than MOV/BTS/MOV.

According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:

Instruction Lat TP Uops Ports Lat TP Uops Ports
BTS (M32, I8) [≤3;≤10] 1.00 / 1.00 3 / 4 1p06+1p23+1p237+1p4 [5;12] 2.00 4
OR (M32, I32) [≤3;≤10] 1.00 / 1.00 2 / 4 1p0156+1p23+1p237+1p4 [≤1;≤8] 0.56 2
BTS (R32, I8) 1 0.50 / 0.50 1 / 1 1*p06 [1;2] 1.00 2
OR (R32, I8) 1 0.25 / 0.25 1 / 1 1*p0156 1 0.25 1

I didn't look up what those MOV instructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:

  • MOV/OR/MOV
  • BTS memory,immediate

Some of the speedup may be due to the shorter instruction sequence, too.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 1, 2023
@randall77
Copy link
Contributor

We use BTS because the encoding is smaller.

// Convert ORconst into BTS, if the code gets smaller, with boundary being

We could convert back to OR when a memory-mod op is available. Or we could generate the memory target versions of BTS. The one caveat is that we want to do that for only the constant BTS versions. The register source BTS are really weird and slow when applied to memory.

@merykitty
Copy link

merykitty commented Aug 1, 2023

I believe saving 2 bytes is not worth the reduced throughput. bt family has the reciprocal throughput of 0.5 on Intel pre-Alder Lake in comparison with 0.25 of basic arithmetic instructions such as or. Worse, on Alder Lake, bt can only be executed on 1 port which results in the reciprocal throughput of 1, and on Zen 4, only bt can achieve the reciprocal throughput of 0.5 while the same figures for bts/btr/btc are 1 due to them being decoded to 2 uops.

@randall77
Copy link
Contributor

Could you post some full benchmark code? My simple benchmark isn't showing any difference between the two.

package bts

import "testing"

func bts_mem(x *uint32) {
	*x |= 1 << 16
}
func bts_reg(x uint32) uint32 {
	return x | 1<<16
}
func BenchmarkBTSmem(b *testing.B) {
	var x uint32
	for i := 0; i < b.N; i++ {
		bts_mem(&x)
	}
	sink = x
}
func BenchmarkBTSreg(b *testing.B) {
	var x uint32
	for i := 0; i < b.N; i++ {
		x = bts_reg(x)
	}
	sink = x
}

var sink uint32

Enabling or disabling those rules I quoted (which have the effect of switching from BTS to OR) makes no difference.

cpu: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

@merykitty
Copy link

@randall77 Your benchmark does not work because it has carried dependency, i.e the operation in an iteration depends on the result of the previous one, and the out-of-order engine cannot take effect because all operations are sequential. As a result, your benchmark only successfully measures the latency of both instructions which are both 1, hence the similar result. To measure the throughput of the instructions, you need to interleave multiple independent operations. I made a quickbench and the results show that the bts version is twice as slow as the or version, which is expected given the difference in throughput of both instructions.

Thanks.

@randall77
Copy link
Contributor

Here's an improved benchmark:

package bts

import "testing"

const N = 5

func BenchmarkBTSmem(b *testing.B) {
        x := [4]uint32{1, 2, 3, 4}
        for i := 0; i < b.N; i++ {
                x[0] |= 1 << N
                x[1] |= 1 << N
                x[2] |= 1 << N
                x[3] |= 1 << N
        }
        sink = x[0] + x[1] + x[2] + x[3]
}
func BenchmarkBTSreg(b *testing.B) {
        var x, y, z, w uint32
	x, y, z, w = 1, 2, 3, 4
        for i := 0; i < b.N; i++ {
                x |= 1 << N
                y |= 1 << N
                z |= 1 << N
                w |= 1 << N
        }
        sink = x + y + z + w
}

var sink uint32

Run with N==5 to get OR behavior and N==15 to get BTS behavior.

I'm still not seeing much difference. A bit better with the all-in-register version, but load/BTS/store is just as fast as OR to memory.

name       old time/op  new time/op  delta
BTSmem-16  1.48ns ± 0%  1.48ns ±10%     ~     (p=0.150 n=9+10)
BTSreg-16  0.67ns ± 3%  0.54ns ±12%  -20.07%  (p=0.000 n=9+8)

cpu: Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz

@merykitty
Copy link

For memory operands, the bottleneck is the store-forwarding latency which has a whopping value of around 5 cycles, this overwhelms any other computation since 4 or can be executed in 1 cycles and 4 bts can be executed in 2 - 4 cycles. Interleaving more arithmetic instructions in the loop can help expose the difference in throughput but the benchmark is getting more arbitrary at that point.

My point is that transforming an or to a bts is a negative optimisation with negligible benefits so it should not be done. FYI both gcc and clang do not perform this transformation even at Os.

@merykitty
Copy link

Also, the OP executed their benchmark on a Zen 4 CPU, which has exceptionally bad bts implementation, which explains the difference that you did not observe on a Xeon one.

@aktau
Copy link
Contributor Author

aktau commented Aug 2, 2023

We use BTS because the encoding is smaller.

But MOV/BTS/MOV is not smaller than OR with a memory operand.

Also, the OP executed their benchmark on a Zen 4 CPU, which has exceptionally bad bts implementation, which explains the difference that you did not observe on a Xeon one.

I was wrong, my machine is actually Zen 2. Here's my results:

cpu: AMD Ryzen Threadripper PRO 3995WX 64-Cores
           │     BTS      │                  OR                  │
           │    sec/op    │    sec/op     vs base                │
BTSmem-128    2.232n ± 0%    1.728n ± 0%  -22.60% (p=0.000 n=10)
BTSreg-128   0.7203n ± 0%   0.4981n ± 0%  -30.86% (p=0.000 n=10)
geomean       1.268n        0.9276n       -26.85%

@randall77
Copy link
Contributor

But MOV/BTS/MOV is not smaller than OR with a memory operand.

100% agree. I think the question here boils down to: should we fix this by adding BTSconst-to-memory instructions, or should we fix this by getting rid of BTSconst altogether?

@aktau
Copy link
Contributor Author

aktau commented Aug 2, 2023

UPDATE: The size comparison below is not accurate, as BTS takes log2(c), which makes it smaller.
From a performance point of view, I'd be inclined to say we should prefer OR (future-wise, it is more common and thus more likely to be a target for optimization by chip builders).

New (correct) size note:

From a size point of view, I played around with https://defuse.ca/online-x86-assembler.htm#disassembly a little bit:

83 48 04 02             or     DWORD PTR [rax+0x4],0x2
0f ba 68 04 01          bts    DWORD PTR [rax+0x4],0x1
...
83 48 04 40             or     DWORD PTR [rax+0x4],0x40
0f ba 68 04 06          bts    DWORD PTR [rax+0x4],0x6
81 48 04 80 00 00 00    or     DWORD PTR [rax+0x4],0x80
0f ba 68 04 07          bts    DWORD PTR [rax+0x4],0x7
...
81 48 04 00 00 00 80    or     DWORD PTR [rax+0x4],0x80000000
0f ba 68 04 1f          bts    DWORD PTR [rax+0x4],0x1f

Indeed OR is bigger by one byte for bits 0..7, and bigger by two bytes for bits 8..31 (I assume bigger by four bytes when 32..61, but ny Lua repl didn't make this easy)

Old (wrong) note:

From a size point of view, I played around with https://defuse.ca/online-x86-assembler.htm#disassembly a little bit:

bts    ecx,0x7
bts    rcx,0x7
bts    DWORD PTR [rax+0x4],0x7
or     ecx,0x20
or     rcx,0x20
or     DWORD PTR [rax+0x4],0x20

Yields:

0:  0f ba e9 07             bts    ecx,0x7 // Generated by Go
4:  48 0f ba e9 07          bts    rcx,0x7
9:  0f ba 68 04 07          bts    DWORD PTR [rax+0x4],0x7
e:  83 c9 20                or     ecx,0x20
11: 48 83 c9 20             or     rcx,0x20
15: 83 48 04 20             or     DWORD PTR [rax+0x4],0x20 // Generated by Go

Which makes it looks like or has a smaller size too (both with and without rex prefix), winning on both fronts. I must be missing something here.

Are we sure that

// Convert ORconst into BTS, if the code gets smaller, with boundary being
// (ORL $40,AX is 3 bytes, ORL $80,AX is 6 bytes).
((ORQ|XORQ)const [c] x) && isUint64PowerOfTwo(int64(c)) && uint64(c) >= 128
=> (BT(S|C)Qconst [int8(log32(c))] x)
((ORL|XORL)const [c] x) && isUint32PowerOfTwo(int64(c)) && uint64(c) >= 128
=> (BT(S|C)Lconst [int8(log32(c))] x)
((ORQ|XORQ) (MOVQconst [c]) x) && isUint64PowerOfTwo(c) && uint64(c) >= 128
=> (BT(S|C)Qconst [int8(log64(c))] x)
((ORL|XORL) (MOVLconst [c]) x) && isUint32PowerOfTwo(int64(c)) && uint64(c) >= 128
=> (BT(S|C)Lconst [int8(log32(c))] x)
are triggering? The constants in question are always smaller than 32, so I don't see how they would pass uint64(c) >= 128.

@randall77
Copy link
Contributor

or DWORD PTR [rax+0x4],0x20

Use a larger constant. 0x20 fits in a single byte, whereas 0x80 does not (which is 1<<7).

The larger OR is needed for constants bigger than 128, which is bit index 7 and larger.

@aktau
Copy link
Contributor Author

aktau commented Aug 2, 2023

Sorry about that, I corrected it above. Indeed it's a 1 byte disadvantage for OR for bits 0..7, and 2 bytes disadvantage for bits 8..31.

It's hard for me to tell how significant that could be. In a binary I have lying around here:

$ stat --printf=%s hugebin
1954399656
$ go tool objdump hugebin | grep -vP '^\s*:0' | grep '\bOR..*\$' | wc -l
86483
$ go tool objdump hugebin | grep -vP '^\s*:0' | grep '\bBTS. \$' | wc -l
61242

So we'd add (61242*2)/1954399656 = 0.0062% to the binary. That seems more or less alright, but it's also a sample size of one. Also I'd note that perhaps removing the rewrite rules would eliminate the MOV/BTS/MOV transformation, thus making this a net gain.

I've also observed some C++-generated instructions while scanning this binary (it's a mixed binary). LLVM does generate larger ORs.

@randall77
Copy link
Contributor

gcc -O3 doesn't use BTS for any constant-index operations.
For register operation, even for things like x | (1<<50), gcc uses a movabs+or instead of a bts, even though the former is a lot bigger. (For -Os it uses OR up to the 32nd bit and BTS after that.)

For memory operation, gcc wisely always uses OR of 1-byte values. For byte indexes larger than 7, it just adds idx/8 to the address so the constant is always encoded in 1 byte. This worries me a bit because of how this might affect store fowarding, but I'm inclined to agree with gcc.

@merykitty
Copy link

@randall77 I believe you can't do or with the byte memory operand in Go because it violates atomicity. For example:

initial:
int x = 0 (0)

thread 1:
int a = x; (1)
int b = a | 1;
x = b; (2)

thread 2:
int c = x; (3)
int d = c + 0x102;
x = d; (4)

Now there are 4 cases:

  • (1) reads (0) and (3) reads (0): Then b = 1 and d = 0x102
  • (1) reads (4) and (3) reads (0): Then d = 0x102 and b = 0x103
  • (1) reads (0) and (3) reads (2): Then b = 1 and d = 0x103
  • (1) reads (4) and (3) reads (2): Then d = c + 0x102= b + 0x102 = (a | 1) + 0x102 = (d | 1) + 0x102. Which means d = (d | 1) + 0x102, this does not have any solution (the case violates causality also but the memory model does not define that)

As a result, the final value in x can be only one of 1, 0x102, 0x103.

However, if you implement x |= 1 using a byte-width instruction, then it can be the case that both (1) and (3) read (0), and (4) is committed to the memory before (2), which leads to the final value of x being 0x101.

@mknyszek mknyszek added this to the Go1.22 milestone Aug 2, 2023
@mknyszek mknyszek added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 2, 2023
@randall77
Copy link
Contributor

@merykitty I don't think the issue you raised is a showstopper, at least directly. That code has a race in it and we're allowed to do anything we want when there is a race.
It would be different if those reads and writes were atomic ones. So it is true that we should be careful if/when we do #61395 to not do this optimization on atomic ORs.

@merykitty
Copy link

@randall77 The Go memory model mandates that:

a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.

So it is not right that we can do anything we want in the presence of data race.

@randall77
Copy link
Contributor

I see, so you are claiming that we get a word-tearing violation if we use sub-word OR-with-memory instructions.
Hmm, have to think about that one.

@randall77
Copy link
Contributor

randall77 commented Aug 3, 2023

Ok, I think I agree with @merykitty that we can't use thinner ORs, it leads to word tearing which we promise not to do.
So that gets us back to having to decide between code size and maybe some performance.

Proposal: use ORconst for anything which can be done with a single instruction. So with both registers and memory targets.
That leaves using BTSconst just for setting bits 31 and higher in 64-bit destinations (for which we'd otherwise have to issue a separate movabs instruction).

That should tend to choose performance over code size. Although I think the performance gains are not terribly large, neither is the code size penalty.

@randall77
Copy link
Contributor

Separately, we should then allow BTSQ to operate directly on memory.

@gopherbot
Copy link

Change https://go.dev/cl/515755 mentions this issue: cmd/compile: don't use BTS when OR works, add direct memory BTS operations

@ianlancetaylor
Copy link
Contributor

GCC does use BTS for code like

void f(unsigned long long *p) {
  *p |= 0x800000000ULL;
}
f:
	btsq	$35, (%rdi)
	ret

@randall77
Copy link
Contributor

Ah yes, when I said gcc what I really meant was "the binary that gets executed when you type gcc on Darwin", which of course isn't gcc, it is llvm.
Looks like gcc does basically what I proposed, which is use bts only for the large bit indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants