cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

aktau · 2023-08-01T14:14:08Z

What version of Go are you using (`go version`)?

$ go version
go1.21-dev +fe5af1532a

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

GOARCH=amd64

What did you do?

We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:

func setBit(part *uint32, num uint32) {
	*part |= 1 << (num % 32)
}

type x struct {
	bitmap [4]uint32 // A bitmap containing whether "Set" was called on a given field.
	u      int32     // Imagine this is field number 8.
	v      int32     // Imagine this is field number 38.

}

func (m *x) SetV(val int32) {
	m.v = val
	setBit(&(m.bitmap[1]), 37)
}

func (m *x) SetU(val int32) {
	m.u = val
	setBit(&(m.bitmap[0]), 7)
}

What did you expect to see?

I expected similar instructions (with different operands) being generated for both setters.

What did you see instead?

SetU is ~30% slower than SetV, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):

TEXT    main.(*x).SetV(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
        MOVL    BX, 20(AX)
        NOP
        ORL     $32, 4(AX)
        RET

TEXT    main.(*x).SetU(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
        MOVL    BX, 16(AX)
        MOVL    (AX), CX
        BTSL    $7, CX
        NOP
        MOVL    CX, (AX)
        RET

It seems like OR into memory does better than MOV/BTS/MOV.

According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:

Instruction	Lat	TP	Uops	Ports	Lat	TP	Uops
BTS (M32, I8)	[≤3;≤10]	1.00 / 1.00	3 / 4	1p06+1p23+1p237+1p4	[5;12]	2.00	4
OR (M32, I32)	[≤3;≤10]	1.00 / 1.00	2 / 4	1p0156+1p23+1p237+1p4	[≤1;≤8]	0.56	2
BTS (R32, I8)	1	0.50 / 0.50	1 / 1	1*p06	[1;2]	1.00	2
OR (R32, I8)	1	0.25 / 0.25	1 / 1	1*p0156	1	0.25	1

I didn't look up what those MOV instructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:

MOV/OR/MOV
BTS memory,immediate

Some of the speedup may be due to the shorter instruction sequence, too.

The text was updated successfully, but these errors were encountered:

randall77 · 2023-08-01T15:00:39Z

We use BTS because the encoding is smaller.

go/src/cmd/compile/internal/ssa/_gen/AMD64.rules

Line 649 in 68a32ce

// Convert ORconst into BTS, if the code gets smaller, with boundary being

We could convert back to OR when a memory-mod op is available. Or we could generate the memory target versions of BTS. The one caveat is that we want to do that for only the constant BTS versions. The register source BTS are really weird and slow when applied to memory.

merykitty · 2023-08-01T18:35:55Z

I believe saving 2 bytes is not worth the reduced throughput. bt family has the reciprocal throughput of 0.5 on Intel pre-Alder Lake in comparison with 0.25 of basic arithmetic instructions such as or. Worse, on Alder Lake, bt can only be executed on 1 port which results in the reciprocal throughput of 1, and on Zen 4, only bt can achieve the reciprocal throughput of 0.5 while the same figures for bts/btr/btc are 1 due to them being decoded to 2 uops.

randall77 · 2023-08-01T20:59:41Z

Could you post some full benchmark code? My simple benchmark isn't showing any difference between the two.

package bts

import "testing"

func bts_mem(x *uint32) {
	*x |= 1 << 16
}
func bts_reg(x uint32) uint32 {
	return x | 1<<16
}
func BenchmarkBTSmem(b *testing.B) {
	var x uint32
	for i := 0; i < b.N; i++ {
		bts_mem(&x)
	}
	sink = x
}
func BenchmarkBTSreg(b *testing.B) {
	var x uint32
	for i := 0; i < b.N; i++ {
		x = bts_reg(x)
	}
	sink = x
}

var sink uint32

Enabling or disabling those rules I quoted (which have the effect of switching from BTS to OR) makes no difference.

cpu: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

merykitty · 2023-08-01T21:51:35Z

@randall77 Your benchmark does not work because it has carried dependency, i.e the operation in an iteration depends on the result of the previous one, and the out-of-order engine cannot take effect because all operations are sequential. As a result, your benchmark only successfully measures the latency of both instructions which are both 1, hence the similar result. To measure the throughput of the instructions, you need to interleave multiple independent operations. I made a quickbench and the results show that the bts version is twice as slow as the or version, which is expected given the difference in throughput of both instructions.

Thanks.

randall77 · 2023-08-02T00:10:43Z

Here's an improved benchmark:

package bts

import "testing"

const N = 5

func BenchmarkBTSmem(b *testing.B) {
        x := [4]uint32{1, 2, 3, 4}
        for i := 0; i < b.N; i++ {
                x[0] |= 1 << N
                x[1] |= 1 << N
                x[2] |= 1 << N
                x[3] |= 1 << N
        }
        sink = x[0] + x[1] + x[2] + x[3]
}
func BenchmarkBTSreg(b *testing.B) {
        var x, y, z, w uint32
	x, y, z, w = 1, 2, 3, 4
        for i := 0; i < b.N; i++ {
                x |= 1 << N
                y |= 1 << N
                z |= 1 << N
                w |= 1 << N
        }
        sink = x + y + z + w
}

var sink uint32

Run with N==5 to get OR behavior and N==15 to get BTS behavior.

I'm still not seeing much difference. A bit better with the all-in-register version, but load/BTS/store is just as fast as OR to memory.

name       old time/op  new time/op  delta
BTSmem-16  1.48ns ± 0%  1.48ns ±10%     ~     (p=0.150 n=9+10)
BTSreg-16  0.67ns ± 3%  0.54ns ±12%  -20.07%  (p=0.000 n=9+8)

cpu: Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz

merykitty · 2023-08-02T02:47:54Z

For memory operands, the bottleneck is the store-forwarding latency which has a whopping value of around 5 cycles, this overwhelms any other computation since 4 or can be executed in 1 cycles and 4 bts can be executed in 2 - 4 cycles. Interleaving more arithmetic instructions in the loop can help expose the difference in throughput but the benchmark is getting more arbitrary at that point.

My point is that transforming an or to a bts is a negative optimisation with negligible benefits so it should not be done. FYI both gcc and clang do not perform this transformation even at Os.

merykitty · 2023-08-02T02:52:59Z

Also, the OP executed their benchmark on a Zen 4 CPU, which has exceptionally bad bts implementation, which explains the difference that you did not observe on a Xeon one.

aktau · 2023-08-02T10:57:01Z

We use BTS because the encoding is smaller.

But MOV/BTS/MOV is not smaller than OR with a memory operand.

Also, the OP executed their benchmark on a Zen 4 CPU, which has exceptionally bad bts implementation, which explains the difference that you did not observe on a Xeon one.

I was wrong, my machine is actually Zen 2. Here's my results:

cpu: AMD Ryzen Threadripper PRO 3995WX 64-Cores
           │     BTS      │                  OR                  │
           │    sec/op    │    sec/op     vs base                │
BTSmem-128    2.232n ± 0%    1.728n ± 0%  -22.60% (p=0.000 n=10)
BTSreg-128   0.7203n ± 0%   0.4981n ± 0%  -30.86% (p=0.000 n=10)
geomean       1.268n        0.9276n       -26.85%

randall77 · 2023-08-02T14:59:23Z

But MOV/BTS/MOV is not smaller than OR with a memory operand.

100% agree. I think the question here boils down to: should we fix this by adding BTSconst-to-memory instructions, or should we fix this by getting rid of BTSconst altogether?

aktau · 2023-08-02T15:43:49Z

UPDATE: The size comparison below is not accurate, as BTS takes log2(c), which makes it smaller.
From a performance point of view, I'd be inclined to say we should prefer OR (future-wise, it is more common and thus more likely to be a target for optimization by chip builders).

New (correct) size note:

From a size point of view, I played around with https://defuse.ca/online-x86-assembler.htm#disassembly a little bit:

83 48 04 02             or     DWORD PTR [rax+0x4],0x2
0f ba 68 04 01          bts    DWORD PTR [rax+0x4],0x1
...
83 48 04 40             or     DWORD PTR [rax+0x4],0x40
0f ba 68 04 06          bts    DWORD PTR [rax+0x4],0x6
81 48 04 80 00 00 00    or     DWORD PTR [rax+0x4],0x80
0f ba 68 04 07          bts    DWORD PTR [rax+0x4],0x7
...
81 48 04 00 00 00 80    or     DWORD PTR [rax+0x4],0x80000000
0f ba 68 04 1f          bts    DWORD PTR [rax+0x4],0x1f

Indeed OR is bigger by one byte for bits 0..7, and bigger by two bytes for bits 8..31 (I assume bigger by four bytes when 32..61, but ny Lua repl didn't make this easy)

~~Old (wrong) note:~~

~~From a size point of view, I played around with https://defuse.ca/online-x86-assembler.htm#disassembly a little bit:~~

bts    ecx,0x7
bts    rcx,0x7
bts    DWORD PTR [rax+0x4],0x7
or     ecx,0x20
or     rcx,0x20
or     DWORD PTR [rax+0x4],0x20

~~Yields:~~

0:  0f ba e9 07             bts    ecx,0x7 // Generated by Go
4:  48 0f ba e9 07          bts    rcx,0x7
9:  0f ba 68 04 07          bts    DWORD PTR [rax+0x4],0x7
e:  83 c9 20                or     ecx,0x20
11: 48 83 c9 20             or     rcx,0x20
15: 83 48 04 20             or     DWORD PTR [rax+0x4],0x20 // Generated by Go

~~Which makes it looks like or has a smaller size too (both with and without rex prefix), winning on both fronts. I must be missing something here.~~

~~Are we sure that~~

go/src/cmd/compile/internal/ssa/_gen/AMD64.rules

Lines 649 to 658 in 68a32ce

    
           // Convert ORconst into BTS, if the code gets smaller, with boundary being 
        
           // (ORL $40,AX is 3 bytes, ORL $80,AX is 6 bytes). 
        
           ((ORQ|XORQ)const [c] x) && isUint64PowerOfTwo(int64(c)) && uint64(c) >= 128 
        
               => (BT(S|C)Qconst [int8(log32(c))] x) 
        
           ((ORL|XORL)const [c] x) && isUint32PowerOfTwo(int64(c)) && uint64(c) >= 128 
        
               => (BT(S|C)Lconst [int8(log32(c))] x) 
        
           ((ORQ|XORQ) (MOVQconst [c]) x) && isUint64PowerOfTwo(c) && uint64(c) >= 128 
        
               => (BT(S|C)Qconst [int8(log64(c))] x) 
        
           ((ORL|XORL) (MOVLconst [c]) x) && isUint32PowerOfTwo(int64(c)) && uint64(c) >= 128 
        
               => (BT(S|C)Lconst [int8(log32(c))] x)

~~are triggering? The constants in question are always smaller than 32, so I don't see how they would pass uint64(c) >= 128.~~

randall77 · 2023-08-02T15:48:58Z

or DWORD PTR [rax+0x4],0x20

Use a larger constant. 0x20 fits in a single byte, whereas 0x80 does not (which is 1<<7).

The larger OR is needed for constants bigger than 128, which is bit index 7 and larger.

aktau · 2023-08-02T16:33:59Z

Sorry about that, I corrected it above. Indeed it's a 1 byte disadvantage for OR for bits 0..7, and 2 bytes disadvantage for bits 8..31.

It's hard for me to tell how significant that could be. In a binary I have lying around here:

$ stat --printf=%s hugebin
1954399656
$ go tool objdump hugebin | grep -vP '^\s*:0' | grep '\bOR..*\$' | wc -l
86483
$ go tool objdump hugebin | grep -vP '^\s*:0' | grep '\bBTS. \$' | wc -l
61242

So we'd add (61242*2)/1954399656 = 0.0062% to the binary. That seems more or less alright, but it's also a sample size of one. Also I'd note that perhaps removing the rewrite rules would eliminate the MOV/BTS/MOV transformation, thus making this a net gain.

I've also observed some C++-generated instructions while scanning this binary (it's a mixed binary). LLVM does generate larger ORs.

randall77 · 2023-08-02T18:07:09Z

gcc -O3 doesn't use BTS for any constant-index operations.
For register operation, even for things like x | (1<<50), gcc uses a movabs+or instead of a bts, even though the former is a lot bigger. (For -Os it uses OR up to the 32nd bit and BTS after that.)

For memory operation, gcc wisely always uses OR of 1-byte values. For byte indexes larger than 7, it just adds idx/8 to the address so the constant is always encoded in 1 byte. This worries me a bit because of how this might affect store fowarding, but I'm inclined to agree with gcc.

merykitty · 2023-08-02T18:51:08Z

@randall77 I believe you can't do or with the byte memory operand in Go because it violates atomicity. For example:

initial:
int x = 0 (0)

thread 1:
int a = x; (1)
int b = a | 1;
x = b; (2)

thread 2:
int c = x; (3)
int d = c + 0x102;
x = d; (4)

Now there are 4 cases:

(1) reads (0) and (3) reads (0): Then b = 1 and d = 0x102
(1) reads (4) and (3) reads (0): Then d = 0x102 and b = 0x103
(1) reads (0) and (3) reads (2): Then b = 1 and d = 0x103
(1) reads (4) and (3) reads (2): Then d = c + 0x102= b + 0x102 = (a | 1) + 0x102 = (d | 1) + 0x102. Which means d = (d | 1) + 0x102, this does not have any solution (the case violates causality also but the memory model does not define that)

As a result, the final value in x can be only one of 1, 0x102, 0x103.

However, if you implement x |= 1 using a byte-width instruction, then it can be the case that both (1) and (3) read (0), and (4) is committed to the memory before (2), which leads to the final value of x being 0x101.

randall77 · 2023-08-02T19:38:19Z

@merykitty I don't think the issue you raised is a showstopper, at least directly. That code has a race in it and we're allowed to do anything we want when there is a race.
It would be different if those reads and writes were atomic ones. So it is true that we should be careful if/when we do #61395 to not do this optimization on atomic ORs.

merykitty · 2023-08-03T00:28:27Z

@randall77 The Go memory model mandates that:

a read r of a memory location x that is not larger than a machine word must observe some write w such that r does not happen before w and there is no write w' such that w happens before w' and w' happens before r. That is, each read must observe a value written by a preceding or concurrent write.

So it is not right that we can do anything we want in the presence of data race.

randall77 · 2023-08-03T02:45:19Z

I see, so you are claiming that we get a word-tearing violation if we use sub-word OR-with-memory instructions.
Hmm, have to think about that one.

randall77 · 2023-08-03T20:45:43Z

Ok, I think I agree with @merykitty that we can't use thinner ORs, it leads to word tearing which we promise not to do.
So that gets us back to having to decide between code size and maybe some performance.

Proposal: use ORconst for anything which can be done with a single instruction. So with both registers and memory targets.
That leaves using BTSconst just for setting bits 31 and higher in 64-bit destinations (for which we'd otherwise have to issue a separate movabs instruction).

That should tend to choose performance over code size. Although I think the performance gains are not terribly large, neither is the code size penalty.

randall77 · 2023-08-03T21:41:19Z

Separately, we should then allow BTSQ to operate directly on memory.

gopherbot · 2023-08-03T22:22:46Z

Change https://go.dev/cl/515755 mentions this issue: cmd/compile: don't use BTS when OR works, add direct memory BTS operations

ianlancetaylor · 2023-08-03T23:05:03Z

GCC does use BTS for code like

void f(unsigned long long *p) {
  *p |= 0x800000000ULL;
}

f:
	btsq	$35, (%rdi)
	ret

randall77 · 2023-08-03T23:15:51Z

Ah yes, when I said gcc what I really meant was "the binary that gets executed when you type gcc on Darwin", which of course isn't gcc, it is llvm.
Looks like gcc does basically what I proposed, which is use bts only for the large bit indexes.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 1, 2023

mknyszek assigned randall77 Aug 2, 2023

mknyszek added this to the Go1.22 milestone Aug 2, 2023

mknyszek added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 2, 2023

gopherbot closed this as completed in 611706b Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

aktau commented Aug 1, 2023 •

edited

randall77 commented Aug 1, 2023

merykitty commented Aug 1, 2023 •

edited

randall77 commented Aug 1, 2023

merykitty commented Aug 1, 2023

randall77 commented Aug 2, 2023

merykitty commented Aug 2, 2023

merykitty commented Aug 2, 2023

aktau commented Aug 2, 2023

randall77 commented Aug 2, 2023

aktau commented Aug 2, 2023 •

edited

randall77 commented Aug 2, 2023

aktau commented Aug 2, 2023 •

edited

randall77 commented Aug 2, 2023

merykitty commented Aug 2, 2023

randall77 commented Aug 2, 2023

merykitty commented Aug 3, 2023

randall77 commented Aug 3, 2023

randall77 commented Aug 3, 2023 •

edited

randall77 commented Aug 3, 2023

gopherbot commented Aug 3, 2023

ianlancetaylor commented Aug 3, 2023

randall77 commented Aug 3, 2023

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

Comments

aktau commented Aug 1, 2023 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

randall77 commented Aug 1, 2023

merykitty commented Aug 1, 2023 • edited

randall77 commented Aug 1, 2023

merykitty commented Aug 1, 2023

randall77 commented Aug 2, 2023

merykitty commented Aug 2, 2023

merykitty commented Aug 2, 2023

aktau commented Aug 2, 2023

randall77 commented Aug 2, 2023

aktau commented Aug 2, 2023 • edited

randall77 commented Aug 2, 2023

aktau commented Aug 2, 2023 • edited

randall77 commented Aug 2, 2023

merykitty commented Aug 2, 2023

randall77 commented Aug 2, 2023

merykitty commented Aug 3, 2023

randall77 commented Aug 3, 2023

randall77 commented Aug 3, 2023 • edited

randall77 commented Aug 3, 2023

gopherbot commented Aug 3, 2023

ianlancetaylor commented Aug 3, 2023

randall77 commented Aug 3, 2023

aktau commented Aug 1, 2023 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

merykitty commented Aug 1, 2023 •

edited

aktau commented Aug 2, 2023 •

edited

aktau commented Aug 2, 2023 •

edited

randall77 commented Aug 3, 2023 •

edited