runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640

wbyao · 2023-10-20T09:10:37Z

I'm doing some performance analysis of go1.20 on arm64 platform, and I found that the DMB instruction in mallocgc consumes a lot of time.

I found some CLs：

CL11083: In go1.5, publicationBarrier was added when allocating scan objects.
CL23043: In go1.7, publicationBarrier is was added when allocating noscan objects because of the heapBitsSetTypeNoScan.
CL41253: However, publicationBarrier is still called when allocating noscan objects after heapBitsSetTypeNoScan is deleted.

I tried to find the reason why publicationBarrier was called when allocating noscan objects, but I couldn't find it.

Knowing from the comments of source code, publicationBarrier in mallocgc ensure that the stores above that initialize x to type-safe memory and set the heap bits occur before the caller can make x observable to the GC. publicationBarrier in allocSpan make sure the newly allocated span will be observed by the GC before pointers into the span are published.

My question is, GC don't scan noscan object, why is publicationBarrier required when allocating noscan objects in mallocgc?

Thank you in advance for your help.

The text was updated successfully, but these errors were encountered:

mauri870 · 2023-10-20T11:44:34Z

I think you might be right, given the comments and CL description the publication barrier for noscan was only there to protect the heap bitmap.

ianlancetaylor · 2023-10-20T13:03:45Z

CC @golang/runtime

mknyszek · 2023-10-20T16:29:20Z

I also think you might be right, but such changes are risky. We could try removing it for noscan objects early next development cycle and watch CI closely to see if anything breaks.

RLH · 2023-10-22T15:06:39Z

The memory model no longer allows out of thin air values, even in racy programs. This means there needs to be a publication barrier somewhere between when the object is initialized to 0 and when it is returned (published) by the GC. When the memory model was weaker we were not concerned with racy programs seeing out of thin air values except when it broke the GC. I suppose the zeroing and therefore the publication barrier could be done eagerly, at least for noscan. There are other reasons to do eager zeroing and any numbers showing that eager zeroing is a performance win would be interesting.

mknyszek · 2023-10-22T17:51:09Z

@RLH I think I follow but I'd like to confirm. Is this the kind of situation you're describing?

(1) Thread 1 on CPU 1 allocates an object at address X without a publication barrier.
(2) Thread 1 on CPU 1 places address X in a globally accessible location Y.
(3) Thread 2 on CPU 2 accesses Y using a load without a barrier. Y is not in the cache. It goes to LLC or main memory and it finds X there.
(4) Thread 2 on CPU 2 accesses X without synchronization.

In step (4) we expect thread 2 to see zeroed memory, but if CPU 2 happens to get a hit in the cache at that address then it might not appear to be zero.

There is a race in this scenario, but one could imagine a program that's OK with the race. And it would be surprising that one is able to observe the memory at X as not zeroed.

RLH · 2023-10-22T21:31:16Z

GoTheLanguage (GoLan) and GoTheImplementation (GoImp) shouldn't be conflated. But yes if Thread 1 (aka goroutine 1) is allowed to reorder the initialization stores of X and the store of the address of X then racy goroutines may (will) see out of thin air values from previous collected objects that were allocated at that address. The compiler, the runtime, and the HW must all conspire so that GoImp provides GoLan's memory model.

See 1 for more discussion.

mknyszek · 2023-10-22T21:50:12Z

GoTheLanguage (GoLan) and GoTheImplementation (GoImp) shouldn't be conflated.

Sure, just trying to wrap my head around a concrete example. Thanks for confirming! I agree that a simple change to remove the publication barrier for noscan objects would be violation of the memory model.

I reread https://go.dev/ref/mem and I believe it's clear that this sort of thing is explicitly disallowed by the following line that discusses rules around racy programs: "Additionally, observation of acausal and “out of thin air” writes is disallowed." Apologies for the too-quick conclusion. I had only considered non-racy programs earlier.

I suppose the zeroing and therefore the publication barrier could be done eagerly, at least for noscan. There are other reasons to do eager zeroing and any numbers showing that eager zeroing is a performance win would be interesting.

We generally haven't been thinking about eager zeroing lately, just because it's hard to decide when the right time to do it is.

We could theoretically drop the barrier in cases where the memory is already zeroed (for instance, by the page allocator, or by the OS) or when the caller explicitly asks for non-zeroed memory, but I wonder if those cases are popular enough to actually make a difference.

Also, special-casing the publication barrier at all is also going to introduce more fragility to the mallocgc path which I'm not a fan of. It's not inconceivable that it's even more load-bearing.

What do other allocators do on arm?

mknyszek · 2023-10-22T21:50:59Z

Oh, another question I have is whether atomics on @wbyao's arm platform are implemented in the kernel. That might slow things down a lot. Disclaimer: I haven't looked at the implementation of publicationBarrier on arm in a while.

wbyao · 2023-10-23T12:20:09Z

Oh, another question I have is whether atomics on @wbyao's arm platform are implemented in the kernel. That might slow things down a lot. Disclaimer: I haven't looked at the implementation of publicationBarrier on arm in a while.

Thanks for your reply, and there is some information I did not describe correctly, my environment is arm64 not arm.
I got that if the noscan objects is accessed by another thread before the memory is cleared, it may lead to unexpected results.
There's another question, the changes in CL270943 make noscan's large objects seems to violate the memory model as well, how noscan's large objects avoid out of thin air values?

mknyszek · 2023-10-23T13:28:36Z

Thanks for your reply, and there is some information I did not describe correctly, my environment is arm64 not arm.

Thanks for the clarification! My point about kernel-implemented atomics is moot then.

There's another question, the changes in CL270943 make noscan's large objects seems to violate the memory model as well, how noscan's large objects avoid out of thin air values?

That's a good point, I think you're right. There's no barrier in memclrNoHeapPointersChunked either.

I believe the right thing to do here is to add a second publication barrier. (I'm trying to convince myself that we can just move it, but I keep going back and forth. This case is kind of special because the GC will observe that the span for this large object is noscan, and the allocation of that span has its own publication barrier for the GC.) It should probably be OK performance-wise because the cost of zeroing here should make the second barrier rare enough that it doesn't impact performance too much.

wbyao · 2023-10-25T01:24:36Z

I believe the right thing to do here is to add a second publication barrier.

I think it's a nice fix.
In addition to memclrNoHeapPointersChunked, is it correct to protect the bitmap using publication barrier only when gcphase!=off, just like this (This will add more branches, but there may be better implementations.)

	if (needzero && span.needzero != 0 && !delayedZeroing) || (gcphase != _GCoff && !noscan) {
		publicationBarrier()
	}

RLH · 2023-10-25T14:45:44Z

The 2011 Java Memory implementation cookbook is out of date but is still worth the read and there are links to more recent papers at the top.

The original publication barriers discussion is in a section of this 23 year old paper in the context of a store release / load acquire machine.

RLH · 2023-10-28T20:03:32Z

Not to kibitz too much but recent versions of ARM, including ARM64, have store release and load acquire instructions and a store release is sufficient for a publication barrier. There is no need for a load acquire by the consumer since the load will be dependent on seeing the pointer so will be program ordered. This should be faster than the full barrier now being used.

As an aside I (mis)spent a chunk of my youth at Intel working on a new load acquire / store release architecture and there was a lot of pain moving from TSO. It was not helped by the fact that the early micro-architecture was quietly TSO and later micro-architectures were weaker and some binaries broke. So if Go is going to switch to the weaker model then the sooner the better so any pain will be lessened. Note that I am only talking about pain in racy programs.

wbyao · 2023-11-03T10:09:54Z

Not to kibitz too much but recent versions of ARM, including ARM64, have store release and load acquire instructions and a store release is sufficient for a publication barrier. There is no need for a load acquire by the consumer since the load will be dependent on seeing the pointer so will be program ordered. This should be faster than the full barrier now being used.

@RLH , I also saw a discussion about store release in the #9984, but didn't explain why a store/store barrier at the end of mallocgc can't solve that problem. @aclements's analysis in #9984 seems correct? I'm a little confused.

And, if relevant and possible, Can someone explain whether it is possible not to protect heap bitmap with memory barrier during non-GC period.

aclements · 2023-11-05T17:06:37Z

This should be faster than the full barrier now being used.

To be clear, we're not currently using a full barrier. publicationBarrier() compiles to DMB ST on ARM64, which is only a store-store barrier. On x86, publicationBarrier() is just a compiler barrier, and we rely on TSO. We might be able to switch to a store-release on ARM64 (STLR), though I'm not sure exactly what write we would do as a store-release.

Can someone explain whether it is possible not to protect heap bitmap with memory barrier during non-GC period.

I'm not 100% positive, but I think you're right that if the GC isn't active, writes to the heap bitmap shouldn't require any synchronization. The mark phase starts by stopping the world, which will be enough to synchronize all writes that occur prior to STW. Even if we were to switch to a ragged barrier, that would still be sufficient to synchronize these writes.

That said, we're also in the process of switching away from the heap bitmap, which may change the synchronization requirements here, though probably not in any fundamental way.

zhouguangyuan0718 · 2023-11-11T07:13:51Z

I tried a littest for this scene.

AArch64 go_mallocgc_GC
{
x=1; (*the unzeroed object*)
y=0; (*a pointer in heap to hold x*)
0:X1=x; 0:X2=y;
1:X2=y;
}
P0                            | P1                                  ;
MOV X0, #0                    | LDR X1,[X2] (*found x when scan y*) ;
STR X0, [X1] (*memclr x*)     | LDR X3,[X1] (*scan x*)              ;
STLR X1, [X2] (*store x to y*)|                                     ;
             
exists (1:X3=0 )

This littest is a simulation for mallocgc thread and GC thread.

The result is X3 can be only zero in thread 2.

Test go_mallocgc_GC Allowed
States 1
1:X3=0;
Ok
Witnesses
Positive: 1 Negative: 0
Condition exists (1:X3=0)
Observation go_mallocgc_GC Always 1 0
Time go_mallocgc_GC 0.03
Hash=79646409d6ffd55445cdcee94b876e9b

So, it seems it's ok to use store release for aarch64. The only problem is how to find out the "store new object to heap". Maybe SSA rules can match some scenes. But for the scenes which SSA rules can't work. We should still add a DMB conservatively.

zhouguangyuan0718 · 2023-11-12T06:41:07Z

goos: linux
goarch: arm64
pkg: runtime
                    │ go1.21.4    │  go1.21.4 without barrier       │
                    │   sec/op    │   sec/op     vs base            │
Malloc8-4             23.69n ± 1%   20.37n ± 1%  -14.00% (n=100)
Malloc16-4            40.47n ± 1%   32.57n ± 1%  -19.51% (n=100)
MallocTypeInfo8-4     38.44n ± 1%   33.52n ± 1%  -12.79% (n=100)
MallocTypeInfo16-4    48.76n ± 0%   43.25n ± 1%  -11.30% (n=100)
MallocLargeStruct-4   435.0n ± 1%   455.9n ± 1%   +4.83% (p=0.000 n=100)
geomean               60.06n        53.51n       -10.91%

I removed the publicationBarrier in runtime.mallocgc directly. And tried to test the effect of the publicationBarrier. Seems it is obvious enough in runtime.mallocgc

gopherbot added the compiler/runtime label Oct 20, 2023

cagedmantis added the NeedsInvestigation label Oct 20, 2023

cagedmantis added this to the Backlog milestone Oct 20, 2023

mknyszek added the early-in-cycle label Oct 20, 2023

mknyszek added this to Go Compiler / Runtime Oct 25, 2023

mknyszek removed the early-in-cycle label Nov 1, 2023

mknyszek moved this to Todo in Go Compiler / Runtime Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640

runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640

wbyao commented Oct 20, 2023 •

edited

Loading

mauri870 commented Oct 20, 2023

ianlancetaylor commented Oct 20, 2023

mknyszek commented Oct 20, 2023

RLH commented Oct 22, 2023

mknyszek commented Oct 22, 2023

RLH commented Oct 22, 2023

mknyszek commented Oct 22, 2023 •

edited

Loading

mknyszek commented Oct 22, 2023

wbyao commented Oct 23, 2023

mknyszek commented Oct 23, 2023

wbyao commented Oct 25, 2023 •

edited

Loading

RLH commented Oct 25, 2023

RLH commented Oct 28, 2023

wbyao commented Nov 3, 2023 •

edited

Loading

aclements commented Nov 5, 2023

zhouguangyuan0718 commented Nov 11, 2023 •

edited

Loading

zhouguangyuan0718 commented Nov 12, 2023 •

edited

Loading

runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640

runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640

Comments

wbyao commented Oct 20, 2023 • edited Loading

mauri870 commented Oct 20, 2023

ianlancetaylor commented Oct 20, 2023

mknyszek commented Oct 20, 2023

RLH commented Oct 22, 2023

mknyszek commented Oct 22, 2023

RLH commented Oct 22, 2023

mknyszek commented Oct 22, 2023 • edited Loading

mknyszek commented Oct 22, 2023

wbyao commented Oct 23, 2023

mknyszek commented Oct 23, 2023

wbyao commented Oct 25, 2023 • edited Loading

RLH commented Oct 25, 2023

RLH commented Oct 28, 2023

wbyao commented Nov 3, 2023 • edited Loading

aclements commented Nov 5, 2023

zhouguangyuan0718 commented Nov 11, 2023 • edited Loading

zhouguangyuan0718 commented Nov 12, 2023 • edited Loading

wbyao commented Oct 20, 2023 •

edited

Loading

mknyszek commented Oct 22, 2023 •

edited

Loading

wbyao commented Oct 25, 2023 •

edited

Loading

wbyao commented Nov 3, 2023 •

edited

Loading

zhouguangyuan0718 commented Nov 11, 2023 •

edited

Loading

zhouguangyuan0718 commented Nov 12, 2023 •

edited

Loading