Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects #63640

Open
wbyao opened this issue Oct 20, 2023 · 17 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@wbyao
Copy link

wbyao commented Oct 20, 2023

I'm doing some performance analysis of go1.20 on arm64 platform, and I found that the DMB instruction in mallocgc consumes a lot of time.

I found some CLs:

  • CL11083: In go1.5, publicationBarrier was added when allocating scan objects.
  • CL23043: In go1.7, publicationBarrier is was added when allocating noscan objects because of the heapBitsSetTypeNoScan.
  • CL41253: However, publicationBarrier is still called when allocating noscan objects after heapBitsSetTypeNoScan is deleted.

I tried to find the reason why publicationBarrier was called when allocating noscan objects, but I couldn't find it.

Knowing from the comments of source code, publicationBarrier in mallocgc ensure that the stores above that initialize x to type-safe memory and set the heap bits occur before the caller can make x observable to the GC. publicationBarrier in allocSpan make sure the newly allocated span will be observed by the GC before pointers into the span are published.

My question is, GC don't scan noscan object, why is publicationBarrier required when allocating noscan objects in mallocgc?

Thank you in advance for your help.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 20, 2023
@mauri870
Copy link
Member

I think you might be right, given the comments and CL description the publication barrier for noscan was only there to protect the heap bitmap.

@ianlancetaylor
Copy link
Contributor

CC @golang/runtime

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 20, 2023
@cagedmantis cagedmantis added this to the Backlog milestone Oct 20, 2023
@mknyszek
Copy link
Contributor

I also think you might be right, but such changes are risky. We could try removing it for noscan objects early next development cycle and watch CI closely to see if anything breaks.

@mknyszek mknyszek added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Oct 20, 2023
@RLH
Copy link
Contributor

RLH commented Oct 22, 2023

The memory model no longer allows out of thin air values, even in racy programs. This means there needs to be a publication barrier somewhere between when the object is initialized to 0 and when it is returned (published) by the GC. When the memory model was weaker we were not concerned with racy programs seeing out of thin air values except when it broke the GC. I suppose the zeroing and therefore the publication barrier could be done eagerly, at least for noscan. There are other reasons to do eager zeroing and any numbers showing that eager zeroing is a performance win would be interesting.

@mknyszek
Copy link
Contributor

@RLH I think I follow but I'd like to confirm. Is this the kind of situation you're describing?

(1) Thread 1 on CPU 1 allocates an object at address X without a publication barrier.
(2) Thread 1 on CPU 1 places address X in a globally accessible location Y.
(3) Thread 2 on CPU 2 accesses Y using a load without a barrier. Y is not in the cache. It goes to LLC or main memory and it finds X there.
(4) Thread 2 on CPU 2 accesses X without synchronization.

In step (4) we expect thread 2 to see zeroed memory, but if CPU 2 happens to get a hit in the cache at that address then it might not appear to be zero.

There is a race in this scenario, but one could imagine a program that's OK with the race. And it would be surprising that one is able to observe the memory at X as not zeroed.

@RLH
Copy link
Contributor

RLH commented Oct 22, 2023

GoTheLanguage (GoLan) and GoTheImplementation (GoImp) shouldn't be conflated. But yes if Thread 1 (aka goroutine 1) is allowed to reorder the initialization stores of X and the store of the address of X then racy goroutines may (will) see out of thin air values from previous collected objects that were allocated at that address. The compiler, the runtime, and the HW must all conspire so that GoImp provides GoLan's memory model.

See 1 for more discussion.

@mknyszek
Copy link
Contributor

mknyszek commented Oct 22, 2023

GoTheLanguage (GoLan) and GoTheImplementation (GoImp) shouldn't be conflated.

Sure, just trying to wrap my head around a concrete example. Thanks for confirming! I agree that a simple change to remove the publication barrier for noscan objects would be violation of the memory model.

I reread https://go.dev/ref/mem and I believe it's clear that this sort of thing is explicitly disallowed by the following line that discusses rules around racy programs: "Additionally, observation of acausal and “out of thin air” writes is disallowed." Apologies for the too-quick conclusion. I had only considered non-racy programs earlier.

I suppose the zeroing and therefore the publication barrier could be done eagerly, at least for noscan. There are other reasons to do eager zeroing and any numbers showing that eager zeroing is a performance win would be interesting.

We generally haven't been thinking about eager zeroing lately, just because it's hard to decide when the right time to do it is.

We could theoretically drop the barrier in cases where the memory is already zeroed (for instance, by the page allocator, or by the OS) or when the caller explicitly asks for non-zeroed memory, but I wonder if those cases are popular enough to actually make a difference.

Also, special-casing the publication barrier at all is also going to introduce more fragility to the mallocgc path which I'm not a fan of. It's not inconceivable that it's even more load-bearing.

What do other allocators do on arm?

@mknyszek
Copy link
Contributor

Oh, another question I have is whether atomics on @wbyao's arm platform are implemented in the kernel. That might slow things down a lot. Disclaimer: I haven't looked at the implementation of publicationBarrier on arm in a while.

@wbyao
Copy link
Author

wbyao commented Oct 23, 2023

Oh, another question I have is whether atomics on @wbyao's arm platform are implemented in the kernel. That might slow things down a lot. Disclaimer: I haven't looked at the implementation of publicationBarrier on arm in a while.

Thanks for your reply, and there is some information I did not describe correctly, my environment is arm64 not arm.
I got that if the noscan objects is accessed by another thread before the memory is cleared, it may lead to unexpected results.
There's another question, the changes in CL270943 make noscan's large objects seems to violate the memory model as well, how noscan's large objects avoid out of thin air values?

@mknyszek
Copy link
Contributor

Thanks for your reply, and there is some information I did not describe correctly, my environment is arm64 not arm.

Thanks for the clarification! My point about kernel-implemented atomics is moot then.

There's another question, the changes in CL270943 make noscan's large objects seems to violate the memory model as well, how noscan's large objects avoid out of thin air values?

That's a good point, I think you're right. There's no barrier in memclrNoHeapPointersChunked either.

I believe the right thing to do here is to add a second publication barrier. (I'm trying to convince myself that we can just move it, but I keep going back and forth. This case is kind of special because the GC will observe that the span for this large object is noscan, and the allocation of that span has its own publication barrier for the GC.) It should probably be OK performance-wise because the cost of zeroing here should make the second barrier rare enough that it doesn't impact performance too much.

@wbyao
Copy link
Author

wbyao commented Oct 25, 2023

I believe the right thing to do here is to add a second publication barrier.

I think it's a nice fix.
In addition to memclrNoHeapPointersChunked, is it correct to protect the bitmap using publication barrier only when gcphase!=off, just like this (This will add more branches, but there may be better implementations.)

	if (needzero && span.needzero != 0 && !delayedZeroing) || (gcphase != _GCoff && !noscan) {
		publicationBarrier()
	}

@RLH
Copy link
Contributor

RLH commented Oct 25, 2023

The 2011 Java Memory implementation cookbook is out of date but is still worth the read and there are links to more recent papers at the top.

The original publication barriers discussion is in a section of this 23 year old paper in the context of a store release / load acquire machine.

@RLH
Copy link
Contributor

RLH commented Oct 28, 2023

Not to kibitz too much but recent versions of ARM, including ARM64, have store release and load acquire instructions and a store release is sufficient for a publication barrier. There is no need for a load acquire by the consumer since the load will be dependent on seeing the pointer so will be program ordered. This should be faster than the full barrier now being used.

As an aside I (mis)spent a chunk of my youth at Intel working on a new load acquire / store release architecture and there was a lot of pain moving from TSO. It was not helped by the fact that the early micro-architecture was quietly TSO and later micro-architectures were weaker and some binaries broke. So if Go is going to switch to the weaker model then the sooner the better so any pain will be lessened. Note that I am only talking about pain in racy programs.

@mknyszek mknyszek removed the early-in-cycle A change that should be done early in the 3 month dev cycle. label Nov 1, 2023
@wbyao
Copy link
Author

wbyao commented Nov 3, 2023

Not to kibitz too much but recent versions of ARM, including ARM64, have store release and load acquire instructions and a store release is sufficient for a publication barrier. There is no need for a load acquire by the consumer since the load will be dependent on seeing the pointer so will be program ordered. This should be faster than the full barrier now being used.

@RLH , I also saw a discussion about store release in the #9984, but didn't explain why a store/store barrier at the end of mallocgc can't solve that problem. @aclements's analysis in #9984 seems correct? I'm a little confused.

And, if relevant and possible, Can someone explain whether it is possible not to protect heap bitmap with memory barrier during non-GC period.

@aclements
Copy link
Member

This should be faster than the full barrier now being used.

To be clear, we're not currently using a full barrier. publicationBarrier() compiles to DMB ST on ARM64, which is only a store-store barrier. On x86, publicationBarrier() is just a compiler barrier, and we rely on TSO. We might be able to switch to a store-release on ARM64 (STLR), though I'm not sure exactly what write we would do as a store-release.

Can someone explain whether it is possible not to protect heap bitmap with memory barrier during non-GC period.

I'm not 100% positive, but I think you're right that if the GC isn't active, writes to the heap bitmap shouldn't require any synchronization. The mark phase starts by stopping the world, which will be enough to synchronize all writes that occur prior to STW. Even if we were to switch to a ragged barrier, that would still be sufficient to synchronize these writes.

That said, we're also in the process of switching away from the heap bitmap, which may change the synchronization requirements here, though probably not in any fundamental way.

@zhouguangyuan0718
Copy link
Contributor

zhouguangyuan0718 commented Nov 11, 2023

I tried a littest for this scene.

AArch64 go_mallocgc_GC
{
x=1; (*the unzeroed object*)
y=0; (*a pointer in heap to hold x*)
0:X1=x; 0:X2=y;
1:X2=y;
}
P0                            | P1                                  ;
MOV X0, #0                    | LDR X1,[X2] (*found x when scan y*) ;
STR X0, [X1] (*memclr x*)     | LDR X3,[X1] (*scan x*)              ;
STLR X1, [X2] (*store x to y*)|                                     ;
             
exists (1:X3=0 )

This littest is a simulation for mallocgc thread and GC thread.

The result is X3 can be only zero in thread 2.

Test go_mallocgc_GC Allowed
States 1
1:X3=0;
Ok
Witnesses
Positive: 1 Negative: 0
Condition exists (1:X3=0)
Observation go_mallocgc_GC Always 1 0
Time go_mallocgc_GC 0.03
Hash=79646409d6ffd55445cdcee94b876e9b

es0 (2)

So, it seems it's ok to use store release for aarch64. The only problem is how to find out the "store new object to heap". Maybe SSA rules can match some scenes. But for the scenes which SSA rules can't work. We should still add a DMB conservatively.

@zhouguangyuan0718
Copy link
Contributor

zhouguangyuan0718 commented Nov 12, 2023

goos: linux
goarch: arm64
pkg: runtime
                    │ go1.21.4    │  go1.21.4 without barrier       │
                    │   sec/op    │   sec/op     vs base            │
Malloc8-4             23.69n ± 1%   20.37n ± 1%  -14.00% (n=100)
Malloc16-4            40.47n ± 1%   32.57n ± 1%  -19.51% (n=100)
MallocTypeInfo8-4     38.44n ± 1%   33.52n ± 1%  -12.79% (n=100)
MallocTypeInfo16-4    48.76n ± 0%   43.25n ± 1%  -11.30% (n=100)
MallocLargeStruct-4   435.0n ± 1%   455.9n ± 1%   +4.83% (p=0.000 n=100)
geomean               60.06n        53.51n       -10.91%

I removed the publicationBarrier in runtime.mallocgc directly. And tried to test the effect of the publicationBarrier. Seems it is obvious enough in runtime.mallocgc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

9 participants