Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: slice update with self subslice - write barrier probably not needed (?) #24314

Open
navytux opened this issue Mar 8, 2018 · 8 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@navytux
Copy link
Contributor

navytux commented Mar 8, 2018

Commit b854339 (CL 98757) changes encoding/binary to use explicit offset instead of buf = buf[1:] idiom with the rationale that buf = buf[1:] is slice data pointer update which in turn is done with write barrier, while using buf[offset] explicitly does not need to update buf slice pointer and so does not need to issue GC write barrier. The commit cites significant speedup for encoding/binary under GOGC=1 and notes that While running make.bash, over 5% of all pointer writes come from encoding/binary doing struct reads.

It was then noted there that instead of explicitly using offset and avoiding buf = buf[1:] update idiom for performance reasons, it would be better to instead teach the compiler to know that slice updates with destination array pointer pointing to the same array's underlying object as of original slice, e.g. as in
buf = buf[1:] does not need write barrier at all.

The conversation there contains the rationale for current compiler behaviour but also brings questions and arguments for why it can be done. It was suggested to move the conversation to the issues, so here it goes.


@josharian thinking out loud maybe it is better to instead teach the compiler to know that slice updates with destination array pointer pointing to the same array's underlying object as of original slice, e.g. as in

buf = buf[1:]

does not need write barrier at all?

The rationale (to my limited write barriers understanding) is that it is whole allocated object that is shaded by write barrier:

//     writePointer(slot, ptr):
//         shade(*slot)
//         if current stack is grey:
//             shade(ptr)
//         *slot = ptr
...
// 1. shade(*slot) prevents a mutator from hiding an object by moving
// the sole pointer to it from the heap to its stack. If it attempts
// to unlink an object from the heap, this will shade it.
//
// 2. shade(ptr) prevents a mutator from hiding an object by moving
// the sole pointer to it from its stack into a black object in the
// heap. If it attempts to install the pointer into a black object,
// this will shade it.

(https://github.com/golang/go/blob/d7eb4901/src/runtime/mbarrier.go#L21)

func shade(b uintptr) {
        if obj, span, objIndex := findObject(b, 0, 0); obj != 0 {
        ...

(https://github.com/golang/go/blob/d7eb4901/src/runtime/mgcmark.go#L1211)

func findObject(p, refBase, refOff uintptr) (base uintptr, s *mspan, objIndex uintptr) {
        s = spanOf(p)
        ...

(https://github.com/golang/go/blob/d7eb4901/src/runtime/mbitmap.go#L354)

and so for buf = buf[1:] the slice, either before or after the update, will be pointing to inside the same allocated object -> thus the mutator for sure won't hide the allocated object from GC and so there is no need to shade it with such updates.

Doing so in the compiler will fix the performance issue this patch solves, as well as automatically remove write barriers in other, probably many, places thus helping not only speed but also code size (#6853).

I appologize if there is something trivial preventing doing this, that I missed.

Kirill

/cc @aclements


@aclements says:
The danger here is subtle. We haven't done this because the update to buf could be racy and if we did this that race could break GC. Specifically, if one goroutine executes buf = buf[1:] without a write barrier while another executes buf = x with a write barrier, it's possible for the final value of buf to continue pointing to the original buf allocation, but for the garbage collector to think it points to the x allocation and free the original buf allocation.

Of course, racy Go programs are undefined, but right now the garbage collector is immune to these sorts of non-type-safety-breaking "benign" races (this is obviously a somewhat fuzzy argument, since races let you break type safety and that lets you break memory safety, and saying anything formal about "benign" races is notoriously impossible so what I'm saying is probably not true anyway, but we try :)


@dgryski says:
@aclements It would be interesting to see the effect such an update would have on code size and performance to know how much we're leaving on the table with this safety check on.


me says:
@aclements thanks for explaining. Am I right that with write-barriers enabled for both racy buf updates, the mechanism that prevents GC from deallocating wrong object is that in the current GC written ptr is always shaded too (https://github.com/golang/go/blob/go1.10-1-g678dede7bc/src/runtime/mbarrier.go#L162 (go1.10), https://github.com/golang/go/blob/0add9a4dcf/src/runtime/mwbbuf.go#L226 (tip)) ?

By the way the thread that performs buf = x with write barrier will always do shade(*slot). If there is buf = buf[1:] racing with buf = x, even if the former does not use write barrier, the shade(*slot) of buf = x thread will shade original buf object in any case (it either sees buf or buf+1 as *slot but they both actually lead to the same allocated object). If so, since shade actually greys an object and grey objects never become white - only eventually black, we can say that original buf underlying object will stay alive - not deallocated, and so it is safe to do buf = buf[1:] without write barrier.

But this goes contrary to what you say, and so there is probably some mistake on my side. Could you please explain where the above logic fails?

I'm GC newbie so please forgive me if I miss something well-known.


@josharian says:
@navytux would you mind opening a new issue and migrating this (interesting) discussion there? We tend to try to keep all conversation on CLs and issues. Thanks!

navytux referenced this issue Mar 8, 2018
While running make.bash, over 5% of all pointer writes
come from encoding/binary doing struct reads.

This change replaces slicing during such reads with an offset.
This avoids updating the slice pointer with every
struct field read or write.

This has no impact when the write barrier is off.
Running the benchmarks with GOGC=1, however,
shows significant improvement:

name          old time/op    new time/op    delta
ReadStruct-8    13.2µs ± 6%    10.1µs ± 5%  -23.24%  (p=0.000 n=10+10)

name          old speed      new speed      delta
ReadStruct-8  5.69MB/s ± 6%  7.40MB/s ± 5%  +30.18%  (p=0.000 n=10+10)

Change-Id: I22904263196bfeddc38abe8989428e263aee5253
Reviewed-on: https://go-review.googlesource.com/98757
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@navytux
Copy link
Contributor Author

navytux commented Mar 25, 2018

By the way the thread that performs buf = x with write barrier will always do shade(*slot). If there is buf = buf[1:] racing with buf = x, even if the former does not use write barrier, the shade(*slot) of buf = x thread will shade original buf object in any case (it either sees buf or buf+1 as *slot but they both actually lead to the same allocated object). If so, since shade actually greys an object and grey objects never become white - only eventually black, we can say that original buf underlying object will stay alive - not deallocated, and so it is safe to do buf = buf[1:] without write barrier.

@aclements, would you mind to please provide your feedback here?

If there is a chance write barriers could be avoided on slice updates it would be a pity to miss 1.11, because, even though Go is compiled, performance is sometimes too painfull, especially when GC bits are intermixed.

Thanks beforehand,
Kirill

@andybons andybons added this to the Unplanned milestone Mar 26, 2018
@aclements
Copy link
Member

@navytux, good call. You're right that we can do this optimization now. It wasn't safe the last time we considered it, but the barrier has changed since then.

Consider two goroutines:

goroutine A does buf = buf[1:]:

  1. r1 = buf.ptr
  2. buf.ptr = r1+1

goroutine B does buf = s:

  1. r2 = s.ptr
  2. r1 = buf.ptr
  3. shade(r1)
  4. shade(r2)
  5. buf.ptr = r2

(Ignoring len and cap, which a race could obviously shear, but that's an orthogonal problem.)

Since the reads commute with each other and everything commutes with shade, the only orderings that matter are A1/B5, A2/B2, and A2/B5. This leads to four schedules worth considering:

  • A1-A2, B1-B5. This is simply a sequential schedule.
  • A1, B1, B2, A2, B3-B5. The original buf.ptr is no longer reachable. It was shaded, but this is harmless.
  • A1, B1-B5, A2. The original buf.ptr is still reachable, but this is okay because B3 shaded it.
  • B1-B5, A1-A2. Again, this is simply a sequential schedule.

Prior to the hybrid barrier, we didn't have step B3, which is what makes the third schedule above safe.

@gopherbot
Copy link

Change https://golang.org/cl/105258 mentions this issue: cmd/compile: eliminate write barrier for b = b[n:]

@aclements
Copy link
Member

@navytux, thanks for making us reconsider this optimization!

@RLH, can you double-check my logic? You're better at finding bugs in these things than I am. :)

@navytux
Copy link
Contributor Author

navytux commented Apr 9, 2018

@aclements thanks a lot for feedback and coming with the patch. I'm glad my observation could be useful.

@josharian thanks also for triggering whole this topic with b854339.

@navytux
Copy link
Contributor Author

navytux commented May 7, 2018

Status update: CL 105258 started to implement the idea but also along the way decided to generalize WB relax rule so that anything could happen between corresponding load and store in the same goroutine, including other writes to pointer in question. That generalization however brought correctness issues because now the race problem could be e.g. 3-body problem of 2 goroutines and GC, for example:

( buf.ptr initially points to A )

G1 G2 GC
r1 = buf.ptr
buf.ptr = new(alloc B)
begin, enable WB
scan G2.stack
r2 = buf.ptr
buf.ptr = r1 (no WB; last use of r1)
scan G1.stack (marks A)
end
use r2

where G2.r2 results pointing to B while B was not marked.

( https://go-review.googlesource.com/c/go/+/105258/3/src/cmd/compile/internal/ssa/writebarrier.go#49 )


I do not have exact proof, but if we allow to omit generating write barrier only when:

write to pointer points to the same allocation as last seen by current goroutine for that pointer.

there won't be situations where intermediate buf.ptr instability could be leaking to another goroutine without being eventually marked, like in example above for G1 -> G2 leak of B:

  • if update to pointer points to another allocation as last seen by current G - either by reading that pointer or by writing to it - we just generate WB as we do currently. This way there is no chance for another goroutine to observe either current write, or later write, even if that later write would point back to original allocation, with GC missing to mark corresponding object as live.

  • if update to pointer points to the same allocation as last seen by current G - we know that without WB we are safe against write races from another mutator because WB triggered from that mutator would shade both inserted and deleted objects (see e.g. cmd/compile: slice update with self subslice - write barrier probably not needed (?) #24314 (comment) for details).

    additionally for this case, if WB is runtime disabled at the time of the write, because enabling or disabling WB happens a) only with the world stopped and b) even with planned safepoints everywhere still atomically with respect to

    if writeBarrier.needed {
        // perform write-barrier notification to GC
    }

    generated code parts, we can say that all mutators would see WB either enabled or disabled consistently, and so if WB is not yet enabled GC is not yet running and thus, even though there is race for write, GC will start scanning stacks and heap only later after the race is over and this way cannot miss to mark survived objects as live.

@aclements I understand we are in code freeze now, but CL 105258 was mailed before the freeze and so it would be a pity not to get it finished for Go1.11. Like I already said even though Go is compiled, performance is sometimes too painfull, especially when GC bits are intermixed, especially taking into account that slice self-update idiom is very common. It would be a pity to miss relaxing write-barriers for this common case ...

I appologize if I miss something and this optimization still cannot be applied even with my corrected rule.

Thanks beforehand,
Kirill

/cc @cherrymui

@navytux
Copy link
Contributor Author

navytux commented Sep 28, 2018

Silence... "words are very unnecessary - they can only do harm..."

@navytux
Copy link
Contributor Author

navytux commented Jul 20, 2020

This is still the case as of go version devel +0951939fd9 Fri Jul 17 19:09:40 2020 +0000 linux/amd64:

package issue24314

type Sumit struct {
        v []int
}

func (s *Sumit) SumAndFlush() int {
        r := 0
        for len(s.v) > 0 { 
                r += s.v[0]
                s.v = s.v[1:]        // <-- NOTE HERE
        }
        return r
}
#include "funcdata.h"

TEXT ·(*Sumit).SumAndFlush(SB), ABIInternal, $8-16 // bufself.go:7
        NO_LOCAL_POINTERS
pc0:
        // MOVQ    (TLS), CX (stack growth prologue)
        // CMPQ    SP, 16(CX)
        // PCDATA  $0, $-2
        // JLS     128
        // PCDATA  $0, $-1
        // SUBQ    $8, SP
        // MOVQ    BP, (SP) (BP save)
        // LEAQ    (SP), BP (BP init)
        // FUNCDATA $0, gclocals·a36216b97439c93dafebe03e7f0808b5(SB) (args)
        // FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) (no locals)
        MOVQ       s+16(SP), AX  // bufself.go:9
        XORL       CX, CX
pc34:
        MOVQ       8(AX), DX
        MOVQ       (AX), BX
        MOVQ       16(AX), SI
        TESTQ      DX, DX
        JLE        pc109
        MOVQ       (BX), R8      // bufself.go:10
        DECQ       DX            // bufself.go:11
        MOVQ       DX, 8(AX)
        LEAQ       -1(SI), DX
        MOVQ       DX, 16(AX)
        ADDQ       R8, CX        // bufself.go:10
        NEGQ       DX            // bufself.go:11
        SARQ       $63, DX
        ANDQ       $8, DX
        ADDQ       BX, DX
        PCDATA     $0, $-2
        CMPL       runtime.writeBarrier(SB), $0               // <-- NOTE HERE
        JNE        pc99
        MOVQ       DX, (AX)
        JMP        pc34
pc99:
        MOVQ       AX, DI
        CALL       runtime.gcWriteBarrierDX(SB)               // <-- NOTE HERE
        JMP        pc34
pc109:
        PCDATA     $0, $-1       // bufself.go:13
        MOVQ       CX, _r0+24(SP)
        // MOVQ    (SP), BP (BP restore)
        // ADDQ    $8, SP (SP restore)
        RET
        NOP
        PCDATA     $1, $-1       // bufself.go:7
        PCDATA     $0, $-2
        NOP
        CALL       runtime.morestack_noctxt(SB)
        PCDATA     $0, $-1
        JMP        pc0

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
None yet
Development

No branches or pull requests

5 participants