Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: move instrumention into an ssa pass? #19054

Closed
josharian opened this issue Feb 13, 2017 · 22 comments
Closed

cmd/compile: move instrumention into an ssa pass? #19054

josharian opened this issue Feb 13, 2017 · 22 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge RaceDetector
Milestone

Comments

@josharian
Copy link
Contributor

This is a naive question intended to open discussion.

race/msan instrumentation currently happens between walk and SSA generation. As a result, as we slowly migrate walk rewrites to SSA generation and/or SSA itself, racewalk will slowly fall out of sync and/or be a source of added work. And the simpler ontology of SSA might prove helpful in increasing instrumentation coverage.

Does it make sense to consider moving racewalk to a generic SSA pass? If so, opinions/considerations about implementation details?

I am not volunteering at the moment to do this work, although I did bump into this question while satisfying my curiosity about what it would be like to handle ORANGE directly during SSA generation. (Answer: It's a fair amount of work.)

cc @dvyukov @ianlancetaylor @randall77

@josharian josharian added this to the Go1.9Maybe milestone Feb 13, 2017
@dvyukov
Copy link
Member

dvyukov commented Feb 13, 2017

Yes, it makes sense.

It should make the pass considerably simpler and hopefully eliminate all the corner cases that we still continue to hit episodically.

It may also help to implement some of the goodness we have in llvm. IIRC that PointerMayBeCaptured eliminated ~1/3 of instrumentation. Also deduplication at least within the same basic block:

void ThreadSanitizer::chooseInstructionsToInstrument(
    SmallVectorImpl<Instruction *> &Local, SmallVectorImpl<Instruction *> &All,
    const DataLayout &DL) {
  SmallSet<Value*, 8> WriteTargets;
  // Iterate from the end.
  for (Instruction *I : reverse(Local)) {
    if (StoreInst *Store = dyn_cast<StoreInst>(I)) {
      Value *Addr = Store->getPointerOperand();
      if (!shouldInstrumentReadWriteFromAddress(Addr))
        continue;
      WriteTargets.insert(Addr);
    } else {
      LoadInst *Load = cast<LoadInst>(I);
      Value *Addr = Load->getPointerOperand();
      if (!shouldInstrumentReadWriteFromAddress(Addr))
        continue;
      if (WriteTargets.count(Addr)) {
        // We will write to this temp, so no reason to analyze the read.
        NumOmittedReadsBeforeWrite++;
        continue;
      }
      if (addrPointsToConstantData(Addr)) {
        // Addr points to some constant data -- it can not race with any writes.
        continue;
      }
    }
    Value *Addr = isa<StoreInst>(*I)
        ? cast<StoreInst>(I)->getPointerOperand()
        : cast<LoadInst>(I)->getPointerOperand();
    if (isa<AllocaInst>(GetUnderlyingObject(Addr, DL)) &&
        !PointerMayBeCaptured(Addr, true, true)) {
      // The variable is addressable but not captured, so it cannot be
      // referenced from a different thread and participate in a data race
      // (see llvm/Analysis/CaptureTracking.h for details).
      NumOmittedNonCaptured++;
      continue;
    }
    All.push_back(I);
  }
  Local.clear();
}

@mdempsky
Copy link
Member

Incidentally, I was looking at this some last week.

The big issue I had was whether to implement instrumentation as an SSA pass or during buildssa. Normally loads don't modify the memory variable, but when instrumented with a sanitizer function call they do. If we were to handle instrumentation as a pure SSA pass, it might need to rewrite memory variable references and insert phi nodes.

I did have a semi-working CL that emitted instrumentation calls during buildssa (basically just emitted a call before each OpLoad/OpStore/OpMove/OpZero), but it produced only about 80-90% as many calls. I haven't yet investigated why.

@dvyukov
Copy link
Member

dvyukov commented Feb 13, 2017

Normally loads don't modify the memory variable, but when instrumented with a sanitizer function call they do.

Implementing it like a normal pass sounds better long term. What's the problem with loads?

but it produced only about 80-90% as many calls

It may be good. First thing is to just run all race test and go test -race std. If that works, we just need to look at some cases where ssa pass does not insert instrumentation.

@josharian
Copy link
Contributor Author

If reasonable, it'd be nice to do it as an SSA pass rather than during buildssa, for encapsulation and understandability reasons--buildssa is already unwieldy, and it will only grow from here.

Normally loads don't modify the memory variable, but when instrumented with a sanitizer function call they do.

It might take some work to make the scheduler happy with this, but could we simply leave sanitizer function calls as non-memory-modifying?

@mdempsky
Copy link
Member

mdempsky commented Feb 13, 2017

What's the problem with loads?

Basically the signatures of SSA Load and Store operations are:

Load :: (*T, Memory) -> T
Store :: (*T, T, Memory) -> Memory

SSA also requires that Memory objects be linearly threaded through the computation. (Similar to monads in Haskell.)

So Stores are necessarily linearized, and Loads are freely rearranged between two Stores. Also, Calls behave a lot like Stores:

Call :: (Function, Memory) -> Memory

Instrumenting a Store is easy, because it amounts to just rewriting:

mem2 = Store(ptr, val, mem1)

into:

memtmp = Call(racewrite, ptr)
mem2 = Store(ptr, val, memtmp)

Since Store already produced a new Memory value, introducing a new Memory temporary for the instrumentation call is a purely local rewrite.

Instrumenting Load is trickier though, because if we have something like:

x := 0
if y != nil {
    x = *y
}

that is currently something like:

x := Const(0)
if NEQ(y, Const(nil)) {
    x = Load(y, mem0)
}

But as an SSA pass, we would have to rewrite it to:

x := Const(0)
if NEQ(y, Const(nil)) {
    mem1 = Call(raceread, y, mem0)
    x = Load(y, mem1)
}
mem2 = Phi(mem0, mem1)
// subsequent references to mem0 have to also be updated to mem2

@randall77
Copy link
Contributor

I support moving race instrumentation to an SSA pass.

I was thinking about introducing a non-memory call for side-effect free calls. This would be generically useful for a bunch of runtime support calls, like complex division. We could use a similar call for instrumenting loads.
I've been waiting to pursue this idea until calling convention changes go in. Part of the reason calls take a memory argument is to make sure argument marshaling happens before the call does. That gets a lot easier with a register-based convention.
Of course, these are all special runtime calls, so we could special-case their calling conventions on an as-needed basis.

@mdempsky
Copy link
Member

mdempsky commented Feb 13, 2017

Naively (i.e., I still largely treat package ssa as a black box), being able to express side-effect free calls SGTM.

Until we have that though, is there any opposition to moving racewalk into buildssa? My WIP CL was relatively non-intrusive after refactoring out the OpLoad/OpStore construction patterns, and I think could still be migrated to a proper SSA pass later.

@josharian
Copy link
Contributor Author

Of course, these are all special runtime calls, so we could special-case their calling conventions on an as-needed basis.

I can't decide whether this is exciting or horrifying. It might be helpful for race calls, so that they can check args/results without disrupting the original call.

Until we have that though, is there any opposition to moving racewalk into buildssa?

Fine by me.

@randall77
Copy link
Contributor

I guess I'm not really for moving the code twice if we can avoid it.
Do the race calls return anything? If not, we can just use a special call that takes a memory arg but returns void. That could be scheduled anywhere the input memory state is live. That's an easy change to the ssa backend.

@mdempsky
Copy link
Member

Do the race calls return anything?

No. They just take an address and sometimes a size as input. No outputs.

@dvyukov
Copy link
Member

dvyukov commented Feb 14, 2017

Yes, not outputs, and the calls don't touch anything that can possibly concern the compiler. The only case when the calls callback into Go world is symbolization callback during race reporting, but it runs non-instrumented runtime code.
The calls can be moved if they don't cross any synchronization operations, but it is probably not very important.

@josharian josharian modified the milestones: Go1.10, Go1.9Maybe Jul 6, 2017
@josharian josharian added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Jul 6, 2017
@griesemer griesemer modified the milestones: Go1.10, Go1.11 Nov 29, 2017
@griesemer
Copy link
Contributor

Too late. Moving to 1.11.

@gopherbot
Copy link

Change https://golang.org/cl/102817 mentions this issue: cmd/compile: insert instrumentation during SSA building

@gopherbot
Copy link

Change https://golang.org/cl/102816 mentions this issue: cmd/compile: refactor memory op constructors in SSA builder

@gopherbot
Copy link

Change https://golang.org/cl/102815 mentions this issue: cmd/compile: disable instrumentation for no-race packages earlier

@mdempsky
Copy link
Member

@aclements I tried to dig up my old racewalk-during-buildssa experimental CLs, but couldn't find them. I recreated the general patch series though, and it seems to work even. :)

I've randomly surveyed and manually inspected the resulting instrumented code, and it looks right.

I'm a little curious why the number of race calls for cmd/go has gone down modestly (~5.6%), but the binary size has gone up somewhat (~2.4%). The change to reorder1 to spill all function arguments to temporary variables before assigning them to the parameter slots explains a little of the size growth, but much less than I expected: without that change (i.e., generating broken code) the binary size still goes up ~1.9%.

@dvyukov
Copy link
Member

dvyukov commented Mar 28, 2018

There seems to be some nice code size reduction! Thanks!

Do you know what was stripped in these ~5.6%? Can you please look at few functions and confirm that we don't remove anything useful? Is it duplicates?

I think the main motivation for moving this to SSA (besides ending infinite stream of bugs) was smarter analysis of redundant and unnecessary instrumentation. One of low hanging fruits is removing dups within a basic block:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Instrumentation/ThreadSanitizer.cpp#L357
Namely, we collect all reads/writes within a basic block with no intervening calls/atomiops and then dedup reads and writes and remove reads if there is a write to that address.

@dvyukov
Copy link
Member

dvyukov commented Mar 28, 2018

Re binary size: I would first check that it's indeed due to .text section (e.g. readelf -a shows sections). And then use nm -S to find few functions with the largest size increase, and then just look at them.

@mdempsky
Copy link
Member

mdempsky commented Mar 28, 2018

The binary size contribution seems to come from three sources:

  1. Most significantly, the CL prevents SSA-ing structs with 2-4 fields. (Currently, the SSA construction phase handles loading a single field from an SSA-able struct as a load of the entire struct, and then extracting just the relevant field. Presumably this is meant to help with CSE, but it introduces false dependencies during instrumentation. I tried disabling just this optimization, but it caused other failures, so I went with the heavier solution of simply disallowing SSA of >1-field structs while instrumenting.)
  2. The reorder1 changes I described above.
  3. Additional spilling needed because of the instrumentation calls being inserted immediately before loads. E.g., previously, encoding/binary.bigEndian.Uint64 could still be optimized into an 8-byte load on amd64, but now it's compiled as 8 1-byte loads, most of which need to be spilled to live through the subsequent raceread calls.

1 and 2 could potentially be improved. 3 seems like it would be more involved.

I looked through the removed instrumentation calls. They seemed to be mostly uses of stack variables that isartificial couldn't identify. For example, when creating non-escaping function closures on the stack, we can now omit the racewrite calls while creating it.

The only place I noticed new instrumentation calls is when spilling parameters to the heap. For example:

func f(x int) *int { return &x }

is compiled as:

func f(x int) *int {
    px := new(int)
    *px = x
    return px
}

Instrumenting during SSA construction means we now see *px = x and insert a racewrite call, whereas before we didn't. (This could probably be optimized away too.)

@dvyukov
Copy link
Member

dvyukov commented Mar 28, 2018

Re (3), won't hoisting all race instrumentation to the beginning of a basic block (or a preceding function call whichever comes first) help? Doing analysis and instrumentation on call-less parts of basic blocks will also help with #19054 (comment). Not sure if it can also help with (1).

@dvyukov
Copy link
Member

dvyukov commented Mar 28, 2018

The only place I noticed new instrumentation calls is when spilling parameters to the heap.

Why wasn't it instrumented before?
Theoretically we only need instrumentation once a new object is published. Complete analysis can be hard, but a simpler version that only removes instrumentation after new to the new'ed object in the same BB and only before any calls, atomicops and assignments of the new'ed pointer to any other locations can be easier and still profitable (e.g. handle typical struct initialization).

@mdempsky
Copy link
Member

Re (3), won't hoisting all race instrumentation to the beginning of a basic block (or a preceding function call whichever comes first) help?

Probably, but that sounds more involved than I have resources available at the moment. It's not trivial to reschedule function calls currently, or to insert them at arbitrary places.

My more immediate concern is the instrumentation pass is blocking us from moving optimizations from order/walk into SSA construction, where we expect they can be both more powerful and easier to maintain. I think that's worth a modest code size increase for instrumented builds, as long as performance is still acceptable.

Why wasn't it instrumented before?

Because paramstoheap adds the assignment to fn.Func.Enter, which instrument skips over ("nothing interesting for race detector in fn->enter").

Theoretically we only need instrumentation once a new object is published.

Ack, I thought about that, but we don't currently have that level of fine-grained information. Currently, we only track whether or not an object is ever published.

a simpler version that only removes instrumentation after new to the new'ed object in the same BB and only before any calls, atomicops and assignments of the new'ed pointer to any other locations can be easier and still profitable (e.g. handle typical struct initialization).

Yes, I think that would be good too.

gopherbot pushed a commit that referenced this issue Mar 29, 2018
Pulling these operations out separately so it's easier to add
instrumentation calls.

Passes toolstash-check.

Updates #19054.

Change-Id: If6a537124a87bac2eceff1d66d9df5ebb3bf07be
Reviewed-on: https://go-review.googlesource.com/102816
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Apr 2, 2018
Rather than checking for each function whether the package supports
instrumentation, check once up front.

Relatedly, tweak the logic for preventing inlining calls to runtime
functions from instrumented packages. Previously, we simply disallowed
inlining runtime functions altogether when instrumenting. With this
CL, it's only disallowed from packages that are actually being
instrumented. That is, now intra-runtime calls can be inlined.

Updates #19054.

Change-Id: I88c97b48bf70193a8a3ee18d952dcb26b0369d55
Reviewed-on: https://go-review.googlesource.com/102815
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@golang golang locked and limited conversation to collaborators Apr 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge RaceDetector
Projects
None yet
Development

No branches or pull requests

6 participants