Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: updates needed when goroutines switch threads on ppc64 #16928

Closed
laboger opened this issue Aug 30, 2016 · 26 comments
Closed

runtime: updates needed when goroutines switch threads on ppc64 #16928

laboger opened this issue Aug 30, 2016 · 26 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@laboger
Copy link
Contributor

laboger commented Aug 30, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version devel +adb1e67 Tue Aug 30 08:08:37 2016 +0000 linux/ppc64le

What operating system and processor architecture are you using (go env)?

Ubuntu 16.04

We haven't hit a specific problem in this area yet, but there are concerns that some additional instructions should be included in the runtime.gosave routine when a goroutines's information is saved before it is swapped off a thread. We should be doing a sync and an instruction to clear any lwarx reservation that might be pending.

@bradfitz bradfitz changed the title runtime: updates needed when goroutines switch threads runtime: updates needed when goroutines switch threads on ppc64 Aug 30, 2016
@ianlancetaylor
Copy link
Contributor

How could there be an lwarx reservation pending? Goroutines aren't preempted arbitrarily. They are requested to stop and they stop only at agreed transition points.

@laboger
Copy link
Contributor Author

laboger commented Aug 31, 2016

I think the situation of concern was if a goroutine called C code that was preempted.

If that is not a situation we need to be concerned about then I can just add the sync because I was told it should be there.

@ianlancetaylor
Copy link
Contributor

C code (or Go code) may be preempted by the kernel, but it won't be preempted by the goroutine scheduler. That is, there is no scenario in which preempting C code will lead to a call to runtime.gosave.

I'm fine with adding a sync instruction to runtime.gosave if we can see some path that requires it. But I don't want to add it just because there might conceivably be a problem. Right now I don't see how there could possibly be a problem.

@laboger
Copy link
Contributor Author

laboger commented Aug 31, 2016

Back to the pending reservation question -- couldn't someone write some functions in asm code where one function did a lwarx and another did a stwcx, and those asm functions were both called from within a goroutine in such as way so that preempting would be possible between the two? Bad programming style but is it illegal in Go?

@randall77
Copy link
Contributor

I would say that doing a lwarx in one function and stwcx in another is
illegal in Go assembly. When writing assembly, there be dragons, and this
is (another) one.

On Wed, Aug 31, 2016 at 8:06 AM, Lynn Boger notifications@github.com
wrote:

Back to the pending reservation question -- couldn't someone write some
functions in asm code where one function did a lwarx and another did a
stwcx, and those asm functions were both called from within a goroutine in
such as way so that preempting would be possible between the two? Bad
programming style but is it illegal in Go?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#16928 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGkgIB9jLDEm2QBPeVS2lalcwxdusl5eks5qlZiAgaJpZM4Jw0w5
.

@laboger
Copy link
Contributor Author

laboger commented Aug 31, 2016

It certainly seems like a bad thing to write code with lwarx and stcwx that way. Are you saying not to worry about that case and if someone reported it they would be told it's illegal? I don't think it would have to be Go asm, I think someone could write a C program with inline assembly to do the same.

Let me back up to what prompted this issue to begin with. We are looking at ways to improve some of the atomic loads and stores on ppc64le so they don't have to do as many syncs -- maybe remove some or replace with lwsync in some cases. Changing some of those would depend on having a sync in gosave when a goroutine is swapped off a thread. So my thought was to put the change into gosave first separately and then follow with the other changes to some of the atomics. If it's preferred that all changes related to syncs and lwsyncs are done in one change we can do that instead.

@randall77
Copy link
Contributor

On Wed, Aug 31, 2016 at 11:25 AM, Lynn Boger notifications@github.com
wrote:

It certainly seems like a bad thing to write code with lwarx and stcwx
that way. Are you saying not to worry about that case and if someone
reported it they would be told it's illegal? I don't think it would have to
be Go asm, I think someone could write a

Yes, we can say "don't do that". lwarx and stcwx don't pair across a
preemption point. Same as depending on register state to be preserved
across a preemption point - it may work in the common case, but not if a
preemption occurs.

Let me back up to what prompted this issue to begin with. We are looking
at ways to improve some of the atomic loads and stores on ppc64le so they
don't have to do as many syncs -- maybe remove some or replace with lwsync
in some cases. Changing some of those would depend on having a sync in
gosave when a goroutine is swapped off a thread. So my thought was to put
the change into gosave first separately and then follow with the other
changes to some of the atomics. If it's preferred that all changes related
to syncs and lwsyncs are done in one change we can do that instead.

Your plan sounds fine. All changes don't have to be in one CL.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#16928 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGkgIBFzfKN53W4iyzd_t6HVDWZ49VjRks5qlcb_gaJpZM4Jw0w5
.

@bcbrock
Copy link

bcbrock commented Aug 31, 2016

I would have said that adding a STCX. seemed a small price to pay to make the goroutine switch bulletproof, but I understand @randall77 comment, it's your ABI and runtime so you can make the rules.

Regarding the need for the SYNC, we got the explanation from one of our cache architects (Derek Williams), which I will paraphrase here:

It is possible for a store done by a goroutine to linger indefinitely in the L2 store queue. If this goroutine is preempted and then immediately restarted on another processor, then without a guarantee of an intervening SYNC it could possibly enter a state where it would not be able to see it's most recent stores, because they would not have been committed to the coherence order yet. His final words were "There's no room to debate this one".

@ianlancetaylor
Copy link
Contributor

I believe there is room for debate once you understand that goroutines are not preempted at arbitrary points, but only at well understood and agreed upon points.

@laboger
Copy link
Contributor Author

laboger commented Aug 31, 2016

I was also asked about the case where Go calls C and C starts a transaction. Is there any way the transaction could be left in an unfinished state and then the goroutine is preempted by the Go scheduler.

@ianlancetaylor
Copy link
Contributor

The Go scheduler will never preempt a goroutine that is running C code.

In the sense that we are discussing here, the Go scheduler never preempts anything at all. It just, essentially, sets a flag on the goroutine saying "please stop." The goroutine checks that flag from time to time (currently, at each call to a function written in Go or written with the Go assembler, if that function is not marked nosplit).

@randall77
Copy link
Contributor

In addition to what Ian said, the Go scheduler uses synchronization
operations to move a Goroutine from one P (one OS thread) to another. So
the Go scheduler guarantees that after a goroutine stops, it will make sure
all of its state gets to the new processor before the goroutine is started
again.

On Wed, Aug 31, 2016 at 1:42 PM, Ian Lance Taylor notifications@github.com
wrote:

The Go scheduler will never preempt a goroutine that is running C code.

In the sense that we are discussing here, the Go scheduler never preempts
anything at all. It just, essentially, sets a flag on the goroutine saying
"please stop." The goroutine checks that flag from time to time (currently,
at each call to a function written in Go or written with the Go assembler,
if that function is not marked nosplit).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16928 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGkgIOa4PxAE_4X2loUmmHddJ_2kebMRks5qlecegaJpZM4Jw0w5
.

@strikerdw
Copy link

Hello, I'm the Derek Williams Bishop referred to. I'm a hardware guy and I know how to spell Go, but that's most of what I know about Go. But I am the Power ISA memory model expert, so what I say here and in other emails will be slanted towards how the hardware works at the machine instruction level. There's doubtless some translation to be had from that world-view into Go.

Above, ianlancetaylor says:

I believe there is room for debate once you understand that goroutines are not preempted at arbitrary points, but only at well understood and agreed upon points.

Possibly yes. Can you give me some idea what these agreed upon and well understood points guarantee? Some of the usual hygiene that has to happen at context switches might be avoidable based on these conditions, but before I can have an opinion I need to know what guarantees those points give you.

And randall77 says:

In addition to what Ian said, the Go scheduler uses synchronization
operations to move a Goroutine from one P (one OS thread) to another. So
the Go scheduler guarantees that after a goroutine stops, it will make sure
all of its state gets to the new processor before the goroutine is started
again.

Interesting. can you explain what synchronization operations those are?

They might fix things, but I can't tell because I have no idea what they are.

I apologize for my ignorance here. Once I have some idea what these things are and if they fix this, I'll be in a better position to comment.

Thanks,

Derek

@randall77
Copy link
Contributor

On Wed, Aug 31, 2016 at 5:28 PM, strikerdw notifications@github.com wrote:

Hello, I'm the Derek Williams Bishop referred to. I'm a hardware guy and I
know how to spell Go, but that's most of what I know about Go. But I am the
Power ISA memory model expert, so what I say here and in other emails will
be slanted towards how the hardware works at the machine instruction level.
There's doubtless some translation to be had from that world-view into Go.

Above, ianlancetaylor says:

I believe there is room for debate once you understand that goroutines
are not preempted at arbitrary points, but only at well understood and
agreed upon points.

Possibly yes. Can you give me some idea what these agreed upon and well
understood points guarantee? Some of the usual hygiene that has to happen
at context switches might be avoidable based on these conditions, but
before I can have an opinion I need to know what guarantees those points
give you.

Those points are at most function call prologues (a few specially chosen
runtime functions opt out of allowing preemption).
In Go, all registers are callee-save so no registers are live at these
points, except the stack pointer and any arch-specific things like thread
pointer and zero register.

And randall77 says:

In addition to what Ian said, the Go scheduler uses synchronization
operations to move a Goroutine from one P (one OS thread) to another. So
the Go scheduler guarantees that after a goroutine stops, it will make sure
all of its state gets to the new processor before the goroutine is started
again.

Interesting. can you explain what synchronization operations those are?

They might fix things, but I can't tell because I have no idea what they
are.

I apologize for my ignorance here. Once I have some idea what these things
are and if they fix this, I'll be in a better position to comment.

There's a lock around the global run queue (src/runtime/lock_futex.go).
Processors also put and take work from per-P queues using atomic load +
compare and swap (src/runtime/proc.go:runqget, for example).

Thanks,

Derek


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16928 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGkgIHBW0FNcXaQ2ogewa06qCCd5fboVks5qlhw3gaJpZM4Jw0w5
.

@ianlancetaylor
Copy link
Contributor

@strikerdw Hi, thanks for the comments. I alluded to the cases earlier. The goroutine scheduler works by, essentially, setting a flag asking a goroutine to stop. The goroutine checks the flag periodically, and, if it is set, stops, and enters the scheduler. A goroutine also enters the scheduler whenever it blocks on some operation, like a channel operation. Goroutines are never preempted.

The periodic flag checks currently occur at function entry for ordinary functions. Those are the only moments when a goroutine enters the scheduler because of a preemption request. All that needs to be saved at such a point--all that is saved when a thread switches from one goroutine to another--is the PC, the stack pointer, and, on some architectures, the link register. Everything else is in memory somewhere.

@strikerdw
Copy link

I was afraid of that. I follow much of that, but not all of it.

So let me just try to explain from a machine code level what can happen. We'll have to map this up to the Go reality later.

Suppose I have this machine code:

st r1, X // r1=1
...context switch.... // took a context switch here (in your world that's some function call that allows it)
ld r2, X // read X back.

In Power, the st r1, X doesn't propagate to memory immediately (and in fact propagates to other processors and memory at varying points in time). So, memory write for the st r1, X can hang out in a queue on the processor you started executing on and not be visible to anyone else for a very long time unless a barrier pushes it around. If the bit doesn't have a barrier (an HWSYNC in this case) in the path, when the S/W thread here wakes up on the other processor, that other processor may not have yet seen the st r1, X and could easily see a stale cached version of X. So the ld r2, X can see "0" (assume inital before write of X is 0).

That's bad. Coherence (or uniprocessor data dependencies -- depending on who you're talking to) just went out the window. Don't cross the streams kinda bad......

The way we cope with this in POWER is for context switches to have an HWSYNC. That also doesn't technically push the store out by itself (I'm fibbing a bit here, but it's close enough for this discussion), but what it does do is order the st X,1 ahead of the stores to whatever job queue control data structure that the H/W thread we're destined for will read in order to realize it can dispatch the S/W thread. So by the time the S/W thread dispatches on the new H/W thread, the HWSYNC has ensured that the ST X,1 is visible to the new H/W thread.

Normal programmers never see this. It just happens in the OS and no one is the wiser. Once you start taking on thread migration at a user level, however, you start feeling and having to manage these effects.

So when Ian said earlier something to the effect that "everything else is in memory somewhere", it's probably a bit more accurate to say that everything is at least on its way to memory and will get there eventually or when a barrier pushes it through.

It's really hard for me to imagine how you avoid this. Sadly it'll take me a long time to learn enough Go to read your scheduler code and look at the barriers in that and determine if that synchronization will fix this (unless there's an obvious HWSYNC that's executed on the departing thread :-)).

As an aside, I don't work for ARM, and I don't play an ARM engineer on television, but their memory model has a great many technical similarities to ours. As a matter of architecture, I would expect them to have similar issues. As a matter of most of the current implementations I've aware of, they may get away with skipping the barrier..... for now. You should check with them at some point.

Let's agree about SYNC before we start into STCX/reservations. The case there is open for more debate.

Derek

@randall77
Copy link
Contributor

I think you're overthinking this. Almost any parallel computation does:

thread 1:
*x = 5
unlock y

thread 2:
lock y
... = *x

Any implementation has to implement lock and unlock so the read in thread 2
is guaranteed to see 5.

That's all we need for the Go scheduler to work properly. The unlock
y/lock y pair are inside the scheduler (what gets called when a Goroutine
takes an "interrupt" and is descheduled, then rescheduled). There's
nothing special about goroutine state - it is just data guarded by locks.

There's a few optimizations in the scheduler where we don't actually do a
lock/unlock but a cheaper compare-and-swap or other atomic operation.
You're also going to have to implement those atomic ops so that their
memory semantics are correct. The implementations are in
runtime/sync/atomic.

On Wed, Aug 31, 2016 at 7:14 PM, strikerdw notifications@github.com wrote:

I was afraid of that. I follow much of that, but not all of it.

So let me just try to explain from a machine code level what can happen.
We'll have to map this up to the Go reality later.

Suppose I have this machine code:

st r1, X // r1=1
...context switch.... // took a context switch here (in your world that's
some function call that allows it)
ld r2, X // read X back.

In Power, the st r1, X doesn't propagate to memory immediately (and in
fact propagates to other processors and memory at varying points in time).
So, memory write for the st r1, X can hang out in a queue on the processor
you started executing on and not be visible to anyone else for a very long
time unless a barrier pushes it around. If the bit doesn't have a barrier
(an HWSYNC in this case) in the path, when the S/W thread here wakes up on
the other processor, that other processor may not have yet seen the st r1,
X and could easily see a stale cached version of X. So the ld r2, X can see
"0" (assume inital before write of X is 0).

That's bad. Coherence (or uniprocessor data dependencies -- depending on
who you're talking to) just went out the window. Don't cross the streams
kinda bad......

The way we cope with this in POWER is for context switches to have an
HWSYNC. That also doesn't technically push the store out by itself (I'm
fibbing a bit here, but it's close enough for this discussion), but what it
does do is order the st X,1 ahead of the stores to whatever job queue
control data structure that the H/W thread we're destined for will read in
order to realize it can dispatch the S/W thread. So by the time the S/W
thread dispatches on the new H/W thread, the HWSYNC has ensured that the ST
X,1 is visible to the new H/W thread.

Normal programmers never see this. It just happens in the OS and no one is
the wiser. Once you start taking on thread migration at a user level,
however, you start feeling and having to manage these effects.

So when Ian said earlier something to the effect that "everything else is
in memory somewhere", it's probably a bit more accurate to say that
everything is at least on its way to memory and will get there eventually
or when a barrier pushes it through.

It's really hard for me to imagine how you avoid this. Sadly it'll take me
a long time to learn enough Go to read your scheduler code and look at the
barriers in that and determine if that synchronization will fix this
(unless there's an obvious HWSYNC that's executed on the departing thread
:-)).

As an aside, I don't work for ARM, and I don't play an ARM engineer on
television, but their memory model has a great many technical similarities
to ours. As a matter of architecture, I would expect them to have similar
issues. As a matter of most of the current implementations I've aware of,
they may get away with skipping the barrier..... for now. You should check
with them at some point.

Let's agree about SYNC before we start into STCX/reservations. The case
there is open for more debate.

Derek


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16928 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGkgIEedZlkRkrD2SwVR4SQsAqqZgqznks5qljUYgaJpZM4Jw0w5
.

@ianlancetaylor
Copy link
Contributor

For the record, the SYNC instructions required by what Keith is talking about are in https://golang.org/src/runtime/internal/atomic/asm_ppc64x.s and https://golang.org/src/runtime/internal/atomic/atomic_ppc64x.s .

On the one hand, we could certainly be doing something wrong, and it would be great if you could analyze our code and verify that it works. On the other hand, we aren't new at this stuff; we do pretty much know what we're doing.

@bcbrock
Copy link

bcbrock commented Sep 1, 2016

@strikerdw I agree now that the current implementation is OK for normal cacheable memory. Locking the scheduler lock to insert the preempted goroutine into the global queue guarantees that any HW thread that is able to see and pick up the goroutine from the global queue will necessarily be able to see all stores made by the goroutine before it was placed in the queue.

So I think the question now has to do with I/O and non-cacheable memory. As @laboger mentioned above we want to replace almost all HWSYNC instructions in our atomics with LWSYNC to improve performance. The only remaining HWSYNC would be in the seldom-used atomic.Store() routines. LWSYNC does not order I/O or noncacheable memory, and presumably C-language routines called from Go could be using these types of memory.

@ianlancetaylor
Copy link
Contributor

I guess the concern is that C code might use I/O instructions and non-cacheable memory but fail to call HWSYNC? I have no idea how much of a problem that would be. We could add HWSYNC instructions to the code that moves between C and Go easily enough. Calling from Go to C will go through the function asmcgocall in runtime/asm_ppc64x.s. Calling from C to Go will go through the function crosscall2 in runtime/cgo/asm_ppc64x.s (note that those are two files with the same name in two different directories).

@strikerdw
Copy link

Apologies for the delay. My day job required my attention today....

OK, first off, I'm going to set Bishop's answer to the side and talk about this from a different point of view.. His answer is technically correct in so far as it goes, but that answer obscures the issues here more than it helps understanding.

Also, to be clear, I don’t intend to imply for a moment that you folks don’t know what you’re doing. I’m quite sure you do. However, the specific mechanics and hardware needs Power has on software for thread migration is not a subject we document as well as we should, so it’s difficult for anyone, no matter how skilled, to find.

What Randal says above about unlock-ing and lock-ing and threads and reading/writing variables is all true, but I don't think that example captures all the subtlety going on in process migration.

So, again consider this program (note it's single threaded):

ST X,1
some_go_function_call_so_we_can_context_switch()
LD X

I believe we all agree the LD X better get 1 (assuming that the function call didn't manipulate X). If we don't agree, please say so now. So, the critical issue here is what magic happens in the thread migration code.

Now if the thread migration code, in the process of locking the job list, putting the S/W thread info on the job list, and unlocking the job list winds up incidentally doing an HWSYNC on the physical hardware thread we are departing, this issue will work out (killing the reservation is a separate issue).

I suspect that’s what’s happening now. I looked at your assembler code and it’s got “SYNC” all over the place in it (which I assume means what I call an HWSYNC in our standard mnemonics). So it’s hard for me to imagine you don’t have an HWSYNC in there somewhere though I haven’t confirmed this.

So it may be working, but I don't think the intent was to use an HWSYNC in this path because it would indirectly manage the hardware thread migration issues. I think your intent in that locking/barrier work was mainly to manage getting the S/W thread onto the job list safely.

If what you were out to do is just the job list manipulation, a better implementation in Power would use LWSYNC and possibly ISYNC instead of any HWSYNCs here. And modifying these primitives on Power to not use so much HWSYNC is something we’re looking at because there’s a chance they’re currently over-burdened with synchronization (this a different and deep subject and it depends on what the Go language's memory model semantics are).

So if we get to a point where the primitives are a bit more efficient, HWSYNC might not “just happen” in the path. You'd still need at least an LWSYNC, but those don't interact with cache-inhibited ops. So with only LWSYNC left the job list manipulation would work but programs with I/O over the migration would fail. Consider this program:

store to I/O data register
go_function_call_so_we_can_context_switch()
store to control register telling adapter to read data registers

The LWSYNC in the thread migration code wouldn’t push the store to the I/O data register out on the old hardware thread and the store to the control register could happen from the migrated to H/W thread first. The device would read a corrupted data register value.

Having HWSYNC in many or most the atomic primitives instead of LWSYNC might fix this thread migration issue and not require the one HWSYNC explicitly in the path, but you're paying HWSYNC penalties for every other use of the primitive instead of LWSYNC penalties.

I think the bottom line is this: on a thread migration if you place one explicit HWSYNC in the path, it just works. Period. No having to think, no debates, no long github posts. No complex sets of rules. Done. And it might wind up being faster than having the HWSYNCs in most/all of the primitives.

If you don’t have that explicit HWSYNC, one has to really scrub the thread migration code path to be sure it either 1) implicitly winds up doing an HWSYNC for some reason (and it prossibly didn’t need to do an HWSYNC there instead of an LWSYNC) or 2) figure out whatever barriers you do have in the migration path and then what subtle things (like I/O) you have to tell the programmer that can’t span over the function call that might context switch.

The second option can be really complex to figure out and leaves the programmers with a complex set of rules. If I understood correctly, one of the guiding principles of Go is to keep user programmers from having to think about this stuff?

So I can’t say there is no other way to achieve the thread migration than by an explicit HWSYNC in the path, but I think the other options over penalize the barriers in the primitives or require a complex set of “can’t do’s” for the programmer and can be hard to figure out.

Derek

@ianlancetaylor
Copy link
Contributor

Go provides no way to write to an I/O register except by executing assembly code. I'm perfectly willing to say that any assembly code that writes to an I/O register is responsible for executing HWSYNC instructions as needed. I appreciate your concerns as to documenting what a programmer can assume about a context switch, but since approximately zero Go programmers will try to manipulate I/O registers it doesn't seem like a major concern. Go isn't really that kind of language. I think it's vastly more likely that anybody who wants to do that will call into a C library, so I can see a good argument for calling HWSYNC on the Go <-> C call paths.

The SYNC instruction that appears in Go assembly code does appear to assemble to an HWSYNC instruction (0x7c0004ac). If we can replace this with a more efficient instruction, that would be great.

We do currently use LWSYNC in one case: to ensure that the garbage collector metadata for a new object is visible to the concurrent garbage collector before the object is actually used in any way. See publicationBarrier in runtime/malloc.go.

@strikerdw
Copy link

Ok. That is a legitimate choice.

To not be remiss then, I should point out that unless there's an incidental HWSYNC (say in the to or from C paths), the programmer putting in HWSYNCs may not be enough:

ST to data register
go_function_call_so_can_preempt()
HWSYNC
ST to ctrl register

will also fail. The HWSYNC the programmer inserted executes on the thread we migrated to and that does not push out the CI store on the thread we migrated from. In the absence of proper run-time support the programmer inserting even correct barriers may not be enough.

So, at this point I'll assume the issues around barriers in the thread migration path are either avoided or handled by whatever barriers are there or in the C path. There are a few other hygiene clean up things that are generally necessary when migrating threads in Power. I will list them below and leave it to you folks to decide if they matter in the context of Go's thread scheduler. All of these operations are executed on the thread being departed. (I think this list is complete).

  1. Doing a stcx to a scratch location to clear the reservation for larx/stcx instructions. This prevents a STCX on the thread being dispatched on the H/W thread errantly succeeding based on the prior S/W threads larx.

  2. Doing a tabort instruction (P8 and later). This resets the transactional tracking mechansim and prevents a thread being dispatched from errantly committing a partial transaction.

  3. Doing a cp_abort instruction (P9 and later). This resets the copy-paste instruction machinery so that an interrupted copy-paste pair doesn't errantly execute on the new thread.

If there are any questions, I'm happy to answer, but otherwise I'll consider this subject closed.

If programmers do manage to do things that cause issues with the context switching, those bugs are extremely difficult to isolate and debug. They'll happen quite randomly and usually only in cases of heavy load (yes I've had the misfortune of doing this).

This is probably not the entirely correct place to ask, but who (if anyone) is responsible for deciding what the Go programming language memory model is? Are there any written documents? Like the C++ or C or (shudder) Java memory models? To figure out if the atomic routines in Go can use more efficient Power barriers, I need to better understand what the language's memory model requirements are. Any help is appreciated.

Thanks ,

Derek

@ianlancetaylor
Copy link
Contributor

The Go memory model is at https://golang.org/ref/mem . You will likely find it unsatisfactory. See also #5045, #7948, and #9442.

In any case, the Go memory model is for user code. It does not describe the requirements on the Go runtime itself. That is, the Go memory model describes the sync/atomic package, or, more precisely, it should describe that package but doesn't. The Go memory model does not describe the runtime/internal/atomic package, which is what the runtime uses.

@quentinmit quentinmit added this to the Go1.8Maybe milestone Sep 6, 2016
@laboger
Copy link
Contributor Author

laboger commented Sep 7, 2016

I was out a few days last week and Bishop is now out through the end of next week. Lots of good discussion here to help understand the situation. Here is my suggestion on how to proceed.

Bishop has proposed changes to some of the atomic functions to use LWSYNC instead of HWSYNC to improve performance. Most of these are in the runtime/internal atomics. Once he gets back he can submit his change. Bishop has studied the Go scheduler more than I but I see Keith's point about the runqput functions doing locks when swapping goroutines and how that should meet the requirement of doing a sync and clearing the reservation.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016
@rsc
Copy link
Contributor

rsc commented Oct 21, 2016

As far as goroutine switching is concerned, the key point is what Keith said: the Go runtime really looks like any other multithreaded user program that Linux might run. The only way a goroutine stops execution on one thread and starts anew on another is when the goroutine itself is handed off to the other thread using either a locked data structure (the global run queue) or an atomic compare and swap. Assuming those lower level things are implemented correctly, there's nothing extra needed to make sure the memory is carried along.

It is also important that goroutines are cheap: goroutine switches happen at very high frequencies. It's important not to weigh them down to the level of thread context switches. I'm unconcerned about assembly authors leaving the processor in an unfortunate state (stcx, tabort, cp_abort, etc). That's their problem, as discussed above.

I know that Derek, Bishop, and Lynn are working on sanity checking the actual atomics, and I spoke to them about that earlier this week and expect to continue to do so. Assuming those are OK and stay OK, goroutines should come along for the ride.

Since Derek wrote:

If there are any questions, I'm happy to answer, but otherwise I'll consider this subject closed.

I will close this.

@rsc rsc closed this as completed Oct 21, 2016
@golang golang locked and limited conversation to collaborators Oct 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

8 participants