New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: updates needed when goroutines switch threads on ppc64 #16928
Comments
How could there be an lwarx reservation pending? Goroutines aren't preempted arbitrarily. They are requested to stop and they stop only at agreed transition points. |
I think the situation of concern was if a goroutine called C code that was preempted. If that is not a situation we need to be concerned about then I can just add the sync because I was told it should be there. |
C code (or Go code) may be preempted by the kernel, but it won't be preempted by the goroutine scheduler. That is, there is no scenario in which preempting C code will lead to a call to I'm fine with adding a sync instruction to |
Back to the pending reservation question -- couldn't someone write some functions in asm code where one function did a lwarx and another did a stwcx, and those asm functions were both called from within a goroutine in such as way so that preempting would be possible between the two? Bad programming style but is it illegal in Go? |
I would say that doing a lwarx in one function and stwcx in another is On Wed, Aug 31, 2016 at 8:06 AM, Lynn Boger notifications@github.com
|
It certainly seems like a bad thing to write code with lwarx and stcwx that way. Are you saying not to worry about that case and if someone reported it they would be told it's illegal? I don't think it would have to be Go asm, I think someone could write a C program with inline assembly to do the same. Let me back up to what prompted this issue to begin with. We are looking at ways to improve some of the atomic loads and stores on ppc64le so they don't have to do as many syncs -- maybe remove some or replace with lwsync in some cases. Changing some of those would depend on having a sync in gosave when a goroutine is swapped off a thread. So my thought was to put the change into gosave first separately and then follow with the other changes to some of the atomics. If it's preferred that all changes related to syncs and lwsyncs are done in one change we can do that instead. |
On Wed, Aug 31, 2016 at 11:25 AM, Lynn Boger notifications@github.com
|
I would have said that adding a STCX. seemed a small price to pay to make the goroutine switch bulletproof, but I understand @randall77 comment, it's your ABI and runtime so you can make the rules. Regarding the need for the SYNC, we got the explanation from one of our cache architects (Derek Williams), which I will paraphrase here: It is possible for a store done by a goroutine to linger indefinitely in the L2 store queue. If this goroutine is preempted and then immediately restarted on another processor, then without a guarantee of an intervening SYNC it could possibly enter a state where it would not be able to see it's most recent stores, because they would not have been committed to the coherence order yet. His final words were "There's no room to debate this one". |
I believe there is room for debate once you understand that goroutines are not preempted at arbitrary points, but only at well understood and agreed upon points. |
I was also asked about the case where Go calls C and C starts a transaction. Is there any way the transaction could be left in an unfinished state and then the goroutine is preempted by the Go scheduler. |
The Go scheduler will never preempt a goroutine that is running C code. In the sense that we are discussing here, the Go scheduler never preempts anything at all. It just, essentially, sets a flag on the goroutine saying "please stop." The goroutine checks that flag from time to time (currently, at each call to a function written in Go or written with the Go assembler, if that function is not marked nosplit). |
In addition to what Ian said, the Go scheduler uses synchronization On Wed, Aug 31, 2016 at 1:42 PM, Ian Lance Taylor notifications@github.com
|
Hello, I'm the Derek Williams Bishop referred to. I'm a hardware guy and I know how to spell Go, but that's most of what I know about Go. But I am the Power ISA memory model expert, so what I say here and in other emails will be slanted towards how the hardware works at the machine instruction level. There's doubtless some translation to be had from that world-view into Go. Above, ianlancetaylor says: I believe there is room for debate once you understand that goroutines are not preempted at arbitrary points, but only at well understood and agreed upon points. Possibly yes. Can you give me some idea what these agreed upon and well understood points guarantee? Some of the usual hygiene that has to happen at context switches might be avoidable based on these conditions, but before I can have an opinion I need to know what guarantees those points give you. And randall77 says: In addition to what Ian said, the Go scheduler uses synchronization Interesting. can you explain what synchronization operations those are? They might fix things, but I can't tell because I have no idea what they are. I apologize for my ignorance here. Once I have some idea what these things are and if they fix this, I'll be in a better position to comment. Thanks, Derek |
On Wed, Aug 31, 2016 at 5:28 PM, strikerdw notifications@github.com wrote:
|
@strikerdw Hi, thanks for the comments. I alluded to the cases earlier. The goroutine scheduler works by, essentially, setting a flag asking a goroutine to stop. The goroutine checks the flag periodically, and, if it is set, stops, and enters the scheduler. A goroutine also enters the scheduler whenever it blocks on some operation, like a channel operation. Goroutines are never preempted. The periodic flag checks currently occur at function entry for ordinary functions. Those are the only moments when a goroutine enters the scheduler because of a preemption request. All that needs to be saved at such a point--all that is saved when a thread switches from one goroutine to another--is the PC, the stack pointer, and, on some architectures, the link register. Everything else is in memory somewhere. |
I was afraid of that. I follow much of that, but not all of it. So let me just try to explain from a machine code level what can happen. We'll have to map this up to the Go reality later. Suppose I have this machine code: st r1, X // r1=1 In Power, the st r1, X doesn't propagate to memory immediately (and in fact propagates to other processors and memory at varying points in time). So, memory write for the st r1, X can hang out in a queue on the processor you started executing on and not be visible to anyone else for a very long time unless a barrier pushes it around. If the bit doesn't have a barrier (an HWSYNC in this case) in the path, when the S/W thread here wakes up on the other processor, that other processor may not have yet seen the st r1, X and could easily see a stale cached version of X. So the ld r2, X can see "0" (assume inital before write of X is 0). That's bad. Coherence (or uniprocessor data dependencies -- depending on who you're talking to) just went out the window. Don't cross the streams kinda bad...... The way we cope with this in POWER is for context switches to have an HWSYNC. That also doesn't technically push the store out by itself (I'm fibbing a bit here, but it's close enough for this discussion), but what it does do is order the st X,1 ahead of the stores to whatever job queue control data structure that the H/W thread we're destined for will read in order to realize it can dispatch the S/W thread. So by the time the S/W thread dispatches on the new H/W thread, the HWSYNC has ensured that the ST X,1 is visible to the new H/W thread. Normal programmers never see this. It just happens in the OS and no one is the wiser. Once you start taking on thread migration at a user level, however, you start feeling and having to manage these effects. So when Ian said earlier something to the effect that "everything else is in memory somewhere", it's probably a bit more accurate to say that everything is at least on its way to memory and will get there eventually or when a barrier pushes it through. It's really hard for me to imagine how you avoid this. Sadly it'll take me a long time to learn enough Go to read your scheduler code and look at the barriers in that and determine if that synchronization will fix this (unless there's an obvious HWSYNC that's executed on the departing thread :-)). As an aside, I don't work for ARM, and I don't play an ARM engineer on television, but their memory model has a great many technical similarities to ours. As a matter of architecture, I would expect them to have similar issues. As a matter of most of the current implementations I've aware of, they may get away with skipping the barrier..... for now. You should check with them at some point. Let's agree about SYNC before we start into STCX/reservations. The case there is open for more debate. Derek |
I think you're overthinking this. Almost any parallel computation does: thread 1: thread 2: Any implementation has to implement lock and unlock so the read in thread 2 That's all we need for the Go scheduler to work properly. The unlock There's a few optimizations in the scheduler where we don't actually do a On Wed, Aug 31, 2016 at 7:14 PM, strikerdw notifications@github.com wrote:
|
For the record, the On the one hand, we could certainly be doing something wrong, and it would be great if you could analyze our code and verify that it works. On the other hand, we aren't new at this stuff; we do pretty much know what we're doing. |
@strikerdw I agree now that the current implementation is OK for normal cacheable memory. Locking the scheduler lock to insert the preempted goroutine into the global queue guarantees that any HW thread that is able to see and pick up the goroutine from the global queue will necessarily be able to see all stores made by the goroutine before it was placed in the queue. So I think the question now has to do with I/O and non-cacheable memory. As @laboger mentioned above we want to replace almost all HWSYNC instructions in our atomics with LWSYNC to improve performance. The only remaining HWSYNC would be in the seldom-used atomic.Store() routines. LWSYNC does not order I/O or noncacheable memory, and presumably C-language routines called from Go could be using these types of memory. |
I guess the concern is that C code might use I/O instructions and non-cacheable memory but fail to call |
Apologies for the delay. My day job required my attention today.... OK, first off, I'm going to set Bishop's answer to the side and talk about this from a different point of view.. His answer is technically correct in so far as it goes, but that answer obscures the issues here more than it helps understanding. Also, to be clear, I don’t intend to imply for a moment that you folks don’t know what you’re doing. I’m quite sure you do. However, the specific mechanics and hardware needs Power has on software for thread migration is not a subject we document as well as we should, so it’s difficult for anyone, no matter how skilled, to find. What Randal says above about unlock-ing and lock-ing and threads and reading/writing variables is all true, but I don't think that example captures all the subtlety going on in process migration. So, again consider this program (note it's single threaded): ST X,1 I believe we all agree the LD X better get 1 (assuming that the function call didn't manipulate X). If we don't agree, please say so now. So, the critical issue here is what magic happens in the thread migration code. Now if the thread migration code, in the process of locking the job list, putting the S/W thread info on the job list, and unlocking the job list winds up incidentally doing an HWSYNC on the physical hardware thread we are departing, this issue will work out (killing the reservation is a separate issue). I suspect that’s what’s happening now. I looked at your assembler code and it’s got “SYNC” all over the place in it (which I assume means what I call an HWSYNC in our standard mnemonics). So it’s hard for me to imagine you don’t have an HWSYNC in there somewhere though I haven’t confirmed this. So it may be working, but I don't think the intent was to use an HWSYNC in this path because it would indirectly manage the hardware thread migration issues. I think your intent in that locking/barrier work was mainly to manage getting the S/W thread onto the job list safely. If what you were out to do is just the job list manipulation, a better implementation in Power would use LWSYNC and possibly ISYNC instead of any HWSYNCs here. And modifying these primitives on Power to not use so much HWSYNC is something we’re looking at because there’s a chance they’re currently over-burdened with synchronization (this a different and deep subject and it depends on what the Go language's memory model semantics are). So if we get to a point where the primitives are a bit more efficient, HWSYNC might not “just happen” in the path. You'd still need at least an LWSYNC, but those don't interact with cache-inhibited ops. So with only LWSYNC left the job list manipulation would work but programs with I/O over the migration would fail. Consider this program: store to I/O data register The LWSYNC in the thread migration code wouldn’t push the store to the I/O data register out on the old hardware thread and the store to the control register could happen from the migrated to H/W thread first. The device would read a corrupted data register value. Having HWSYNC in many or most the atomic primitives instead of LWSYNC might fix this thread migration issue and not require the one HWSYNC explicitly in the path, but you're paying HWSYNC penalties for every other use of the primitive instead of LWSYNC penalties. I think the bottom line is this: on a thread migration if you place one explicit HWSYNC in the path, it just works. Period. No having to think, no debates, no long github posts. No complex sets of rules. Done. And it might wind up being faster than having the HWSYNCs in most/all of the primitives. If you don’t have that explicit HWSYNC, one has to really scrub the thread migration code path to be sure it either 1) implicitly winds up doing an HWSYNC for some reason (and it prossibly didn’t need to do an HWSYNC there instead of an LWSYNC) or 2) figure out whatever barriers you do have in the migration path and then what subtle things (like I/O) you have to tell the programmer that can’t span over the function call that might context switch. The second option can be really complex to figure out and leaves the programmers with a complex set of rules. If I understood correctly, one of the guiding principles of Go is to keep user programmers from having to think about this stuff? So I can’t say there is no other way to achieve the thread migration than by an explicit HWSYNC in the path, but I think the other options over penalize the barriers in the primitives or require a complex set of “can’t do’s” for the programmer and can be hard to figure out. Derek |
Go provides no way to write to an I/O register except by executing assembly code. I'm perfectly willing to say that any assembly code that writes to an I/O register is responsible for executing The We do currently use |
Ok. That is a legitimate choice. To not be remiss then, I should point out that unless there's an incidental HWSYNC (say in the to or from C paths), the programmer putting in HWSYNCs may not be enough: ST to data register will also fail. The HWSYNC the programmer inserted executes on the thread we migrated to and that does not push out the CI store on the thread we migrated from. In the absence of proper run-time support the programmer inserting even correct barriers may not be enough. So, at this point I'll assume the issues around barriers in the thread migration path are either avoided or handled by whatever barriers are there or in the C path. There are a few other hygiene clean up things that are generally necessary when migrating threads in Power. I will list them below and leave it to you folks to decide if they matter in the context of Go's thread scheduler. All of these operations are executed on the thread being departed. (I think this list is complete).
If there are any questions, I'm happy to answer, but otherwise I'll consider this subject closed. If programmers do manage to do things that cause issues with the context switching, those bugs are extremely difficult to isolate and debug. They'll happen quite randomly and usually only in cases of heavy load (yes I've had the misfortune of doing this). This is probably not the entirely correct place to ask, but who (if anyone) is responsible for deciding what the Go programming language memory model is? Are there any written documents? Like the C++ or C or (shudder) Java memory models? To figure out if the atomic routines in Go can use more efficient Power barriers, I need to better understand what the language's memory model requirements are. Any help is appreciated. Thanks , Derek |
The Go memory model is at https://golang.org/ref/mem . You will likely find it unsatisfactory. See also #5045, #7948, and #9442. In any case, the Go memory model is for user code. It does not describe the requirements on the Go runtime itself. That is, the Go memory model describes the sync/atomic package, or, more precisely, it should describe that package but doesn't. The Go memory model does not describe the runtime/internal/atomic package, which is what the runtime uses. |
I was out a few days last week and Bishop is now out through the end of next week. Lots of good discussion here to help understand the situation. Here is my suggestion on how to proceed. Bishop has proposed changes to some of the atomic functions to use LWSYNC instead of HWSYNC to improve performance. Most of these are in the runtime/internal atomics. Once he gets back he can submit his change. Bishop has studied the Go scheduler more than I but I see Keith's point about the runqput functions doing locks when swapping goroutines and how that should meet the requirement of doing a sync and clearing the reservation. |
As far as goroutine switching is concerned, the key point is what Keith said: the Go runtime really looks like any other multithreaded user program that Linux might run. The only way a goroutine stops execution on one thread and starts anew on another is when the goroutine itself is handed off to the other thread using either a locked data structure (the global run queue) or an atomic compare and swap. Assuming those lower level things are implemented correctly, there's nothing extra needed to make sure the memory is carried along. It is also important that goroutines are cheap: goroutine switches happen at very high frequencies. It's important not to weigh them down to the level of thread context switches. I'm unconcerned about assembly authors leaving the processor in an unfortunate state (stcx, tabort, cp_abort, etc). That's their problem, as discussed above. I know that Derek, Bishop, and Lynn are working on sanity checking the actual atomics, and I spoke to them about that earlier this week and expect to continue to do so. Assuming those are OK and stay OK, goroutines should come along for the ride. Since Derek wrote:
I will close this. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version devel +adb1e67 Tue Aug 30 08:08:37 2016 +0000 linux/ppc64le
What operating system and processor architecture are you using (
go env
)?Ubuntu 16.04
We haven't hit a specific problem in this area yet, but there are concerns that some additional instructions should be included in the runtime.gosave routine when a goroutines's information is saved before it is swapped off a thread. We should be doing a sync and an instruction to clear any lwarx reservation that might be pending.
The text was updated successfully, but these errors were encountered: