-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: hot vars and cache lines #14980
Comments
Having a few hot reads scattered about cache lines instead of all in one cache line shouldn't matter much. It will only take a tiny bit more work and cache space to cache them. Keeping hot writes away from each other (and from other hot reads) will matter much more. Is writeBarrier really written that much? Just twice per GC cycle, as far as I can tell. |
I don't recall seeing serious contention on any globals when I ran https://godoc.org/github.com/aclements/go-perf/cmd/memlat, but that was quite a while ago and I wasn't necessarily looking. It would be easy enough to run that again. Particularly if it's run on a multi-node system, any globals with poor cacheability or false sharing should stick out as expensive remote DRAM events. It would also be easy enough to crank up the PEBS recording rate and just write a simple tool to look for hot globals. Sort of like https://godoc.org/github.com/aclements/go-perf/cmd/memanim, but obviously looking for different things in the memory trace. With memanim, I found the hardware could easily record every single load over 50 cycles. If we do find any, the cheap solution is to add padding variables around them. We already do this in a few places (grep for CacheLineSize), but I think those are all based on assumptions about hot cache lines and aren't backed up by measurements. |
Thanks, Keith and Austin. I don't have a linux machine lying around now, but I should soon(ish), and I will play with this then. |
Frequent write sharing can be very expensive and prevent scaling on higher core counts. We need to get rid of each and every case. |
Naive question. The runtime has a bunch of top-level vars, some of which are fairly hot, e.g. the writeBarrier struct (checked before every write barrier call), the debug struct (checked during every malloc for e.g. allocfreetrace), and the trace struct (to know whether tracing is enabled). Some are written a lot (writeBarrier), whereas others are read-mostly (debug, trace).
They are organized for readability and thus end up potentially scattered around the final binary. However, I wonder whether it would be better to ensure that all the hottest read-mostly variables are in a single cache line and ensure that the hottest read-write variables don't trigger false sharing.
Many of these aren't easy to move around and experiment with, because of compiler integration. So first: Any instincts about whether this is likely to matter in practice?
cc @dvyukov @aclements
The text was updated successfully, but these errors were encountered: