-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: consider backing malloc structures with large pages #14264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I tried this experiment for the heap itself, and it didn't do much. The OS was transparently using big pages to back the heap (at least on Linux). |
I tried running a 2GB heap hog for 15 minutes and, surprisingly, didn't get any huge pages anywhere (heap or metadata):
spans start at 0xc000000000 and the bitmap ends/arena begins at 0xc820000000, as usual. So, we don't have much in the way of spans, but we do have 53 MB of bitmap. I confirmed that CONFIG_TRANSPARENT_HUGEPAGE=y and CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y. Or maybe AnonHugePages doesn't mean what I think it means? Based on fs/proc/task_mmu.c it seems to be what you would think. It's also possible THP just isn't working on my laptop. I tried cranking up khugepaged to full throttle by setting /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to 0 and nothing happened. The scan count even stayed at "53". |
What does cat /sys/kernel/mm/transparent_hugepage/enabled say on your system? |
Likewise, /sys/kernel/mm/transparent_hugepage/defrag says
Curiously, /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans has gone up from 53 to 60 in the day and a half since I did that experiment, so something is happening. I'll dig in a bit more, but I'm thinking that it's probably worth our while to madvise these regions even if THP will theoretically eventually back them with large pages. |
/cc @aclements Are we going to try to tackle this for 1.8 or should we push it to 1.8Maybe or 1.9? |
This is pretty low priority, so Go1.8Maybe is fine. However, I did some quick experiments tonight and learned some interesting things:
This suggests we should be mapping in larger chunks. Possibly we should do this in proportion to how much is already mapped, though _HeapAllocChunk is already 1MB. It's also possible that the reduced TLB pressure from huge pages was the actual cause of the performance boost observed with a larger _HeapAllocChunk in #16866. |
A little more digging. The reason we generally don't see many huge pages backing the Go heap is that we grow it by 1MB at a time (unless the growing allocation needs more) and fault the new space almost immediately. The fault handler sees that the VMA doesn't span the full huge page frame, so it has no choice but to back it with a normal page. When we later grow the heap mapping by another 1MB, since there are already normal pages in the huge page frame from the earlier fault, the page fault handler keeps backing the area with normal pages. I'm not sure why the background scrubber didn't clean this up after some time, but it's better if we don't force it to copy all of that memory around anyway. I spent a while digging into why only half of the heap was backed by huge pages only to discover that I could no longer reproduce that behavior once I had read the THP code. It must have noticed I was poking around. We should probably just grow the heap in 2MB (and 2MB aligned) chunks. We could do the same for the heap bitmap and the spans. This means very small Go binaries will use between 6MB and 8MB. That may be okay, especially since, as @quentinmit pointed out, environments will very little RAM are likely to have THP disabled (in fact, the kernel disables THP by default if there's less then 512MB of physical memory). |
We could always grow <2MB for the first few grows and 2MB after that. |
If we wait to switch to huge pages (and assuming the THP scrubber doesn't fix things), the first 4 MB of the heap will already be enough to blow out the 1024 entry L2 TLB on a Haswell. What I find compelling about backing things with huge pages from the start is that it lets us fit the entire address space of a 2 GB heap in the TLB. I certainly think we should back the heap arena with huge pages from the start, since we already assume it's going to grow to at least 4MB. I'm less convinced about the bitmap and spans regions, since those don't grow to a whole huge page until the heap itself grows to 64MB and 2GB, respectively. I wish there was a way we could ask the system to defrag a specific page [1]. Then we could switch already-mapped pages to huge pages once if we see the heap growing. But I poked around and didn't find one. [1] Strictly speaking, we can do it ourselves by |
I compared huge pages to tip using the x/benchmarks garbage benchmark on both Skylake and Sandy Bridge. This is with the arena, the bitmap, and the spans array backed by huge pages. I haven't tried a more dynamic approach yet. Also, I dug into hardware counters after running these and found that fixalloc (which is allocating mspans) and the mark bitmap allocator are still using 4K pages, which means this reduced TLB misses by only 4X and most remaining TLB misses are in the garbage collector. I'll try to fix that and rerun the benchmarks. Nevertheless, this shows some pretty good speedup, especially on Skylake, which has good huge page TLB support. Sandy Bridge's huge page TLB support is surprisingly bad, but it gets some speedup, too. Skylake (64 4KB + 32 2MB entry L1 LTB, 1536 entry unified L2 TLB):
Sandy Bridge (64 4KB + 32 2MB entry L1 TLB, 512 4KB entry L2 TLB, no 2MB L2):
|
For the record, here's how I checked the hardware counters: Totals: Profile: |
I have a further tweaked version that makes fixalloc and the mark bitmap allocator also use huge pages. For tip:
With 2M pages:
|
I started seeing this behavior again. It turns out that I just didn't have very many free physical huge pages, so THP was falling back to regular pages. One reason for this is actually the OS buffer cache, which can fragment memory to quite an extent even if applications aren't consuming much of it. |
Backing persistentalloc with large pages (which covers fixalloc) and changing the mark bits allocator to use persistentalloc (which makes sense in general) turns out to be a bad idea: Skylake:
Sandy Bridge:
I'm not quite sure why. However, this does increase persistentalloc'ed memory by 50%, so it could be a bad effect on key runtime-internal structures. |
I tried one more experiment where I made |
@aclements, is this a bump to Go 1.9? |
@aclements, shall we close this? |
Results were a mixed bag and this increased complexity, so closing. May be worth revisiting on future hardware revisions. |
One of the limiting factors for GC performance is TLB misses. A simple change that may improve this is to back its internal structures like the heap bitmap and span array with large pages.
/cc @RLH
The text was updated successfully, but these errors were encountered: