runtime: consider backing malloc structures with large pages #14264

aclements · 2016-02-08T21:33:01Z

One of the limiting factors for GC performance is TLB misses. A simple change that may improve this is to back its internal structures like the heap bitmap and span array with large pages.

/cc @RLH

randall77 · 2016-02-08T22:59:22Z

I tried this experiment for the heap itself, and it didn't do much. The OS was transparently using big pages to back the heap (at least on Linux).
Is that not happening for the heap bitmap & span array? Maybe something as simple as aligning the starting point would do it.

aclements · 2016-02-28T04:57:23Z

I tried running a 2GB heap hog for 15 minutes and, surprisingly, didn't get any huge pages anywhere (heap or metadata):

c000000000-c0001b0000 rw-p 00000000 00:00 0 
Size:               1728 kB
Rss:                1728 kB
Pss:                1728 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      1728 kB
Referenced:         1728 kB
Anonymous:          1728 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd 
c81ca18000-c85ec00000 rw-p 00000000 00:00 0                              [stack:12747]
Size:            1083296 kB
Rss:             1083100 kB
Pss:             1083100 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   1083100 kB
Referenced:      1083100 kB
Anonymous:       1083100 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd 
c85ec00000-c85ee00000 rw-p 00000000 00:00 0 
Size:               2048 kB
Rss:                2048 kB
Pss:                2048 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      2048 kB
Referenced:         2048 kB
Anonymous:          2048 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd nh 
c85ee00000-c88bd00000 rw-p 00000000 00:00 0 
Size:             736256 kB
Rss:              736256 kB
Pss:              736256 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    736256 kB
Referenced:       736256 kB
Anonymous:        736256 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd

spans start at 0xc000000000 and the bitmap ends/arena begins at 0xc820000000, as usual. So, we don't have much in the way of spans, but we do have 53 MB of bitmap.

I confirmed that CONFIG_TRANSPARENT_HUGEPAGE=y and CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y.

Or maybe AnonHugePages doesn't mean what I think it means? Based on fs/proc/task_mmu.c it seems to be what you would think.

It's also possible THP just isn't working on my laptop. I tried cranking up khugepaged to full throttle by setting /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to 0 and nothing happened. The scan count even stayed at "53".

mwhudson · 2016-02-29T07:18:52Z

What does cat /sys/kernel/mm/transparent_hugepage/enabled say on your system?

aclements · 2016-02-29T15:13:03Z

What does cat /sys/kernel/mm/transparent_hugepage/enabled say on your system?

[always] madvise never

Likewise, /sys/kernel/mm/transparent_hugepage/defrag says

[always] madvise never

Curiously, /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans has gone up from 53 to 60 in the day and a half since I did that experiment, so something is happening. I'll dig in a bit more, but I'm thinking that it's probably worth our while to madvise these regions even if THP will theoretically eventually back them with large pages.

quentinmit · 2016-09-28T22:37:01Z

/cc @aclements Are we going to try to tackle this for 1.8 or should we push it to 1.8Maybe or 1.9?

aclements · 2016-09-29T01:38:45Z

This is pretty low priority, so Go1.8Maybe is fine.

However, I did some quick experiments tonight and learned some interesting things:

mmap with MAP_HUGETLB is not an option MAP_HUGETLB 1) can only draw on the "persistent" huge page pool, which is usually uncofigured and empty and 2) can only be used by processes in the GID configured in /proc/sys/vm/hugetlb_shm_group, which is only root by default.
Given this, you might expect madvise(MADV_HUGEPAGE) to be useful, but it doesn't change anything. If a range hasn't been explicitly advised one way or the other, it takes the system default, which is generally to enable THP anyway.
What does matter is how much we map at a time. I increased _HeapAllocChunk and bitmapChunk to 2 MB and the AnonHugePages count in smaps shot up to 1GB out of a 2GB heap (I'm not sure why it didn't go higher; maybe I ran out of huge pages). If we map in smaller chunks, I never see AnonHugePages go above 0, even if I let the process sit for several THP scrubbing cycles.

This suggests we should be mapping in larger chunks. Possibly we should do this in proportion to how much is already mapped, though _HeapAllocChunk is already 1MB.

It's also possible that the reduced TLB pressure from huge pages was the actual cause of the performance boost observed with a larger _HeapAllocChunk in #16866.

aclements · 2016-09-29T22:33:42Z

A little more digging.

The reason we generally don't see many huge pages backing the Go heap is that we grow it by 1MB at a time (unless the growing allocation needs more) and fault the new space almost immediately. The fault handler sees that the VMA doesn't span the full huge page frame, so it has no choice but to back it with a normal page. When we later grow the heap mapping by another 1MB, since there are already normal pages in the huge page frame from the earlier fault, the page fault handler keeps backing the area with normal pages. I'm not sure why the background scrubber didn't clean this up after some time, but it's better if we don't force it to copy all of that memory around anyway.

I spent a while digging into why only half of the heap was backed by huge pages only to discover that I could no longer reproduce that behavior once I had read the THP code. It must have noticed I was poking around.

We should probably just grow the heap in 2MB (and 2MB aligned) chunks. We could do the same for the heap bitmap and the spans. This means very small Go binaries will use between 6MB and 8MB. That may be okay, especially since, as @quentinmit pointed out, environments will very little RAM are likely to have THP disabled (in fact, the kernel disables THP by default if there's less then 512MB of physical memory).

randall77 · 2016-09-29T22:45:00Z

We could always grow <2MB for the first few grows and 2MB after that.

aclements · 2016-09-30T00:52:06Z

We could always grow <2MB for the first few grows and 2MB after that.

If we wait to switch to huge pages (and assuming the THP scrubber doesn't fix things), the first 4 MB of the heap will already be enough to blow out the 1024 entry L2 TLB on a Haswell. What I find compelling about backing things with huge pages from the start is that it lets us fit the entire address space of a 2 GB heap in the TLB.

I certainly think we should back the heap arena with huge pages from the start, since we already assume it's going to grow to at least 4MB. I'm less convinced about the bitmap and spans regions, since those don't grow to a whole huge page until the heap itself grows to 64MB and 2GB, respectively.

I wish there was a way we could ask the system to defrag a specific page [1]. Then we could switch already-mapped pages to huge pages once if we see the heap growing. But I poked around and didn't find one.

[1] Strictly speaking, we can do it ourselves by mremaping the existing pages to some temporary location, re-mmaping the region we want to be huge pages, and copying the data over, but that doesn't seem worth the complexity.

aclements · 2016-09-30T02:38:22Z

I compared huge pages to tip using the x/benchmarks garbage benchmark on both Skylake and Sandy Bridge. This is with the arena, the bitmap, and the spans array backed by huge pages. I haven't tried a more dynamic approach yet. Also, I dug into hardware counters after running these and found that fixalloc (which is allocating mspans) and the mark bitmap allocator are still using 4K pages, which means this reduced TLB misses by only 4X and most remaining TLB misses are in the garbage collector. I'll try to fix that and rerun the benchmarks.

Nevertheless, this shows some pretty good speedup, especially on Skylake, which has good huge page TLB support. Sandy Bridge's huge page TLB support is surprisingly bad, but it gets some speedup, too.

Skylake (64 4KB + 32 2MB entry L1 LTB, 1536 entry unified L2 TLB):

name           old time/op  new time/op  delta
XGarbage1GB-4  7.13ms ± 2%  7.00ms ± 2%  -1.73%  (p=0.000 n=20+19)
XGarbage64M-4  6.79ms ± 3%  6.71ms ± 3%  -1.26%  (p=0.000 n=16+16)

Sandy Bridge (64 4KB + 32 2MB entry L1 TLB, 512 4KB entry L2 TLB, no 2MB L2):

name            old time/op  new time/op  delta
XGarbage1GB-12  2.49ms ± 0%  2.48ms ± 1%  -0.67%  (p=0.000 n=17+17)
XGarbage64M-12  2.27ms ± 0%  2.26ms ± 1%  -0.61%  (p=0.000 n=17+19)

aclements · 2016-09-30T14:47:31Z

For the record, here's how I checked the hardware counters:

Totals: ocperf.py stat -e frontend_retired.stlb_miss,dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_2m_4m,dtlb_load_misses.walk_completed_4k ./bench -bench garbage -benchnum 1 -benchmem 1024

Profile: perf record -e cpu/event=0xc6,umask=0x1,frontend=0x15,name=frontend_retired_stlb_miss,period=1007/,cpu/event=0x8,umask=0xe,name=dtlb_load_misses_walk_completed,period=1003/,cpu/event=0x8,umask=0x4,name=dtlb_load_misses_walk_completed_2m_4m,period=20003/,cpu/event=0x8,umask=0x2,name=dtlb_load_misses_walk_completed_4k,period=20003/ ./bench.e7ffc08 -bench garbage -benchnum 1 -benchmem 1024 (which are the events produced by ocperf.py, but with period / 100)

aclements · 2016-09-30T15:19:36Z

I have a further tweaked version that makes fixalloc and the mark bitmap allocator also use huge pages. For bench -bench garbage -benchnum 1 -benchmem 1024, this reduces the number of TLB misses by a factor of 17. The remaining 4k misses are mostly in the data segment mapped from the binary.

tip:

        21,335,534      dtlb_load_misses_walk_completed
            71,860      dtlb_load_misses_walk_completed_2m_4m
        21,263,733      dtlb_load_misses_walk_completed_4k

With 2M pages:

         1,259,586      dtlb_load_misses_walk_completed
           570,739      dtlb_load_misses_walk_completed_2m_4m
           688,849      dtlb_load_misses_walk_completed_4k

aclements · 2016-09-30T17:24:44Z

I spent a while digging into why only half of the heap was backed by huge pages only to discover that I could no longer reproduce that behavior once I had read the THP code.

I started seeing this behavior again. It turns out that I just didn't have very many free physical huge pages, so THP was falling back to regular pages. One reason for this is actually the OS buffer cache, which can fragment memory to quite an extent even if applications aren't consuming much of it.

aclements · 2016-09-30T17:37:38Z

Backing persistentalloc with large pages (which covers fixalloc) and changing the mark bits allocator to use persistentalloc (which makes sense in general) turns out to be a bad idea:

Skylake:

name           old time/op  new time/op  delta                               
XGarbage1GB-4  7.13ms ± 2%  7.15ms ± 2%   ~     (p=0.355 n=20+20)
XGarbage64M-4  6.79ms ± 3%  6.77ms ± 1%   ~     (p=0.444 n=16+17)

Sandy Bridge:

name            old time/op  new time/op  delta
XGarbage1GB-12  2.49ms ± 0%  2.55ms ± 2%  +2.33%  (p=0.000 n=17+20)
XGarbage64M-12  2.27ms ± 0%  2.33ms ± 3%  +2.49%  (p=0.000 n=17+19)

I'm not quite sure why. However, this does increase persistentalloc'ed memory by 50%, so it could be a bad effect on key runtime-internal structures.

aclements · 2016-09-30T19:59:12Z

I tried one more experiment where I made persistentalloc strictly linear so there was no internal fragmentation and at most 2MB of unused persistent memory at a time (CL 30113). The results were essentially the same as the old persistentalloc: no speedup on Skylake and some slowdown on Sandy Bridge.

rsc · 2016-10-21T00:50:34Z

@aclements, is this a bump to Go 1.9?

rsc · 2018-06-11T20:40:02Z

@aclements, shall we close this?

aclements · 2018-06-12T21:50:03Z

Results were a mixed bag and this increased complexity, so closing. May be worth revisiting on future hardware revisions.

aclements added this to the Go1.7Early milestone Feb 8, 2016

bradfitz modified the milestones: Go1.8Early, Go1.7Early May 5, 2016

quentinmit added the NeedsDecision label Sep 28, 2016

aclements modified the milestones: Go1.8Maybe, Go1.8Early Sep 29, 2016

randall77 closed this as completed Sep 29, 2016

randall77 reopened this Sep 29, 2016

aclements modified the milestones: Go1.9, Go1.8Maybe Oct 21, 2016

aclements modified the milestones: Go1.10Early, Go1.9 Jun 7, 2017

bradfitz added early-in-cycle and removed early-in-cycle labels Jun 14, 2017

bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018

ianlancetaylor added this to the Go1.12 milestone Jun 1, 2018

aclements closed this as completed Jun 12, 2018

golang locked and limited conversation to collaborators Jun 12, 2019

gopherbot added the FrozenDueToAge label Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: consider backing malloc structures with large pages #14264

runtime: consider backing malloc structures with large pages #14264

aclements commented Feb 8, 2016

randall77 commented Feb 8, 2016

aclements commented Feb 28, 2016

mwhudson commented Feb 29, 2016

aclements commented Feb 29, 2016

quentinmit commented Sep 28, 2016

aclements commented Sep 29, 2016

aclements commented Sep 29, 2016

randall77 commented Sep 29, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

rsc commented Oct 21, 2016

rsc commented Jun 11, 2018

aclements commented Jun 12, 2018 •

edited

Loading

runtime: consider backing malloc structures with large pages #14264

runtime: consider backing malloc structures with large pages #14264

Comments

aclements commented Feb 8, 2016

randall77 commented Feb 8, 2016

aclements commented Feb 28, 2016

mwhudson commented Feb 29, 2016

aclements commented Feb 29, 2016

quentinmit commented Sep 28, 2016

aclements commented Sep 29, 2016

aclements commented Sep 29, 2016

randall77 commented Sep 29, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

aclements commented Sep 30, 2016

rsc commented Oct 21, 2016

rsc commented Jun 11, 2018

aclements commented Jun 12, 2018 • edited Loading

aclements commented Jun 12, 2018 •

edited

Loading