Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: consider backing malloc structures with large pages #14264

Closed
aclements opened this issue Feb 8, 2016 · 18 comments
Closed

runtime: consider backing malloc structures with large pages #14264

aclements opened this issue Feb 8, 2016 · 18 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@aclements
Copy link
Member

One of the limiting factors for GC performance is TLB misses. A simple change that may improve this is to back its internal structures like the heap bitmap and span array with large pages.

/cc @RLH

@aclements aclements added this to the Go1.7Early milestone Feb 8, 2016
@randall77
Copy link
Contributor

I tried this experiment for the heap itself, and it didn't do much. The OS was transparently using big pages to back the heap (at least on Linux).
Is that not happening for the heap bitmap & span array? Maybe something as simple as aligning the starting point would do it.

@aclements
Copy link
Member Author

I tried running a 2GB heap hog for 15 minutes and, surprisingly, didn't get any huge pages anywhere (heap or metadata):

c000000000-c0001b0000 rw-p 00000000 00:00 0 
Size:               1728 kB
Rss:                1728 kB
Pss:                1728 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      1728 kB
Referenced:         1728 kB
Anonymous:          1728 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd 
c81ca18000-c85ec00000 rw-p 00000000 00:00 0                              [stack:12747]
Size:            1083296 kB
Rss:             1083100 kB
Pss:             1083100 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   1083100 kB
Referenced:      1083100 kB
Anonymous:       1083100 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd 
c85ec00000-c85ee00000 rw-p 00000000 00:00 0 
Size:               2048 kB
Rss:                2048 kB
Pss:                2048 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      2048 kB
Referenced:         2048 kB
Anonymous:          2048 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd nh 
c85ee00000-c88bd00000 rw-p 00000000 00:00 0 
Size:             736256 kB
Rss:              736256 kB
Pss:              736256 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    736256 kB
Referenced:       736256 kB
Anonymous:        736256 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd 

spans start at 0xc000000000 and the bitmap ends/arena begins at 0xc820000000, as usual. So, we don't have much in the way of spans, but we do have 53 MB of bitmap.

I confirmed that CONFIG_TRANSPARENT_HUGEPAGE=y and CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y.

Or maybe AnonHugePages doesn't mean what I think it means? Based on fs/proc/task_mmu.c it seems to be what you would think.

It's also possible THP just isn't working on my laptop. I tried cranking up khugepaged to full throttle by setting /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to 0 and nothing happened. The scan count even stayed at "53".

@mwhudson
Copy link
Contributor

What does cat /sys/kernel/mm/transparent_hugepage/enabled say on your system?

@aclements
Copy link
Member Author

What does cat /sys/kernel/mm/transparent_hugepage/enabled say on your system?

[always] madvise never

Likewise, /sys/kernel/mm/transparent_hugepage/defrag says

[always] madvise never

Curiously, /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans has gone up from 53 to 60 in the day and a half since I did that experiment, so something is happening. I'll dig in a bit more, but I'm thinking that it's probably worth our while to madvise these regions even if THP will theoretically eventually back them with large pages.

@bradfitz bradfitz modified the milestones: Go1.8Early, Go1.7Early May 5, 2016
@quentinmit
Copy link
Contributor

/cc @aclements Are we going to try to tackle this for 1.8 or should we push it to 1.8Maybe or 1.9?

@quentinmit quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Sep 28, 2016
@aclements
Copy link
Member Author

This is pretty low priority, so Go1.8Maybe is fine.

However, I did some quick experiments tonight and learned some interesting things:

  1. mmap with MAP_HUGETLB is not an option MAP_HUGETLB 1) can only draw on the "persistent" huge page pool, which is usually uncofigured and empty and 2) can only be used by processes in the GID configured in /proc/sys/vm/hugetlb_shm_group, which is only root by default.
  2. Given this, you might expect madvise(MADV_HUGEPAGE) to be useful, but it doesn't change anything. If a range hasn't been explicitly advised one way or the other, it takes the system default, which is generally to enable THP anyway.
  3. What does matter is how much we map at a time. I increased _HeapAllocChunk and bitmapChunk to 2 MB and the AnonHugePages count in smaps shot up to 1GB out of a 2GB heap (I'm not sure why it didn't go higher; maybe I ran out of huge pages). If we map in smaller chunks, I never see AnonHugePages go above 0, even if I let the process sit for several THP scrubbing cycles.

This suggests we should be mapping in larger chunks. Possibly we should do this in proportion to how much is already mapped, though _HeapAllocChunk is already 1MB.

It's also possible that the reduced TLB pressure from huge pages was the actual cause of the performance boost observed with a larger _HeapAllocChunk in #16866.

@aclements aclements modified the milestones: Go1.8Maybe, Go1.8Early Sep 29, 2016
@aclements
Copy link
Member Author

A little more digging.

The reason we generally don't see many huge pages backing the Go heap is that we grow it by 1MB at a time (unless the growing allocation needs more) and fault the new space almost immediately. The fault handler sees that the VMA doesn't span the full huge page frame, so it has no choice but to back it with a normal page. When we later grow the heap mapping by another 1MB, since there are already normal pages in the huge page frame from the earlier fault, the page fault handler keeps backing the area with normal pages. I'm not sure why the background scrubber didn't clean this up after some time, but it's better if we don't force it to copy all of that memory around anyway.

I spent a while digging into why only half of the heap was backed by huge pages only to discover that I could no longer reproduce that behavior once I had read the THP code. It must have noticed I was poking around.

We should probably just grow the heap in 2MB (and 2MB aligned) chunks. We could do the same for the heap bitmap and the spans. This means very small Go binaries will use between 6MB and 8MB. That may be okay, especially since, as @quentinmit pointed out, environments will very little RAM are likely to have THP disabled (in fact, the kernel disables THP by default if there's less then 512MB of physical memory).

@randall77 randall77 reopened this Sep 29, 2016
@randall77
Copy link
Contributor

We could always grow <2MB for the first few grows and 2MB after that.

@aclements
Copy link
Member Author

We could always grow <2MB for the first few grows and 2MB after that.

If we wait to switch to huge pages (and assuming the THP scrubber doesn't fix things), the first 4 MB of the heap will already be enough to blow out the 1024 entry L2 TLB on a Haswell. What I find compelling about backing things with huge pages from the start is that it lets us fit the entire address space of a 2 GB heap in the TLB.

I certainly think we should back the heap arena with huge pages from the start, since we already assume it's going to grow to at least 4MB. I'm less convinced about the bitmap and spans regions, since those don't grow to a whole huge page until the heap itself grows to 64MB and 2GB, respectively.

I wish there was a way we could ask the system to defrag a specific page [1]. Then we could switch already-mapped pages to huge pages once if we see the heap growing. But I poked around and didn't find one.

[1] Strictly speaking, we can do it ourselves by mremaping the existing pages to some temporary location, re-mmaping the region we want to be huge pages, and copying the data over, but that doesn't seem worth the complexity.

@aclements
Copy link
Member Author

I compared huge pages to tip using the x/benchmarks garbage benchmark on both Skylake and Sandy Bridge. This is with the arena, the bitmap, and the spans array backed by huge pages. I haven't tried a more dynamic approach yet. Also, I dug into hardware counters after running these and found that fixalloc (which is allocating mspans) and the mark bitmap allocator are still using 4K pages, which means this reduced TLB misses by only 4X and most remaining TLB misses are in the garbage collector. I'll try to fix that and rerun the benchmarks.

Nevertheless, this shows some pretty good speedup, especially on Skylake, which has good huge page TLB support. Sandy Bridge's huge page TLB support is surprisingly bad, but it gets some speedup, too.

Skylake (64 4KB + 32 2MB entry L1 LTB, 1536 entry unified L2 TLB):

name           old time/op  new time/op  delta
XGarbage1GB-4  7.13ms ± 2%  7.00ms ± 2%  -1.73%  (p=0.000 n=20+19)
XGarbage64M-4  6.79ms ± 3%  6.71ms ± 3%  -1.26%  (p=0.000 n=16+16)

Sandy Bridge (64 4KB + 32 2MB entry L1 TLB, 512 4KB entry L2 TLB, no 2MB L2):

name            old time/op  new time/op  delta
XGarbage1GB-12  2.49ms ± 0%  2.48ms ± 1%  -0.67%  (p=0.000 n=17+17)
XGarbage64M-12  2.27ms ± 0%  2.26ms ± 1%  -0.61%  (p=0.000 n=17+19)

@aclements
Copy link
Member Author

For the record, here's how I checked the hardware counters:

Totals: ocperf.py stat -e frontend_retired.stlb_miss,dtlb_load_misses.walk_completed,dtlb_load_misses.walk_completed_2m_4m,dtlb_load_misses.walk_completed_4k ./bench -bench garbage -benchnum 1 -benchmem 1024

Profile: perf record -e cpu/event=0xc6,umask=0x1,frontend=0x15,name=frontend_retired_stlb_miss,period=1007/,cpu/event=0x8,umask=0xe,name=dtlb_load_misses_walk_completed,period=1003/,cpu/event=0x8,umask=0x4,name=dtlb_load_misses_walk_completed_2m_4m,period=20003/,cpu/event=0x8,umask=0x2,name=dtlb_load_misses_walk_completed_4k,period=20003/ ./bench.e7ffc08 -bench garbage -benchnum 1 -benchmem 1024 (which are the events produced by ocperf.py, but with period / 100)

@aclements
Copy link
Member Author

I have a further tweaked version that makes fixalloc and the mark bitmap allocator also use huge pages. For bench -bench garbage -benchnum 1 -benchmem 1024, this reduces the number of TLB misses by a factor of 17. The remaining 4k misses are mostly in the data segment mapped from the binary.

tip:

        21,335,534      dtlb_load_misses_walk_completed
            71,860      dtlb_load_misses_walk_completed_2m_4m
        21,263,733      dtlb_load_misses_walk_completed_4k

With 2M pages:

         1,259,586      dtlb_load_misses_walk_completed
           570,739      dtlb_load_misses_walk_completed_2m_4m
           688,849      dtlb_load_misses_walk_completed_4k

@aclements
Copy link
Member Author

I spent a while digging into why only half of the heap was backed by huge pages only to discover that I could no longer reproduce that behavior once I had read the THP code.

I started seeing this behavior again. It turns out that I just didn't have very many free physical huge pages, so THP was falling back to regular pages. One reason for this is actually the OS buffer cache, which can fragment memory to quite an extent even if applications aren't consuming much of it.

@aclements
Copy link
Member Author

Backing persistentalloc with large pages (which covers fixalloc) and changing the mark bits allocator to use persistentalloc (which makes sense in general) turns out to be a bad idea:

Skylake:

name           old time/op  new time/op  delta                               
XGarbage1GB-4  7.13ms ± 2%  7.15ms ± 2%   ~     (p=0.355 n=20+20)
XGarbage64M-4  6.79ms ± 3%  6.77ms ± 1%   ~     (p=0.444 n=16+17)

Sandy Bridge:

name            old time/op  new time/op  delta
XGarbage1GB-12  2.49ms ± 0%  2.55ms ± 2%  +2.33%  (p=0.000 n=17+20)
XGarbage64M-12  2.27ms ± 0%  2.33ms ± 3%  +2.49%  (p=0.000 n=17+19)

I'm not quite sure why. However, this does increase persistentalloc'ed memory by 50%, so it could be a bad effect on key runtime-internal structures.

@aclements
Copy link
Member Author

I tried one more experiment where I made persistentalloc strictly linear so there was no internal fragmentation and at most 2MB of unused persistent memory at a time (CL 30113). The results were essentially the same as the old persistentalloc: no speedup on Skylake and some slowdown on Sandy Bridge.

@rsc
Copy link
Contributor

rsc commented Oct 21, 2016

@aclements, is this a bump to Go 1.9?

@aclements aclements modified the milestones: Go1.9, Go1.8Maybe Oct 21, 2016
@aclements aclements modified the milestones: Go1.10Early, Go1.9 Jun 7, 2017
@bradfitz bradfitz added early-in-cycle A change that should be done early in the 3 month dev cycle. and removed early-in-cycle A change that should be done early in the 3 month dev cycle. labels Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone Jun 1, 2018
@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

@aclements, shall we close this?

@aclements
Copy link
Member Author

aclements commented Jun 12, 2018

Results were a mixed bag and this increased complexity, so closing. May be worth revisiting on future hardware revisions.

@golang golang locked and limited conversation to collaborators Jun 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

8 participants