Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent object placement on heap causes Linux kernel contention #33092

Closed
midom opened this issue Jul 12, 2019 · 13 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@midom
Copy link

midom commented Jul 12, 2019

This applies to newer Go versions on any modern Linux kernel, but in this case:

go version go1.11.5 linux/amd64
Linux 4.16.18, SMP
48-thread 24-core 2-socket Haswell machine

If I allocate lots of small objects in parallel from many concurrent routines, they are all placed on the heap linearly, therefore each of the Go threads waiting for Linux pagefaults - which end up becoming more and more expensive because of spinlocking in that path.

In other cases of highly concurrent programming developers rely on allocators like jemalloc to do scalable placement of objects - that option does not exist with Go heap.

In some cases developers chose to have a background page that touches/cleans the pages before giving them to concurrent routines.

Having some spread in object placement would be a useful feature in the runtime.

A very basic program I have written to illustrate this in Go spends 50s in kernel while only 3s in userland, mostly from inflated costs in faulting:

2.75user 50.06system 0:01.19elapsed 4405%CPU (0avgtext+0avgdata 522224maxresident)k
0inputs+0outputs (0major+221566minor)pagefaults 0swaps

perf report shows that most of the cost from running my program is in page fault spinlocks:

-   91.23%     0.00%  tt       [kernel.vmlinux]                                [k] page_fault                                                                                                                                                                                                                                ▒
   - 91.22% page_fault                                                                                                                                                                                                                                                                                                       ▒
      - 91.11% __do_page_fault                                                                                                                                                                                                                                                                                               ▒
         - 90.96% handle_mm_fault                                                                                                                                                                                                                                                                                            ▒
            - 90.94% __handle_mm_fault                                                                                                                                                                                                                                                                                       ▒
               + 62.74% queued_spin_lock_slowpath                                                                                                                                                                                                                                                                            ▒
               - 27.99% do_wp_page                                                                                                                                                                                                                                                                                           ▒
                  - 27.98% wp_page_copy                                                                                                                                                                                                                                                                                      ▒
                     + 24.77% queued_spin_lock_slowpath                                                                                                                                                                                                                                                                      ▒
                     + 2.91% ptep_clear_flush

Source code for it:

package main
import (
	"strings"
	"sync"
)
func main() {
	var wg sync.WaitGroup
	defer wg.Wait()

	rp := strings.Repeat("a", 100)

	for n := 0; n < 100; n++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			a := make(map[int]map[string]string)
			for z := 0; z < 10000; z++ {
				a[z] = map[string]string{"x": rp + "a",}
			}
		}()
	}
}

Setting GOMAXPROCS=1 shows how little amount of work kernel has to do without contention:

# time ./tt

real	0m1.190s
user	0m3.552s
sys	0m48.070s
# time GOMAXPROCS=1  ./tt

real	0m0.801s
user	0m0.790s
sys	0m0.012s
@ianlancetaylor ianlancetaylor changed the title concurrent object placement on heap causes Linux kernel contention runtime: concurrent object placement on heap causes Linux kernel contention Jul 13, 2019
@ianlancetaylor
Copy link
Contributor

CC @aclements @mknyszek

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 13, 2019
@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Jul 13, 2019
@mknyszek
Copy link
Contributor

Maybe I don't understand what you mean by "spread in object placement", but for a long time Go has placed objects in separate pages depending on which P ("processor") is allocating, and it's able to do this without any locking. This is similar to TCMalloc's design. Allocating pages requires a lock, but small object allocation already scales fairly well.

Furthermore, I'm not actually able to reproduce the exact behavior you're seeing on your machine with your example program. I get:

thing $ GOMAXPROCS=1 /usr/bin/time ./thing
0.67user 0.00system 0:00.67elapsed 100%CPU (0avgtext+0avgdata 39944maxresident)k
0inputs+0outputs (0major+649minor)pagefaults 0swaps
thing $ /usr/bin/time ./thing
3.17user 8.90system 0:00.34elapsed 3532%CPU (0avgtext+0avgdata 476600maxresident)k
0inputs+0outputs (0major+18703minor)pagefaults 0swaps

It's true there are many more page faults, and significantly more time is spent in the kernel, but overall elapsed time still improves.

GODEBUG=gctrace=1 on the other hand indicates that these page faults are coming from the heap growing and touching many new pages.

thing $ GODEBUG=gctrace=1 ./thing                                                                      
gc 1 @0.006s 8%: 0.020+2.9+0.66 ms clock, 0.96+1.0/9.1/0+32 ms cpu, 5->7->5 MB, 6 MB goal, 48 P
gc 2 @0.013s 14%: 0.22+4.3+1.0 ms clock, 10+4.8/25/0+51 ms cpu, 8->13->11 MB, 10 MB goal, 48 P         
gc 3 @0.026s 13%: 0.38+11+1.1 ms clock, 18+1.6/56/0+54 ms cpu, 17->30->26 MB, 22 MB goal, 48 P 
gc 4 @0.045s 16%: 0.86+13+1.7 ms clock, 41+2.9/86/0+82 ms cpu, 33->49->46 MB, 53 MB goal, 48 P 
gc 5 @0.078s 14%: 0.72+16+0.33 ms clock, 34+9.7/144/0+15 ms cpu, 63->90->86 MB, 93 MB goal, 48 P
gc 6 @0.131s 12%: 0.43+43+0.50 ms clock, 20+5.2/326/0+24 ms cpu, 124->193->187 MB, 173 MB goal, 48 P
gc 7 @0.238s 12%: 1.3+59+1.5 ms clock, 63+15/525/0+73 ms cpu, 258->381->361 MB, 374 MB goal, 48 P
thing $ GODEBUG=gctrace=1 GOMAXPROCS=1 ./thing                                                 
gc 1 @0.007s 30%: 0.005+9.8+0.005 ms clock, 0.005+4.1/1.3/0+0.005 ms cpu, 4->5->5 MB, 5 MB goal, 1 P   
gc 2 @0.018s 27%: 0.001+5.3+0.001 ms clock, 0.001+1.2/0/0+0.001 ms cpu, 6->9->4 MB, 10 MB goal, 1 P
gc 3 @0.025s 29%: 0.001+8.4+0.001 ms clock, 0.001+3.2/0.088/0+0.001 ms cpu, 5->9->8 MB, 8 MB goal, 1 P
gc 4 @0.034s 29%: 0.001+10+0.002 ms clock, 0.001+1.5/1.9/0+0.002 ms cpu, 9->15->9 MB, 17 MB goal, 1 P
gc 5 @0.046s 29%: 0.001+3.6+0.001 ms clock, 0.001+1.0/0/0+0.001 ms cpu, 11->13->3 MB, 19 MB goal, 1 P
gc 6 @0.051s 28%: 0.001+9.8+0.001 ms clock, 0.001+0.57/2.3/0+0.001 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 7 @0.062s 28%: 0.001+3.8+0.001 ms clock, 0.001+1.1/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 8 @0.067s 27%: 0.001+9.9+0.001 ms clock, 0.001+0.57/2.4/0+0.001 ms cpu, 4->10->9 MB, 7 MB goal, 1 P 
gc 9 @0.078s 27%: 0.001+3.8+0.001 ms clock, 0.001+1.1/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 10 @0.084s 27%: 0.001+10+0.002 ms clock, 0.001+0.58/2.5/0+0.002 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 10 @0.084s 27%: 0.001+10+0.002 ms clock, 0.001+0.58/2.5/0+0.002 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 11 @0.095s 27%: 0.001+3.8+0.001 ms clock, 0.001+1.0/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 12 @0.100s 27%: 0.001+10+0.003 ms clock, 0.001+0.57/2.7/0+0.003 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 13 @0.112s 27%: 0.002+3.9+0.001 ms clock, 0.002+1.1/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 14 @0.117s 27%: 0.001+10+0.003 ms clock, 0.001+0.58/2.8/0+0.003 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 15 @0.129s 27%: 0.003+3.8+0.001 ms clock, 0.003+1.0/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 16 @0.134s 26%: 0.001+9.8+0.001 ms clock, 0.001+0.62/2.1/0+0.001 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 17 @0.145s 26%: 0.001+3.7+0.001 ms clock, 0.001+1.0/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 18 @0.150s 26%: 0+9.9+0.002 ms clock, 0+0.58/2.3/0+0.002 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 19 @0.161s 26%: 0.001+3.8+0.001 ms clock, 0.001+1.0/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 20 @0.166s 26%: 0.001+10+0.002 ms clock, 0.001+0.59/2.5/0+0.002 ms cpu, 4->10->9 MB, 7 MB goal, 1 P
gc 21 @0.177s 26%: 0.001+3.5+0.001 ms clock, 0.001+0.98/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 22 @0.182s 26%: 0+8.5+0.001 ms clock, 0+0.57/1.8/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 23 @0.192s 26%: 0.001+3.0+0.001 ms clock, 0.001+0.81/0/0+0.001 ms cpu, 10->13->3 MB, 19 MB goal, 1 P
gc 24 @0.196s 26%: 0+7.8+0.001 ms clock, 0+0.53/1.6/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P                                                                                                            
gc 25 @0.205s 26%: 0.001+2.9+0.001 ms clock, 0.001+0.77/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 26 @0.209s 26%: 0.001+8.4+0.003 ms clock, 0.001+0.49/2.2/0+0.003 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 27 @0.218s 26%: 0.002+2.7+0.001 ms clock, 0.002+0.75/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 28 @0.222s 25%: 0+7.9+0.001 ms clock, 0+0.52/1.5/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 29 @0.231s 25%: 0.001+2.6+0.001 ms clock, 0.001+0.73/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 30 @0.234s 25%: 0.001+8.3+0.002 ms clock, 0.001+0.57/1.8/0+0.002 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 31 @0.244s 25%: 0.001+2.7+0.001 ms clock, 0.001+0.72/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 32 @0.248s 26%: 0.001+5.8+0.003 ms clock, 0.001+2.3/0.56/0+0.003 ms cpu, 4->7->6 MB, 6 MB goal, 1 P
gc 33 @0.255s 26%: 0.001+3.9+0 ms clock, 0.001+0.88/0.79/0+0 ms cpu, 7->9->5 MB, 13 MB goal, 1 P
gc 34 @0.260s 26%: 0+2.4+0 ms clock, 0+0.70/0/0+0 ms cpu, 6->7->2 MB, 10 MB goal, 1 P  
gc 35 @0.264s 26%: 0+8.3+0.001 ms clock, 0+0.73/1.4/0+0.001 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 36 @0.273s 25%: 0+2.7+0 ms clock, 0+0.72/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 37 @0.277s 25%: 0+8.2+0.001 ms clock, 0+0.54/1.7/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 38 @0.286s 25%: 0.001+2.8+0.001 ms clock, 0.001+0.75/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 39 @0.290s 25%: 0+8.1+0.001 ms clock, 0+0.58/1.5/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 40 @0.299s 25%: 0.001+2.6+0.001 ms clock, 0.001+0.71/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 41 @0.303s 25%: 0+8.1+0.002 ms clock, 0+0.55/1.6/0+0.002 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 42 @0.312s 25%: 0.001+2.7+0 ms clock, 0.001+0.71/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 43 @0.316s 25%: 0+7.7+0 ms clock, 0+0.58/1.3/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 44 @0.325s 25%: 0+2.6+0 ms clock, 0+0.71/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 45 @0.328s 25%: 0+7.5+0 ms clock, 0+0.55/1.4/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 46 @0.337s 25%: 0+2.5+0 ms clock, 0+0.67/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P        
gc 47 @0.341s 25%: 0.001+7.4+0.001 ms clock, 0.001+0.53/1.5/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 48 @0.349s 25%: 0+2.5+0 ms clock, 0+0.71/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P        
gc 49 @0.352s 25%: 0+7.4+0 ms clock, 0+0.53/1.3/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P        
gc 50 @0.361s 25%: 0+2.4+0 ms clock, 0+0.66/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P        
gc 51 @0.364s 25%: 0+8.1+0.001 ms clock, 0+0.58/1.7/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 52 @0.373s 25%: 0.012+2.7+0 ms clock, 0.012+0.74/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 53 @0.377s 24%: 0+8.0+0.013 ms clock, 0+0.53/1.5/0+0.013 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 54 @0.386s 24%: 0.002+2.8+0.001 ms clock, 0.002+0.73/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 55 @0.390s 24%: 0+8.1+0.001 ms clock, 0+0.52/1.6/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 56 @0.399s 24%: 0+2.6+0 ms clock, 0+0.69/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 57 @0.403s 24%: 0+8.0+0.002 ms clock, 0+0.55/1.4/0+0.002 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 58 @0.412s 24%: 0+2.6+0 ms clock, 0+0.69/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 59 @0.416s 24%: 0+7.9+0.001 ms clock, 0+0.53/1.5/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 60 @0.424s 24%: 0+2.6+0 ms clock, 0+0.67/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 61 @0.428s 24%: 0+7.5+0 ms clock, 0+0.51/1.5/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 62 @0.437s 24%: 0+2.5+0 ms clock, 0+0.68/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 63 @0.440s 24%: 0+7.5+0.001 ms clock, 0+0.58/1.3/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 64 @0.449s 24%: 0+2.6+0 ms clock, 0+0.72/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 65 @0.452s 24%: 0+7.5+0 ms clock, 0+0.50/1.5/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 66 @0.461s 24%: 0+2.6+0 ms clock, 0+0.67/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 67 @0.464s 24%: 0+7.7+0.001 ms clock, 0+0.53/1.5/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 68 @0.473s 24%: 0+2.7+0.001 ms clock, 0+0.74/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 69 @0.477s 24%: 0+7.9+0.001 ms clock, 0+0.57/1.4/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 70 @0.486s 24%: 0.001+2.7+0.001 ms clock, 0.001+0.73/0/0+0.001 ms cpu, 10->12->3 MB, 19 MB goal, 1 P                                                                                                    
gc 71 @0.490s 24%: 0+7.7+0.001 ms clock, 0+0.54/1.4/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 72 @0.498s 24%: 0+2.6+0 ms clock, 0+0.67/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 73 @0.502s 24%: 0+7.5+0 ms clock, 0+0.46/1.5/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 74 @0.510s 24%: 0+2.5+0 ms clock, 0+0.65/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 75 @0.514s 24%: 0+7.7+0 ms clock, 0+0.52/1.5/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 76 @0.523s 24%: 0.001+2.4+0 ms clock, 0.001+0.64/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 77 @0.526s 24%: 0+7.4+0 ms clock, 0+0.54/1.3/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 78 @0.535s 24%: 0+2.4+0 ms clock, 0+0.64/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 79 @0.538s 24%: 0+7.4+0.001 ms clock, 0+0.53/1.4/0+0.001 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 80 @0.547s 24%: 0+2.3+0 ms clock, 0+0.63/0/0+0 ms cpu, 10->12->3 MB, 19 MB goal, 1 P
gc 81 @0.550s 24%: 0+7.2+0 ms clock, 0+0.49/1.4/0+0 ms cpu, 4->10->9 MB, 6 MB goal, 1 P
gc 82 @0.558s 24%: 0+2.2+0 ms clock, 0+0.60/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 83 @0.562s 24%: 0+7.3+0.001 ms clock, 0+0.56/1.2/0+0.001 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 84 @0.570s 24%: 0+2.3+0 ms clock, 0+0.64/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 85 @0.573s 23%: 0+7.4+0.001 ms clock, 0+0.62/1.2/0+0.001 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 86 @0.582s 23%: 0+2.4+0 ms clock, 0+0.64/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 87 @0.585s 23%: 0+7.7+0.001 ms clock, 0+0.69/1.2/0+0.001 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 88 @0.594s 23%: 0+2.2+0 ms clock, 0+0.59/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 89 @0.597s 23%: 0+7.3+0 ms clock, 0+0.64/1.2/0+0 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 90 @0.606s 23%: 0+2.3+0 ms clock, 0+0.61/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 91 @0.609s 23%: 0+7.3+0 ms clock, 0+0.64/1.2/0+0 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 92 @0.617s 23%: 0.001+2.1+0 ms clock, 0.001+0.59/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 93 @0.621s 23%: 0+7.5+0.002 ms clock, 0+0.76/1.1/0+0.002 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 94 @0.629s 23%: 0.001+2.2+0 ms clock, 0.001+0.57/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 95 @0.633s 23%: 0+7.5+0.001 ms clock, 0+0.71/1.2/0+0.001 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 96 @0.641s 23%: 0.001+2.1+0 ms clock, 0.001+0.53/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 97 @0.645s 23%: 0+7.4+0.002 ms clock, 0+0.68/1.2/0+0.002 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 98 @0.653s 23%: 0.001+2.1+0 ms clock, 0.001+0.57/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 99 @0.657s 23%: 0+7.4+0.001 ms clock, 0+0.73/1.1/0+0.001 ms cpu, 4->10->9 MB, 5 MB goal, 1 P
gc 100 @0.665s 23%: 0+2.1+0 ms clock, 0+0.53/0/0+0 ms cpu, 10->12->2 MB, 19 MB goal, 1 P
gc 101 @0.669s 23%: 0+7.3+0.001 ms clock, 0+1.3/0.77/0+0.001 ms cpu, 4->9->8 MB, 5 MB goal, 1 P

When we don't set GOMAXPROCS=1, the runtime assumes the heap just keeps growing. There are few GCs, but the heap grows to almost 400 MiB which is the cause of all those page faults (AFAICT).

When we set GOMAXPROCS=1, our heap stays small, and we end up re-using a lot of memory and don't have to fault in new pages quite so often.

For comparison, these are the parameters of the machine I was using:

go version go1.12.5 linux/amd64
Linux 4.19.37-amd64, SMP
48-thread 24-core 2-socket Haswell machine

There may be a legitimate bug here (I need to think more about the sample program), but the bug is likely not allocator scalability.

@midom
Copy link
Author

midom commented Jul 15, 2019

Weird, to have some common ground I set up a google cloud instance (n1-standard-64), downloaded latest stable golang:

go version go1.12.7 linux/amd64
Linux my-vm 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64 GNU/Linux
(apologies for somewhat ancient kernel, looking for better image)

Even named the binary 'thing':

domas_mituzas@my-vm:~/thing$ time ./thing

real 0m6.469s
user 0m3.764s
sys 6m32.516s
domas_mituzas@my-vm:~/thing$ time GOMAXPROCS=1 ./thing

real 0m1.038s
user 0m0.916s
sys 0m0.124s

And indeed, allocator scalability itself is fantastic, which is why it slams into kernel scalability. I did not look much at how Go places objects - if really objects are placed into separate blocks, then it is hitting another page table contention that may be harder to work around (multiple mappings could do, though).

I have a very basic C testcase that shows how this breaks inside kernel:

# /usr/bin/time ./x
5.59user 951.16system 0:20.20elapsed 4735%CPU (0avgtext+0avgdata 1050196maxresident)k
0inputs+0outputs (0major+24724446minor)pagefaults 0swaps
#define _GNU_SOURCE

#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/sysinfo.h>
#include <stdio.h>

char *x;

unsigned long long arena_size = 1L * 1024 * 1024 * 1024;
unsigned long long start;

void * worker(void *p) {
	int n;
	while (n<arena_size) {
		x[n]++;
		n += 4095;
	}
}

int main( int ac, char ** av ) {
	int n, nthr = get_nprocs();
	pthread_t *threads = malloc(sizeof(pthread_t) * nthr);

	x = malloc(arena_size);
	if (x==(caddr_t)-1) {
		exit(1);
	}

	for (n=0; n<nthr; n++)
		pthread_create(&threads[n], NULL, worker, NULL);

	while (n--)
		pthread_join(threads[n], NULL);

	return 0;
}

Let me check the properties of random arena access vs linear.

@midom
Copy link
Author

midom commented Jul 15, 2019

update with latest stable kernel (5.2.1) and go1.12.7:

gcoff, ncpu1: 0.37user 0.34system 0:00.72elapsed 100%CPU (0avgtext+0avgdata 535076maxresident)k
gcoff: 1.29user 287.80system 0:04.83elapsed 5976%CPU (0avgtext+0avgdata 540384maxresident)k

gcon, ncpu1: 0.62user 0.01system 0:00.64elapsed 99%CPU (0avgtext+0avgdata 21436maxresident)k
gcoff: 3.93user 251.22system 0:04.23elapsed 6022%CPU (0avgtext+0avgdata 516776maxresident)k

@midom
Copy link
Author

midom commented Jul 15, 2019

random access to same gigabyte does not show same contention signature:

-   95.93%    38.21%  mmapr    mmapr                                                     [.] worker                                                                                                                                                                                                                          ▒
   - 57.71% worker                                                                                                                                                                                                                                                                                                           ▒
      - 56.39% page_fault                                                                                                                                                                                                                                                                                                    ◆
         - 56.35% __do_page_fault                                                                                                                                                                                                                                                                                            ▒
            - 44.17% handle_mm_fault                                                                                                                                                                                                                                                                                         ▒
               - 43.97% __handle_mm_fault                                                                                                                                                                                                                                                                                    ▒
                  - 38.50% do_wp_page                                                                                                                                                                                                                                                                                        ▒
                     - 38.47% wp_page_copy                                                                                                                                                                                                                                                                                   ▒
                        - 35.38% ptep_clear_flush                                                                                                                                                                                                                                                                            ▒
                           - 35.13% flush_tlb_mm_range                                                                                                                                                                                                                                                                       ▒
                              - 34.35% on_each_cpu_cond_mask                                                                                                                                                                                                                                                                 ▒
                                 - 31.15% on_each_cpu_mask                                                                                                                                                                                                                                                                   ▒
                                    - 30.84% smp_call_function_many                                                                                                                                                                                                                                                          ▒
                                       - 9.27% llist_add_batch                                                                                                                                                                                                                                                               ▒
                                          + 1.97% call_function_interrupt                                                                                                                                                                                                                                                    ▒
                                       - 8.88% native_sched_clock                                                                                                                                                                                                                                                            ▒
                                            2.56% cyc2ns_read_begin.part.2                                                                                                                                                                                                                                                   ▒
                                          + 1.20% call_function_interrupt                                                                                                                                                                                                                                                    ▒
                                       + 2.16% call_function_interrupt                                                                                                                                                                                                                                                       ▒
                                       + 1.73% native_send_call_func_ipi                                                                                                                                                                                                                                                     ▒
                                       + 0.88% cpumask_next                                                                                                                                                                                                                                                                  ▒
                                 + 1.76% cpumask_next                                                                                                                                                                                                                                                                        ▒
                                   1.06% tlb_is_not_lazy                                                                                                                                                                                                                                                                     ▒
                        + 1.54% alloc_pages_vma                                                                                                                                                                                                                                                                              ▒
                    1.50% queued_spin_lock_slowpath                                                                                                                                                                                                                                                                          ▒
                    1.18% _raw_spin_lock                                                                                                                                                                                                                                                                                     ▒
            - 6.60% down_read_trylock                                                                                                                                                                                                                                                                                        ▒
               - 1.12% call_function_interrupt                                                                                                                                                                                                                                                                               ▒
                  + 0.91% smp_call_function_interrupt                                                                                                                                                                                                                                                                        ▒
              2.61% up_read                                                                                                                                                                                                                                                                                                  ▒
            + 1.99% down_read                                                                                                                                                                                                                                                                                                ▒
   + 38.21% start_thread

meanwhile, single thread allocs seem to be much much more efficient:

-   88.96%    20.99%  mmapr    mmapr               [.] worker                                                                                                                                                                                                                                                                ▒
   - 67.97% worker                                                                                                                                                                                                                                                                                                           ▒
      - 56.31% page_fault                                                                                                                                                                                                                                                                                                    ▒
         - 55.74% __do_page_fault                                                                                                                                                                                                                                                                                            ◆
            - 52.29% handle_mm_fault                                                                                                                                                                                                                                                                                         ▒
               - 50.21% __handle_mm_fault                                                                                                                                                                                                                                                                                    ▒
                  - 46.11% do_wp_page                                                                                                                                                                                                                                                                                        ▒
                     - 45.49% wp_page_copy                                                                                                                                                                                                                                                                                   ▒
                        - 23.04% alloc_pages_vma                                                                                                                                                                                                                                                                             ▒
                           - 21.95% __alloc_pages_nodemask                                                                                                                                                                                                                                                                   ▒
                              - get_page_from_freelist                                                                                                                                                                                                                                                                       ▒
                                   16.18% clear_page_erms                                                                                                                                                                                                                                                                    ▒
                        - 8.75% __lru_cache_add                                                                                                                                                                                                                                                                              ▒
                           + pagevec_lru_move_fn                                                                                                                                                                                                                                                                             ▒
                        + 4.73% ptep_clear_flush                                                                                                                                                                                                                                                                             ▒
                        + 3.78% mem_cgroup_try_charge_delay                                                                                                                                                                                                                                                                  ▒
                          0.52% mem_cgroup_commit_charge

(I'm having all the threads randomly walk the same memory, instead of linearly, with some adjustments to the .c above).

So, there's definitely ~25x pagefaulting cost increase in random allocs (very little of that in actual spinlocking), but the linear edge case shows ~1700x pagefaulting cost increase, mostly in spinlocks, which I trigger with my Go testcase too (which indicates that the page gets faulted from multiple threads at once).

By changing my C test to linearly allocate different maps in each thread I can see that the scaling behavior is very similar to random-order allocation - increased costs are from flush_tlb_mm_range, which would be hard to avoid, and down_read_trylock.

I'm really waving my hands here, but from what I see - the profiles of go testcase are indicative of pagefaults hitting same page and a spinlock, rather than TLB flushing.

@midom
Copy link
Author

midom commented Jul 15, 2019

In some of my other tests I ended up hitting runtime.(*mcentral).cacheSpan contention (simply because I was allocating lots of large objects).

The main way how I hit this pagefault issue is with fast mapassign functions, and this code in concurrent fashion will hit high % in kernel spinlocks:

type Maps struct {
	l *Maps
	m map[int]int
}

var prev *Maps

for z := 0; z < 100000; z++ {
	prev = &Maps{prev, map[int]int{1:1}}
}

@teou
Copy link

teou commented Aug 8, 2019

i can reproduce this on my machine: centos 7.3.1611, and the results are weird :

  1. build the same test go program above(the: ./tt) with golang 1.11.1 , result:
    [thing]$ time GOMAXPROCS=1 ./thing
    real 0m5.026s
    user 0m1.103s
    sys 0m3.928s
    [thing]$ time ./thing
    real 0m2.318s
    user 0m1.667s
    sys 0m5.307s

  2. then i changed GOROOT to go1.9.1 dir on the same machine ,then build , performance is much better:
    [thing]$ time ./thing
    real 0m0.886s
    user 0m2.219s
    sys 0m1.114s
    [thing]$ time GOMAXPROCS=1 ./thing
    real 0m1.928s
    user 0m1.544s
    sys 0m0.382s

  3. switch back to GOROOT 1.11.1 again, then build, performance is like go 1.9:
    [thing]$ time ./thing
    real 0m0.700s
    user 0m1.904s
    sys 0m0.716s

  4. rm -f ~/.cache/* build with go1.11.1 again, SLOW:
    [thing]$ time ./thing
    real 0m1.021s
    user 0m1.549s
    sys 0m2.272s

@teou
Copy link

teou commented Aug 15, 2019

@bradfitz runtime: concurrent object placement on heap causes Linux kernel contention

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@tcolgate
Copy link
Contributor

Does #35112 help here?

@midom
Copy link
Author

midom commented Feb 8, 2020

retesting with 1.10, 1.13 and master as of today (2020 feb):

go version go1.10.4 linux/amd64
real 0m26.668s
user 0m2.680s
sys 15m28.629s

go version go1.13.7 linux/amd64
real 0m13.369s
user 0m2.359s
sys 7m52.419s

go version devel +ca8bf63809 Sat Feb 8 16:15:48 2020 +0000 linux/amd64
real 0m0.282s
user 0m1.908s
sys 0m5.713s

singlethreaded perf between master and 1.13 does not change much

this can be closed

@midom midom closed this as completed Feb 8, 2020
@networkimprov
Copy link

@ianlancetaylor should this repro be added to the runtime test sequence?

@randall77
Copy link
Contributor

@networkimprov We could add this as a benchmark. Feel free to submit a CL.

I don't think we want to make it a test. We abhor flaky tests, and I think this one would be hard to make reliable enough to include by default.

@networkimprov
Copy link

Re a CL, cc @mknyszek @tcolgate as they're already tagged "Contributor".

@golang golang locked and limited conversation to collaborators Feb 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants