Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: performance issue with the new page allocator on aix/ppc64 #35451

Closed
Helflym opened this issue Nov 8, 2019 · 30 comments
Closed

runtime: performance issue with the new page allocator on aix/ppc64 #35451

Helflym opened this issue Nov 8, 2019 · 30 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX Performance
Milestone

Comments

@Helflym
Copy link
Contributor

Helflym commented Nov 8, 2019

Since the new page allocator (#35112) has been enabled by default, the runtime is extremely slow. Some recent failures on aix/ppc64 builder seems also related, starting from CL 190622. There is a timeout during runtime tests (cf https://build.golang.org/log/7e68765a5fe5e9887ef06fd90de1c9ae6682e73d https://build.golang.org/log/d09de5259a97061169ce2648541666ba1101fc1c)

Before CL 201765:

$ time ./make.bash 
Building Go cmd/dist using /opt/freeware/lib/golang. (go1.13.4 aix/ppc64)
Building Go toolchain1 using /opt/freeware/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for aix/ppc64.
---
Installed Go for aix/ppc64 in /opt/freeware/src/packages/BUILD/goroot
Installed commands in /opt/freeware/src/packages/BUILD/goroot/bin

real    1m52.246s
user    2m53.752s
sys     0m16.242s

After

$ time ./make.bash 
Building Go cmd/dist using /opt/freeware/lib/golang. (go1.13.4 aix/ppc64)
Building Go toolchain1 using /opt/freeware/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for aix/ppc64.
---
Installed Go for aix/ppc64 in /opt/freeware/src/packages/BUILD/goroot
Installed commands in /opt/freeware/src/packages/BUILD/goroot/bin

real    20m44.516s
user    4m56.656s
sys     23m5.166s

cc @mknyszek

@agnivade agnivade added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX labels Nov 8, 2019
@Helflym
Copy link
Contributor Author

Helflym commented Nov 8, 2019

The problem might be larger than just this new page allocator. Its tests seem to pass correctly but a freeze (about 30s) occurs after each one, even with the old page allocator. It seems to loop inside runtime.sysmon. I'll continue to investigate.

go test -v -run=TestPageAllocScav
=== RUN   TestPageAllocScavenge
=== RUN   TestPageAllocScavenge/ScavMultiple
    TestPageAllocScavenge/ScavMultiple: mgcscavenge_test.go:368: start
    TestPageAllocScavenge/ScavMultiple: mgcscavenge_test.go:381: end

^\SIGQUIT: quit
PC=0x90000000056c270 m=2 sigcode=8

goroutine 0 [idle]:
runtime.usleep(0x271096ea48b0)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/os2_aix.go:523 +0x54 fp=0x110279500 sp=0x1102794b8 pc=0x100035694
runtime.sysmon()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:4453 +0xa4 fp=0x110279580 sp=0x110279500 pc=0x100047fc4
runtime.mstart1()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:1125 +0xf4 fp=0x1102795b8 sp=0x110279580 pc=0x10003eca4
runtime.mstart()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:1090 +0x5c fp=0x1102795e8 sp=0x1102795b8 pc=0x10003eb8c

goroutine 1 [chan receive]:
runtime.gopark(0x11007e800, 0xa00040000062298, 0xe170001000c6938, 0x2)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:304 +0x118 fp=0xa00040000083a00 sp=0xa000400000839d0 pc=0x10003c1e8
runtime.chanrecv(0xa00040000062240, 0xa00040000083b50, 0x100040000000180, 0x1000e40bc)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/chan.go:563 +0x334 fp=0xa00040000083aa0 sp=0xa00040000083a00 pc=0x100005f14
runtime.chanrecv1(0xa00040000062240, 0xa00040000083b50)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/chan.go:433 +0x24 fp=0xa00040000083ae0 sp=0xa00040000083aa0 pc=0x100005b84
testing.(*T).Run(0xa000400000ac120, 0x1001f2f45, 0x15, 0x11007ff80, 0x10000000000036c)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1006 +0x338 fp=0xa00040000083ba0 sp=0xa00040000083ae0 pc=0x1000e40d8
testing.runTests.func1(0xa000400000ac000)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1247 +0x78 fp=0xa00040000083c00 sp=0xa00040000083ba0 pc=0x1000e7f28
testing.tRunner(0xa000400000ac000, 0xa00040000083d28)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:954 +0xe4 fp=0xa00040000083c60 sp=0xa00040000083c00 pc=0x1000e3d34
testing.runTests(0xa0004000000e0a0, 0x1101f2fc0, 0x13e, 0x13e, 0x0)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1245 +0x2d0 fp=0xa00040000083d48 sp=0xa00040000083c60 pc=0x1000e5590
testing.(*M).Run(0xa0004000000c080, 0x0)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1162 +0x188 fp=0xa00040000083e58 sp=0xa00040000083d48 pc=0x1000e4618
runtime_test.TestMain(0xa0004000000c080)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/crash_test.go:28 +0x2c fp=0xa00040000083eb8 sp=0xa00040000083e58 pc=0x100172b2c
main.main()
        _testmain.go:1152 +0x130 fp=0xa00040000083f50 sp=0xa00040000083eb8 pc=0x1001e93e0
runtime.main()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:203 +0x28c fp=0xa00040000083fc0 sp=0xa00040000083f50 pc=0x10003bd3c
runtime.goexit()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa00040000083fc0 sp=0xa00040000083fc0 pc=0x100072774

goroutine 2 [force gc (idle)]:
runtime.gopark(0x11007ead8, 0x1102044d0, 0x1114000000000000, 0x1)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:304 +0x118 fp=0xa00040000038780 sp=0xa00040000038750 pc=0x10003c1e8
runtime.goparkunlock(...)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:310
runtime.forcegchelper()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:253 +0xd0 fp=0xa000400000387c0 sp=0xa00040000038780 pc=0x10003c050
runtime.goexit()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa000400000387c0 sp=0xa000400000387c0 pc=0x100072774
created by runtime.init.5
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:242 +0x34

goroutine 3 [GC sweep wait]:
runtime.gopark(0x11007ead8, 0x110204920, 0xc14000000000000, 0x1)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:304 +0x118 fp=0xa0004000004bf78 sp=0xa0004000004bf48 pc=0x10003c1e8
runtime.goparkunlock(...)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:310
runtime.bgsweep(0xa00040000056000)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mgcsweep.go:70 +0xac fp=0xa0004000004bfb8 sp=0xa0004000004bf78 pc=0x100026e2c
runtime.goexit()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa0004000004bfb8 sp=0xa0004000004bfb8 pc=0x100072774
created by runtime.gcenable
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mgc.go:214 +0x58

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x11007ead8, 0x1102048e0, 0xd14000000000000, 0x1)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:304 +0x118 fp=0xa0004000004af10 sp=0xa0004000004aee0 pc=0x10003c1e8
runtime.goparkunlock(...)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:310
runtime.bgscavenge(0xa00040000056000)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mgcscavenge.go:219 +0xf0 fp=0xa0004000004afb8 sp=0xa0004000004af10 pc=0x1000257f0
runtime.goexit()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa0004000004afb8 sp=0xa0004000004afb8 pc=0x100072774
created by runtime.gcenable
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mgc.go:215 +0x78

goroutine 5 [finalizer wait]:
runtime.gopark(0x11007ead8, 0x11022ef48, 0x1014040000056000, 0x1)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:304 +0x118 fp=0xa00040000038f20 sp=0xa00040000038ef0 pc=0x10003c1e8
runtime.goparkunlock(...)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:310
runtime.runfinq()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mfinal.go:175 +0xc8 fp=0xa00040000038fc0 sp=0xa00040000038f20 pc=0x10001aab8
runtime.goexit()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa00040000038fc0 sp=0xa00040000038fc0 pc=0x100072774
created by runtime.createfing
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mfinal.go:156 +0x90

goroutine 6 [chan receive]:
runtime.gopark(0x11007e800, 0xa00040000062358, 0xe170001000c6938, 0x2)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/proc.go:304 +0x118 fp=0xa00040000085900 sp=0xa000400000858d0 pc=0x10003c1e8
runtime.chanrecv(0xa00040000062300, 0xa00040000085a50, 0x100040000001680, 0x1000e40bc)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/chan.go:563 +0x334 fp=0xa000400000859a0 sp=0xa00040000085900 pc=0x100005f14
runtime.chanrecv1(0xa00040000062300, 0xa00040000085a50)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/chan.go:433 +0x24 fp=0xa000400000859e0 sp=0xa000400000859a0 pc=0x100005b84
testing.(*T).Run(0xa000400000ad320, 0x1001ee2ae, 0xc, 0xa0004000004d4a0, 0xa00040000049e88)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1006 +0x338 fp=0xa00040000085aa0 sp=0xa000400000859e0 pc=0x1000e40d8
runtime_test.TestPageAllocScavenge(0xa000400000ac120)
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/mgcscavenge_test.go:383 +0x13cc fp=0xa00040000085f50 sp=0xa00040000085aa0 pc=0x1001a0ffc
testing.tRunner(0xa000400000ac120, 0x11007ff80)
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:954 +0xe4 fp=0xa00040000085fb0 sp=0xa00040000085f50 pc=0x1000e3d34
runtime.goexit()
        /opt/freeware/src/packages/BUILD/goroot/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xa00040000085fb0 sp=0xa00040000085fb0 pc=0x100072774
created by testing.(*T).Run
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1005 +0x31c

goroutine 7 [running]:
        goroutine running on other thread; stack unavailable
created by testing.(*T).Run
        /opt/freeware/src/packages/BUILD/goroot/src/testing/testing.go:1005 +0x31c

r0   0xffffffffffffffff r1   0x110279100
r2   0xffffffffffffffff r3   0xffffffffffffffff
r4   0xffffffffffffffff r5   0xffffffffffffffff
r6   0xffffffffffffffff r7   0xffffffffffffffff
r8   0xffffffffffffffff r9   0xffffffffffffffff
r10  0xffffffffffffffff r11  0xffffffffffffffff
r12  0xffffffffffffffff r13  0x110281800
r14  0x0        r15  0x0
r16  0x0        r17  0x0
r18  0x0        r19  0x0
r20  0x0        r21  0x0
r22  0x0        r23  0x0
r24  0x0        r25  0x0
r26  0x1102792a0        r27  0x0
r28  0xfffffffffffffe0  r29  0x9001000a0094210
r30  0x9001000a0004b40  r31  0x0
pc   0x90000000056c270  ctr  0xffffffff00000000
link 0xffffffffffffffff xer  0xffffffff
ccr  0x0        trap 0x0
exit status 2
FAIL    runtime 35.807s

@Helflym
Copy link
Contributor Author

Helflym commented Nov 8, 2019

The problem comes from maxChunks constant in (s *pageAlloc) init.
On linux/ppc64le, maxChunks=0x4000000. On aix/ppc64, it's 0x4000000000, resulting on mapping 0x200000000000, with sysReserve.
Reducing maxChunks is fixing the freezes, I've tested manually. Therefore, is it possible to reduce it ? Or should we find a way to not sysReserve all the chunks at once ?

@ianlancetaylor
Copy link
Contributor

CC @mknyszek @aclements

@bcmills
Copy link
Contributor

bcmills commented Nov 8, 2019

My understanding is that aix is not a “first class port”, so this is not technically a release-blocker.

(Nonetheless, I hope we can figure out how to resolve it before the release.)

@mknyszek
Copy link
Contributor

mknyszek commented Nov 8, 2019

@Helflym thanks for looking into it. I was aware of this issue but was unable to dig deeper because I was unable to get a gomote over the last few days.

The mapping of 2 TiB is completely intentional, AIX claims to have a 60-bit address space in the code. This mapping is also PROT_NONE, so it's not like it actually NEEDS that memory, it's just a reservation that's mapped in as needed. Perhaps the mmap results are particularly non-continguous? Also, I noticed that sysMap on AIX unmaps memory and then remaps it, which may also be a problem here. Would implementing/using mprotect on AIX work better?

Also, when you say reducing maxChunks is fixing the freezes, do you mean that it's also bringing back the original performance?

@Helflym
Copy link
Contributor Author

Helflym commented Nov 8, 2019

The allocation is actually working pretty find. There is no freeze at this moment. The freeze occurs for once a PageAlloc test have ended (maybe during a GC phase or something like this).

Also, I noticed that sysMap on AIX unmaps memory and then remaps it, which may also be a problem here. Would implementing/using mprotect on AIX work better?

This unmmap is needed because AIX mmap doesn't work on an already mmap area, which is the case when doing sysReserve. So yes that's normal. No idea about mprotect, I will have to check and see what's going on.

Also, when you say reducing maxChunks is fixing the freezes, do you mean that it's also bringing back the original performance?

Yes, I've set the Linux value, it does work. (I haven't run all.bash though).

@mknyszek
Copy link
Contributor

mknyszek commented Nov 8, 2019

@Helflym OK, it would be nice to not have to support the full 60 bit address space if this is the cause of the performance slowdown. The comment on heapAddrBits explains that mmap on AIX returns addresses up to 0x0afffffffffffff, but if you can lower maxChunks artificially is that not really true? maxChunks is computed via heapAddrBits.

I also don't fully understand why that would be the cause of the slowdown unless manipulating that mapping is particularly expensive (which could be because of the munmap, @rsc tells me that on some systems mprotect is better at not messing with VMAs and such in the kernel). This is the assumption I'm making, since you say the "allocation" is working fine, by which I assume you mean the original sysReserve call at start-up.

I can write a couple of patches and if you have some time could you try them for me? Or if I had access to a machine (the gomote now is going to be really hard to get, due to slowness) I could dig deeper into debugging this myself.

With regard to the freeze, each PageAlloc test creates two of these mappings and then frees both at the end of the test, so if manipulating the mapping is this expensive that test will definitely cause an apparent hang.

@Helflym
Copy link
Contributor Author

Helflym commented Nov 12, 2019

@Helflym OK, it would be nice to not have to support the full 60 bit address space if this is the cause of the performance slowdown. The comment on heapAddrBits explains that mmap on AIX returns addresses up to 0x0afffffffffffff, but if you can lower maxChunks artificially is that not really true? maxChunks is computed via heapAddrBits.

0x0afffffffffffff is the official limit. However, most of the processes don't need that much memory. As far as I remember, the highest address in a Go program that I have seen, was near 0x0a00040000000000. I think that's why I was able to lower maxChunks artificially. Note that at tip, it doesn't seem possible anymore...

Anyway, I've made some tests with different values for maxChunks to see how much performances are impacted. I've run them before the commit switching to the new page allocator (ie 52d5e76)

maxChunks value time of go test -v -run=TestPageAlloc
0x0004000000 (Linux) 43s
0x0040000000 44s
0x0400000000 63s
0x4000000000 (default AIX) >600s

As you can see, performances are falling down after 0x0400000000.

I think it should be possible to have this value on AIX. Because if I understand correctly and please tell me if I'm completely wrong, you need a chunk for every possible address returned by mmap ie from 0x0a00000000000000 to 0x0afffffffffffff on AIX. To simplify, a chunk is allocated for every address starting from 0x0. As on AIX, the solution doesn't work and mmap addresses always have this '0x0a', don't you think that saying the first chunk deals with 0x0a00000000000000 instead of 0 is possible ? Therefore, maxChunks value would be 0x0400000000. Am I correct ? If yes, I can be a first workaround before finding a better solution.

I can write a couple of patches and if you have some time could you try them for me? Or if I had access to a machine (the gomote now is going to be really hard to get, due to slowness) I could dig deeper into debugging this myself.

I've created a special user for this kind of purpose on our build machine, just send me your SSH key by mail and I'll grant you the access.

Meanwhile, I'll continue to search for a solution.

@mknyszek
Copy link
Contributor

If we know all returned addresses are that high, then we can set arenaBaseOffset in the runtime to (^0x0a00000000000000+1)&uintptrMask on AIX (this is effectively -0x0a00000000000000, this is the only way to express this as a constant in Go) and set heapAddrBits to 48 (delete the AIX-specific stuff there).

This is a fine enough solution for me by the way. It would be nice to understand why AIX doesn't like this large mapping but not a priority for release. ppc64 supposedly only has a 48 bit address space (as far as the runtime is concerned for other platforms) so this just perhaps better aligns us to AIX behavior, which seems to just add in 0x0a00000000000000.

But ah, I remember now that ppc64 has that inverse/hashed page table design which might be the source of the problems with these large mappings. I'm going to think more about that.

@Helflym
Copy link
Contributor Author

Helflym commented Nov 12, 2019

I do agree with you. I've always wanted to use this arenaBaseOffset in order to avoid too much difference between aix/ppc64 and linux/ppc64 but I wasn't aware of (^0x0a00000000000000+1)&uintptrMask. Anyway, this is working fine. I'll create the patch tomorrow and check that nothing else has been broken.

Note that I've traced a little which unmapping is slowing the tests and it's indeed when releasing the object created by this sysReserve. However, I don't understand why this munmap is taking so long as it's far quicker when doing a similar mmap/munmap on C.

@aclements
Copy link
Member

However, I don't understand why this munmap is taking so long as it's far quicker when doing a similar mmap/munmap on C.

Does it matter if any pages have been touched in the mapping?

@mknyszek
Copy link
Contributor

@Helflym When you say "similar mmap/munmap in C," can you share the code? My guess was that it was the act of breaking apart that big mapping with munmap that was problematic, and I'm curious to see under what conditions it works just fine.

Also: thank you for looking into this!

@Helflym
Copy link
Contributor Author

Helflym commented Nov 13, 2019

Does it matter if any pages have been touched in the mapping?

It seems that slow munmap, yes.

I've been using this code.

#include <sys/mman.h>
#include <stdio.h>
#include <sys/time.h>

#define ALLOC_SIZE 0x200000000000ull
#define PAGE_ALLOC_NB 0x40000
#define GAP_BETWEEN_PAGES 0x100000ull


int main(void)
{
        void *addr;
	struct timeval start, stop;

        gettimeofday(&start, NULL);
        addr = mmap(NULL, ALLOC_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	gettimeofday(&stop, NULL);
        printf("memory mapped at %p (%d µs)\n", addr, (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);

	for (int i=0x0; i < PAGE_ALLOC_NB; i++){
		int * n_addr = ((int *)(addr + i*GAP_BETWEEN_PAGES));
		mprotect(n_addr, 0x1000, PROT_WRITE|PROT_READ);
		*n_addr = i;
		/* if (i % 0x1000) { */
		/* 	printf("%p = 0x%x\n", n_addr, *n_addr); */
		/* } */
	}
        fflush(stdout);

	gettimeofday(&start, NULL);
        munmap(addr, ALLOC_SIZE);
	gettimeofday(&stop, NULL);
        printf("memory unmap (%d µs)\n", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
        fflush(stdout);

        return 0;
}

munmap performances are slowing down whenever ALLOC_SIZE, PAGE_ALLOC_NB or GAP_BETWEEN_PAGES is being increased. With the current settings, it takes ~3.6s. That's a lot but nothing compared to the one in Go which takes more than 30/40s. Note that I haven't any idea which PAGE_ALLOC_NB and GAP_BETWEEN_PAGES should be used to be closer to Go runtime.

Anyway, ALLOC_SIZE is the same than the Go one, and mmap takes more than 1s which is already too long if every Go programs (even an helloworld) takes more than 1s to start.

Note, using mprotect instead of munmap + mmap seems to work fine. I've tested it in Go runtime too. I'll check the performances.

@gopherbot
Copy link

Change https://golang.org/cl/206841 mentions this issue: runtime: add arenaBaseOffset on aix/ppc64

@mknyszek
Copy link
Contributor

It would be better to use mprotect and still assume a 60-bit address space just in case AIX changes it's mmap policy (since this isn't documented anywhere). If it does change its policy, then existing Go binaries will break on new versions of AIX.

However, the fact that mmap takes 1 second to run makes this plan dead-on-arrival. Perhaps the arenaBaseOffset is the right way to go in this case, and to just deal with changes to AIX's mmap in the future?

@Helflym
Copy link
Contributor Author

Helflym commented Nov 14, 2019

It would be better to use mprotect and still assume a 60-bit address space just in case AIX changes it's mmap policy (since this isn't documented anywhere).

What policy do you mean by this ? The fact that mmap addresses are starting after 0x0a00000000000000 ? I don't think it will happen in a near future and if it does anyway, there will be a way to still use this segment. AIX have a strict compatibility policy, everything compiled in a previous version must run as is in all the following one. Therefore, there is many ways to keep older behaviors when running an newly compiled process.

However, the fact that mmap takes 1 second to run makes this plan dead-on-arrival. Perhaps the arenaBaseOffset is the right way to go in this case, and to just deal with changes to AIX's mmap in the future?

mmap (and munmap afterwards) is taking so long because the memory area reserved is really huge. Isn't it possible to allocate s.chunks incrementally ? Or have several levels of chunks (like this is done in the arena with arenaL1 and arenaL2?
At the moment, only AIX is facing issues but others OS might have the same problem in the future. Especially, because amd64 is already providing a way to mmap on 57 bits addresses (according to malloc.go)

@gopherbot
Copy link

Change https://golang.org/cl/207237 mentions this issue: runtime: use mprotect in sysMap for aix/ppc64

@mknyszek
Copy link
Contributor

It would be better to use mprotect and still assume a 60-bit address space just in case AIX changes it's mmap policy (since this isn't documented anywhere).

What policy do you mean by this ? The fact that mmap addresses are starting after 0x0a00000000000000 ? I don't think it will happen in a near future and if it does anyway, there will be a way to still use this segment. AIX have a strict compatibility policy, everything compiled in a previous version must run as is in all the following one. Therefore, there is many ways to keep older behaviors when running an newly compiled process.

Yeah, that's what I meant. If it's that strict then perhaps it's OK. @aclements?

However, the fact that mmap takes 1 second to run makes this plan dead-on-arrival. Perhaps the arenaBaseOffset is the right way to go in this case, and to just deal with changes to AIX's mmap in the future?

mmap (and munmap afterwards) is taking so long because the memory area reserved is really huge. Isn't it possible to allocate s.chunks incrementally ? Or have several levels of chunks (like this is done in the arena with arenaL1 and arenaL2?
At the moment, only AIX is facing issues but others OS might have the same problem in the future. Especially, because amd64 is already providing a way to mmap on 57 bits addresses (according to malloc.go)

While that's true, very large mappings are not nearly as costly on other systems (though to be honest I only have hard numbers for Linux right now, and for 32 TiB PROT_NONE mappings it's <10 µs). A big difference between s.chunks and the arenas array is that s.chunks is mapped PROT_NONE, which theoretically means the OS shouldn't have to do anything expensive. The only other issues we've run into so far are artificial limits imposed by some systems (#35568).

With that said, we're exploring an incremental mapping approach. We could also add a layer of indirection but this complicates the code more than the incremental mapping approach @aclements suggested (IMO), and is probably safer considering we're in the freeze. Anything else that limits how much address space we map is likely going to be more complex than what we have now, which is why I proposed the current approach in the first place.

@aclements
Copy link
Member

What policy do you mean by this ? The fact that mmap addresses are starting after 0x0a00000000000000 ? I don't think it will happen in a near future and if it does anyway, there will be a way to still use this segment. AIX have a strict compatibility policy, everything compiled in a previous version must run as is in all the following one.

That's good to know. If there's a strict compatibility policy on this, I'm much more comfortable with the runtime assuming mappings will start at 0x0a00000000000000.

Isn't it possible to allocate s.chunks incrementally ? Or have several levels of chunks (like this is done in the arena with arenaL1 and arenaL2?

There are certainly ways to engineer around this, but the current approach keeps the runtime much simpler and generally performs better. Adding a level of indirection would also be unfortunate considering the MMU hardware already implements all of the indirection we need directly in hardware. The arena index was originally just one level and depended entirely on MMU indirection (it's still one level on most platforms), and switching to two levels significantly increased its complexity and cost about 2% in performance IIRC.

@Helflym
Copy link
Contributor Author

Helflym commented Nov 15, 2019

Ok, thanks for your answers. And, if one day, the incremental mapping is implemented, we can still try to remove this arenaBaseOffset and see how AIX is performing. But at the moment, I'd rather have a working AIX builder and I don't think we have any other options, right ?

@mknyszek
Copy link
Contributor

@Helflym I'm working on a couple of patches so that this shouldn't be a problem for the foreseeable future (involving incremental mapping). I hope to have them up for review today.

@mknyszek
Copy link
Contributor

@Helflym Ahhh... I spoke too soon. There are some problems. I think landing the arenaBaseOffset change is the way to go for now to unblock the builder.

@ianlancetaylor
Copy link
Contributor

Let's please unblock the builder somehow.

@mknyszek
Copy link
Contributor

@ianlancetaylor I +2'd both changes.

@Helflym if you can confirm that the patches work on/fix AIX, we can land them any time and unblock the builder (modulo some minor comments, submit when ready).

@gopherbot
Copy link

Change https://golang.org/cl/207497 mentions this issue: runtime: convert page allocator bitmap to sparse array

gopherbot pushed a commit that referenced this issue Nov 16, 2019
On AIX, addresses returned by mmap are between 0x0a00000000000000
and 0x0afffffffffffff. The previous solution to handle these large
addresses was to increase the arena size up to 60 bits addresses,
cf CL 138736.

However, with the new page allocator, the 60bit heap addresses are
causing huge memory allocations, especially by (s *pageAlloc).init. mmap
and munmap syscalls dealing with these allocations are reducing
performances of every Go programs.

In order to avoid these allocations, arenaBaseOffset is set to
0x0a00000000000000 and heap addresses are on 48bit, as others operating
systems.

Updates: #35451

Change-Id: Ice916b8578f76703428ec12a82024147a7592bc0
Reviewed-on: https://go-review.googlesource.com/c/go/+/206841
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@Helflym
Copy link
Contributor Author

Helflym commented Nov 18, 2019

@mknyszek I've tried your patch with the 60bits addresses and that's far more quicker but not enough sadly.

With your patch,

$ time ./runtime.test -test.v -test.run=TestPageAlloc
...
real    0m58.025s
user    0m15.277s
sys     0m19.586s

With the arenaBaseOffset patch,

$ time ./runtime.test -test.v -test.run=TestPageAlloc
...
real    0m0.106s
user    0m0.024s
sys     0m0.035s

I didn't have time to investigate much further though. I'll try to do it by the end of the day or tomorrow and will keep you update.

@mknyszek
Copy link
Contributor

@Helflym Yeah that's what I figured. I think for this release we're going to stick with the 48-bit address space assumption as the fix.

I tried to do an incremental mapping approach wherein we would keep making mappings twice the size, copying data between them each time (there are at most 64 times that we would perform this doubling, and the copying wouldn't actually take very long, a very conservative upper bound would be 500µs for a 1 TiB heap, and it would only happen once).

This seemed like a great idea in principle, but there are some sharp edges here. The following is more for me to document why we ended up not doing this, so I apologize for the low level of specifics in the rest of this comment:

Firstly, picking the contiguous section of the summaries that the mapping represents is tricky. Ideally you want some kind of "centering" in case the heap keeps growing down in the address space. Next, we want to maintain the illusion to the page allocator that this mapping is actually one enormous mapping, but it's possible to get an address from mmap that's too low to store the pointer, and we'd have to rely on slice calculations in the compiler doing the right overflow. It would work but could lead to unexpected bugs in the future. The alternative would be to maintain an explicit offset and apply it everywhere but this significantly complicates the code.

We also considered doing a sparse-array approach for the summaries, but that complicates the code a lot as well, since each summary level is a different size (generics over constants would fix this, but that's a slippery slope).

@Helflym
Copy link
Contributor Author

Helflym commented Nov 19, 2019

@Helflym Yeah that's what I figured. I think for this release we're going to stick with the 48-bit address space assumption as the fix.

I think that's better anyway. Even if it's not exactly how AIX is working, it will be closer to AIX and avoid unnecessary build failures everytime something dealing with the memory is changed.

If one day, there is other huge changes which might allow 60bits addresses again, just warn me and I'll try asap. But until then, let's keep AIX with an arena similar to Linux, it will be safer.

@mknyszek
Copy link
Contributor

@Helflym Would you consider this bug fixed for now? We can open a new issue if we see a need for 60-bit addresses again.

@Helflym
Copy link
Contributor Author

Helflym commented Nov 25, 2019

@mknyszek yes I think we can close this issue for now.

Closed by 5042317

@Helflym Helflym closed this as completed Nov 25, 2019
gopherbot pushed a commit that referenced this issue Dec 3, 2019
Currently the page allocator bitmap is implemented as a single giant
memory mapping which is reserved at init time and committed as needed.
This causes problems on systems that don't handle large uncommitted
mappings well, or institute low virtual address space defaults as a
memory limiting mechanism.

This change modifies the implementation of the page allocator bitmap
away from a directly-mapped set of bytes to a sparse array in same vein
as mheap.arenas. This will hurt performance a little but the biggest
gains are from the lockless allocation possible with the page allocator,
so the impact of this extra layer of indirection should be minimal.

In fact, this is exactly what we see:
    https://perf.golang.org/search?q=upload:20191125.5

This reduces the amount of mapped (PROT_NONE) memory needed on systems
with 48-bit address spaces to ~600 MiB down from almost 9 GiB. The bulk
of this remaining memory is used by the summaries.

Go processes with 32-bit address spaces now always commit to 128 KiB of
memory for the bitmap. Previously it would only commit the pages in the
bitmap which represented the range of addresses (lowest address to
highest address, even if there are unused regions in that range) used by
the heap.

Updates #35568.
Updates #35451.

Change-Id: I0ff10380156568642b80c366001eefd0a4e6c762
Reviewed-on: https://go-review.googlesource.com/c/go/+/207497
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Dec 9, 2019
AIX doesn't allow to mmap an already mmap address. The previous way to
deal with this behavior was to munmap before calling mmap again.
However, mprotect syscall is able to change protections on a memory
range. Thus, memory mapped by sysReserve can be remap using it. Note
that sysMap is always called with a non-nil pointer so mprotect is
always possible.

Updates: #35451

Change-Id: I1fd1e1363d9ed9eb5a8aa7c8242549bd6dad8cd0
Reviewed-on: https://go-review.googlesource.com/c/go/+/207237
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@golang golang locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX Performance
Projects
None yet
Development

No branches or pull requests

7 participants