Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestArenaCollision has many "out of memory" failures on linux-ppc64le-power9osu #35514

Closed
bcmills opened this issue Nov 11, 2019 · 8 comments
Labels
arch-ppc64x Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 11, 2019
@bcmills bcmills added this to the Go1.14 milestone Nov 11, 2019
@bcmills bcmills added arch-ppc64x Builders x/build issues (builders, bots, dashboards) labels Nov 11, 2019
@mknyszek mknyszek self-assigned this Nov 12, 2019
@mknyszek
Copy link
Contributor

I'm having a very difficult time reproducing this. 10000 consecutive runs and nothing.

The test's failure should be independent of the full runtime test since it spins up a new process for it, so I'm led to believe that this is actually just the machine saying it's out of memory (and not the test doing something weird/wrong in the face of new changes).

With that being said, my next best guess is that the machine has swap enabled and it's trying to reserve swap space for all the PROT_NONE mapped memory (maybe something overcommit related; not unheard of, though a first for Go 1.14 on Linux if true).

@laboger
Copy link
Contributor

laboger commented Nov 12, 2019

I noticed this the other day and I am not able to reproduce the problem either on any of our systems.

@mengzhuo
Copy link
Contributor

Also affected linux/arm64

=== RUN   TestArenaCollision
    TestArenaCollision: malloc_test.go:233: === RUN   TestArenaCollision
        fatal error: runtime: out of memory
        
        runtime stack:
        runtime.throw(0x267eff, 0x16)
        	/root/godev/src/runtime/panic.go:1106 +0x54 fp=0xfffd577fd610 sp=0xfffd577fd5e0 pc=0x439ac
        runtime.sysMap(0xfffd75ca9000, 0x1fb7a7000, 0x460c98)
        	/root/godev/src/runtime/mem_linux.go:169 +0xbc fp=0xfffd577fd650 sp=0xfffd577fd610 pc=0x28e34
        runtime.extendMappedRegion(0xfffd714a8000, 0x800000, 0x4800800, 0x800000, 0x1fffa7880, 0x460c98)
        	/root/godev/src/runtime/mpagealloc.go:291 +0xd0 fp=0xfffd577fd690 sp=0xfffd577fd650 pc=0x38778
        runtime.(*pageAlloc).grow(0x4468e8, 0xfffd3c000000, 0x400000)
        	/root/godev/src/runtime/mpagealloc.go:351 +0xec fp=0xfffd577fd6f0 sp=0xfffd577fd690 pc=0x38934
        runtime.(*mheap).grow(0x4468e0, 0x80, 0x0)
        	/root/godev/src/runtime/mheap.go:1321 +0x8c fp=0xfffd577fd760 sp=0xfffd577fd6f0 pc=0x36b34
        runtime.(*mheap).allocSpan(0x4468e0, 0x80, 0x100, 0x460c48, 0x0)
        	/root/godev/src/runtime/mheap.go:1120 +0x5f8 fp=0xfffd577fd7e0 sp=0xfffd577fd760 pc=0x36950
        runtime.(*mheap).alloc.func1()
        	/root/godev/src/runtime/mheap.go:867 +0x58 fp=0xfffd577fd840 sp=0xfffd577fd7e0 pc=0x72aa0
        runtime.(*mheap).alloc(0x4468e0, 0x80, 0x4000680101, 0xfffd56d9e648)
        	/root/godev/src/runtime/mheap.go:861 +0x64 fp=0xfffd577fd890 sp=0xfffd577fd840 pc=0x35f8c
        runtime.largeAlloc(0x100000, 0x70101, 0x24003f00000)
        	/root/godev/src/runtime/malloc.go:1152 +0x84 fp=0xfffd577fd8e0 sp=0xfffd577fd890 pc=0x1dc7c
        runtime.mallocgc.func1()
        	/root/godev/src/runtime/malloc.go:1047 +0x44 fp=0xfffd577fd910 sp=0xfffd577fd8e0 pc=0x719bc
        runtime.systemstack(0xfffd577fd950)
        	/root/godev/src/runtime/asm_arm64.s:237 +0xa0 fp=0xfffd577fd920 sp=0xfffd577fd910 pc=0x75968
        runtime.mstart()
        	/root/godev/src/runtime/proc.go:1077 fp=0xfffd577fd920 sp=0xfffd577fd920 pc=0x48788
        
        goroutine 19 [running]:
        runtime.systemstack_switch()
        	/root/godev/src/runtime/asm_arm64.s:182 +0x8 fp=0x400004dcd0 sp=0x400004dcc0 pc=0x758b0
        runtime.mallocgc(0x100000, 0x21f7a0, 0x4000592001, 0x24003f00000)
        	/root/godev/src/runtime/malloc.go:1046 +0x74c fp=0x400004dd90 sp=0x400004dcd0 pc=0x1dad4
        runtime.newobject(0x21f7a0, 0x24003f00000)
        	/root/godev/src/runtime/malloc.go:1165 +0x38 fp=0x400004ddc0 sp=0x400004dd90 pc=0x1dda0
        runtime_test.TestArenaCollision(0x40000e8120)
        	/root/godev/src/runtime/malloc_test.go:252 +0x110 fp=0x400004df70 sp=0x400004ddc0 pc=0x185718
        testing.tRunner(0x40000e8120, 0x276188)
        	/root/godev/src/testing/testing.go:954 +0xdc fp=0x400004dfc0 sp=0x400004df70 pc=0xe5444
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x400004dfc0 sp=0x400004dfc0 pc=0x7823c
        created by testing.(*T).Run
        	/root/godev/src/testing/testing.go:1005 +0x2c4
        
        goroutine 1 [chan receive, locked to thread]:
        runtime.gopark(0x275230, 0x400008e298, 0x40000e170e, 0x2)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x40000e1ac0 sp=0x40000e1aa0 pc=0x46168
        runtime.chanrecv(0x400008e240, 0x40000e1bd8, 0x4000000101, 0xe578c)
        	/root/godev/src/runtime/chan.go:563 +0x2c4 fp=0x40000e1b50 sp=0x40000e1ac0 pc=0x174ac
        runtime.chanrecv1(0x400008e240, 0x40000e1bd8)
        	/root/godev/src/runtime/chan.go:433 +0x28 fp=0x40000e1b80 sp=0x40000e1b50 pc=0x17190
        testing.(*T).Run(0x40000e8120, 0x265364, 0x12, 0x276188, 0x5dccbc01)
        	/root/godev/src/testing/testing.go:1006 +0x2e0 fp=0x40000e1c30 sp=0x40000e1b80 pc=0xe57a8
        testing.runTests.func1(0x40000e8000)
        	/root/godev/src/testing/testing.go:1247 +0x74 fp=0x40000e1c80 sp=0x40000e1c30 pc=0xe8f9c
        testing.tRunner(0x40000e8000, 0x40000e1d88)
        	/root/godev/src/testing/testing.go:954 +0xdc fp=0x40000e1cd0 sp=0x40000e1c80 pc=0xe5444
        testing.runTests(0x40000a6080, 0x43dc80, 0x147, 0x147, 0x0)
        	/root/godev/src/testing/testing.go:1245 +0x26c fp=0x40000e1db0 sp=0x40000e1cd0 pc=0xe6a94
        testing.(*M).Run(0x40000d4000, 0x0)
        	/root/godev/src/testing/testing.go:1162 +0x14c fp=0x40000e1eb0 sp=0x40000e1db0 pc=0xe5cc4
        runtime_test.TestMain(0x40000d4000)
        	/root/godev/src/runtime/crash_test.go:28 +0x28 fp=0x40000e1f00 sp=0x40000e1eb0 pc=0x16c400
        main.main()
        	_testmain.go:1170 +0x15c fp=0x40000e1f70 sp=0x40000e1f00 pc=0x1dc694
        runtime.main()
        	/root/godev/src/runtime/proc.go:203 +0x234 fp=0x40000e1fd0 sp=0x40000e1f70 pc=0x45d6c
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x40000e1fd0 sp=0x40000e1fd0 pc=0x7823c
        
        goroutine 2 [force gc (idle)]:
        runtime.gopark(0x275520, 0x4408f0, 0x1411, 0x1)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x400003afa0 sp=0x400003af80 pc=0x46168
        runtime.goparkunlock(...)
        	/root/godev/src/runtime/proc.go:310
        runtime.forcegchelper()
        	/root/godev/src/runtime/proc.go:253 +0xc0 fp=0x400003afd0 sp=0x400003afa0 pc=0x46018
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x400003afd0 sp=0x400003afd0 pc=0x7823c
        created by runtime.init.5
        	/root/godev/src/runtime/proc.go:242 +0x30
        
        goroutine 3 [GC sweep wait]:
        runtime.gopark(0x275520, 0x440d20, 0x140c, 0x1)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x400003b7a0 sp=0x400003b780 pc=0x46168
        runtime.goparkunlock(...)
        	/root/godev/src/runtime/proc.go:310
        runtime.bgsweep(0x4000062000)
        	/root/godev/src/runtime/mgcsweep.go:89 +0x180 fp=0x400003b7d0 sp=0x400003b7a0 pc=0x332e8
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x400003b7d0 sp=0x400003b7d0 pc=0x7823c
        created by runtime.gcenable
        	/root/godev/src/runtime/mgc.go:214 +0x54
        
        goroutine 4 [GC scavenge wait]:
        runtime.gopark(0x275520, 0x440ce0, 0x140d, 0x1)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x400003bf40 sp=0x400003bf20 pc=0x46168
        runtime.goparkunlock(...)
        	/root/godev/src/runtime/proc.go:310
        runtime.bgscavenge(0x4000062000)
        	/root/godev/src/runtime/mgcscavenge.go:274 +0x364 fp=0x400003bfd0 sp=0x400003bf40 pc=0x31f1c
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x400003bfd0 sp=0x400003bfd0 pc=0x7823c
        created by runtime.gcenable
        	/root/godev/src/runtime/mgc.go:215 +0x74
        
        goroutine 18 [finalizer wait]:
        runtime.gopark(0x275520, 0x45d128, 0x1410, 0x1)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x4000036730 sp=0x4000036710 pc=0x46168
        runtime.goparkunlock(...)
        	/root/godev/src/runtime/proc.go:310
        runtime.runfinq()
        	/root/godev/src/runtime/mfinal.go:175 +0xbc fp=0x40000367d0 sp=0x4000036730 pc=0x29304
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x40000367d0 sp=0x40000367d0 pc=0x7823c
        created by runtime.createfing
        	/root/godev/src/runtime/mfinal.go:156 +0x78
        
        goroutine 5 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x40000ae3d0, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x400003a750 sp=0x400003a730 pc=0x46168
        runtime.gcBgMarkWorker(0x4000022000)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x400003a7d0 sp=0x400003a750 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x400003a7d0 sp=0x400003a7d0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 20 [runnable]:
        runtime.gcMarkDone()
        	/root/godev/src/runtime/mgc.go:1424 +0x304 fp=0x4000037750 sp=0x4000037750 pc=0x2bf6c
        runtime.gcBgMarkWorker(0x4000024800)
        	/root/godev/src/runtime/mgc.go:2001 +0x2e4 fp=0x40000377d0 sp=0x4000037750 pc=0x2ce4c
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x40000377d0 sp=0x40000377d0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 21 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x40000ae3f0, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x4000037f50 sp=0x4000037f30 pc=0x46168
        runtime.gcBgMarkWorker(0x4000027000)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x4000037fd0 sp=0x4000037f50 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x4000037fd0 sp=0x4000037fd0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 34 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x40000ae400, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x4000506750 sp=0x4000506730 pc=0x46168
        runtime.gcBgMarkWorker(0x4000029800)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x40005067d0 sp=0x4000506750 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x40005067d0 sp=0x40005067d0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 6 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x40000ae410, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x400003c750 sp=0x400003c730 pc=0x46168
        runtime.gcBgMarkWorker(0x400002c000)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x400003c7d0 sp=0x400003c750 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x400003c7d0 sp=0x400003c7d0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 22 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x4000600000, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x4000038750 sp=0x4000038730 pc=0x46168
        runtime.gcBgMarkWorker(0x400002e800)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x40000387d0 sp=0x4000038750 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x40000387d0 sp=0x40000387d0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 35 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x4000600010, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x4000506f50 sp=0x4000506f30 pc=0x46168
        runtime.gcBgMarkWorker(0x4000031000)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x4000506fd0 sp=0x4000506f50 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x4000506fd0 sp=0x4000506fd0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        goroutine 50 [GC worker (idle)]:
        runtime.gopark(0x275338, 0x4000600020, 0x1418, 0x0)
        	/root/godev/src/runtime/proc.go:304 +0xd0 fp=0x4000502750 sp=0x4000502730 pc=0x46168
        runtime.gcBgMarkWorker(0x4000033800)
        	/root/godev/src/runtime/mgc.go:1874 +0xe8 fp=0x40005027d0 sp=0x4000502750 pc=0x2cc50
        runtime.goexit()
        	/root/godev/src/runtime/asm_arm64.s:1148 +0x4 fp=0x40005027d0 sp=0x40005027d0 pc=0x7823c
        created by runtime.gcBgMarkStartWorkers
        	/root/godev/src/runtime/mgc.go:1822 +0x80
        
        (exit status exit status 2)
--- FAIL: TestArenaCollision (0.03s)
FAIL
exit status 1
FAIL	runtime	0.045s

@mknyszek
Copy link
Contributor

I suspect the problem is that TestArenaCollision makes the heap very discontiguous, and we make a contiguity assumption in the page allocator (so too much memory ends up mapped and the system can't handle it, if other processes are running; in essence, it's an overcommit issue). I'm working on a couple of patches which should hopefully fix this and a couple of other issues.

@gopherbot
Copy link

Change https://golang.org/cl/207758 mentions this issue: runtime: use inUse ranges to map in summary memory only as needed

@gopherbot
Copy link

Change https://golang.org/cl/207757 mentions this issue: runtime: track ranges of address space which are owned by the heap

@ianlancetaylor
Copy link
Contributor

This was happening fairly frequently but then it stopped. The most recent occurrence is

2019-11-19T15:30:58-580337e/linux-ppc64le-power9osu

I'm calling this fixed.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 10, 2019

As a post-mortem, I suspect https://go-review.googlesource.com/c/go/+/207497 actually helped a lot here, since each L2 entry is only 1 MiB in size now, and the page bitmap had the biggest footprint.

https://go-review.googlesource.com/c/go/+/207758 reduces the footprint of the test by an order of magnitude, which should make concerns about a discontiguous address space go away for good with the new page allocator.

gopherbot pushed a commit that referenced this issue Dec 11, 2019
This change adds a new inUse field to the allocator which tracks ranges
of addresses that are owned by the heap. It is updated on each heap
growth.

These ranges are tracked in an array which is kept sorted. In practice
this array shouldn't exceed its initial allocation except in rare cases
and thus should be small (ideally exactly 1 element in size).

In a hypothetical worst-case scenario wherein we have a 1 TiB heap and 4
MiB arenas (note that the address ranges will never be at a smaller
granularity than an arena, since arenas are always allocated
contiguously), inUse would use at most 4 MiB of memory if the heap
mappings were completely discontiguous (highly unlikely) with an
additional 2 MiB leaked from previous allocations. Furthermore, the
copies that are done to keep the inUse array sorted will copy at most 4
MiB of memory in such a scenario, which, assuming a conservative copying
rate of 5 GiB/s, amounts to about 800µs.

However, note that in practice:
1) Most 64-bit platforms have 64 MiB arenas.
2) The copies should incur little-to-no page faults, meaning a copy rate
   closer to 25-50 GiB/s is expected.
3) Go heaps are almost always mostly contiguous.

Updates #35514.

Change-Id: I3ad07f1c2b5b9340acf59ecc3b9ae09e884814fe
Reviewed-on: https://go-review.googlesource.com/c/go/+/207757
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Dec 11, 2019
Prior to this change, if the heap was very discontiguous (such as in
TestArenaCollision) it's possible we could map a large amount of memory
as R/W and commit it. We would use only the start and end to track what
should be mapped, and we would extend that mapping as needed to
accomodate a potentially fragmented address space.

After this change, we only map exactly the part of the summary arrays
that we need by using the inUse ranges from the previous change. This
reduces the GCSys footprint of TestArenaCollision from 300 MiB to 18
MiB.

Because summaries are no longer mapped contiguously, this means the
scavenger can no longer iterate directly. This change also updates the
scavenger to borrow ranges out of inUse and iterate over only the
parts of the heap which are actually currently in use. This is both an
optimization and necessary for correctness.

Fixes #35514.

Change-Id: I96bf0c73ed0d2d89a00202ece7b9d089a53bac90
Reviewed-on: https://go-review.googlesource.com/c/go/+/207758
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@golang golang locked and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-ppc64x Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants