Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

Closed
bcmills opened this issue Sep 27, 2019 · 12 comments
Closed

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

bcmills opened this issue Sep 27, 2019 · 12 comments
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Sep 27, 2019

Observed on the linux-ppc64le-buildlet builder (https://build.golang.org/log/b524842fe441d0c1d47adad4cde878daed7bfc76):

--- FAIL: TestGoexitCrash (60.00s)
    crash_test.go:95: testprog GoexitExit exit status: exit status 2
    crash_test.go:291: output:
        t1
        t2
        t3
        t4
        SIGQUIT: quit
        PC=0x68570 m=0 sigcode=0
        
        goroutine 6 [syscall]:
        runtime.notetsleepg(0x1d8d40, 0xc34fff89f, 0x1415)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/lock_futex.go:227 +0x38 fp=0xc000036730 sp=0xc0000366f0 pc=0x198e8
        runtime.timerproc(0x1d8d20)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/time.go:308 +0x2f4 fp=0xc0000367b8 sp=0xc000036730 pc=0x59774
        runtime.goexit()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xc0000367b8 sp=0xc0000367b8 pc=0x67a44
        created by runtime.(*timersBucket).addtimerLocked
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/time.go:166 +0x13c
        
        r0   0xdd	r1   0xc000036648
        r2   0x2f2ec	r3   0x4
        r4   0x80	r5   0x0
        r6   0xc000036698	r7   0x0
        r8   0x0	r9   0x0
        r10  0x0	r11  0x0
        r12  0x0	r13  0x1ddac8
        r14  0x196d4	r15  0xc000036678
        r16  0x1	r17  0xffffffffb150046b
        r18  0xb150046b	r19  0x127474
        r20  0xc00008e020	r21  0x1d6a40
        r22  0x0	r23  0x39
        r24  0x10	r25  0x8
        r26  0x0	r27  0x0
        r28  0x0	r29  0x0
        r30  0xc00005e900	r31  0x130000
        pc   0x68570	ctr  0x0
        link 0x39094	xer  0x0
        ccr  0x54400002	trap 0xc00
        
        
        want output containing: no goroutines (main called runtime.Goexit) - deadlock!
FAIL
FAIL	runtime	109.964s

CC @danscales @mknyszek; see previously #31966

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 27, 2019
@bcmills bcmills added this to the Go1.14 milestone Sep 27, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Oct 21, 2019

@bcmills
Copy link
Contributor Author

bcmills commented Oct 21, 2019

@danscales @mknyszek @aclements: could someone from the runtime team take a look at this failure and at least assess whether it's something we should fix in 1.14?

@bcmills bcmills modified the milestones: Backlog, Go1.14 Oct 21, 2019
@danscales
Copy link
Contributor

I will take a look, see if I can reproduce or make a guess as to what might be happening.

@danscales
Copy link
Contributor

danscales commented Oct 24, 2019

I am able to reproduce once every few times when I run this command on the ppc64-linux-buildlet (which each time itself runs the test 500 times):

gomote run user-danscales-linux-ppc64-buildlet-0 go/bin/go test runtime -test.count=500 -test.run TestGoexitCrash

When it fails, the test seems to hang (somehow the normal "deadlock" detection of no goroutines and no main thread (because Goexit called) doesn't happen), and then something forces a SIGQUIT after 60 seconds.

The same test command run locally on amd64 never fails.

The actual test program (that is supposed to deadlock when all threads end and main does the Goexit) is:

func GoexitExit() {
	println("t1")
	go func() {
		time.Sleep(time.Millisecond)
	}()
	i := 0
	println("t2")
	runtime.SetFinalizer(&i, func(p *int) {})
	println("t3")
	runtime.GC()
	println("t4")
	runtime.Goexit()
}

I can still reproduce the problem if I comment out the i := 0 line and the SetFinalizer line. However, I can't reproduce if I comment out the runtime.GC line.

I'll keep investigating and check if it is present in 1.13. However, it seems unlikely that this has to be fixed for 1.14, since it is so rare and only happens when all threads and main exit (which is most likely a programming mistake).

@danscales
Copy link
Contributor

OK, it definitely also happens in go1.13. I haven't been able to reproduce in go1.12 yet, so it might be a slight regression in the go1.13 timeframe.

@bcmills bcmills modified the milestones: Go1.14, Backlog Oct 24, 2019
@aclements
Copy link
Member

aclements commented Oct 25, 2019 via email

@danscales
Copy link
Contributor

It turns out that the problem is on the latest release of go1.13, but not on the very first releases of 1.13 in August, so I did a 'git bisect'. The change that it came up with (not saying this is definitive at all) was:

runtime: redefine scavenge goal in terms of heap_inuse [mknyszek]

So, good guess that it might be related to GC/scavenger-related. Also, as I mentioned, it isn't reproducible for any commit if runtime.GC() is removed. Not sure why this would particularly show up only for ppc64.

Will update further when I get a change to try out GOGC=20

@mknyszek
Copy link
Contributor

@danscales thanks for the bisection, that helps a lot.

I printed out some diagnostic information and ran the program until it hung.

I found that in the cases where it hung, the scavenger had turned on, but consistently found no work to do, and thus fell into it's exponential back-off case. Since the program is no longer making progress at that point, it means that the scavenger will never get memory to scavenge, so it sits there for all eternity preventing checkdead from firing since it thinks that the scavenger is alive via a live timer.

There are indeed cases where the scavenger turned on but there was no hang, because it found memory to scavenge and achieved its goal, thereby turning off and allowing checkdead to throw. Most of the time it just never turns on at all, which is why this is hard to reproduce.

The reason why removing runtime.GC() fixes this is because the scavenger is always off for the first GC, so an explicit GC opens up the opportunity to potentially turn on the scavenger at all.

So, now the question is why does it turn on if there's no work to do? In this case, heap_sys - heap_released (the scavenger's retained heap memory metric) is greater than the goal it sets for itself, which, as of the CL @danscales mentioned, is based on heap_inuse at the end of the last GC, multiplied by the expected heap growth (next_gc / last_next_gc).

My first guess was that since last_heap_inuse is stale information (slightly) there could be skew with the actual heap_inuse which means the whole heap is either in use or scavenged. But this guess was wrong. When I print out heap_inuse at the last scavenger pacing change before the hang, it's identical in value to last_heap_inuse.

There's a 5 pages (40960 byte) discrepancy between heap_sys - heap_released and heap_inuse. In theory, heap_sys is all mapped memory, heap_released is all freed and scavenged memory, and heap_inuse is all in use memory. Therefore those 40960 bytes should be free and unscavenged memory, but the scavenger can't find them. If it could, it would just scavenge some portion of them and the application would exit quickly (like in the non-hanging scavenger-on cases, where 40960 bytes is the exact discrepancy).

This indicates either a bug in the scavenging/treap code, or maybe 5 pages are being unaccounted for correctly. I'll dig further.

@mknyszek
Copy link
Contributor

OK it appears to be neither of the things I thought it would be.

So it turns out there's a chance the computed rate could end up as +Inf, which means that "retained-want" (in the scavenger's calculation) is going to be some nonsense number.

This is the crux of the problem. The reason why we see this on ppc64 and not other platforms is the higher system page size of 64 KiB; it's more likely with small heaps that we could end up having less work than a single page, and so we fall into a situation where the scavenger should scavenge something (due to one set of sane pacing parameters) but will calculate it's always ahead of schedule because of the nonsense number, so it gets stuck in a loop.

The fix is easy: never let the nonsense number happen by either always rounding up to 1 physical page worth of work, or turning off the scavenger if there isn't at least one physical page worth of work. Both avoid the divide by zero which causes the +Inf, and the nonsense number later on.

This is not a performance problem or anything else in real running applications because the nonsense could dissipate in the following GC cycle, or the scavenger will harmlessly back off and do nothing (when there's really so little work to do that it doesn't matter). It is a problem for deadlock detection though, which is useful for teaching, so I'll put up a fix.

Should we backport this?

@mknyszek
Copy link
Contributor

Uploaded a fix for this. With the patch, I can run the test 100,000 consecutive times on linux/ppc64 without issue.

@gopherbot
Copy link

Change https://golang.org/cl/203517 mentions this issue: runtime: turn off scavenger when there's <1 physical page of work

@mknyszek
Copy link
Contributor

Moved to the Go 1.14 milestone just because the fix is already up for review, is small, and close to landing.

@mknyszek mknyszek self-assigned this Oct 29, 2019
@mknyszek mknyszek removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 29, 2019
@golang golang locked and limited conversation to collaborators Oct 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants