Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: go runs into a deadlock where a panicing application never acquires the debuglock due to non-atomic unlock #54786

Open
a-milkyway opened this issue Aug 31, 2022 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@a-milkyway
Copy link

a-milkyway commented Aug 31, 2022

What version of Go are you using (go version)?

$ go version
1.17.6

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
Linux 4.14.35-2047.514.5.el7uek.x86_64 #2 SMP Mon Jun 13 21:57:28 PDT 2022 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

An application (gobgpd) encountered a panic and in the process of dumping stack traces go runtime tried to acquire a printLock which never became available preventing the application from a restart and causing application redundancy to fail.

What did you expect to see?

A panic should've completed in about a minute or so and application should've restarted.

What did you see instead?

The panic took six hours.

The runtime printlock function has an additional mp.locks increment/decrement operation to make the operation atomic but no such provision exists during printunlock. If the scheduler pre-empts the printunlock after decrementing printlock but before actually unlocking, this could cause this issue.

Here is the snippet:

func printlock() {
    mp := getg().m
    mp.locks++ // do not reschedule between printlock++ and lock(&debuglock).
    mp.printlock++
    if mp.printlock == 1 {
        lock(&debuglock)
    }
    mp.locks-- // now we know debuglock is held and holding up mp.locks for us.
}

func printunlock() {
    mp := getg().m
    mp.printlock--
    if mp.printlock == 0 {
        unlock(&debuglock)
    }
}
@seankhliao
Copy link
Member

how do you know it was the panic that took 6 hours?

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 31, 2022
@a-milkyway
Copy link
Author

a-milkyway commented Aug 31, 2022

I did a pstack on the gobgpd from time to time during those six hours where no application processing was observed but go runtime was in garbage collection, here is what i saw. (i can probably copy other threads data as well, but for conciseness, providing this particular thread). This thread never made any progress before signal 11.

Thread 26 (Thread 0x7f3ea1ffb700 (LWP 57758)):
#0 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
#1 0x0000000000432f16 in runtime.futexsleep (addr=0xfffffffffffffe00, val=2, ns=4636867) at /usr/local/go/src/runtime/os_linux.go:44
#2 0x000000000040cf99 in runtime.lock2 (l=0xfffffffffffffe00) at /usr/local/go/src/runtime/lock_futex.go:107
---Type to continue, or q to quit---
#3 0x0000000000438805 in runtime.lockWithRank (l=, rank=) at /usr/local/go/src/runtime/lockrank_off.go:23
#4 runtime.lock (l=) at /usr/local/go/src/runtime/lock_futex.go:48
#5 runtime.printlock () at /usr/local/go/src/runtime/print.go:71
#6 0x0000000000437808 in runtime.dopanic_m (gp=0xc0011e71e0, pc=4419537, sp=139906982587536) at /usr/local/go/src/runtime/panic.go:1389
#7 0x0000000000437288 in runtime.fatalthrow.func1 () at /usr/local/go/src/runtime/panic.go:1253
#8 0x0000000000437210 in runtime.fatalthrow () at /usr/local/go/src/runtime/panic.go:1250
#9 0x0000000000436fd1 in runtime.throw (s=...) at /usr/local/go/src/runtime/panic.go:1198
#10 0x00000000004157f0 in runtime.badPointer (s=0x7f3ee83fb978, p=824701829088, refBase=824640469408, refOff=40) at /usr/local/go/src/runtime/mbitmap.go:367
#11 0x0000000000415986 in runtime.findObject (p=, refBase=, refOff=) at /usr/local/go/src/runtime/mbitmap.go:409
#12 0x0000000000420310 in runtime.scanobject (b=, gcw=0xc000076e98) at /usr/local/go/src/runtime/mgcmark.go:1280
#13 0x000000000041fb9a in runtime.gcDrain (gcw=0xc000076e98, flags=7) at /usr/local/go/src/runtime/mgcmark.go:1047
#14 0x000000000041cd6e in runtime.gcBgMarkWorker.func2 () at /usr/local/go/src/runtime/mgc.go:1288
#15 0x00000000004680e9 in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:383
#16 0x0000000000000000 in ?? ()

Eventually it crashed with a signal 11, but this above thread was still in the same state never acquiring debuglock.

@a-milkyway a-milkyway changed the title affected/package: runtime affected/package: go runtime runs into a deadlock where a panicing application never acquires the debuglock due to non-atomic unlock Aug 31, 2022
@hopehook hopehook changed the title affected/package: go runtime runs into a deadlock where a panicing application never acquires the debuglock due to non-atomic unlock runtime: go runs into a deadlock where a panicing application never acquires the debuglock due to non-atomic unlock Sep 1, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 1, 2022
@hopehook hopehook added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Sep 1, 2022
@hopehook
Copy link
Member

hopehook commented Sep 1, 2022

@a-milkyway Thanks for the report, can you provide the complete stack information. Also, did you use cgo?
Github can fold content, don't worry about too much content. Thanks again.

@gopherbot
Copy link

Change https://go.dev/cl/427414 mentions this issue: runtime: fix printlock/printunlock deadlock

@seankhliao seankhliao added this to the Go1.20 milestone Sep 3, 2022
@mknyszek mknyszek modified the milestones: Go1.20, Backlog Sep 7, 2022
@mknyszek mknyszek self-assigned this Sep 7, 2022
@a-milkyway
Copy link
Author

@hopehook apologize for the delayed response. we use cgo. i lost the complete stack trace from my local machine, will try to see if it is stored somewhere and will provide it as soon as i get it.

@hopehook
Copy link
Member

hopehook commented Sep 7, 2022

@a-milkyway Welcome back and your local complete stack trace can be very important. As you said the program has been suspended for so long, the situation should be more complicated.

I suspect cgo is the most insecure factor, it may not be handled well.

@a-milkyway
Copy link
Author

Hi @hopehook I have a complete crashdump from a recent incident. I am attaching it here. Thank you so much!
gobgpd.gz
coredump.gz
core_backtrace.gz

@mknyszek mknyszek removed their assignment Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

Successfully merging a pull request may close this issue.

6 participants