Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: incorrect frame information in traceback traversal may hang the process. #50772

Closed
abeaumont opened this issue Jan 24, 2022 · 17 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@abeaumont
Copy link

abeaumont commented Jan 24, 2022

What version of Go are you using (go version)?

go version go1.17.0 linux/arm64

Does this issue reproduce with the latest release?

It's been also been reproduced with go version go1.17.6 linux/arm64.

What operating system and processor architecture are you using (go env)?

Linux ARM64 (it could affect other architectures where sys.MinFrameSize > 0).

What did you do?

Run a go program with profiling enabled on Linux/ARM64.
This may eventually (it may take from minutes to days) cause the process to hang.
I have a binary and a core dump that showcases the problem (I can share those but they're rather big -the core dump is 4.2GB-) but we have been unable to reproduce it since with this binary.
We have also been able to reproduce in production with private code on linux/arm64 with version go1.17.6 (more on this later).
We have tried hard and failed to generate a small reproducible program.

What did you expect to see?

The program should generate the stack traces and keep running normally.

What did you see instead?

The program hangs. This seems to be caused by a combination of an infinite loop in runtime.gentraceback function while there's a stopTheWorld happening. As a result, the single running goroutine keeps running forever.

Here's the backtrace when the problem occurs:

gef➤  bt
#0  0x000000000046c52c in runtime.pcvalue (f=..., off=0x20830, targetpc=0x48108d, cache=0x401880e6f8, strict=0x1, ~r5=<optimized out>, ~r6=<optimized out>) at runtime/symtab.go:815
#1  0x000000000046ca60 in runtime.funcspdelta (f=..., targetpc=0x48108d, cache=0x401880e6f8, ~r3=<optimized out>) at runtime/symtab.go:957
#2  0x0000000000473c5c in runtime.gentraceback (pc0=0x456078, sp0=0xffc41457b580, lr0=0x0, gp=0x4000602b60, skip=0x0, pcbuf=0x401880e958, max=0x40, callback={void (runtime.stkframe *, void *, bool *)} 0x401880e930, v=0x0, flags=0x6, ~r10=<optimized out>) at runtime/traceback.go:207
#3  0x000000000045941c in runtime.sigprof (pc=0xffffab3f84b0, sp=0xffc41457b560, lr=0x482f7c, gp=0x4000602b60, mp=0x4000580c00) at runtime/proc.go:4755
#4  0x00000000004648a0 in runtime.sighandler (sig=0x1b, info=0x401880eda0, ctxt=0x401880ee20, gp=0x4000602b60) at runtime/signal_unix.go:542
#5  0x0000000000464058 in runtime.sigtrampgo (sig=0x1b, info=0x401880eda0, ctx=0x401880ee20) at runtime/signal_unix.go:460
#6  0x0000000000483114 in runtime.sigtramp () at runtime/sys_linux_arm64.s:458
#7  <signal handler called>
#8  0x0000ffffab3f84b0 in __kernel_clock_gettime ()
#9  0x0000000000482f7c in runtime.nanotime1 () at runtime/sys_linux_arm64.s:350
#10 0x000000000048108d in runtime.call2097152 () at runtime/asm_arm64.s:431
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Inside sigprof gentraceback is called twice, the first time with the PC and SP arguments in sigprof which fails, the second time with the vSDO PC and SP.
This seems to be the problem. If we take a look into the stack at the time:

gef➤  x/40gx 0xffc41457b560-0x8
0xffc41457b558: 0x0000ffc41457b5a8      0x000000000048108d
0xffc41457b568: 0x000000003178c8b9      0x000000000045a8d8
0xffc41457b578: 0x0000ffc41457b5a8      0x0000000000456078
0xffc41457b588: 0x0000000000000000      0x0000000000000000
0xffc41457b598: 0xffffffffffffffff      0x0000000000000000
0xffc41457b5a8: 0x0000ffc41457b5f8      0x0000000000454178
0xffc41457b5b8: 0x0000ffc41457b5f8      0x000000000045423c
0xffc41457b5c8: 0x0000ffc41457b628      0x000000000041f424
0xffc41457b5d8: 0x0010c763826714a8      0x000000000045425c
0xffc41457b5e8: 0x0010c763825a0ab9      0x0000004000001040
0xffc41457b5f8: 0x0000ffc41457b6e8      0x0000000000455dec
0xffc41457b608: 0x0000004000044000      0x0000000000000000
0xffc41457b618: 0x0000004000001040      0x000000000048108d
0xffc41457b628: 0x000000003178c90c      0x0000000000453570
0xffc41457b638: 0x0000ffc41457b668      0x00000000004509d8
0xffc41457b648: 0x0000000000000000      0x000000000048108d
0xffc41457b658: 0x000000003178cb6b      0x0010c763826714a8
0xffc41457b668: 0x0000ffc41457b698      0x0000000000456078
0xffc41457b678: 0x0000000000000000      0x0000000000000000
0xffc41457b688: 0x0000000000000001      0x0000004000580c00

We can see how the FP in both SP (0xffc41457b560) and vDSO SP (0xffc41457b580) point to the same next frame (0x0000ffc41457b5a8).
Following the FP would result in the correct stack trace. From SP:
PC=0x48108d (runtime.call2097152), SP=0xffc41457b560
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0
PC=0x455dec (runtime.schedule), SP=0xffc41457b600
PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0
PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750
PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790

From vDSO SP:
PC=0x456078 (runtime.checkTimers), SP=0xffc41457b580
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0
PC=0x455dec (runtime.schedule), SP=0xffc41457b600
PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0
PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750
PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790

Instead, funcspdelta is used which relays on pcsp to retrieve the stack frame size.
For PC=0x456078 this delta is 0x50, which is used to jump to PC=0x41f424 (runtime.notewakeup). I believe this is incorrect and the root cause of the bug since this results in an incorrect stack frame.

Looking at gentraceback's pcbuf contents we can see the following information:

gef➤  x/8gx 0x401880e958
0x401880e958:   0x0000000000456079      0x0000000000456071
0x401880e968:   0x0000000000456078      0x0000000000456071
0x401880e978:   0x000000000041f424      0x0000000000000000
0x401880e988:   0x0000000000000000      0x0000000000000000

I'm unsure why runtime.checkTimers appears 4 times before runtime.notewakeup, but it seems to confirm the stack trace corruption.
After runtime.notewakeup the next frame has PC=0x48108d (runtime.call2097152).
At this point, funcspdelta returns 0, so in the traceback traversal SP = FP from this point forward (in x86, AMD64 and WASM this FP is increased just afterward, and thus the FP = SP invariant doesn't hold there and there's no infinite loop in these architectures).

Additionally, when the frame being traversed is the frame of a wrapper function (funcID == funcID_wrapper), as it's in the case we are analyzing, the loop index n doesn't get incremented. Combining these two (SP = FP, n not incremented) results in an infinite loop.

This bug has also happened in a production instance with go version 1.17.6 (also on linux/ARM64). We modified the runtime to:

  • print some debugging information when we enter the infinite loop.
  • break the loop to prevent the process from hanging.

With these changes we have been able to confirm the pattern described above (but failed to create a small reproducible case):

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228e00, name=crypto/x509.(*SignatureAlgorithm).String, entry=0x228de0, nameoff=0x28e68, args=24, deferredreturn=0x0, pcsp=0x21753, pcfile=0x22d, pcln=0x230, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2291c5, name=crypto/x509.(*ConstraintViolationError).Error, entry=0x2291b0, nameoff=0x28f14, args=24, deferredreturn=0x0, pcsp=0x26e8c, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228eea, name=crypto/x509.(*PublicKeyAlgorithm).String, entry=0x228e70, nameoff=0x28e91, args=24, deferredreturn=0x0, pcsp=0x85325, pcfile=0x85332, pcln=0x8533f, npcdata=0x3, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x22957f, name=crypto/x509.(*SystemRootsError).Unwrap, entry=0x229510, nameoff=0x28fda, args=24, deferredreturn=0x0, pcsp=0x21906, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2292d9, name=crypto/x509.(*HostnameError).Error, entry=0x2292c0, nameoff=0x28f65, args=24, deferredreturn=0x0, pcsp=0x2bcff, pcfile=0x208, pcln=0x20b, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2291bc, name=crypto/x509.(*ConstraintViolationError).Error, entry=0x2291b0, nameoff=0x28f14, args=24, deferredreturn=0x0, pcsp=0x26e8c, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2296e7, name=crypto/x509.(*UnknownAuthorityError).Error, entry=0x2296b0, nameoff=0x2905c, args=24, deferredreturn=0x0, pcsp=0x217d9, pcfile=0x208, pcln=0x20b, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228df9, name=crypto/x509.(*SignatureAlgorithm).String, entry=0x228de0, nameoff=0x28e68, args=24, deferredreturn=0x0, pcsp=0x21753, pcfile=0x22d, pcln=0x230, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
@cherrymui
Copy link
Member

Thanks for the report. Could you explain what function's frame size that you think is wrong? checkTimers? Or call2097152? The latter should have frame size 2097152+16, not 0. If it the runtime got that SP delta being 0, something is wrong. Also, it seems to me that it should not unwind to call2097152, as none of checkTimers, notewakeup, and the VDSO code are called from call2097152.

@cherrymui
Copy link
Member

I'm not sure I understand this part: You said from the "correct" stack trace

PC=0x456078 (runtime.checkTimers), SP=0xffc41457b560
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0

which indicates the SP delta should be 0xffc41457b5b0-0xffc41457b560=0x50.

But also

For PC=0x456078 this delta is 0x50, which is used to jump to PC=0x41f424 (runtime.notewakeup). I believe this is incorrect

The SP delta seems to match above and therefore should be the correct value. Why does it get a different PC instead, which appears to be the memory content of 0xffc41457b5d0 = SP+0x70 ?

@abeaumont
Copy link
Author

Thanks for the quick response. Sorry, I messed up the first line of the SP and vDSO SP stack traces, I have fixed them now :/

Thanks for the report. Could you explain what function's frame size that you think is wrong? checkTimers? Or call2097152? The latter should have frame size 2097152+16, not 0. If it the runtime got that SP delta being 0, something is wrong. Also, it seems to me that it should not unwind to call2097152, as none of checkTimers, notewakeup, and the VDSO code are called from call2097152.

I think checkTimers has an incorrect frame size. spdelta gives a frame size of 0x50, but it should be 0xffc41457b5b0-0xffc41457b580 = 0x30.

@abeaumont
Copy link
Author

Or call2097152? The latter should have frame size 2097152+16, not 0. If it the runtime got that SP delta being 0, something is wrong.

call2097152 has a 0 delta because the PC has not reached the code where the stack is modified:
Here the contents at pcsp:

gef➤  x/8bx 0x00000000016dfdc0 + 0x0000000000020830
0x17005f0:      0x02    0x0b    0xa0    0x80    0x80    0x02    0x2c    0x9f

Here is the disassembly (note that PC=0x48108d actually means PC=0x48108c, there's an off by one offset), some instructions before the stack frame is created:

gef➤  disassemble 0x481070 
Dump of assembler code for function runtime.call2097152:
0x0000000000481070 <+0>:     ldr     x1, [x28, #16]
0x0000000000481074 <+4>:     mov     x27, #0xff90                    // #65424
0x0000000000481078 <+8>:     movk    x27, #0x1f, lsl #16
0x000000000048107c <+12>:    subs    x2, sp, x27
0x0000000000481080 <+16>:    b.cc    0x481150 <runtime.call2097152+224>  // b.lo, b.ul, b.last
0x0000000000481084 <+20>:    cmp     x2, x1
0x0000000000481088 <+24>:    b.ls    0x481150 <runtime.call2097152+224>  // b.plast
0x000000000048108c <+28>:    sub     x27, sp, #0x10
0x0000000000481090 <+32>:    sub     x27, x27, #0x200, lsl #12
0x0000000000481094 <+36>:    str     x30, [x27]
0x0000000000481098 <+40>:    mov     sp, x27
0x000000000048109c <+44>:    stur    x29, [sp, #-8]
0x00000000004810a0 <+48>:    sub     x29, sp, #0x8
0x00000000004810a4 <+52>:    ldr     x1, [x28, #32]
[...]

@cherrymui
Copy link
Member

Sorry, I still don't quite understand it. Maybe there are still some numbers swapped? You said

We can see how the FP in both SP (0xffc41457b580) and vDSO SP (0xffc41457b560)

and also

Following the FP would result in the correct stack trace. From SP:
PC=0x48108d (runtime.call2097152), SP=0xffc41457b560
...

From vDSO SP:
PC=0x456078 (runtime.checkTimers), SP=0xffc41457b580

which seems SP=0xffc41457b560 whereas VDSO SP=0xffc41457b580, which is the other way around compared to the first line.

That said, I may have a guess what the problem is. As you mentioned, we do a traceback from SP first, if this fails, then do a traceback from VDSO SP. But at certain point in the VDSO calling code (nanotime1), the first traceback may not fail but run into infinite loop? Maybe we should use VDSO directly without trying the first case in some situations, e.g. the PC is within nanotime1.

@cherrymui
Copy link
Member

Is it hanging in the first call to gentraceback (with PC, SP from the signal handler), or the second (with VDSO PC and SP)?

Also, could you disassemble the first a few instructions of checkTimers? I think the SP delta 0x50 matches the machine code, but perhaps VDSO SP is wrong.

@cherrymui
Copy link
Member

I think the VDSO SP is wrong, and it is fixed in https://go-review.googlesource.com/c/go/+/337590 . But that CL is not applied to Go 1.17. Maybe we should. Thanks.

@abeaumont
Copy link
Author

which seems SP=0xffc41457b560 whereas VDSO SP=0xffc41457b580, which is the other way around compared to the first line.

You are absolutely right, sorry 🤦 It's SP (0xffc41457b560) and vDSO SP (0xffc41457b580), fixed 👍

@abeaumont
Copy link
Author

Is it hanging in the first call to gentraceback (with PC, SP from the signal handler), or the second (with VDSO PC and SP)?

I'ts hanging in the second (VDSO PC and SP).

Also, could you disassemble the first a few instructions of checkTimers? I think the SP delta 0x50 matches the machine code, but perhaps VDSO SP is wrong.

Right, the SP delta 0x50 matches the machine code, but somehow a delta of 0x30 would be needed in the case of VDSO for some reason.

Here's the disassembly:

gef➤  disassemble 0x456010
Dump of assembler code for function runtime.checkTimers:
0x0000000000456010 <+0>:     ldr     x1, [x28, #16]
0x0000000000456014 <+4>:     mov     x2, sp
0x0000000000456018 <+8>:     cmp     x2, x1
0x000000000045601c <+12>:    b.ls    0x45621c <runtime.checkTimers+524>  // b.plast
0x0000000000456020 <+16>:    str     x30, [sp, #-80]!
0x0000000000456024 <+20>:    stur    x29, [sp, #-8]
0x0000000000456028 <+24>:    sub     x29, sp, #0x8
0x000000000045602c <+28>:    ldr     x0, [sp, #88]
0x0000000000456030 <+32>:    ldrsb   x27, [x0]
0x0000000000456034 <+36>:    mov     x27, #0x1668                    // #5736
0x0000000000456038 <+40>:    add     x1, x0, x27
0x000000000045603c <+44>:    ldar    x1, [x1]
0x0000000000456040 <+48>:    mov     x27, #0x1670                    // #5744
0x0000000000456044 <+52>:    add     x2, x0, x27
0x0000000000456048 <+56>:    ldar    x2, [x2]
0x000000000045604c <+60>:    cbz     x1, 0x456060 <runtime.checkTimers+80>
0x0000000000456050 <+64>:    cbz     x2, 0x45605c <runtime.checkTimers+76>
0x0000000000456054 <+68>:    cmp     x2, x1

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 24, 2022
@mknyszek mknyszek added this to the Backlog milestone Jan 24, 2022
@cherrymui
Copy link
Member

cherrymui commented Jan 24, 2022

but somehow a delta of 0x30 would be needed in the case of VDSO for some reason.

That is because VDSO SP is wrong. See my comment at #50772 (comment)

@cherrymui
Copy link
Member

I think CL https://go-review.googlesource.com/c/go/+/337590 should fix this. As this is fixed at tip, I think we can close this and open backport issues for previous releases.

In the meantime, you can try Go 1.18beta1 ( https://go.dev/dl/#go1.18beta1 ) or manually apply that patch to your toolchain, and let me know if this it still fails. Thanks.

@cherrymui
Copy link
Member

@gopherbot please backport this to previous releases. This may cause programs to hang. Thanks.

@gopherbot
Copy link

Backport issue(s) opened: #50780 (for 1.16), #50781 (for 1.17).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/380715 mentions this issue: [release-branch.go1.17] runtime: set vdsoSP to caller's SP consistently

@gopherbot
Copy link

Change https://golang.org/cl/380716 mentions this issue: [release-branch.go1.16] runtime: set vdsoSP to caller's SP consistently

@lizthegrey
Copy link

Confirming that backporting this individual change into our go1.17.6 runtime fixes the process hangs we were observing.

@cherrymui
Copy link
Member

Thanks for confirming! @lizthegrey

gopherbot pushed a commit that referenced this issue Feb 7, 2022
m.vdsoSP should be set to the SP of the caller of nanotime1,
instead of the SP of nanotime1 itself, which matches m.vdsoPC.
Otherwise the unmatched vdsoPC and vdsoSP would make the stack
trace look like recursive.

We already do it correctly on AMD64, 386, and RISCV64. This CL
fixes the rest.

Also incorporate CL 352509, skipping a flaky test.

Updates #47324, #50772.
Fixes #50780.

Change-Id: I98b6fcfbe9fc6bdd28b8fe2a1299b7c505371dd4
Reviewed-on: https://go-review.googlesource.com/c/go/+/337590
Trust: Cherry Mui <cherryyz@google.com>
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
(cherry picked from commit 217507e)
Reviewed-on: https://go-review.googlesource.com/c/go/+/380716
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
gopherbot pushed a commit that referenced this issue Feb 7, 2022
m.vdsoSP should be set to the SP of the caller of nanotime1,
instead of the SP of nanotime1 itself, which matches m.vdsoPC.
Otherwise the unmatched vdsoPC and vdsoSP would make the stack
trace look like recursive.

We already do it correctly on AMD64, 386, and RISCV64. This CL
fixes the rest.

Also incorporate CL 352509, skipping a flaky test.

Updates #47324, #50772.
Fixes #50781.

Change-Id: I98b6fcfbe9fc6bdd28b8fe2a1299b7c505371dd4
Reviewed-on: https://go-review.googlesource.com/c/go/+/337590
Trust: Cherry Mui <cherryyz@google.com>
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
(cherry picked from commit 217507e)
Reviewed-on: https://go-review.googlesource.com/c/go/+/380715
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@golang golang locked and limited conversation to collaborators Jan 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants