Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: infinite loop in lockextra on linux/arm #34391

Closed
nyuichi opened this issue Sep 19, 2019 · 20 comments
Closed

runtime: infinite loop in lockextra on linux/arm #34391

nyuichi opened this issue Sep 19, 2019 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@nyuichi
Copy link
Contributor

nyuichi commented Sep 19, 2019

What version of Go are you using (go version)?

$ go/bin/go version
go version devel +dac30212c2 Tue Sep 10 07:31:41 2019 +0100 linux/arm

Does this issue reproduce with the latest release?

I only reproduced this with HEAD due to another issue #34030.

What operating system and processor architecture are you using (go env)?

go env Output
$ go/bin/go env
GO111MODULE=""
GOARCH="arm"
GOBIN=""
GOCACHE="/home/pi/.cache/go-build"
GOENV="/home/pi/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH=""
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/pi/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/pi/go/pkg/tool/linux_arm"
GCCGO="gccgo"
GOARM="7"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -marm -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build423836175=/tmp/go-build -gno-record-gcc-switches"

What did you do?

My environment is Raspberry pi 3 running raspbian buster.

$ cat bug.go 
package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	signal_chan := make(chan os.Signal, 1)
	signal.Notify(signal_chan, syscall.SIGUSR1)
	go func() {
		for {
			<-signal_chan
		}
	}()

	p, _ := os.FindProcess(os.Getpid())

	go func() {
		for {
			p.Signal(syscall.SIGUSR1)
		}
	}()

        t0 := time.Now()
        t1 := t0
        for t1.Sub(t0) < 60 * time.Second {
                t1 = time.Now()
		fmt.Println("{}", t1)
        }
	fmt.Println("done")
}
$ go/bin/go build bug.go
$ gdb bug
Reading symbols from bug...done.
Loading Go Runtime support.
(gdb) handle SIGUSR1 nostop noprint
Signal        Stop	Print	Pass to program	Description
SIGUSR1       No	No	Yes		User defined signal 1
(gdb) r
Starting program: /home/pi/bug 
[New LWP 3815]
[New LWP 3816]
[New LWP 3817]
[New LWP 3818]
[New LWP 3819]
[New LWP 3820]
[New LWP 3821]
{} 2019-09-19 08:21:27.266727698 +0100 BST m=+0.002104470
{} 2019-09-19 08:21:27.268343107 +0100 BST m=+0.003719983
{} 2019-09-19 08:21:27.268407117 +0100 BST m=+0.003783941
... (omit) ...
{} 2019-09-19 08:22:04.869803408 +0100 BST m=+37.605180023
{} 2019-09-19 08:22:04.869853199 +0100 BST m=+37.605229814
{} 2019-09-19 08:22:04.869910438 +0100 BST m=+37.605287106
### (at this point the log suddenly stops coming) ###
^C
### (interrupt with Ctrl-C after waiting a while) ###
Thread 1 "bug" received signal SIGINT, Interrupt.
runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
334		SWI	$0
(gdb) info threads
  Id   Target Id         Frame 
* 1    LWP 3812 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  2    LWP 3815 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  3    LWP 3816 "bug"    runtime.usleep () at /home/pi/go/src/runtime/sys_linux_arm.s:486
  4    LWP 3817 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  5    LWP 3818 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:335
  6    LWP 3819 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  7    LWP 3820 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  8    LWP 3821 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  9    LWP 3822 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
  10   LWP 3824 "bug"    runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
(gdb) bt
#0  runtime.futex () at /home/pi/go/src/runtime/sys_linux_arm.s:334
#1  0x00037b7c in runtime.futexsleep (addr=0x1602fc <runtime.m0+204>, val=0, ns=-1) at /home/pi/go/src/runtime/os_linux.go:44
#2  0x000191d8 in runtime.notesleep (n=0x1602fc <runtime.m0+204>) at /home/pi/go/src/runtime/lock_futex.go:151
#3  0x00040518 in runtime.stoplockedm () at /home/pi/go/src/runtime/proc.go:2105
#4  0x00041bc4 in runtime.schedule () at /home/pi/go/src/runtime/proc.go:2506
#5  0x00041d34 in runtime.park_m (gp=0x48a1c0) at /home/pi/go/src/runtime/proc.go:2647
#6  0x000633e4 in runtime.mcall () at /home/pi/go/src/runtime/asm_arm.s:289
#7  0x00491574 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 3
[Switching to thread 3 (LWP 3816)]
#0  runtime.usleep () at /home/pi/go/src/runtime/sys_linux_arm.s:486
486		RET
(gdb) bt
#0  runtime.usleep () at /home/pi/go/src/runtime/sys_linux_arm.s:486
#1  0x0003f990 in runtime.lockextra (nilokay=false, ~r1=<optimized out>) at /home/pi/go/src/runtime/proc.go:1794
#2  0x0003f3d4 in runtime.needm (x=<optimized out>) at /home/pi/go/src/runtime/proc.go:1595
#3  0x0004f240 in runtime.badsignal (sig=10, c=0x447c58) at /home/pi/go/src/runtime/signal_unix.go:610
#4  0x0004e9ac in runtime.sigtrampgo (sig=<optimized out>, info=0x447c88, ctx=0x447d08) at /home/pi/go/src/runtime/signal_unix.go:315
#5  0x0006604c in runtime.sigtramp () at /home/pi/go/src/runtime/sys_linux_arm.s:449
#6  0x76ffd3bc in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

What did you expect to see?

done is printed.

What did you see instead?

The execution got stuck running into an infinite loop (maybe a livelock).
It appears the issue is caused in the same situation of #34030, namely a signal occurs while VDSO is running.
In the patch I attached in #34030 it is tested that incoming SIGPROF signals no longer cause segv, but I hadn't fully tested other signals.

@laboger
Copy link
Contributor

laboger commented Sep 19, 2019

We have been seeing some strange failures on the linux/ppc64 builder. The problems are not consistently reproducible but I was able to find that the test TestStress in os/signal hangs consistently on my ppc64 system. It does not always hang when running all.bash but does consistently hang if I run that test by itself. I am still trying to see if I can make the same thing happen on ppc64le.

This starts to happen with commit 904f046 which was the fix for the issue mentioned above #34030. I'm guessing it is the same problem identified in this issue but if not I can open a new issue. There are many failures on the ppc64 build dashboard since this commit mostly appearing as 20m timeouts. In one case I was able to make it happen just like the builder log and found that it was hung during a compile. If you look through the ppc64 failures the hang is not always on the same test but usually a 20m timeout.

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 19, 2019
@toothrot toothrot added this to the Go1.14 milestone Sep 19, 2019
@toothrot
Copy link
Contributor

toothrot commented Sep 19, 2019

/cc @aclements @rsc @randall77 who own runtime.

I don't see why this should livelock, so this could be a bug.

@laboger
Copy link
Contributor

laboger commented Sep 20, 2019

The commit mentioned above is incorrect for ppc64/ppc64le. R30 is used for G in Go and will never be clobbered by the vdso code since R30 is considered nonvolatile in the non-Go world. Removing ppc64, ppc64le from the case in sigFetchG resolves the issues I was seeing.

@gopherbot
Copy link

Change https://golang.org/cl/196658 mentions this issue: runtime: fix regression on ppc64x from commit 904f046

@nyuichi
Copy link
Contributor Author

nyuichi commented Sep 20, 2019

/cc @ianlancetaylor who worked with me to make the commit mentioned above.

@laboger
Copy link
Contributor

laboger commented Sep 20, 2019

When TestStress in os/signal hangs on ppc64, I can attach with gdb and see this stack:

#0  0x000000000006dde0 in runtime.usleep () at /home/boger/golang/test/go/src/runtime/sys_linux_ppc64x.s:117
#1  0x0000000000042cb0 in runtime.lockextra (~r1=<optimized out>, nilokay=false) at /home/boger/golang/test/go/src/runtime/proc.go:1794
#2  0x0000000000042680 in runtime.needm (x=<optimized out>) at /home/boger/golang/test/go/src/runtime/proc.go:1595
#3  0x0000000000054550 in runtime.badsignal (sig=10, c=0xc000008ee0) at /home/boger/golang/test/go/src/runtime/signal_unix.go:611
#4  0x0000000000053bcc in runtime.sigtrampgo (sig=<optimized out>, info=0xc000009d78, ctx=0xc000009000)
    at /home/boger/golang/test/go/src/runtime/signal_unix.go:316
#5  0x000000000006e11c in sigtramp () at /home/boger/golang/test/go/src/runtime/sys_linux_ppc64x.s:317
#6  <signal handler called>
#7  __kernel_clock_gettime () at arch/powerpc/kernel/vdso64/gettimeofday.S:95
#8  0x000000000006dfc8 in runtime.nanotime1 () at /home/boger/golang/test/go/src/runtime/sys_linux_ppc64x.s:233
#9  0x0000000001d304bb in ?? ()

It didn't go down this path before because I don't think 'g' could be nil. So it looks like this is uncovering another problem because when badsignal is called, it doesn't return, which causes the hang.

@cherrymui
Copy link
Member

Thanks @nyuichi and @laboger for the backtraces. I think the problem is, when sigFetchG returns nil, it calls badsignal which, calls needm, which wants an "extram", but for a non-cgo program we never created one, so it never succeed. Maybe we have to create an extra M on these platforms for non-cgo programs?

cc @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

I think the current sigtrampgo is incomplete. We must not call badsignal for a signal that we are trying to catch, even if that signal arrives during a VDSO. The spin in lockextra is a distraction; if we fix that, we will still crash when we should not.

Perhaps the code in sigtrampgo should call sigsend itself if appropriate. Perhaps for other cases we should atomically record a bitmask of the signals, and check that bitmask as we return from the VDSO.

@laboger
Copy link
Contributor

laboger commented Sep 23, 2019

I just hit a compile time hang when doing a golang build on one of our power8 ppc64le machines. Here is what the trace looks like:

(gdb) bt
#0  0x000000000006b5d0 in runtime.usleep () at /home/boger/golang/fresh/go/src/runtime/sys_linux_ppc64x.s:117
#1  0x00000000000432b0 in runtime.lockextra (~r1=<optimized out>, nilokay=false) at /home/boger/golang/fresh/go/src/runtime/proc.go:1794
#2  0x0000000000042c80 in runtime.needm (x=<optimized out>) at /home/boger/golang/fresh/go/src/runtime/proc.go:1595
#3  0x0000000000054500 in runtime.badsignal (sig=17, c=0xc000008ee0) at /home/boger/golang/fresh/go/src/runtime/signal_unix.go:610
#4  0x0000000000053cbc in runtime.sigtrampgo (sig=<optimized out>, info=0xc000009d78, ctx=0xc000009000)
    at /home/boger/golang/fresh/go/src/runtime/signal_unix.go:315
#5  0x000000000006b8ec in runtime.sigtramp () at /home/boger/golang/fresh/go/src/runtime/sys_linux_ppc64x.s:317
#6  <signal handler called>
#7  __do_get_tspec () at /build/linux-B4VGWU/linux-4.15.0/arch/powerpc/kernel/vdso64/gettimeofday.S:208
#8  0x00007b3791c2057c in __kernel_clock_gettime () at /build/linux-B4VGWU/linux-4.15.0/arch/powerpc/kernel/vdso64/gettimeofday.S:75
PC not saved

This must be the reason for all the 20 minute hangs we have been seeing since the original sigFetchG change went in. I think sig 17 is SIGCHLD which seems like a signal that should be ignored.

gopherbot pushed a commit that referenced this issue Sep 23, 2019
This fixes a regression introduced with CL 192937. That change
was intended to fix a problem in arm and arm64 but also added
code to change the behavior in ppc64 and ppc64le even though the
error never occurred there. The change to function sigFetchG
assumes that the register holding 'g' could be clobbered by
vdso code when in fact 'g' is in R30 and that is nonvolatile
in the 64-bit PowerPC ELF ABI so would not be clobbered in vdso code.

So if this happens somehow the path it takes is incorrect,
falling through to a call to badsignal which doesn't seem right.

This regression caused intermittent hangs on the builder dashboard
for ppc64, and can be reproduced consistently when running os/signal
TestStress on some ppc64 systems.

I mentioned this problem is issue #34391 because I thought it was
related to another problem described there.

Change-Id: I2ee3606de302bafe509d300077ce3b44b88571a1
Reviewed-on: https://go-review.googlesource.com/c/go/+/196658
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@nyuichi
Copy link
Contributor Author

nyuichi commented Sep 27, 2019

I started looking into this problem.
If my understanding is correct, when a signal other than SIGPROF is delivered to a Go thread running VDSO, sigtrampgo recognizes the current context as a foreign (= C) one and tries to delegate the handle of the signal to that C context, although VDSO never does such a job.

So @ianlancetaylor, do you mean we should have a thread-local bitmask for keeping signals received during a VDSO? I was not able to figure out how to create a thread local storage in a go program not linked to libc. It seems tls_g is only initialized when cgo is enabled and pthread is required to implement it.

@ianlancetaylor
Copy link
Contributor

We don't need a thread-local bitmask. Signals are delivered to the process as a whole. Go doesn't support delivering signals to a specific thread. An atomically accessed process-wide bitmask would suffice.

@nyuichi
Copy link
Contributor Author

nyuichi commented Sep 30, 2019

@ianlancetaylor

Go doesn't support delivering signals to a specific thread.

Makes sense.

An atomically accessed process-wide bitmask would suffice.

Can I assume also that Go doesn't support real time signals? For them POSIX defines order and number of delivery are guaranteed.

nyuichi added a commit to nyuichi/go that referenced this issue Sep 30, 2019
nyuichi added a commit to nyuichi/go that referenced this issue Sep 30, 2019
@ianlancetaylor
Copy link
Contributor

Real time signals are an interesting case, but I think that we don't need to worry about them. In a pure Go program they can only be seen by code that calls os/signal.Notify; I'm suggesting that we call sigsend directly from the signal handler, so that case will continue to work the same way. In Go code linked into a C program, the Go runtime will not install a signal handler for the real time signals, so we don't have to worry. That leaves the case of a Go program that links in C code that uses real time signals, but relies on the Go signal handler forwarding the signals to the C signal handler; that case will be handled by sigfwdgo. So I can't think of a case where we need to keep proper track of real time signals.

I think we really only need the suggested atomic bitmask for signals that we aren't going to forward and that we aren't going to ignore. In other words, signals for which we are going to crash. So I don't think we need to be super careful about how we handle them.

@nyuichi
Copy link
Contributor Author

nyuichi commented Oct 11, 2019

@ianlancetaylor

Sorry for the late reply. I don't quite understand some points:

  1. Did you mean that pure Go programs using VDSO on arm are classified into the first case? If so, is it legitimate to call sigsend (and perhaps also raisebadsignal) directly when m is not available?
  2. Is the distinction between your second case and third case determined by the return value of sigfwdgo? In that case should a signal incoming to VDSO code running inside Go programs with cgo support be forwarded to the C signal handler?
  3. I understand that the suggested atomic bitmask is only used for the second case; is this correct?

nyuichi added a commit to nyuichi/go that referenced this issue Oct 11, 2019
This commit fixes issue golang#34391, which is due to an incorrect patch
merged in golang#34030.

sigtrampgo is modified to record incoming signals in a globally shared
atomic bitmask during when the G register is clobbered. When the
execution exits from vdso it checks if there is a pending signal it
re-raises them to its own process.
@ianlancetaylor
Copy link
Contributor

Sorry, I'm not sure how to map what you call the first, second, and third cases into what I wrote.

  1. We can do part of sigsend without an m. I'm not sure whether we can call notewakeup without an m, so we may need some other mechanism to make that call.
  2. sigfwdgo is called before we even check for whether we are in a VDSO, so whether or not we are running in a VDSO shouldn't affect whether we forward a signal.
  3. I'm not sure.

@laboger
Copy link
Contributor

laboger commented Oct 14, 2019

@randall77 I see you reopened #22047. We have seen similar hangs recently and I suspect it could be related to the issue with signals described here. I was unable to add any comments to issue #22047 because it was locked. I have been able to reproduce some hangs on my systems by using GOMAXPROCS=2 like is done with builder testing, and in those cases I see the stack that I provided above in the signal handling code. (Maybe there should be a different issue for the ppc64{le} hangs? I don't like to put comments in multiple issues but in this case the headline is misleading.)

@ianlancetaylor
Copy link
Contributor

I unlocked #22047.

nyuichi added a commit to nyuichi/go that referenced this issue Oct 18, 2019
This commit fixes issue golang#34391, which is due to an incorrect patch
merged in golang#34030.

sigtrampgo is modified to record incoming signals in a globally shared
atomic bitmask when the G register is clobbered. When the execution
exits from vdso it checks if there is a pending signal and in that
case it re-raises them to its own process.
@gopherbot
Copy link

Change https://golang.org/cl/201899 mentions this issue: runtime: WIP: fix infinite loop in lockextra on linux/arm

nyuichi added a commit to nyuichi/go that referenced this issue Oct 18, 2019
This commit fixes issue golang#34391, which is due to an incorrect patch
merged in CL 192937.

sigtrampgo is modified to record incoming signals in a globally shared
atomic bitmask when the G register is clobbered. When the execution
exits from vdso it checks if there are pending signals and in that
case it re-raises them to its own process.
@gopherbot
Copy link

Change https://golang.org/cl/202759 mentions this issue: runtime: save/fetch g register during VDSO on ARM and ARM64

@gopherbot
Copy link

Change https://golang.org/cl/206959 mentions this issue: runtime: crash if a signal is received with bad G and no extra M

gopherbot pushed a commit that referenced this issue Nov 15, 2019
When we receive a signal, if G is nil we call badsignal, which
calls needm. When cgo is not used, there is no extra M, so needm
will just hang. In this situation, even GOTRACEBACK=crash cannot
get a stack trace, as we're in the signal handler and cannot
receive another signal (SIGQUIT).

Instead, just crash.

For #35554.
Updates #34391.

Change-Id: I061ac43fc0ac480435c050083096d126b149d21f
Reviewed-on: https://go-review.googlesource.com/c/go/+/206959
Run-TryBot: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@golang golang locked and limited conversation to collaborators Nov 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants