Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: program on linux/s390x sometimes hangs after calling "plugin.Open" #40473

Closed
liurui-1 opened this issue Jul 29, 2020 · 21 comments
Closed
Labels
arch-s390x Issues solely affecting the s390x architecture. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@liurui-1
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.14.6 linux/s390x

Note: This problem is only found in linux/s390x. All Golang versions have the same issue.

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="s390x"
GOBIN="/usr/local/go/bin/"
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="s390x"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ua/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_s390x"
GCCGO="gccgo"
AR="ar"
CC="s390x-linux-gnu-gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -march=z196 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build262588498=/tmp/go-build -gno-record-gcc-switches"

What did you do?

This is a sample code (ttt.go):

package main

import (
	"fmt"
	"plugin"
)

func openPlugin(name string) {
	fmt.Printf("openPlugin Start: %s\n", name)
	_, err := plugin.Open(name)
	fmt.Printf("openPlugin End: %s\n", name)
	if err != nil {
		fmt.Printf("openPlugin Error: $s - %v\n", name, err)
	}
}

func main() {
	openPlugin("cpu.so")
	openPlugin("disk.so")
	openPlugin("diskio.so")
	openPlugin("mem.so")
	openPlugin("net.so")
	openPlugin("processes.so")
	openPlugin("procstat.so")
	openPlugin("system.so")
	openPlugin("jaeger.so")
	openPlugin("zipkin.so")
}

What did you expect to see?

$ ./ttt
openPlugin Start: cpu.so
openPlugin End: cpu.so
openPlugin Start: disk.so
openPlugin End: disk.so
openPlugin Start: diskio.so
openPlugin End: diskio.so
openPlugin Start: mem.so
openPlugin End: mem.so
openPlugin Start: net.so
openPlugin End: net.so
openPlugin Start: processes.so
openPlugin End: processes.so
openPlugin Start: procstat.so
openPlugin End: procstat.so
openPlugin Start: system.so
openPlugin End: system.so
openPlugin Start: jaeger.so
openPlugin End: jaeger.so
openPlugin Start: zipkin.so
openPlugin End: zipkin.so

What did you see instead?

If you run multiple times, you can see the problem randomly:

$ ./ttt
openPlugin Start: cpu.so
openPlugin End: cpu.so
openPlugin Start: disk.so
openPlugin End: disk.so
openPlugin Start: diskio.so
openPlugin End: diskio.so
openPlugin Start: mem.so
openPlugin End: mem.so
openPlugin Start: net.so
openPlugin End: net.so
openPlugin Start: processes.so
openPlugin End: processes.so
openPlugin Start: procstat.so
openPlugin End: procstat.so
openPlugin Start: system.so
openPlugin End: system.so
openPlugin Start: jaeger.so
openPlugin End: jaeger.so
openPlugin Start: zipkin.so
openPlugin End: zipkin.so

$ ./ttt
openPlugin Start: cpu.so
openPlugin End: cpu.so
openPlugin Start: disk.so
openPlugin End: disk.so
openPlugin Start: diskio.so
openPlugin End: diskio.so
openPlugin Start: mem.so
openPlugin End: mem.so
openPlugin Start: net.so
openPlugin End: net.so
openPlugin Start: processes.so
openPlugin End: processes.so
openPlugin Start: procstat.so
openPlugin End: procstat.so
openPlugin Start: system.so
openPlugin End: system.so
openPlugin Start: jaeger.so
openPlugin End: jaeger.so
openPlugin Start: zipkin.so
openPlugin End: zipkin.so

$ ./ttt
openPlugin Start: cpu.so
openPlugin End: cpu.so
openPlugin Start: disk.so
openPlugin End: disk.so
openPlugin Start: diskio.so
openPlugin End: diskio.so
openPlugin Start: mem.so
openPlugin End: mem.so
openPlugin Start: net.so
openPlugin End: net.so
openPlugin Start: processes.so
openPlugin End: processes.so
openPlugin Start: procstat.so
openPlugin End: procstat.so
openPlugin Start: system.so
openPlugin End: system.so
openPlugin Start: jaeger.so
openPlugin End: jaeger.so
openPlugin Start: zipkin.so

==> hang without End of zipkin.so

@liurui-1 liurui-1 changed the title Golang for s390 hang after call plugin.Open for uncertain times Golang for linux/s390 hang after call plugin.Open for uncertain times Jul 29, 2020
@liurui-1 liurui-1 changed the title Golang for linux/s390 hang after call plugin.Open for uncertain times Golang for linux/s390x hang after call plugin.Open for uncertain times Jul 29, 2020
@liurui-1 liurui-1 changed the title Golang for linux/s390x hang after call plugin.Open for uncertain times Golang app on linux/s390x hangs after calling "plugin.Open" for uncertain times Jul 29, 2020
@mundaym mundaym added the arch-s390x Issues solely affecting the s390x architecture. label Jul 29, 2020
@mundaym mundaym changed the title Golang app on linux/s390x hangs after calling "plugin.Open" for uncertain times plugin: app on linux/s390x sometimes hangs after calling "plugin.Open" Jul 29, 2020
@mundaym mundaym added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 29, 2020
@mundaym
Copy link
Member

mundaym commented Jul 29, 2020

Are you able to generate a core dump for the program? It might identify where the program got stuck.

@liurui-1
Copy link
Author

liurui-1 commented Jul 29, 2020

Thanks @mundaym
GoLang for linux/s390x is very weak to generate stacktrace or heapdump. I have never successfully created such files for linux/s390x.

Following is my code:

package main

import (
	"fmt"
	"os"
	"os/signal"
	"plugin"
	"runtime/pprof"
	"syscall"
)

func openPlugin(name string) {
	fmt.Printf("openPlugin Start: %s\n", name)
	_, err := plugin.Open(name)
	fmt.Printf("openPlugin End: %s\n", name)
	if err != nil {
		fmt.Printf("openPlugin Error: $s - %v\n", name, err)
	}
}

func dump(uaDumpDir string) {
	fd1, err := os.Create(uaDumpDir + "goroutine.pprof")
	if err != nil {
		fmt.Printf("Failed to create goroutine dump: %v\n", err)
	} else {
		defer fd1.Close()

		profile := pprof.Lookup("goroutine")
		if profile != nil {
			profile.WriteTo(fd1, 2)
		} else {
			fmt.Println("Cannot find profile of goroutine")
		}
	}

	fd2, err := os.Create(uaDumpDir + "dumpHeap.pprof")
	if err != nil {
		fmt.Printf("Failed to create heap dump: %v\n", err)
	} else {
		defer fd2.Close()

		profile := pprof.Lookup("heap")
		if profile != nil {
			profile.WriteTo(fd2, 1)
		} else {
			fmt.Printf("Cannot find profile of heap\n")
		}
	}
}

func handleSignal(dumpChan chan os.Signal, uaDumpDir string) {
	for item := range dumpChan {
		fmt.Printf("Received sigal: %v\n", item)
		dump(uaDumpDir)
	}
}

func main() {
	uaDumpDir := "/var/log/"

	defer func() {
		if err := recover(); err != nil {
			fmt.Printf("Internal error: %v\n", err)
			dump(uaDumpDir)
		}
	}()

	dumpChan := make(chan os.Signal, 1)
	signal.Notify(dumpChan, syscall.SIGUSR1)
	defer close(dumpChan)
	go handleSignal(dumpChan, uaDumpDir)

	openPlugin("cpu.so")
	openPlugin("disk.so")
	openPlugin("diskio.so")
	openPlugin("mem.so")
	openPlugin("net.so")
	openPlugin("processes.so")
	openPlugin("procstat.so")
	openPlugin("system.so")
	openPlugin("jaeger.so")
	openPlugin("zipkin.so")
}

When the app hang, I can trigger a signal.

kill -SIGUSR1 <pid of ttt>

However, it can only generate an empty file.

root@hales1:/home/ua/go/src/main1# ./ttt
openPlugin Start: cpu.so
openPlugin End: cpu.so
openPlugin Start: disk.so
openPlugin End: disk.so
openPlugin Start: diskio.so
openPlugin End: diskio.so
openPlugin Start: mem.so
openPlugin End: mem.so
openPlugin Start: net.so
openPlugin End: net.so
openPlugin Start: processes.so
openPlugin End: processes.so
openPlugin Start: procstat.so
openPlugin End: procstat.so
openPlugin Start: system.so
openPlugin End: system.so
openPlugin Start: jaeger.so
openPlugin End: jaeger.so
openPlugin Start: zipkin.so
Received sigal: user defined signal 1
ls -ltr /var/log
...
-rw-r--r--  1 root   root                 0 Jul 29 06:39 goroutine.pprof
...

For any other platforms, it can generate dump and stack trace succcessfully.

@liurui-1 liurui-1 changed the title plugin: app on linux/s390x sometimes hangs after calling "plugin.Open" plugin: Golang app on linux/s390x sometimes hangs after calling "plugin.Open" Jul 31, 2020
@liurui-1
Copy link
Author

liurui-1 commented Aug 3, 2020

I just simplified the test app (ttt.go) to be:

package main

import (
        "fmt"
        "plugin"
)

func openPlugin(name string) {
        fmt.Printf("openPlugin Start: %s\n", name)
        _, err := plugin.Open(name)
        fmt.Printf("openPlugin End: %s\n", name)
        if err != nil {
                fmt.Printf("openPlugin Error: $s - %v\n", name, err)
        }
}

func main() {
        openPlugin("cpu.so")
        openPlugin("disk.so")
        openPlugin("diskio.so")
        openPlugin("mem.so")
        openPlugin("net.so")
        openPlugin("processes.so")
        openPlugin("procstat.so")
        openPlugin("system.so")
        openPlugin("jaeger.so")
        openPlugin("zipkin.so")
}
  1. The core dump when the app hangs is:
SIGABRT: abort
PC=0x10ce518 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x123b948, 0x8000000000, 0x0, 0x0, 0x108d332, 0x3ffb4a213a8, 0x10a48f2, 0x125dac0, 0x10a4964, 0x123b948, ...)
        /usr/local/go/src/runtime/sys_linux_s390x.s:336 +0x28
runtime.futexsleep(0x123b948, 0x10a5ccc, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os_linux.go:45 +0x4c
runtime.notesleep(0x123b948)
        /usr/local/go/src/runtime/lock_futex.go:151 +0xa8
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1834 +0xd4
runtime.findrunnable(0xc000026a80, 0x0)
        /usr/local/go/src/runtime/proc.go:2366 +0xa7a
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2526 +0x380
runtime.park_m(0xc0004e2300)
        /usr/local/go/src/runtime/proc.go:2696 +0xa6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_s390x.s:237 +0x60

goroutine 1 [running]:
        goroutine running on other thread; stack unavailable

goroutine 5 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc0000227d0, 0x63ae03bc)
        /usr/local/go/src/runtime/sema.go:513 +0x120
sync.(*Cond).Wait(0xc0000227c0)
        /usr/local/go/src/sync/cond.go:56 +0xd6
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc00007d3b0, 0x0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:50 +0x9e
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc00007d3b0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x4e
created by github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.NewAsyncLoopLogger
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x8a

goroutine 6 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000022950, 0x63ae03bc)
        /usr/local/go/src/runtime/sema.go:513 +0x120
sync.(*Cond).Wait(0xc000022940)
        /usr/local/go/src/sync/cond.go:56 +0xd6
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc00007d4d0, 0x0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:50 +0x9e
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc00007d4d0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x4e
created by github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.NewAsyncLoopLogger
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x8a

r0   0x0        r1   0xee
r2   0x123b948  r3   0x80
r4   0x0        r5   0x0
r6   0x0        r7   0x0
r8   0x0        r9   0x16
r10  0x0        r11  0x8
r12  0x8        r13  0x123b1e0
r14  0x109936c  r15  0x3ffc26fefc0
pc   0x10ce518  link 0x109936c

  1. Remove the fmt function calls after "plugin.Open," it may still hang occasionally.
    Following is core dump.
SIGABRT: abort
PC=0x10ce468 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x123b948, 0x8000000000, 0x0, 0x0, 0x70000, 0x1de8006f1f26a9, 0x10a4842, 0x807246, 0x10a48b4, 0x123b948, ...)
        /usr/local/go/src/runtime/sys_linux_s390x.s:336 +0x28
runtime.futexsleep(0x123b948, 0x10a5c1c, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os_linux.go:45 +0x4c
runtime.notesleep(0x123b948)
        /usr/local/go/src/runtime/lock_futex.go:151 +0xa8
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1834 +0xd4
runtime.findrunnable(0xc00002ea80, 0x0)
        /usr/local/go/src/runtime/proc.go:2366 +0xa7a
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2526 +0x380
runtime.park_m(0xc000001080)
        /usr/local/go/src/runtime/proc.go:2696 +0xa6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_s390x.s:237 +0x60

goroutine 1 [running]:
        goroutine running on other thread; stack unavailable

goroutine 18 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc0000229d0, 0x63ae03bc)
        /usr/local/go/src/runtime/sema.go:513 +0x120
sync.(*Cond).Wait(0xc0000229c0)
        /usr/local/go/src/sync/cond.go:56 +0xd6
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc0000b1680, 0x0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:50 +0x9e
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc0000b1680)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x4e
created by github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.NewAsyncLoopLogger
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x8a

goroutine 19 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000022b50, 0x63ae03bc)
        /usr/local/go/src/runtime/sema.go:513 +0x120
sync.(*Cond).Wait(0xc000022b40)
        /usr/local/go/src/sync/cond.go:56 +0xd6
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc0000b17a0, 0x0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:50 +0x9e
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc0000b17a0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x4e
created by github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.NewAsyncLoopLogger
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x8a

r0   0x0        r1   0xee
r2   0x123b948  r3   0x80
r4   0x0        r5   0x0
r6   0x0        r7   0x0
r8   0x0        r9   0x18
r10  0x0        r11  0x8
r12  0x8        r13  0x123b1e0
r14  0x10992bc  r15  0x3ffeacfe420
pc   0x10ce468  link 0x10992bc
  1. Move all plugin.Open to a separate goroutine, it may still hang occasionally.

The app code is:

package main

import (
        "fmt"
        "plugin"
        "sync"
)

func openPlugin(name string) {
        fmt.Printf("openPlugin Start: %s\n", name)
        _, err := plugin.Open(name)
        //fmt.Printf("openPlugin End: %s\n", name)
        if err != nil {
                //fmt.Printf("openPlugin Error: $s - %v\n", name, err)
        }
}

func main() {
        var wg sync.WaitGroup

        wg.Add(1)
        go func() {
                openPlugin("cpu.so")
                openPlugin("disk.so")
                openPlugin("diskio.so")
                openPlugin("mem.so")
                openPlugin("net.so")
                openPlugin("processes.so")
                openPlugin("procstat.so")
                openPlugin("system.so")
                openPlugin("jaeger.so")
                openPlugin("zipkin.so")
                wg.Done()
        }()
        wg.Wait()
}

The core dump is:

SIGABRT: abort
PC=0x10ce7a8 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x123c948, 0x8000000000, 0x0, 0x0, 0x123c1e0, 0xc000029500, 0x10a4b22, 0x3ff46fd0ccc, 0x10a4b94, 0x123c948, ...)
        /usr/local/go/src/runtime/sys_linux_s390x.s:336 +0x28
runtime.futexsleep(0x123c948, 0x10a5efc, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os_linux.go:45 +0x4c
runtime.notesleep(0x123c948)
        /usr/local/go/src/runtime/lock_futex.go:151 +0xa8
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1834 +0xd4
runtime.findrunnable(0xc000029500, 0x0)
        /usr/local/go/src/runtime/proc.go:2366 +0xa7a
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2526 +0x380
runtime.park_m(0xc000001c80)
        /usr/local/go/src/runtime/proc.go:2696 +0xa6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_s390x.s:237 +0x60

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc00001c098)
        /usr/local/go/src/runtime/sema.go:56 +0x3a
sync.(*WaitGroup).Wait(0xc00001c090)
        /usr/local/go/src/sync/waitgroup.go:130 +0x88
main.main()
        /home/ua/go/src/main1/ttt.go:35 +0x78

goroutine 6 [running]:
        goroutine running on other thread; stack unavailable
created by main.main
        /home/ua/go/src/main1/ttt.go:22 +0x66

goroutine 35 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000022990, 0x4e0603bc)
        /usr/local/go/src/runtime/sema.go:513 +0x120
sync.(*Cond).Wait(0xc000022980)
        /usr/local/go/src/sync/cond.go:56 +0xd6
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc0001a5560, 0x0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:50 +0x9e
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc0001a5560)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x4e
created by github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.NewAsyncLoopLogger
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x8a

goroutine 36 [sync.Cond.Wait]:
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:310
sync.runtime_notifyListWait(0xc000022b10, 0x4e0603bc)
        /usr/local/go/src/runtime/sema.go:513 +0x120
sync.(*Cond).Wait(0xc000022b00)
        /usr/local/go/src/sync/cond.go:56 +0xd6
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processItem(0xc0001a5680, 0x0)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:50 +0x9e
github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.(*asyncLoopLogger).processQueue(0xc0001a5680)
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x4e
created by github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog.NewAsyncLoopLogger
        /home/ua/go/src/github.ibm.com/Unified-Agent/ua-plugin/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x8a

r0   0x0        r1   0xee
r2   0x123c948  r3   0x80
r4   0x0        r5   0x0
r6   0x0        r7   0x0
r8   0x0        r9   0x17
r10  0x0        r11  0x8
r12  0x8        r13  0x123c1e0
r14  0x109959c  r15  0x3ffec67eb90
pc   0x10ce7a8  link 0x109959c

@mundaym
Copy link
Member

mundaym commented Aug 3, 2020

This appears to be caused by an infinite loop in the part of the traceback code that handles inlined functions. I can 'fix' it using the following patch:

diff --git a/src/runtime/traceback.go b/src/runtime/traceback.go
index 944c8473d2..aac91e36ed 100644
--- a/src/runtime/traceback.go
+++ b/src/runtime/traceback.go
@@ -375,7 +375,12 @@ func gentraceback(pc0, sp0, lr0 uintptr, gp *g, skip int, pcbuf *uintptr, max in
                                        }
                                        lastFuncID = inltree[ix].funcID
                                        // Back up to an instruction in the "caller".
+                                       oldtracepc := tracepc
                                        tracepc = frame.fn.entry + uintptr(inltree[ix].parentPc)
+                                       if tracepc == oldtracepc {
+                                               println("error: infinite loop in inline traceback")
+                                               break
+                                       }
                                        pc = tracepc + 1
                                }
                        }

The reason the bug is intermittent is that the tracebacks are generated by the memory profiler which is non-deterministic and not all calls to gentraceback end up in this infinite loop (presumably they are tracing from different PCs).

@randall77 Any ideas why this might be happening? I'm not sure if this is an issue with plugins or a strange edge case in the inlined function data.

@mundaym
Copy link
Member

mundaym commented Aug 3, 2020

All Golang versions have the same issue.

Note that this does not reproduce with master at tip. So Go 1.15 might not have this bug.

@liurui-1
Copy link
Author

liurui-1 commented Aug 3, 2020

@mundaym
I tested versions from 1.13.5 to 1.14.6. I did not test 1.15 because it is for production system and in "https://golang.org/dl/" 1.15 is said to be "Unstable version".

@liurui-1
Copy link
Author

liurui-1 commented Aug 3, 2020

@mundaym
I can have a try with 1.15 if you do not use the build server now. We need to consider if 1.15 is stable enough to build a production system if it does not have this bug.

@randall77
Copy link
Contributor

I don't think plugins should matter here. Unless the plugins were built by a different version of Go? I thought we had a check for that, though.

Sounds like a corner case. Could you find out what function it is? print funcname(f), ix, inltree[ix].parentPc, and anything else you can think of when you detect the loop?

Any way I could reproduce this myself? I think I would need the code for all the plugins.

@liurui-1
Copy link
Author

liurui-1 commented Aug 4, 2020

@randall77
I have told @mundaym how to get the source to build zipkin.so which is enough to generate this block issue. Please just use the source for this debugging purpose.

@mundaym
Copy link
Member

mundaym commented Aug 4, 2020

I've managed to reproduce this using only open source components if you want to try @randall77. You will need to use Go 1.14.6 I think, on an s390x system:

tee plugin.go << PLUGIN
package main

import (
        _ "github.com/go-openapi/spec"
)

func main() {}
PLUGIN
tee main.go << MAIN
package main

import (
        "fmt"
        "os"
        "plugin"
)

func main() {
        if _, err := plugin.Open("plugin.so"); err != nil {
                fmt.Fprintf(os.Stderr, "error while loading plugin: %v\n", err)
                os.Exit(1)
        }
}
MAIN
go get -u github.com/go-openapi/spec
go build -buildmode=plugin -o plugin.so plugin.go
go build main.go
./main

The main binary will sometimes hang when run. It will be stuck in gentraceback.

@mundaym
Copy link
Member

mundaym commented Aug 4, 2020

Here is some debug output:

===================
func: github.com/go-openapi/spec.init.4
ix: 0
frame.fn.entry: 0x3ff6f39d250
tracepc: 0x3ff6f39d26e
parentPc: 0x1e
===================

The assembly for spec.init.4 looks like this:

    # /home/mndy/go/plugin-bug/gopath/src/github.com/go-openapi/spec/spec.go
    00000 (36) TEXT "".init.4(SB), ABIInternal
    00001 (36) PCDATA $0, $-2
    00002 (36) PCDATA $1, $-2
    00003 (36) FUNCDATA $0, gclocals·2002e13acf59079a1a5782c918894579(SB)
    00004 (36) FUNCDATA $1, gclocals·2002e13acf59079a1a5782c918894579(SB)
    00005 (36) FUNCDATA $2, gclocals·2164b7b5825a121481ed7a83141c5b82(SB)
v5
    00006 (37) PCDATA $0, $-1
v5
    00007 (37) PCDATA $1, $-1
v5
    00008 (?) NOP
v7
    00009 (+37) PCDATA $0, $0
v7
    00010 (+37) PCDATA $1, $0
v7
    00011 (+43) CALL "".JSONSchemaDraft04(SB)
v9
    00012 (43) PCDATA $0, $1
v9
    00013 (43) MOVD 8(R15), R3
v53
    00014 (43) PCDATA $0, $2
v53
    00015 (43) MOVD 24(R15), R2
v19
    00016 (43) PCDATA $0, $3
v19
    00017 (43) MOVD 16(R15), R0
b1
    00018 (+44) CGIJ $8, R0, $0, 28
b3
    00019 (+45) PCDATA $0, $-1
b3
    00020 (+45) PCDATA $1, $-1
b3
    00021 (+45) CGIJ $8, R0, $0, 79
v74
    00022 (45) PCDATA $0, $4
v74
    00023 (45) PCDATA $1, $0
v74
    00024 (45) MOVD R0, R3
v21
    00025 (45) PCDATA $0, $5
v21
    00026 (45) MOVD 8(R3), R0
b5
    00027 (45) JMP 79
v26
    00028 (+37) PCDATA $0, $-2
v26
    00029 (+37) PCDATA $1, $-2
v26
    00030 (+37) MOVWZ runtime.writeBarrier(SB), R0
b6
    00031 (37) CIJ $6, R0, $0, 68
v15
    00032 (+37) MOVD R3, "".jsonSchema(SB)
v36
    00033 (+38) PCDATA $0, $-1
v36
    00034 (+38) PCDATA $1, $-1
v36
    00035 (?) NOP
...

v5 is the orginal InlMark which had pos +37. What is interesting is that the parentPc is 30 which corresponds to the final address of v9 which is at pos 43. The InlMark is supposed to only move to an instruction with the same pos. I wonder therefore if parentPc is supposed to be pointing at v26 since that is at pos +37 and has a pre-assembly pc of 30. I'm not sure how that would happen though since the code should be fully assembled and prog pc values set before the parentPc is generated.

@randall77
Copy link
Contributor

I can't create a gomote to test on:

$ gomote create linux-s390x-ibm
# host type "host-linux-s390x" is not elastic; 0 of 0 machines connected, 0 busy

I can't cross compile to check the assembly, because it has cgo in it.

I'm afraid I can't help at the moment.

The InlMark is supposed to only move to an instruction with the same pos.

No, the inlMark should point to a different position. The position of the inlMark itself is in an inlined callee - the target of the inlMark should be in a caller.

Can you show the final assembly (via gnu objdump or something) and the entries that traceback hits in the inline table (via println in traceback.go)?

@mundaym
Copy link
Member

mundaym commented Aug 7, 2020

Thanks for the pointers, I think I have found the issue. The InlMark optimization here executes before the assembler has run. The optimization therefore assumes that the target Prog will not be moved or removed from the function by the assembler. In this case however the assembler does remove it. This leaves the InlMarks field pointing at an orphaned Prog that was never actually assembled and therefore does not have a valid PC value. This invalid data then breaks the traceback.

The assembler removes the original Prog as part of PIC code generation. The s390x backend rewrites external references into two instruction GOT lookups. The original Prog is replaced with a NOP that is subsequently removed by the assembler from the linked list of Progs that make up the function.

The simplest fix here is, I think, to stop the s390x assembler removing NOPs from the Prog linked list. This does appear to work. Some of the other backends also appear to remove NOPs from the linked list however I don't know if they are affected by this bug or not.

Longer term I'm not sure what the contract between the compiler and the assembler should be with respect to Prog values and the relationship between them. The assembler does do some minor optimizations and, in general, all the backends modify the Prog list significantly as they process it. Changing that is a fair amount of work but could be done.

One thing to consider for Go 1.16 is whether we should add a new pseudo instruction for inlined function marks. This could be treated the same way as the other pseudo ops we have like PCDATA and would therefore be more resilient to modifications made by the assembler. Like PCDATA it would act as a marker for the following, real, instruction at its PC value.

PS: gomote wasn't working because of weather issues at the location that hosts the build machine. They should all be working again now.

@randall77
Copy link
Contributor

Thanks for the excellent diagnosis.

The assembler removes the original Prog as part of PIC code generation. The s390x backend rewrites external references into two instruction GOT lookups.

Can the original Prog be repurposed as one of the two GOT lookup instructions? That way it would remain a valid target.
Another technique (x86 uses this) is to rewrite the original to a nop, but leave it in the Prog list.

Longer term I'm not sure what the contract between the compiler and the assembler should be with respect to Prog values and the relationship between them. The assembler does do some minor optimizations and, in general, all the backends modify the Prog list significantly as they process it. Changing that is a fair amount of work but could be done.

We've been moving away from the assembler doing any optimizations (e.g. we removed instruction reordering a while ago), now that the compiler is pretty good. If we can get rid of optimizations (and do them in the compiler instead), that would be good.
But yes, the assembler will continue to need to do rewrites. We should definitely try to make it clear what the Prog contract should be, and stick to it.

One thing to consider for Go 1.16 is whether we should add a new pseudo instruction for inlined function marks. This could be treated the same way as the other pseudo ops we have like PCDATA and would therefore be more resilient to modifications made by the assembler. Like PCDATA it would act as a marker for the following, real, instruction at its PC value.

The inlmark nops should be real nops, that generate actual code. We should never remove them. So I think they would need to be their own op, not a marker for a following op. (Because what if there are 2 of them in a row?)

@randall77
Copy link
Contributor

What, if anything, do you think we need to fix for 1.15?

@mundaym
Copy link
Member

mundaym commented Aug 7, 2020

Can the original Prog be repurposed as one of the two GOT lookup instructions? That way it would remain a valid target.
Another technique (x86 uses this) is to rewrite the original to a nop, but leave it in the Prog list.

Yeah, either would work. The second way is what I've implemented since it is a simpler change in my opinion.

The inlmark nops should be real nops, that generate actual code. We should never remove them. So I think they would need to be their own op, not a marker for a following op. (Because what if there are 2 of them in a row?)

I think it would be relatively easy to ensure and enforce that no two inline markers point to the same PC value. The advantage of it being a marker op rather than a code generating op would be that we could continue to use pre-existing instructions as inline marker targets and not just additional real nops. We'd still need the compiler to ensure a valid instruction followed the marker op.

What, if anything, do you think we need to fix for 1.15?

I'd like to do the small fix to the s390x backend that is suitable for backporting to 1.15.x and 1.14.x release branches to fix this specific bug. This isn't a recent regression so I don't think we should rush a fix into 1.15 if it is going to be released very soon, though if it does make it then great. Then for 1.16 maybe we try something more comprehensive.

@gopherbot
Copy link

Change https://golang.org/cl/247697 mentions this issue: cmd/internal/obj: fix inline marker issue on s390x

@mundaym mundaym removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 11, 2020
@mundaym mundaym modified the milestones: Go1.15.1, Go1.16 Aug 11, 2020
@mundaym
Copy link
Member

mundaym commented Aug 11, 2020

@gopherbot please open backport issues.

This is a bug that causes programs to intermittently crash with no workaround.

@gopherbot
Copy link

gopherbot commented Aug 11, 2020

Backport issue(s) opened: #40693 (for 1.15), #40694 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 11, 2020
@mundaym
Copy link
Member

mundaym commented Aug 20, 2020

Re-opening because the fix (CL 247697) was temporarily reverted in CL 248718 due to test failures on arm (issue #40796). That issue has now been resolved so we should now be ok to resubmit the fix for this issue on s390x.

@mundaym mundaym reopened this Aug 20, 2020
@gopherbot
Copy link

Change https://golang.org/cl/249448 mentions this issue: cmd/internal/obj: fix inline marker issue on s390x

@dmitshur dmitshur changed the title plugin: Golang app on linux/s390x sometimes hangs after calling "plugin.Open" plugin: program on linux/s390x sometimes hangs after calling "plugin.Open" Sep 3, 2020
@golang golang locked and limited conversation to collaborators Sep 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-s390x Issues solely affecting the s390x architecture. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants