Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: morestack_noctxt missing SPWRITE, causes "traceback stuck" assert #54332

Closed
lizthegrey opened this issue Aug 8, 2022 · 68 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@lizthegrey
Copy link

lizthegrey commented Aug 8, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18.5 linux/arm64

Does this issue reproduce with the latest release?

yes, repros on go1.19.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/lizf/.cache/go-build"
GOENV="/home/lizf/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/lizf/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/lizf/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="go1.18.5"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3053713255=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run continuous profiling via scraping /debug/pprof endpoints including mutex and blocking profiles, while performing a workload consuming from Kafka using Sarama and Klauspost Zstd libraries.

  runtime.SetBlockProfileRate(100000)
  runtime.SetMutexProfileFraction(100)

What did you expect to see?

No crashes.

What did you see instead?

The following crashes implicating unexpected return pc for runtime.sigtramp and traceback stuck (assert added in https://go-review.googlesource.com/c/go/+/400575/ by @cherrymui / #52116 which was backported to go1.18.5):

runtime: traceback stuck. pc=0x12eb5 sp=0x4019365840
stack: frame={sp:0x4019365840, fp:0x4019365840} stack=[0x4019364000,0x4019366000)
0x0000004019365740:  0x0000000000ba9494 <github.com/Shopify/sarama.releaseCrc32Field+0x0000000000000034>  0x00000040193657b8 
0x0000004019365750:  0x0000000000bcd8e0 <github.com/Shopify/sarama.(*realDecoder).pop+0x0000000000000060>  0x00000040193657c8 
0x0000004019365760:  0x0000000000000000  0x0000000000000000 
0x0000004019365770:  0x0000000000000700  0x0000000001068260 
0x0000004019365780:  0x00000000010ab360  0x0000000000005701 
0x0000004019365790:  0x0000000000000000  0x0000000000000000 
0x00000040193657a0:  0x0000000000000000  0x0000004000601c00 
0x00000040193657b0:  0x000000401c3c7100  0x00000040193657f8 
0x00000040193657c0:  0x0000000000bd070c <github.com/Shopify/sarama.(*RecordBatch).decode+0x000000000000037c>  0x000000400ff4e0f0 
0x00000040193657d0:  0x0000000000bd064c <github.com/Shopify/sarama.(*RecordBatch).decode+0x00000000000002bc>  0x0000004026c7805f 
0x00000040193657e0:  0x0000000000107de8 <strings.HasSuffix+0x0000000000000058>  0x0000000000000000 
0x00000040193657f0:  0x0000000000000000  0x00000040193658c8 
0x0000004019365800:  0x0000000000bd0eac <github.com/Shopify/sarama.(*Records).decode+0x000000000000009c>  0x0000004017c948c0 
0x0000004019365810:  0x0000004026d6cf92  0x0000000000012eb5 <internal/bytealg.IndexByteString+0x0000000000000005> 
0x0000004019365820:  0x0000000000012eb5 <internal/bytealg.IndexByteString+0x0000000000000005>  0x0000004019365898 
0x0000004019365830:  0x00012ee60001ce60  0x0000000000012eb5 <internal/bytealg.IndexByteString+0x0000000000000005> 
0x0000004019365840: >0x0000000000012eb5 <internal/bytealg.IndexByteString+0x0000000000000005>  0x00000000000000d7 
0x0000004019365850:  0x0000000000000000  0x0000004019365800 
0x0000004019365860:  0x0000000000bd0474 <github.com/Shopify/sarama.(*RecordBatch).decode+0x00000000000000e4>  0x00000040193658b8 
0x0000004019365870:  0x0000000000000000  0x0000000000000000 
0x0000004019365880:  0x0000000000000000  0x0000000000000000 
0x0000004019365890:  0x0000000000000000  0x0000004026d6cf92 
0x00000040193658a0:  0x000000400ff4e0f0  0x0000000000000000 
0x00000040193658b0:  0x0000000000000000  0x0000000000bd09c0 <github.com/Shopify/sarama.(*RecordBatch).decode.func1+0x0000000000000000> 
0x00000040193658c0:  0x000000400ff4e0f0  0x0000004019365918 
0x00000040193658d0:  0x0000000000bb6f1c <github.com/Shopify/sarama.(*FetchResponseBlock).decode+0x000000000000033c>  0x0000004012612c80 
0x00000040193658e0:  0x000000000154bfa8  0x0000004017c948c0 
0x00000040193658f0:  0x0000000000bb6f0c <github.com/Shopify/sarama.(*FetchResponseBlock).decode+0x000000000000032c>  0x0000000000000018 
0x0000004019365900:  0x00000000010d87c0  0x0000004000048301 
0x0000004019365910:  0x0000004017c94a40  0x00000040193659c8 
0x0000004019365920:  0x0000000000bb78e8 <github.com/Shopify/sarama.(*FetchResponse).decode+0x00000000000002a8>  0x0000004009b9fc38 
0x0000004019365930:  0x000000000154bfa8  0x0000004017c948c0 
fatal error: traceback stuck

goroutine 0 [idle]:
runtime: unexpected return pc for runtime.sigtramp called from 0xffff985dc5c0
stack: frame={sp:0x400eacacd0, fp:0x400eacada0} stack=[0x400eac4000,0x400eacc000)
0x000000400eacabd0:  0x000000400eacac28  0x02c595e603246a9e 
0x000000400eacabe0:  0x0000000002c595e5  0x000000400eaba000 
0x000000400eacabf0:  0x0000000000000000  0x000000400eac9fe0 
0x000000400eacac00:  0x0000004013a80000  0x000000400eacada0 
0x000000400eacac10:  0x000000400eacae20  0x000000400eacac98 
0x000000400eacac20:  0x000000000007d58c <runtime.sigtrampgo+0x000000000000001c>  0x000000400000001b 
0x000000400eacac30:  0x0000004000601c00  0x000000400eacac58 
0x000000400eacac40:  0x0000004000603040  0x02bcec7502c540fb 
0x000000400eacac50:  0x00246aa103246aa0  0x0000000000000000 
0x000000400eacac60:  0x0000000000000000  0x0000000000000000 
0x000000400eacac70:  0x0000000000000000  0x0000000000000000 
0x000000400eacac80:  0x0000004000603040  0x000000400eacada0 
0x000000400eacac90:  0x000000400eacae20  0x000000400eacacc8 
0x000000400eacaca0:  0x000000000007cea4 <runtime.sigtramp+0x0000000000000084>  0x000000000000001b 
0x000000400eacacb0:  0x000000400eacada0  0x000000400eacae20 
0x000000400eacacc0:  0x02c5956e03246a9e  0x000000400eacbff0 
0x000000400eacacd0: <0x0000ffff985dc5c0  0x000000400000001b 
0x000000400eacace0:  0x000000400eacada0  0x000000400eacae20 
0x000000400eacacf0:  0x000000000005abba <runtime.selparkcommit+0x000000000000000a>  0x1265726f74537070 
0x000000400eacad00:  0x021210eed9a50809  0x623824220118000a 
0x000000400eacad10:  0x302d333733663636  0x0000000000000015 
0x000000400eacad20:  0x0000000000000071  0x000000400b753e98 
0x000000400eacad30:  0x000000000153a0c0  0x0000004000603040 
0x000000400eacad40:  0x000000400eacacc8  0x3e66376972bea4d0 
0x000000400eacad50:  0xbebbbd41c5d26bf1  0x3f11566aaf25de2c 
0x000000400eacad60:  0xbf66c16c16bebd93  0x3fc5555555555555 
0x000000400eacad70:  0x3fcde3175e2d734a  0x3ffc439d143a5197 
0x000000400eacad80:  0xbfd1a21a8e7d2cb5  0x0000000000000000 
0x000000400eacad90:  0x0404040202020303  0x0000000000000000 
0x000000400eacada0: >0x000000000000001b  0x00000000fffffffe 
0x000000400eacadb0:  0x0000000000000466  0x0000000000000000 
0x000000400eacadc0:  0x0000000000000000  0x0000000000000000 
0x000000400eacadd0:  0x0000000000000000  0x0000000000000000 
0x000000400eacade0:  0x0000000000000000  0x0000000000000000 
0x000000400eacadf0:  0x0000000000000000  0x0000000000000000 
0x000000400eacae00:  0x0000000000000000  0x0000000000000000 
0x000000400eacae10:  0x0000000000000000  0x0000000000000000 
0x000000400eacae20:  0x0000000000000000  0x0000000000000000 
0x000000400eacae30:  0x000000400eac4000  0x0000004000000000 
0x000000400eacae40:  0x0000000000008000  0x0000000000000000 
0x000000400eacae50:  0x0000000000000000  0x0000000000000000 
0x000000400eacae60:  0x0000000000000000  0x0000000000000000 
0x000000400eacae70:  0x00002a1c49103b0c  0x02bcec6f02b9baa6 
0x000000400eacae80:  0x02b9baaf03246a9f  0x02c5956e02bcec75 
0x000000400eacae90:  0x0000000000000000  0x0000004013a805d0 
runtime.throw({0x1262952?, 0x1fae6c0?})
	/usr/local/go/src/runtime/panic.go:992 +0x50
runtime.gentraceback(0x203ad80?, 0x400eac9a90?, 0x400eaca968?, 0x4000603040, 0x0, 0x400eaca968, 0x40, 0x0, 0x0?, 0x6)
	/usr/local/go/src/runtime/traceback.go:488 +0xd98
runtime.sigprof(0x79650, 0x400eacac08?, 0x4013a80010?, 0x4000603040, 0x4000601c00)
	/usr/local/go/src/runtime/proc.go:4507 +0xf0
runtime.sighandler(0x1b?, 0x4000601c00?, 0x400eacac58?, 0x4000603040?)
	/usr/local/go/src/runtime/signal_unix.go:613 +0x514
runtime.sigtrampgo(0x1b, 0x400eacada0, 0x400eacae20)
	/usr/local/go/src/runtime/signal_unix.go:477 +0x164
runtime.sigtrampgo(0x1b, 0x400eacada0, 0x400eacae20)
	<autogenerated>:1 +0x1c
runtime: unexpected return pc for runtime.sigtramp called from 0xffff985dc5c0
stack: frame={sp:0x400eacacd0, fp:0x400eacada0} stack=[0x400eac4000,0x400eacc000)
0x000000400eacabd0:  0x000000400eacac28  0x02c595e603246a9e 
0x000000400eacabe0:  0x0000000002c595e5  0x000000400eaba000 
0x000000400eacabf0:  0x0000000000000000  0x000000400eac9fe0 
0x000000400eacac00:  0x0000004013a80000  0x000000400eacada0 
0x000000400eacac10:  0x000000400eacae20  0x000000400eacac98 
0x000000400eacac20:  0x000000000007d58c <runtime.sigtrampgo+0x000000000000001c>  0x000000400000001b 
0x000000400eacac30:  0x0000004000601c00  0x000000400eacac58 
0x000000400eacac40:  0x0000004000603040  0x02bcec7502c540fb 
0x000000400eacac50:  0x00246aa103246aa0  0x0000000000000000 
0x000000400eacac60:  0x0000000000000000  0x0000000000000000 
0x000000400eacac70:  0x0000000000000000  0x0000000000000000 
0x000000400eacac80:  0x0000004000603040  0x000000400eacada0 
0x000000400eacac90:  0x000000400eacae20  0x000000400eacacc8 
0x000000400eacaca0:  0x000000000007cea4 <runtime.sigtramp+0x0000000000000084>  0x000000000000001b 
0x000000400eacacb0:  0x000000400eacada0  0x000000400eacae20 
0x000000400eacacc0:  0x02c5956e03246a9e  0x000000400eacbff0 
0x000000400eacacd0: <0x0000ffff985dc5c0  0x000000400000001b 
0x000000400eacace0:  0x000000400eacada0  0x000000400eacae20 
0x000000400eacacf0:  0x000000000005abba <runtime.selparkcommit+0x000000000000000a>  0x1265726f74537070 
0x000000400eacad00:  0x021210eed9a50809  0x623824220118000a 
0x000000400eacad10:  0x302d333733663636  0x0000000000000015 
0x000000400eacad20:  0x0000000000000071  0x000000400b753e98 
0x000000400eacad30:  0x000000000153a0c0  0x0000004000603040 
0x000000400eacad40:  0x000000400eacacc8  0x3e66376972bea4d0 
0x000000400eacad50:  0xbebbbd41c5d26bf1  0x3f11566aaf25de2c 
0x000000400eacad60:  0xbf66c16c16bebd93  0x3fc5555555555555 
0x000000400eacad70:  0x3fcde3175e2d734a  0x3ffc439d143a5197 
0x000000400eacad80:  0xbfd1a21a8e7d2cb5  0x0000000000000000 
0x000000400eacad90:  0x0404040202020303  0x0000000000000000 
0x000000400eacada0: >0x000000000000001b  0x00000000fffffffe 
0x000000400eacadb0:  0x0000000000000466  0x0000000000000000 
0x000000400eacadc0:  0x0000000000000000  0x0000000000000000 
0x000000400eacadd0:  0x0000000000000000  0x0000000000000000 
0x000000400eacade0:  0x0000000000000000  0x0000000000000000 
0x000000400eacadf0:  0x0000000000000000  0x0000000000000000 
0x000000400eacae00:  0x0000000000000000  0x0000000000000000 
0x000000400eacae10:  0x0000000000000000  0x0000000000000000 
0x000000400eacae20:  0x0000000000000000  0x0000000000000000 
0x000000400eacae30:  0x000000400eac4000  0x0000004000000000 
0x000000400eacae40:  0x0000000000008000  0x0000000000000000 
0x000000400eacae50:  0x0000000000000000  0x0000000000000000 
0x000000400eacae60:  0x0000000000000000  0x0000000000000000 
0x000000400eacae70:  0x00002a1c49103b0c  0x02bcec6f02b9baa6 
0x000000400eacae80:  0x02b9baaf03246a9f  0x02c5956e02bcec75 
0x000000400eacae90:  0x0000000000000000  0x0000004013a805d0 
runtime.sigtramp()
	/usr/local/go/src/runtime/sys_linux_arm64.s:489 +0x84

goroutine 89 [running]:
runtime: traceback stuck. pc=0x3ff93 sp=0x40048952c0
stack: frame={sp:0x40048952c0, fp:0x40048952c0} stack=[0x4004894000,0x4004896000)
0x00000040048951c0:  0x0000000000000040  0x00000040048951e8 
0x00000040048951d0:  0x00000000005019a4 <github.com/klauspost/compress/zstd.(*sequenceDec).init+0x0000000000000054>  0x0000004004895310 
0x00000040048951e0:  0x00000040008c6000  0x0000004004895228 
0x00000040048951f0:  0x0000000000501b9c <github.com/klauspost/compress/zstd.(*sequenceDecs).initialize+0x000000000000016c>  0x0000004004895338 
0x0000004004895200:  0x00000000004e2b84 <github.com/klauspost/compress/zstd.(*blockDec).decodeCompressed+0x0000000000000164>  0x0000004000530058 
0x0000004004895210:  0x00000040002e22a0  0x00000000006c0001 <text/template/parse.(*endNode).Copy+0x0000000000000001> 
0x0000004004895220:  0x0000000f00800005  0x000001fd00280903 
0x0000004004895230:  0x00000000004e2ebc <github.com/klauspost/compress/zstd.(*blockDec).prepareSequences+0x000000000000025c>  0x000000000003ff93 <runtime.goroutineProfileWithLabels.func4.1+0x0000000000000003> 
0x0000004004895240:  0x00000000000ee1a4 <os.UserHomeDir+0x00000000000000c4>  0x0000000000000811 
0x0000004004895250:  0x0000000000000015  0x0000000000020000 <runtime.advanceEvacuationMark+0x0000000000000000> 
0x0000004004895260:  0x00000000006c0001 <text/template/parse.(*endNode).Copy+0x0000000000000001>  0x0000000000000000 
0x0000004004895270:  0x0000000000000006  0x0000000f00800005 
0x0000004004895280:  0x000001fd00280903  0x0000000000000000 
0x0000004004895290:  0x000000400720ea00  0x0000004004895338 
0x00000040048952a0:  0x00000000004e2a78 <github.com/klauspost/compress/zstd.(*blockDec).decodeCompressed+0x0000000000000058>  0x0000004000530058 
0x00000040048952b0:  0x00000040002e22a0  0x00000000000ee1a4 <os.UserHomeDir+0x00000000000000c4> 
0x00000040048952c0: >0x000000000003ff93 <runtime.goroutineProfileWithLabels.func4.1+0x0000000000000003>  0x0000000000020000 <runtime.advanceEvacuationMark+0x0000000000000000> 
0x00000040048952d0:  0x0000004009f42000  0x00000040002e22a0 
0x00000040048952e0:  0x0000000000001a81  0x0000004009f42000 
0x00000040048952f0:  0x0000004005b17500  0x000000400720ea00 
0x0000004004895300:  0x0000004004a28a00  0x00000040002e22a0 
0x0000004004895310:  0x0000004003bcf000  0x000000400885150c 
0x0000004004895320:  0x000000000000467b  0x000000000002ba01 <runtime.gcStart+0x0000000000000701> 
0x0000004004895330:  0x000000000000002e  0x0000004004895398 
0x0000004004895340:  0x00000000004e1974 <github.com/klauspost/compress/zstd.(*blockDec).decodeBuf+0x0000000000000284>  0x0000004000530058 
0x0000004004895350:  0x0000000000000000  0x0000000000000000 
0x0000004004895360:  0x0000000000000000  0x0000004000530050 
0x0000004004895370:  0x0000000000000000  0x0000000000000000 
0x0000004004895380:  0x0000000000000000  0x0000000000020000 <runtime.advanceEvacuationMark+0x0000000000000000> 
0x0000004004895390:  0x0000000000000000  0x0000004004895458 
0x00000040048953a0:  0x00000000004fd2cc <github.com/klauspost/compress/zstd.(*frameDec).runDecoder+0x000000000000016c>  0x000000400058c000 
0x00000040048953b0:  0x0000004000530050  0x0000000000000000 
fatal error: traceback stuck
goroutine 0 [idle]:
runtime: unexpected return pc for runtime.sigtramp called from 0xffffabfd25c0
stack: frame={sp:0x400000acd0, fp:0x400000ada0} stack=[0x4000004000,0x400000c000)
0x000000400000abd0:  0x000000400000ac28  0x0000000000000000 
0x000000400000abe0:  0x0000000000000000  0x0000004000002000 
0x000000400000abf0:  0x0000000000000000  0x0000000000000000 
0x000000400000ac00:  0x0000000000000000  0x000000400000ada0 
0x000000400000ac10:  0x000000400000ae20  0x000000400000ac98 
0x000000400000ac20:  0x000000000007d58c <runtime.sigtrampgo+0x000000000000001c>  0x000000400000001b 
0x000000400000ac30:  0x0000000002046680  0x000000400000ac58 
0x000000400000ac40:  0x0000004000521d40  0x0000000000000000 
0x000000400000ac50:  0x0000000000000000  0x0000000000000000 
0x000000400000ac60:  0x0000000000000000  0x0000000000000000 
0x000000400000ac70:  0x0000000000000000  0x0000000000000000 
0x000000400000ac80:  0x0000004000521d40  0x000000400000ada0 
0x000000400000ac90:  0x000000400000ae20  0x000000400000acc8 
0x000000400000aca0:  0x000000000007cea4 <runtime.sigtramp+0x0000000000000084>  0x000000000000001b 
0x000000400000acb0:  0x000000400000ada0  0x000000400000ae20 
0x000000400000acc0:  0x0000000000000000  0x000000400000bff0 
0x000000400000acd0: <0x0000ffffabfd25c0  0x000000000000001b 
0x000000400000ace0:  0x000000400000ada0  0x000000400000ae20 
0x000000400000acf0:  0x0000000000000009  0x0000000000000015 
0x000000400000ad00:  0x0000000000000000  0x00000040008c9fab 
0x000000400000ad10:  0x000000000001ff93 <runtime.evacuate+0x00000000000004b3>  0x0000000000000009 
0x000000400000ad20:  0x000000000000032d  0x00000040048953d0 
0x000000400000ad30:  0x0000000002078100  0x0000004000521d40 
0x000000400000ad40:  0x000000400000acc8  0x3e66376972bea4d0 
0x000000400000ad50:  0xbebbbd41c5d26bf1  0x3f11566aaf25de2c 
0x000000400000ad60:  0xbf66c16c16bebd93  0x3fc5555555555555 
0x000000400000ad70:  0x3fb8fcd2e18d7b62  0x3ffe7032d1e7284a 
0x000000400000ad80:  0xbfbab6204490a1e0  0x0000000000000000 
0x000000400000ad90:  0x0000000000000000  0x0000000000000000 
0x000000400000ada0: >0x000000000000001b  0x00000000fffffffe 
0x000000400000adb0:  0x0000000000004d16  0x0000000000000000 
0x000000400000adc0:  0x0000000000000000  0x0000000000000000 
0x000000400000add0:  0x0000000000000000  0x0000000000000000 
0x000000400000ade0:  0x0000000000000000  0x0000000000000000 
0x000000400000adf0:  0x0000000000000000  0x0000000000000000 
0x000000400000ae00:  0x0000000000000000  0x0000000000000000 
0x000000400000ae10:  0x0000000000000000  0x0000000000000000 
0x000000400000ae20:  0x0000000000000000  0x0000000000000000 
0x000000400000ae30:  0x0000004000004000  0x0000000000000000 
0x000000400000ae40:  0x0000000000008000  0x0000000000000000 
0x000000400000ae50:  0x0000000000000000  0x0000000000000000 
0x000000400000ae60:  0x0000000000000000  0x0000000000000000 
0x000000400000ae70:  0x0000000000000000  0x0000000000000000 
0x000000400000ae80:  0x0000000000000000  0x0000000000000000 
0x000000400000ae90:  0x0000000000000000  0x0000000000000000 
runtime.throw({0x1275238?, 0x1fbe760?})
    /usr/local/go/src/runtime/panic.go:992 +0x50
runtime.gentraceback(0x204afc0?, 0x0?, 0x400000a968?, 0x4000521d40, 0x0, 0x400000a968, 0x40, 0x0, 0x0?, 0x6)
    /usr/local/go/src/runtime/traceback.go:488 +0xd98
runtime.sigprof(0x79650, 0x400000ac08?, 0x0?, 0x4000521d40, 0x2046680)
    /usr/local/go/src/runtime/proc.go:4507 +0xf0
runtime.sighandler(0x1b?, 0x2046680?, 0x400000ac58?, 0x4000521d40?)
    /usr/local/go/src/runtime/signal_unix.go:613 +0x514
runtime.sigtrampgo(0x1b, 0x400000ada0, 0x400000ae20)
    /usr/local/go/src/runtime/signal_unix.go:477 +0x164
runtime.sigtrampgo(0x1b, 0x400000ada0, 0x400000ae20)
    <autogenerated>:1 +0x1c
runtime: unexpected return pc for runtime.sigtramp called from 0xffffabfd25c0
stack: frame={sp:0x400000acd0, fp:0x400000ada0} stack=[0x4000004000,0x400000c000)
0x000000400000abd0:  0x000000400000ac28  0x0000000000000000 
0x000000400000abe0:  0x0000000000000000  0x0000004000002000 
0x000000400000abf0:  0x0000000000000000  0x0000000000000000 
0x000000400000ac00:  0x0000000000000000  0x000000400000ada0 
0x000000400000ac10:  0x000000400000ae20  0x000000400000ac98 
0x000000400000ac20:  0x000000000007d58c <runtime.sigtrampgo+0x000000000000001c>  0x000000400000001b 
0x000000400000ac30:  0x0000000002046680  0x000000400000ac58 
0x000000400000ac40:  0x0000004000521d40  0x0000000000000000 
0x000000400000ac50:  0x0000000000000000  0x0000000000000000 
0x000000400000ac60:  0x0000000000000000  0x0000000000000000 
0x000000400000ac70:  0x0000000000000000  0x0000000000000000 
0x000000400000ac80:  0x0000004000521d40  0x000000400000ada0 
0x000000400000ac90:  0x000000400000ae20  0x000000400000acc8 
0x000000400000aca0:  0x000000000007cea4 <runtime.sigtramp+0x0000000000000084>  0x000000000000001b 
0x000000400000acb0:  0x000000400000ada0  0x000000400000ae20 
0x000000400000acc0:  0x0000000000000000  0x000000400000bff0 
0x000000400000acd0: <0x0000ffffabfd25c0  0x000000000000001b 
0x000000400000ace0:  0x000000400000ada0  0x000000400000ae20 
0x000000400000acf0:  0x0000000000000009  0x0000000000000015 
0x000000400000ad00:  0x0000000000000000  0x00000040008c9fab 
0x000000400000ad10:  0x000000000001ff93 <runtime.evacuate+0x00000000000004b3>  0x0000000000000009 
0x000000400000ad20:  0x000000000000032d  0x00000040048953d0 
0x000000400000ad30:  0x0000000002078100  0x0000004000521d40 
0x000000400000ad40:  0x000000400000acc8  0x3e66376972bea4d0 
0x000000400000ad50:  0xbebbbd41c5d26bf1  0x3f11566aaf25de2c 
0x000000400000ad60:  0xbf66c16c16bebd93  0x3fc5555555555555 
0x000000400000ad70:  0x3fb8fcd2e18d7b62  0x3ffe7032d1e7284a 
0x000000400000ad80:  0xbfbab6204490a1e0  0x0000000000000000 
0x000000400000ad90:  0x0000000000000000  0x0000000000000000 
0x000000400000ada0: >0x000000000000001b  0x00000000fffffffe 
0x000000400000adb0:  0x0000000000004d16  0x0000000000000000 
0x000000400000adc0:  0x0000000000000000  0x0000000000000000 
0x000000400000add0:  0x0000000000000000  0x0000000000000000 
0x000000400000ade0:  0x0000000000000000  0x0000000000000000 
0x000000400000adf0:  0x0000000000000000  0x0000000000000000 
0x000000400000ae00:  0x0000000000000000  0x0000000000000000 
0x000000400000ae10:  0x0000000000000000  0x0000000000000000 
0x000000400000ae20:  0x0000000000000000  0x0000000000000000 
0x000000400000ae30:  0x0000004000004000  0x0000000000000000 
0x000000400000ae40:  0x0000000000008000  0x0000000000000000 
0x000000400000ae50:  0x0000000000000000  0x0000000000000000 
0x000000400000ae60:  0x0000000000000000  0x0000000000000000 
0x000000400000ae70:  0x0000000000000000  0x0000000000000000 
0x000000400000ae80:  0x0000000000000000  0x0000000000000000 
0x000000400000ae90:  0x0000000000000000  0x0000000000000000 
runtime.sigtramp()
    /usr/local/go/src/runtime/sys_linux_arm64.s:489 +0x84
goroutine 197 [running]:
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 8, 2022
@prattmic prattmic added this to the Go1.18.6 milestone Aug 8, 2022
@prattmic
Copy link
Member

prattmic commented Aug 8, 2022

Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix?

@cherrymui
Copy link
Member

This stack looks weird. 0x12eb5 is not even an aligned PC (ARM64 instructions are always 4 bytes). Why would it appear on the stack? 0x3ff93 from another stack trace is similarly weird.

Maybe they are not PCs? Given that a nearby stack word has content 0x00012ee60001ce60, maybe 0x12eb5, 0x12ee6, 0x1ce60 are not PCs but some (32-bit?) integers on stack?

On both stack traces the SIGPROF lands on PC 0x79650. What is that PC? Is it a Go function? Thanks.

@lizthegrey
Copy link
Author

lizthegrey commented Aug 8, 2022

Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix?

It was. But we backported the fix and it was working fine. The new change for us is turning on not just CPU/mem profiling, but block profiling too.

@lizthegrey
Copy link
Author

lizthegrey commented Aug 8, 2022

This stack looks weird. 0x12eb5 is not even an aligned PC (ARM64 instructions are always 4 bytes). Why would it appear on the stack? 0x3ff93 from another stack trace is similarly weird.

Maybe they are not PCs? Given that a nearby stack word has content 0x00012ee60001ce60, maybe 0x12eb5, 0x12ee6, 0x1ce60 are not PCs but some (32-bit?) integers on stack?

On both stack traces the SIGPROF lands on PC 0x79650. What is that PC? Is it a Go function? Thanks.

$ go tool addr2line retriever
0x79650
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321

@lizthegrey
Copy link
Author

Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix?

It was. But we backported the fix and it was working fine. The new change for us is turning on not just CPU/mem profiling, but block profiling too.

Addendum: I just trawled errors, and indeed, we do have significant numbers of "traceback stuck" dating back multiple months, just in applications that were able to restart after crashing without coming to my attention.

@lizthegrey
Copy link
Author

Here are some additional crashes from other applications:

$ go tool addr2line barbet
0x79410
0x79414
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:322
$ go tool addr2line beagle
0x79410
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
$ go tool addr2line shepherd
0x790a0
0x791d0
0x791d4
runtime.systemstack
/usr/local/go/src/runtime/asm_arm64.s:204
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:321
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:322

@lizthegrey
Copy link
Author

lizthegrey commented Aug 9, 2022

https://github.com/golang/go/blob/go1.18.5/src/runtime/asm_arm64.s#L321 -- culprit lines here.

similarity to #52829, does this need to not have NOFRAME ie https://go-review.googlesource.com/c/go/+/405482/?

@lizthegrey
Copy link
Author

Also repros with go1.19.0

runtime: traceback stuck. pc=0x58e9e0 sp=0x4035cfada0
stack: frame={sp:0x4035cfada0, fp:0x4035cfada0} stack=[0x4035cf8000,0x4035cfc000)
0x0000004035cfaca0:  0x0000004047098bd0  0x000000010001aae8 
0x0000004035cfacb0:  0xc90000000001bfb0  0x0000ffff94c3f878 
0x0000004035cfacc0:  0xc9c4a878b22454ee  0x0000000000000006 
0x0000004035cfacd0:  0x000000000058db7c <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x000000000000004c>  0x000000403cd55520 
0x0000004035cface0:  0x000000000058dbf0 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x00000000000000c0>  0x0000004035cfad38 
0x0000004035cfacf0:  0x00000040470acb20  0x0000004035cfad58 
0x0000004035cfad00:  0x000000000058e980 <github.com/honeycombio/libhoney-go.(*Event).AddField+0x0000000000000110>  0x0000000000fcae00 
0x0000004035cfad10:  0x0000004047098bd0  0x00000040469ff6c8 
0x0000004035cfad20:  0x0000000000000014  0x0100000000000014 
0x0000004035cfad30:  0x00000040470b2248  0x0000000000ebb3a0 
0x0000004035cfad40:  0x000000000058dc50 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1+0x0000000000000000>  0x00000040470b2248 
0x0000004035cfad50:  0x0000004035cfad40  0x0000004035cfadb8 
0x0000004035cfad60:  0x00000000005a5898 <github.com/honeycombio/beeline-go/trace.(*Span).AddField+0x0000000000000158>  0x00000040470b2240 
0x0000004035cfad70:  0x00000040469ff6c8  0x0000000000000014 
0x0000004035cfad80:  0x0000000000ebb3a0  0x0000000001ed2ac8 
0x0000004035cfad90:  0x01000000005a5840  0x0000000000f9a880 
0x0000004035cfada0: >0x000000000058e9e0 <github.com/honeycombio/libhoney-go.(*Event).AddField.func1+0x0000000000000000>  0x00000040470b226c 
0x0000004035cfadb0:  0x0000004035cfada0  0x0000004035cfae28 
0x0000004035cfadc0:  0x00000000005a7a4c <github.com/honeycombio/beeline-go.AddField+0x00000000000000cc>  0x0000ffff94c3f878 
0x0000004035cfadd0:  0x00000040469ff6c8  0x0000000001ed2ac8 
0x0000004035cfade0:  0x0000000000ebb3a0  0x0000000001ed2ac8 
0x0000004035cfadf0:  0x0000004035cfae08  0x0000004035cfae28 
0x0000004035cfae00:  0x01000000005a7a00  0x0000000000f9a880 
0x0000004035cfae10:  0x00000000005a5900 <github.com/honeycombio/beeline-go/trace.(*Span).AddField.func1+0x0000000000000000>  0x0000004043dc5688 
0x0000004035cfae20:  0x0000004035cfae10  0x0000004035cfae88 
0x0000004035cfae30:  0x0000000000d79884 <github.com/honeycombio/hound/cmd/shepherd/app.(*App).getSchema+0x00000000000000f4>  0x0000004043dc5600 
0x0000004035cfae40:  0x00000040469ff6c8  0x0000000000000014 
0x0000004035cfae50:  0x0000000000ebb3a0  0x0000000001ed2ac8 
0x0000004035cfae60:  0x000000400015aae0  0x0000000000000014 
0x0000004035cfae70:  0x0000000000ebb3a0  0x0000004043dc5600 
0x0000004035cfae80:  0x00000040469ff6c8  0x0000004035cfaf48 
0x0000004035cfae90:  0x0000000000d755bc <github.com/honeycombio/hound/cmd/shepherd/app.(*App).batchProcess.func1+0x000000000000043c>  0x0000000001491d40 
fatal error: traceback stuck

goroutine 0 [idle]:
runtime.throw({0x11b3553?, 0x1ebf180?})
	/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x40009ac610 sp=0x40009ac5e0 pc=0x490f0
runtime.gentraceback(0x1f3c7e0?, 0x0?, 0x40009ac9a8?, 0x403cd55520, 0x0, 0x40009ac9a8, 0x40, 0x0, 0x0?, 0x6)
	/usr/local/go/src/runtime/traceback.go:488 +0xde4 fp=0x40009ac940 sp=0x40009ac610 pc=0x703f4
runtime.sigprof(0x7abe0, 0x40009acc48?, 0x0?, 0x403cd55520, 0x40009a4000)
	/usr/local/go/src/runtime/proc.go:4567 +0xe8 fp=0x40009acbb0 sp=0x40009ac940 pc=0x55618
runtime.sighandler(0x1b?, 0x40009a4000?, 0x40009acc98?, 0x40005e4000?)
	/usr/local/go/src/runtime/signal_unix.go:629 +0x574 fp=0x40009acc60 sp=0x40009acbb0 pc=0x60334
runtime.sigtrampgo(0x1b, 0x40009acda0, 0x40009ace20)
	/usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x40009acce0 sp=0x40009acc60 pc=0x5f9c4
runtime.sigtramp()
	/usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x40009acda0 sp=0x40009acce0 pc=0x7e228

goroutine 1503875 [running]:
runtime: traceback stuck. pc=0x58e9e0 sp=0x400ffaeda0
stack: frame={sp:0x400ffaeda0, fp:0x400ffaeda0} stack=[0x400ffac000,0x400ffb0000)
0x000000400ffaeca0:  0x0000004046ce2240  0x000000010001f0b8 
0x000000400ffaecb0:  0x7300000001243730  0x0000ffff9d953948 
0x000000400ffaecc0:  0xc8f36a7a4ff431e7  0x0000000000000007 
0x000000400ffaecd0:  0x000000000058db7c <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x000000000000004c>  0x000000400cfa5520 
0x000000400ffaece0:  0x000000000058dc00 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField+0x00000000000000d0>  0x000000400ffaed38 
0x000000400ffaecf0:  0x0000000000065708 <runtime.rawstringtmp+0x0000000000000048>  0x000000400ffaed58 
0x000000400ffaed00:  0x000000000058e980 <github.com/honeycombio/libhoney-go.(*Event).AddField+0x0000000000000110>  0x0000000000fcae00 
0x000000400ffaed10:  0x0000004046ce2240  0x0000004046110d50 
0x000000400ffaed20:  0x0000000000000016  0x0100000000000016 
0x000000400ffaed30:  0x00000040453eaf68  0x0000000000eccea0 
0x000000400ffaed40:  0x000000000058dc50 <github.com/honeycombio/libhoney-go.(*fieldHolder).AddField.func1+0x0000000000000000>  0x00000040453eaf68 
0x000000400ffaed50:  0x000000400ffaed40  0x000000400ffaedb8 
0x000000400ffaed60:  0x00000000005a5898 <github.com/honeycombio/beeline-go/trace.(*Span).AddField+0x0000000000000158>  0x00000040453eaf60 
0x000000400ffaed70:  0x0000004046110d50  0x0000000000000016 
0x000000400ffaed80:  0x0000000000eccea0  0x000000000147d320 
0x000000400ffaed90:  0x01000000005a5840  0x0000000000f9a880 
0x000000400ffaeda0: >0x000000000058e9e0 <github.com/honeycombio/libhoney-go.(*Event).AddField.func1+0x0000000000000000>  0x00000040453eaf8c 
0x000000400ffaedb0:  0x000000400ffaeda0  0x000000400ffaee28 
0x000000400ffaedc0:  0x00000000005a7a4c <github.com/honeycombio/beeline-go.AddField+0x00000000000000cc>  0x0000ffff9d953948 
0x000000400ffaedd0:  0x0000004046110d50  0x000000000147d320 
0x000000400ffaede0:  0x0000000000eccea0  0x000000000147d320 
0x000000400ffaedf0:  0x000000400ffaee08  0x000000400ffaee28 
0x000000400ffaee00:  0x01000000005a7a00  0x0000000000f9a880 
0x000000400ffaee10:  0x00000000005a5900 <github.com/honeycombio/beeline-go/trace.(*Span).AddField.func1+0x0000000000000000>  0x00000040453c2b88 
0x000000400ffaee20:  0x000000400ffaee10  0x000000400ffaee88 
0x000000400ffaee30:  0x0000000000d79a6c <github.com/honeycombio/hound/cmd/shepherd/app.(*App).getSchema+0x00000000000002dc>  0x00000040453c2b00 
0x000000400ffaee40:  0x0000004046110d50  0x0000000000000016 
0x000000400ffaee50:  0x0000000000eccea0  0x000000000147d320 
0x000000400ffaee60:  0x00000040006af0b0  0x0000000000000016 
0x000000400ffaee70:  0x0000000000eccea0  0x00000040453c2b00 
0x000000400ffaee80:  0x0000004046110d50  0x000000400ffaef48 
0x000000400ffaee90:  0x0000000000d755bc <github.com/honeycombio/hound/cmd/shepherd/app.(*App).batchProcess.func1+0x000000000000043c>  0x0000000001491d40 
fatal error: traceback stuck

goroutine 0 [idle]:
runtime.throw({0x11b3553?, 0x1ebf180?})
	/usr/local/go/src/runtime/panic.go:1047 +0x40 fp=0x400078a610 sp=0x400078a5e0 pc=0x490f0
runtime.gentraceback(0x1f3c7e0?, 0x0?, 0x400078a9a8?, 0x400cfa5520, 0x0, 0x400078a9a8, 0x40, 0x0, 0x0?, 0x6)
	/usr/local/go/src/runtime/traceback.go:488 +0xde4 fp=0x400078a940 sp=0x400078a610 pc=0x703f4
runtime.sigprof(0x7abe4, 0x400078ac48?, 0x0?, 0x400cfa5520, 0x4000780000)
	/usr/local/go/src/runtime/proc.go:4567 +0xe8 fp=0x400078abb0 sp=0x400078a940 pc=0x55618
runtime.sighandler(0x1b?, 0x4000780000?, 0x400078ac98?, 0x40007821a0?)
	/usr/local/go/src/runtime/signal_unix.go:629 +0x574 fp=0x400078ac60 sp=0x400078abb0 pc=0x60334
runtime.sigtrampgo(0x1b, 0x400078ada0, 0x400078ae20)
	/usr/local/go/src/runtime/signal_unix.go:479 +0x164 fp=0x400078ace0 sp=0x400078ac60 pc=0x5f9c4
runtime.sigtramp()
	/usr/local/go/src/runtime/sys_linux_arm64.s:471 +0x58 fp=0x400078ada0 sp=0x400078ace0 pc=0x7e228

goroutine 16981043 [running]:
go tool addr2line shepherd
0x7abe4
0x7abe0
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:324
runtime.morestack_noctxt
/usr/local/go/src/runtime/asm_arm64.s:323

https://github.com/golang/go/blob/go1.19/src/runtime/asm_arm64.s#L323-L324

@dmitshur dmitshur modified the milestones: Go1.18.6, Go1.20 Aug 9, 2022
@cherrymui
Copy link
Member

Thanks for the details @lizthegrey !

similarity to #52829, does this need to not have NOFRAME ie https://go-review.googlesource.com/c/go/+/405482/?

I don't think this is necessary. The ARM32 function contains a call in some configurations, whereas the ARM64 function should not contain a call.

So the signal lands at the entry of morestack_noctxt or systemstack. At this point the stack switch hasn't occurred so it should be able to traceback normally. I don't immediately see what is wrong. I'll keep looking.

If you need a workaround you can try comment out the "traceback stuck" throw. The profile sample's stacks may be inaccurate (which probably was the case before that CL) but it probably won't crash.

@lizthegrey
Copy link
Author

Need any additional debugging info? Any way that I can help?

@prattmic
Copy link
Member

We have special cases for morestack and systemstack here, but I don't believe they are relevant because (a) funcID_morestack only applies to morestack, not morestack_noctxt, and (b) we should only hits these if we already switched stacks, which is not the case here.

@lizthegrey is this readily reproducible (ideally OSS, but I'm also curious if you could manually reproduce)? Or only occurs in prod?

@lizthegrey
Copy link
Author

lizthegrey commented Aug 16, 2022

It reproduces multiple times per day in all of our environments. Sadly I'm not sure I can reduce to a minimal OSS test case.

One thing I'd be happy to do is to run a version of this that is a branch build that patches go runtime to hang rather than panic when this occurs, then I'd be happy to ssh in and dlv the binary.

@gopherbot
Copy link

Change https://go.dev/cl/424196 mentions this issue: DO NOT SUBMIT: runtime: add dlog to gentraceback

@gopherbot
Copy link

Change https://go.dev/cl/424195 mentions this issue: DO NOT SUBMIT: runtime: dump entire stack

@prattmic
Copy link
Member

If you are willing to patch the runtime, there are a few changes that would be helpful.

  • https://go.dev/cl/424195: This just expands the stack dump to include the entire stack, which will help see more context. Since this only runs when crashing, it won't have any impact on normal execution.

  • https://go.dev/cl/424196: This adds debuglog logs to trace the behavior of gentraceback, which will help us see how to gets into the bad state. debuglog is a low-overhead in-memory ring buffer log in the runtime, which you must enable with go build -tags debuglog [1]. The log is dumped to stderr on crash. Without the build tag, the code is a complete no-op. With the build tag, it is logging every execution of gentraceback, so there may be a small performance degradation, but I don't think it will be noticeable in real-world applications.

I think the latter is a bit simpler than attaching with a debugger, since it can track iteration state (previous frames) that will be gone once we decide to crash. Plus you don't need to watch for hung tasks and manually inspect them, but that is an option as well if the patch is concerning.

[1] If adding tags to your build is difficult for some reason, you can also swap the build tag lines at the top of debuglog_off.go and debuglog_on.go.

@lizthegrey
Copy link
Author

There's some slew since 1.18.5:

/usr/local/go/src/runtime/traceback.go:47:9: cannot use gp.goid (variable of type int64) as type uint64 in argument to d.u64

I'll fix the patch vs our base go version and try this overnight.

@lizthegrey
Copy link
Author

lizthegrey commented Aug 17, 2022

Fresh crash for you with the additional telemetry.

Details
2022/08/17 15:22:56 maxprocs: Updating GOMAXPROCS=4: determined from CPU quota
time="2022-08-17T15:22:56Z" level=info msg="missing POD_NAME and POD_NAMESPACE in env, ok if not retriever"
time="2022-08-17T15:22:56Z" level=info msg="Debug service listening on :6060"
[LaunchDarkly] 2022/08/17 15:22:56 INFO: Starting LaunchDarkly client 5.0.2
[LaunchDarkly] 2022/08/17 15:22:56 INFO: Starting LaunchDarkly streaming connection
[LaunchDarkly] 2022/08/17 15:22:56 INFO: Waiting up to 30000 milliseconds for LaunchDarkly client to start...
[LaunchDarkly] 2022/08/17 15:22:56 INFO: Connecting to LaunchDarkly stream
[LaunchDarkly] 2022/08/17 15:22:56 INFO: LaunchDarkly streaming is active
[LaunchDarkly] 2022/08/17 15:22:56 INFO: Initialized LaunchDarkly client
runtime: traceback stuck. pc=0x12894 sp=0x40009d7840
stack: frame={sp:0x40009d7840, fp:0x40009d7840} stack=[0x40009d6000,0x40009d8000)
0x00000040009d7740:  0x000000000001cadc <runtime.mallocgc+0x000000000000068c>  0x00000040009d77b8 
0x00000040009d7750:  0x0000000000474210 <github.com/Shopify/sarama.(*realDecoder).pop+0x0000000000000060>  0x00000040009d77c8 
0x00000040009d7760:  0x0000000000000000  0x0000000000000000 
0x00000040009d7770:  0x0000000000000700  0x0000000000000004 
0x00000040009d7780:  0x00000040033b40c0  0x0000000000007701 
0x00000040009d7790:  0x0000000000000000  0x0000000000000000 
0x00000040009d77a0:  0x0000000000000000  0x0000004000600000 
0x00000040009d77b0:  0x0000004002b77c00  0x00000040009d77f8 
0x00000040009d77c0:  0x000000000047703c <github.com/Shopify/sarama.(*RecordBatch).decode+0x000000000000037c>  0x0000004008ec5980 
0x00000040009d77d0:  0x0000000000476f7c <github.com/Shopify/sarama.(*RecordBatch).decode+0x00000000000002bc>  0x00000040099ce05f 
0x00000040009d77e0:  0x000000000010a2ee <type..eq.[...]interface {}+0x000000000000003e>  0x0000000000000000 
0x00000040009d77f0:  0x0000000000000000  0x00000040009d78c8 
0x00000040009d7800:  0x00000000004777dc <github.com/Shopify/sarama.(*Records).decode+0x000000000000009c>  0x00000040033b40c0 
0x00000040009d7810:  0x0000004009ac5ab9  0x0000000000012894 <cmpbody+0x00000000000000e4> 
0x00000040009d7820:  0x0000000000012894 <cmpbody+0x00000000000000e4>  0x00000040009d7898 
0x00000040009d7830:  0x000128c50001ce80  0x0000000000012894 <cmpbody+0x00000000000000e4> 
0x00000040009d7840: >0x0000000000012894 <cmpbody+0x00000000000000e4>  0x00000000000000dc 
0x00000040009d7850:  0x0000000000000000  0x00000040009d7800 
0x00000040009d7860:  0x0000000000476da4 <github.com/Shopify/sarama.(*RecordBatch).decode+0x00000000000000e4>  0x00000040009d78b8 
0x00000040009d7870:  0x0000000000000000  0x0000000000000000 
0x00000040009d7880:  0x0000000000000000  0x0000000000000000 
0x00000040009d7890:  0x0000000000000000  0x0000004009ac5ab9 
0x00000040009d78a0:  0x0000004008ec5980  0x0000000000000000 
0x00000040009d78b0:  0x0000000000000000  0x00000000004772f0 <github.com/Shopify/sarama.(*RecordBatch).decode.func1+0x0000000000000000> 
0x00000040009d78c0:  0x0000004008ec5980  0x00000040009d7918 
0x00000040009d78d0:  0x000000000045a32c <github.com/Shopify/sarama.(*FetchResponseBlock).decode+0x000000000000033c>  0x0000004002cc3540 
0x00000040009d78e0:  0x0000000001c75e50  0x00000040033b40c0 
0x00000040009d78f0:  0x000000000045a31c <github.com/Shopify/sarama.(*FetchResponseBlock).decode+0x000000000000032c>  0x0000000000000018 
0x00000040009d7900:  0x00000000016a8080  0x00000040005cf601 
0x00000040009d7910:  0x00000040027f6300  0x00000040009d79c8 
0x00000040009d7920:  0x000000000045acf8 <github.com/Shopify/sarama.(*FetchResponse).decode+0x00000000000002a8>  0x000000400710dad0 
0x00000040009d7930:  0x0000000001c75e50  0x00000040033b40c0 
fatal error: traceback stuck

goroutine 0 [idle]:
runtime: unexpected return pc for runtime.sigtramp called from 0xffffaac795c0
stack: frame={sp:0x40006cacd0, fp:0x40006cada0} stack=[0x40006c4000,0x40006cc000)
0x00000040006cabd0:  0x00000040006cac28  0x0000000000000000 
0x00000040006cabe0:  0x0000000000000000  0x00000040006c2000 
0x00000040006cabf0:  0x0000000000000000  0x0000000000000000 
0x00000040006cac00:  0x0000000000000000  0x00000040006cada0 
0x00000040006cac10:  0x00000040006cae20  0x00000040006cac98 
0x00000040006cac20:  0x000000000007d35c <runtime.sigtrampgo+0x000000000000001c>  0x000000400000001b 
0x00000040006cac30:  0x00000040006c0000  0x00000040006cac58 
0x00000040006cac40:  0x0000004007a6bd40  0x0000000000000000 
0x00000040006cac50:  0x0000000000000000  0x0000000000000000 
0x00000040006cac60:  0x0000000000000000  0x0000000000000000 
0x00000040006cac70:  0x0000000000000000  0x0000000000000000 
0x00000040006cac80:  0x0000004007a6bd40  0x00000040006cada0 
0x00000040006cac90:  0x00000040006cae20  0x00000040006cacc8 
0x00000040006caca0:  0x000000000007cc74 <runtime.sigtramp+0x0000000000000084>  0x000000000000001b 
0x00000040006cacb0:  0x00000040006cada0  0x00000040006cae20 
0x00000040006cacc0:  0x0000000000000000  0x00000040006cbff0 
0x00000040006cacd0: <0x0000ffffaac795c0  0x000000000000001b 
0x00000040006cace0:  0x00000040006cada0  0x00000040006cae20 
0x00000040006cacf0:  0x000000000004a4fd <runtime.main+0x00000000000002ed>  0x070a091214d6eaae 
0x00000040006cad00:  0x226e776f6e6b6e75  0x3464623239303924 
0x00000040006cad10:  0x342d356131632d61  0x000000000000000d 
0x00000040006cad20:  0x00000000000001dd  0x0000000000000000 
0x00000040006cad30:  0x0000000001c5b300  0x0000004007a6bd40 
0x00000040006cad40:  0x00000040006cacc8  0x3e66376972bea4d0 
0x00000040006cad50:  0xbebbbd41c5d26bf1  0x3f11566aaf25de2c 
0x00000040006cad60:  0xbf66c16c16bebd93  0x3fc5555555555555 
0x00000040006cad70:  0x3fc7642f0ac54c41  0x3ffd137a1ea75678 
0x00000040006cad80:  0xbfca94327912c722  0x0000000000000000 
0x00000040006cad90:  0x0000000000000000  0x0000000000000000 
0x00000040006cada0: >0x000000000000001b  0x00000000fffffffe 
0x00000040006cadb0:  0x0000000000000136  0x0000000000000000 
0x00000040006cadc0:  0x0000000000000000  0x0000000000000000 
0x00000040006cadd0:  0x0000000000000000  0x0000000000000000 
0x00000040006cade0:  0x0000000000000000  0x0000000000000000 
0x00000040006cadf0:  0x0000000000000000  0x0000000000000000 
0x00000040006cae00:  0x0000000000000000  0x0000000000000000 
0x00000040006cae10:  0x0000000000000000  0x0000000000000000 
0x00000040006cae20:  0x0000000000000000  0x0000000000000000 
0x00000040006cae30:  0x00000040006c4000  0x0000000000000000 
0x00000040006cae40:  0x0000000000008000  0x0000000000000000 
0x00000040006cae50:  0x0000000000000000  0x0000000000000000 
0x00000040006cae60:  0x0000000000000000  0x0000000000000000 
0x00000040006cae70:  0x0000000000000000  0x0000000000000000 
0x00000040006cae80:  0x0000000000000000  0x0000000000000000 
0x00000040006cae90:  0x0000000000000000  0x0000000000000000 
runtime.throw({0x1894577?, 0x2a981a0?})
	/usr/local/go/src/runtime/panic.go:992 +0x50
runtime.gentraceback(0x2b3d3a0?, 0x0?, 0x40006ca968?, 0x4007a6bd40, 0x0, 0x40006ca968, 0x40, 0x0, 0x0?, 0x6)
	/usr/local/go/src/runtime/traceback.go:488 +0xd98
runtime.sigprof(0x79410, 0x40006cac08?, 0x0?, 0x4007a6bd40, 0x40006c0000)
	/usr/local/go/src/runtime/proc.go:4507 +0xf0
runtime.sighandler(0x1b?, 0x40006c0000?, 0x40006cac58?, 0x4007a6bd40?)
	/usr/local/go/src/runtime/signal_unix.go:613 +0x514
runtime.sigtrampgo(0x1b, 0x40006cada0, 0x40006cae20)
	/usr/local/go/src/runtime/signal_unix.go:477 +0x164
runtime.sigtrampgo(0x1b, 0x40006cada0, 0x40006cae20)
	<autogenerated>:1 +0x1c
runtime: unexpected return pc for runtime.sigtramp called from 0xffffaac795c0
stack: frame={sp:0x40006cacd0, fp:0x40006cada0} stack=[0x40006c4000,0x40006cc000)
0x00000040006cabd0:  0x00000040006cac28  0x0000000000000000 
0x00000040006cabe0:  0x0000000000000000  0x00000040006c2000 
0x00000040006cabf0:  0x0000000000000000  0x0000000000000000 
0x00000040006cac00:  0x0000000000000000  0x00000040006cada0 
0x00000040006cac10:  0x00000040006cae20  0x00000040006cac98 
0x00000040006cac20:  0x000000000007d35c <runtime.sigtrampgo+0x000000000000001c>  0x000000400000001b 
0x00000040006cac30:  0x00000040006c0000  0x00000040006cac58 
0x00000040006cac40:  0x0000004007a6bd40  0x0000000000000000 
0x00000040006cac50:  0x0000000000000000  0x0000000000000000 
0x00000040006cac60:  0x0000000000000000  0x0000000000000000 
0x00000040006cac70:  0x0000000000000000  0x0000000000000000 
0x00000040006cac80:  0x0000004007a6bd40  0x00000040006cada0 
0x00000040006cac90:  0x00000040006cae20  0x00000040006cacc8 
0x00000040006caca0:  0x000000000007cc74 <runtime.sigtramp+0x0000000000000084>  0x000000000000001b 
0x00000040006cacb0:  0x00000040006cada0  0x00000040006cae20 
0x00000040006cacc0:  0x0000000000000000  0x00000040006cbff0 
0x00000040006cacd0: <0x0000ffffaac795c0  0x000000000000001b 
0x00000040006cace0:  0x00000040006cada0  0x00000040006cae20 
0x00000040006cacf0:  0x000000000004a4fd <runtime.main+0x00000000000002ed>  0x070a091214d6eaae 
0x00000040006cad00:  0x226e776f6e6b6e75  0x3464623239303924 
0x00000040006cad10:  0x342d356131632d61  0x000000000000000d 
0x00000040006cad20:  0x00000000000001dd  0x0000000000000000 
0x00000040006cad30:  0x0000000001c5b300  0x0000004007a6bd40 
0x00000040006cad40:  0x00000040006cacc8  0x3e66376972bea4d0 
0x00000040006cad50:  0xbebbbd41c5d26bf1  0x3f11566aaf25de2c 
0x00000040006cad60:  0xbf66c16c16bebd93  0x3fc5555555555555 
0x00000040006cad70:  0x3fc7642f0ac54c41  0x3ffd137a1ea75678 
0x00000040006cad80:  0xbfca94327912c722  0x0000000000000000 
0x00000040006cad90:  0x0000000000000000  0x0000000000000000 
0x00000040006cada0: >0x000000000000001b  0x00000000fffffffe 
0x00000040006cadb0:  0x0000000000000136  0x0000000000000000 
0x00000040006cadc0:  0x0000000000000000  0x0000000000000000 
0x00000040006cadd0:  0x0000000000000000  0x0000000000000000 
0x00000040006cade0:  0x0000000000000000  0x0000000000000000 
0x00000040006cadf0:  0x0000000000000000  0x0000000000000000 
0x00000040006cae00:  0x0000000000000000  0x0000000000000000 
0x00000040006cae10:  0x0000000000000000  0x0000000000000000 
0x00000040006cae20:  0x0000000000000000  0x0000000000000000 
0x00000040006cae30:  0x00000040006c4000  0x0000000000000000 
0x00000040006cae40:  0x0000000000008000  0x0000000000000000 
0x00000040006cae50:  0x0000000000000000  0x0000000000000000 
0x00000040006cae60:  0x0000000000000000  0x0000000000000000 
0x00000040006cae70:  0x0000000000000000  0x0000000000000000 
0x00000040006cae80:  0x0000000000000000  0x0000000000000000 
0x00000040006cae90:  0x0000000000000000  0x0000000000000000 
runtime.sigtramp()
	/usr/local/go/src/runtime/sys_linux_arm64.s:489 +0x84

goroutine 2446 [running]:

@prattmic
Copy link
Member

There's some slew since 1.18.5:

Oops, apologies, I made this CL based on tip rather than 1.18. Did you get this fixed? If not I can provide another on 1.18.

Fresh crash for you with the additional telemetry.

Did this get built with -tags debuglog? The is missing, though perhaps just not included in your paste. The log will be output after all of the crash stack traces just before exit.

@lizthegrey
Copy link
Author

Did this get built with -tags debuglog? The is missing, though perhaps just not included in your paste. The log will be output after all of the crash stack traces just before exit.

I thought it did, but apparently it's missing them. Does go install accept tags, or only go build?

                    go install -tags debuglog -ldflags "${LD_FLAGS} \
                      -X github.com/honeycombio/hound/env.buildID=$BUILD_ID \
                      -X github.com/honeycombio/hound/env.commitHash=$CIRCLE_SHA1" \
                      ./...

@prattmic
Copy link
Member

prattmic commented Aug 17, 2022

Yes, go install accepts tags.

I also see that the stack dump doesn't include the entire stack either (https://go.dev/cl/424195). I'd guess that your build isn't picking up the patches to the runtime at all (e.g., wrong GOROOT), except that it clearly did when you got a build error...

You could double check the symbols in the binary. If debuglog is enabled, it should have various dlogger symbols like runtime.(*dlogger).end, if not those are missing.

@lizthegrey
Copy link
Author

oh, whoops, good catch.

    Roles pinned:
        beagle: 603443
        barbet: 603443

that binary is pinned to an older build. the build instructions were correct, prod just didn't have the correct one. egg on my face, sorry, I'll re-try and let you know when we have a crash with the correct build.

@prattmic
Copy link
Member

Thanks. By the way, do you happen to be using musl libc (/Alpine Linux)? (Is this even a cgo binary?)

I'm investigating #54306 in parallel, presumably an issue with musl. It is probably unrelated, but it would help to know if you are using musl.

@lizthegrey
Copy link
Author

lizthegrey commented Aug 17, 2022

Thanks. By the way, do you happen to be using musl libc (/Alpine Linux)? (Is this even a cgo binary?)

I'm investigating #54306 in parallel, presumably an issue with musl. It is probably unrelated, but it would help to know if you are using musl.

No, we run on Ubuntu 22.04 containers with standard libc 2.35-0ubuntu3.1. No use of cgo

@lizthegrey
Copy link
Author

@prattmic
Copy link
Member

Thanks, I'll take a look! Does this one have debuglog output at the end? (Lots of lines mostly containing gentrace: ...)

@lizthegrey
Copy link
Author

I think I made the hourly cutoff for our build train, will report if I see any crashes this afternoon or over the weekend; if no crashes by Monday, then we can assume that was the fix.

@lizthegrey
Copy link
Author

Unfortunately I do see a crash in the shepherd binary (same as from tracebacks 1-3), same PC -- 0x7cdc0. didn't manage to get to the pod before Kubernetes GCed it though, and for some reason log shipping is broken in that environment.

It'll likely happen again over weekend though, if it happened within 3 hours of deploy of the putative fix

@cherrymui
Copy link
Member

@lizthegrey have you seen more failures and do you have any logs to share? Could you doublecheck that the binary includes the patch? Thanks!

@lizthegrey
Copy link
Author

@lizthegrey have you seen more failures and do you have any logs to share? Could you doublecheck that the binary includes the patch? Thanks!

No failures over weekend. Yes, 100% sure the crash from Fri includes the patch, and I can manually scrape the logs from another system, it'll just take some time and thus hoping it happens somewhere else where I can grab the logs automatically instead.

@lizthegrey
Copy link
Author

https://share.getcloudapp.com/p9uQpJww -- no crashes today either. the crash on Friday must have been a fluke, but at least we have the telemetry to detect it if it does happen over the next few weeks. I'll still work on reconstructing the logs from the Friday crash immediately post-deploy though.

@lizthegrey
Copy link
Author

lizthegrey commented Aug 23, 2022

share.getcloudapp.com/p9uQpJww -- no crashes today either. the crash on Friday must have been a fluke, but at least we have the telemetry to detect it if it does happen over the next few weeks. I'll still work on reconstructing the logs from the Friday crash immediately post-deploy though.

Found the problem -- the pod was a lingering holdover of the previous build, despite running chronologically long after the new release should have gone. It did not have the patch. You are good to propose the MOVD RSP, RSP patch for 1.20 release and for 1.19 and for 1.18 backport trains.

thank you thank you thank you @cherrymui and @prattmic for your patience.

@cherrymui
Copy link
Member

Thanks @lizthegrey ! This is great. I'll send a CL soon.
And thank you very much for providing the debug logs and helping us find the issue.

@lizthegrey lizthegrey changed the title runtime: SIGTRAMP while profiling arm64 hits new "traceback stuck" assert runtime: morestack_noctxt missing SPWRITE, causes "traceback stuck" assert Aug 23, 2022
@gopherbot
Copy link

Change https://go.dev/cl/425396 mentions this issue: runtime: mark morestack_noctxt SPWRITE on LR architectures

@cherrymui
Copy link
Member

@gopherbot please backport this to previous releases. This may cause a runtime crash. Thanks.

@gopherbot
Copy link

Backport issue(s) opened: #54674 (for 1.18), #54675 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label Aug 25, 2022
@gopherbot
Copy link

Change https://go.dev/cl/425615 mentions this issue: [release-branch.go1.19] runtime: mark morestack_noctxt SPWRITE on LR architectures

@gopherbot
Copy link

Change https://go.dev/cl/425616 mentions this issue: [release-branch.go1.18] runtime: mark morestack_noctxt SPWRITE on LR architectures

gopherbot pushed a commit that referenced this issue Aug 31, 2022
…architectures

On LR architectures, morestack (and morestack_noctxt) are called
with a special calling convention, where the caller doesn't save
LR on stack but passes it as a register, which morestack will save
to g.sched.lr. The stack unwinder currently doesn't understand it,
and would fail to unwind from it. morestack already writes SP (as
it switches stack), but morestack_noctxt (which tailcalls
morestack) doesn't. If a profiling signal lands right in
morestack_noctxt, the unwinder will try to unwind the stack and
go off, and possibly crash.

Marking morestack_noctxt SPWRITE stops the unwinding.

Ideally we could teach the unwinder about the special calling
convention, or change the calling convention to be less special
(so the unwinder doesn't need to fetch a register from the signal
context). This is a stop-gap solution, to stop the unwinder from
crashing.

Updates #54332.
Fixes #54674.

Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576
Reviewed-on: https://go-review.googlesource.com/c/go/+/425396
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit e4be2ac)
Reviewed-on: https://go-review.googlesource.com/c/go/+/425616
gopherbot pushed a commit that referenced this issue Aug 31, 2022
…architectures

On LR architectures, morestack (and morestack_noctxt) are called
with a special calling convention, where the caller doesn't save
LR on stack but passes it as a register, which morestack will save
to g.sched.lr. The stack unwinder currently doesn't understand it,
and would fail to unwind from it. morestack already writes SP (as
it switches stack), but morestack_noctxt (which tailcalls
morestack) doesn't. If a profiling signal lands right in
morestack_noctxt, the unwinder will try to unwind the stack and
go off, and possibly crash.

Marking morestack_noctxt SPWRITE stops the unwinding.

Ideally we could teach the unwinder about the special calling
convention, or change the calling convention to be less special
(so the unwinder doesn't need to fetch a register from the signal
context). This is a stop-gap solution, to stop the unwinder from
crashing.

Updates #54332.
Fixes #54675.

Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576
Reviewed-on: https://go-review.googlesource.com/c/go/+/425396
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit e4be2ac)
Reviewed-on: https://go-review.googlesource.com/c/go/+/425615
rajbarik pushed a commit to rajbarik/go that referenced this issue Sep 1, 2022
On LR architectures, morestack (and morestack_noctxt) are called
with a special calling convention, where the caller doesn't save
LR on stack but passes it as a register, which morestack will save
to g.sched.lr. The stack unwinder currently doesn't understand it,
and would fail to unwind from it. morestack already writes SP (as
it switches stack), but morestack_noctxt (which tailcalls
morestack) doesn't. If a profiling signal lands right in
morestack_noctxt, the unwinder will try to unwind the stack and
go off, and possibly crash.

Marking morestack_noctxt SPWRITE stops the unwinding.

Ideally we could teach the unwinder about the special calling
convention, or change the calling convention to be less special
(so the unwinder doesn't need to fetch a register from the signal
context). This is a stop-gap solution, to stop the unwinder from
crashing.

Fixes golang#54332.

Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576
Reviewed-on: https://go-review.googlesource.com/c/go/+/425396
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
bradfitz pushed a commit to tailscale/go that referenced this issue Sep 8, 2022
…architectures

On LR architectures, morestack (and morestack_noctxt) are called
with a special calling convention, where the caller doesn't save
LR on stack but passes it as a register, which morestack will save
to g.sched.lr. The stack unwinder currently doesn't understand it,
and would fail to unwind from it. morestack already writes SP (as
it switches stack), but morestack_noctxt (which tailcalls
morestack) doesn't. If a profiling signal lands right in
morestack_noctxt, the unwinder will try to unwind the stack and
go off, and possibly crash.

Marking morestack_noctxt SPWRITE stops the unwinding.

Ideally we could teach the unwinder about the special calling
convention, or change the calling convention to be less special
(so the unwinder doesn't need to fetch a register from the signal
context). This is a stop-gap solution, to stop the unwinder from
crashing.

Updates golang#54332.
Fixes golang#54675.

Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576
Reviewed-on: https://go-review.googlesource.com/c/go/+/425396
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit e4be2ac)
Reviewed-on: https://go-review.googlesource.com/c/go/+/425615
@gopherbot
Copy link

Change https://go.dev/cl/446895 mentions this issue: runtime: mark morestack_noctxt SPWRITE for linux/loong64

gopherbot pushed a commit that referenced this issue Mar 15, 2023
ref. CL 425396

Updates #54332.

Change-Id: I1a235b0cca4dbf79cf61cf5f40b594fc2d940857
Reviewed-on: https://go-review.googlesource.com/c/go/+/446895
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: WANG Xuerui <git@xen0n.name>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: xiaodong liu <teaofmoli@gmail.com>
xen0n pushed a commit to xen0n/go that referenced this issue May 1, 2023
ref. CL 425396

Updates golang#54332.

Change-Id: I1a235b0cca4dbf79cf61cf5f40b594fc2d940857
Reviewed-on: https://go-review.googlesource.com/c/go/+/446895
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: WANG Xuerui <git@xen0n.name>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: xiaodong liu <teaofmoli@gmail.com>
(cherry picked from commit b5c8ae9)
@golang golang locked and limited conversation to collaborators Nov 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants