Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: debug.Stack() and runtime.Callers() PCs differ on panic trigger site #34123

Closed
lggomez opened this issue Sep 5, 2019 · 14 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@lggomez
Copy link

lggomez commented Sep 5, 2019

What version of Go are you using (go version)?

go1.12.4 linux/amd64

Does this issue reproduce with the latest release?

Can´t tell for the time being, the only repro we have is on a web app and we still don´t have docker images with go 1.13/1.12.7 on our end

What operating system and processor architecture are you using (go env)?

Linux e5d326139ac1 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build601602077=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Quoting my original issue on newrelic/go-agent#100:

I'm working on a custom gin recover middleware that prints the offending stacktrace whenever a panic goes uncatched. In this scenario, we noticed an application's code path in which the line number on the printed stacktrace is wrong, returning line 59 instead of line 61 (it took me a long time to realize that something was wrong with the stacktrace itself and not the reported line):

func (ls LocationsMutableService) getChildren(ctx context.Context, tree string, loc string) (LocationChildren, error) {
	url := ""
	var children LocationChildren

//line 59 - panic not possible here
	response, err := ls.Client.GET(url, &children)

//line 61 - actual place of panic (response == nil)
	if response.StatusCode == http.StatusNotFound { 
		return nil, errors.New("")
	}

	if err != nil {
		return nil, err
	}

	return children, err
}

We fixed this by writing a implementation based on debug.Stack() but, surprisingly, after deploying the fix to verify it I'm observing that the internal stacktrace used by txn.NoticeError has the same inaccuracy that our original middleware implementation had:

2019/08/30 14:33:09 /go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/nr_stack.go 19 mlhandlers.GetStackTrace
/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go 58 mlhandlers.RecoveryWithWriter.func1.1
/usr/local/go/src/runtime/panic.go 522 runtime.gopanic
/usr/local/go/src/runtime/panic.go 82 runtime.panicmem
/usr/local/go/src/runtime/signal_unix.go 390 runtime.sigpanic
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go 59 gateways.LocationsMutableService.getChildren
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go 51 gateways.LocationsMutableService.GetChildren.func1

After some research the main culprit seems to be a wrong PC being used when retrieving the stack frames, but I'm still not certain about the concrete cause. I changed the stacktrace generator code that manually generated the frames via newrelic/go-agent#101 but I keep noticing the same issue, so I guess it boils down to a difference between how runtime.Stack and runtime.Callers manipulate the systemstack

The scenario I've been showing is on a specific branch of a web app we use at work (thus being unable to share it here), and unfortunately I've been trying to create a local repro case to no avail

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 5, 2019
@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Sep 5, 2019
@ianlancetaylor
Copy link
Contributor

Marking as release-blocker since this sounds serious. But it's going to be hard for us to investigate without a reproduction case.

@lggomez
Copy link
Author

lggomez commented Sep 6, 2019

@ianlancetaylor I edited the issue with the container go env and uname values

In the meanwhile I can add some more info I found while trying to diagnose this that may (or may not) be helpul:

  • systemstack is handled differently between Stacktrace and runtime.Callers. I can´t tell if this can affect the panic site pc though
  • Doesn´t look like an inlining or compiler issue at a glance? this occurs in a single function block (altough lower levels of the stack itself at that call site are indeed funcs passed as params to a handler)
  • It likely is an off by one error on the panic site pc; in this case it is being decremented when it shouldn`t. Or, it should come incremented as it is required to decrement program counters if traversed manually (as referred at what the sigpanic is used for ? go-stack/stack#11 (comment))

Here are the different stacktraces built on the same recovery middleware for comparison (some package paths ommited for policy):

custom implementation (debug.Stack)

runtime error: invalid memory address or nil pointer dereference
	/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go:103 +0x37
	/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go:93 +0x3c
	/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go:48 +0x64
	/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:61 +0x1a8 -------------------- PANIC SITE
	/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:51 +0xe5
	/go/src/github.com/.../src/api/vendor/github.com/.../gorelic/gorelic.go:24 +0xd8
	/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:107 +0x98
	/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:50 +0x171
	/go/src/github.com/.../src/api/gateways/service_centers_service.go:21 +0xd5
	/go/src/github.com/.../src/api/controllers/service_center_controller.go:35 +0x62
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 +0x43
	/go/src/github.com/.../src/api/server/middlewares.go:41 +0xa0
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 +0x43
	/go/src/github.com/.../src/api/server/middlewares.go:26 +0x72
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 +0x43
	/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go:87 +0x5a
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 +0x43
	/go/src/github.com/.../src/api/server/middlewares.go:87 +0x376
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 +0x43
	/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/mlmetrics.go:19 +0x62
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 +0x43
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/gin.go:361 +0x597
	/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/gin.go:326 +0x134



2019/09/05 21:07:24 GIN (runtime.Caller & runtime.FuncForPC)

2019/09/05 21:07:24 /usr/local/go/src/runtime/panic.go:82 (0x444620)
/usr/local/go/src/runtime/signal_unix.go:390 (0x44444f)
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:59 (0xad66d7) -------------------- PANIC SITE
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:51 (0xadf614)
/go/src/github.com/.../src/api/vendor/github.com/.../gorelic/gorelic.go:24 (0xad47d7)
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:107 (0xad7277)
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go:50 (0xad64d0)
/go/src/github.com/.../src/api/gateways/service_centers_service.go:21 (0xade744)
/go/src/github.com/.../src/api/controllers/service_center_controller.go:35 (0xaeb1c1)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 (0x993972)
/go/src/github.com/.../src/api/server/middlewares.go:41 (0xb0c61f)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 (0x993972)
/go/src/github.com/.../src/api/server/middlewares.go:26 (0xb0c461)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 (0x993972)
/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go:87 (0xb091a9)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 (0x993972)
/go/src/github.com/.../src/api/server/middlewares.go:87 (0xb0f595)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 (0x993972)
/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/mlmetrics.go:19 (0xb081d1)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go:108 (0x993972)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/gin.go:361 (0x99c1a6)
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/gin.go:326 (0x99b9e3)
/usr/local/go/src/net/http/server.go:2774 (0x6c6c47)
/usr/local/go/src/net/http/server.go:1878 (0x6c2830)
/usr/local/go/src/runtime/asm_amd64.s:1337 (0x45dce0)



2019/09/05 21:07:24 newrelic own fix (runtime.Callers & runtime.CallersFrames)

2019/09/05 21:07:24 /go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/nr_stack.go 19 mlhandlers.GetStackTrace
/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go 58 mlhandlers.RecoveryWithWriter.func1.1
/usr/local/go/src/runtime/panic.go 522 runtime.gopanic
/usr/local/go/src/runtime/panic.go 82
/usr/local/go/src/runtime/signal_unix.go 390 runtime.sigpanic
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go 59 gateways.LocationsMutableService.getChildren -------------------- PANIC SITE
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go 51 gateways.LocationsMutableService.GetChildren.func1
/go/src/github.com/.../src/api/vendor/github.com/.../gorelic/gorelic.go 24 gorelic.WrapDatastoreSegment
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go 107 gateways.LocationsMutableService.newRelicSegment
/go/src/github.com/.../src/api/gateways/locations_mutable_service.go 50 gateways.LocationsMutableService.GetChildren
/go/src/github.com/.../src/api/gateways/service_centers_service.go 21 gateways.(*ServiceCentersService).DoPanic
/go/src/github.com/.../src/api/controllers/service_center_controller.go 35 controllers.(*ServiceCenterController).DoPanic
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go 108 gin.(*Context).Next
/go/src/github.com/.../src/api/server/middlewares.go 41 server.accessLog
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go 108 gin.(*Context).Next
/go/src/github.com/.../src/api/server/middlewares.go 26 server.setUUID
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go 108 gin.(*Context).Next
/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/recovery.go 87 mlhandlers.RecoveryWithWriter.func1
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go 108 gin.(*Context).Next
/go/src/github.com/.../src/api/server/middlewares.go 87 server.newRelic.func1
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go 108 gin.(*Context).Next
/go/src/github.com/.../src/api/vendor/github.com/.../gingonic/mlhandlers/mlmetrics.go 19 mlhandlers.Datadog.func1
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/context.go 108 gin.(*Context).Next
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/gin.go 361 gin.(*Engine).handleHTTPRequest
/go/src/github.com/.../src/api/vendor/github.com/gin-gonic/gin/gin.go 326 gin.(*Engine).ServeHTTP
/usr/local/go/src/net/http/server.go 2774 http.serverHandler.ServeHTTP
/usr/local/go/src/net/http/server.go 1878 http.(*conn).serve
/usr/local/go/src/runtime/asm_amd64.s 1337 runtime.goexit

@randall77
Copy link
Contributor

A couple things you could do to help us debug:

  • Try with 1.13 and/or 1.11. At what version did your code originally work?
  • Provide the output of -S on the LocationsMutableService.getChildren function. Build with go build -gcflags=-S, save the output, find the code for that function, and paste it here.
  • Disassemble your binary (objdump -d), find the same function above, and paste the disassembly of it.
  • Show us the actual values in the runtime.Callers return slice. Make sure you use the exact binary used in the previous 2 points (i.e. if you hack in some print statements for this, use the binary with the print statements in it when disassembling).

@lggomezml
Copy link

@randall77 I'm with some blockers on our side regarding being able to test 1.13 at work but I'll be back with these things on 1.12

@lggomezml
Copy link

@randall77 here's the things you requested, these were on 1.12.9, with changes on extern, mprof and traceback files to print the program counters: dumps.zip

@ianlancetaylor while I was investigating this and looking into the runtime I found some more details that might be of help on diagnosing this and even coming up with a portable repro

  • This issue happens from go 1.12 and afterwards, including 1.13. 1.11 seems unaffected (can't say anything about prior versions)
  • It can be observed that PCs are being decremented on traceback.callers when it shouldn't, but only in some specific cases like the repro I have at work (and, for the time being, only on the panic site as we don't any any proof yet that suggests otherwise)
  • I see that some heavy work was done on tracebacks for 1.12 which involves how PCs are manipulated, this particular PR stood out since the description and changes seem directly related to this issue: 232c979#diff-5a2de8a1053d4e11fbc71407a5361e93

@randall77
Copy link
Contributor

Ok, I think I can piece together what's happening here. The problem is that when an instruction faults, we record in the results of runtime.Callers the PC of the instruction that faulted. Then when we process the trace in runtime.CallersFrames and friends, we subtract 1. That gives us the immediately preceeding instruction. Instead, we must pre-increment by 1 for faulting instructions (as opposed to function calls, where we shouldn't).

@gopherbot
Copy link

Change https://golang.org/cl/196962 mentions this issue: runtime: fix line number for faulting instructions

@lggomez
Copy link
Author

lggomez commented Sep 26, 2019

Thanks for the quick response, in a couple days you managed to fix this after all the time I spent trying to pin it down, amazing!

I didn't have time to set up my docker image to build the revision locally, but I hope I can do so soon. Is the milestone confirmed to be 1.14 only or there's a chance of a backport to 1.13/1.12?

@odeke-em
Copy link
Member

Interestingly on Go1.13.1 I can't reproduce @randall77's repro in the CL
please see https://play.golang.org/p/8A6HGud329v
which properly executes
Screen Shot 2019-09-30 at 12 17 04 PM

but I can reproduce it on tip/master

$ go run main.go 
devel +740d2c8c22 Thu Sep 26 18:47:41 2019 +0000
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: expected line 24, got line 23
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x109769e]

goroutine 1 [running]:
main.main.func1()
	/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/34123/main.go:36 +0x238
panic(0x10b00e0, 0x116cae0)
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/runtime/panic.go:679 +0x1b2
main.f()
	/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/34123/main.go:24 +0xe
main.main()
	/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/34123/main.go:43 +0x74
exit status 2

and on Go1.12.10

$ go run main.go 
go1.12.10
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: expected line 24, got line 23
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x109480e]

goroutine 1 [running]:
main.main.func1()
	/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/34123/main.go:36 +0x29e
panic(0x10ada00, 0x116e480)
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/runtime/panic.go:522 +0x1b5
main.f()
	/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/34123/main.go:24 +0xe
main.main()
	/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/34123/main.go:43 +0x67
exit status 2

So that's going to be an interesting backport but also perhaps there is something extra at play.

@odeke-em
Copy link
Member

odeke-em commented Oct 1, 2019

@lggomezml
Copy link

@randall77 I´m still able to repro this on my local container using gotip:
go version devel +bf49905 Wed Nov 13 15:52:21 2019 +0000 linux/amd64

I can go back to get the a new set of dumps like you requested before if needed

@odeke-em
Copy link
Member

odeke-em commented Nov 13, 2019 via email

@randall77 randall77 reopened this Nov 13, 2019
@randall77
Copy link
Contributor

Yes, a new set of dumps would be helpful.

@lggomez
Copy link
Author

lggomez commented Nov 13, 2019

@odeke-em @randall77 nevermind, the fix works on all frontiers, thank you for the hard work

Took me a while to figure it out but the issue is on gotip: it uses the toolchain compiled from nightly but not its goroot so I had to run GOROOT=~/sdk/gotip gotip run main.go to effectively use the tip runtime. Closing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants