Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: bad inlining tree emitted for function literal #46234

Closed
thanm opened this issue May 18, 2021 · 20 comments
Closed

cmd/compile: bad inlining tree emitted for function literal #46234

thanm opened this issue May 18, 2021 · 20 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge release-blocker
Milestone

Comments

@thanm
Copy link
Contributor

thanm commented May 18, 2021

What version of Go are you using (go version)?

$ go version
go version devel go1.17-8b0901fd32 Tue May 18 07:50:25 2021 +0000 linux/amd64

Does this issue reproduce with the latest release?

Only on 1.17 / tip -- this problem is not present in 1.16.

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

Run this program: https://play.golang.org/p/VZWvwkvYUGx

What did you expect to see?

In 1.16 you get a crash on a nil pointer, e.g.

$ go run repro.go
begin main
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4a2245]

goroutine 1 [running]:
main.WEA.func1(0x0)
	/tmp/repro.go:17 +0x5
main.(*CM).NewA(0xc0000eef40, 0x0, 0x0, 0x4c761f, 0x3, 0x0, 0xc0000eeed0, 0x2, 0x2, 0xb, ...)
	/tmp/repro.go:30 +0x42
main.(*R).CA(0xc0000eef48, 0x0, 0x0, 0x4c761f, 0x3, 0x0, 0x0, 0xb, 0x0)
	/tmp/repro.go:37 +0xd1
main.main()
	/tmp/repro.go:52 +0xc5
exit status 2
thanm@cetina:/tmp$ 

What did you see instead?

Program gets an infinite loop in traceback:

begin main
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x487f00]

goroutine 1 [running]:
main.WEA(...)
        /tmp/repro.go:17
main.WEA(...)
        /tmp/repro.go:17
main.WEA(...)
        /tmp/repro.go:17
main.WEA(...)
        /tmp/repro.go:17
<repeats>

The problem sees to be in the runtime's traceback handling of inlining -- at the point of the fault we are at

Dump of assembler code for function main.(*R).CA.func1:
=> 0x0000000000487f00 <+0>:	movb   $0x1,(%rax)
   0x0000000000487f03 <+3>:	retq   

The inlining tree for this function (from go build -gcflags=-d=pctab=pctoinline) looks like

-- inlining tree for "".(*R).CA.func1:
0 | -1 | "".WEA (/tmp/repro.go:37:39) pc=0

The problem here is that the traceback code is trying to "back up" an instruction when landing on 0x0000000000487f00 to recover the parent PC (runtime code here), but this winds up hitting the same instruction, hence we never get out of the loop.

I'm tentatively marking this as a compiler bug, can be retitled if it turns out there are other components involved.

@thanm thanm added this to the Go1.17 milestone May 18, 2021
@thanm thanm self-assigned this May 18, 2021
@cherrymui
Copy link
Member

I think the problem here is that the function literal main.(*R).CA.func1 itself is not actually an inlined function (its containing function main.WEA is). So it should not have an inlining tree at all.

@thanm
Copy link
Contributor Author

thanm commented May 18, 2021

Agree-- we don't want an inl tree for main.(*R).CA.func1. Now working on trying to understand how it is we get the inltree.

@gopherbot
Copy link

Change https://golang.org/cl/320913 mentions this issue: cmd/compile: don't emit inltree for closure within body of inlined func

@mdempsky
Copy link
Member

When we clone a function literal because of inlining, do we have a way for users/debuggers to distinguish stepping through a cloned function literal vs the original? It seems like we should somehow maintain that information, along with position details about where it was cloned into.

(Of course, for release purposes, I think it's more critical to have traceback working reliably than to have perfect debugging info.)

@thanm
Copy link
Contributor Author

thanm commented May 18, 2021

When we clone a function literal because of inlining, do we have a way for users/debuggers to distinguish stepping through a cloned function literal vs the original? It seems like we should somehow maintain that information, along with position details about where it was cloned into.

This is an interesting question. I guess my question would be "does the user care"? If the line table info and variable locations are correct (in the sense that they show the expected line numbers, and the debugger can print the right values), then I wonder whether it matters that you're in a clone or not.

@mdempsky
Copy link
Member

That's fair. I think most users shouldn't need to worry about this. But I'm wondering about power users (particularly us) trying to diagnose suspected miscompilation that affects inlined function literals that were optimized differently than the original.

What about the position information for newfn and newfn.Nname? Could we update their positions at least to maintain inlining stack info? Can we use inlined positions for PEXTERN/PFUNC declarations, or are they limited to PAUTO?

I think if we could at least do that, it would address my theoretical power user concern.

@gopherbot
Copy link

Change https://golang.org/cl/366494 mentions this issue: cmd/compile/internal/inline: revise closure inl position fix

gopherbot pushed a commit that referenced this issue Nov 24, 2021
This patch revises the fix for issue 46234, fixing a bug that was
accidentally introduced by CL 320913. When inlining a chunk of code
with a closure expression, we want to avoid updating the source
positions in the function being closed over, but we do want to update
the position for the ClosureExpr itself (since it is part of the
function we are inlining). CL 320913 unintentionally did away with the
closure expr source position update; here we restore it again.

Updates #46234.
Fixes #49171.

Change-Id: Iaa51bc498e374b9e5a46fa0acd7db520edbbbfca
Reviewed-on: https://go-review.googlesource.com/c/go/+/366494
Trust: Than McIntosh <thanm@google.com>
Trust: Dan Scales <danscales@google.com>
Reviewed-by: Dan Scales <danscales@google.com>
@rsc rsc unassigned thanm Jun 23, 2022
@mdempsky
Copy link
Member

At tip with GOEXPERIMENT=unified, the test case passes, but I notice in the -d=pctab=pctoinline output it reports:

funcpctab main.(*R).CA.func1 [valfunc=pctoinline]
     0     -1 00000 (/tmp/bug.go:37)    TEXT    main.(*R).CA.func1(SB), NOSPLIT|ABIInternal, $0-8
     0        00000 (/tmp/bug.go:37)    TEXT    main.(*R).CA.func1(SB), NOSPLIT|ABIInternal, $0-8
     0        00000 (/tmp/bug.go:37)    FUNCDATA        $0, gclocals·wgcWObbY2HYnK2SU/U22lA==(SB)
     0        00000 (/tmp/bug.go:37)    FUNCDATA        $1, gclocals·J5F+7Qw7O7ve2QcWC7DpeQ==(SB)
     0        00000 (/tmp/bug.go:37)    FUNCDATA        $5, main.(*R).CA.func1.arginfo1(SB)
     0        00000 (/tmp/bug.go:37)    FUNCDATA        $6, main.(*R).CA.func1.argliveinfo(SB)
     0        00000 (/tmp/bug.go:37)    PCDATA  $3, $1
     0     -1 00000 (/tmp/bug.go:17)    MOVB    $1, (AX)
     3        00003 (/tmp/bug.go:17)    RET
     4 done
wrote 3 bytes to 0xc00060f3c8
 00 04 00
-- inlining tree for main.(*R).CA.func1:
0 | -1 | main.WEA (/tmp/bug.go:37:39) pc=0
--

Is this an issue, or could it be one? The original bug report mentions this "inlining tree" output being wrong, and it's not present with GOEXPERIMENT=nounified. Though I'm guessing maybe it's harmless here since it's not actually referenced by the instructions.

For #54593, it's been mentioned that pprof sampling seems to correlate with the kubernetes apiserver hanging and getting restarted. I'm wondering if it's hitting a similar issue with infinite stack traces somewhere.

@thanm
Copy link
Contributor Author

thanm commented Aug 23, 2022

Agree that if it's not being referenced by the instructions, it seems hard to imagine that it would be triggering the original bug. But it does seem a bit odd.

@cherrymui
Copy link
Member

Yeah, the presence of the inline tree looks suspicious. What do you mean by "it's not being referenced by the instructions"? Thanks.

@thanm
Copy link
Contributor Author

thanm commented Aug 23, 2022

It shows up in the dump. For example in this code:

func (r *R) CA(ctx context.Context, cN string, nn *nAO) (*int, error) {
	cA, err := r.cM.NewA(ctx, cN, nn, WEA(), WEA())
	if err == nil {

the calls to WEA() get inlined. In the dump you can see this in the second column:

    39        00057 (<unknown line number>)	NOP
    39        00057 (/usr/local/google/home/thanm/gobugs/46234/repro.go:37)	XORL	AX, AX
    3b      0 00059 (/usr/local/google/home/thanm/gobugs/46234/repro.go:37)	JMP	96
    3d        00061 (/usr/local/google/home/thanm/gobugs/46234/repro.go:37)	MOVQ	AX, main..autotmp_36+8(SP)
    42        00066 (/usr/local/google/home/thanm/gobugs/46234/repro.go:37)	MOVQ	main..autotmp_24+24(SP)(AX*8), DX

Note the "0" for the instruction at 00059.

At least that's how I'm reading it; hopefully I am not off base.

@mdempsky
Copy link
Member

mdempsky commented Aug 23, 2022

What do you mean by "it's not being referenced by the instructions"?

In the instruction stream, the second column is always "-1" or blank. (Contrast @thanm's example of a different function where the entry is 0 at one point.)

However, I have been able to construct alternative cases where the instructions for an inlined function literal do reference where the enclosing function was inlined to (i.e., the original bug here). I haven't successfully gotten any of them to crash with pprof yet though.

@cherrymui
Copy link
Member

Okay, thanks. You're right that that one is not referenced. It is odd that the unreferenced inline tree appears. Maybe something along the line of the CL above would be necessary for unified IR.

@mdempsky
Copy link
Member

I have this program: https://go.dev/play/p/OikFLBmKguY?v=gotip

Which produces this -d=pctab=pctoinline output:

funcpctab main.stress.func1 [valfunc=pctoinline]
     0     -1 00000 (/tmp/repro.go:12)  TEXT    main.stress.func1(SB), NOSPLIT|NEEDCTXT|ABIInternal, $0-0
     0        00000 (/tmp/repro.go:12)  TEXT    main.stress.func1(SB), NOSPLIT|NEEDCTXT|ABIInternal, $0-0
     0        00000 (/tmp/repro.go:12)  FUNCDATA        $0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
     0        00000 (/tmp/repro.go:12)  FUNCDATA        $1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
     0      0 00000 (/tmp/repro.go:12)  MOVQ    16(DX), AX
     4        00004 (/tmp/repro.go:12)  MOVQ    24(DX), CX
     8        00008 (/tmp/repro.go:12)  MOVQ    32(DX), BX
     c        00012 (/tmp/repro.go:12)  MOVQ    8(DX), DX
    10     -1 00016 (/tmp/repro.go:22)  MOVQ    DX, main.x0(SB)
    17        00023 (/tmp/repro.go:23)  MOVQ    AX, main.x1(SB)
    1e        00030 (/tmp/repro.go:24)  MOVQ    CX, main.x2(SB)
    25        00037 (/tmp/repro.go:25)  MOVQ    BX, main.x3(SB)
    2c        00044 (/tmp/repro.go:26)  RET
    2d done
wrote 5 bytes to 0xc000817078
 02 10 01 1d 00
-- inlining tree for main.stress.func1:
0 | -1 | main.foo (/tmp/repro.go:12:9) pc=0
--

And that output looks like it should be "wrong". Specifically, the inlining tree entry is actually used for loading the closure variable values. And the pc=0 value seems like it would point back into those same instructions within the function, leading to an infinite loop during stack walking.

But pprof doesn't seem to be getting stuck there.

@mdempsky
Copy link
Member

Okay, manually instrumenting the runtime at

, I can see that loop does go on infinitely because of the inlining cycle.

@mdempsky mdempsky reopened this Aug 23, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 23, 2022
@mdempsky mdempsky self-assigned this Aug 23, 2022
@mdempsky
Copy link
Member

Actually, I'm going to file a new issue instead of reopening this one. I don't want to cause any weird issues for milestones.

@cherrymui
Copy link
Member

Yeah, that indeed looks like a cycle. Profiling collects at most 32 frames, so I guess the program doesn't fall into infinite loop, but the profile would looks like a bunch of recursive calls.

@mdempsky
Copy link
Member

@cherrymui I was able to confirm that the loop does go infinite. We only check the n < max condition at the top-level, whereas the inner loop for handling PCDATA_InlTreeIndex goes until it finds a negative value (which never happens with a cycle).

@cherrymui
Copy link
Member

Oops... Good point. Thanks.

@gopherbot
Copy link

Change https://go.dev/cl/425395 mentions this issue: cmd/compile/internal/noder: fix inlined function literal positions

gopherbot pushed a commit that referenced this issue Aug 25, 2022
When inlining function calls, we rewrite the position information on
all of the nodes to keep track of the inlining context. This is
necessary so that at runtime, we can synthesize additional stack
frames so that the inlining is transparent to the user.

However, for function literals, we *don't* want to apply this
rewriting to the underlying function. Because within the function
literal (when it's not itself inlined), the inlining context (if any)
will have already be available at the caller PC instead.

Unified IR was already getting this right in the case of user-written
statements within the function literal, which is what the unit test
for #46234 tested. However, it was still using inline-adjusted
positions for the function declaration and its parameters, which
occasionally end up getting used for generated code (e.g., loading
captured values from the closure record).

I've manually verified that this fixes the hang in
https://go.dev/play/p/avQ0qgRzOgt, and spot-checked the
-d=pctab=pctoinline output for kube-apiserver and kubelet and they
seem better.

However, I'm still working on a more robust test for this (hence
"Updates" not "Fixes") and internal assertions to verify that we're
emitting correct inline trees. In particular, there are still other
cases (even in the non-unified frontend) where we're producing
corrupt (but at least acyclic) inline trees.

Updates #54625.

Change-Id: Iacfd2e1eb06ae8dc299c0679f377461d3d46c15a
Reviewed-on: https://go-review.googlesource.com/c/go/+/425395
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
rajbarik pushed a commit to rajbarik/go that referenced this issue Sep 1, 2022
When inlining function calls, we rewrite the position information on
all of the nodes to keep track of the inlining context. This is
necessary so that at runtime, we can synthesize additional stack
frames so that the inlining is transparent to the user.

However, for function literals, we *don't* want to apply this
rewriting to the underlying function. Because within the function
literal (when it's not itself inlined), the inlining context (if any)
will have already be available at the caller PC instead.

Unified IR was already getting this right in the case of user-written
statements within the function literal, which is what the unit test
for golang#46234 tested. However, it was still using inline-adjusted
positions for the function declaration and its parameters, which
occasionally end up getting used for generated code (e.g., loading
captured values from the closure record).

I've manually verified that this fixes the hang in
https://go.dev/play/p/avQ0qgRzOgt, and spot-checked the
-d=pctab=pctoinline output for kube-apiserver and kubelet and they
seem better.

However, I'm still working on a more robust test for this (hence
"Updates" not "Fixes") and internal assertions to verify that we're
emitting correct inline trees. In particular, there are still other
cases (even in the non-unified frontend) where we're producing
corrupt (but at least acyclic) inline trees.

Updates golang#54625.

Change-Id: Iacfd2e1eb06ae8dc299c0679f377461d3d46c15a
Reviewed-on: https://go-review.googlesource.com/c/go/+/425395
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@golang golang locked and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge release-blocker
Projects
None yet
Development

No branches or pull requests

4 participants