cmd/compile: improve fn arg recovery in tracebacks via call site records #65021

thanm · 2024-01-08T18:33:04Z

With the current Go compiler + runtime, when a program crashes/panics and we generate a traceback, the stack walking code in addition to printing out the names of functions present in the trace also makes an attempt to print function argument information.

Arg printing is done opportunistically, and is which is to say that the runtime will print the contents of the stack locations associated with parameters for a function, which are sometimes accurate but more often than not, inaccurate.

This issue to track the idea of enhancing the pclntab info for Go binaries to try to do a little better at recovering correct arg values. First a background section talking about the current implementation characteristics, then a second section talking about ideas on how we could do better.

Background

Here is a toy program that when run, causes a crash due to an out-of-bounds array access:

package main

var G1, G2 int
var C [10]int

func main() {
	println(foo(G1+11, G2+11, "blah", 5.0))
}

func foo(x, y int, s string, f float32) int {
	var big [1010]int
	q := x + y
	il := x + 1
	jl := y + 1
	for i := 0; i < il; i++ {
		for j := 0; j < jl; j++ {
			q += bar(i, j)
		}
		big[i]++
	}
	return q + big[0]
}

func bar(q, r int) int {
	v := q + 1 + C[0]
	defer func() { G1 += G2 }()
	for i := 0; i < 10; i++ {
		C[i] &= 0x3
	}
	C[v] += 3
	return v
}

If you run this program optimization disabled, here's the crash you get:

$ go run -trimpath -gcflags="-l -N" ./example.go
panic: runtime error: index out of range [10] with length 10

goroutine 1 [running]:
main.bar(0x9, 0x0)
	./example.go:31 +0x12f
main.foo(0xb, 0xb, {0x468238, 0x4}, 0x40a00000)
	./example.go:18 +0xf8
main.main()
	./example.go:8 +0x3d
exit status 2

Note that for the "main.bar" frame, the runtime is able to correctly print the values of the incoming parameters "q" and "r", and similarly the args for "main.foo" show up correctly as well.

Now here's what happens when you run a default build (optimization enabled, without "-l -N"):

$  go run -trimpath ./example.go
panic: runtime error: index out of range [10] with length 10

goroutine 1 [running]:
main.bar(0x0?, 0xc0000a0688?)
	./example.go:30 +0x9d
main.foo(0x60?, 0x0?, {0xc00010e058?, 0x4ca370?}, 0x45dcc0?)
	./example.go:17 +0x93
main.main()
	./example.go:7 +0x3d
exit status 2

The args for "main.bar" are shown incorrectly as "0x0?, 0xc0000a0688?"; the question mark symbols show that the compiler is aware of the fact that params are no longer live at the point where the crash takes place, but is making a "best effort" by reading the spill spill locations for the params (stack slot that the param is written to if we have to call morestack). In this case the contents of the spill slot are pretty much garbage. Similarly for the "main.foo" parameter.

Suppose we rewrite "var big [10]int" at line 12 to "var big [1010]int" and try again:

$ go run -trimpath ./example.go
panic: runtime error: index out of range [10] with length 10

goroutine 1 [running]:
main.bar(0x0?, 0x0?)
	./example.go:30 +0x9d
main.foo(0xb?, 0xb?, {0x468238?, 0x4?}, 0x40a00000?)
	./example.go:17 +0x94
main.main()
	./example.go:7 +0x3d
exit status 2

We still have incorrect values for the "main.bar" args, but the "main.foo" args are looking pretty decent -- this is due to the fact that the large stack frame for the routine triggered a call to "morestack", so the spill locations were actually written to, so the values we read back are correct.

Overall however the story is not great on reporting arg values accurately, and since we moved to the register ABI, the fraction of inaccurate values printed has definitely gone up.

One way to do better

In the DWARF world, compiler developers also wrestle with a similar problem, e.g. you want to print the values of incoming params, but the params are no longer live at the point where something bad happens.

A while back DWARF was enhanced with some extra mechanisms to help here, specifically the DW_TAG_callsite and DW_TAG_callsite_value tags. How these work are described in the DWARF standard, also in https://dwarfstd.org/issues/100909.2.html, but here rough idea is that the compiler generates supplemental location expressions at each callsite that describe how to compute the value of each argument passed at the call. When the unwinder is trying to recover function args for a given frame and runs into "no longer live" params, it can go back up the stack and use the callsite DIEs to recover them. Example:

    func foo(x, y int) {
       var x [10]int
       ...
       bar(&x[0])
       ...
    }
    func bar(p *int) int {
      a := *p // last use of "p"
      ...
      if ... {
        panic("bad")
      }
    }

Within "foo", the compiler emits a DWARF DW_TAG_call_site DIE for the the call to "bar", and then examines each expression feeding into the call. For the first param, since the frame offset of "x" is know, it write out a DW_AT_call_site_value attribute indicating that the value of the arg can be computed as FP plus a constant.

When the unwinder is emitting stack trace output for "bar", since parameter "p" is no longer live at the panic, it instead finds the subprogram DIE for "foo", then locates the correct DW_TAG_call_site for the call to "bar", and then uses the location expression in the first DW_AT_call_site_value attribute to arrive at an accurate value for the argument.

This is not a bulletproof solution, of course, since the arg value may be something that we can't describe with a single location expression, and we're again vulnerable to values being clobbered, but in many cases it can help recover the value.

We could in theory implement the same sort of thing in the Go compiler with the pclntab-- emit records for each call site (if that site has one or more recoverable arg values) and then some sort of simple/restricted location expression that tells how to compute the value of the arg.

The text was updated successfully, but these errors were encountered:

aktau · 2024-01-09T06:59:54Z

Would it be too expensive for the traceback to use DW_TAG_call_site so it could be used to both enhance the debugging and traceback experience?

thanm · 2024-01-09T14:04:18Z

Would it be too expensive for the traceback to use DW_TAG_call_site so it could be used to both enhance the debugging and traceback experience?

The short answer to that is yes, it would be too expensive.

The Go pclntab is located in a loadable section in the Go binary (similar to text or rodata), which means that the runtime can read it right out of program memory. DWARF is stored in non-loadable sections, so in order for the Go runtime to have access, it would have to open/read the file corresponding to the program binary. In many cases that is a non-starter.

aktau · 2024-01-09T14:33:15Z

That makes sense. I guess enhancing DWARF should be a separate issue (tracebacks and debugging can't share data). Thanks.

thanm added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jan 8, 2024

thanm added this to the Backlog milestone Jan 8, 2024

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: improve fn arg recovery in tracebacks via call site records #65021

cmd/compile: improve fn arg recovery in tracebacks via call site records #65021

thanm commented Jan 8, 2024

aktau commented Jan 9, 2024

thanm commented Jan 9, 2024

aktau commented Jan 9, 2024

cmd/compile: improve fn arg recovery in tracebacks via call site records #65021

cmd/compile: improve fn arg recovery in tracebacks via call site records #65021

Comments

thanm commented Jan 8, 2024

aktau commented Jan 9, 2024

thanm commented Jan 9, 2024

aktau commented Jan 9, 2024