Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http/pprof: Negative allocation counts produced by /debug/pprof/allocs?seconds=1 #49171

Closed
felixge opened this issue Oct 27, 2021 · 9 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@felixge
Copy link
Contributor

felixge commented Oct 27, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes, but not previous releases, e.g. go1.16.9.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/felix.geisendoerfer/Library/Caches/go-build"
GOENV="/Users/felix.geisendoerfer/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS="-count=1"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/felix.geisendoerfer/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/felix.geisendoerfer/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.17.2/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.2/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/mg/xwzs985x35789zk2l14nk8k80000gn/T/go-build895031316=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Continuous allocation profiling of a production application using delta profiles similar to running /debug/pprof/allocs?seconds=60 in a loop.

Below is the smallest reproduction of the issue that I've managed to create so far. It executes an APM instrumentation workload and continuously checks the delta allocation profile for negative values.

main.go
package main

import (
	"context"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"time"

	"strings"

	"net/http"
	"net/http/httptest"
	"net/http/pprof"

	"github.com/google/pprof/profile"
	"google.golang.org/grpc/metadata"
	"gopkg.in/DataDog/dd-trace-go.v1/ddtrace"
	"gopkg.in/DataDog/dd-trace-go.v1/ddtrace/ext"
	"gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func main() {
	log.Printf("starting")
	go memProfile()
	go workload()
	time.Sleep(10 * time.Second)
	log.Printf("unable to reproduce issue, shutting down")
}

func workload() {
	prevCtx := context.Background()
	for {
		_, prevCtx = startSpanFromContext(prevCtx, "foo", "bar", "op", tracer.AnalyticsRate(100))
	}
}

func memProfile() {
	h := pprof.Handler("allocs")

	for {
		recorder := httptest.NewRecorder()
		request, err := http.NewRequest("GET", "/?seconds=1", nil)
		if err != nil {
			panic(err)
		}
		h.ServeHTTP(recorder, request)

		profData := recorder.Body.Bytes()
		prof, err := profile.ParseData(profData)
		if err != nil {
			panic(err)
		}

		for si, s := range prof.Sample {
			for i, v := range s.Value {
				if v < 0 && prof.SampleType[i].Type == "alloc_objects" {
					st := prof.SampleType[i]
					log.Printf("negative delta value for: %s/%s: %d\n", st.Type, st.Unit, v)
					printDetails(prof, si, s)

					log.Printf("writing profile to alloc.pprof")
					ioutil.WriteFile("alloc.pprof", profData, 0666)
					os.Exit(1)
				}
			}
		}
	}
}

func printDetails(prof *profile.Profile, negI int, negSample *profile.Sample) {
	printSample(negI, negSample)

	negativeKey := sampleKey(negSample)
	for si, otherSample := range prof.Sample {
		if otherSample == negSample {
			continue
		}

		otherKey := sampleKey(otherSample)
		if otherKey == negativeKey {
			fmt.Printf("\nfound another sample with the same addrs:\n\n")
			printSample(si, otherSample)
		}
	}
}

func printSample(idx int, s *profile.Sample) {
	fmt.Printf("sample %d values=%v labels=%#v numLabels=%#v\n", idx, s.Value, s.Label, s.NumLabel)
	for _, loc := range s.Location {
		for i, line := range loc.Line {
			fmt.Printf("loc %d addr %d line %d %s() %s:%d\n", loc.ID, loc.Address, i, line.Function.Name, line.Function.Filename, line.Line)
		}
	}
}

func sampleKey(s *profile.Sample) string {
	var addrs []string
	for _, loc := range s.Location {
		addrs = append(addrs, fmt.Sprintf("%d", loc.Address))
	}
	return strings.Join(addrs, ",")
}

func startSpanFromContext(
	ctx context.Context, method, operation, service string, opts ...tracer.StartSpanOption,
) (ddtrace.Span, context.Context) {
	opts = append(opts,
		tracer.ServiceName(service),
		tracer.ResourceName(method),
		tracer.Tag("method", method),
		tracer.SpanType(ext.AppTypeRPC),
	)
	metadata.FromIncomingContext(ctx) // nil is ok
	return tracer.StartSpanFromContext(ctx, operation, opts...)
}

type MDCarrier metadata.MD

var _ tracer.TextMapWriter = (*MDCarrier)(nil)
var _ tracer.TextMapReader = (*MDCarrier)(nil)

// Get will return the first entry in the metadata at the given key.
func (mdc MDCarrier) Get(key string) string {
	if m := mdc[key]; len(m) > 0 {
		return m[0]
	}
	return ""
}

// Set will add the given value to the values found at key. Key will be lowercased to match
// the metadata implementation.
func (mdc MDCarrier) Set(key, val string) {
	k := strings.ToLower(key) // as per google.golang.org/grpc/metadata/metadata.go
	mdc[k] = append(mdc[k], val)
}

// ForeachKey will iterate over all key/value pairs in the metadata.
func (mdc MDCarrier) ForeachKey(handler func(key, val string) error) error {
	for k, vs := range mdc {
		for _, v := range vs {
			if err := handler(k, v); err != nil {
				return err
			}
		}
	}
	return nil
}

Sorry about not being able to make it smaller than this, but so far I wasn't able to trigger the problem with simpler workloads.

What did you expect to see?

alloc_objects should never be negative. So the example program should exit after 10s printing unable to reproduce issue, shutting down.

What did you see instead?

The example program detects negative alloc_objects and prints debugging output like shown below.

2021/10/27 01:39:02 starting
2021/10/27 01:39:08 negative delta value for: alloc_objects/count: -10923
sample 0 values=[-10923 -524312 0 0] labels=map[string][]string(nil) numLabels=map[string][]int64{"bytes":[]int64{48}}
loc 1 addr 19853706 line 0 main.startSpanFromContext() /Users/felix.geisendoerfer/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.33.0/ddtrace/tracer/option.go:483
loc 2 addr 19850287 line 0 main.workload() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:34

found another sample with the same addrs:

sample 7 values=[10923 524312 0 0] labels=map[string][]string(nil) numLabels=map[string][]int64{"bytes":[]int64{48}}
loc 10 addr 19853706 line 0 main.startSpanFromContext() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:483
loc 2 addr 19850287 line 0 main.workload() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:34
exit status 1

Additionally go tool pprof -http=:6060 alloc.pprof throws JS errors when opening an alloc flamegraph for the profile.

image

A few things to note:

  • The produced profile contains two samples with the same addrs (19853706, 19850287), but the filename/line symbolization for 19853706 is different between the samples (option.go:483 vs main.go:483).
  • Neither filename/line symbolization for 19853706 appears to be correct.
  • The negative samples seem to be a result of the internal profile.Merge() call that can't match some stack traces with the same addrs from p0 and p1 because they got symbolized differently (i.e. the origin of this problem is unrelated to delta profiles, but delta profiles are where this is having a very bad impact, so I'm using it as the main example).
  • Different runs of the test program will produce different results, sometimes it will pass. I suspect this is due to the random sampling of the memory profiler.
  • Setting runtime.MemProfileRate = 1 on top of main() seems to make the test program always pass.
  • Building with -gcflags='-l' doesn't fix the issue. pprof also reports samples that claim that inlining is still happening, see example below (note how loc 17 has two lines). However -m doesn't output any inlining taking place 🤔
2021/10/27 01:50:55 starting
2021/10/27 01:50:59 negative delta value for: alloc_objects/count: -21846
sample 3 values=[-21846 -1048624 -21846 -1048624] labels=map[string][]string(nil) numLabels=map[string][]int64{"bytes":[]int64{48}}
loc 16 addr 17354981 line 0 context.WithValue() /usr/local/Cellar/go/1.17.2/libexec/src/context/context.go:533
loc 17 addr 19848324 line 0 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.ContextWithSpan() /Users/felix.geisendoerfer/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.33.0/ddtrace/tracer/context.go:21
loc 17 addr 19848324 line 1 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.StartSpanFromContext() /Users/felix.geisendoerfer/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.33.0/ddtrace/tracer/context.go:50
loc 18 addr 19855951 line 0 main.startSpanFromContext() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:111
loc 19 addr 19853062 line 0 main.workload() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:34

found another sample with the same addrs:

sample 11 values=[21846 1048624 21846 1048624] labels=map[string][]string(nil) numLabels=map[string][]int64{"bytes":[]int64{48}}
loc 16 addr 17354981 line 0 context.WithValue() /usr/local/Cellar/go/1.17.2/libexec/src/context/context.go:533
loc 37 addr 19848324 line 0 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.ContextWithSpan() /Users/felix.geisendoerfer/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.33.0/ddtrace/tracer/context.go:21
loc 37 addr 19848324 line 1 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.StartSpanFromContext() /Users/felix.geisendoerfer/go/pkg/mod/gopkg.in/!data!dog/dd-trace-go.v1@v1.33.0/ddtrace/tracer/option.go:50
loc 18 addr 19855951 line 0 main.startSpanFromContext() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:111
loc 19 addr 19853062 line 0 main.workload() /Users/felix.geisendoerfer/go/src/github.com/felixge/dump/pprof-symbol-bug/main.go:34

Theories

My best guess is that this is somehow related to go1.17 being able to inline functions containing closures.

Most samples with negative alloc values that I've seen involve functions that call functions containing closures. The issue also doesn't seem to be reproducible in go1.16.

However, the issue can still be reproduced in go 1.17 when disabling inlining with -l, so I'm not sure ...

@felixge
Copy link
Contributor Author

felixge commented Oct 27, 2021

@prattmic @hyangah this might be of interest to you. I'm still trying to nail the exact root cause of this, but so far I've failed. Maybe something in runtime/pprof/proto.go is causing this? FWIW runtime.CallersFrames() seems to be innocent, at least I wasn't able to make it produce non-deterministic symbolization when throwing some of the problematic pcs at it.

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 27, 2021
@egonelbre
Copy link
Contributor

Bisected the issue to 6d2ef2e. CC: @thanm

@thanm
Copy link
Contributor

thanm commented Oct 28, 2021

Thanks for filing this issue. I will take a look.

You wrote:

The produced profile contains two samples with the same addrs
(19853706, 19850287), but the filename/line symbolization for 19853706
is different between the samples (option.go:483 vs main.go:483).

Just to confirm, you are saying that the allocation site (PC value) is the same for the two samples, but the symbolization is different? This seems very unexpected.

The negative samples seem to be a result of the internal
profile.Merge() call that can't match some stack traces with the same
addrs from p0 and p1 because they got symbolized differently (i.e. the
origin of this problem is unrelated to delta profiles, but delta
profiles are where this is having a very bad impact, so I'm using it
as the main example).

What are the "p0" and "p1" to which you refer?

Building with -gcflags='-l' doesn't fix the issue.

Also surprising given the bisection, since that commit really only makes a difference if the inliner is enabled... but I suppose that will make it all the more fun to track down :-)

@thanm thanm self-assigned this Oct 28, 2021
@felixge
Copy link
Contributor Author

felixge commented Oct 28, 2021

Just to confirm, you are saying that the allocation site (PC value) is the same for the two samples, but the symbolization is different? This seems very unexpected.

Yes. I always also very surprised to see this.

What are the "p0" and "p1" to which you refer?

The before/after profile for delta profiling. They are referenced on the line calling profile.Merge() that I linked.

Also surprising given the bisection, since that commit really only makes a difference if the inliner is enabled... but I suppose that will make it all the more fun to track down :-)

I agree, the whole bug is very strange 😅. But please retest anything stated here if it seems to make no sense. I haven't done the bisection myself, and there is also a chance that some of the information I provided is not correct. But hopefully the provided sample program will offer a good way to reproduce and analyze this on your end.

@felixge
Copy link
Contributor Author

felixge commented Nov 21, 2021

Hey @thanm, did you end up investigating this? If yes, did you find any more clues for the root cause?

@thanm
Copy link
Contributor

thanm commented Nov 23, 2021

I am getting a fix ready for this. I spent a little while wild goose-chasing, but I think I've made some progress.

@thanm
Copy link
Contributor

thanm commented Nov 23, 2021

This took a while to puzzle out, but basically what's happening here is that we're losing an inline mark as a result of the fix for issue #46234 (CL 320913). Prior to the CL we had this line of code, which was removed as part CL 46234. The (*inlsubst).closure() method is dealing with two constructs: the OCLOSURE expression (which needs to have its position updated) and the closure function (which needs to be left alone). The fix is to restore the line cited above (in a slightly different form, since the closure code has been refactored a bit since my CL).

@gopherbot
Copy link

Change https://golang.org/cl/366494 mentions this issue: cmd/compile/internal/inline: revise closure inl position fix

@felixge
Copy link
Contributor Author

felixge commented Nov 24, 2021

@thanm thank you so much for debugging and fixing this! Seems to work great based on my testing with the code from above.

@rsc rsc unassigned thanm Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants