Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: block profiler: fix sampling bias #44192

Closed
felixge opened this issue Feb 9, 2021 · 16 comments
Closed

runtime/pprof: block profiler: fix sampling bias #44192

felixge opened this issue Feb 9, 2021 · 16 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@felixge
Copy link
Contributor

felixge commented Feb 9, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.7 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/felix.geisendoerfer/Library/Caches/go-build"
GOENV="/Users/felix.geisendoerfer/Library/Application Support/go/env"
GOEXE=""
GOFLAGS="-count=1"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/felix.geisendoerfer/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/felix.geisendoerfer/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15.7_1/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15.7_1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/felix.geisendoerfer/go/src/github.com/felixge/go-profiler-notes/examples/block-bias/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/mg/xwzs985x35789zk2l14nk8k80000gn/T/go-build460068460=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I created a program with 2 goroutines. One is blocked on a slow time.NewTicker() loop, the other on one that is 10x faster. Then I observed the program using the block profiler with the duration of the slow ticker as the rate.

Code: https://gist.github.com/felixge/6f9701dd26707dbe7d351805860a4608

What did you expect to see?

Both goroutines are blocked 100% of their time, so I'd expect both of them to show up equally in the block profile like shown below:

CleanShot 2021-02-09 at 20 54 14@2x

What did you see instead?

The block profiler sampling seems to be biased, preferring infrequent long block events over frequent short events, even when they contribute to the same amount of overall blocking. This can be seen in the block profile result I actually got:

CleanShot 2021-02-09 at 20 53 33@2x

I've done a lot of additional research on this which can be found here: https://github.com/felixge/go-profiler-notes/blob/main/block.md#sampling-bias

Proposal

I propose to fix the profiler in a way that removes the sampling bias without changing the API or increasing overhead. Concretely this might be done with a patch like this:

felixge/go@master...debias-blockprofile-rate

I've described this in more detail along with a simulation here: https://github.com/felixge/go-profiler-notes/blob/main/block.md#simulation--proposal-for-improvement

The downside to this approach is that it will obscure the "average event duration" which could currently be determined by dividing the delay/nanoseconds by the contentions/count values in the profile. I have a few ideas around this that I'd be happy to discuss.

Additionally I propose to improve the documentation for the block profiler to give people a better idea what the rate value means and perhaps also make some recommendations about overhead.

I'm happy to work on all of the above and submit the patches for it.

@seankhliao seankhliao changed the title block profiler: fix sampling bias runtime/pprof: block profiler: fix sampling bias Feb 10, 2021
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 10, 2021
@felixge
Copy link
Contributor Author

felixge commented Feb 24, 2021

@seankhliao would it make sense to add the label "proposal" for this? I'm afraid it might not be reviewed otherwise.

@seankhliao
Copy link
Member

cc @hyangah

@hyangah
Copy link
Contributor

hyangah commented Feb 25, 2021

Thank you for the detailed explanation and writeup.

cc @dvyukov (blockprofile author), @pjweinb (mutexprofile author), @rsc (heap profile author) - all three profiles share similar sampling logic.

Improving sampling bias sounds great. I have very limited experience with blockprofile so I want to learn more about the goal here. In your motivating example scenario, why is it important to present the two kinds of events equally? If one type of events usually block 10x longer on average, isn't it more interesting and worth more attention?

@felixge
Copy link
Contributor Author

felixge commented Feb 25, 2021

@hyangah thx.

all three profiles share similar sampling logic.

It's similar, but I think all three of them use slightly different approaches:

  • Heap Profiler: Uses an internal counter to accumulate bytes allocated. Any event causing the counter to exceed MemProfileRate gets sampled and resets the counter. Very simplified explanation, see runtime/malloc.go for full details.
  • Block profiler: Samples events with a probability of min(1, duration/rate). In code review @dvyukov mentioned that this was intended to be similar to MemProfileRate.
  • Mutex Profiler: The simplest sampling implementation. It doesn't take into account the event duration at all, and just samples events with a probability of 1/rate.

Improving sampling bias sounds great. I have very limited experience with blockprofile so I want to learn more about the goal here. In your motivating example scenario, why is it important to present the two kinds of events equally? If one type of events usually block 10x longer on average, isn't it more interesting and worth more attention?

Great point. I think it depends : ). I could imagine a scenario where you have two producers sending work into two separate channels, sometimes getting blocked while doing so. I'd expect the block profiler to be able to tell me which of those two producers is spending more time blocked than the other. Given the current bias, it could give me the wrong answer to that question.

Or from another point of view, imagine if the heap profiler was biased towards big allocations (i hope it isn't, still need to analyze it some more). It could make you chase a few big allocation sources even so they may contribute less overall garbage than thousands of small allocations that the profiler is telling you to ignore by under-representing them.

Anyway, I'd be okay if the bias is here to stay. But if it is, I think it's worth documenting, which I'd be happy to send patches for.

@dvyukov
Copy link
Member

dvyukov commented Feb 25, 2021

Interesting. I can see how this change makes sense:
felixge/go@master...debias-blockprofile-rate#diff-4af3428aab976e43f655b4e1a35c92a608a92294609190aa42378d2d265d501cR414

Have you looked at what's done in tcmalloc?
https://github.com/google/tcmalloc/blob/master/tcmalloc/sampler.cc#L168-L177
The logic there was given a lot of thought and it passed several rounds of de-biasing already :)
At least for heap sampling it would be good to mimic exactly the same logic I think. Maybe for other samplers as well.

@felixge
Copy link
Contributor Author

felixge commented Feb 25, 2021

Interesting. I can see how this change makes sense:

Cool : ). Do you think the fact that it obscures the mean of the samples (total_delay / count) is a problem? It's kind of a nice thing to be able to compute, but OTOH I'm not sure if anybody ever looks at it.

Have you looked at what's done in tcmalloc?

Not yet, thanks for the link!

At least for heap sampling it would be good to mimic exactly the same logic I think. Maybe for other samplers as well.

If the current heap sampling could be improved, that'd be awesome. But I'll need some time to fully wrap my head around the current Go implementation as well as tcmalloc's sampler before I could contribute anything useful to that discussion. If you're aware of any good design documents or literature discussing heap sampling techniques, please let me know.

@dvyukov
Copy link
Member

dvyukov commented Feb 25, 2021

Do you think the fact that it obscures the mean of the samples (total_delay / count) is a problem?

Lie in profiles is definitely bad. If I see any numbers that definitely look like a lie I may throw away whole profile b/c I don't know what I can trust there.
Humm.... could we make blockRecord.count a float and record real number of cycles and rate/cycles samples? It should fix the total, but preserve average.

If you're aware of any good design documents or literature discussing heap sampling techniques, please let me know.

I am not aware of anything besides comments in sampler.h:
https://github.com/google/tcmalloc/blob/master/tcmalloc/sampler.h#L33

@felixge
Copy link
Contributor Author

felixge commented Feb 25, 2021

Lie in profiles is definitely bad. If I see any numbers that definitely look like a lie I may throw away whole profile b/c I don't know what I can trust there.

👍 I love this perspective / mindset.

Humm.... could we make blockRecord.count a float and record real number of cycles and rate/cycles samples? It should fix the total, but preserve average.

This sounds interesting, but I don't think I understand it. Can you explain it differently?

@dvyukov
Copy link
Member

dvyukov commented Feb 25, 2021

Humm.... could we make blockRecord.count a float and record real number of cycles and rate/cycles samples? It should fix the total, but preserve average.

This sounds interesting, but I don't think I understand it. Can you explain it differently?

In saveblockevent we do:

func saveblockevent(cycles int64, skip int, which bucketType) {
	...
	b.bp().count++
	b.bp().cycles += cycles
	...
}

You change increases cycles, but count is still incremented by 1. This makes average incorrect.
My idea was to do something like:

	b.bp().count += rate/cycles
	b.bp().cycles += rate

This increases cycles, but also proportionally increases count.

@felixge
Copy link
Contributor Author

felixge commented Feb 25, 2021

Edit: I updated this comment, you can still see the original below, but I've hidden it because it was confusing.

Old Comment (Please ignore)
	b.bp().count += rate/cycles
	b.bp().cycles += rate

I spend some time thinking and playing around with this. I'm assuming your snippet above would be used instead of the patch I proposed? If that's the case, I think it will fix the sample mean issue, but flip the total duration bias towards short frequent events favored over long infrequent events.

Example 1:

rate = 100
20 events with 500 cycles each
--->
count = 4 (100/500*20)
cycles = 2000 (1000*20)
mean = 500 (2000/40

Example 2:

rate = 100
10 events with 1000 cycles each
--->
count = 1 (100/1000*10)
cycles = 1000 (100*10)
mean = 1000 (1000/1)

As you can see, the total cycles in example 1 end up being twice as high as in example 2, even so both events contribute the same amount of blocking.

Unless I'm missing something, I think the only way to fix the bias without screwing the sample mean is to either change the logic used for deciding which events get sampled, or to add the mean as its own value type into the profile. But maybe there is an elegant solution I'm overlooking.


FWIW, I also played with a little simulation for your suggestion. Below you can see the impact of the rate on an application that produces three distinct block events a, b, c with a population mean of 1000ns, 2000ns and 3000ns respectively.

image

And this is how things look right now before applying any patches:

image

  b.bp().count += rate/cycles
  b.bp().cycles += rate

@dvyukov it took me a bit to figure out if you were suggesting this change in addition to my patch or instead of it. Neither combination seemed to work. However, I think I finally figured out that your idea could be applied by a new patch that simply does this inside of saveblockevent. At least in my testing it seems to fix the bias and produce a sample mean that converges on the population mean.

if cycles < rate {
	b.bp().count += rate/cycles
	b.bp().cycles += rate
} else {
	b.bp().count += 1
	b.bp().cycles += cycles
}

I'll think this over again tomorrow when it's not late, but I'd be happy to send a patch + tests for this if you agree. Let me know.

@dvyukov
Copy link
Member

dvyukov commented Feb 26, 2021

Yes, I meant in combination with your patch. Something similar to the snippet you posted. I just did not elaborate enough, sorry.

@felixge
Copy link
Contributor Author

felixge commented Feb 26, 2021

Yes, I meant in combination with your patch. Something similar to the snippet you posted. I just did not elaborate enough, sorry.

No worries, in retrospect I think it was clear enough, but my brain didn't want to collaborate 🙈. I'll try to prepare a patch for this now to continue the discussion.

felixge added a commit to felixge/go that referenced this issue Feb 26, 2021
Block profiles were biased towards infrequent long events over frequent
short events. This fix corrects the bias by aggregating shorter events
as longer but less frequent in the profiles. As a result their
cumulative duration will be accurately represented in the profile
without skewing their sample mean (duration/count).

Credit to @dvyukov for suggesting to adjust the count in the
saveblockevent function.

Fixes golang#44192.
@hyangah
Copy link
Contributor

hyangah commented Feb 26, 2021

I see Heap profiler's sampling is more similar to what tcmalloc sampler is doing - having an internal counter to keep track of the next sampling point instead of tossing a coin every time. Is it an option to change the block profile sampler to be more similar to what the current heap profiler or tcmalloc sampler are doing?

@felixge
Copy link
Contributor Author

felixge commented Feb 26, 2021

@hyangah I think the main concern there would be doing this in a way that doesn't fundamentally impact the overhead of block profiling or change the semantics of the blockprofilerate in a way that will lead to less accurate profiles. But if there is consensus that this would be a better direction, I'd be happy to give it a shot.

Meanwhile I got a patch ready based on @dvyukov's suggestions here: felixge@a3dc518

But I'll hold off on submitting it for now. I also still have to sort out some CLA stuff on my end.

felixge added a commit to felixge/go that referenced this issue Mar 9, 2021
Block profiles were biased towards infrequent long events over frequent
short events. This fix corrects the bias by aggregating shorter events
as longer but less frequent in the profiles. As a result their
cumulative duration will be accurately represented in the profile
without skewing their sample mean (duration/count).

Credit to @dvyukov for suggesting to adjust the count in the
saveblockevent function.

Fixes golang#44192.

Change-Id: I71a99d7f6ebdb2d484d44890a2517863cceb4004
@gopherbot
Copy link

Change https://golang.org/cl/299991 mentions this issue: runtime/pprof: fix block profile bias

@felixge
Copy link
Contributor Author

felixge commented Mar 9, 2021

@hyangah @dvyukov I just submitted my current patch. I think it's a decent improvement for now.

However, I'd also be happy to look into aligning the sampling with heap/tcmalloc in a follow-up patch if that's something you're interested in.

felixge added a commit to felixge/go that referenced this issue Mar 9, 2021
Block profiles were biased towards infrequent long events over frequent
short events. This fix corrects the bias by aggregating shorter events
as longer but less frequent in the profiles. As a result their
cumulative duration will be accurately represented in the profile
without skewing their sample mean (duration/count).

Credit to @dvyukov for suggesting to adjust the count in the
saveblockevent function.

Fixes golang#44192.

Change-Id: I71a99d7f6ebdb2d484d44890a2517863cceb4004
@dmitshur dmitshur added this to the Go1.17 milestone Jun 4, 2021
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 4, 2021
@golang golang locked and limited conversation to collaborators Jun 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants