Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http/pprof: add seconds= to mutex/block profile handler #23401

Closed
lmb opened this issue Jan 10, 2018 · 30 comments
Closed

net/http/pprof: add seconds= to mutex/block profile handler #23401

lmb opened this issue Jan 10, 2018 · 30 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Proposal Proposal-Accepted
Milestone

Comments

@lmb
Copy link
Contributor

lmb commented Jan 10, 2018

Using the mutex and block profiles from net/http/pprof currently requires modifying the application to call runtime.SetBlockProfileRate and runtime.SetMutexProfileFraction. Since it's not clear what the performance impact of enabling these profiles is, few people will actually do it.

I'd like to add the following parameters, to make using these two profiles easier to use:

/debug/pprof/block?seconds=1&rate=1
/debug/pprof/mutex?seconds=1&frac=1

If the seconds parameter isn't specified the endpoints will behave as they currently do. If it is specified the block profile rate or mutex profile fraction will be set (default 1) and then the handler sleeps for the specified duration. At the end of the duration the profile is written and the original profile rate / fraction is restored.

This proposal requires changing SetBlockProfileRate to return the previous rate, similar to SetMutexProfileFraction.

@gopherbot gopherbot added this to the Proposal milestone Jan 10, 2018
@rsc
Copy link
Contributor

rsc commented Jan 22, 2018

The profiles are reporting things that happened earlier in the program. The rate change should really happen once, at program startup. Otherwise if you hit the URL twice with different rates the second will report the first one's samples but scaled badly. Or do we now subtract inside the handler?

The defaults are zero, which is admittedly not very useful. Maybe we should fix that instead.

/cc @hyangah @pjweinbgo

@hyangah
Copy link
Contributor

hyangah commented Jan 22, 2018

Yes, the current mutex/block profiling implementation doesn't work well with rate changes.

Inside google, we implemented custom handlers that take a 'seconds' parameter and return the delta, and they are served from separate endpoints -- the additional parameter changes the semantics of data so much that we thought they are better to be served from different endpoints.

BTW, I don't see code for scaling samples using the sampling rate and that is a bug.
(https://github.com/golang/go/blob/master/src/runtime/pprof/pprof.go#L357) Is it too late for 1.10?

@ianlancetaylor
Copy link
Contributor

It's not absolutely too late, since this is a new feature, but it's awfully late.

@lmb
Copy link
Contributor Author

lmb commented Jan 24, 2018

I might be mistaken, but it looks like at least the mutex profile does scale samples (but ignores rate changes)?

It seems there are two options:

  1. Enable block and mutex profiling by default. But what is the correct number?
  2. Make current code work with rate changes and implement new (?) endpoints.

The former is a simple change but takes more work, the latter vice versa. I'd like to have mutex and block profiling work out of the box, so I'm happy to work on both.

@hyangah
Copy link
Contributor

hyangah commented Jan 24, 2018

  1. support for difference reporting (seconds parameter): I think this is reasonable, but first consider whether this can be implemented from the client side. I.e., capturing the profiles twice and then asking pprof to report difference (-base flag). Would it be too inconvenient?

  2. support for changing the sampling rate on the fly: this is difficult. they are not meant to be changed.

  3. default to non-zero values: can anyone add measurement and benchmark.

(fyi, the bug I mentioned in my previous comment is fixed now (for mutex profile) and is not relevant to this proposal.)

@rsc
Copy link
Contributor

rsc commented Jan 29, 2018

Maybe if they are off then second=XXX rate=XXX can turn them on and send a diff back?
We should probably add the diff code for Go 1.11 anyway.

/cc @hyangah.

@hyangah
Copy link
Contributor

hyangah commented Jan 29, 2018

No objection. Just once the handler is hit with rate=XXX param, we may want to make it clear that the output from the default /debug/pprof/mutex endpoint is no longer interpretable.

@lmb
Copy link
Contributor Author

lmb commented Jan 30, 2018

I probably misunderstand how the whole profile machinery works under the hood. My initial proposal wasn't to enable differential profiling, but to make the mutex and block profile useable out of the box. My idea was to just toggle the them on / off on an as-needed basis. That was naive, for all the points you have pointed out.

As an application developer, I have no idea what the cost of enabling these profiles is. Should I sample one in thousand events? One in a hundred? The documentation isn't helping in that case. That's what I would like to solve.

Re finding a default value, I tried following @hyangah's suggestion to run a bunch of benchmarks (from https://github.com/golang/go/wiki/Benchmarks) and see how they change, but didn't get far due to noisy results. Is there a better corpus I could use?

@rsc
Copy link
Contributor

rsc commented Feb 5, 2018

@hyangah what do you think we should do here to make these profiles easy to use and helpful to developers?

@hyangah
Copy link
Contributor

hyangah commented Feb 8, 2018

How about resetting the profile counters if the sampling period is changed?

Since we migrated all these profiles to the proto buf format, the output can include information like the collection time and the duration. For legacy format output, we can output the duration as a comment.

/debug/pprof/{mutex,block} endpoints keep reporting the counters since the current sampling rate is set.
The new rate=XXX param changes the sampling rate.

After talking with @pjweinbgo I became less optimistic about choosing the non-zero default values. Their impact varies a lot depending on applications. For example, benchmarks that don't involve many mutex will not be affected.

@lmb
Copy link
Contributor Author

lmb commented Feb 15, 2018

Sounds good to me. Would you add the seconds parameter as well?

P.S. As I said I'd like to take a stab at this, if you don't mind @hyangah ? I'd probably split this in two CLs:

  • Reset profile data on rate change
  • Changes to net/http/pprof to add new parameters

@gopherbot
Copy link

Change https://golang.org/cl/94335 mentions this issue: runtime: reset block, mutex profile counters after rate change

@hyangah
Copy link
Contributor

hyangah commented Feb 15, 2018

There is concern about exposing the rate change functionality in production. Especially when the change can affect the server performance and when the service owner already tuned the parameters best for the service. I agree that's a dangerous feature.

@aalexand proposed an alternative that adds the 'seconds' parameter.

  • If 'seconds' param is given, report diff.
  • If the rate is 0 and 'seconds' param is given, set the rate for a predefined value (maybe 100 for mutex profile), report the diff after the seconds, and reset the rate back to 0.
  • If the rate is 0 and 'seconds' param is not given, return zero output (the current behavior)

@lmb, does this still meet your use case?

@lmb
Copy link
Contributor Author

lmb commented Feb 16, 2018

There is concern about exposing the rate change functionality in production. Especially when the change can affect the server performance [...]

If I understand correctly, the worry is that an operator would set an value that slows down a production service, and causes degradation? In my mind accessing the pprof endpoints is a privileged operation, so I expect the operator to know what they are doing / take responsibility. The new proposal feels like we are trying to blunt the tool because we don't trust the developer.

If the rate is 0 and 'seconds' param is given, set the rate for a predefined value (maybe 100 for mutex profile)

I think this is a catch 22, since again we do not know what the right value for different services is. Which is why it makes sense to add that argument in the first place.

@aalexand
Copy link
Contributor

In my mind accessing the pprof endpoints is a privileged operation.
we do not know what the right value for different services is. Which is why it makes sense to add that argument in the first place.

Go team should decide, but: no existing debug/pprof handler allows changing the profiling configuration parameters, so introducing that capability needs serious justification IMO. If you want different mutex profiling rate for different services, you could enable it in main() with the rate that works best. Giving a way to experiment with different rate values on a production service sounds like a bad idea to me. You can also implement your own HTTP handler with such parameter and call SetMutexProfileFraction prior to delegating to the default handler.

Curious: do you have examples of when SetMutexProfileFraction(1) didn't work for you and why? I'd naively expect that if a given contention is not visible in the 1% sample, it's not worth looking at, but I might be missing something.

@hyangah
Copy link
Contributor

hyangah commented Feb 20, 2018

@lmb It is common to have multiple levels of privileged operations in production systems. One for the full privilege to modify the production system. Another for the subset of the privileged operations such as read-only access for monitoring/debugging or temporary mutation ops for probing or profiling. Maybe you want to allow the second group of privilege to the common monitoring service, or the semi-trusted group of people who help debugging production issue, but you won't still want to give out the full privilege of your server.

The current /debug/pprof endpoints are operating with the second privilege group, so exposing the rate change feature through the endpoints doesn't seem like a good idea.

FYI most Go servers in Google have been running with 1% mutex profile rate (i.e. SetMutexProfileFraction(100)) for a while. Low enough to avoid regression but high enough to capture interesting events. That seems to be a good candidate for the default rate when profiling is temporary enabled by the seconds param.

@lmb
Copy link
Contributor Author

lmb commented Feb 27, 2018

Go team should decide, but: no existing debug/pprof handler allows changing the profiling configuration parameters, so introducing that capability needs serious justification IMO.

Is your worry changing the parameters, or crippling the service? Tracing a sufficiently busy service via pprof/trace will right now degrade it. How is that different?

Curious: do you have examples of when SetMutexProfileFraction(1) didn't work for you and why?

No, because I haven't gotten to that stage. I need to enable mutex profiling to see its effect, but don't have a way of evaluating it without making a release.

The current /debug/pprof endpoints are operating with the second privilege group [...]

Similar argument: I don't think pprof endpoints were / are written with that in mind, but maybe I'm wrong.

FYI most Go servers in Google have been running with 1% mutex profile rate (i.e. SetMutexProfileFraction(100)) for a while.

Thats useful to know, thanks! I'll just bite the bullet, enable it at that level and see what the impact is.

@ianlancetaylor
Copy link
Contributor

@lmb I'm unclear on the current status of this proposal. Are there further changes that you think would be useful in 1.11? Thanks.

@rsc
Copy link
Contributor

rsc commented Jul 23, 2018

It looks like we have support for ?seconds= in /debug/pprof/profile and /debug/pprof/trace but not mutexprofile etc. Is the proposed support for seconds= in mutex (do two requests and diff them) planned to be added to the pprof tool itself or the HTTP endpoint? Either way, seconds= sounds fine. I'd rather avoid rate= if possible for the reasons given above.

Obviously we missed doing anything for Go 1.11.

@aalexand
Copy link
Contributor

@rsc I think it would make sense to have it added to the HTTP endpoint for symmetry with /debug/pprof/profile and /debug/pprof/trace.

@rsc
Copy link
Contributor

rsc commented Aug 6, 2018

@hyangah Do we already have code somewhere for profiles diffs for mutex profiles and other cumulative profiles? If so can we publish that into Go 1.12?

@rsc rsc changed the title proposal: net/http/pprof: Add seconds, rate and frac arguments to mutex/block profile net/http/pprof: add seconds= to mutex/block profile handler Aug 6, 2018
@rsc rsc modified the milestones: Proposal, Go1.12 Aug 6, 2018
@rsc rsc added the help wanted label Aug 6, 2018
@rsc
Copy link
Contributor

rsc commented Aug 6, 2018

Note that we should still make the default behavior (seconds= empty string or unspecified) return the cumulative for-all-time counts.

@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Aug 6, 2018
@hyangah
Copy link
Contributor

hyangah commented Aug 6, 2018

@rsc yes, we do have a code internally used.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 14, 2018
This PR enables mutex profiling by default. See the following
for a detailed introduction to the profile type:
https://rakyll.org/mutexprofile/.

The mutex profile sampling rate is set at `.1%`. This is based
off of golang/go#23401 (comment),
which says that Google runs all of their Go servers in production with
the sampling rate of `1%`. I figured that it doesn't hurt to be an order
of magnitude more conservative than them, but I'm not against following
them directly.

I ran a few benchmarks on my laptop. With a mutex profile sample rate
of `100%` I observed a drop in throughput of `kv95 --concurrency=64`
of just about `1%`. This indicates that a sampling rate of `.1%` will
result in a negligible performance change.

The change also adds the new profile type to the debug page and the
pprof ui visualizer (it was actually already there, it just had no
samples by default).

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Sep 17, 2018
30233: pprof: enable Mutex profiling by default r=nvanbenschoten a=nvanbenschoten

This PR enables mutex profiling by default. See the following for a detailed introduction to the profile type: https://rakyll.org/mutexprofile/.

The mutex profile sampling rate is set at `.1%`. This is based off of golang/go#23401 (comment),
which says that Google runs all of their Go servers in production with the sampling rate of `1%`. I figured that it doesn't hurt to be an order of magnitude more conservative than them, but I'm not against following them directly.

I ran a few benchmarks on my laptop. With a mutex profile sample rate of `100%` I observed a drop in throughput of `kv95 --concurrency=64` of just about `1%`. This indicates that a sampling rate of `.1%` will result in a negligible performance change.

The change also adds the new profile type to the debug page and the pprof ui visualizer (it was actually already there, it just had no samples by default).

![screenshot_2018-09-14 unknown delay](https://user-images.githubusercontent.com/5438456/45562017-23d94400-b817-11e8-908e-2a11faff3128.png)

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 17, 2018
This PR enables mutex profiling by default. See the following
for a detailed introduction to the profile type:
https://rakyll.org/mutexprofile/.

The mutex profile sampling rate is set at `.1%`. This is based
off of golang/go#23401 (comment),
which says that Google runs all of their Go servers in production with
the sampling rate of `1%`. I figured that it doesn't hurt to be an order
of magnitude more conservative than them, but I'm not against following
them directly.

I ran a few benchmarks on my laptop. With a mutex profile sample rate
of `100%` I observed a drop in throughput of `kv95 --concurrency=64`
of just about `1%`. This indicates that a sampling rate of `.1%` will
result in a negligible performance change.

The change also adds the new profile type to the debug page and the
pprof ui visualizer (it was actually already there, it just had no
samples by default).

Release note: None
@hyangah
Copy link
Contributor

hyangah commented Oct 17, 2018

The version of implementation we use uses github.com/google/pprof/profile for diff computation. Currently, the github.com/google/pprof/profile is vendorred under src/cmd/vendor, and also an old, partial copy of it is also in src/runtime/pprof/internal. None of them is accessible from src/net/http/pprof. Can we move the vendored upstream pprof somewhere accessible by all these three places? (e.g. src/internal or src/vendor?)

@gopherbot
Copy link

Change https://golang.org/cl/147598 mentions this issue: net/http/pprof: mutex, block profiles support seconds param

@yoheiueda
Copy link

@hyangah do you have any updates? I am really looking forward to this change.

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@odeke-em
Copy link
Member

Gentle ping @hyangah, I added some code review suggestions to your CL a couple of months ago, please take a look. Thank you!

@gopherbot
Copy link

Change https://golang.org/cl/229537 mentions this issue: net/http/pprof: allow "seconds" parameters to most profiles

gopherbot pushed a commit that referenced this issue Apr 22, 2020
golang.org/cl/147598 added the support for delta computation for mutex
and block profiles. In fact, this delta computation makes sense for
other types of profiles.

For example, /debug/pprof/allocs?seconds=x will provide how much allocation
was made during the specified period. /debug/pprof/goroutine?seconds=x will
provide the changes in the list of goroutines. This also makes sense for
custom profiles.

Update #23401
Update google/pprof#526

Change-Id: I45e9073eb001ea5b3f3d16e5a57f635193610656
Reviewed-on: https://go-review.googlesource.com/c/go/+/229537
Run-TryBot: Hyang-Ah Hana Kim <hyangah@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
@josharian
Copy link
Contributor

It may be too late—sorry—but I wonder whether instead of seconds=5 it’d be better to do time=5s (or any other duration).

@hyangah
Copy link
Contributor

hyangah commented Apr 22, 2020

@josharian yes, I like your suggestion but 'seconds' preexisted for cpu profile and execution trace endpoints.

@golang golang locked and limited conversation to collaborators Apr 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Proposal Proposal-Accepted
Projects
None yet
Development

No branches or pull requests

10 participants