New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http/pprof: add seconds= to mutex/block profile handler #23401
Comments
The profiles are reporting things that happened earlier in the program. The rate change should really happen once, at program startup. Otherwise if you hit the URL twice with different rates the second will report the first one's samples but scaled badly. Or do we now subtract inside the handler? The defaults are zero, which is admittedly not very useful. Maybe we should fix that instead. /cc @hyangah @pjweinbgo |
Yes, the current mutex/block profiling implementation doesn't work well with rate changes. Inside google, we implemented custom handlers that take a 'seconds' parameter and return the delta, and they are served from separate endpoints -- the additional parameter changes the semantics of data so much that we thought they are better to be served from different endpoints. BTW, I don't see code for scaling samples using the sampling rate and that is a bug. |
It's not absolutely too late, since this is a new feature, but it's awfully late. |
I might be mistaken, but it looks like at least the mutex profile does scale samples (but ignores rate changes)? It seems there are two options:
The former is a simple change but takes more work, the latter vice versa. I'd like to have mutex and block profiling work out of the box, so I'm happy to work on both. |
(fyi, the bug I mentioned in my previous comment is fixed now (for mutex profile) and is not relevant to this proposal.) |
Maybe if they are off then second=XXX rate=XXX can turn them on and send a diff back? /cc @hyangah. |
No objection. Just once the handler is hit with rate=XXX param, we may want to make it clear that the output from the default /debug/pprof/mutex endpoint is no longer interpretable. |
I probably misunderstand how the whole profile machinery works under the hood. My initial proposal wasn't to enable differential profiling, but to make the mutex and block profile useable out of the box. My idea was to just toggle the them on / off on an as-needed basis. That was naive, for all the points you have pointed out. As an application developer, I have no idea what the cost of enabling these profiles is. Should I sample one in thousand events? One in a hundred? The documentation isn't helping in that case. That's what I would like to solve. Re finding a default value, I tried following @hyangah's suggestion to run a bunch of benchmarks (from https://github.com/golang/go/wiki/Benchmarks) and see how they change, but didn't get far due to noisy results. Is there a better corpus I could use? |
@hyangah what do you think we should do here to make these profiles easy to use and helpful to developers? |
How about resetting the profile counters if the sampling period is changed? Since we migrated all these profiles to the proto buf format, the output can include information like the collection time and the duration. For legacy format output, we can output the duration as a comment. /debug/pprof/{mutex,block} endpoints keep reporting the counters since the current sampling rate is set. After talking with @pjweinbgo I became less optimistic about choosing the non-zero default values. Their impact varies a lot depending on applications. For example, benchmarks that don't involve many mutex will not be affected. |
Sounds good to me. Would you add the P.S. As I said I'd like to take a stab at this, if you don't mind @hyangah ? I'd probably split this in two CLs:
|
Change https://golang.org/cl/94335 mentions this issue: |
There is concern about exposing the rate change functionality in production. Especially when the change can affect the server performance and when the service owner already tuned the parameters best for the service. I agree that's a dangerous feature. @aalexand proposed an alternative that adds the 'seconds' parameter.
@lmb, does this still meet your use case? |
If I understand correctly, the worry is that an operator would set an value that slows down a production service, and causes degradation? In my mind accessing the pprof endpoints is a privileged operation, so I expect the operator to know what they are doing / take responsibility. The new proposal feels like we are trying to blunt the tool because we don't trust the developer.
I think this is a catch 22, since again we do not know what the right value for different services is. Which is why it makes sense to add that argument in the first place. |
Go team should decide, but: no existing debug/pprof handler allows changing the profiling configuration parameters, so introducing that capability needs serious justification IMO. If you want different mutex profiling rate for different services, you could enable it in main() with the rate that works best. Giving a way to experiment with different rate values on a production service sounds like a bad idea to me. You can also implement your own HTTP handler with such parameter and call SetMutexProfileFraction prior to delegating to the default handler. Curious: do you have examples of when SetMutexProfileFraction(1) didn't work for you and why? I'd naively expect that if a given contention is not visible in the 1% sample, it's not worth looking at, but I might be missing something. |
@lmb It is common to have multiple levels of privileged operations in production systems. One for the full privilege to modify the production system. Another for the subset of the privileged operations such as read-only access for monitoring/debugging or temporary mutation ops for probing or profiling. Maybe you want to allow the second group of privilege to the common monitoring service, or the semi-trusted group of people who help debugging production issue, but you won't still want to give out the full privilege of your server. The current /debug/pprof endpoints are operating with the second privilege group, so exposing the rate change feature through the endpoints doesn't seem like a good idea. FYI most Go servers in Google have been running with 1% mutex profile rate (i.e. SetMutexProfileFraction(100)) for a while. Low enough to avoid regression but high enough to capture interesting events. That seems to be a good candidate for the default rate when profiling is temporary enabled by the seconds param. |
Is your worry changing the parameters, or crippling the service? Tracing a sufficiently busy service via pprof/trace will right now degrade it. How is that different?
No, because I haven't gotten to that stage. I need to enable mutex profiling to see its effect, but don't have a way of evaluating it without making a release.
Similar argument: I don't think pprof endpoints were / are written with that in mind, but maybe I'm wrong.
Thats useful to know, thanks! I'll just bite the bullet, enable it at that level and see what the impact is. |
@lmb I'm unclear on the current status of this proposal. Are there further changes that you think would be useful in 1.11? Thanks. |
It looks like we have support for ?seconds= in /debug/pprof/profile and /debug/pprof/trace but not mutexprofile etc. Is the proposed support for seconds= in mutex (do two requests and diff them) planned to be added to the pprof tool itself or the HTTP endpoint? Either way, seconds= sounds fine. I'd rather avoid rate= if possible for the reasons given above. Obviously we missed doing anything for Go 1.11. |
@rsc I think it would make sense to have it added to the HTTP endpoint for symmetry with /debug/pprof/profile and /debug/pprof/trace. |
@hyangah Do we already have code somewhere for profiles diffs for mutex profiles and other cumulative profiles? If so can we publish that into Go 1.12? |
Note that we should still make the default behavior (seconds= empty string or unspecified) return the cumulative for-all-time counts. |
@rsc yes, we do have a code internally used. |
This PR enables mutex profiling by default. See the following for a detailed introduction to the profile type: https://rakyll.org/mutexprofile/. The mutex profile sampling rate is set at `.1%`. This is based off of golang/go#23401 (comment), which says that Google runs all of their Go servers in production with the sampling rate of `1%`. I figured that it doesn't hurt to be an order of magnitude more conservative than them, but I'm not against following them directly. I ran a few benchmarks on my laptop. With a mutex profile sample rate of `100%` I observed a drop in throughput of `kv95 --concurrency=64` of just about `1%`. This indicates that a sampling rate of `.1%` will result in a negligible performance change. The change also adds the new profile type to the debug page and the pprof ui visualizer (it was actually already there, it just had no samples by default). Release note: None
30233: pprof: enable Mutex profiling by default r=nvanbenschoten a=nvanbenschoten This PR enables mutex profiling by default. See the following for a detailed introduction to the profile type: https://rakyll.org/mutexprofile/. The mutex profile sampling rate is set at `.1%`. This is based off of golang/go#23401 (comment), which says that Google runs all of their Go servers in production with the sampling rate of `1%`. I figured that it doesn't hurt to be an order of magnitude more conservative than them, but I'm not against following them directly. I ran a few benchmarks on my laptop. With a mutex profile sample rate of `100%` I observed a drop in throughput of `kv95 --concurrency=64` of just about `1%`. This indicates that a sampling rate of `.1%` will result in a negligible performance change. The change also adds the new profile type to the debug page and the pprof ui visualizer (it was actually already there, it just had no samples by default). ![screenshot_2018-09-14 unknown delay](https://user-images.githubusercontent.com/5438456/45562017-23d94400-b817-11e8-908e-2a11faff3128.png) Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This PR enables mutex profiling by default. See the following for a detailed introduction to the profile type: https://rakyll.org/mutexprofile/. The mutex profile sampling rate is set at `.1%`. This is based off of golang/go#23401 (comment), which says that Google runs all of their Go servers in production with the sampling rate of `1%`. I figured that it doesn't hurt to be an order of magnitude more conservative than them, but I'm not against following them directly. I ran a few benchmarks on my laptop. With a mutex profile sample rate of `100%` I observed a drop in throughput of `kv95 --concurrency=64` of just about `1%`. This indicates that a sampling rate of `.1%` will result in a negligible performance change. The change also adds the new profile type to the debug page and the pprof ui visualizer (it was actually already there, it just had no samples by default). Release note: None
The version of implementation we use uses github.com/google/pprof/profile for diff computation. Currently, the |
Change https://golang.org/cl/147598 mentions this issue: |
@hyangah do you have any updates? I am really looking forward to this change. |
Gentle ping @hyangah, I added some code review suggestions to your CL a couple of months ago, please take a look. Thank you! |
Change https://golang.org/cl/229537 mentions this issue: |
golang.org/cl/147598 added the support for delta computation for mutex and block profiles. In fact, this delta computation makes sense for other types of profiles. For example, /debug/pprof/allocs?seconds=x will provide how much allocation was made during the specified period. /debug/pprof/goroutine?seconds=x will provide the changes in the list of goroutines. This also makes sense for custom profiles. Update #23401 Update google/pprof#526 Change-Id: I45e9073eb001ea5b3f3d16e5a57f635193610656 Reviewed-on: https://go-review.googlesource.com/c/go/+/229537 Run-TryBot: Hyang-Ah Hana Kim <hyangah@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
It may be too late—sorry—but I wonder whether instead of seconds=5 it’d be better to do time=5s (or any other duration). |
@josharian yes, I like your suggestion but 'seconds' preexisted for cpu profile and execution trace endpoints. |
Using the mutex and block profiles from
net/http/pprof
currently requires modifying the application to callruntime.SetBlockProfileRate
andruntime.SetMutexProfileFraction
. Since it's not clear what the performance impact of enabling these profiles is, few people will actually do it.I'd like to add the following parameters, to make using these two profiles easier to use:
If the
seconds
parameter isn't specified the endpoints will behave as they currently do. If it is specified the block profile rate or mutex profile fraction will be set (default1
) and then the handler sleeps for the specified duration. At the end of the duration the profile is written and the original profile rate / fraction is restored.This proposal requires changing
SetBlockProfileRate
to return the previous rate, similar to SetMutexProfileFraction.The text was updated successfully, but these errors were encountered: