runtime: API for unstable metrics #37112

mknyszek · 2020-02-07T16:14:27Z

Proposal: API for unstable runtime metrics

Background & Motivation

Today runtime metrics are exposed in two ways.

The first way is via the struct-based sampling APIs runtime.ReadMemStats and runtime/debug.GCStats. These functions accept a pointer to a struct and then populate the struct with data from the runtime.

The problems with this type of API are:

Removing/renaming old metrics from the structs is impossible.
- For example, MemStats.BySize is hard-coded to 61 size classes when there are currently 83. We cannot ever change BySize.
Adding implementation-specific metrics to the structs is discouraged, because it pollutes the API when inevitably they'll be deprecated.
runtime.ReadMemStats has a global effect on the application because it forces a STW. This has a direct effect on latency. Being able to tease apart which metrics actually need gives users more control over performance.

The good things about this type of API are:

Protected by the Go 1 compatibility promise.
Easy for applications to ingest, use for their own purposes, or push to a metrics collection service or log.

The second is via GODEBUG flags which emit strings containing metrics to standard error (e.g. gctrace, gcpacertrace, scavtrace).

The problems with this type of API are:

Difficult for an application to ingest because it must be parsed.
Format of the output is not protected by the Go 1 backwards compatibility promise.

The good things about this type of API are:

We can freely change it and add implementation-specific metrics.
We never have to live with bad decisions.

I would like to propose a new API which takes the best of both approaches.

Requirements

The API should be easily extendable with new metrics.
The API should be easily retractable, to deprecate old metrics.
- Removing a metric should not break any Go applications as per the Go 1 compatibility promise.
The API should be discoverable, to obtain a list of currently relevant metrics.
The API should be rich, allowing a variety of metrics (e.g. distributions).
The API implementation should minimize CPU/memory usage, such that it does not appreciably
affect any of the metrics being measured.
The API should include useful existing metrics already exposed by the runtime.

Goals

Given the requirements, I suggest we prioritize the following concerns when designing the API in the following order.

Extensibility.
- Metrics are “unstable” and therefore it should always be compatible to add or remove metrics.
- Since metrics will tend to be implementation-specific, this feature is critical.
Discoverability.
- Because these metrics are “unstable,” there must be a way for the application, and for the human writing the application, to discover the set of usable metrics and be able to do something useful with that information (e.g. log the metric).
- The API should enable collecting a subset of metrics programmatically. For example, one might want to “collect all memory-related metrics” or “collect all metrics which are efficient to collect”.
Performance.
- Must have a minimized effect on the metrics it returns in the steady-state.
- Should scale up to 100s of metrics, an amount that a human might consider “a lot.”
  - Note that picking the right types to expose can limit the amount of metrics we need to expose. For example, a distribution type would significantly reduce the number of metrics.
Ergonomics.
- The API should be as easy to use as it can be, given the above.

Design

See full design document at https://golang.org/design/37112-unstable-runtime-metrics.

Highlights:

Expose a new sampling-based API in a new package, the runtime/metrics package.
Use string keys for each metric which include the unit of the metric in an easily-parseable format.
Expose a discovery API which provides metadata about each metric at runtime, such as whether it requires a STW and whether it's cumulative (counter as opposed to a gauge).
Add a Histogram interface to the package which represents a distribution.
Support for event-based metrics is discussed and left open, but considered outside the scope of this proposal.

Backwards Compatibility

Note that although the set of metrics the runtime exposes will not be stable across Go versions, the API to discover and access those metrics will be.

Therefore, this proposal strictly increases the API surface of the Go standard library without changing any existing functionality and is therefore Go 1 compatible.

The text was updated successfully, but these errors were encountered:

gopherbot · 2020-02-08T17:29:40Z

Change https://golang.org/cl/218677 mentions this issue: design/37112-unstable-runtime-metrics.md: add proposal

rsc · 2020-03-11T17:46:58Z

@mknyszek, what's the status of this? Is the doc ready for review and discussion (if so please check it in)?

mknyszek · 2020-03-11T17:49:35Z

The doc is ready for review and discussion, I just don't have +2 or a review to land it yet. I'll try to ping more people to look at the CL.

This change adds a proposal and design document for addition of a runtime metrics package which is designed to support adding and removing runtime metrics without breaking the Go 1 compatibility promise. Updates golang/go#37112. Change-Id: I550740970ad74c5c71d712735e4984808bdbf463 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/218677 Reviewed-by: Austin Clements <austin@google.com>

bcmills · 2020-03-19T15:06:24Z

A couple of high-level comments:

The Histogram interface seems too abstract. The user will need to know the space of possible concrete types anyway, and won't be able to use a given Histogram without knowing its concrete element type. Instead, I would be inclined to provide concrete Float64Histogram and TimeHistogram types — and then you don't need to worry about whether you can add methods to the interface in the future.

bcmills · 2020-03-19T15:07:40Z

Should Metric type include type information about the metric values? That would allow the user to skip reading metrics that they won't be able to interpret anyway.

Actually, perhaps the Metric should include an allocation function..? That would make the allocation strategy a bit less implicit, and might reduce the risk of aliasing bugs in user code.

mknyszek · 2020-03-19T16:02:48Z

A couple of high-level comments:

The Histogram interface seems too abstract. The user will need to know the space of possible concrete types anyway, and won't be able to use a given Histogram without knowing its concrete element type. Instead, I would be inclined to provide concrete Float64Histogram and TimeHistogram types — and then you don't need to worry about whether you can add methods to the interface in the future.

Given that we probably need to copy data out of the runtime anyway, I think you're right that it makes sense to just have concrete types.

Should Metric type include type information about the metric values? That would allow the user to skip reading metrics that they won't be able to interpret anyway.

Actually, perhaps the Metric should include an allocation function..? That would make the allocation strategy a bit less implicit, and might reduce the risk of aliasing bugs in user code.

I considered including type information in the Metric in order to let users filter out metrics they don't understand, and I'm certainly not opposed. An allocation function is also a neat idea. One less potential footgun, but it is more work the user has to do to set this all up.

bcmills · 2020-03-19T17:04:27Z

It may also be worthwhile to have Read report an explicit error if a requested metric does not exist.

(I could imagine that someone might try to call Read with a hard-coded list of metrics, and then the user code may produce an obscure panic when trying to read the unpopulated value.)

rsc · 2020-04-01T17:38:06Z

I know there are minor details still being worked out, like whether Read returns an error in some cases. Other than those, it seems like people are generally in favor of this proposal.

Am I reading this wrong? Does anyone object to accepting this proposal?

rsc · 2020-04-08T17:17:39Z

OK, then based on the discussion above this seems like a likely accept.

rsc · 2020-04-15T17:02:49Z

No change in consensus, so accepted.

gopherbot · 2020-08-05T23:23:30Z

Change https://golang.org/cl/247041 mentions this issue: runtime,runtime/metrics: add memory metrics

gopherbot · 2020-08-05T23:23:31Z

Change https://golang.org/cl/247040 mentions this issue: runtime/metrics: add package interface

gopherbot · 2020-10-26T15:15:18Z

Change https://golang.org/cl/247045 mentions this issue: runtime,runtime/metrics: add object size distribution metrics

gopherbot · 2020-10-26T15:15:19Z

Change https://golang.org/cl/247049 mentions this issue: runtime,runtime/metrics: export goroutine count as a metric

gopherbot · 2020-10-26T15:15:19Z

Change https://golang.org/cl/247047 mentions this issue: runtime,runtime/metrics: add metric for distribution of GC pauses

gopherbot · 2020-10-26T15:15:20Z

Change https://golang.org/cl/247046 mentions this issue: runtime: add timeHistogram type

gopherbot · 2020-10-26T15:15:21Z

Change https://golang.org/cl/247043 mentions this issue: runtime,runtime/metrics: add heap object count metric

gopherbot · 2020-10-26T15:15:22Z

Change https://golang.org/cl/247044 mentions this issue: runtime,runtime/metrics: add heap goal and GC cycle metrics

mknyszek · 2020-10-26T17:41:21Z

It may also be worthwhile to have Read report an explicit error if a requested metric does not exist.

(I could imagine that someone might try to call Read with a hard-coded list of metrics, and then the user code may produce an obscure panic when trying to read the unpopulated value.)

Oops, I totally forgot about this. Luckily, the case you're worried about came up in the intervening months, and the way I addressed it is by making the resulting Value have Kind KindBad if a metric doesn't exist. It's not an explicit error value returned by Read, but it does signal the issue (and it does seem right to set this anyway, even if we do decide on returning an explicit error value, too). WDYT? @bcmills

This change creates the runtime/metrics package and adds the initial interface as laid out in the design document. For #37112. Change-Id: I202dcee08ab008dd63bf96f7a4162f5b5f813637 Reviewed-on: https://go-review.googlesource.com/c/go/+/247040 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

This change adds support for a variety of runtime memory metrics and contains the base implementation of Read for the runtime/metrics package, which lives in the runtime. It also adds testing infrastructure for the metrics package, and a bunch of format and documentation tests. For #37112. Change-Id: I16a2c4781eeeb2de0abcb045c15105f1210e2d8a Reviewed-on: https://go-review.googlesource.com/c/go/+/247041 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com>

For #37112. Change-Id: Idd3dd5c84215ddd1ab05c2e76e848aa0a4d40fb0 Reviewed-on: https://go-review.googlesource.com/c/go/+/247043 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

This change adds three new metrics: the heap goal, GC cycle count, and forced GC count. These metrics are identical to their MemStats counterparts. For #37112. Change-Id: I5a5e8dd550c0d646e5dcdbdf38274895e27cdd88 Reviewed-on: https://go-review.googlesource.com/c/go/+/247044 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

This change adds metrics for the distribution of objects allocated and freed by size, mirroring MemStats' BySize field. For #37112. Change-Id: Ibaf1812da93598b37265ec97abc6669c1a5efcbf Reviewed-on: https://go-review.googlesource.com/c/go/+/247045 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

This change adds a concurrent HDR time histogram to the runtime with tests. It also adds a function to generate boundaries for use by the metrics package. For #37112. Change-Id: Ifbef8ddce8e3a965a0dcd58ccd4915c282ae2098 Reviewed-on: https://go-review.googlesource.com/c/go/+/247046 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

For #37112. Change-Id: Ibb0425c9c582ae3da3b2662d5bbe830d7df9079c Reviewed-on: https://go-review.googlesource.com/c/go/+/247047 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

For #37112. Change-Id: I994dfe848605b95ef6aec24f53869e929247e987 Reviewed-on: https://go-review.googlesource.com/c/go/+/247049 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

bcmills · 2020-11-03T16:11:51Z

KindBad seems reasonable to me.

narqo · 2020-12-18T09:15:07Z

Would it be possible to update the proposal and the documentation describing the relationship between expvar and runtime/metrics?

expvar already exposes runtime.Memstats in some predefined format. Some projects use expvar as a vendor-neutral layer for exposing their own metrics. And there are third-party tooling, that pulls the data from expvar into different metrics collectors.

Now it feels confusing why std has two ways of exposing its internal stats.

mknyszek · 2021-01-06T20:14:23Z

@narqo Yeah, I'm happy to update both the proposal and the documentation. I'll make sure it happens before the release.

The short answer to the relationship between the two is that runtime/metrics is low-level, and intended to be a replacement for ReadMemStats. ReadMemStats has problems for us in terms of exposing statistics that are useful to us for both debugging and evolving the runtime. Future evolution of runtime/metrics could also involve "push" metrics ("time series data") instead of just "pull" metrics ("sampled data") which doesn't quite fit with expvar's model, so I think having a separate, underlying package still makes sense.

Another point here is that despite its use of string keys, runtime/metrics is actually much lower-level API that puts a lot of focus on efficiency at the price of usability. AFAICT, isn't really the case with expvar (despite some performance improvements I see that were made a few years ago).

I filed another issue to try to figure out how runtime/metrics should be exported through expvar (#43555) because I am honestly not 100% sure. I think I'm personally slightly leaning toward re-exporting runtime/metrics up through expvar as one big blob, but I have some concerns with just doing that.

Thank you for your input here, and please feel free to start/continue the discussion on #43555.

gopherbot added this to the Proposal milestone Feb 7, 2020

gopherbot added the Proposal label Feb 7, 2020

mknyszek added this to Incoming in Proposals (old) Feb 27, 2020

rsc changed the title ~~proposal: API for unstable runtime metrics~~ proposal: runtime: API for unstable metrics Mar 4, 2020

rsc moved this from Incoming to Active in Proposals (old) Mar 4, 2020

rsc mentioned this issue Mar 4, 2020

proposal: review meeting minutes #33502

Open

rsc moved this from Active to Likely Accept in Proposals (old) Apr 8, 2020

rsc added the Proposal-FinalCommentPeriod label Apr 8, 2020

rsc moved this from Likely Accept to Accepted in Proposals (old) Apr 15, 2020

rsc modified the milestones: Proposal, Backlog Apr 15, 2020

rsc changed the title ~~proposal: runtime: API for unstable metrics~~ runtime: API for unstable metrics Apr 15, 2020

rsc added Proposal-Accepted and removed Proposal-FinalCommentPeriod labels Apr 15, 2020

mknyszek mentioned this issue Apr 15, 2020

testing: consider calling ReadMemStats less during benchmarking #20875

Open

mknyszek self-assigned this Apr 24, 2020

mknyszek mentioned this issue Jun 8, 2020

runtime/debug: provide channel statistics #5611

Closed

dmitshur mentioned this issue Dec 28, 2020

api: audit for Go 1.16 go1.16.txt #43407

Closed

mpx mentioned this issue Dec 31, 2020

runtime/metrics: use explicit histogram boundaries (API fix) #43443

Closed

This was referenced Apr 28, 2021

runtime/metrics: add goroutine state counts, total goroutines created, total threads #15490

Open

runtime: expose number of running/runnable goroutines #17089

Open

mknyszek closed this as completed Apr 5, 2022

rsc unassigned mknyszek Jun 23, 2022

golang locked and limited conversation to collaborators Jun 23, 2023

gopherbot added the FrozenDueToAge label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: API for unstable metrics #37112

runtime: API for unstable metrics #37112

mknyszek commented Feb 7, 2020 •

edited by bcmills

gopherbot commented Feb 8, 2020

rsc commented Mar 11, 2020

mknyszek commented Mar 11, 2020

bcmills commented Mar 19, 2020

bcmills commented Mar 19, 2020

mknyszek commented Mar 19, 2020

bcmills commented Mar 19, 2020

rsc commented Apr 1, 2020

rsc commented Apr 8, 2020

rsc commented Apr 15, 2020

gopherbot commented Aug 5, 2020

gopherbot commented Aug 5, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

mknyszek commented Oct 26, 2020

bcmills commented Nov 3, 2020

narqo commented Dec 18, 2020 •

edited

mknyszek commented Jan 6, 2021

runtime: API for unstable metrics #37112

runtime: API for unstable metrics #37112

Comments

mknyszek commented Feb 7, 2020 • edited by bcmills

Proposal: API for unstable runtime metrics

Background & Motivation

Requirements

Goals

Design

Backwards Compatibility

gopherbot commented Feb 8, 2020

rsc commented Mar 11, 2020

mknyszek commented Mar 11, 2020

bcmills commented Mar 19, 2020

bcmills commented Mar 19, 2020

mknyszek commented Mar 19, 2020

bcmills commented Mar 19, 2020

rsc commented Apr 1, 2020

rsc commented Apr 8, 2020

rsc commented Apr 15, 2020

gopherbot commented Aug 5, 2020

gopherbot commented Aug 5, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

gopherbot commented Oct 26, 2020

mknyszek commented Oct 26, 2020

bcmills commented Nov 3, 2020

narqo commented Dec 18, 2020 • edited

mknyszek commented Jan 6, 2021

mknyszek commented Feb 7, 2020 •

edited by bcmills

narqo commented Dec 18, 2020 •

edited