Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/metrics: /memory/classes/heap/unused:bytes spikes #67019

Open
felixge opened this issue Apr 24, 2024 · 2 comments
Open

runtime/metrics: /memory/classes/heap/unused:bytes spikes #67019

felixge opened this issue Apr 24, 2024 · 2 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@felixge
Copy link
Contributor

felixge commented Apr 24, 2024

Go version

go1.22.1

Output of go env in your module/workspace:

N/A - many different prod environments

What did you do?

We're trying to build nice dashboards to expose runtime/metrics to all Datadog users. For this reason we started rolling out a package that collects these metrics across our fleet.

What did you see happen?

While graphing the data, we noticed occasional spikes in the memory metrics. Upon closer inspection, we discovered that all of these spikes were caused by the /memory/classes/heap/unused:bytes metric.

2024-04-24 PROF-9661 Report Go Heap Spikes Upstream  Datadog at 17 41 16@2x

These spikes are pretty rare (e.g. 18 spikes per day in the last 24h for a very large fleet), but frequent enough to cause problems with building nice dashboards. The issue occurs across architectures (arm64, amd64), instance types, and hyperscalers without any clear pattern.

We suspect the large values are the result of an underflow in the runtime/metrics code:

out.scalar = uint64(in.heapStats.inHeap) - in.heapStats.inObjects

To investigate further we started logging the values (we internally use float64 for storage). Below are a few values we logged and their distance from math.MaxUint64 (assuming it's indeed an underflow we're seeing here).

value math.MaxUint64 - value
18446744073337960688 371652608
18446744073325609896 383950848
18446744073455472696 254150656
18446744073690885408 18751488
18446744073702264952 7352320

We also logged the values of all KindUint64 runtime metrics that were collected as part of the same metrics.Read() call. I've dumped the results into this sheet (apologies for the formatting)

What did you expect to see?

No spikes.

cc @mknyszek

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 24, 2024
@mknyszek
Copy link
Contributor

Hm... those differences are so large (300+ MiB??) that it doesn't seem like a situation where we're transiently racy in some corner case (as is the usual case for accounting bugs). It's a bit difficult to explain this.

This is a shot in the dark, but is it possible these services are using GOEXPERIMENT=arenas? That's maybe the one area where I could see both (1) skew this large and (2) accounting bugs still lurk, because it's used less widely.

@joedian joedian added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 24, 2024
@felixge
Copy link
Contributor Author

felixge commented Apr 24, 2024

I've checked and think I can rule out arena usage. There are no search hits for arena imports in our mono-repo, and we're seeing this issue across most of our services.

Any other ideas for things we could do to help debug this? Do you have a chance to see if the issue happens with Google's fleet? It's possible we're doing something weird, but I'm not sure what 🤔 . This is the only memory metric that we see behave oddly like this.

@mknyszek mknyszek self-assigned this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: No status
Development

No branches or pull requests

4 participants