Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/telemetry/internal/counter: allow StackCounter to skip custom call depth #61234

Open
hyangah opened this issue Jul 7, 2023 · 12 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. telemetry x/telemetry issues
Milestone

Comments

@hyangah
Copy link
Contributor

hyangah commented Jul 7, 2023

Currently StackCounter.Inc captures the Inc caller's stack using runtime.Callers(2, ...).
One use case we want to explore is to have a StackCounter to track calls to gopls/internal/bug.report.

If we call StackCounter.Inc from this report function (wrapper of StackCounter.Inc), the first couple of entries are always the functions in this gopls/internal/bug package, which isn't very interesting while taking up a good fraction of the counter name which already has pretty small max limit (#61181).

Can we allow the api to configure custom skip depth like runtime.Callers or log.Logger.Output?

Or alternatively, can we have StackCounter.Inc take pcs ?

cc @golang/tools-team @rsc

@gopherbot gopherbot added the telemetry x/telemetry issues label Jul 7, 2023
@gopherbot gopherbot added this to the Unreleased milestone Jul 7, 2023
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 10, 2023
@hyangah
Copy link
Contributor Author

hyangah commented Jul 12, 2023

We will see if the increased max len and compression mechanism to be added for #61181 alleviate the issue.

@gopherbot
Copy link

Change https://go.dev/cl/509435 mentions this issue: internal/counter: compress stack counter names

gopherbot pushed a commit to golang/telemetry that referenced this issue Jul 15, 2023
Stack counter names are multiline, where each line is
<import-path>.<function-name>:<line-number>. The import paths
tend to be long and repeated, making the names stored in count files
significantly longer than necessary. This change replaces
a repeated import path in the sequence of lines with a single
character " (for 'Ditto'). This is a detail of the internals
of the count files. Parse() reconstructs the uncompressed names.

For: golang/go#61234

Change-Id: I1f1a4160cf7b16a1e2937b56d8c929736bb6ea5c
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/509435
Run-TryBot: Peter Weinberger <pjw@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Jamal Carvalho <jamal@golang.org>
@findleyr
Copy link
Contributor

findleyr commented Nov 6, 2023

With a longer limit on stack counter names, this is probably not urgent, and may not even be worth the complexity.

For example, there are many redundant frames for bug.Errorf here, but on the other hand it doesn't affect the viability of the data, and is somewhat useful to know that these are all bug reports.

@findleyr
Copy link
Contributor

Per discussion today, if we want to have any automation (such as @adonovan's new x/tools/gopls/internal/telemetry/cmd/stacks) that detects e.g. the most significant frame in the call stack, it could be useful to make that frame the top of the call stack.

That's a newly compelling reason to do this.

@adonovan
Copy link
Member

The new compression of stack counter names replaces the "path" part of a symbol with a ditto mark (") if it is the same as the previous line. I dislike it for several reasons.

  1. The ditto mark (an unbalanced quote) is surprising and confusing. While it's common in penmanship, it is not a software convention. The first time I saw it I started debugging my program to find out why it was emitting garbage.
  2. The compression makes the stack harder to use. Grep doesn't work. My tests in CL 558256 don't work. It requires a stateful decoder to read it.
  3. The split function is incorrect for generic symbols, which may contain periods, making the result even more confusing.

I propose we revert it.

@pjweinb
Copy link

pjweinb commented Jan 29, 2024 via email

@findleyr
Copy link
Contributor

@adonovan it looks like that compression was added for a reason. We shouldn't revert without addressing that reason in a different way. Increasing the max name length seems reasonable, as does @hyangah's proposal in this issue to allow skips (which we want to do anyway: #61234 (comment)).

More generally, it feels like the real problem here is the loose coupling between x/telemetry and the new functionality added in https://go.dev/cl/558256. The counter file format is not intended as an externally facing API, and so it is not surprising that it doesn't make for a good API. Since we're going eventually move the watchdog functionality into x/telemetry, that concern is only temporary. So I don't find (1) or (2) particularly compelling reasons to revert.

But the bugginess of (3) is a real problem. Since fixing that feels like it will involve significant complexity, reverting and instead increasing the max name length sounds like a reasonable path forward. Keep it simple.

@hyangah
Copy link
Contributor Author

hyangah commented Jan 30, 2024

Why is (3) a problem if the raw counter name is not meant to be interpreted directly and running decoder can recover it to the original stack trace?

@adonovan
Copy link
Member

adonovan commented Jan 30, 2024

We shouldn't revert without addressing that reason in a different way.

Of course; increasing the limit seems like a good solution. I can't imagine the extra size of these messages has a significant dollar cost compared the their immense value for debugging.

More generally, it feels like the real problem here is the loose coupling between x/telemetry and the new functionality added in https://go.dev/cl/558256. The counter file format is not intended as an externally facing API, and so it is not surprising that it doesn't make for a good API. Since we're going eventually move the watchdog functionality into x/telemetry, that concern is only temporary. So I don't find (1) or (2) particularly compelling reasons to revert.

I agree that we should move watchdog into x/telemetry at some point, and that would justify writing a more elaborate decoder for the watchdog test, but I don't think that changes the thrust of my argument.

The counter names are not an API, but they are an interface, and they are exposed via json files such as https://storage.googleapis.com/prod-telemetry-merged/2023-12-05.json to tools that operate on them such as stacks. The stacks command parses the stacks to extract the likely culprit symbol, and (as you know) I would like it to additionally mark up each frame to the correct file@version:line in CodeSearch.

Perhaps you're saying the intent was that counter names are merely unstructured strings, but that seems like a lost opportunity to give stacks the special treatment that they--the primary data structure for post-mortem analysis--surely deserve. We should either specify the grammar of these strings, or provide a package that parses them. I think the existing uncompressed format is already pretty good.

@findleyr
Copy link
Contributor

findleyr commented Jan 30, 2024

@adonovan sorry, I thought the uploads already parsed the counter file, and therefore included the expanded name. Is that not the case?

@adonovan
Copy link
Member

adonovan commented Jan 30, 2024

@adonovan sorry, I thought the uploads already parsed the counter file, and therefore included the expanded name. Is that not the case?

Ah, I didn't realize that was the case (if it is indeed the case). So it sounds like the compressed form is supposed to be an internal (but stable and precisely specified) detail and you already have routines for compressing and decompressing it that my test could take advantage of, especially when it is moved into x/telemetry.

@findleyr
Copy link
Contributor

@hyangah

Why is (3) a problem if the raw counter name is not meant to be interpreted directly and running decoder can recover it to the original stack trace?

You're right, it's not a big problem, since the compression/expansion is deterministic. Just a bit sloppy.

@adonovan yes, that is the point I was trying to make: when in x/telemetry you wouldn't only interact with parsed counter files. I confirmed that this is the case, and the uploaded payloads are expanded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. telemetry x/telemetry issues
Projects
None yet
Development

No branches or pull requests

6 participants