Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/telemetry: audit upload process for 1.23 #65970

Open
2 tasks
findleyr opened this issue Feb 27, 2024 · 8 comments
Open
2 tasks

x/telemetry: audit upload process for 1.23 #65970

findleyr opened this issue Feb 27, 2024 · 8 comments
Assignees
Labels
telemetry x/telemetry issues
Milestone

Comments

@findleyr
Copy link
Contributor

findleyr commented Feb 27, 2024

We were discussing the potential raciness of telemetry uploading in the context of a thundering herd of Go commands. I looked into this in more detail. Here's a paraphrased summary of the upload process:

  1. Collect all counter file names.
  2. Group by expiry. Ignore expiries that are not in the past (meaning at least a day in the past)
  3. For each group of counter files with the same expiry, compute both "local" and "uploadable" reports, using a random X value that is specific to this operations (i.e. not a hash of report content). This X determines both upload sampling and uploaded report names. Skip counter files that can't be read (e.g. because they've been deleted by one of the following steps...).
  4. Before writing <date>.json (the upload report) and local.<date>.json, check if they already exist. If either of them already exists, assume something else got there first, delete all the counter files, and exit.
  5. Then try to write the report files.
  6. Finally, delete all the counter files.
  7. POST the uploadable report (<date>.json) to telemetry.go.dev/upload/<date>/<X>.json

I'm not concerned about races involving user intervention, such as e.g. deleting the telemetry directory in the middle of an upload. However, it does look like there's a race between steps (4) and (7) that could theoretically lead to duplicate uploads: two processes overwrite each other's report with different X values, and each thinks they should upload. But it looks very unlikely given how much has to happen in the unsynchronized period, and we can probably avoid the race by replacing (4) and (5) with an exclusive write, and not deleting files if that write fails.

Nevertheless, I think we should consider this more.

Bugs discovered:

  • a file with no counters is considered a failed file
  • a single report failure causes all other reports not to be uploaded
@gopherbot gopherbot added the telemetry x/telemetry issues label Feb 27, 2024
@gopherbot gopherbot added this to the Unreleased milestone Feb 27, 2024
@bcmills
Copy link
Contributor

bcmills commented Feb 27, 2024

I would also be concerned about races between steps (3) and (6), particularly on Windows (where a file normally can't be deleted while another process has an open handle to it).

@findleyr
Copy link
Contributor Author

@bcmills indeed, if deleting the counter files fails, the error is logged but no other action is taken. However, given that step 6 happens after the report is written, I don't think we have to be worried about this leading to duplicate uploads--that would only be a problem if some other process deleted the local reports, but left the counter files in place. Otherwise, the counter files will "eventually" be deleted because of the logic in step (4): if a report already exists, we still try to delete the counter files.

Are you concerned about having stray counter files around? The race that you describe could lead to A situation where you have e.g.

local/
 go-1.22-linux-amd64-2024-02-26.v1.count
 local.2024-02-26.json

That .v1.count file shouldn't be there, but is "harmless" because it will never be uploaded, because the presence of local.2024-02-26.json will prevent the upload. Furthermore, the counter file would eventually be deleted by step (4). Is that a problem? Aside from cruft, the only downside I foresee is unnecessary work, because reports are needlessly produced during the upload process.

@bcmills
Copy link
Contributor

bcmills commented Feb 27, 2024

Yeah — if the error in deleting the file doesn't cause any other ill effects, that's probably an ok failure mode.

@pjweinb pjweinb self-assigned this Feb 28, 2024
@findleyr findleyr changed the title x/telemetry: guard against upload races x/telemetry: audit upload process for 1.23 May 9, 2024
@findleyr findleyr assigned findleyr and unassigned pjweinb May 9, 2024
@findleyr
Copy link
Contributor Author

findleyr commented May 9, 2024

Repurposing this issue to track generally auditing the upload process. Assigning to myself as I'm working on it.

@findleyr findleyr modified the milestones: Unreleased, Go1.23 May 10, 2024
@gopherbot
Copy link

Change https://go.dev/cl/584400 mentions this issue: internal/upload: don't skip all uploads on the first report failure

@gopherbot
Copy link

Change https://go.dev/cl/584305 mentions this issue: internal/upload: add an end-to-end test for mode handling

gopherbot pushed a commit to golang/telemetry that referenced this issue May 10, 2024
A single failure of createReport should not prevent the upload of other
reports. Fix this by logging and proceeding when there is a failure.

To test this failure, and to generally make it easier to exercise upload
bugs, make the following testing improvements:

- Add regtest.RunProgAsOf, which is like RunProg but sets CounterTime
  (newly exposed) to a time in the past.
- Add regtest.NewIncProgram for the common use case of a program that
  just increments counters and exits.
- Export CreateTestUploadConfig and CreateTestUploadServer.
- Have CreateTestUploadServer implement its own cleanup.
- Add a testWriter to echo upload logs to t.Log.

Together, these helpers make it relatively easy to write an ad-hoc
upload test using only the public counter and upload APIs.

For golang/go#65970

Change-Id: I9f54ad22a1f69cc6162ebe5628ad3287b89bbee1
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584400
Auto-Submit: Robert Findley <rfindley@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
gopherbot pushed a commit to golang/telemetry that referenced this issue May 10, 2024
For golang/go#65970

Change-Id: I2e84e0c9ffaa2f270b8d42164ea95181ec088e8c
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584305
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@gopherbot
Copy link

Change https://go.dev/cl/584402 mentions this issue: internal/counter: remove fall-back logic in Parse for missing TimeBegin

@gopherbot
Copy link

Change https://go.dev/cl/584795 mentions this issue: internal/upload: migrate TestUploadBasic and TestUploadFailure

gopherbot pushed a commit to golang/telemetry that referenced this issue May 10, 2024
Remove fallback logic in counter.Parse to handle a missing "TimeBegin"
metadata field. All counters should have this field populated (it is
written by file.init). To ensure we catch this form of corruption,
return an error from counterDateSpan when the TimeBegin or TimeEnd
fields are missing (this was a pre-existing TODO).

The logic dates to the original instance of the counter package, so it
is likely just obsolete.

For golang/go#65970

Change-Id: I6a75a42f3092c3471fe95cda1958b57b954e9140
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584402
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
Auto-Submit: Robert Findley <rfindley@google.com>
gopherbot pushed a commit to golang/telemetry that referenced this issue May 10, 2024
Move TestUploadBasic and TestUpload failure to the new upload_test
package, and rewrite them to use only the public upload API. This also
cleans up the file layout a bit, as TestDates is now the only test in
dates_test.go.

Additionally, check the state of the telemetry dir before and after
upload, and generalize TestUploadFailure to a retry test that checks
various status code retry behavior.

For golang/go#65970

Change-Id: Iab9d33a5eb852b43e7876bbf8b12d91b40c85909
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584795
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
telemetry x/telemetry issues
Projects
None yet
Development

No branches or pull requests

4 participants