Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/tools/gopls: deadlock due to telemetry worker #33692

Closed
muirdm opened this issue Aug 16, 2019 · 9 comments
Closed

x/tools/gopls: deadlock due to telemetry worker #33692

muirdm opened this issue Aug 16, 2019 · 9 comments
Labels
FrozenDueToAge gopls Issues related to the Go language server, gopls.
Milestone

Comments

@muirdm
Copy link

muirdm commented Aug 16, 2019

My gopls on master (caa95bb) keeps locking up. The debug interface doesn't even respond. I attached via dlv and I saw goroutines were blocked here:

/Users/muir/projects/tools/internal/telemetry/export/worker.go:38 golang.org/x/tools/internal/telemetry/export.Do

Furthermore, one of the goroutines was the telemetry worker goroutine itself. It was invoking a task that published to the workQueue channel (which must have been full, causing a deadlock).

(dlv) bt
0  0x0000000001030baf in runtime.gopark
   at /usr/local/go/src/runtime/proc.go:302
1  0x0000000001006c6b in runtime.goparkunlock
   at /usr/local/go/src/runtime/proc.go:307
2  0x0000000001006c6b in runtime.chansend
   at /usr/local/go/src/runtime/chan.go:236
3  0x0000000001006a25 in runtime.chansend1
   at /usr/local/go/src/runtime/chan.go:127
4  0x0000000001314b62 in golang.org/x/tools/internal/telemetry/export.Do
   at /Users/muir/projects/tools/internal/telemetry/export/worker.go:38
5  0x000000000131421f in golang.org/x/tools/internal/telemetry/export.Metric
   at /Users/muir/projects/tools/internal/telemetry/export/export.go:91
6  0x000000000139dc03 in golang.org/x/tools/internal/telemetry/metric.(*Int64Data).modify.func1
   at /Users/muir/projects/tools/internal/telemetry/metric/metric.go:231
7  0x0000000001315527 in golang.org/x/tools/internal/telemetry/export.init.0.func1
   at /Users/muir/projects/tools/internal/telemetry/export/worker.go:21
8  0x000000000105d7e1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1337

/cc @ianthehat

@gopherbot gopherbot added this to the Unreleased milestone Aug 16, 2019
@gopherbot gopherbot added the gopls Issues related to the Go language server, gopls. label Aug 16, 2019
@gopherbot
Copy link

Thank you for filing a gopls issue! Please take a look at the Troubleshooting section of the gopls Wiki page, and make sure that you have provided all of the relevant information here.

@odeke-em
Copy link
Member

Thank you for reporting this issue @muirrn! I shall send a fix for this immediately.

@gopherbot
Copy link

Change https://golang.org/cl/190637 mentions this issue: internal/telemetry/export: add load-shedding for workQueue

@odeke-em
Copy link
Member

odeke-em commented Aug 17, 2019 via email

@muirdm
Copy link
Author

muirdm commented Aug 17, 2019

Thanks for looking at this. It looks like your change would avoid the deadlock, but it would introduce substantial latency under heavy contention. There is also guaranteed loss of telemetry data whenever it hits the current deadlock case.

It seems to me that the telemetry worker needs to be rearchitected in a more fundamental way. I'm sure @ianthehat already has great ideas, but I'm going to say my idea anyway: we could accumulate all telemetry messages for a single request in memory, and then kick off a goroutine to send them after the request completes. This way the telemetry messages for a given request are still sent in order, but they are always handled asynchronously with respect to LSP requests. If there is some requirement that the telemetry messages be processed "live" then this obviously wouldn't work.

@odeke-em
Copy link
Member

Thanks for looking at this. It looks like your change would avoid the deadlock, but it would introduce substantial latency under heavy contention. There is also guaranteed loss of telemetry data whenever it hits the current deadlock case.

Right, it implements load shedding as a best effort and then drops overflowing events as it is a circular buffer of sorts, but ensures no deadlocks.

and then kick off a goroutine to send them after the request completes. This way the telemetry messages for a given request are still sent in order, but they are always handled asynchronously with respect to LSP requests. If there is some requirement that the telemetry messages be processed "live" then this obviously wouldn't work.

Events are processed online but the suggested batching will mean that the global ordering of events to process will be lost. Also, when metrics are collected, we'd now have to include timestamps too.

Anyways, good thoughts and concerns @muirrn! We can discuss more as we work on this issue.

@gopherbot
Copy link

Change https://golang.org/cl/190737 mentions this issue: internal/telemetry: change concurrency model

@ianthehat
Copy link

I think the load shedding may cause other significant issues (incorrect counts in rare events would be very bad, but much worse is memory leaks caused by unmatched start/finish span pairs for instance)
The concurrency model was always temporary, it was the quickest thing to implement that would be adequate while I got the API right.
I have put up a CL that might take us a bit further, and fixes the deadlock, but the final design needs more thought.

@odeke-em
Copy link
Member

Thank you for the discourse @muirrn and @ianthehat! Sure, for now we can the mutex as Ian has posted up.

@golang golang locked and limited conversation to collaborators Aug 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge gopls Issues related to the Go language server, gopls.
Projects
None yet
Development

No branches or pull requests

4 participants