x/tools/gopls: deadlock due to telemetry worker #33692

muirdm · 2019-08-16T20:20:46Z

My gopls on master (caa95bb) keeps locking up. The debug interface doesn't even respond. I attached via dlv and I saw goroutines were blocked here:

/Users/muir/projects/tools/internal/telemetry/export/worker.go:38 golang.org/x/tools/internal/telemetry/export.Do

Furthermore, one of the goroutines was the telemetry worker goroutine itself. It was invoking a task that published to the workQueue channel (which must have been full, causing a deadlock).

(dlv) bt
0  0x0000000001030baf in runtime.gopark
   at /usr/local/go/src/runtime/proc.go:302
1  0x0000000001006c6b in runtime.goparkunlock
   at /usr/local/go/src/runtime/proc.go:307
2  0x0000000001006c6b in runtime.chansend
   at /usr/local/go/src/runtime/chan.go:236
3  0x0000000001006a25 in runtime.chansend1
   at /usr/local/go/src/runtime/chan.go:127
4  0x0000000001314b62 in golang.org/x/tools/internal/telemetry/export.Do
   at /Users/muir/projects/tools/internal/telemetry/export/worker.go:38
5  0x000000000131421f in golang.org/x/tools/internal/telemetry/export.Metric
   at /Users/muir/projects/tools/internal/telemetry/export/export.go:91
6  0x000000000139dc03 in golang.org/x/tools/internal/telemetry/metric.(*Int64Data).modify.func1
   at /Users/muir/projects/tools/internal/telemetry/metric/metric.go:231
7  0x0000000001315527 in golang.org/x/tools/internal/telemetry/export.init.0.func1
   at /Users/muir/projects/tools/internal/telemetry/export/worker.go:21
8  0x000000000105d7e1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1337

/cc @ianthehat

The text was updated successfully, but these errors were encountered:

gopherbot · 2019-08-16T20:20:51Z

Thank you for filing a gopls issue! Please take a look at the Troubleshooting section of the gopls Wiki page, and make sure that you have provided all of the relevant information here.

odeke-em · 2019-08-17T06:32:56Z

Thank you for reporting this issue @muirrn! I shall send a fix for this immediately.

gopherbot · 2019-08-17T19:05:58Z

Change https://golang.org/cl/190637 mentions this issue: internal/telemetry/export: add load-shedding for workQueue

odeke-em · 2019-08-17T20:03:10Z

Could you please try that patch and see if it fixes the problem for you? It performs some best effort load shedding.

…

On Sat, Aug 17, 2019 at 1:08 PM GopherBot ***@***.***> wrote: Change https://golang.org/cl/190637 mentions this issue: internal/telemetry/export: add load-shedding for workQueue — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#33692?email_source=notifications&email_token=ABFL3VYEZM6IQBWVOXD5IQDQFBECZA5CNFSM4IMMRCEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QRSQY#issuecomment-522262851>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFL3V7YTB2DS3WOQ6TGHKLQFBECZANCNFSM4IMMRCEA> .

muirdm · 2019-08-17T21:01:11Z

Thanks for looking at this. It looks like your change would avoid the deadlock, but it would introduce substantial latency under heavy contention. There is also guaranteed loss of telemetry data whenever it hits the current deadlock case.

It seems to me that the telemetry worker needs to be rearchitected in a more fundamental way. I'm sure @ianthehat already has great ideas, but I'm going to say my idea anyway: we could accumulate all telemetry messages for a single request in memory, and then kick off a goroutine to send them after the request completes. This way the telemetry messages for a given request are still sent in order, but they are always handled asynchronously with respect to LSP requests. If there is some requirement that the telemetry messages be processed "live" then this obviously wouldn't work.

odeke-em · 2019-08-17T22:00:40Z

Thanks for looking at this. It looks like your change would avoid the deadlock, but it would introduce substantial latency under heavy contention. There is also guaranteed loss of telemetry data whenever it hits the current deadlock case.

Right, it implements load shedding as a best effort and then drops overflowing events as it is a circular buffer of sorts, but ensures no deadlocks.

and then kick off a goroutine to send them after the request completes. This way the telemetry messages for a given request are still sent in order, but they are always handled asynchronously with respect to LSP requests. If there is some requirement that the telemetry messages be processed "live" then this obviously wouldn't work.

Events are processed online but the suggested batching will mean that the global ordering of events to process will be lost. Also, when metrics are collected, we'd now have to include timestamps too.

Anyways, good thoughts and concerns @muirrn! We can discuss more as we work on this issue.

gopherbot · 2019-08-18T03:29:42Z

Change https://golang.org/cl/190737 mentions this issue: internal/telemetry: change concurrency model

ianthehat · 2019-08-18T03:33:39Z

I think the load shedding may cause other significant issues (incorrect counts in rare events would be very bad, but much worse is memory leaks caused by unmatched start/finish span pairs for instance)
The concurrency model was always temporary, it was the quickest thing to implement that would be adequate while I got the API right.
I have put up a CL that might take us a bit further, and fixes the deadlock, but the final design needs more thought.

odeke-em · 2019-08-19T23:49:30Z

Thank you for the discourse @muirrn and @ianthehat! Sure, for now we can the mutex as Ian has posted up.

gopherbot added this to the Unreleased milestone Aug 16, 2019

gopherbot added the gopls label Aug 16, 2019

gopherbot closed this as completed in golang/tools@d9ab56a Aug 20, 2019

golang locked and limited conversation to collaborators Aug 19, 2020

gopherbot added the FrozenDueToAge label Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/tools/gopls: deadlock due to telemetry worker #33692

x/tools/gopls: deadlock due to telemetry worker #33692

muirdm commented Aug 16, 2019

gopherbot commented Aug 16, 2019

odeke-em commented Aug 17, 2019

gopherbot commented Aug 17, 2019

odeke-em commented Aug 17, 2019 via email

muirdm commented Aug 17, 2019

odeke-em commented Aug 17, 2019

gopherbot commented Aug 18, 2019

ianthehat commented Aug 18, 2019

odeke-em commented Aug 19, 2019

x/tools/gopls: deadlock due to telemetry worker #33692

x/tools/gopls: deadlock due to telemetry worker #33692

Comments

muirdm commented Aug 16, 2019

gopherbot commented Aug 16, 2019

odeke-em commented Aug 17, 2019

gopherbot commented Aug 17, 2019

odeke-em commented Aug 17, 2019 via email

muirdm commented Aug 17, 2019

odeke-em commented Aug 17, 2019

gopherbot commented Aug 18, 2019

ianthehat commented Aug 18, 2019

odeke-em commented Aug 19, 2019