proposal: runtime: garbage collect goroutines blocked forever #19702

faiface · 2017-03-24T19:15:43Z

Hi everyone!

As of today, Go's runtime does not garbage collect blocked goroutines. The most cited reason being, that goroutines blocked forever are usually a bug and that collecting them would hide this bug. I would like to show a few examples, where garbage collected goroutines would really be appreciated and lead to a much safer and less buggy code.

What does it mean?

package main

import "fmt"

func numbers() <-chan int {
	ch := make(chan int)
	go func() {
		for i := 0; ; i++ {
			ch <- i
		}
	}()
	return ch
}

func main() {
	for i := 0; i < 1000; i++ {
		for num := range numbers() {
			if num >= 1000 {
				break
			}
			fmt.Println(num)
		}
	}
}

This code has a memory leak in current Go. The function numbers returns a channel that generates an infinite sequence of natural numbers. We get this sequence 1000 times in the main function, print the first 1000 numbers of each sequence and quit. The numbers function spawns a goroutine that feeds the channel with numbers. However, once we print the first 1000 numbers produced by the goroutine, the goroutine stays in the memory, blocked forever. If the first for-loop iterated forever, we would run out of memory very quickly.

How to collect goroutines?

I'd suggest the following algorithm:

All non-blocked goroutines are marked active.
All channels not reachable by active goroutines are marked inactive.
All gouroutines blocked on an inactive channel are marked dead and garbage collected.

Edit: Based on the discussion, I add one detail to the implementation. A goroutine marked dead would be collected in a way that's identical to calling runtime.Goexit inside that goroutine (https://golang.org/pkg/runtime/#Goexit).

Edit 2: Based on the further discussion, runtime.Goexit behavior is debatable and maybe not right.

What are the benefits?

Once such goroutine garbage collection is enabled, we can solve large variety of new problems.

Infinite generator functions. Generator is a function that returns a channel that sends a stream of values. Heavily used in functional languages, they are currently hardly possible in Go (we'd have to send them a stopping signal).
Finite generator functions. We can do them in Go, however, we have to drain the channel if we want to avoid a memory leak.
Manager goroutines. A nice way to construct a concurrent-safe object in Go is to, instead of guarding it by a mutex, create a manager goroutine, that ranges over a channel that sends commands to this object. Manager goroutine then executes this commands as they come in. This is hard to do in Go, because if the object goes out of scope, manager goroutine stays in the memory forever.

All of the described problems are solvable. Manual memory management is also solvable. And this is just like that. Inconvenient, error-prone and preventing us from doing some advanced stuff, that would be really useful.

I'd like to point out, that the whole talk "Advanced Go Concurrency Patterns" would be pointless, if Go collected blocked goroutines.

What do you guys think?

The text was updated successfully, but these errors were encountered:

bradfitz · 2017-03-24T19:22:50Z

This makes the garbage collector more expensive.

Also, how do you "mark dead" a goroutine? A goroutine can't be killed by another goroutine. Do you run deferred statements? Does the channel operation panic?

This would require a lot of design & implementation work for marginal benefit.

faiface · 2017-03-24T19:30:37Z

I'm not sure about the details, but it seems logical that the deferred statements would run and then the goroutine would be silently killed. It would not be killed by another goroutine, it would be killed by the garbage collector.

I personally don't think the benefits are marginal. It would make a lot of the obvious code work correctly.

If you watch Rob Pike's "Go Concurrency Patterns" talk, you see that his code examples would not be correct, if the goroutines would not be garbage collected like this. Also, in one point in the talk, being asked by some guy "What happens to that goroutine", he responds something like, "Don't worry, it'll be collected".

bradfitz · 2017-03-24T19:37:20Z

I'm not sure about the details,

The proposal process is really about the details, though.

but it seems logical that the deferred statements would run and then the goroutine would be silently killed.

What about locks it's currently holding? Would deferred statements be run?

What about the cases where it actually is a bug to have the leak and the goroutine shouldn't be silently swept under the rug?

Maybe an alternate proposal would be that the sender goroutine should have to explicitly select on the receiver going away:

func numbers() <-chan int {
	ch := make(chan int)
	go func() {
		for i := 0; ; i++ {
			select {
                        case ch <- i:
                        unreachable:  // language change, though.
                                // choose: panic, return silently, etc
                        }
		}
	}()
	return ch
}

But really, now that we have context in the standard library, the answer should be to start the generator goroutine with a context.

faiface · 2017-03-24T19:48:55Z

I think that deferred calls should be run. If one of the deferred calls is a mutex unlock, than it'll be run. If the blocked goroutine holds a mutex, but no deferred call unlocks that mutex, then it's a bug and although the goroutine will be collected, the mutex won't be unlocked which would definitely show the bug, just as good as a goroutine memory leak.

On the Context: of course, it's possible to use here, however, you have to manually cancel it. It's just like manual memory management.

bradfitz · 2017-03-24T19:59:34Z

Okay, so your proposal is that a channel send or receive operation that can be dynamically proven to never complete turns into a call to runtime.Goexit (https://golang.org/pkg/runtime/#Goexit)?

faiface · 2017-03-24T20:01:14Z

Exactly, yes.

randall77 · 2017-03-24T20:15:28Z

I don't agree with calling Goexit. If a goroutine is blocked forever, we can just garbage collect it without changing any semantics (and maybe sample it or something, for the bug case). It would never have run again, so there isn't an observable difference except for memory footprint. Why would it call Goexit, run defers, etc.? That seems like an unexpected behavior while waiting on a channel. And it feels like finalizers - we can only provide nebulous guarantees about whether we can detect it and how quickly we do so.

Holding a lock while doing a channel operation should be heavily frowned upon :(

faiface · 2017-03-24T21:19:19Z

@randall77 You've got a valid point. This would need to be discussed.

fraenkel · 2017-03-24T21:57:14Z

How would goroutines waiting on Tickers or Timers work?
What about channels in channels?

faiface · 2017-03-24T22:12:10Z

@fraenkel Tickers and Timers theoretically always have a running goroutine behind them, counting the time, which eventually sends a signal. So, they're always active. Blocking on them doesn't garbage collect anything.

Channels in channels don't make any difference, if I'm not missing anything. If a channel is not reachable by any active goroutine, then no values will ever be sent or received on it.

nullchinchilla · 2017-03-24T23:09:10Z

If this is implemented (and I think it indeed would be useful), deferred functions should not run. Essentially this should not change the observable behavior of any program except for memory usage.

I do feel like this feature would be very helpful, as it allows goroutines essentially to be used as continuations (from languages like Scheme) that contain some state that could be resumed or thrown away implicitly. A goroutine in the background managing some variables local to that goroutine could replace structs in cases where you really do not want to expose any internal structure, or you are doing some complex synchronization.

nullchinchilla · 2017-03-24T23:17:19Z

In fact, it might be a good idea to panic if the garbage collector wants to collect an unreachable goroutine but finds that it has pending defers (or more conservatively, is holding a lock), since that's always a bug.

randall77 · 2017-03-24T23:44:05Z

I think this could work, but it would require folding goroutines into the marking phase of the GC.
Instead of treating all goroutines as roots, start with only active goroutines. (Active here means not blocked on a channel. Goroutines in syscalls are considered active for this.) Any time we mark a channel, mark all the goroutines waiting on that channel as active, and schedule their stacks for scanning. At the end of mark, any goroutines not marked active can be collected. As a bonus, we never need to scan their stack (and we can't, because then we would find a reference to the channel they are waiting on).

There are lots of tricky details:

Channels have direction - if an active goroutine has a send-only reference to a channel, it could never wake up a goroutine receive-waiting on that channel. The GC has no way to tell different channel references apart (send/receive/bidirectional is only a compile-time thing).
The GC needs to know that channels (or possibly something the channel references, like the goroutine struct) are special objects. It needs that information to know that when marking the channel, it needs to do extra work it doesn't need to do for normal objects. We'd need to allocate channels on segregated pages or something. or maybe marks similar to how we do finalizers?
Goroutines could be selecting on several channels at once. Any one of those could cause it to be active.

On a more meta level, this proposal would encourage using forever-blocking goroutines during the normal course of business. Right now they are considered a bug. The proposal suggests this is just a defense against memory leaks. But this is a slippery slope people are surely going to drive a mac truck down. Lazy-evaluating Haskell interpreter, anyone? Go isn't really a good fit for that, as a goroutine is a really heavy future implementation. But people will try anyway, and get frustrated with the performance.

fraenkel · 2017-03-24T23:51:33Z

If one happens to write incorrect code which creates deadlocks or goroutines which are "dead" for some reason, one cannot easily determine what went wrong since the evidence is now magically removed.

faiface · 2017-03-25T00:16:59Z

@fraenkel It seems to me that deadlocks could easily be handled as a special case.

nullchinchilla · 2017-03-25T00:38:03Z

@randall77 I already use goroutines heavily as an annoying future implementation that requires manual memory management. The whole point of goroutines being extremely lightweight and scalable is that doing this is not "really heavy". If goroutines are just used where you would use threads in Java, etc, then it has lost some of its utility.

nullchinchilla · 2017-03-25T00:39:57Z

For example, I often use goroutines to express a sequence of numbers, such as an allocator for object IDs, to avoid intertwining the logic for incrementing counters, etc with the main loop where the business logic happens.

Far more frequently, I use goroutines to manage an object that requires complex synchronization. Essentially, in the function that creates the object ("NewWhateverStruct(...)" etc), a goroutine will be spun up in the background that communicates with the methods through channels and does all the actual work. This can include objects that do not manage external resources; a large in-memory thread-safe database for example. Currently, users of such an object must call a "Close()" method or something to kill the goroutines running in the background, which is annoying and easy to mess up, especially when the object may be referenced many times throughout many goroutines.

bradfitz · 2017-03-25T01:26:09Z

@randall77, you could imagine doing exactly what you propose in the GC, but instead of doing anything else to kill the goroutine, just ~sync.Once a warning about a leaked goroutine to stderr. The runtime doesn't complain to stderr often, but this seems justified enough.

docmerlin · 2017-03-25T01:55:25Z

This would also allow some of the more common erlang patterns to be useful in go.

egonelbre · 2017-03-25T13:22:36Z

It's possible to detect when a goroutine is blocked for more than some time by examining the stack (proof-of-concept https://github.com/egonelbre/antifreeze/blob/master/monitor.go). As such, one of the approaches could be that you can specify a time limit for how long a goroutine can be blocked.

With regards to the example, this is faster, shorter and doesn't require an additional goroutine:

type Numbers struct{ next int64 }
func (n *Numbers) Next() int { return int(atomic.AddInt64(&n.next, 1)) }

func main() {
	for i := 0; i < 10; i++ {
		var numbers Numbers
		for {
			num := numbers.Next()
			if num >= 1000 {
				break
			}
			fmt.Println(num)
		}
	}
}

nullchinchilla · 2017-03-25T15:04:19Z

@egonelbre Using a timeout would be horrible, as it will either be too long or too short for some purposes. Especially accidentally timing out a goroutine which is still reachable would cause weird unpredictable bugs.

Integrating goroutine-collecting with the GC would make sure that all the collecting is completely "unobservable" without dumping stack traces. And as @docmerlin mentions, it does significantly increase the expressivity of Go compared to languages like Erlang with somewhat similar concepts of communicating lightweight processes.

Merovius · 2017-03-25T17:36:32Z

I was missing this for a long time too and still don't quite understand, why memory is considered a resource that a programmer should obviously not need to care about, while goroutines are (cf. "What about the cases where it actually is a bug to have the leak and the goroutine shouldn't be silently swept under the rug?" - I could just as easily ask "what about cases where a leaked pointer is actually a bug and the GC is sweeping that under the rug?").

That being said, I believe a) this proposal so far to be too hand-wavy. I think there are a lot of questions to be answered about the details of what goroutines can and can't be collected. And as this feature is only useful, once it's codified in the spec (otherwise a program can't rely on the collection, thus any actual use of it would still be incorrect), these questions would need good, complete but also concise answers.
And b) that indeed, context seems like a good enough solution to at least all cases I care about. In 99% of code, I need to plumb through a context anyhow and rely on it being cancelled upstack.

So, it would be cool, if this could happen, but the edge-cases should definitely be thought hard about.

faiface · 2017-03-25T17:57:52Z

@Merovius Fully agree. This proposal is not meant to be something complete, I was rather going for to see whether other people were missing this too.

I think it would be useful if some more people from the Go team stated their opinion on this (such as @griesemer, @robpike, @adg, etc. not mentioning @bradfitz, who already contributed).

If the proposal turns out to be reasonable and acceptable, then we might start thinking about the details in depth.

bradfitz · 2017-03-25T18:02:43Z

The @golang/proposal-review group meets on Mondays, but generally doesn't comment on proposals until they've been open with community input for a week.

faiface · 2017-03-25T18:04:24Z

@bradfitz Aaa, thanks, I didn't know that. Does that mean that this proposal will be discussed on the Monday in two weeks?

bradfitz · 2017-03-25T18:05:01Z

It depends on backlog.

nullchinchilla · 2017-03-25T18:58:30Z

@Merovius context is similar to manually managing memory using free in my mind, especially in code that does not use the context package pervasively, or even for which a workflow involving context makes no sense (say, an in-memory data structure package)

nullchinchilla · 2017-03-25T19:01:52Z

@Merovius It seems like there's a very simple way of defining "unreachable":

A goroutine is unreachable if it is waiting on a resource that is unreachable (in the GC sense) from any other goroutine stack.

bradfitz · 2017-03-28T16:50:05Z

@rsc,

Effort spent compressing leaked memory is probably wasted: better to make it easier for programmers to find and fix leaks instead

Thoughts on my comment to Keith above?
#19702 (comment)

rsc · 2017-03-29T15:15:32Z

@bradfitz, I don't think this is worthwhile. The detection rate is so low that it won't be worth the complexity. Leaked goroutines don't even matter until there are a lot of them; people who care can watch runtime.NumGoroutines() against a limit they choose, and that will be much better at detecting. Or just take a look at /debug/pprof once in a while.

nullchinchilla · 2017-03-29T17:17:38Z

@rsc I don't think this should be about "killing" goroutines. Such a change will never change the observable behavior of any program except for memory usage. It is not about killing "infinitely blocked" goroutines, or deadlock detection, but rather collecting "unreachable" goroutines. A goroutine that's infinitely blocked but has pointer on its stack to a waitgroup that somebody else is waiting on is obviously not unreachable.

Nothing will change from the perspective of programmers using existing Go idioms, except perhaps debugging, which can easily be solved by adding a flag to GODEBUG's GC optiosn that prints something when goroutines are collected. The only things goroutine collecting would add are new idioms / design patterns. And some other languages do indeed collect threads in exactly this unobservable way to enable more design freedom.

Of course, it's fine if the larger Go community feels that idioms such as using background goroutines to synchronize an object don't belong in Go, but my point was that this proposal isn't about magically fixing memory leaks in existing programs, but rather to enable programs that haven't been written yet to be written.

rsc · 2017-03-29T17:22:28Z

@bunsim I understand your motivation was for new patterns. But I don't believe you can split the hairs well enough here to collect only the goroutines that aren't stuck due to bugs. And you shouldn't have to set a GODEBUG flag (that breaks the idioms you want to enable!) after the fact to get useful information about your deadlocked program.

faiface · 2017-03-29T17:33:20Z

@rsc When you or anybody else have dealt with goroutines stuck forever, how often was it a bug that wouldn't be solved by garbage collecting that goroutine? If it's often the case, could you give any example?

nullchinchilla · 2017-03-29T20:44:59Z

The GODEBUG flag wouldn't break the idioms, it would just print "# of goroutines collected this cycle" as it already does for bytes etc. And you already can totally disable the GC using debugging options anyway.

Nobody is suggesting some weird heuristic to collect only goroutines that aren't stuck due to bugs. It's perfectly okay to collect ones that are stuck due to bugs, it only loses us some debugging info we can recover by simply disabling the GC. And a large amount of existing "buggy code" will simply be the correct way of doing things in the future.

The whole argument sounds suspiciously like "we shouldn't use a GC because it hides bugs from Valgrind".

ianlancetaylor · 2017-03-29T20:59:10Z

I want to stress @rsc 's point that if this is to change the way that people write Go code, then it is essential that it be clear when goroutines will be collected, and that goroutines will be reliably collected in that state. That seems to me to be difficult. It's a lot harder to tell when a goroutine is blocked than it is to tell when memory is not referenced. The canonical case here is a goroutine blocked reading from or writing to a channel, but the goroutine itself is holding a reference to the channel; for example, it may be an argument to the function that blocked, and therefore be on the goroutine's stack. We need to reliably detect not that the channel is unreferenced, but that the only references to the channel are from memory that is only visible to the blocked goroutine. That seems hard.

nullchinchilla · 2017-03-29T22:24:09Z

@ianlancetaylor That doesn't seem especially harder than GC in general, though? It seems about as difficult as implementing a weak-map (delete from the map if the only references are from memory rooted at the map), though admittedly it isn't something Go has.

ianlancetaylor · 2017-03-29T23:39:20Z

Weak pointers are easier: you have a pointer type that the GC explicitly ignores (except that it gets updated when an object gets moved, for systems with a moving GC, and it gets cleared when an object is freed.)

What we need for this is something different: given a goroutine G1 blocked on a channel, we need to run a GC from all roots except G1, and see whether we marked the channel. That is straightforward but too inefficient to actually use. I don't know an efficient way to implement this. Perhaps there is one.

nullchinchilla · 2017-03-30T13:27:19Z

@ianlancetaylor What if we run the GC only from the roots of running/runnable goroutines, and then collect all goroutines blocked on channels we didn't mark? (Of course, with some caveats with timers, etc) This should collect everything in one go, though it's might be hard to integrate it into Go's concurrent GC.

rsc · 2017-03-30T14:46:35Z

There are many times when a goroutine will be blocked on a channel in a data structure that is accessible to other goroutines if they follow the right chain of pointers in memory, but none of them will. That case and many other related ones fundamentally fool any GC-based algorithm, so that in many cases goroutines will not be collected, and the conditions will be very unclear, depending potentially on optimizations in the compiler that move data between stack and heap, and maybe other optimizations as well. The result will be that it's fairly unpredictable when a blocked goroutine would be collected vs not.

As Ian said, "it is essential that it be clear when goroutines will be collected, and that goroutines will be reliably collected in that state."

nullchinchilla · 2017-03-31T23:01:13Z

@rsc There are many times when a data structure is inside a datastructure that's accessible from a goroutine if it follows the right chain of pointers in memory, but it never will. So the conditions at which memory will be collected is not clear and GC is "fooled", so garbage collection is fundamentally undecidable. I'm sure you could even construct a scenario where freeing an object at the "right" moment means solving the halting problem. Better use free()!

I've programmed extensively in Racket, a language where unreachable blocked threads are collected, and it is definitely not "fairly unpredictable" whether a blocked thread will be collected. There are several common design patterns that involve blocked threads being collected, and in each case it's very clear why it's being collected, and in other cases you still close them down manually.

In Go's case, we wouldn't be removing any code to shut goroutines down at all in most existing code; we would simply have new idioms in which whether goroutines are collected or not is very obvious. This proposal isn't intended to deprecate goroutine termination devices like tombs, etc, but rather to allow usage of goroutines to implement not exactly thread-like constructs such as coroutines and generators.

nullchinchilla · 2017-03-31T23:06:55Z

Just as an example, this proposal would allow users not to call Stop() on time.Ticker. Calling stop on objects that don't manage external resources should be the GC's job.

Essentially, this proposal can make "is this object implemented by a background goroutine" a well-encapsulated implementation detail. You might be able to implement time.Ticker without needing Stop() by some magic fiddling with the runtime and scheduler, but whether you do that, or simply use a goroutine that sleeps periodically, shouldn't be exposed in the form of "does it leak memory unless you stop it".

ianlancetaylor · 2017-04-01T00:40:45Z

The issue of calling Stop on a time.Ticker is actually a different problem. Implementing this suggestion would not solve it. The problem there is that a time.Ticker has an entry in the runtime timer table, and there is nothing that removes that entry even if the time.Ticker is garbage collected.

nullchinchilla · 2017-04-01T00:46:24Z

@ianlancetaylor Ah, I always thought the reason was that time.Ticker was backed by a sleeping goroutine in a loop. But if this proposal is implemented such an implementation would indeed avoid the need of calling Stop.

gopherbot added this to the Proposal milestone Mar 24, 2017

gopherbot added the Proposal label Mar 24, 2017

golang locked and limited conversation to collaborators Apr 1, 2018

gopherbot added the FrozenDueToAge label Apr 1, 2018

zigo101 mentioned this issue Dec 19, 2019

Question: Is a chain of goroutines with empty select deletable? #36214

Closed

bcmills mentioned this issue Feb 18, 2020

proposal: Go 2: time: deprecate time.Tick(d Duration) #37144

Closed

pamburus mentioned this issue Jun 9, 2020

defense in depth against goroutine leak gophercloud/gophercloud#1979

Closed

bcmills mentioned this issue Sep 16, 2020

proposal: Go 2: goroutine collector or closer to prevent goroutine leaks #41414

Closed

bcmills removed the FrozenDueToAge label Sep 16, 2020

golang unlocked this conversation Sep 16, 2020

golang locked as resolved and limited conversation to collaborators Sep 16, 2020

bcmills added the FrozenDueToAge label Sep 16, 2020

deanveloper mentioned this issue Aug 13, 2021

proposal: Go 2: function values as iterators #43557

Closed

deanveloper mentioned this issue Sep 23, 2021

proposal: use channels as iterators #48567

Closed

lujiajing1126 mentioned this issue Aug 10, 2022

Parameterize memory size and some key improvements apache/skywalking-banyandb#153

Merged

golang unlocked this conversation Nov 1, 2022

golang locked as resolved and limited conversation to collaborators Nov 1, 2022

proposal: runtime: garbage collect goroutines blocked forever #19702

proposal: runtime: garbage collect goroutines blocked forever #19702

Comments

faiface commented Mar 24, 2017 • edited Loading

What does it mean?

How to collect goroutines?

What are the benefits?

bradfitz commented Mar 24, 2017

faiface commented Mar 24, 2017

bradfitz commented Mar 24, 2017

faiface commented Mar 24, 2017 • edited Loading

bradfitz commented Mar 24, 2017

faiface commented Mar 24, 2017

randall77 commented Mar 24, 2017

faiface commented Mar 24, 2017

fraenkel commented Mar 24, 2017

faiface commented Mar 24, 2017

nullchinchilla commented Mar 24, 2017 • edited Loading

nullchinchilla commented Mar 24, 2017 • edited Loading

randall77 commented Mar 24, 2017

fraenkel commented Mar 24, 2017

faiface commented Mar 25, 2017

nullchinchilla commented Mar 25, 2017 • edited Loading

nullchinchilla commented Mar 25, 2017 • edited Loading

bradfitz commented Mar 25, 2017

docmerlin commented Mar 25, 2017

egonelbre commented Mar 25, 2017

nullchinchilla commented Mar 25, 2017

Merovius commented Mar 25, 2017

faiface commented Mar 25, 2017

bradfitz commented Mar 25, 2017

faiface commented Mar 25, 2017

bradfitz commented Mar 25, 2017

nullchinchilla commented Mar 25, 2017

nullchinchilla commented Mar 25, 2017

bradfitz commented Mar 28, 2017

rsc commented Mar 29, 2017

nullchinchilla commented Mar 29, 2017 • edited Loading

rsc commented Mar 29, 2017

faiface commented Mar 29, 2017

nullchinchilla commented Mar 29, 2017

ianlancetaylor commented Mar 29, 2017

nullchinchilla commented Mar 29, 2017 • edited Loading

ianlancetaylor commented Mar 29, 2017

nullchinchilla commented Mar 30, 2017 • edited Loading

rsc commented Mar 30, 2017

nullchinchilla commented Mar 31, 2017 • edited Loading

nullchinchilla commented Mar 31, 2017 • edited Loading

ianlancetaylor commented Apr 1, 2017

nullchinchilla commented Apr 1, 2017 • edited Loading

faiface commented Mar 24, 2017 •

edited

Loading

faiface commented Mar 24, 2017 •

edited

Loading

nullchinchilla commented Mar 24, 2017 •

edited

Loading

nullchinchilla commented Mar 24, 2017 •

edited

Loading

nullchinchilla commented Mar 25, 2017 •

edited

Loading

nullchinchilla commented Mar 25, 2017 •

edited

Loading

nullchinchilla commented Mar 29, 2017 •

edited

Loading

nullchinchilla commented Mar 29, 2017 •

edited

Loading

nullchinchilla commented Mar 30, 2017 •

edited

Loading

nullchinchilla commented Mar 31, 2017 •

edited

Loading

nullchinchilla commented Mar 31, 2017 •

edited

Loading

nullchinchilla commented Apr 1, 2017 •

edited

Loading