proposal: x/sync: once functions that can be retried on error #22098

skabbes · 2017-10-01T18:07:04Z

sync.Once provides a great tool for initializing singletons, or performing initialization lazily. However, it is completely unusable for initializing resource who's initialization can fail (and that failure is temporary).

I am proposing a type and/or method to be added to the sync package that can handle initialization functions that can fail, to simply retry them later - the next time they are needed.

// TryDo calls `f` successively (serialized) until f first returns without error
// If `f` panics, the initialization is treated as if an error occurred
// and `TryDo` will be called again on the next invocation.
func (x *XX) TryDo(f func () error) error {
    // implementation details
    return nil
}

Simple Use cases

An http.Handler which lazily reads a template from over the network once

var once sync.XX
var tmpl []byte
handler := func (w http.ResponseWriter, r* http.Request) {
    err := once.TryDo(func () error {
        tmpl, err = readTemplateFromFS()
        return err
    })

    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
    }
    // do something with tmpl
}

Dynamically generating a lookup table on demand the first time
A cache library which only allocates "room" on the storage server (storage namespace / table / etc) the first time it is used.

Workaround

Instead of returning an error the once function would infinitely retry with a backoff policy. This allowed erroring function to fit into the sync.Once. If retries are exhausted, this is not a suitable workaround.
Use sync.Do but on failure, swap the sync.Once with a new sync.Once so it can retry.
Write your own TryDo, requiring a lock, atomic variable (to make it low overhead) and a clever defer function to deal with panics. A hand rolled version will likely not handle panics, or atomic var optimization.

Advanced use case
You are writing a library API which requires lazy initialization of some expensive reusable resource. The API looks like this.

// Search will search for matches of `q` inside book. If `ctx` is canceled, all
// search operations with Search will return with `ErrCanceled`
func (b *Book) Search(ctx context.Context, query string) ([]SearchResult, error)

However, to perform a search the first time, we would like to lazily build a search index of that book, an expensive operation, which can fail. You would like this index build operation to backoff and retry up until the limits of the context passed.

func (b *Book) Search(ctx context.Context, query string) ([]SearchResult, error) {
    err := b.once.TryDo(func () error {
        // build the search index respects the deadlines of ctx and will return an error otherwise
        return retryTilDealineOrSuccess(ctx, func () error {
            return b.buildSearchIndex(ctx)
        })
    })
    if err != nil {
        return nil, err
    }
    // ... snip ...
}

As a final comment, TryDo is more generic than Do, you can trivially build Do with TryDo but not vice versa. I would like to solicit feedback on this API addition, and if agreed I can build a design doc or submit the proposed code to implement this.

The text was updated successfully, but these errors were encountered:

rsc · 2017-10-02T20:39:01Z

If retry has to happen, why can't it happen inside the function that once.Do calls?
Or maybe you are looking for golang.org/x/sync/singleflight?

/cc @bcmills @bradfitz @dvyukov

skabbes · 2017-10-02T21:11:57Z

It could, if infinite retry was desirable. However, if you want to use a once.Do inside a function that needs to respect a context, you need a way to jump out of the once.Do. I believe this is exactly why once.DoChan exists, so you can select against a ctx.Done().

But if the caller's context has been cancelled, there is no reason to really keep trying the initialization. Not to mention, since this initialization is implied to be expensive, you don't want to be doing it again until you need it next.

Essentially, I want this behavior to be easy to build with once.Do:

Try to initialize a resource with once.Do up to the limits of a context
Once that context has expired, give up the initialization (and fail the caller)
The next invocation of Do should retry the initialization
Repeat until success

re: singleflight - I suppose if there was a singleflightcache or something, so that subsequent (non-concurrent) calls didn't redo the initialization. That would work. It would feel odd though to use a map with a single key. I think at that point I'd just write this TryDo as library code.

skabbes · 2017-10-02T21:16:16Z

Oh, there is no DoChan I was mistaken, that is in singleflight. I suppose that just strengthens my point.

robpike · 2017-10-03T00:02:53Z

The semantics of sync.Once are very easy to understand, to use, and to trust. Adding retries is a great complication for all three. While not dismissing the problem you have, I believe the solution should be solved elsewhere, most likely in an external package, and not in the core library.

skabbes · 2017-10-03T06:58:32Z

Fair enough, thank you, that is the sort of decision I was looking for.

@bradfitz is this possibly something x/sync would be interested in? Feel to me pretty generally useful, and I'd hate to release it under my name to get 1 star and 0 downloads.

dvyukov · 2017-10-03T07:00:29Z

If initialization is expensive and/or puts load on external systems and currently fails, you probably don't want to retry it with max possible frequency. This means that the primitive also needs lastInitTime and lastError and be configurable by a backoff policy. The search index initialization probably needs to be done in a separate goroutine and start ahead of time because if it does not fit into request slot it will never be finished.

skabbes · 2017-10-03T07:10:21Z

I agree, I was trying to provide a more specific example to take something as abstract as "once.Do need to be able to respect context deadlines" and bring it back into reality a bit. I want to make clear that I am not asking about a particular problem that I have, rather simply noticed this limitation of sync.Once.

http context deadlines is the most familiar cancelation to gophers, but you also have queue lease expiration, user initiated cancelation, best-effort time sensitive calculations (do this before Y time, otherwise I don't care), etc etc. I was just trying to motivate the more general problem.

dvyukov · 2017-10-03T07:15:40Z

I mean that if we consider this problem in sufficient generality, it does not boil down to TryDo(f func () error).

skabbes · 2017-10-03T07:27:44Z

Yes, I believe it does. You would simply compose the once with whatever other behavior you desired including exp backoff and retry. The key feature of TryDo is you can sort of reset itself so its f can run again later.

That f can do anything: it can wait on an already launched goroutine (optimistic search index building), it can do exponential backoff and retry, or it can fail, and wait to attempt to reinitialize again later.

I don't imagine TryDo to be a complete solution, only to provide the "do this thing once, and if it fails, do it again later" behavior to compose with other behaviors (like backoff / retry / etc).

I might be missing something, do you have an example it might not cover?

dvyukov · 2017-10-03T07:49:47Z

@skabbes fair enough

bcmills · 2017-10-05T03:06:47Z

if the caller's context has been cancelled, there is no reason to really keep trying the initialization.

That's not obvious to me anyway. If the thing in question really is one-time initialization, then restarting it for each new call could be very expensive compared to just doing it once and getting it over with.

For example, consider what happens if you are connecting to a backend server, establishing the connection (for whatever reason) takes 300ms, actually using the connection takes 10ms, and each incoming call that wants to use the connection has an aggressive 30ms deadline.

With the behavior you're proposing, every operation fails in perpetuity.

With naive use of sync.Once and context.Background() (or a suitable alternative as described in #19643), all operations fail for the first 300ms or so, but then they all succeed.

since this initialization is implied to be expensive, you don't want to be doing it again until you need it next.

Don't confuse latency with throughput. Often, the action of a sync.Once has high latency but consumes very little of the program's throughput.

bcmills · 2017-10-05T03:12:02Z

Instead of returning an error the once function would infinitely retry with a backoff policy. […] If retries are exhausted, this is not a suitable workaround.

What does it mean to “exhaust” retries, or when would such exhaustion be desirable? With appropriate backoff parameters, you generally do not need to impose any artificial limit on the number of attempts: you can just keep retrying until the system comes back up.

bcmills · 2017-10-05T03:27:26Z

To use your earlier examples, I would expect to see something like:

var (
	once sync.Once
	ready = make(chan struct{})
)

var tmpl []byte
handler := func (w http.ResponseWriter, r* http.Request) {
	once.Do(func () {
		go func() {
			defer close(ready)
			for attempt := int64(0); ; attempt++ {
				tmpl, err = readTemplateFromFS()
				if err == nil {
					return
				}
				backoff(attempt)
			}
		})
	})

	select {
	case <-r.Context().Done():
		return
	case <-ready:
	}
	// do something with tmpl
}

// finishIndexing begins indexing the Book where the last finishIndexing call stopped.
func (b *Book) finishIndexing(ctx context.Context) error {
	if b.indexed() {
		return nil
	}

	select {
	case <-ctx.Done():
		return ctx.Err()
	case b.indexSem <- token{}:
	}
	defer func() { <-b.indexSem }()

	cursor := b.loadProgress()
	for !b.indexed() {
		…
		select {
		case <-ctx.Done():
			b.storeProgress(cursor)
			return ctx.Err()
		default:
		}
	}
	return nil
}

func (b *Book) Search(ctx context.Context, query string) ([]SearchResult, error) {
	if err := b.finishIndexing(ctx); err != nil {
		return nil, err
	}
	…
}

skabbes · 2017-10-05T08:22:46Z

That's not obvious to me anyway. If the thing in question really is one-time initialization, then restarting it for each new call could be very expensive compared to just doing it once and getting it over with.

Yes, it could be. As you have already pointed out, once.Sync already covers those cases very well. Also in your 300ms init example, yes that would certainly be a case to not use this. I think we can both agree on this if infinite retry is desirable then sync.Once's current behavior is sufficient.

So... when is it not desirable?

Infinite retry?
Suppose you run an http server from a monolith codebase with several features running. One new features requires generating and storing a public private key pair before using any part of that feature.

// MakeKeyPair generates a new pub/priv keypair and stores it in the database
func MakeKeyPair(userID string) (*KeyPair, error) {
	pair := expensiveCPUOp(2048)
	err := storeKeyPairTransactionally(userID, pair)
	return pair, err
}

You would like to 'once' this per user (assume all reqs for same users sharded to same box), however your keypair database is down due to a configuration error. So do you:

Infinitely retry MakeKeyPair entirely?
Infinitely retry only the storeKeyPair part?

1 will continue to use rob CPU resources from the other features. 2 will rob memory resources from other features. Since it is using additional resources, this infinite retry could lead to cascading failures in other feature areas.

[...] you can just keep retrying until the system comes back up.

Yes, you certainly can - but at the very least, that once.Do will be consuming a goroutine, some memory and possibly spinning CPU cycles - which depending on the context might be OK or might not be. The current design does not allow for graceful failure or backpressure to upstream systems.

Given that gophers want to 'once-initialize' things that can error, is our answer "infinitely retry"? I think no because we don't get them the tools to do exponential backoff and retry with jitter (pretty much required for a correct implementation). I would rather the answer be something simple like, "use TryDo" or "don't do that at all, it is a bad pattern".

bcmills · 2017-10-05T15:26:20Z

2 will rob memory resources from other features. Since it is using additional resources, this infinite retry could lead to cascading failures in other feature areas.

In the situation you're describing, you're already consuming O(N) memory to store the N Once objects. If you are concerned about cascading failures, you need to limit N regardless, and the TryDo you're describing seems orthogonal to that. If you have an external limiting mechanism anyway, you can presumably use that same mechanism to cancel the further retries.

bcmills · 2017-10-05T15:38:58Z

Given that gophers want to 'once-initialize' things that can error, is our answer "infinitely retry"? I think no because we don't get them the tools to do exponential backoff and retry with jitter (pretty much required for a correct implementation). I would rather the answer be something simple like, "use TryDo" or "don't do that at all, it is a bad pattern".

I see a lot of exponential backoff and retry libraries available.

In contrast, the “retry on the first call after failure” policy that you propose for TryDo does not perform exponential backoff: it seems much more prone to persistent overload and cascading failure.

“Use exponential backoff” seems like the simple answer.

skabbes · 2017-10-05T17:10:24Z

I see a lot of exponential backoff and retry libraries available.

Yes, and many of them seem to return error... not really my main point, but to a new gopher or for someone doing this the first time - it doesn't look like you can even use it because of the interface mismatch.

In the situation you're describing, you're already consuming O(N) memory to store the N Once objects.

No, I disagree - the Do implementation requires O(N) to be stored, (since the goroutines will keep a reference to the *User object indefinitely). But the TryDo implementation "releases its grips" so to speak on the objects, those objects are free to be garbage collected, allowing for a natural ebb and flow of these objects.

Let me rephrase my statement:

Given that gophers want to 'once-initialize' things that can error, correct usage of Do requires exponential backoff and retry. In contrast, when using TryDo doing the natural/obvious thing (returning the error without any sort of backoff and retry) is correct, and depending on the context may or may not be subject to cascading failures depending on the resource being initialized, and the cost of doing so.

I suppose we disagree on this, but if I wrote some library code that in some scenarios caused a reference cycle, and caused CPU or memory "ghosting" (due to infinite retry) even after the original request was served, my team would not accept it into our code base.

Many of your arguments seem to assume there is no such things as a best-effort request, and that every single software system has an infinite supply of load, or that in software development it is always wise to solve those problems first.

Regardless, I will close this and continue to use TryDo in a library, since compared to Do it:

significantly simplifies the logic of initializing an err-able resource
allows for best-effort semantics (while also providing thread-safe lazy initialization)
is easy to reason able not leaking goroutine, memory, or CPU resources.

ianlancetaylor changed the title ~~sync: x/sync: proposal: once functions that can be retried on error~~ proposal: x/sync: once functions that can be retried on error Oct 1, 2017

gopherbot added this to the Proposal milestone Oct 1, 2017

gopherbot added the Proposal label Oct 1, 2017

skabbes closed this as completed Oct 5, 2017

golang locked and limited conversation to collaborators Oct 5, 2018

gopherbot added the FrozenDueToAge label Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: x/sync: once functions that can be retried on error #22098

proposal: x/sync: once functions that can be retried on error #22098

skabbes commented Oct 1, 2017

rsc commented Oct 2, 2017

skabbes commented Oct 2, 2017

skabbes commented Oct 2, 2017

robpike commented Oct 3, 2017

skabbes commented Oct 3, 2017

dvyukov commented Oct 3, 2017

skabbes commented Oct 3, 2017

dvyukov commented Oct 3, 2017

skabbes commented Oct 3, 2017

dvyukov commented Oct 3, 2017

bcmills commented Oct 5, 2017 •

edited

bcmills commented Oct 5, 2017

bcmills commented Oct 5, 2017 •

edited

skabbes commented Oct 5, 2017

bcmills commented Oct 5, 2017

bcmills commented Oct 5, 2017

skabbes commented Oct 5, 2017

proposal: x/sync: once functions that can be retried on error #22098

proposal: x/sync: once functions that can be retried on error #22098

Comments

skabbes commented Oct 1, 2017

rsc commented Oct 2, 2017

skabbes commented Oct 2, 2017

skabbes commented Oct 2, 2017

robpike commented Oct 3, 2017

skabbes commented Oct 3, 2017

dvyukov commented Oct 3, 2017

skabbes commented Oct 3, 2017

dvyukov commented Oct 3, 2017

skabbes commented Oct 3, 2017

dvyukov commented Oct 3, 2017

bcmills commented Oct 5, 2017 • edited

bcmills commented Oct 5, 2017

bcmills commented Oct 5, 2017 • edited

skabbes commented Oct 5, 2017

bcmills commented Oct 5, 2017

bcmills commented Oct 5, 2017

skabbes commented Oct 5, 2017

bcmills commented Oct 5, 2017 •

edited

bcmills commented Oct 5, 2017 •

edited