Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/sync/semaphore: acquire should fail if context is canceled or timed out #63615

Closed
szuecs opened this issue Oct 18, 2023 · 12 comments
Closed

x/sync/semaphore: acquire should fail if context is canceled or timed out #63615

szuecs opened this issue Oct 18, 2023 · 12 comments
Labels
FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@szuecs
Copy link

szuecs commented Oct 18, 2023

What version of Go are you using (go version)?

Issue is also reproducibe in https://go.dev/play/p/CNcJihWeMMU

$ go version
go version go1.21.3 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE='on'
GOARCH='amd64'
GOBIN='/home/sszuecs/go/bin'
GOCACHE='/home/sszuecs/.cache/go-build'
GOENV='/home/sszuecs/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/sszuecs/go/pkg/mod'
GONOPROXY='github.bus.zalan.do'
GONOSUMDB='github.bus.zalan.do'
GOOS='linux'
GOPATH='/home/sszuecs/go'
GOPRIVATE='github.bus.zalan.do'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/share/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/share/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.3'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/sszuecs/tmp/go/semaphore/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3306102333=/tmp/go-build -gno-r
ecord-gcc-switches'

What did you do?

Reproducible error:
https://go.dev/play/p/CNcJihWeMMU

What did you expect to see?

no error printed

What did you see instead?

2009/11/10 23:00:00 acquire expected to fail on done context
@gopherbot gopherbot added this to the Unreleased milestone Oct 18, 2023
@ianlancetaylor
Copy link
Contributor

CC @jba

@bcmills
Copy link
Contributor

bcmills commented Oct 18, 2023

In general a Go API that accepts a Context can either succeed or fail if the context is already canceled, as long as it doesn't block in the process.

That said, I agree that it would be reasonable for Acquire to provide the stronger invariant that if the Context is already done before the method is called, the error is always non-nil.

@szuecs
Copy link
Author

szuecs commented Oct 18, 2023

@bcmills I found it with @AlexanderYastrebov when I tried to use it with context.WithTimeout.

For the records, you can use https://go.dev/play/p/wgsWdy8mGGo to handle the timeout case.
For me it was a bit unexpected to test in the good case the timeout (/cancel) case.

@zephyrtronium
Copy link
Contributor

The documentation on Acquire states that this is allowed behavior. Does changing this need to go through proposal review?

@gopherbot
Copy link

Change https://go.dev/cl/536275 mentions this issue: semaphore: cancel acquisition with a done context

szuecs added a commit to zalando/skipper that referenced this issue Oct 19, 2023
Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
szuecs added a commit to zalando/skipper that referenced this issue Oct 19, 2023
Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@zephyrtronium
Copy link
Contributor

To be clear, is the desired behavior (in terms of the language of the Go memory model) "if ctx closing happens before the call to Acquire returns, Acquire never acquires the semaphore," or is it sufficient to say, "if ctx closing happens before the call to Acquire begins, Acquire never acquires the semaphore?" The latter is easier and probably cheaper to implement.

@bcmills
Copy link
Contributor

bcmills commented Oct 19, 2023

@zephyrtronium, I think the only reasonable guarantee we can make is if ctx is already canceled at the start of the call. (If it is canceled partway through, then there isn't a happens-before relationship between acquiring the semaphore and canceling the context — canceling a context.WithCancel is not guaranteed to be synchronous.)

@zephyrtronium
Copy link
Contributor

Correct me if I'm wrong, but I think it is possible to construct a happens-before relation with (Acquire starts) <= (context canceled) <= (Acquire completes):

  1. Goroutines A and B share an empty semaphore of size 2, and A has the ability to cancel the context B will use to acquire the semaphore.
  2. A acquires weight 1, then signals to (or creates) B.
  3. B calls Acquire(ctx, 2).
  4. A continually calls TryAcquire(1) followed by Release(1) when it succeeds, so that A synchronizes with B's Acquire.
  5. A cancels B's context and then calls Release(1).
sem := semaphore.NewWeighted(2)
ctx, cancel := context.WithCancel(context.Background())
sem.TryAcquire(1)
go sem.Acquire(ctx, 2)
for sem.TryAcquire(1) {
    sem.Release(1)
}
cancel()
sem.Release(1)

With the current implementation or if we choose the closed-before-begins option (implemented with a minor adjustment to CL 536275's current patch set), the semaphore may have either 0 or 2 tokens held at the end of this sequence. I believe that having Acquire immediately release its acquisition if it finds both context canceled and a ready signal is sufficient to implement the closed-before-returns option and leave the semaphore always at 0 tokens held.

To me, the question is more about how much work will happen after the semaphore is acquired with a canceled context. Or more precisely, does Acquire really need to be the first point where the context being done causes the caller to quit its work? Maybe the situation that caused this issue for @szuecs in the first place would help answer that question.

szuecs added a commit to zalando/skipper that referenced this issue Oct 20, 2023
* feature: filter fifoWithBody that works similar to fifo(), but release deferred until body streaming to client was finished
* fix: fifo() and fifoWithBody() with canceled requests

Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 20, 2023
@cagedmantis cagedmantis changed the title x/sync/semaphore: Acquire should fail if context is canceled or timed out x/sync/semaphore: acquire should fail if context is canceled or timed out Oct 20, 2023
@bcmills
Copy link
Contributor

bcmills commented Oct 23, 2023

@zephyrtronium, it is possible to construct a happens-before relationship, but I think you'll find that actually having such a relationship is extremely rare in practice. (Typically the goroutine that cancels the Context doesn't have any semaphone tokens at all, let alone enough to block a specific goroutine from acquiring the semaphore.)

@zephyrtronium
Copy link
Contributor

Sorry, I think we're talking past each other. I agree it is unusual to explicitly construct a happens-before relation between the context cancelling and Acquire returning in general. When there isn't one, my premise of "if ctx closing happens before" doesn't apply. The example I gave is contrived, but more practically, I can imagine a Weighted-backed task pool which allows a worker to cancel remaining tasks, along the lines of an errgroup. That could reasonably cause cancellation to happen before some blocked Acquire returns even if that isn't particularly the intent.

My question is whether it still resolves this issue to leave the result unspecified when the happens-before-returns relation does occur. Is it sufficient to have some point where Acquire will start to always fail, or should we make that point respect the synchronizes-before relation that Acquire itself enforces?

@szuecs
Copy link
Author

szuecs commented Oct 30, 2023

@zephyrtronium I am not sure, if you ave an open question that I should answer.
For me it makes sense to just test the context already canceled case, if the other case is too complicated to do. Then please make sure that the documentation clearly says about the timeout case, that you need to check the context if you use context.WithTimeout or context.WithDeadline, because it's too easy to write unexpected broken code.

More specific where we used the code:
Basically we have code in an http proxy that can run a filter pipeline. fifo() is a filter that limits concurrency and can timeout in semaphore.Weighted.Acquire. The context comes from an http.Request. You have no control for how long the request took time to you and users can specify arbitrary filter pipelines, that then take more or less time.
We use it in https://github.com/zalando/skipper/blob/cw2023freeze/scheduler/scheduler.go#L185 and we can drop https://github.com/zalando/skipper/blob/cw2023freeze/scheduler/scheduler.go#L158-L170 in case the context would be checked at start within Acquire(). The broken code you can see in https://github.com/zalando/skipper/blob/master/scheduler/scheduler.go#L171.
Of course there are some CPU cycles which would require also to check it later, but giving the amount of time that Acquire() cost, that's a very very unlikely case compared to the context is already canceled case.

szuecs added a commit to zalando/skipper that referenced this issue Nov 7, 2023
* feature: filter fifoWithBody that works similar to fifo(), but release deferred until body streaming to client was finished
* fix: fifo() and fifoWithBody() with canceled requests

Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
RomanZavodskikh pushed a commit to zalando/skipper that referenced this issue Nov 27, 2023
* feature: filter fifoWithBody that works similar to fifo(), but release deferred until body streaming to client was finished
* fix: fifo() and fifoWithBody() with canceled requests

Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
AlexanderYastrebov pushed a commit to zalando/skipper that referenced this issue Nov 27, 2023
* feature: filter fifoWithBody that works similar to fifo(), but release deferred until body streaming to client was finished
* fix: fifo() and fifoWithBody() with canceled requests

Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
AlexanderYastrebov pushed a commit to zalando/skipper that referenced this issue Nov 27, 2023
* feature: filter fifoWithBody that works similar to fifo(), but release deferred until body streaming to client was finished
* fix: fifo() and fifoWithBody() with canceled requests

Both filters did not check for canceled context from request before semaphore.Acquire, see golang/go#63615

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@bcmills
Copy link
Contributor

bcmills commented Feb 23, 2024

The documentation on Acquire states that this is allowed behavior. Does changing this need to go through proposal review?

To address this question explicitly: I don't think this needs a proposal review because the new behavior still conforms with the existing documentation.

(We're updating the documentation to no longer note a behavior that is no longer possible, but it would also be technically correct to leave that note in place and just let Hyrum's Law lock in the stronger invariant over time. 😅)

@bcmills bcmills added NeedsFix The path to resolution is known, but the work has not been done. FixPending Issues that have a fix which has not yet been reviewed or submitted. labels Feb 23, 2024
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants