Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: defer is slow #14939

Closed
minux opened this issue Mar 24, 2016 · 42 comments
Closed

runtime: defer is slow #14939

minux opened this issue Mar 24, 2016 · 42 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge Performance
Milestone

Comments

@minux
Copy link
Member

minux commented Mar 24, 2016

package p
import "testing"
//go:noinline
func defers() (r int) {
        defer func() {
                r = 42
        }()
        return 0
}
func BenchmarkDefer(b *testing.B) {
        for i := 0; i < b.N; i++ {
                defers()
        }
}

On my system, BenchmarkDefer uses 77.7ns/op. This issue
arises from investigation of #9704: if I remove the "defer endcgo(mp)"
and place the call at the end of the func cgocall, the benchmark in
#9704 will improve from 144ns/op to 63.7ns/op. (Note: we can't

eliminate the defer in func cgocall though, as it will break defer/recover
in Go->C->Go scenario.)

@minux minux added this to the Go1.7 milestone Mar 24, 2016
@minux
Copy link
Member Author

minux commented Mar 24, 2016

One way to improve defer for such simple cases is to
allocate _defer on stack by the compiler. But as
a function can defer unbounded number of functions,
I'm not sure if maintaining two different _defer allocation
mechanisms is acceptable.

@cespare
Copy link
Contributor

cespare commented Mar 24, 2016

Some prior discussion at #6980.

@martisch
Copy link
Contributor

I also had a CL recently where a single defer would have been the preferred solution but was not usable because of a 65ns performance hit. CL20512

So improving simple cases like a single defer in some branch (might even call no other methods or only select few) would have helped there.

name old time/op new time/op delta 
SprintfInt-2 137ns ± 7% 202ns ± 6% +47.72% (p=0.000 n=20+20)

@aclements
Copy link
Member

Given the current cost of defers, I think it's acceptable to have two defer allocation mechanisms if that addresses the problem. And I actually don't think the runtime side of this is very complicated.

This is on the list for 1.8. I'm planning to either do it myself or get someone else to do it. :)

If it turns out we need to simplify things, we could limit this to defers with at most one (possibly implicit) argument, which would handle the cgo case as well as the mutex unlock case. Another possible simplification would be to only stack-allocate defers in functions where all defers can be stack allocated, which would probably simplify creating an efficient prologue.

@aclements aclements self-assigned this Jul 1, 2016
@ianlancetaylor
Copy link
Contributor

Separate from stack allocation, we should also consider special-casing defers with no arguments, as that is a fairly common case (about half of the defer calls in the standard library). Because the no-argument case doesn't have to worry about the arguments on the stack, it can use a simpler version of deferproc, one that doesn't need to call systemstack.

redbaron added a commit to redbaron/prometheus that referenced this issue Sep 19, 2016
@bradfitz
Copy link
Contributor

@bmizerany sent https://golang.org/cl/29379 to replace a defer with explicit mutex unlocks in x/time/rate. Consider reverting that optimization if this more general optimization goes in.

@aclements
Copy link
Member

/cc @dr2chase, who I've been talking with about compiler changes to support this.

@aclements
Copy link
Member

@dr2chase and I have been discussing an alternate approach to this that's somewhat less general than stack-allocated defers but should have essentially zero overhead versus a function call when it applies.

The idea is to take a page out of the C++ exception handling book, open code the defer execution, and use a PC value table to figure out where the defer code is when handling a panic.

Specifically, if the set of defers can be statically determined at every PC in a function, then the compiler would turn those defers into closures built on the stack and generate code at every exit point to directly call the appropriate defer closures for that exit point. In the common case, then, there would be no deferreturn logic at all and the defer execution would look essentially like hand-expanding the defer (like what CL 29379 did).

To keep this working with panics, the compiler would generate a PC value table for every function where this optimization applies that logically records, for every PC, where in the stack frame to find the set of defer closures to run for that PC. This actual encoding of this table could be quite compact in general, since large runs of PCs will have the same set of defer closures, and we could encode the tree-like structure of the set of defers to run directly in this table, so each entry would contain at most one defer closure offset and the PC to use to look up the next defer closure offset.

When panic walks the stack, it would keep an eye on both this PC value table and the defer stack. A given frame could have either defers on the stack or a defer offset PC value table, but not both. If a frame has a defer offset PC value table, panic would use the table to find the defer closures and call them.

beorn7 pushed a commit to prometheus/prometheus that referenced this issue Sep 22, 2016
This is related to golang/go#14939 .
It's probably the only occurrence where it matters.
@minux
Copy link
Member Author

minux commented Sep 22, 2016 via email

@ianlancetaylor
Copy link
Contributor

I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed?

@ianlancetaylor
Copy link
Contributor

A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame. You could even choose to always start the list from the same place in the stack frame, just below the return address/frame pointer, so you would only need a single bit in the traceback information. Then each time you defer a function, you update the linked list to point to the new stack allocated closure. At the end of the function, you remove each closure from the list as you execute it.

@dr2chase
Copy link
Contributor

dr2chase commented Sep 22, 2016

@minux - no. If there's any point in the generated code (which is not the same as the input program) that two different defer sets can reach, then the PC-range technique won't work. Unless, say, we record which defers need running by storing into a bitmask of defers-to-(not-)run; this would also handles Ian's problem. Store a zero byte on entry, store a different value on the branches to the common area that don't enable all the defers. PC range has to mention the location of the which-defers-not-to-run byte, of course.

@aclements
Copy link
Member

I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed?

That's a good question. If you're simply in the function epilogue executing local defers and one of them panics, then the PC in the epilogue tells you what's left. However, if you're running defers because of a panic and one of them panics, then you're right that things get more complicated. One possibility would be that when a panic happens, before running any defers, you walk the stack to find any frames that have PC value-based defers and weave those into the linked list of defers at the right points. Then panic just works from the linked list. If a second panic happens while walking this linked list, we know that we're already handling a panic, so it wouldn't rebuild the list, it would just keep going from the current list.

@aclements
Copy link
Member

A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame.

Possibly. I was hoping to avoid the overhead of even dealing with a linked list in the regular return case. If I understand your proposal, it seems like the cost at function exit wouldn't be substantially different from the cost of deferreturn today, which alone is about 36% of the time in @minux's benchmark.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Sep 22, 2016

I am imagining that given

defer f()
...
defer g()

the defer statements would be compiled as

deferredFunctions = deferredFunction{next: deferredFunctions; run: f}
...
deferredFunctions = deferredFunction{next: deferredFunctions; run: g}

and the function exit sequence would be compiled as

deferredFunctions = deferredFunctions.next
g()
deferredFunctions = deferredFunctions.next
f()

So I don't think the performance would be like that of deferreturn. The compiler would of course have to ensure that deferredFunctions was stored on the stack and that it was always accurate before any function call. And this approach would take extra execution space as there would in effect be two copies of each deferred function--I suspect that is true of any efficient implementation.

@aclements
Copy link
Member

I see. I thought you were saying the function return would use the linked list (rather than just unwinding it for the benefit of a panic). I imagine that would perform well.

@gopherbot
Copy link

Change https://golang.org/cl/190098 mentions this issue: cmd/compile, cmd/link, runtime: make defers low-cost through inline code and extra funcdata

@aclements aclements modified the milestones: Unplanned, Go1.14 Oct 16, 2019
gopherbot pushed a commit that referenced this issue Oct 16, 2019
…ode and extra funcdata

Generate inline code at defer time to save the args of defer calls to unique
(autotmp) stack slots, and generate inline code at exit time to check which defer
calls were made and make the associated function/method/interface calls. We
remember that a particular defer statement was reached by storing in the deferBits
variable (always stored on the stack). At exit time, we check the bits of the
deferBits variable to determine which defer function calls to make (in reverse
order). These low-cost defers are only used for functions where no defers
appear in loops. In addition, we don't do these low-cost defers if there are too
many defer statements or too many exits in a function (to limit code increase).

When a function uses open-coded defers, we produce extra
FUNCDATA_OpenCodedDeferInfo information that specifies the number of defers, and
for each defer, the stack slots where the closure and associated args have been
stored. The funcdata also includes the location of the deferBits variable.
Therefore, for panics, we can use this funcdata to determine exactly which defers
are active, and call the appropriate functions/methods/closures with the correct
arguments for each active defer.

In order to unwind the stack correctly after a recover(), we need to add an extra
code segment to functions with open-coded defers that simply calls deferreturn()
and returns. This segment is not reachable by the normal function, but is returned
to by the runtime during recovery. We set the liveness information of this
deferreturn() to be the same as the liveness at the first function call during the
last defer exit code (so all return values and all stack slots needed by the defer
calls will be live).

I needed to increase the stackguard constant from 880 to 896, because of a small
amount of new code in deferreturn().

The -N flag disables open-coded defers. '-d defer' prints out the kind of defer
being used at each defer statement (heap-allocated, stack-allocated, or
open-coded).

Cost of defer statement  [ go test -run NONE -bench BenchmarkDefer$ runtime ]
  With normal (stack-allocated) defers only:         35.4  ns/op
  With open-coded defers:                             5.6  ns/op
  Cost of function call alone (remove defer keyword): 4.4  ns/op

Text size increase (including funcdata) for go cmd without/with open-coded defers:  0.09%

The average size increase (including funcdata) for only the functions that use
open-coded defers is 1.1%.

The cost of a panic followed by a recover got noticeably slower, since panic
processing now requires a scan of the stack for open-coded defer frames. This scan
is required, even if no frames are using open-coded defers:

Cost of panic and recover [ go test -run NONE -bench BenchmarkPanicRecover runtime ]
  Without open-coded defers:        62.0 ns/op
  With open-coded defers:           255  ns/op

A CGO Go-to-C-to-Go benchmark got noticeably faster because of open-coded defers:

CGO Go-to-C-to-Go benchmark [cd misc/cgo/test; go test -run NONE -bench BenchmarkCGoCallback ]
  Without open-coded defers:        443 ns/op
  With open-coded defers:           347 ns/op

Updates #14939 (defer performance)
Updates #34481 (design doc)

Change-Id: I51a389860b9676cfa1b84722f5fb84d3c4ee9e28
Reviewed-on: https://go-review.googlesource.com/c/go/+/190098
Reviewed-by: Austin Clements <austin@google.com>
@danscales
Copy link
Contributor

If we had more data on this, perhaps we could focus on those cases.

I'm working on a high-throughput HTTPS server which shows the cost of defer in crypto/tls and internal/poll fairly clearly in its CPU profiles. The following five defer sites account for several percent of the application's CPU usage, split fairly evenly between each.

These particular defers are in place unconditionally, so a minimal PC-range-based compiler change as @dr2chase and @aclements discussed in September 2016 might be enough.

@rhysh Go 1.1.3 improved defer performance already by allocating defer records on the stack. And we have just checked into the main tree (for release in Go 1.14) a bigger change https://golang.org/cl/190098 to make defer calls directly (inline) at normal exits. This should reduce overhead even more significantly in many cases.

So, if you are still seeing defer overheads for your server, it will be great to see if the overheads have gone down with Go 1.13 (if you haven't already upgraded) or with the changes in the main tree (or with the beta release of Go 1.14 in early November).

@gopherbot
Copy link

Change https://golang.org/cl/202340 mentions this issue: cmd/compile, cmd/link, runtime: make defers low-cost through inline code and extra funcdata

gopherbot pushed a commit that referenced this issue Oct 24, 2019
…ode and extra funcdata

Generate inline code at defer time to save the args of defer calls to unique
(autotmp) stack slots, and generate inline code at exit time to check which defer
calls were made and make the associated function/method/interface calls. We
remember that a particular defer statement was reached by storing in the deferBits
variable (always stored on the stack). At exit time, we check the bits of the
deferBits variable to determine which defer function calls to make (in reverse
order). These low-cost defers are only used for functions where no defers
appear in loops. In addition, we don't do these low-cost defers if there are too
many defer statements or too many exits in a function (to limit code increase).

When a function uses open-coded defers, we produce extra
FUNCDATA_OpenCodedDeferInfo information that specifies the number of defers, and
for each defer, the stack slots where the closure and associated args have been
stored. The funcdata also includes the location of the deferBits variable.
Therefore, for panics, we can use this funcdata to determine exactly which defers
are active, and call the appropriate functions/methods/closures with the correct
arguments for each active defer.

In order to unwind the stack correctly after a recover(), we need to add an extra
code segment to functions with open-coded defers that simply calls deferreturn()
and returns. This segment is not reachable by the normal function, but is returned
to by the runtime during recovery. We set the liveness information of this
deferreturn() to be the same as the liveness at the first function call during the
last defer exit code (so all return values and all stack slots needed by the defer
calls will be live).

I needed to increase the stackguard constant from 880 to 896, because of a small
amount of new code in deferreturn().

The -N flag disables open-coded defers. '-d defer' prints out the kind of defer
being used at each defer statement (heap-allocated, stack-allocated, or
open-coded).

Cost of defer statement  [ go test -run NONE -bench BenchmarkDefer$ runtime ]
  With normal (stack-allocated) defers only:         35.4  ns/op
  With open-coded defers:                             5.6  ns/op
  Cost of function call alone (remove defer keyword): 4.4  ns/op

Text size increase (including funcdata) for go binary without/with open-coded defers:  0.09%

The average size increase (including funcdata) for only the functions that use
open-coded defers is 1.1%.

The cost of a panic followed by a recover got noticeably slower, since panic
processing now requires a scan of the stack for open-coded defer frames. This scan
is required, even if no frames are using open-coded defers:

Cost of panic and recover [ go test -run NONE -bench BenchmarkPanicRecover runtime ]
  Without open-coded defers:        62.0 ns/op
  With open-coded defers:           255  ns/op

A CGO Go-to-C-to-Go benchmark got noticeably faster because of open-coded defers:

CGO Go-to-C-to-Go benchmark [cd misc/cgo/test; go test -run NONE -bench BenchmarkCGoCallback ]
  Without open-coded defers:        443 ns/op
  With open-coded defers:           347 ns/op

Updates #14939 (defer performance)
Updates #34481 (design doc)

Change-Id: I63b1a60d1ebf28126f55ee9fd7ecffe9cb23d1ff
Reviewed-on: https://go-review.googlesource.com/c/go/+/202340
Reviewed-by: Austin Clements <austin@google.com>
@rhysh
Copy link
Contributor

rhysh commented Oct 25, 2019

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer's direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it's about 1.5%. With the development branch at be64a19, it's down to 0.5%.

Zooming out to the application's total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer "in" a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it's effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

@gopherbot
Copy link

Change https://golang.org/cl/203481 mentions this issue: crypto/tls: move a defer out of a loop

@rasky
Copy link
Member

rasky commented Oct 26, 2019

How’s that defer “in a loop”? It does look like it from a source code point of view, but in SSA it is in a block which is no more “part of the loop” than the first block after loop in source code order. It sounds like the loop detection made by the defer optimization is slightly off.

@josharian
Copy link
Contributor

@rasky the loop detection occurs during escape analysis, not SSA.

gopherbot pushed a commit that referenced this issue Oct 26, 2019
Rhys Hiltner noted in #14939 that this defer was
syntactically inside a loop, but was only ever

executed once. Now that defer in a loop
is significantly slower, pull this one out.

name                                    old time/op   new time/op   delta
Throughput/MaxPacket/1MB/TLSv12-8        3.94ms ± 8%   3.93ms ±13%    ~     (p=0.967 n=15+15)
Throughput/MaxPacket/1MB/TLSv13-8        4.33ms ± 3%   4.51ms ± 7%  +4.00%  (p=0.000 n=14+14)
Throughput/MaxPacket/2MB/TLSv12-8        6.80ms ± 6%   7.01ms ± 4%  +3.15%  (p=0.000 n=14+14)
Throughput/MaxPacket/2MB/TLSv13-8        6.96ms ± 5%   6.80ms ± 5%  -2.43%  (p=0.006 n=15+14)
Throughput/MaxPacket/4MB/TLSv12-8        12.0ms ± 3%   11.7ms ± 2%  -2.88%  (p=0.000 n=15+13)
Throughput/MaxPacket/4MB/TLSv13-8        12.1ms ± 3%   11.7ms ± 2%  -3.54%  (p=0.000 n=13+13)
Throughput/MaxPacket/8MB/TLSv12-8        22.2ms ± 3%   21.6ms ± 3%  -2.97%  (p=0.000 n=15+15)
Throughput/MaxPacket/8MB/TLSv13-8        22.5ms ± 5%   22.0ms ± 3%  -2.34%  (p=0.004 n=15+15)
Throughput/MaxPacket/16MB/TLSv12-8       42.4ms ± 3%   41.3ms ± 3%  -2.49%  (p=0.001 n=15+15)
Throughput/MaxPacket/16MB/TLSv13-8       43.4ms ± 5%   42.3ms ± 3%  -2.33%  (p=0.006 n=15+14)
Throughput/MaxPacket/32MB/TLSv12-8       83.1ms ± 4%   80.6ms ± 3%  -2.98%  (p=0.000 n=15+15)
Throughput/MaxPacket/32MB/TLSv13-8       85.2ms ± 8%   82.6ms ± 4%  -3.02%  (p=0.005 n=15+15)
Throughput/MaxPacket/64MB/TLSv12-8        167ms ± 7%    158ms ± 2%  -5.21%  (p=0.000 n=15+15)
Throughput/MaxPacket/64MB/TLSv13-8        170ms ± 4%    162ms ± 3%  -4.83%  (p=0.000 n=15+15)
Throughput/DynamicPacket/1MB/TLSv12-8    4.13ms ± 7%   4.00ms ± 8%    ~     (p=0.061 n=15+15)
Throughput/DynamicPacket/1MB/TLSv13-8    4.72ms ± 6%   4.64ms ± 7%    ~     (p=0.377 n=14+15)
Throughput/DynamicPacket/2MB/TLSv12-8    7.29ms ± 7%   7.09ms ± 7%    ~     (p=0.070 n=15+14)
Throughput/DynamicPacket/2MB/TLSv13-8    7.18ms ± 5%   6.59ms ± 4%  -8.34%  (p=0.000 n=15+15)
Throughput/DynamicPacket/4MB/TLSv12-8    12.3ms ± 3%   11.9ms ± 4%  -3.31%  (p=0.000 n=15+14)
Throughput/DynamicPacket/4MB/TLSv13-8    12.2ms ± 4%   12.0ms ± 4%  -1.91%  (p=0.019 n=15+15)
Throughput/DynamicPacket/8MB/TLSv12-8    22.4ms ± 3%   21.9ms ± 3%  -2.18%  (p=0.000 n=15+15)
Throughput/DynamicPacket/8MB/TLSv13-8    22.7ms ± 3%   22.2ms ± 3%  -2.35%  (p=0.000 n=15+15)
Throughput/DynamicPacket/16MB/TLSv12-8   42.3ms ± 3%   42.1ms ± 3%    ~     (p=0.505 n=14+15)
Throughput/DynamicPacket/16MB/TLSv13-8   42.7ms ± 3%   43.3ms ± 7%    ~     (p=0.123 n=15+14)
Throughput/DynamicPacket/32MB/TLSv12-8   82.8ms ± 3%   81.9ms ± 3%    ~     (p=0.112 n=14+15)
Throughput/DynamicPacket/32MB/TLSv13-8   84.6ms ± 6%   83.9ms ± 4%    ~     (p=0.624 n=15+15)
Throughput/DynamicPacket/64MB/TLSv12-8    166ms ± 4%    163ms ± 6%    ~     (p=0.081 n=15+15)
Throughput/DynamicPacket/64MB/TLSv13-8    165ms ± 3%    168ms ± 3%  +1.56%  (p=0.029 n=15+15)

Change-Id: I22409b05afe761b8ed1912b15c67fc03f88d3d1f
Reviewed-on: https://go-review.googlesource.com/c/go/+/203481
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@rasky
Copy link
Member

rasky commented Oct 28, 2019

Can we close this issue? It doesn't seem there's much left. Defers in a loop are always going to be somewhat slower.

@danscales
Copy link
Contributor

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer's direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it's about 1.5%. With the development branch at be64a19, it's down to 0.5%.

Zooming out to the application's total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer "in" a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it's effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

@rhysh Glad to hear the results that you have measured! And, as a bunch of folks have already pointed out, there are a bunch of further optimizations that we can do to tighten up the inline defer code, eliminate some of the checks, etc.

@navytux
Copy link
Contributor

navytux commented Oct 30, 2019

@danscales, thanks a lot for fixing defer slowness!

@ianlancetaylor
Copy link
Contributor

I think this is done, so I am going to close it. We can always make things faster. If there are specific ideas for making defers faster, let's do them in separate issues.

Thanks to all especially @danscales .

flisky pushed a commit to flisky/bigcache that referenced this issue May 7, 2020
There is a known issue in Go with deferred operations being much slower than explicit operations, both directly and indirectly impacting locking. Generally, on OS X and Windows, the change gives between 10-15% performance gain.

Relates: golang/go#14939
@golang golang locked and limited conversation to collaborators Dec 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge Performance
Projects
None yet
Development

No branches or pull requests