Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: optimization to reduce P churn #32113

Open
amscanne opened this issue May 17, 2019 · 16 comments
Open

runtime: optimization to reduce P churn #32113

amscanne opened this issue May 17, 2019 · 16 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@amscanne
Copy link
Contributor

amscanne commented May 17, 2019

Background

The following is a fairly frequent pattern that appears in our code and others:

goroutine1:

ch1 <- data (1)
result = <-ch2 (2)

goroutine2:

data = <- ch1 (3)
// do work...
ch2 <- result (4)

The scheduler exhibits two different behaviors, depending on whether goroutine2 is busy and there are available Ps.

  • If goroutine2 is busy or there are no idle Ps, then the behavior is fine. The item will be enqueued in the channel, goroutine2 is marked as runnable if needed, and eventually goroutine1 will yield.
  • If goroutine2 is not busy and there are idle Ps, then the behavior is sub-optimal. The operation in (1) will mark goroutine2 as runnable, and wake up some idle P via a relatively expensive system call [1]. Ultimately the wake will likely result in an IPI to wake an idle core, if there are any. The next P will be scheduled and a race to (2) and (3) ensures.

In the second case, if the P wakes and successfully steals the now runnable goroutine2, i.e. (3) happens first, then it will start executing on the new P. Unfortunately, the whole dance will happen again with the result. If the P wakes but does not successfully steal the now runnable goroutine2, i.e. (4) happens first and goroutine2 is run locally, then a large number of cycles are wasted. Either way, this dance happens again with the result. In both cases, we spend a large number of cycles and interprocessor co-ordination costs for what should be a goroutine context switch.

These are further problems caused by this, as it will introduce unnecessary work stealing and bouncing of goroutines between system threads and cores. (Leading to locality inefficiencies.)

Ideal schedule

With an oracle, the ideal schedule after (1) would be:

  • If goroutine2 is running or there are no idle Ps, enqueue only (current behavior).
  • If goroutine1 will not block or has other goroutines in its runqueue, wake idle Ps (current behavior).
  • If goroutine1 will block immediately, and there are no other goroutines in P's local runqueue, do not wake up any other Ps. The goroutine2 will be executed by the current P immediately after goroutine1 blocks.

In essence, we want to yield the goroutine1's time to goroutine2 in this case, or at least avoid all the wasted signaling overhead. To put it another way: if goroutine1's P will block, then it fills the role of the "idle P" far more efficiently.

Proposal

It may be possible to specifically optimize for this case in the compiler, just as certain loop patterns are optimized.

In the case where a blocking channel send is immediately followed by a blocking channel receive, I propose an optimization that tries to avoid these scheduler round trips.

Here's a rough sketch of the idea:

  • runqput returns a bool that indicates whether the newly placed G is the only item on the queue. (Alternatively we could just check the runq length below.)
  • goready takes an additional parameter "deferwake" which skips the wake operation if true. By default this will be false everywhere, which implements current behavior.
  • chansend accepts a similar "deferwake" parameter. This is plumbed through to send, and will be AND'ed with the result of runqput. The deferwake parameter will be passed as true if the compiler detects a blocking receive immediately following the blocking send statement (or possibly in the same block, see below).
  • chanrecv also accepts a "deferwake" parameter, which will be set to true only when proceeded by a call to chansend with deferwake also set to true. If this is true AND the current goroutine will not yield as a result of the recv AND the current runqueue length > 0 AND there are idle Ps, at this point we can call wakeup.

Rejected alternatives

I thought about this problem a few years ago when it caused issues. In the past, I considered the possibility of a different channel operator. Something like:

ch1 <~ data

This operator would write to the channel and immediately yield to the other goroutine, if it was not already running (otherwise would fall back to the existing channel behavior). Using this operator in the above situation would make it much more efficient in general.

However, this is a language change, and confusing to users. When do you use which operator? It would be good to have the effect of this optimization out of the box.

Extensions

  • This optimization may apply to other kinds of wakes. I consider only channels today.
  • The optimization could be extended to cases where a blocking channel receive appears following the blocking send in the same block, not necessary the subsequent statement.

[1] https://github.com/golang/go/blob/master/src/runtime/proc.go#L665

@gopherbot gopherbot added this to the Proposal milestone May 17, 2019
@randall77
Copy link
Contributor

What about just enforcing a minimum delay between when a G is created and when it can be stolen? That gives the local P time to finish the spawning G (finish = either done or block) and pick up the new G itself.

The delay would be on the order of the overhead to move a G between processors (sys calls, cache warmup, etc.)

The tricky part is to not even wake the remote P when the goroutine is queued. We want a timer somehow that can be cancelled if the G is started locally.

@bradfitz bradfitz changed the title proposal: optimization to reduce P churn proposal: runtime: optimization to reduce P churn May 17, 2019
@bradfitz
Copy link
Contributor

@amscanne
Copy link
Contributor Author

Yes, most of the waste is generated by the wakeup call itself. Ensuring that the other P does not steal the G is probably a minor improvement, but you're still going to waste a ton of cycles (maybe even doing these wake ups twice -- on (1) and (4)).

I think using a timer gets much trickier. This is the reason I have limited the proposal to compiler-identified sequences of "chansend(block=true); chanrecv(block=true)" calls. It's possible that the system thread could be pre-empted between those calls, but if the system is busy (though Ps in this process may still be idle) it's probably even more valuable to not waste useless cycles.

@amscanne
Copy link
Contributor Author

(Totally open to a timer, but I'm concerned about replacing a P wakeup with a kick to sysmon in order to enforce the timer, which solves the locality issue but still burns cycles.)

@dvyukov
Copy link
Member

dvyukov commented May 18, 2019

Also see #8903 which was about a similar problem. I don't remember all details exactly now, but as far as I remember my proposal was somewhat more generic, but your wins in simplicity and most likely safer from potential negative effects for corner cases.

@rsc
Copy link
Contributor

rsc commented May 28, 2019

This has come up repeatedly. Obviously it is easy to recognize and fuse

ch1 <- data (1)
result = <-ch2 (2)

It's harder to see that in more complex code that would benefit from the optimization, though. We've fiddled with heuristics in the runtime to try to wait a little bit before stealing a G from a P, and so on. Probably more tuning is needed.

It's unclear this needs to be a proposal, unless you are proposing a language change, and it sounds like you've backed away from that.

The way forward with a suggestion like this is to try implementing it and see how much of an improvement (and how general of an improvement) it yields.

@rsc rsc modified the milestones: Proposal, Unplanned May 28, 2019
@rsc rsc added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels May 28, 2019
@rsc
Copy link
Contributor

rsc commented May 28, 2019

/cc @randall77 @aclements

@golang golang deleted a comment from rsc May 28, 2019
@randall77
Copy link
Contributor

Related:

#27345 (start working on a new goroutine immediately, on the parent's stack)
#18237 (lots of time in findrunnable)

I also remember an issue related to ready goroutines ping-ponging around Ps, but I can't find it at the moment.

@amscanne
Copy link
Contributor Author

I backed away from a language change proposal based on the assumption that it would likely not be accepted. My personal preference would be to have an operation like <~ that immediately switches to the other goroutine if currently waiting. (And behaves like a normal channel operation if busy.) But I realize that the existence of this operator might be confusing.

I think it's unclear how much of a impact this would have in general. This is probably just be a tiny optimization that doesn't matter in the general case, but can help in a few very specific ones. For us, it might let us structure some goroutine interactions much more efficiently.

I hacked something together, and it seems like there's a decent effect on microbenchmarks at least (unless I screwed something up).

Code:

func BenchmarkPingPong(b *testing.B) {
	var wg sync.WaitGroup
	defer wg.Wait()

	ch1 := make(chan struct{}, 1)
	ch2 := make(chan struct{}, 1)
	wg.Add(2)
	go func() {
		defer wg.Done()
		for i := 0; i < b.N; i++ {
			ch1 <- struct{}{}
			<-ch2
		}
	}()
	go func() {
		defer wg.Done()
		<-ch1
		for i := 0; i < b.N-1; i++ {
			ch2 <- struct{}{}
			<-ch1
		}
		ch2 <- struct{}{}
	}()
}

Before:

/usr/bin/time /usr/bin/go test -bench=.* -benchtime=5s
goos: linux
goarch: amd64
BenchmarkPingPong-4   	20000000	       563 ns/op
PASS
ok  	_/home/amscanne/gotest/spin	11.805s
12.68user 1.00system 0:11.98elapsed 114%CPU (0avgtext+0avgdata 46036maxresident)k
0inputs+3816outputs (0major+19758minor)pagefaults 0swaps

After:

/usr/bin/time go test -bench=.* -benchtime=5s
goos: linux
goarch: amd64
BenchmarkPingPong-4   	20000000	       330 ns/op
PASS
ok  	_/home/amscanne/gotest/spin	6.949s
7.11user 0.05system 0:07.11elapsed 100%CPU (0avgtext+0avgdata 46460maxresident)k
0inputs+3824outputs (0major+19084minor)pagefaults 0swaps

The system time is telling @ 20x, and the extra 14% in CPU usage is indicative of an additional P waking up with nothing to do. (Or maybe it occasionally successfully steals the goroutine, which is also bad.)

Assuming this small optimization is readily acceptable -- what's the best way to group those operations and transform the channel calls? The runtime bits are straight-forward, but any up front guidance on the compiler side is appreciated. Otherwise, I'm just planning to call a specialized scan in walkstmt list, but maybe there's a better way.

@rsc
Copy link
Contributor

rsc commented Jun 4, 2019

Given that there is no language change here anymore, going to move this to being a regular issue.

@rsc rsc changed the title proposal: runtime: optimization to reduce P churn runtime: optimization to reduce P churn Jun 4, 2019
@rsc rsc removed the Proposal label Jun 4, 2019
@prattmic
Copy link
Member

prattmic commented Aug 31, 2020

I've started looking into this. I've got a very naive implementation (probably very similar to Adin's) to use with his microbenchmark.

Combined with perf stat, we can see higher-level system effects of the change.

Fixed time (-benchtime=1s):

name                   old time/op  new time/op  delta
ChanPingPong-12         467ns ± 2%   247ns ± 2%  -47.07%  (p=0.000 n=9+10)

name                   old iters    new iters    delta
ChanPingPong-iters-12   2.51M ± 5%   4.44M ±14%  +76.87%  (p=0.000 n=10+10)

name                   old msec     new msec     delta
Perf-task-clock         2.01k ± 2%   1.38k ± 5%  -31.43%  (p=0.000 n=10+8)

name                   old val      new val      delta
Perf-context-switches   44.1k ± 2%    0.7k ± 8%  -98.52%  (p=0.000 n=10+8)
Perf-cpu-migrations       182 ±19%      10 ±27%  -94.39%  (p=0.000 n=10+9)
Perf-page-faults          518 ± 8%     511 ± 8%     ~     (p=0.536 n=10+9)
Perf-cycles             4.58G ± 2%   5.54G ± 6%  +21.05%  (p=0.000 n=10+8)
Perf-instructions       8.21G ± 3%  11.22G ± 6%  +36.63%  (p=0.000 n=10+8)
Perf-branches           1.69G ± 3%   2.30G ± 6%  +35.76%  (p=0.000 n=10+8)

Fixed iterations (-benchtime=10000000x):

name                   old time/op  new time/op  delta
ChanPingPong-12         473ns ± 2%   241ns ± 3%  -49.00%  (p=0.000 n=10+10)

name                   old msec     new msec     delta
Perf-task-clock         5.68k ± 3%   2.43k ± 3%  -57.15%  (p=0.000 n=10+10)

name                   old val      new val      delta
Perf-context-switches    125k ± 3%      1k ± 9%  -99.54%  (p=0.000 n=10+9)
Perf-cpu-migrations       517 ±13%      11 ±51%  -97.95%  (p=0.000 n=10+10)
Perf-page-faults          469 ± 9%     473 ±11%     ~     (p=0.928 n=10+10)
Perf-cycles             13.1G ± 2%   10.4G ± 2%  -20.56%  (p=0.000 n=10+10)
Perf-instructions       23.2G ± 0%   21.0G ± 0%   -9.52%  (p=0.000 n=10+8)
Perf-branches           4.79G ± 0%   4.31G ± 0%  -10.11%  (p=0.000 n=10+8)

I've included both since the the different fixed dimensions change the interpretation. e.g., the first case has higher cycles after because it is simply able to do a lot more work. And it still does nearly double the iterations in 30% less CPU time (== far less time stalled)!

This certainly looks worthwhile from the micro-benchmark perspective. The questions remaining to me are if we can efficiently and reliably detect these scenarios, and if they affect many programs.

@prattmic
Copy link
Member

prattmic commented Sep 1, 2020

For future reference, here's @amscanne's prototype: amscanne@eee812b

This is a bit more advanced than mine, as I haven't made any compiler changes yet.

@gopherbot
Copy link

Change https://golang.org/cl/254817 mentions this issue: WIP: merge chansend1 + chanrecv1 into unified chansendrecv1

@GuhuangLS
Copy link

GuhuangLS commented Jun 10, 2021

For future reference, here's @amscanne's prototype: amscanne@eee812b

This is a bit more advanced than mine, as I haven't made any compiler changes yet.

@prattmic Michael, I have one question, amscanne/go@eee812b needs to modify apis? and needs user's program to perceive?How does compiler make the decision?

@prattmic
Copy link
Member

Neither @amscanne nor my prototype change any language syntax or APIs. Rather, the compiler detects channel send followed immediately by channel receive and rather than calling the typical runtime.chansend and runtime.chanrecv functions, it emits calls to alternative implementations (a single merged runtime.chansendrecv in my case).

Both prototypes are rudimentary and would probably hurt performance for many programs due to poor decisions and would need more refinement.

copybara-service bot pushed a commit to google/gvisor that referenced this issue Nov 1, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
zhuangel pushed a commit to zhuangel/go that referenced this issue Nov 16, 2021
Background:

Scheduler is very easy to get into a "P churn" problem as stated by Adin
in golang#32113. This problem is more serious
in gVisor as futex() syscall, used for wake and idle M, is a much heavier
operation from GR0 into HR0.

Adin proposed to add the context semantics into scheduler to decide if
we need to wake a new M. Let's call it a local strategy.

Here we propose another way to solve this problem. Let's call it a global
strategy. When we need to decide whether to start a new M, except the
condition of an extra P, let's calculate # of runnable Gs and # of
running Ps. When # of runnable Gs <= # of running Ps * factor, do not start
another M, as those already-running Ms will steal Gs this P. We have tried
using a factor of 1.5; but then switch to 1.

The mechnism applies when we ready a G; and we also add this to
handleoffp(). For handoffp(), the previous strategy to wake up a M is to
satisfy one of below two conditions:
  - the local runq is not empty;
  - the global runq is not empty.

We constrain the 2nd condition by checking the comparison of # of running
Gs and # of running Ps. (Note that the running P here has different
meaning with the one used above for wakep().

Two concerns are raised for this method:
  - If it adds too much contention when we do the G/P counting? A side
    info is that we usually set GOMAXPROCS to 4 or 8. And as we see from our
    result, CPU util is much lower; and we don't see too much contention
    in the flame graph.
  - If it brings worse latency? Yes, it does incur small regression in
    latency, but the CPU util seems a big enough advantage.

Signed-off-by: Shi Liu <liushi.ls@antgroup.com>
Signed-off-by: Jielong Zhou <jielong.zjl@antgroup.com>
Signed-off-by: Yong He <chenglang.hy@antgroup.com>
Signed-off-by: Jianfeng Tan <henry.tjf@antgroup.com>
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 9, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 9, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 10, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 10, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 10, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 10, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 406873844
copybara-service bot pushed a commit to google/gvisor that referenced this issue Dec 10, 2021
Some synchronization patterns require the ability to simultaneously wake and
sleep a goroutine. For the sleep package, this is the case when a waker must be
asserted when a subsequent fetch is imminent.

Currently, this operation results in significant P churn in the runtime, which
ping-pongs execution between multiple system threads and cores and consumes a
significant amount of host CPU (and because of the context switches, this can
be significant worse with mitigations for side channel vulnerabilities).

The solution is to introduce a dedicated mechanism for a synchronous switch
which does not wake another runtime P (see golang/go#32113). This can be used
by the `AssertAndFetch` API in the sleep package.

The benchmark results for this package are very similiar to raw channel
operations for all cases, with the exception of operations that do not wait.
The primary advantage is more precise control over scheduling. This will be
used in a subsequent change.

```
BenchmarkGoAssertNonWaiting
BenchmarkGoAssertNonWaiting-8                   261364384                4.976 ns/op
BenchmarkGoSingleSelect
BenchmarkGoSingleSelect-8                       20946358                57.77 ns/op
BenchmarkGoMultiSelect
BenchmarkGoMultiSelect-8                         6071697               197.0 ns/op
BenchmarkGoWaitOnSingleSelect
BenchmarkGoWaitOnSingleSelect-8                  4978051               235.4 ns/op
BenchmarkGoWaitOnMultiSelect
BenchmarkGoWaitOnMultiSelect-8                   2309224               520.2 ns/op

BenchmarkSleeperAssertNonWaiting
BenchmarkSleeperAssertNonWaiting-8              447325033                2.657 ns/op
BenchmarkSleeperSingleSelect
BenchmarkSleeperSingleSelect-8                  21488844                55.19 ns/op
BenchmarkSleeperMultiSelect
BenchmarkSleeperMultiSelect-8                   21851674                54.89 ns/op
BenchmarkSleeperWaitOnSingleSelect
BenchmarkSleeperWaitOnSingleSelect-8             2860327               416.4 ns/op
BenchmarkSleeperWaitOnSingleSelectSync
BenchmarkSleeperWaitOnSingleSelectSync-8         2741733               427.1 ns/op
BenchmarkSleeperWaitOnMultiSelect
BenchmarkSleeperWaitOnMultiSelect-8              2867484               418.1 ns/op
BenchmarkSleeperWaitOnMultiSelectSync
BenchmarkSleeperWaitOnMultiSelectSync-8          2789158               427.9 ns/op
```

PiperOrigin-RevId: 415581417
nixprime added a commit to nixprime/go that referenced this issue Mar 3, 2023
The most recently goready()'d G on each P is given a special position in
the P's runqueue, p.runnext. Other Ps steal p.runnext only as a last
resort, and usleep(3) before doing so: findRunnable() => stealWork() =>
runqsteal() => runqgrab(). As documented in runqgrab(), this is to
reduce thrashing of Gs between Ps in cases where one goroutine wakes another
and then "almost immediately" blocks.

On Linux, usleep() is implemented by invoking the nanosleep system call.
Syscall timeouts in the Linux kernel are subject to timer slack, as
documented by the man page for syscall prctl, section
"PR_SET_TIMERSLACK". Experimentally, short timeouts can expect to expire
50 microseconds late regardless of other system activity. Thus, on
Linux, usleep(3) typically sleeps for at least 53 microseconds, more
than 17x longer than intended.

A P must be in the spinning state in order to attempt work-stealing.
While at least one P is spinning, wakep() will refuse to wake a new
spinning P. One P sleeping in runqgrab() thus prevents further threads
from being woken in response to e.g. goroutine wakeups *globally*
(throughout the process). Futex wake-to-wakeup latency is approximately
20 microseconds, so sleeping for 53 microseconds can significantly
increase goroutine wakeup latency by delaying thread wakeup.

Fix this by timestamping Gs when they are runqput() into p.runnext, and
causing runqgrab() to indicate to findRunnable() that it should loop if
p.runnext is not yet stealable.

Alternative fixes considered:

- osyield() on Linux as we do on a few other platforms. On Linux,
  osyield() is implemented by the sched_yield system call, which IIUC
  causes the calling thread to yield its timeslice to any thread on its
  runqueue that it would not preempt on wakeup, potentially introducing
  even larger latencies on busy systems. See also
  https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752
  for a case against sched_yield on semantic grounds.

- Replace the usleep() with a spin loop in-place. This tends to waste
  the spinning P's time, since it can't check other runqueues and the
  number of calls to runqgrab() - and therefore sleeps -  is linear in
  the number of Ps. Empirically, it introduces regressions not observed
  in this change.

Unfortunately, this is a load-bearing bug. In programs with goroutines
that frequently wake up goroutines and then immediately block, this bug
significantly reduces overhead from useless thread wakeups in wakep().
In golang.org/x/benchmarks, this manifests most clearly as regressions
in benchmark dustin_broadcast. To avoid this regression, we need to
intentionally throttle wakep() => acquirem().

Thus, this change also introduces a "need-wakep()" prediction mechanism,
which causes goready() and newproc() to call wakep() only if the calling
goroutine is predicted not to immediately block. To handle
mispredictions, sysmon is changed to wakep() if it detects
underutilization. The current prediction algorithm is simple, but
appears to be effective; it can be improved in the future as warranted.

Results from golang.org/x/benchmarks:
(Baseline is go1.20.1; experiment is go1.20.1 plus this change)

shortname: ajstarks_deck_generate
goos: linux
goarch: amd64
pkg: github.com/ajstarks/deck/generate
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                    sec/op                    │         sec/op           vs base               │
Arc-12                                        3.857µ ± 5%               3.753µ ± 5%       ~ (p=0.424 n=10)
Polygon-12                                    7.074µ ± 6%               6.969µ ± 4%       ~ (p=0.190 n=10)
geomean                                       5.224µ                    5.114µ       -2.10%

shortname: aws_jsonutil
pkg: github.com/aws/aws-sdk-go/private/protocol/json/jsonutil
              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
              │                    sec/op                    │         sec/op           vs base               │
BuildJSON-12                                     5.602µ ± 3%               5.600µ ± 2%       ~ (p=0.896 n=10)
StdlibJSON-12                                    3.843µ ± 2%               3.828µ ± 2%       ~ (p=0.224 n=10)
geomean                                          4.640µ                    4.630µ       -0.22%

shortname: benhoyt_goawk_1_18
pkg: github.com/benhoyt/goawk/interp
                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                    sec/op                    │         sec/op           vs base               │
RecursiveFunc-12                                          17.79µ ± 3%               17.65µ ± 3%       ~ (p=0.436 n=10)
RegexMatch-12                                             815.8n ± 4%               823.3n ± 1%       ~ (p=0.353 n=10)
RepeatExecProgram-12                                      21.30µ ± 6%               21.69µ ± 3%       ~ (p=0.052 n=10)
RepeatNew-12                                              79.21n ± 4%               79.73n ± 3%       ~ (p=0.529 n=10)
RepeatIOExecProgram-12                                    41.83µ ± 1%               42.07µ ± 2%       ~ (p=0.796 n=10)
RepeatIONew-12                                            1.195µ ± 3%               1.196µ ± 2%       ~ (p=1.000 n=10)
geomean                                                   3.271µ                    3.288µ       +0.54%

shortname: bindata
pkg: github.com/kevinburke/go-bindata
           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                    sec/op                    │            sec/op             vs base          │
Bindata-12                                    316.2m ± 5%                    309.7m ± 4%  ~ (p=0.436 n=10)

           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                     B/s                      │             B/s               vs base          │
Bindata-12                                   20.71Mi ± 5%                   21.14Mi ± 4%  ~ (p=0.436 n=10)

           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                     B/op                     │             B/op              vs base          │
Bindata-12                                   183.0Mi ± 0%                   183.0Mi ± 0%  ~ (p=0.353 n=10)

           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                  allocs/op                   │          allocs/op            vs base          │
Bindata-12                                    5.790k ± 0%                    5.789k ± 0%  ~ (p=0.358 n=10)

shortname: bloom_bloom
pkg: github.com/bits-and-blooms/bloom/v3
                      │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                      │                    sec/op                    │         sec/op           vs base               │
SeparateTestAndAdd-12                                    414.6n ± 4%               413.9n ± 2%       ~ (p=0.895 n=10)
CombinedTestAndAdd-12                                    425.8n ± 9%               419.8n ± 8%       ~ (p=0.353 n=10)
geomean                                                  420.2n                    416.9n       -0.78%

shortname: capnproto2
pkg: zombiezen.com/go/capnproto2
                               │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                               │                    sec/op                    │         sec/op           vs base               │
TextMovementBetweenSegments-12                                    320.5µ ± 5%              318.4µ ± 10%       ~ (p=0.579 n=10)
Growth_MultiSegment-12                                            13.63m ± 1%              13.87m ±  2%  +1.71% (p=0.029 n=10)
geomean                                                           2.090m                   2.101m        +0.52%

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                     B/s                      │           B/s            vs base               │
Growth_MultiSegment-12                                   73.35Mi ± 1%              72.12Mi ± 2%  -1.68% (p=0.027 n=10)

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                     B/op                     │             B/op              vs base          │
Growth_MultiSegment-12                                   1.572Mi ± 0%                   1.572Mi ± 0%  ~ (p=0.320 n=10)

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                  allocs/op                   │         allocs/op           vs base            │
Growth_MultiSegment-12                                     21.00 ± 0%                   21.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: cespare_mph
pkg: github.com/cespare/mph
         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
         │                    sec/op                    │            sec/op             vs base          │
Build-12                                    32.72m ± 2%                    32.49m ± 1%  ~ (p=0.280 n=10)

shortname: commonmark_markdown
pkg: gitlab.com/golang-commonmark/markdown
                          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                          │                    sec/op                    │         sec/op           vs base               │
RenderSpecNoHTML-12                                          10.09m ± 2%               10.18m ± 3%       ~ (p=0.796 n=10)
RenderSpec-12                                                10.19m ± 1%               10.11m ± 3%       ~ (p=0.684 n=10)
RenderSpecBlackFriday2-12                                    6.793m ± 5%               6.946m ± 2%       ~ (p=0.063 n=10)
geomean                                                      8.872m                    8.944m       +0.81%

shortname: dustin_broadcast
pkg: github.com/dustin/go-broadcast
                      │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                      │                    sec/op                    │         sec/op          vs base                │
DirectSend-12                                            570.5n ± 7%              355.2n ± 2%  -37.74% (p=0.000 n=10)
ParallelDirectSend-12                                    549.0n ± 5%              360.9n ± 3%  -34.25% (p=0.000 n=10)
ParallelBrodcast-12                                      788.7n ± 2%              486.0n ± 4%  -38.37% (p=0.000 n=10)
MuxBrodcast-12                                           788.6n ± 4%              471.5n ± 6%  -40.21% (p=0.000 n=10)
geomean                                                  664.4n                   414.0n       -37.68%

shortname: dustin_humanize
pkg: github.com/dustin/go-humanize
                 │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                 │                    sec/op                    │            sec/op             vs base          │
ParseBigBytes-12                                    1.964µ ± 5%                    1.941µ ± 3%  ~ (p=0.289 n=10)

shortname: ericlagergren_decimal
pkg: github.com/ericlagergren/decimal/benchmarks
                                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                       │                    sec/op                    │         sec/op           vs base               │
Pi/foo=ericlagergren_(Go)/prec=100-12                                     147.5µ ± 2%               147.5µ ± 1%       ~ (p=0.912 n=10)
Pi/foo=ericlagergren_(GDA)/prec=100-12                                    329.6µ ± 1%               332.1µ ± 2%       ~ (p=0.063 n=10)
Pi/foo=shopspring/prec=100-12                                             680.5µ ± 4%               688.6µ ± 2%       ~ (p=0.481 n=10)
Pi/foo=apmckinlay/prec=100-12                                             2.541µ ± 4%               2.525µ ± 3%       ~ (p=0.218 n=10)
Pi/foo=go-inf/prec=100-12                                                 169.5µ ± 3%               170.7µ ± 3%       ~ (p=0.218 n=10)
Pi/foo=float64/prec=100-12                                                4.136µ ± 3%               4.162µ ± 6%       ~ (p=0.436 n=10)
geomean                                                                   62.38µ                    62.66µ       +0.45%

shortname: ethereum_bitutil
pkg: github.com/ethereum/go-ethereum/common/bitutil
                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                    sec/op                    │         sec/op          vs base                │
FastTest2KB-12                                              130.4n ± 1%              131.5n ± 1%        ~ (p=0.093 n=10)
BaseTest2KB-12                                              624.8n ± 2%              983.0n ± 2%  +57.32% (p=0.000 n=10)
Encoding4KBVerySparse-12                                    21.48µ ± 3%              22.20µ ± 3%   +3.37% (p=0.005 n=10)
geomean                                                     1.205µ                   1.421µ       +17.94%

                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                     B/op                     │            B/op             vs base            │
Encoding4KBVerySparse-12                                   9.750Ki ± 0%                 9.750Ki ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                  allocs/op                   │         allocs/op           vs base            │
Encoding4KBVerySparse-12                                     15.00 ± 0%                   15.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: ethereum_core
pkg: github.com/ethereum/go-ethereum/core
                             │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                             │                    sec/op                    │         sec/op           vs base               │
PendingDemotion10000-12                                         96.72n ± 4%               98.55n ± 2%       ~ (p=0.055 n=10)
FuturePromotion10000-12                                         2.128n ± 3%               2.093n ± 3%       ~ (p=0.896 n=10)
PoolBatchInsert10000-12                                         642.6m ± 2%               642.1m ± 5%       ~ (p=0.796 n=10)
PoolBatchLocalInsert10000-12                                    805.2m ± 2%               826.6m ± 4%       ~ (p=0.105 n=10)
geomean                                                         101.6µ                    102.3µ       +0.69%

shortname: ethereum_corevm
pkg: github.com/ethereum/go-ethereum/core/vm
            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
            │                    sec/op                    │         sec/op           vs base               │
OpDiv128-12                                    137.4n ± 3%               139.5n ± 1%  +1.56% (p=0.024 n=10)

shortname: ethereum_ecies
pkg: github.com/ethereum/go-ethereum/crypto/ecies
                    │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                    │                    sec/op                    │         sec/op           vs base               │
GenerateKeyP256-12                                     15.67µ ± 6%               15.66µ ± 3%       ~ (p=0.971 n=10)
GenSharedKeyP256-12                                    51.09µ ± 6%               52.09µ ± 4%       ~ (p=0.631 n=10)
GenSharedKeyS256-12                                    47.24µ ± 2%               46.67µ ± 3%       ~ (p=0.247 n=10)
geomean                                                33.57µ                    33.64µ       +0.21%

shortname: ethereum_ethash
pkg: github.com/ethereum/go-ethereum/consensus/ethash
                  │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                  │                    sec/op                    │            sec/op             vs base          │
HashimotoLight-12                                    1.116m ± 5%                    1.112m ± 2%  ~ (p=0.684 n=10)

shortname: ethereum_trie
pkg: github.com/ethereum/go-ethereum/trie
                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                    sec/op                    │         sec/op           vs base               │
HashFixedSize/10K-12                                               9.236m ± 1%               9.106m ± 1%  -1.40% (p=0.019 n=10)
CommitAfterHashFixedSize/10K-12                                    19.60m ± 1%               19.51m ± 1%       ~ (p=0.796 n=10)
geomean                                                            13.45m                    13.33m       -0.93%

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                     B/op                     │          B/op            vs base               │
HashFixedSize/10K-12                                              6.036Mi ± 0%              6.037Mi ± 0%       ~ (p=0.247 n=10)
CommitAfterHashFixedSize/10K-12                                   8.626Mi ± 0%              8.626Mi ± 0%       ~ (p=0.280 n=10)
geomean                                                           7.216Mi                   7.216Mi       +0.01%

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                  allocs/op                   │        allocs/op         vs base               │
HashFixedSize/10K-12                                               77.17k ± 0%               77.17k ± 0%       ~ (p=0.050 n=10)
CommitAfterHashFixedSize/10K-12                                    79.99k ± 0%               79.99k ± 0%       ~ (p=0.391 n=10)
geomean                                                            78.56k                    78.57k       +0.00%

shortname: gonum_blas_native
pkg: gonum.org/v1/gonum/blas/gonum
                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                    sec/op                    │         sec/op           vs base               │
Dnrm2MediumPosInc-12                                        1.953µ ± 2%               1.940µ ± 5%       ~ (p=0.989 n=10)
DasumMediumUnitaryInc-12                                    932.5n ± 1%               931.2n ± 1%       ~ (p=0.753 n=10)
geomean                                                     1.349µ                    1.344µ       -0.40%

shortname: gonum_community
pkg: gonum.org/v1/gonum/graph/community
                            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                            │                    sec/op                    │            sec/op             vs base          │
LouvainDirectedMultiplex-12                                    26.40m ± 1%                    26.64m ± 1%  ~ (p=0.165 n=10)

shortname: gonum_lapack_native
pkg: gonum.org/v1/gonum/lapack/gonum
                      │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                      │                    sec/op                    │         sec/op           vs base               │
Dgeev/Circulant10-12                                     41.97µ ± 6%               42.90µ ± 4%       ~ (p=0.143 n=10)
Dgeev/Circulant100-12                                    12.13m ± 4%               12.30m ± 3%       ~ (p=0.796 n=10)
geomean                                                  713.4µ                    726.4µ       +1.81%

shortname: gonum_mat
pkg: gonum.org/v1/gonum/mat
                                  │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                  │                    sec/op                    │         sec/op           vs base               │
MulWorkspaceDense1000Hundredth-12                                   89.78m ±  0%              81.48m ±  1%  -9.24% (p=0.000 n=10)
ScaleVec10000Inc20-12                                               7.204µ ± 36%              8.450µ ± 35%       ~ (p=0.853 n=10)
geomean                                                             804.2µ                    829.7µ        +3.18%

shortname: gonum_topo
pkg: gonum.org/v1/gonum/graph/topo
                          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                          │                    sec/op                    │         sec/op           vs base               │
TarjanSCCGnp_10_tenth-12                                     7.251µ ± 1%               7.187µ ± 1%  -0.88% (p=0.025 n=10)
TarjanSCCGnp_1000_half-12                                    74.48m ± 2%               74.37m ± 4%       ~ (p=0.796 n=10)
geomean                                                      734.8µ                    731.1µ       -0.51%

shortname: gonum_traverse
pkg: gonum.org/v1/gonum/graph/traverse
                                     │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                     │                    sec/op                    │         sec/op           vs base               │
WalkAllBreadthFirstGnp_10_tenth-12                                      3.517µ ± 1%               3.534µ ± 1%       ~ (p=0.343 n=10)
WalkAllBreadthFirstGnp_1000_tenth-12                                    11.12m ± 6%               11.19m ± 2%       ~ (p=0.631 n=10)
geomean                                                                 197.8µ                    198.9µ       +0.54%

shortname: gtank_blake2s
pkg: github.com/gtank/blake2s
          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
          │                    sec/op                    │            sec/op             vs base          │
Hash8K-12                                    18.96µ ± 4%                    18.82µ ± 5%  ~ (p=0.579 n=10)

          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
          │                     B/s                      │             B/s               vs base          │
Hash8K-12                                   412.2Mi ± 4%                   415.2Mi ± 5%  ~ (p=0.579 n=10)

shortname: hugo_hugolib
pkg: github.com/gohugoio/hugo/hugolib
                            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                            │                    sec/op                    │         sec/op          vs base                │
MergeByLanguage-12                                             529.9n ± 1%              531.5n ± 2%        ~ (p=0.305 n=10)
ResourceChainPostProcess-12                                    62.76m ± 3%              56.23m ± 2%  -10.39% (p=0.000 n=10)
ReplaceShortcodeTokens-12                                      2.727µ ± 3%              2.701µ ± 7%        ~ (p=0.592 n=10)
geomean                                                        44.92µ                   43.22µ        -3.80%

shortname: k8s_cache
pkg: k8s.io/client-go/tools/cache
                           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                           │                    sec/op                    │         sec/op           vs base               │
Listener-12                                                   1.312µ ± 1%               1.199µ ± 1%  -8.62% (p=0.000 n=10)
ReflectorResyncChanMany-12                                    785.7n ± 4%               796.3n ± 3%       ~ (p=0.089 n=10)
geomean                                                       1.015µ                    976.9n       -3.76%

            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
            │                     B/op                     │            B/op             vs base            │
Listener-12                                     16.00 ± 0%                   16.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
            │                  allocs/op                   │         allocs/op           vs base            │
Listener-12                                     1.000 ± 0%                   1.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: k8s_workqueue
pkg: k8s.io/client-go/util/workqueue
                                                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                                         │                    sec/op                    │         sec/op          vs base                │
ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-12                                      244.6µ ± 1%              245.9µ ± 0%   +0.55% (p=0.023 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-12                                     75.09µ ± 1%              63.54µ ± 1%  -15.37% (p=0.000 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-12                                    49.47µ ± 2%              42.45µ ± 2%  -14.19% (p=0.000 n=10)
ParallelizeUntil/pieces:999,workers:10,chunkSize:13-12                                      68.51µ ± 1%              55.07µ ± 1%  -19.63% (p=0.000 n=10)
geomean                                                                                     88.82µ                   77.74µ       -12.47%

shortname: kanzi
pkg: github.com/flanglet/kanzi-go/benchmark
        │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
        │                    sec/op                    │         sec/op           vs base               │
BWTS-12                                   0.4479n ± 6%              0.4385n ± 7%       ~ (p=0.529 n=10)
FPAQ-12                                    17.03m ± 3%               17.42m ± 3%       ~ (p=0.123 n=10)
LZ-12                                      1.897m ± 2%               1.887m ± 4%       ~ (p=1.000 n=10)
MTFT-12                                    771.2µ ± 4%               785.8µ ± 3%       ~ (p=0.247 n=10)
geomean                                    57.79µ                    58.01µ       +0.38%

shortname: minio
pkg: github.com/minio/minio/cmd
                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                    sec/op                    │         sec/op          vs base                │
DecodehealingTracker-12                                            852.8n ± 5%              866.8n ± 5%        ~ (p=0.190 n=10)
AppendMsgReplicateDecision-12                                     0.5383n ± 4%             0.7598n ± 3%  +41.13% (p=0.000 n=10)
AppendMsgResyncTargetsInfo-12                                      4.785n ± 2%              4.639n ± 3%   -3.06% (p=0.003 n=10)
DataUpdateTracker-12                                               3.122µ ± 2%              1.880µ ± 3%  -39.77% (p=0.000 n=10)
MarshalMsgdataUsageCacheInfo-12                                    110.9n ± 2%              109.4n ± 3%        ~ (p=0.101 n=10)
geomean                                                            59.74n                   57.50n        -3.75%

                              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                              │                     B/s                      │          B/s            vs base                │
DecodehealingTracker-12                                         347.8Mi ± 5%             342.2Mi ± 6%        ~ (p=0.190 n=10)
AppendMsgReplicateDecision-12                                   1.730Gi ± 3%             1.226Gi ± 3%  -29.14% (p=0.000 n=10)
AppendMsgResyncTargetsInfo-12                                   1.946Gi ± 2%             2.008Gi ± 3%   +3.15% (p=0.003 n=10)
DataUpdateTracker-12                                            312.5Ki ± 3%             517.6Ki ± 2%  +65.62% (p=0.000 n=10)
geomean                                                         139.1Mi                  145.4Mi        +4.47%

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                     B/op                     │         B/op           vs base                 │
DecodehealingTracker-12                                           0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgReplicateDecision-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgResyncTargetsInfo-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
DataUpdateTracker-12                                              340.0 ± 0%                339.0 ± 1%       ~ (p=0.737 n=10)
MarshalMsgdataUsageCacheInfo-12                                   96.00 ± 0%                96.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                      ²                          -0.06%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                  allocs/op                   │       allocs/op        vs base                 │
DecodehealingTracker-12                                           0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgReplicateDecision-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgResyncTargetsInfo-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
DataUpdateTracker-12                                              9.000 ± 0%                9.000 ± 0%       ~ (p=1.000 n=10) ¹
MarshalMsgdataUsageCacheInfo-12                                   1.000 ± 0%                1.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                      ²                          +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

shortname: semver
pkg: github.com/Masterminds/semver
                            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                            │                    sec/op                    │            sec/op             vs base          │
ValidateVersionTildeFail-12                                    854.7n ± 2%                    842.7n ± 2%  ~ (p=0.123 n=10)

shortname: shopify_sarama
pkg: github.com/Shopify/sarama
                          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                          │                    sec/op                    │         sec/op           vs base               │
Broker_Open-12                                               212.2µ ± 1%               205.9µ ± 2%  -2.95% (p=0.000 n=10)
Broker_No_Metrics_Open-12                                    132.9µ ± 1%               121.3µ ± 2%  -8.68% (p=0.000 n=10)
geomean                                                      167.9µ                    158.1µ       -5.86%

shortname: spexs2
pkg: github.com/egonelbre/spexs2/_benchmark
              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
              │                    sec/op                    │         sec/op           vs base               │
Run/10k/1-12                                      23.29 ± 1%                23.11 ± 2%       ~ (p=0.315 n=10)
Run/10k/16-12                                     5.648 ± 2%                5.462 ± 4%  -3.30% (p=0.004 n=10)
geomean                                           11.47                     11.23       -2.06%

shortname: sweet-biogo-igor
goos:
goarch:
pkg:
cpu:
          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │                   sec/op                    │           sec/op             vs base          │
BiogoIgor                                    13.53 ± 1%                    13.62 ± 1%  ~ (p=0.165 n=10)

          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │              average-RSS-bytes              │      average-RSS-bytes       vs base          │
BiogoIgor                                  62.19Mi ± 3%                  62.86Mi ± 1%  ~ (p=0.247 n=10)

          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │               peak-RSS-bytes                │       peak-RSS-bytes         vs base          │
BiogoIgor                                  89.57Mi ± 4%                  89.03Mi ± 3%  ~ (p=0.516 n=10)

          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │                peak-VM-bytes                │        peak-VM-bytes         vs base          │
BiogoIgor                                  766.4Mi ± 0%                  766.4Mi ± 0%  ~ (p=0.954 n=10)

shortname: sweet-biogo-krishna
             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │                     sec/op                     │          sec/op            vs base               │
BiogoKrishna                                       12.70 ± 2%                  12.09 ± 3%  -4.86% (p=0.000 n=10)

             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │               average-RSS-bytes                │       average-RSS-bytes         vs base          │
BiogoKrishna                                     4.085Gi ± 0%                     4.083Gi ± 0%  ~ (p=0.105 n=10)

             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │                 peak-RSS-bytes                 │         peak-RSS-bytes          vs base          │
BiogoKrishna                                     4.174Gi ± 0%                     4.173Gi ± 0%  ~ (p=0.853 n=10)

             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │                 peak-VM-bytes                  │         peak-VM-bytes           vs base          │
BiogoKrishna                                     4.877Gi ± 0%                     4.877Gi ± 0%  ~ (p=0.591 n=10)

shortname: sweet-bleve-index
                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │                    sec/op                    │            sec/op             vs base          │
BleveIndexBatch100                                     4.675 ± 1%                     4.669 ± 1%  ~ (p=0.739 n=10)

                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │              average-RSS-bytes               │      average-RSS-bytes        vs base          │
BleveIndexBatch100                                   185.5Mi ± 1%                   185.9Mi ± 1%  ~ (p=0.796 n=10)

                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │                peak-RSS-bytes                │        peak-RSS-bytes         vs base          │
BleveIndexBatch100                                   267.5Mi ± 6%                   265.0Mi ± 2%  ~ (p=0.739 n=10)

                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │                peak-VM-bytes                 │        peak-VM-bytes          vs base          │
BleveIndexBatch100                                   1.945Gi ± 4%                   1.945Gi ± 0%  ~ (p=0.725 n=10)

shortname: sweet-go-build
                    │ ./sweet/results/go-build/baseline.results │ ./sweet/results/go-build/experiment.results │
                    │                  sec/op                   │        sec/op         vs base               │
GoBuildKubelet                                       51.32 ± 0%             51.38 ± 3%       ~ (p=0.105 n=10)
GoBuildKubeletLink                                   7.669 ± 1%             7.663 ± 2%       ~ (p=0.579 n=10)
GoBuildIstioctl                                      46.02 ± 0%             46.07 ± 0%       ~ (p=0.739 n=10)
GoBuildIstioctlLink                                  8.174 ± 1%             8.143 ± 2%       ~ (p=0.436 n=10)
GoBuildFrontend                                      16.17 ± 1%             16.10 ± 1%       ~ (p=0.143 n=10)
GoBuildFrontendLink                                  1.399 ± 3%             1.377 ± 3%       ~ (p=0.218 n=10)
geomean                                              12.23                  12.18       -0.39%

shortname: sweet-gopher-lua
                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │                   sec/op                    │           sec/op             vs base          │
GopherLuaKNucleotide                                    22.71 ± 1%                    22.86 ± 1%  ~ (p=0.218 n=10)

                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │              average-RSS-bytes              │      average-RSS-bytes       vs base          │
GopherLuaKNucleotide                                  36.64Mi ± 2%                  36.40Mi ± 1%  ~ (p=0.631 n=10)

                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │               peak-RSS-bytes                │       peak-RSS-bytes         vs base          │
GopherLuaKNucleotide                                  43.28Mi ± 5%                  41.55Mi ± 7%  ~ (p=0.089 n=10)

                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │                peak-VM-bytes                │     peak-VM-bytes       vs base               │
GopherLuaKNucleotide                                  699.6Mi ± 0%             699.9Mi ± 0%  +0.04% (p=0.006 n=10)

shortname: sweet-markdown
                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │                  sec/op                   │          sec/op            vs base          │
MarkdownRenderXHTML                                 260.6m ± 4%                 256.4m ± 4%  ~ (p=0.796 n=10)

                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │             average-RSS-bytes             │     average-RSS-bytes      vs base          │
MarkdownRenderXHTML                                20.47Mi ± 1%                20.71Mi ± 2%  ~ (p=0.393 n=10)

                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │              peak-RSS-bytes               │      peak-RSS-bytes        vs base          │
MarkdownRenderXHTML                               20.88Mi ± 11%                21.73Mi ± 6%  ~ (p=0.470 n=10)

                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │               peak-VM-bytes               │       peak-VM-bytes        vs base          │
MarkdownRenderXHTML                                699.2Mi ± 0%                699.3Mi ± 0%  ~ (p=0.464 n=10)

shortname: sweet-tile38
                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │                 sec/op                  │       sec/op        vs base               │
Tile38WithinCircle100kmRequest                                   529.1µ ± 1%          530.3µ ± 1%       ~ (p=0.143 n=10)
Tile38IntersectsCircle100kmRequest                               629.6µ ± 1%          630.8µ ± 1%       ~ (p=0.971 n=10)
Tile38KNearestLimit100Request                                    446.4µ ± 1%          453.7µ ± 1%  +1.62% (p=0.000 n=10)
geomean                                                          529.8µ               533.4µ       +0.67%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │            average-RSS-bytes            │ average-RSS-bytes   vs base               │
Tile38WithinCircle100kmRequest                                  5.054Gi ± 1%         5.057Gi ± 1%       ~ (p=0.796 n=10)
Tile38IntersectsCircle100kmRequest                              5.381Gi ± 0%         5.431Gi ± 1%  +0.94% (p=0.019 n=10)
Tile38KNearestLimit100Request                                   6.801Gi ± 0%         6.802Gi ± 0%       ~ (p=0.684 n=10)
geomean                                                         5.697Gi              5.717Gi       +0.34%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             peak-RSS-bytes              │   peak-RSS-bytes    vs base               │
Tile38WithinCircle100kmRequest                                  5.380Gi ± 1%         5.381Gi ± 1%       ~ (p=0.912 n=10)
Tile38IntersectsCircle100kmRequest                              5.669Gi ± 1%         5.756Gi ± 1%  +1.53% (p=0.019 n=10)
Tile38KNearestLimit100Request                                   7.013Gi ± 0%         7.011Gi ± 0%       ~ (p=0.796 n=10)
geomean                                                         5.980Gi              6.010Gi       +0.50%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │              peak-VM-bytes              │   peak-VM-bytes     vs base               │
Tile38WithinCircle100kmRequest                                  6.047Gi ± 1%         6.047Gi ± 1%       ~ (p=0.725 n=10)
Tile38IntersectsCircle100kmRequest                              6.305Gi ± 1%         6.402Gi ± 2%  +1.53% (p=0.035 n=10)
Tile38KNearestLimit100Request                                   7.685Gi ± 0%         7.685Gi ± 0%       ~ (p=0.955 n=10)
geomean                                                         6.642Gi              6.676Gi       +0.51%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             p50-latency-sec             │  p50-latency-sec    vs base               │
Tile38WithinCircle100kmRequest                                   88.81µ ± 1%          89.36µ ± 1%  +0.61% (p=0.043 n=10)
Tile38IntersectsCircle100kmRequest                               151.5µ ± 1%          152.0µ ± 1%       ~ (p=0.089 n=10)
Tile38KNearestLimit100Request                                    259.0µ ± 0%          259.1µ ± 0%       ~ (p=0.853 n=10)
geomean                                                          151.6µ               152.1µ       +0.33%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             p90-latency-sec             │  p90-latency-sec    vs base               │
Tile38WithinCircle100kmRequest                                   712.5µ ± 0%          713.9µ ± 1%       ~ (p=0.190 n=10)
Tile38IntersectsCircle100kmRequest                               960.6µ ± 1%          958.2µ ± 1%       ~ (p=0.739 n=10)
Tile38KNearestLimit100Request                                    1.007m ± 1%          1.032m ± 1%  +2.50% (p=0.000 n=10)
geomean                                                          883.4µ               890.5µ       +0.80%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             p99-latency-sec             │  p99-latency-sec    vs base               │
Tile38WithinCircle100kmRequest                                   7.061m ± 1%          7.085m ± 1%       ~ (p=0.481 n=10)
Tile38IntersectsCircle100kmRequest                               7.228m ± 1%          7.187m ± 1%       ~ (p=0.143 n=10)
Tile38KNearestLimit100Request                                    2.085m ± 0%          2.131m ± 1%  +2.22% (p=0.000 n=10)
geomean                                                          4.738m               4.770m       +0.66%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │                  ops/s                  │       ops/s         vs base               │
Tile38WithinCircle100kmRequest                                   17.01k ± 1%          16.97k ± 1%       ~ (p=0.143 n=10)
Tile38IntersectsCircle100kmRequest                               14.29k ± 1%          14.27k ± 1%       ~ (p=0.988 n=10)
Tile38KNearestLimit100Request                                    20.16k ± 1%          19.84k ± 1%  -1.59% (p=0.000 n=10)
geomean                                                          16.99k               16.87k       -0.67%

shortname: uber_tally
goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                    sec/op                    │         sec/op           vs base               │
ScopeTaggedNoCachedSubscopes-12                                    2.867µ ± 4%               2.921µ ± 4%       ~ (p=0.579 n=10)
HistogramAllocation-12                                             1.519µ ± 3%               1.507µ ± 7%       ~ (p=0.631 n=10)
geomean                                                            2.087µ                    2.098µ       +0.53%

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                     B/op                     │             B/op              vs base          │
HistogramAllocation-12                                   1.124Ki ± 1%                   1.125Ki ± 4%  ~ (p=0.271 n=10)

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                  allocs/op                   │         allocs/op           vs base            │
HistogramAllocation-12                                     20.00 ± 0%                   20.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: uber_zap
pkg: go.uber.org/zap/zapcore
                                              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                              │                    sec/op                    │         sec/op          vs base                │
BufferedWriteSyncer/write_file_with_buffer-12                                   296.1n ± 12%             205.9n ± 10%  -30.46% (p=0.000 n=10)
MultiWriteSyncer/2_discarder-12                                                 7.528n ±  4%             7.014n ±  2%   -6.83% (p=0.000 n=10)
MultiWriteSyncer/4_discarder-12                                                 9.065n ±  1%             8.908n ±  1%   -1.73% (p=0.002 n=10)
MultiWriteSyncer/4_discarder_with_buffer-12                                     225.2n ±  2%             147.6n ±  2%  -34.48% (p=0.000 n=10)
WriteSyncer/write_file_with_no_buffer-12                                        4.785µ ±  1%             4.933µ ±  3%   +3.08% (p=0.001 n=10)
ZapConsole-12                                                                   702.5n ±  1%             649.1n ±  1%   -7.62% (p=0.000 n=10)
JSONLogMarshalerFunc-12                                                         1.219µ ±  2%             1.226µ ±  3%        ~ (p=0.781 n=10)
ZapJSON-12                                                                      555.4n ±  1%             480.9n ±  3%  -13.40% (p=0.000 n=10)
StandardJSON-12                                                                 814.1n ±  1%             809.0n ±  0%        ~ (p=0.101 n=10)
Sampler_Check/7_keys-12                                                         10.55n ±  2%             10.61n ±  1%        ~ (p=0.594 n=10)
Sampler_Check/50_keys-12                                                        11.01n ±  0%             10.98n ±  1%        ~ (p=0.286 n=10)
Sampler_Check/100_keys-12                                                       10.71n ±  0%             10.71n ±  0%        ~ (p=0.563 n=10)
Sampler_CheckWithHook/7_keys-12                                                 20.20n ±  2%             20.42n ±  2%        ~ (p=0.446 n=10)
Sampler_CheckWithHook/50_keys-12                                                20.72n ±  2%             21.02n ±  1%        ~ (p=0.078 n=10)
Sampler_CheckWithHook/100_keys-12                                               20.15n ±  2%             20.68n ±  3%   +2.63% (p=0.037 n=10)
TeeCheck-12                                                                     140.8n ±  2%             140.5n ±  2%        ~ (p=0.754 n=10)
geomean                                                                         87.80n                   82.39n         -6.15%

The only large regression (in ethereum_bitutil's BaseTest2KB) appears to
be spurious, as the test does not involve any goroutines (or
B.RunParallel()), which profiling confirms.

Updates golang/go#18237
Related to golang/go#32113
nixprime added a commit to nixprime/go that referenced this issue Mar 3, 2023
The most recently goready()'d G on each P is given a special position in
the P's runqueue, p.runnext. Other Ps steal p.runnext only as a last
resort, and usleep(3) before doing so: findRunnable() => stealWork() =>
runqsteal() => runqgrab(). As documented in runqgrab(), this is to
reduce thrashing of Gs between Ps in cases where one goroutine wakes another
and then "almost immediately" blocks.

On Linux, usleep() is implemented by invoking the nanosleep system call.
Syscall timeouts in the Linux kernel are subject to timer slack, as
documented by the man page for syscall prctl, section
"PR_SET_TIMERSLACK". Experimentally, short timeouts can expect to expire
50 microseconds late regardless of other system activity. Thus, on
Linux, usleep(3) typically sleeps for at least 53 microseconds, more
than 17x longer than intended.

A P must be in the spinning state in order to attempt work-stealing.
While at least one P is spinning, wakep() will refuse to wake a new
spinning P. One P sleeping in runqgrab() thus prevents further threads
from being woken in response to e.g. goroutine wakeups *globally*
(throughout the process). Futex wake-to-wakeup latency is approximately
20 microseconds, so sleeping for 53 microseconds can significantly
increase goroutine wakeup latency by delaying thread wakeup.

Fix this by timestamping Gs when they are runqput() into p.runnext, and
causing runqgrab() to indicate to findRunnable() that it should loop if
p.runnext is not yet stealable.

Alternative fixes considered:

- osyield() on Linux as we do on a few other platforms. On Linux,
  osyield() is implemented by the sched_yield system call, which IIUC
  causes the calling thread to yield its timeslice to any thread on its
  runqueue that it would not preempt on wakeup, potentially introducing
  even larger latencies on busy systems. See also
  https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752
  for a case against sched_yield on semantic grounds.

- Replace the usleep() with a spin loop in-place. This tends to waste
  the spinning P's time, since it can't check other runqueues and the
  number of calls to runqgrab() - and therefore sleeps -  is linear in
  the number of Ps. Empirically, it introduces regressions not observed
  in this change.

- Change thread timer slack using prctl(PR_SET_TIMERSLACK). In practice,
  user programs will have been tuned based on the default timer slack
  value, so tampering with this may introduce regressions into existing
  programs.

Unfortunately, this is a load-bearing bug. In programs with goroutines
that frequently wake up goroutines and then immediately block, this bug
significantly reduces overhead from useless thread wakeups in wakep().
In golang.org/x/benchmarks, this manifests most clearly as regressions
in benchmark dustin_broadcast. To avoid this regression, we need to
intentionally throttle wakep() => acquirem().

Thus, this change also introduces a "need-wakep()" prediction mechanism,
which causes goready() and newproc() to call wakep() only if the calling
goroutine is predicted not to immediately block. To handle
mispredictions, sysmon is changed to wakep() if it detects
underutilization. The current prediction algorithm is simple, but
appears to be effective; it can be improved in the future as warranted.

Results from golang.org/x/benchmarks:
(Baseline is go1.20.1; experiment is go1.20.1 plus this change)

shortname: ajstarks_deck_generate
goos: linux
goarch: amd64
pkg: github.com/ajstarks/deck/generate
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                    sec/op                    │         sec/op           vs base               │
Arc-12                                        3.857µ ± 5%               3.753µ ± 5%       ~ (p=0.424 n=10)
Polygon-12                                    7.074µ ± 6%               6.969µ ± 4%       ~ (p=0.190 n=10)
geomean                                       5.224µ                    5.114µ       -2.10%

shortname: aws_jsonutil
pkg: github.com/aws/aws-sdk-go/private/protocol/json/jsonutil
              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
              │                    sec/op                    │         sec/op           vs base               │
BuildJSON-12                                     5.602µ ± 3%               5.600µ ± 2%       ~ (p=0.896 n=10)
StdlibJSON-12                                    3.843µ ± 2%               3.828µ ± 2%       ~ (p=0.224 n=10)
geomean                                          4.640µ                    4.630µ       -0.22%

shortname: benhoyt_goawk_1_18
pkg: github.com/benhoyt/goawk/interp
                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                    sec/op                    │         sec/op           vs base               │
RecursiveFunc-12                                          17.79µ ± 3%               17.65µ ± 3%       ~ (p=0.436 n=10)
RegexMatch-12                                             815.8n ± 4%               823.3n ± 1%       ~ (p=0.353 n=10)
RepeatExecProgram-12                                      21.30µ ± 6%               21.69µ ± 3%       ~ (p=0.052 n=10)
RepeatNew-12                                              79.21n ± 4%               79.73n ± 3%       ~ (p=0.529 n=10)
RepeatIOExecProgram-12                                    41.83µ ± 1%               42.07µ ± 2%       ~ (p=0.796 n=10)
RepeatIONew-12                                            1.195µ ± 3%               1.196µ ± 2%       ~ (p=1.000 n=10)
geomean                                                   3.271µ                    3.288µ       +0.54%

shortname: bindata
pkg: github.com/kevinburke/go-bindata
           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                    sec/op                    │            sec/op             vs base          │
Bindata-12                                    316.2m ± 5%                    309.7m ± 4%  ~ (p=0.436 n=10)

           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                     B/s                      │             B/s               vs base          │
Bindata-12                                   20.71Mi ± 5%                   21.14Mi ± 4%  ~ (p=0.436 n=10)

           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                     B/op                     │             B/op              vs base          │
Bindata-12                                   183.0Mi ± 0%                   183.0Mi ± 0%  ~ (p=0.353 n=10)

           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
           │                  allocs/op                   │          allocs/op            vs base          │
Bindata-12                                    5.790k ± 0%                    5.789k ± 0%  ~ (p=0.358 n=10)

shortname: bloom_bloom
pkg: github.com/bits-and-blooms/bloom/v3
                      │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                      │                    sec/op                    │         sec/op           vs base               │
SeparateTestAndAdd-12                                    414.6n ± 4%               413.9n ± 2%       ~ (p=0.895 n=10)
CombinedTestAndAdd-12                                    425.8n ± 9%               419.8n ± 8%       ~ (p=0.353 n=10)
geomean                                                  420.2n                    416.9n       -0.78%

shortname: capnproto2
pkg: zombiezen.com/go/capnproto2
                               │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                               │                    sec/op                    │         sec/op           vs base               │
TextMovementBetweenSegments-12                                    320.5µ ± 5%              318.4µ ± 10%       ~ (p=0.579 n=10)
Growth_MultiSegment-12                                            13.63m ± 1%              13.87m ±  2%  +1.71% (p=0.029 n=10)
geomean                                                           2.090m                   2.101m        +0.52%

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                     B/s                      │           B/s            vs base               │
Growth_MultiSegment-12                                   73.35Mi ± 1%              72.12Mi ± 2%  -1.68% (p=0.027 n=10)

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                     B/op                     │             B/op              vs base          │
Growth_MultiSegment-12                                   1.572Mi ± 0%                   1.572Mi ± 0%  ~ (p=0.320 n=10)

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                  allocs/op                   │         allocs/op           vs base            │
Growth_MultiSegment-12                                     21.00 ± 0%                   21.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: cespare_mph
pkg: github.com/cespare/mph
         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
         │                    sec/op                    │            sec/op             vs base          │
Build-12                                    32.72m ± 2%                    32.49m ± 1%  ~ (p=0.280 n=10)

shortname: commonmark_markdown
pkg: gitlab.com/golang-commonmark/markdown
                          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                          │                    sec/op                    │         sec/op           vs base               │
RenderSpecNoHTML-12                                          10.09m ± 2%               10.18m ± 3%       ~ (p=0.796 n=10)
RenderSpec-12                                                10.19m ± 1%               10.11m ± 3%       ~ (p=0.684 n=10)
RenderSpecBlackFriday2-12                                    6.793m ± 5%               6.946m ± 2%       ~ (p=0.063 n=10)
geomean                                                      8.872m                    8.944m       +0.81%

shortname: dustin_broadcast
pkg: github.com/dustin/go-broadcast
                      │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                      │                    sec/op                    │         sec/op          vs base                │
DirectSend-12                                            570.5n ± 7%              355.2n ± 2%  -37.74% (p=0.000 n=10)
ParallelDirectSend-12                                    549.0n ± 5%              360.9n ± 3%  -34.25% (p=0.000 n=10)
ParallelBrodcast-12                                      788.7n ± 2%              486.0n ± 4%  -38.37% (p=0.000 n=10)
MuxBrodcast-12                                           788.6n ± 4%              471.5n ± 6%  -40.21% (p=0.000 n=10)
geomean                                                  664.4n                   414.0n       -37.68%

shortname: dustin_humanize
pkg: github.com/dustin/go-humanize
                 │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                 │                    sec/op                    │            sec/op             vs base          │
ParseBigBytes-12                                    1.964µ ± 5%                    1.941µ ± 3%  ~ (p=0.289 n=10)

shortname: ericlagergren_decimal
pkg: github.com/ericlagergren/decimal/benchmarks
                                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                       │                    sec/op                    │         sec/op           vs base               │
Pi/foo=ericlagergren_(Go)/prec=100-12                                     147.5µ ± 2%               147.5µ ± 1%       ~ (p=0.912 n=10)
Pi/foo=ericlagergren_(GDA)/prec=100-12                                    329.6µ ± 1%               332.1µ ± 2%       ~ (p=0.063 n=10)
Pi/foo=shopspring/prec=100-12                                             680.5µ ± 4%               688.6µ ± 2%       ~ (p=0.481 n=10)
Pi/foo=apmckinlay/prec=100-12                                             2.541µ ± 4%               2.525µ ± 3%       ~ (p=0.218 n=10)
Pi/foo=go-inf/prec=100-12                                                 169.5µ ± 3%               170.7µ ± 3%       ~ (p=0.218 n=10)
Pi/foo=float64/prec=100-12                                                4.136µ ± 3%               4.162µ ± 6%       ~ (p=0.436 n=10)
geomean                                                                   62.38µ                    62.66µ       +0.45%

shortname: ethereum_bitutil
pkg: github.com/ethereum/go-ethereum/common/bitutil
                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                    sec/op                    │         sec/op          vs base                │
FastTest2KB-12                                              130.4n ± 1%              131.5n ± 1%        ~ (p=0.093 n=10)
BaseTest2KB-12                                              624.8n ± 2%              983.0n ± 2%  +57.32% (p=0.000 n=10)
Encoding4KBVerySparse-12                                    21.48µ ± 3%              22.20µ ± 3%   +3.37% (p=0.005 n=10)
geomean                                                     1.205µ                   1.421µ       +17.94%

                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                     B/op                     │            B/op             vs base            │
Encoding4KBVerySparse-12                                   9.750Ki ± 0%                 9.750Ki ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                  allocs/op                   │         allocs/op           vs base            │
Encoding4KBVerySparse-12                                     15.00 ± 0%                   15.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: ethereum_core
pkg: github.com/ethereum/go-ethereum/core
                             │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                             │                    sec/op                    │         sec/op           vs base               │
PendingDemotion10000-12                                         96.72n ± 4%               98.55n ± 2%       ~ (p=0.055 n=10)
FuturePromotion10000-12                                         2.128n ± 3%               2.093n ± 3%       ~ (p=0.896 n=10)
PoolBatchInsert10000-12                                         642.6m ± 2%               642.1m ± 5%       ~ (p=0.796 n=10)
PoolBatchLocalInsert10000-12                                    805.2m ± 2%               826.6m ± 4%       ~ (p=0.105 n=10)
geomean                                                         101.6µ                    102.3µ       +0.69%

shortname: ethereum_corevm
pkg: github.com/ethereum/go-ethereum/core/vm
            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
            │                    sec/op                    │         sec/op           vs base               │
OpDiv128-12                                    137.4n ± 3%               139.5n ± 1%  +1.56% (p=0.024 n=10)

shortname: ethereum_ecies
pkg: github.com/ethereum/go-ethereum/crypto/ecies
                    │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                    │                    sec/op                    │         sec/op           vs base               │
GenerateKeyP256-12                                     15.67µ ± 6%               15.66µ ± 3%       ~ (p=0.971 n=10)
GenSharedKeyP256-12                                    51.09µ ± 6%               52.09µ ± 4%       ~ (p=0.631 n=10)
GenSharedKeyS256-12                                    47.24µ ± 2%               46.67µ ± 3%       ~ (p=0.247 n=10)
geomean                                                33.57µ                    33.64µ       +0.21%

shortname: ethereum_ethash
pkg: github.com/ethereum/go-ethereum/consensus/ethash
                  │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                  │                    sec/op                    │            sec/op             vs base          │
HashimotoLight-12                                    1.116m ± 5%                    1.112m ± 2%  ~ (p=0.684 n=10)

shortname: ethereum_trie
pkg: github.com/ethereum/go-ethereum/trie
                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                    sec/op                    │         sec/op           vs base               │
HashFixedSize/10K-12                                               9.236m ± 1%               9.106m ± 1%  -1.40% (p=0.019 n=10)
CommitAfterHashFixedSize/10K-12                                    19.60m ± 1%               19.51m ± 1%       ~ (p=0.796 n=10)
geomean                                                            13.45m                    13.33m       -0.93%

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                     B/op                     │          B/op            vs base               │
HashFixedSize/10K-12                                              6.036Mi ± 0%              6.037Mi ± 0%       ~ (p=0.247 n=10)
CommitAfterHashFixedSize/10K-12                                   8.626Mi ± 0%              8.626Mi ± 0%       ~ (p=0.280 n=10)
geomean                                                           7.216Mi                   7.216Mi       +0.01%

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                  allocs/op                   │        allocs/op         vs base               │
HashFixedSize/10K-12                                               77.17k ± 0%               77.17k ± 0%       ~ (p=0.050 n=10)
CommitAfterHashFixedSize/10K-12                                    79.99k ± 0%               79.99k ± 0%       ~ (p=0.391 n=10)
geomean                                                            78.56k                    78.57k       +0.00%

shortname: gonum_blas_native
pkg: gonum.org/v1/gonum/blas/gonum
                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                         │                    sec/op                    │         sec/op           vs base               │
Dnrm2MediumPosInc-12                                        1.953µ ± 2%               1.940µ ± 5%       ~ (p=0.989 n=10)
DasumMediumUnitaryInc-12                                    932.5n ± 1%               931.2n ± 1%       ~ (p=0.753 n=10)
geomean                                                     1.349µ                    1.344µ       -0.40%

shortname: gonum_community
pkg: gonum.org/v1/gonum/graph/community
                            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                            │                    sec/op                    │            sec/op             vs base          │
LouvainDirectedMultiplex-12                                    26.40m ± 1%                    26.64m ± 1%  ~ (p=0.165 n=10)

shortname: gonum_lapack_native
pkg: gonum.org/v1/gonum/lapack/gonum
                      │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                      │                    sec/op                    │         sec/op           vs base               │
Dgeev/Circulant10-12                                     41.97µ ± 6%               42.90µ ± 4%       ~ (p=0.143 n=10)
Dgeev/Circulant100-12                                    12.13m ± 4%               12.30m ± 3%       ~ (p=0.796 n=10)
geomean                                                  713.4µ                    726.4µ       +1.81%

shortname: gonum_mat
pkg: gonum.org/v1/gonum/mat
                                  │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                  │                    sec/op                    │         sec/op           vs base               │
MulWorkspaceDense1000Hundredth-12                                   89.78m ±  0%              81.48m ±  1%  -9.24% (p=0.000 n=10)
ScaleVec10000Inc20-12                                               7.204µ ± 36%              8.450µ ± 35%       ~ (p=0.853 n=10)
geomean                                                             804.2µ                    829.7µ        +3.18%

shortname: gonum_topo
pkg: gonum.org/v1/gonum/graph/topo
                          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                          │                    sec/op                    │         sec/op           vs base               │
TarjanSCCGnp_10_tenth-12                                     7.251µ ± 1%               7.187µ ± 1%  -0.88% (p=0.025 n=10)
TarjanSCCGnp_1000_half-12                                    74.48m ± 2%               74.37m ± 4%       ~ (p=0.796 n=10)
geomean                                                      734.8µ                    731.1µ       -0.51%

shortname: gonum_traverse
pkg: gonum.org/v1/gonum/graph/traverse
                                     │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                     │                    sec/op                    │         sec/op           vs base               │
WalkAllBreadthFirstGnp_10_tenth-12                                      3.517µ ± 1%               3.534µ ± 1%       ~ (p=0.343 n=10)
WalkAllBreadthFirstGnp_1000_tenth-12                                    11.12m ± 6%               11.19m ± 2%       ~ (p=0.631 n=10)
geomean                                                                 197.8µ                    198.9µ       +0.54%

shortname: gtank_blake2s
pkg: github.com/gtank/blake2s
          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
          │                    sec/op                    │            sec/op             vs base          │
Hash8K-12                                    18.96µ ± 4%                    18.82µ ± 5%  ~ (p=0.579 n=10)

          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
          │                     B/s                      │             B/s               vs base          │
Hash8K-12                                   412.2Mi ± 4%                   415.2Mi ± 5%  ~ (p=0.579 n=10)

shortname: hugo_hugolib
pkg: github.com/gohugoio/hugo/hugolib
                            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                            │                    sec/op                    │         sec/op          vs base                │
MergeByLanguage-12                                             529.9n ± 1%              531.5n ± 2%        ~ (p=0.305 n=10)
ResourceChainPostProcess-12                                    62.76m ± 3%              56.23m ± 2%  -10.39% (p=0.000 n=10)
ReplaceShortcodeTokens-12                                      2.727µ ± 3%              2.701µ ± 7%        ~ (p=0.592 n=10)
geomean                                                        44.92µ                   43.22µ        -3.80%

shortname: k8s_cache
pkg: k8s.io/client-go/tools/cache
                           │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                           │                    sec/op                    │         sec/op           vs base               │
Listener-12                                                   1.312µ ± 1%               1.199µ ± 1%  -8.62% (p=0.000 n=10)
ReflectorResyncChanMany-12                                    785.7n ± 4%               796.3n ± 3%       ~ (p=0.089 n=10)
geomean                                                       1.015µ                    976.9n       -3.76%

            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
            │                     B/op                     │            B/op             vs base            │
Listener-12                                     16.00 ± 0%                   16.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
            │                  allocs/op                   │         allocs/op           vs base            │
Listener-12                                     1.000 ± 0%                   1.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: k8s_workqueue
pkg: k8s.io/client-go/util/workqueue
                                                         │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                                         │                    sec/op                    │         sec/op          vs base                │
ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-12                                      244.6µ ± 1%              245.9µ ± 0%   +0.55% (p=0.023 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-12                                     75.09µ ± 1%              63.54µ ± 1%  -15.37% (p=0.000 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-12                                    49.47µ ± 2%              42.45µ ± 2%  -14.19% (p=0.000 n=10)
ParallelizeUntil/pieces:999,workers:10,chunkSize:13-12                                      68.51µ ± 1%              55.07µ ± 1%  -19.63% (p=0.000 n=10)
geomean                                                                                     88.82µ                   77.74µ       -12.47%

shortname: kanzi
pkg: github.com/flanglet/kanzi-go/benchmark
        │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
        │                    sec/op                    │         sec/op           vs base               │
BWTS-12                                   0.4479n ± 6%              0.4385n ± 7%       ~ (p=0.529 n=10)
FPAQ-12                                    17.03m ± 3%               17.42m ± 3%       ~ (p=0.123 n=10)
LZ-12                                      1.897m ± 2%               1.887m ± 4%       ~ (p=1.000 n=10)
MTFT-12                                    771.2µ ± 4%               785.8µ ± 3%       ~ (p=0.247 n=10)
geomean                                    57.79µ                    58.01µ       +0.38%

shortname: minio
pkg: github.com/minio/minio/cmd
                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                    sec/op                    │         sec/op          vs base                │
DecodehealingTracker-12                                            852.8n ± 5%              866.8n ± 5%        ~ (p=0.190 n=10)
AppendMsgReplicateDecision-12                                     0.5383n ± 4%             0.7598n ± 3%  +41.13% (p=0.000 n=10)
AppendMsgResyncTargetsInfo-12                                      4.785n ± 2%              4.639n ± 3%   -3.06% (p=0.003 n=10)
DataUpdateTracker-12                                               3.122µ ± 2%              1.880µ ± 3%  -39.77% (p=0.000 n=10)
MarshalMsgdataUsageCacheInfo-12                                    110.9n ± 2%              109.4n ± 3%        ~ (p=0.101 n=10)
geomean                                                            59.74n                   57.50n        -3.75%

                              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                              │                     B/s                      │          B/s            vs base                │
DecodehealingTracker-12                                         347.8Mi ± 5%             342.2Mi ± 6%        ~ (p=0.190 n=10)
AppendMsgReplicateDecision-12                                   1.730Gi ± 3%             1.226Gi ± 3%  -29.14% (p=0.000 n=10)
AppendMsgResyncTargetsInfo-12                                   1.946Gi ± 2%             2.008Gi ± 3%   +3.15% (p=0.003 n=10)
DataUpdateTracker-12                                            312.5Ki ± 3%             517.6Ki ± 2%  +65.62% (p=0.000 n=10)
geomean                                                         139.1Mi                  145.4Mi        +4.47%

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                     B/op                     │         B/op           vs base                 │
DecodehealingTracker-12                                           0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgReplicateDecision-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgResyncTargetsInfo-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
DataUpdateTracker-12                                              340.0 ± 0%                339.0 ± 1%       ~ (p=0.737 n=10)
MarshalMsgdataUsageCacheInfo-12                                   96.00 ± 0%                96.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                      ²                          -0.06%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                  allocs/op                   │       allocs/op        vs base                 │
DecodehealingTracker-12                                           0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgReplicateDecision-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendMsgResyncTargetsInfo-12                                     0.000 ± 0%                0.000 ± 0%       ~ (p=1.000 n=10) ¹
DataUpdateTracker-12                                              9.000 ± 0%                9.000 ± 0%       ~ (p=1.000 n=10) ¹
MarshalMsgdataUsageCacheInfo-12                                   1.000 ± 0%                1.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                      ²                          +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

shortname: semver
pkg: github.com/Masterminds/semver
                            │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                            │                    sec/op                    │            sec/op             vs base          │
ValidateVersionTildeFail-12                                    854.7n ± 2%                    842.7n ± 2%  ~ (p=0.123 n=10)

shortname: shopify_sarama
pkg: github.com/Shopify/sarama
                          │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                          │                    sec/op                    │         sec/op           vs base               │
Broker_Open-12                                               212.2µ ± 1%               205.9µ ± 2%  -2.95% (p=0.000 n=10)
Broker_No_Metrics_Open-12                                    132.9µ ± 1%               121.3µ ± 2%  -8.68% (p=0.000 n=10)
geomean                                                      167.9µ                    158.1µ       -5.86%

shortname: spexs2
pkg: github.com/egonelbre/spexs2/_benchmark
              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
              │                    sec/op                    │         sec/op           vs base               │
Run/10k/1-12                                      23.29 ± 1%                23.11 ± 2%       ~ (p=0.315 n=10)
Run/10k/16-12                                     5.648 ± 2%                5.462 ± 4%  -3.30% (p=0.004 n=10)
geomean                                           11.47                     11.23       -2.06%

shortname: sweet-biogo-igor
goos:
goarch:
pkg:
cpu:
          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │                   sec/op                    │           sec/op             vs base          │
BiogoIgor                                    13.53 ± 1%                    13.62 ± 1%  ~ (p=0.165 n=10)

          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │              average-RSS-bytes              │      average-RSS-bytes       vs base          │
BiogoIgor                                  62.19Mi ± 3%                  62.86Mi ± 1%  ~ (p=0.247 n=10)

          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │               peak-RSS-bytes                │       peak-RSS-bytes         vs base          │
BiogoIgor                                  89.57Mi ± 4%                  89.03Mi ± 3%  ~ (p=0.516 n=10)

          │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │
          │                peak-VM-bytes                │        peak-VM-bytes         vs base          │
BiogoIgor                                  766.4Mi ± 0%                  766.4Mi ± 0%  ~ (p=0.954 n=10)

shortname: sweet-biogo-krishna
             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │                     sec/op                     │          sec/op            vs base               │
BiogoKrishna                                       12.70 ± 2%                  12.09 ± 3%  -4.86% (p=0.000 n=10)

             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │               average-RSS-bytes                │       average-RSS-bytes         vs base          │
BiogoKrishna                                     4.085Gi ± 0%                     4.083Gi ± 0%  ~ (p=0.105 n=10)

             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │                 peak-RSS-bytes                 │         peak-RSS-bytes          vs base          │
BiogoKrishna                                     4.174Gi ± 0%                     4.173Gi ± 0%  ~ (p=0.853 n=10)

             │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │
             │                 peak-VM-bytes                  │         peak-VM-bytes           vs base          │
BiogoKrishna                                     4.877Gi ± 0%                     4.877Gi ± 0%  ~ (p=0.591 n=10)

shortname: sweet-bleve-index
                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │                    sec/op                    │            sec/op             vs base          │
BleveIndexBatch100                                     4.675 ± 1%                     4.669 ± 1%  ~ (p=0.739 n=10)

                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │              average-RSS-bytes               │      average-RSS-bytes        vs base          │
BleveIndexBatch100                                   185.5Mi ± 1%                   185.9Mi ± 1%  ~ (p=0.796 n=10)

                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │                peak-RSS-bytes                │        peak-RSS-bytes         vs base          │
BleveIndexBatch100                                   267.5Mi ± 6%                   265.0Mi ± 2%  ~ (p=0.739 n=10)

                   │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │
                   │                peak-VM-bytes                 │        peak-VM-bytes          vs base          │
BleveIndexBatch100                                   1.945Gi ± 4%                   1.945Gi ± 0%  ~ (p=0.725 n=10)

shortname: sweet-go-build
                    │ ./sweet/results/go-build/baseline.results │ ./sweet/results/go-build/experiment.results │
                    │                  sec/op                   │        sec/op         vs base               │
GoBuildKubelet                                       51.32 ± 0%             51.38 ± 3%       ~ (p=0.105 n=10)
GoBuildKubeletLink                                   7.669 ± 1%             7.663 ± 2%       ~ (p=0.579 n=10)
GoBuildIstioctl                                      46.02 ± 0%             46.07 ± 0%       ~ (p=0.739 n=10)
GoBuildIstioctlLink                                  8.174 ± 1%             8.143 ± 2%       ~ (p=0.436 n=10)
GoBuildFrontend                                      16.17 ± 1%             16.10 ± 1%       ~ (p=0.143 n=10)
GoBuildFrontendLink                                  1.399 ± 3%             1.377 ± 3%       ~ (p=0.218 n=10)
geomean                                              12.23                  12.18       -0.39%

shortname: sweet-gopher-lua
                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │                   sec/op                    │           sec/op             vs base          │
GopherLuaKNucleotide                                    22.71 ± 1%                    22.86 ± 1%  ~ (p=0.218 n=10)

                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │              average-RSS-bytes              │      average-RSS-bytes       vs base          │
GopherLuaKNucleotide                                  36.64Mi ± 2%                  36.40Mi ± 1%  ~ (p=0.631 n=10)

                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │               peak-RSS-bytes                │       peak-RSS-bytes         vs base          │
GopherLuaKNucleotide                                  43.28Mi ± 5%                  41.55Mi ± 7%  ~ (p=0.089 n=10)

                     │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │
                     │                peak-VM-bytes                │     peak-VM-bytes       vs base               │
GopherLuaKNucleotide                                  699.6Mi ± 0%             699.9Mi ± 0%  +0.04% (p=0.006 n=10)

shortname: sweet-markdown
                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │                  sec/op                   │          sec/op            vs base          │
MarkdownRenderXHTML                                 260.6m ± 4%                 256.4m ± 4%  ~ (p=0.796 n=10)

                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │             average-RSS-bytes             │     average-RSS-bytes      vs base          │
MarkdownRenderXHTML                                20.47Mi ± 1%                20.71Mi ± 2%  ~ (p=0.393 n=10)

                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │              peak-RSS-bytes               │      peak-RSS-bytes        vs base          │
MarkdownRenderXHTML                               20.88Mi ± 11%                21.73Mi ± 6%  ~ (p=0.470 n=10)

                    │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │
                    │               peak-VM-bytes               │       peak-VM-bytes        vs base          │
MarkdownRenderXHTML                                699.2Mi ± 0%                699.3Mi ± 0%  ~ (p=0.464 n=10)

shortname: sweet-tile38
                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │                 sec/op                  │       sec/op        vs base               │
Tile38WithinCircle100kmRequest                                   529.1µ ± 1%          530.3µ ± 1%       ~ (p=0.143 n=10)
Tile38IntersectsCircle100kmRequest                               629.6µ ± 1%          630.8µ ± 1%       ~ (p=0.971 n=10)
Tile38KNearestLimit100Request                                    446.4µ ± 1%          453.7µ ± 1%  +1.62% (p=0.000 n=10)
geomean                                                          529.8µ               533.4µ       +0.67%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │            average-RSS-bytes            │ average-RSS-bytes   vs base               │
Tile38WithinCircle100kmRequest                                  5.054Gi ± 1%         5.057Gi ± 1%       ~ (p=0.796 n=10)
Tile38IntersectsCircle100kmRequest                              5.381Gi ± 0%         5.431Gi ± 1%  +0.94% (p=0.019 n=10)
Tile38KNearestLimit100Request                                   6.801Gi ± 0%         6.802Gi ± 0%       ~ (p=0.684 n=10)
geomean                                                         5.697Gi              5.717Gi       +0.34%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             peak-RSS-bytes              │   peak-RSS-bytes    vs base               │
Tile38WithinCircle100kmRequest                                  5.380Gi ± 1%         5.381Gi ± 1%       ~ (p=0.912 n=10)
Tile38IntersectsCircle100kmRequest                              5.669Gi ± 1%         5.756Gi ± 1%  +1.53% (p=0.019 n=10)
Tile38KNearestLimit100Request                                   7.013Gi ± 0%         7.011Gi ± 0%       ~ (p=0.796 n=10)
geomean                                                         5.980Gi              6.010Gi       +0.50%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │              peak-VM-bytes              │   peak-VM-bytes     vs base               │
Tile38WithinCircle100kmRequest                                  6.047Gi ± 1%         6.047Gi ± 1%       ~ (p=0.725 n=10)
Tile38IntersectsCircle100kmRequest                              6.305Gi ± 1%         6.402Gi ± 2%  +1.53% (p=0.035 n=10)
Tile38KNearestLimit100Request                                   7.685Gi ± 0%         7.685Gi ± 0%       ~ (p=0.955 n=10)
geomean                                                         6.642Gi              6.676Gi       +0.51%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             p50-latency-sec             │  p50-latency-sec    vs base               │
Tile38WithinCircle100kmRequest                                   88.81µ ± 1%          89.36µ ± 1%  +0.61% (p=0.043 n=10)
Tile38IntersectsCircle100kmRequest                               151.5µ ± 1%          152.0µ ± 1%       ~ (p=0.089 n=10)
Tile38KNearestLimit100Request                                    259.0µ ± 0%          259.1µ ± 0%       ~ (p=0.853 n=10)
geomean                                                          151.6µ               152.1µ       +0.33%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             p90-latency-sec             │  p90-latency-sec    vs base               │
Tile38WithinCircle100kmRequest                                   712.5µ ± 0%          713.9µ ± 1%       ~ (p=0.190 n=10)
Tile38IntersectsCircle100kmRequest                               960.6µ ± 1%          958.2µ ± 1%       ~ (p=0.739 n=10)
Tile38KNearestLimit100Request                                    1.007m ± 1%          1.032m ± 1%  +2.50% (p=0.000 n=10)
geomean                                                          883.4µ               890.5µ       +0.80%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │             p99-latency-sec             │  p99-latency-sec    vs base               │
Tile38WithinCircle100kmRequest                                   7.061m ± 1%          7.085m ± 1%       ~ (p=0.481 n=10)
Tile38IntersectsCircle100kmRequest                               7.228m ± 1%          7.187m ± 1%       ~ (p=0.143 n=10)
Tile38KNearestLimit100Request                                    2.085m ± 0%          2.131m ± 1%  +2.22% (p=0.000 n=10)
geomean                                                          4.738m               4.770m       +0.66%

                                   │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │
                                   │                  ops/s                  │       ops/s         vs base               │
Tile38WithinCircle100kmRequest                                   17.01k ± 1%          16.97k ± 1%       ~ (p=0.143 n=10)
Tile38IntersectsCircle100kmRequest                               14.29k ± 1%          14.27k ± 1%       ~ (p=0.988 n=10)
Tile38KNearestLimit100Request                                    20.16k ± 1%          19.84k ± 1%  -1.59% (p=0.000 n=10)
geomean                                                          16.99k               16.87k       -0.67%

shortname: uber_tally
goos: linux
goarch: amd64
pkg: github.com/uber-go/tally
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                │                    sec/op                    │         sec/op           vs base               │
ScopeTaggedNoCachedSubscopes-12                                    2.867µ ± 4%               2.921µ ± 4%       ~ (p=0.579 n=10)
HistogramAllocation-12                                             1.519µ ± 3%               1.507µ ± 7%       ~ (p=0.631 n=10)
geomean                                                            2.087µ                    2.098µ       +0.53%

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                     B/op                     │             B/op              vs base          │
HistogramAllocation-12                                   1.124Ki ± 1%                   1.125Ki ± 4%  ~ (p=0.271 n=10)

                       │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                       │                  allocs/op                   │         allocs/op           vs base            │
HistogramAllocation-12                                     20.00 ± 0%                   20.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

shortname: uber_zap
pkg: go.uber.org/zap/zapcore
                                              │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │
                                              │                    sec/op                    │         sec/op          vs base                │
BufferedWriteSyncer/write_file_with_buffer-12                                   296.1n ± 12%             205.9n ± 10%  -30.46% (p=0.000 n=10)
MultiWriteSyncer/2_discarder-12                                                 7.528n ±  4%             7.014n ±  2%   -6.83% (p=0.000 n=10)
MultiWriteSyncer/4_discarder-12                                                 9.065n ±  1%             8.908n ±  1%   -1.73% (p=0.002 n=10)
MultiWriteSyncer/4_discarder_with_buffer-12                                     225.2n ±  2%             147.6n ±  2%  -34.48% (p=0.000 n=10)
WriteSyncer/write_file_with_no_buffer-12                                        4.785µ ±  1%             4.933µ ±  3%   +3.08% (p=0.001 n=10)
ZapConsole-12                                                                   702.5n ±  1%             649.1n ±  1%   -7.62% (p=0.000 n=10)
JSONLogMarshalerFunc-12                                                         1.219µ ±  2%             1.226µ ±  3%        ~ (p=0.781 n=10)
ZapJSON-12                                                                      555.4n ±  1%             480.9n ±  3%  -13.40% (p=0.000 n=10)
StandardJSON-12                                                                 814.1n ±  1%             809.0n ±  0%        ~ (p=0.101 n=10)
Sampler_Check/7_keys-12                                                         10.55n ±  2%             10.61n ±  1%        ~ (p=0.594 n=10)
Sampler_Check/50_keys-12                                                        11.01n ±  0%             10.98n ±  1%        ~ (p=0.286 n=10)
Sampler_Check/100_keys-12                                                       10.71n ±  0%             10.71n ±  0%        ~ (p=0.563 n=10)
Sampler_CheckWithHook/7_keys-12                                                 20.20n ±  2%             20.42n ±  2%        ~ (p=0.446 n=10)
Sampler_CheckWithHook/50_keys-12                                                20.72n ±  2%             21.02n ±  1%        ~ (p=0.078 n=10)
Sampler_CheckWithHook/100_keys-12                                               20.15n ±  2%             20.68n ±  3%   +2.63% (p=0.037 n=10)
TeeCheck-12                                                                     140.8n ±  2%             140.5n ±  2%        ~ (p=0.754 n=10)
geomean                                                                         87.80n                   82.39n         -6.15%

The only large regression (in ethereum_bitutil's BaseTest2KB) appears to
be spurious, as the test does not involve any goroutines (or
B.RunParallel()), which profiling confirms.

Updates golang/go#18237
Related to golang/go#32113
@gopherbot
Copy link

Change https://go.dev/cl/473656 mentions this issue: runtime: don't usleep() in runqgrab()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

8 participants