runtime: lock-free channels #8899

dvyukov · 2014-10-07T14:03:06Z

Here is design doc:
https://docs.google.com/document/d/1yIAYmbvL3JxOKOjuCyon7JhW4cSv1wy5hC0ApeGMV9s/pub

Here is the implementation:
https://golang.org/cl/12544043/

I've benchmarked the change of a real app where chans were abandoned in favor of other
sync primitives due to perf reasons:
https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/cGHMdNoHGQEJ
and the change provides 23% end-to-end speedup making chans significantly faster than
the other synchronization.

olekukonko · 2015-10-05T12:47:13Z

What is the update on this ?

odeke-em · 2016-03-12T11:47:26Z

Unfortunately the CL seems to have gone stale @dvyukov due to changes to runtime's code.

OneOfOne · 2016-03-24T10:14:51Z

I have implemented an extremely simple lock-free channel here, if there's interest, I'm willing to implement it in the runtime and submit a CL (and re-license it under the Go license of course).

➜ go test -bench=. -benchmem -cpu 1,4,8,32 -benchtime 3s

#   ch := NewSize(runtime.NumCPU())
BenchmarkLFChan         30000000               168 ns/op              40 B/op          4 allocs/op
BenchmarkLFChan-4       30000000               175 ns/op              45 B/op          4 allocs/op
BenchmarkLFChan-8       20000000               205 ns/op              45 B/op          4 allocs/op
BenchmarkLFChan-32      20000000               201 ns/op              45 B/op          4 allocs/op

# ch := make(chan interface{}, runtime.NumCPU())
BenchmarkChan           50000000               115 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         20000000               261 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         20000000               331 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32        10000000               532 ns/op               8 B/op          1 allocs/op

PASS
ok      github.com/OneOfOne/lfchan      51.663s

OneOfOne · 2016-03-24T10:15:23Z

@dvyukov @ianlancetaylor @rsc

dvyukov · 2016-03-24T10:28:15Z

@OneOfOne The tricky part will be to support blocking on empty/full chan and select statements.
Also your chan does not seem to be FIFO which is a requirement.
Also you scan the whole array to find an item, this can be very expensive for large chans.

OneOfOne · 2016-03-24T10:36:19Z

@dvyukov I'll work on fixing 2 and 3 and ping you back, and if I get your approval I'll try to figure out 1.

ianlancetaylor · 2016-03-24T16:16:37Z

Your benchmarks look good but any change to channels has to work efficiently with select statements.

OneOfOne · 2016-03-26T14:54:05Z

@dvyukov I updated the package and fixed all your notes.

@ianlancetaylor I added Select support to the package.

➜ go test -bench=. -benchmem -cpu 1,4,8,32 -benchtime 3s
# ch := NewSize(100)
BenchmarkLFChan         20000000               292 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       20000000               202 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       30000000               161 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      20000000               215 ns/op               8 B/op          1 allocs/op
# ch := make(chan interface{}, 100)
BenchmarkChan           10000000               371 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         10000000               378 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         10000000               506 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32        10000000               513 ns/op               8 B/op          1 allocs/op

the reason 32 is slower than 8 is the fact I only have 8 cores, but theoretically, it should get faster with the number of (real) cores you throw at it.

jonhoo · 2016-03-26T16:18:57Z

I'll benchmark up to 80 cores later today and post results.

jonhoo · 2016-03-26T17:51:26Z

On an 80-core Intel host:

$ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan         30000000               506 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-2       20000000              1107 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       10000000              1611 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       10000000              1710 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-16      10000000              2165 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      10000000              2192 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-48      10000000              2288 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-64      10000000              2354 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-72       5000000              2454 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-80       5000000              2554 ns/op               8 B/op          1 allocs/op
BenchmarkChan           20000000               768 ns/op               8 B/op          1 allocs/op
BenchmarkChan-2         20000000              1188 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         20000000              3215 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8          5000000              3657 ns/op               8 B/op          1 allocs/op
BenchmarkChan-16        10000000              2734 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32         5000000              2387 ns/op               8 B/op          1 allocs/op
BenchmarkChan-48         5000000              2448 ns/op               8 B/op          1 allocs/op
BenchmarkChan-64         5000000              2452 ns/op               8 B/op          1 allocs/op
BenchmarkChan-72         5000000              2552 ns/op               8 B/op          1 allocs/op
BenchmarkChan-80         5000000              2659 ns/op               8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan      436.003s

On a 48-core AMD host:

$ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48 -benchtime 10s
PASS
BenchmarkLFChan         20000000               734 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-2       10000000              1271 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       20000000              1140 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       20000000              1097 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-16      10000000              1257 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      10000000              1564 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-48      10000000              2172 ns/op               8 B/op          1 allocs/op
BenchmarkChan           20000000               937 ns/op               8 B/op          1 allocs/op
BenchmarkChan-2         20000000              1147 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         10000000              1721 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         10000000              2372 ns/op               8 B/op          1 allocs/op
BenchmarkChan-16        10000000              2349 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32         5000000              2295 ns/op               8 B/op          1 allocs/op
BenchmarkChan-48         5000000              2753 ns/op               8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan      276.712s

Also, a plot:

OneOfOne · 2016-03-26T18:04:51Z

@jonhoo thanks for running the benchmark, it is rather weird how different the numbers are, I wonder if it has something to do with having multiple CPU sockets.

jonhoo · 2016-03-26T18:07:19Z

@OneOfOne These machines are also somewhat older, so each core is slower than what you'd find in your laptop nowadays.

RLH · 2016-03-26T20:37:18Z

Any idea why adding HW threads doesn't result in increased parallelism.
Could you describe the HW? How many HW threads per core?

On Sat, Mar 26, 2016 at 2:07 PM, Jon Gjengset notifications@github.com
wrote:

@OneOfOne https://github.com/OneOfOne These machines are also somewhat
older, so each core is slower than what you'd find in your laptop nowadays.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#8899 (comment)

jonhoo · 2016-03-26T21:17:35Z

@RLH: see the linked pages for hardware descriptions. I'm not pinning threads, but I never run with -cpu higher than the number of real cores.

dvyukov · 2016-03-27T09:33:09Z

@OneOfOne There are several problems with this implementation:

It is not FIFO.
Recv/Send can fail spuriously.
Len can return -1.
It does not support blocking and unblocking of goroutines (you must not just spin, but tell scheduler precisely when a goroutine can be descheduled and marked as blocked, and when it must be made runnable again).
The same applies to select: you must not just poll in a loop, you need to tell scheduler when to block and unblock the goroutine.

OneOfOne · 2016-03-27T16:10:29Z

@dvyukov

fixed here.
I can't trigger that (might have gotten fixed with 1).
I'll see if I can fix it, I added a test, but actually have to use -count to trigger it.
That's not possible in user code AFAIK, I was gonna do that once I submit a CL and be able to use runtime funcs.
same as 4

OneOfOne · 2016-03-27T19:28:30Z

@dvyukov 3. fixed with OneOfOne/lfchan@bdddd90

minux · 2016-03-27T23:03:36Z

have you tried to write a SPIN model for your implementation? the runtime used to have SPIN model for the scheduler, but it got out of date pretty quickly. If we add more lock-free stuff to the runtime, I'd suggest we start with a formal model and strive to keep the formal model always in sync with the source code.

OneOfOne · 2016-03-28T00:20:26Z

@minux I'm not sure what a SPIN model is.

kostya-sh · 2016-03-28T06:32:58Z

I believe benchmarks in https://github.com/OneOfOne/lfchan/blob/master/chan_test.go are not benchmarking what you want. Operations like wg.Add, wg.Done and creating many goroutines probably limit scalability and skew results.

I think a better approach would be to use 1 reader and 1 writer goroutine and let RunParallel handle concurrency level for you. Also avoid using atomic operations and wait groups.

Also I want to point out that the following result:

BenchmarkLFChan         50000000               325 ns/op
BenchmarkLFChan-4       50000000               270 ns/op

shows that the benchmarked operation became 270*4/325=3.3 times slower.

davecheney · 2016-03-28T06:33:46Z

For a lock free implementation, this code sure does spend a long time in runtime.lock.

OneOfOne · 2016-03-28T06:44:04Z

@jonhoo I fixed that, would you be kind to rerun the benchmarks, I simplified them as per @kostya-sh's suggestion and removed the GOMAXPROCS call.

@dvyukov pointed me to the right direction with his runtime.lock remark.

➜ go test -bench=. -benchmem  -run NONE -cpu 1,4,8 -benchtime 10s
BenchmarkLFChan         100000000              142 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       100000000              174 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       100000000              132 ns/op               8 B/op          1 allocs/op
BenchmarkChan           100000000              102 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         50000000               257 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         50000000               337 ns/op               8 B/op          1 allocs/op
PASS
ok      github.com/OneOfOne/lfchan      86.061

bradfitz · 2016-03-28T06:49:41Z

This bug was originally about lock-free channels in the runtime.

All this recent discussion is about an external package. Can we move discussion to elsewhere on Github (e.g. https://github.com/OneOfOne/lfchan/issues) so I can remain subscribed to this bug for its original purpose?

jonhoo · 2016-03-28T06:57:00Z

Created an issue over at OneOfOne/lfchan#3. Will continue posting benchmark results there.

dvyukov · 2016-04-03T06:18:56Z

@OneOfOne The queue is still not FIFO after 2f9d3a217eadbc557e758649fa153dd59ff14c11 and len still can return -1. You can block and unblock goroutines without runtime support, sync.Mutex+Cond can block/unblock goroutines. You can stab runtime scheduler by implementing 2 functions on top of Mutex+Cond: park(g *G) and unpark(g *G).

gorakhargosh · 2016-04-23T17:19:39Z

@jonhoo how did you generate the plot?

jonhoo · 2016-04-24T00:57:40Z

@gorakhargosh: OneOfOne/lfchan#3 (comment)

gyf19 · 2016-12-27T04:10:11Z

@dvyukov ping?

dvyukov · 2016-12-27T12:45:04Z

After the recent fairness changes in channels, my algorithm does not apply per se. My algorithm does not have the fairness properties that the current code provide.

olekukonko · 2017-01-06T15:25:30Z

@dvyukov any plans to revisit this in future ?

dvyukov · 2017-01-06T15:41:25Z

No particular plans.

ghost · 2017-04-06T18:37:16Z

@dvyukov

What are the fairness properties this implementation is missing?
Other than that, is there anything else wrong with it?

dvyukov · 2017-04-07T07:17:05Z

@stjepang I've lost all context already. What implementation do you mean? Why do you think there are fairness problems?

ghost · 2017-04-07T09:25:55Z

@dvyukov In order to improve the performance of Go's channels, you designed almost lock-free channels and wrote this implementation. Now it's apparently abandonded, as you commented:

After the recent fairness changes in channels, my algorithm does not apply per se. My algorithm does not have the fairness properties that the current code provide.

I was wondering what are the fairness changes that make your algorithm inapplicable. Why exactly your algorithm didn't make it as the official Go channel implementation? What are the problems with it that still need to be solved?

dvyukov · 2017-04-07T09:44:02Z

I was wondering what are the fairness changes that make your algorithm inapplicable.

I mean this change:
https://go-review.googlesource.com/c/16740
Though, it does not mention fairness nor adds tests. So probably we can break the additional fairness guarantees that it provides.

Why exactly your algorithm didn't make it as the official Go channel implementation? What are the problems with it that still need to be solved?

It was constantly deprioritized over other runtime changes, thus constantly get outdated and I wasn't able to keep up updating it. As far as I remember it all worked. Now the major problem is rebasing it onto current runtime code. Also need to decide what to do with respect to the fairness change 16740.

olekukonko · 2017-04-07T09:56:35Z

@dvyukovI, I might be wrong but I noticed you have not been active like before. Is this as a result of your busy schedule or you lost the zeal as a result of multiple rejections of some of your proposal or working on a secret project we can't know about yet :)

dvyukov · 2017-04-07T10:01:04Z

@olekukonko working on a non-secret project you can know about: https://github.com/google/syzkaller

olekukonko · 2017-04-07T10:33:14Z

@dvyukov, the project looks Interesting, I will be experimenting with it and thanks for the prompt response. Regards.

empijei · 2019-03-19T15:10:06Z

Are there any plans to revisit this?

pdeva · 2021-01-14T07:51:37Z

2021 now. hopefully someday we see this implemented 😄

alphadose · 2022-05-03T06:09:45Z

Candidate -> https://github.com/alphadose/ZenQ
Linked Issue -> #52652

dvyukov added accepted Performance labels Oct 7, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

ianlancetaylor assigned dvyukov Oct 5, 2015

OneOfOne mentioned this issue Mar 26, 2016

proposal: atomic keyword #14977

Closed

jonhoo mentioned this issue Mar 28, 2016

Fix poor scalability to many (true-SMP) cores OneOfOne/lfchan#3

Open

dvyukov mentioned this issue May 15, 2017

runtime: select on a shared channel is slow with many Ps #20351

Open

changkun mentioned this issue Aug 21, 2019

第 56 期 channel & select 源码分析 talkgo/night#450

Closed

pellared mentioned this issue Mar 15, 2021

ThreadSafeActionHandlerIft is NOT lock-free sbracaloni/thread-safe-action#5

Closed

MBkkt mentioned this issue Feb 14, 2022

Add MPMC channel that works with coroutine YACLib/YACLib#145

Open

egonelbre mentioned this issue May 3, 2022

runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) #52652

Open

bcmills mentioned this issue Dec 6, 2022

runtime: contention on a single channel can consume all Ps #57070

Open

mknyszek mentioned this issue Dec 19, 2022

runtime: with -test.blockprofile program run even faster. #57319

Closed

runtime: lock-free channels #8899

runtime: lock-free channels #8899

Comments

dvyukov commented Oct 7, 2014

olekukonko commented Oct 5, 2015

odeke-em commented Mar 12, 2016

OneOfOne commented Mar 24, 2016

OneOfOne commented Mar 24, 2016

dvyukov commented Mar 24, 2016

OneOfOne commented Mar 24, 2016

ianlancetaylor commented Mar 24, 2016

OneOfOne commented Mar 26, 2016

jonhoo commented Mar 26, 2016

jonhoo commented Mar 26, 2016

OneOfOne commented Mar 26, 2016

jonhoo commented Mar 26, 2016

RLH commented Mar 26, 2016

jonhoo commented Mar 26, 2016

dvyukov commented Mar 27, 2016

OneOfOne commented Mar 27, 2016

OneOfOne commented Mar 27, 2016

minux commented Mar 27, 2016 via email

OneOfOne commented Mar 28, 2016

kostya-sh commented Mar 28, 2016

davecheney commented Mar 28, 2016

OneOfOne commented Mar 28, 2016

bradfitz commented Mar 28, 2016

jonhoo commented Mar 28, 2016

dvyukov commented Apr 3, 2016

gorakhargosh commented Apr 23, 2016

jonhoo commented Apr 24, 2016

gyf19 commented Dec 27, 2016

dvyukov commented Dec 27, 2016

olekukonko commented Jan 6, 2017

dvyukov commented Jan 6, 2017

ghost commented Apr 6, 2017

dvyukov commented Apr 7, 2017

ghost commented Apr 7, 2017

dvyukov commented Apr 7, 2017

olekukonko commented Apr 7, 2017

dvyukov commented Apr 7, 2017

olekukonko commented Apr 7, 2017

empijei commented Mar 19, 2019

pdeva commented Jan 14, 2021

alphadose commented May 3, 2022