Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: lock-free channels #8899

Open
dvyukov opened this issue Oct 7, 2014 · 52 comments
Open

runtime: lock-free channels #8899

dvyukov opened this issue Oct 7, 2014 · 52 comments
Assignees
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Oct 7, 2014

Here is design doc:
https://docs.google.com/document/d/1yIAYmbvL3JxOKOjuCyon7JhW4cSv1wy5hC0ApeGMV9s/pub

Here is the implementation:
https://golang.org/cl/12544043/

I've benchmarked the change of a real app where chans were abandoned in favor of other
sync primitives due to perf reasons:
https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/cGHMdNoHGQEJ
and the change provides 23% end-to-end speedup making chans significantly faster than
the other synchronization.
@olekukonko
Copy link

What is the update on this ?

@odeke-em
Copy link
Member

Unfortunately the CL seems to have gone stale @dvyukov due to changes to runtime's code.

@OneOfOne
Copy link
Contributor

I have implemented an extremely simple lock-free channel here, if there's interest, I'm willing to implement it in the runtime and submit a CL (and re-license it under the Go license of course).

➜ go test -bench=. -benchmem -cpu 1,4,8,32 -benchtime 3s

#   ch := NewSize(runtime.NumCPU())
BenchmarkLFChan         30000000               168 ns/op              40 B/op          4 allocs/op
BenchmarkLFChan-4       30000000               175 ns/op              45 B/op          4 allocs/op
BenchmarkLFChan-8       20000000               205 ns/op              45 B/op          4 allocs/op
BenchmarkLFChan-32      20000000               201 ns/op              45 B/op          4 allocs/op

# ch := make(chan interface{}, runtime.NumCPU())
BenchmarkChan           50000000               115 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         20000000               261 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         20000000               331 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32        10000000               532 ns/op               8 B/op          1 allocs/op

PASS
ok      github.com/OneOfOne/lfchan      51.663s

@OneOfOne
Copy link
Contributor

@dvyukov @ianlancetaylor @rsc

@dvyukov
Copy link
Member Author

dvyukov commented Mar 24, 2016

@OneOfOne The tricky part will be to support blocking on empty/full chan and select statements.
Also your chan does not seem to be FIFO which is a requirement.
Also you scan the whole array to find an item, this can be very expensive for large chans.

@OneOfOne
Copy link
Contributor

@dvyukov I'll work on fixing 2 and 3 and ping you back, and if I get your approval I'll try to figure out 1.

@ianlancetaylor
Copy link
Contributor

Your benchmarks look good but any change to channels has to work efficiently with select statements.

@OneOfOne
Copy link
Contributor

@dvyukov I updated the package and fixed all your notes.

@ianlancetaylor I added Select support to the package.

➜ go test -bench=. -benchmem -cpu 1,4,8,32 -benchtime 3s
# ch := NewSize(100)
BenchmarkLFChan         20000000               292 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       20000000               202 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       30000000               161 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      20000000               215 ns/op               8 B/op          1 allocs/op
# ch := make(chan interface{}, 100)
BenchmarkChan           10000000               371 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         10000000               378 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         10000000               506 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32        10000000               513 ns/op               8 B/op          1 allocs/op
  • the reason 32 is slower than 8 is the fact I only have 8 cores, but theoretically, it should get faster with the number of (real) cores you throw at it.

@jonhoo
Copy link

jonhoo commented Mar 26, 2016

I'll benchmark up to 80 cores later today and post results.

@jonhoo
Copy link

jonhoo commented Mar 26, 2016

On an 80-core Intel host:

$ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan         30000000               506 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-2       20000000              1107 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       10000000              1611 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       10000000              1710 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-16      10000000              2165 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      10000000              2192 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-48      10000000              2288 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-64      10000000              2354 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-72       5000000              2454 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-80       5000000              2554 ns/op               8 B/op          1 allocs/op
BenchmarkChan           20000000               768 ns/op               8 B/op          1 allocs/op
BenchmarkChan-2         20000000              1188 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         20000000              3215 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8          5000000              3657 ns/op               8 B/op          1 allocs/op
BenchmarkChan-16        10000000              2734 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32         5000000              2387 ns/op               8 B/op          1 allocs/op
BenchmarkChan-48         5000000              2448 ns/op               8 B/op          1 allocs/op
BenchmarkChan-64         5000000              2452 ns/op               8 B/op          1 allocs/op
BenchmarkChan-72         5000000              2552 ns/op               8 B/op          1 allocs/op
BenchmarkChan-80         5000000              2659 ns/op               8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan      436.003s

On a 48-core AMD host:

$ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48 -benchtime 10s
PASS
BenchmarkLFChan         20000000               734 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-2       10000000              1271 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       20000000              1140 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       20000000              1097 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-16      10000000              1257 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      10000000              1564 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-48      10000000              2172 ns/op               8 B/op          1 allocs/op
BenchmarkChan           20000000               937 ns/op               8 B/op          1 allocs/op
BenchmarkChan-2         20000000              1147 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         10000000              1721 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         10000000              2372 ns/op               8 B/op          1 allocs/op
BenchmarkChan-16        10000000              2349 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32         5000000              2295 ns/op               8 B/op          1 allocs/op
BenchmarkChan-48         5000000              2753 ns/op               8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan      276.712s

Also, a plot: x

@OneOfOne
Copy link
Contributor

@jonhoo thanks for running the benchmark, it is rather weird how different the numbers are, I wonder if it has something to do with having multiple CPU sockets.

@jonhoo
Copy link

jonhoo commented Mar 26, 2016

@OneOfOne These machines are also somewhat older, so each core is slower than what you'd find in your laptop nowadays.

@RLH
Copy link
Contributor

RLH commented Mar 26, 2016

Any idea why adding HW threads doesn't result in increased parallelism.
Could you describe the HW? How many HW threads per core?

On Sat, Mar 26, 2016 at 2:07 PM, Jon Gjengset notifications@github.com
wrote:

@OneOfOne https://github.com/OneOfOne These machines are also somewhat
older, so each core is slower than what you'd find in your laptop nowadays.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#8899 (comment)

@jonhoo
Copy link

jonhoo commented Mar 26, 2016

@RLH: see the linked pages for hardware descriptions. I'm not pinning threads, but I never run with -cpu higher than the number of real cores.

@dvyukov
Copy link
Member Author

dvyukov commented Mar 27, 2016

@OneOfOne There are several problems with this implementation:

  1. It is not FIFO.
  2. Recv/Send can fail spuriously.
  3. Len can return -1.
  4. It does not support blocking and unblocking of goroutines (you must not just spin, but tell scheduler precisely when a goroutine can be descheduled and marked as blocked, and when it must be made runnable again).
  5. The same applies to select: you must not just poll in a loop, you need to tell scheduler when to block and unblock the goroutine.

@OneOfOne
Copy link
Contributor

@dvyukov

  1. fixed here.
  2. I can't trigger that (might have gotten fixed with 1).
  3. I'll see if I can fix it, I added a test, but actually have to use -count to trigger it.
  4. That's not possible in user code AFAIK, I was gonna do that once I submit a CL and be able to use runtime funcs.
  5. same as 4

@OneOfOne
Copy link
Contributor

@dvyukov 3. fixed with OneOfOne/lfchan@bdddd90

@minux
Copy link
Member

minux commented Mar 27, 2016 via email

@OneOfOne
Copy link
Contributor

@minux I'm not sure what a SPIN model is.

@kostya-sh
Copy link
Contributor

I believe benchmarks in https://github.com/OneOfOne/lfchan/blob/master/chan_test.go are not benchmarking what you want. Operations like wg.Add, wg.Done and creating many goroutines probably limit scalability and skew results.

I think a better approach would be to use 1 reader and 1 writer goroutine and let RunParallel handle concurrency level for you. Also avoid using atomic operations and wait groups.

Also I want to point out that the following result:

BenchmarkLFChan         50000000               325 ns/op
BenchmarkLFChan-4       50000000               270 ns/op

shows that the benchmarked operation became 270*4/325=3.3 times slower.

@davecheney
Copy link
Contributor

For a lock free implementation, this code sure does spend a long time in runtime.lock.

@OneOfOne
Copy link
Contributor

@jonhoo I fixed that, would you be kind to rerun the benchmarks, I simplified them as per @kostya-sh's suggestion and removed the GOMAXPROCS call.

@dvyukov pointed me to the right direction with his runtime.lock remark.

➜ go test -bench=. -benchmem  -run NONE -cpu 1,4,8 -benchtime 10s
BenchmarkLFChan         100000000              142 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       100000000              174 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       100000000              132 ns/op               8 B/op          1 allocs/op
BenchmarkChan           100000000              102 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         50000000               257 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         50000000               337 ns/op               8 B/op          1 allocs/op
PASS
ok      github.com/OneOfOne/lfchan      86.061

@bradfitz
Copy link
Contributor

This bug was originally about lock-free channels in the runtime.

All this recent discussion is about an external package. Can we move discussion to elsewhere on Github (e.g. https://github.com/OneOfOne/lfchan/issues) so I can remain subscribed to this bug for its original purpose?

@jonhoo
Copy link

jonhoo commented Mar 28, 2016

Created an issue over at OneOfOne/lfchan#3. Will continue posting benchmark results there.

@dvyukov
Copy link
Member Author

dvyukov commented Apr 3, 2016

@OneOfOne The queue is still not FIFO after 2f9d3a217eadbc557e758649fa153dd59ff14c11 and len still can return -1. You can block and unblock goroutines without runtime support, sync.Mutex+Cond can block/unblock goroutines. You can stab runtime scheduler by implementing 2 functions on top of Mutex+Cond: park(g *G) and unpark(g *G).

@gorakhargosh
Copy link

@jonhoo how did you generate the plot?

@jonhoo
Copy link

jonhoo commented Apr 24, 2016

@gyf19
Copy link

gyf19 commented Dec 27, 2016

@dvyukov ping?

@dvyukov
Copy link
Member Author

dvyukov commented Dec 27, 2016

After the recent fairness changes in channels, my algorithm does not apply per se. My algorithm does not have the fairness properties that the current code provide.

@olekukonko
Copy link

@dvyukov any plans to revisit this in future ?

@dvyukov
Copy link
Member Author

dvyukov commented Jan 6, 2017

No particular plans.

@ghost
Copy link

ghost commented Apr 6, 2017

@dvyukov

What are the fairness properties this implementation is missing?
Other than that, is there anything else wrong with it?

@dvyukov
Copy link
Member Author

dvyukov commented Apr 7, 2017

@stjepang I've lost all context already. What implementation do you mean? Why do you think there are fairness problems?

@ghost
Copy link

ghost commented Apr 7, 2017

@dvyukov In order to improve the performance of Go's channels, you designed almost lock-free channels and wrote this implementation. Now it's apparently abandonded, as you commented:

After the recent fairness changes in channels, my algorithm does not apply per se. My algorithm does not have the fairness properties that the current code provide.

I was wondering what are the fairness changes that make your algorithm inapplicable. Why exactly your algorithm didn't make it as the official Go channel implementation? What are the problems with it that still need to be solved?

@dvyukov
Copy link
Member Author

dvyukov commented Apr 7, 2017

I was wondering what are the fairness changes that make your algorithm inapplicable.

I mean this change:
https://go-review.googlesource.com/c/16740
Though, it does not mention fairness nor adds tests. So probably we can break the additional fairness guarantees that it provides.

Why exactly your algorithm didn't make it as the official Go channel implementation? What are the problems with it that still need to be solved?

It was constantly deprioritized over other runtime changes, thus constantly get outdated and I wasn't able to keep up updating it. As far as I remember it all worked. Now the major problem is rebasing it onto current runtime code. Also need to decide what to do with respect to the fairness change 16740.

@olekukonko
Copy link

@dvyukovI, I might be wrong but I noticed you have not been active like before. Is this as a result of your busy schedule or you lost the zeal as a result of multiple rejections of some of your proposal or working on a secret project we can't know about yet :)

@dvyukov
Copy link
Member Author

dvyukov commented Apr 7, 2017

@olekukonko working on a non-secret project you can know about: https://github.com/google/syzkaller

@olekukonko
Copy link

@dvyukov, the project looks Interesting, I will be experimenting with it and thanks for the prompt response. Regards.

@empijei
Copy link
Contributor

empijei commented Mar 19, 2019

Are there any plans to revisit this?

@pdeva
Copy link

pdeva commented Jan 14, 2021

2021 now. hopefully someday we see this implemented 😄

@alphadose
Copy link

Candidate -> https://github.com/alphadose/ZenQ
Linked Issue -> #52652

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests