|
|
Descriptionnet: band-aid for windows network poller
Fixes performance of the current windows network poller
with the new scheduler.
Gives runtime a hint when GetQueuedCompletionStatus() will block.
Fixes issue 5068.
benchmark old ns/op new ns/op delta
BenchmarkTCP4Persistent 4004000 33906 -99.15%
BenchmarkTCP4Persistent-2 21790 17513 -19.63%
BenchmarkTCP4Persistent-4 44760 34270 -23.44%
BenchmarkTCP4Persistent-6 45280 43000 -5.04%
Patch Set 1 #Patch Set 2 : diff -r 43eb97ed849a https://dvyukov%40google.com@code.google.com/p/go/ #Patch Set 3 : diff -r 43eb97ed849a https://dvyukov%40google.com@code.google.com/p/go/ #Patch Set 4 : diff -r 43eb97ed849a https://dvyukov%40google.com@code.google.com/p/go/ #
Total comments: 1
Patch Set 5 : diff -r f78c5036b7f6 https://dvyukov%40google.com@code.google.com/p/go/ #Patch Set 6 : diff -r f78c5036b7f6 https://dvyukov%40google.com@code.google.com/p/go/ #Patch Set 7 : diff -r f78c5036b7f6 https://dvyukov%40google.com@code.google.com/p/go/ #
MessagesTotal messages: 60
Hello golang-dev@googlegroups.com, I'd like you to review this change to https://dvyukov%40google.com@code.google.com/p/go/
Sign in to reply to this message.
Do you have a plan for *general* solution. I suspect many other syscalls will be affected too. Alex
Sign in to reply to this message.
On 2013/03/18 07:59:27, brainman wrote: > Do you have a plan for *general* solution. I suspect many other syscalls will be > affected too. What other syscalls are affected?
Sign in to reply to this message.
I don't know of any. But what do I do, if one of my syscalls is suddenly significantly slower then before. How do I give hint to the scheduler? Alex
Sign in to reply to this message.
On 2013/03/18 08:27:31, brainman wrote: > I don't know of any. But what do I do, if one of my syscalls is suddenly > significantly slower then before. How do I give hint to the scheduler? Humm... perhaps we can make all syscalls blocking on windows?
Sign in to reply to this message.
I re-run net benchmarks to see regression on windows. I compared two "clean" versions before and after new scheduler: # ~/go/root/misc/benchcmp id16000.txt id16303.txt benchmark old ns/op new ns/op delta BenchmarkTCPPersistent 87501 265628 +203.57% BenchmarkTCPPersistent-2 42188 52813 +25.18% BenchmarkTCPPersistent-4 39375 65625 +66.67% BenchmarkTCPPersistentTimeout 94532 253128 +167.77% BenchmarkTCPPersistentTimeout-2 43438 59375 +36.69% BenchmarkTCPPersistentTimeout-4 42813 64688 +51.09% I also compared you proposed change: # ~/go/root/misc/benchcmp id16000.txt id16303cl7612045.txt benchmark old ns/op new ns/op delta BenchmarkTCPPersistent 87501 57500 -34.29% BenchmarkTCPPersistent-2 42188 58125 +37.78% BenchmarkTCPPersistent-4 39375 70938 +80.16% BenchmarkTCPPersistentTimeout 94532 62813 -33.55% BenchmarkTCPPersistentTimeout-2 43438 63750 +46.76% BenchmarkTCPPersistentTimeout-4 42813 71094 +66.06% which is better, but, still regression on some tests. Then I got rid of net.runtime_blockingSyscallHint altogether, as per your suggestion: # hg diff diff -r 1130cc72c645 src/pkg/runtime/cgocall.c --- a/src/pkg/runtime/cgocall.c Mon Mar 18 15:33:04 2013 -0700 +++ b/src/pkg/runtime/cgocall.c Tue Mar 19 11:34:51 2013 +1100 @@ -145,7 +145,10 @@ * so it is safe to call while "in a system call", outside * the $GOMAXPROCS accounting. */ - runtime·entersyscall(); + if(Windows) { + runtime·entersyscallblock(); + } else + runtime·entersyscall(); runtime·asmcgocall(fn, arg); runtime·exitsyscall(); # ~/go/root/misc/benchcmp id16000.txt id16303alex.txt benchmark old ns/op new ns/op delta BenchmarkTCPPersistent 87501 92969 +6.25% BenchmarkTCPPersistent-2 42188 62813 +48.89% BenchmarkTCPPersistent-4 39375 72657 +84.53% BenchmarkTCPPersistentTimeout 94532 103126 +9.09% BenchmarkTCPPersistentTimeout-2 43438 78907 +81.65% BenchmarkTCPPersistentTimeout-4 42813 77344 +80.66% # and it is still bad. I am not sure what to do. Alex
Sign in to reply to this message.
On 2013/03/19 00:39:32, brainman wrote: > I re-run net benchmarks to see regression on windows. I compared two "clean" > versions before and after new scheduler: > > # ~/go/root/misc/benchcmp id16000.txt id16303.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPPersistent 87501 265628 +203.57% > BenchmarkTCPPersistent-2 42188 52813 +25.18% > BenchmarkTCPPersistent-4 39375 65625 +66.67% > BenchmarkTCPPersistentTimeout 94532 253128 +167.77% > BenchmarkTCPPersistentTimeout-2 43438 59375 +36.69% > BenchmarkTCPPersistentTimeout-4 42813 64688 +51.09% > > I also compared you proposed change: > > # ~/go/root/misc/benchcmp id16000.txt id16303cl7612045.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPPersistent 87501 57500 -34.29% > BenchmarkTCPPersistent-2 42188 58125 +37.78% > BenchmarkTCPPersistent-4 39375 70938 +80.16% > BenchmarkTCPPersistentTimeout 94532 62813 -33.55% > BenchmarkTCPPersistentTimeout-2 43438 63750 +46.76% > BenchmarkTCPPersistentTimeout-4 42813 71094 +66.06% > > which is better, but, still regression on some tests. > > Then I got rid of net.runtime_blockingSyscallHint altogether, as per your > suggestion: > > # hg diff > diff -r 1130cc72c645 src/pkg/runtime/cgocall.c > --- a/src/pkg/runtime/cgocall.c Mon Mar 18 15:33:04 2013 -0700 > +++ b/src/pkg/runtime/cgocall.c Tue Mar 19 11:34:51 2013 +1100 > @@ -145,7 +145,10 @@ > * so it is safe to call while "in a system call", outside > * the $GOMAXPROCS accounting. > */ > - runtime·entersyscall(); > + if(Windows) { > + runtime·entersyscallblock(); > + } else > + runtime·entersyscall(); > runtime·asmcgocall(fn, arg); > runtime·exitsyscall(); > > # ~/go/root/misc/benchcmp id16000.txt id16303alex.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPPersistent 87501 92969 +6.25% > BenchmarkTCPPersistent-2 42188 62813 +48.89% > BenchmarkTCPPersistent-4 39375 72657 +84.53% > BenchmarkTCPPersistentTimeout 94532 103126 +9.09% > BenchmarkTCPPersistentTimeout-2 43438 78907 +81.65% > BenchmarkTCPPersistentTimeout-4 42813 77344 +80.66% > # > > and it is still bad. > > I am not sure what to do. > > Alex How about "if(windows && runtime.NumCPU()==1)"
Sign in to reply to this message.
I'm sorry, I misunderstood the comparison. So does it mean the new scheduler is a regression for windows? Is it feasible to switch back to old scheduler for windows?
Sign in to reply to this message.
On 2013/03/19 02:26:25, Ewan Chou wrote: > I'm sorry, I misunderstood the comparison. Yeh, your change does not help on my pc, because I have more then 1 cpu, so your change does nothing. > So does it mean the new scheduler is a regression for windows? Yes, on net benchmarks at least. > Is it feasible to switch back to old scheduler for windows? I don't know. Alex
Sign in to reply to this message.
On 2013/03/19 00:39:32, brainman wrote: > I re-run net benchmarks to see regression on windows. I compared two "clean" > versions before and after new scheduler: > > # ~/go/root/misc/benchcmp id16000.txt id16303.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPPersistent 87501 265628 +203.57% > BenchmarkTCPPersistent-2 42188 52813 +25.18% > BenchmarkTCPPersistent-4 39375 65625 +66.67% > BenchmarkTCPPersistentTimeout 94532 253128 +167.77% > BenchmarkTCPPersistentTimeout-2 43438 59375 +36.69% > BenchmarkTCPPersistentTimeout-4 42813 64688 +51.09% > > I also compared you proposed change: > > # ~/go/root/misc/benchcmp id16000.txt id16303cl7612045.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPPersistent 87501 57500 -34.29% > BenchmarkTCPPersistent-2 42188 58125 +37.78% > BenchmarkTCPPersistent-4 39375 70938 +80.16% > BenchmarkTCPPersistentTimeout 94532 62813 -33.55% > BenchmarkTCPPersistentTimeout-2 43438 63750 +46.76% > BenchmarkTCPPersistentTimeout-4 42813 71094 +66.06% > > which is better, but, still regression on some tests. > > Then I got rid of net.runtime_blockingSyscallHint altogether, as per your > suggestion: > > # hg diff > diff -r 1130cc72c645 src/pkg/runtime/cgocall.c > --- a/src/pkg/runtime/cgocall.c Mon Mar 18 15:33:04 2013 -0700 > +++ b/src/pkg/runtime/cgocall.c Tue Mar 19 11:34:51 2013 +1100 > @@ -145,7 +145,10 @@ > * so it is safe to call while "in a system call", outside > * the $GOMAXPROCS accounting. > */ > - runtime·entersyscall(); > + if(Windows) { > + runtime·entersyscallblock(); > + } else > + runtime·entersyscall(); > runtime·asmcgocall(fn, arg); > runtime·exitsyscall(); > > # ~/go/root/misc/benchcmp id16000.txt id16303alex.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPPersistent 87501 92969 +6.25% > BenchmarkTCPPersistent-2 42188 62813 +48.89% > BenchmarkTCPPersistent-4 39375 72657 +84.53% > BenchmarkTCPPersistentTimeout 94532 103126 +9.09% > BenchmarkTCPPersistentTimeout-2 43438 78907 +81.65% > BenchmarkTCPPersistentTimeout-4 42813 77344 +80.66% > # > > and it is still bad. > > I am not sure what to do. Thanks for the numbers. I don't have explanation why new sched + this patch is worse than old sched with GOMAXPROCS>1... Does profiling works on Windows? Can you try to collect profiles and CPU utilization? I am also not sure why this patch is faster than tip on my machine and slower than tip on your machine with GOMAXPROCS>1. Are you sure this patch slowdowns net benchamrks as compared to tip? Proper fix (rewrite) for windows network poller is a significant change. Most likely I won't have time for it until summer.
Sign in to reply to this message.
Profiling does work on windows, but I don't know what you want. If you could provide instructions I will collect the info. I will also rerun all my benchmarks again to verify. I tested everything on 386 if it matters. I would also try to rewrite netpoller myself, but I don't know where to start. If you would describe your plan, perhaps, I will have a go. Thank you. Alex
Sign in to reply to this message.
On 2013/03/21 11:38:20, brainman wrote: > Profiling does work on windows, but I don't know what you want. If you could > provide instructions I will collect the info. I will also rerun all my > benchmarks again to verify. I tested everything on 386 if it matters. > > I would also try to rewrite netpoller myself, but I don't know where to start. > If you would describe your plan, perhaps, I will have a go. Here are some rough ideas. First, Read/Write generate garbage which is unfortunate. Since there is at most 1 pending Read and 1 pending Write operations, it's possible to embed the required structures directly into netFD: type netFD struct { ... ro syscall.Overlapped rb syscall.WSABuf rt Timer wo syscall.Overlapped wb syscall.WSABuf wt Timer } Deadline timer should be set only once (when the deadline is set), instead of resetting it on each operation. Expired timer calls CancelIOEx(). Read/Write issue the operation (WSARecv/Send) and block on a lightweight semaphore (see runtime/netpoll.goc). runtime.netpoll() calls GetQueuedCompletionStatusEx() and unblocks a batch of ready goroutines. netFD.Close() calls CancelIOEx for outstanding operations. So, why I expect to see performance improvement. 1. Do not allocate readOp/writeOp, timer on each operation. 2. Do not arm timer on each operation (there is global mutex). 3. Do not do 3-chan select (this is also allocates memory and locks mutexes, in particular cancelc chan mutex is shared between read and write operations). 4. Dequeue batches of ready operations with GetQueuedCompletionStatusEx(). 5. Better integration with scheduler with runtime.netpoll(). It's also possible to look into RIO windows API, however it's available only in new versions. I remember something about pinning WSA buffers into physical memory, so that kernel does not need to pin/unpin them on every operations. But I can't find how to do it. Perhaps just VirtualLock() will do, but requires investigation. 1 and 4 look quite simple and most likely can be done separately. 2 and 3 are harder to implement.
Sign in to reply to this message.
My reply is in http://play.golang.org/p/9bJ323Me7T. Codereview rejects my message because it is too long. Alex
Sign in to reply to this message.
On 2013/03/22 02:06:06, brainman wrote: > My reply is in http://play.golang.org/p/9bJ323Me7T. Codereview rejects my > message because it is too long. > > Alex When I first report this issue 5068, it was on amd64 windows 6g. But your result show that amd64 is not suffering from this issue. Maybe that was a patched result?
Sign in to reply to this message.
On 2013/03/22 02:25:38, Ewan Chou wrote: > On 2013/03/22 02:06:06, brainman wrote: > > My reply is in http://play.golang.org/p/9bJ323Me7T. Codereview rejects my > > message because it is too long. > > > > Alex > > When I first report this issue 5068, it was on amd64 windows 6g. > > But your result show that amd64 is not suffering from this issue. > > Maybe that was a patched result? What is your OS? How many cpus you have? Can you do investigation similar to mine on your computer and post results here? Thank you. Alex
Sign in to reply to this message.
On 2013/03/22 02:29:00, brainman wrote: > What is your OS? How many cpus you have? Can you do investigation similar to > mine on your computer and post results here? Thank you. > > Alex I'm sorry I don't know a graceful way to apply the patch and revert yet. so here is only the tip benchmark, it's windows 7 6g and my CPU has 4-core PASS BenchmarkTCP4OneShot 100 1890108 ns/op 6218 B/op 67 allocs/op BenchmarkTCP4OneShot-2 1000 193011 ns/op 6059 B/op 67 allocs/op BenchmarkTCP4OneShot-3 5000 181610 ns/op 6056 B/op 67 allocs/op BenchmarkTCP4OneShotTimeout 50 2140122 ns/op 6730 B/op 75 allocs/op BenchmarkTCP4OneShotTimeout-2 500 210012 ns/op 6729 B/op 75 allocs/op BenchmarkTCP4OneShotTimeout-3 5000 206611 ns/op 6727 B/op 75 allocs/op BenchmarkTCP4Persistent 500 372021 ns/op 1659 B/op 15 allocs/op BenchmarkTCP4Persistent-2 10000 19701 ns/op 1541 B/op 14 allocs/op BenchmarkTCP4Persistent-3 20000 46102 ns/op 1539 B/op 14 allocs/op BenchmarkTCP4PersistentTimeout 500 378021 ns/op 2364 B/op 23 allocs/op BenchmarkTCP4PersistentTimeout-2 2000 58503 ns/op 2308 B/op 22 allocs/op BenchmarkTCP4PersistentTimeout-3 2000 51002 ns/op 2325 B/op 23 allocs/op BenchmarkTCP6OneShot 50 2100120 ns/op 6169 B/op 67 allocs/op BenchmarkTCP6OneShot-2 500 232013 ns/op 6237 B/op 67 allocs/op BenchmarkTCP6OneShot-3 1000 214012 ns/op 6214 B/op 67 allocs/op BenchmarkTCP6OneShotTimeout 100 1890108 ns/op 6862 B/op 75 allocs/op BenchmarkTCP6OneShotTimeout-2 1000 219012 ns/op 6915 B/op 75 allocs/op BenchmarkTCP6OneShotTimeout-3 1000 137007 ns/op 6925 B/op 75 allocs/op BenchmarkTCP6Persistent 500 318018 ns/op 1664 B/op 15 allocs/op BenchmarkTCP6Persistent-2 10000 28501 ns/op 1542 B/op 14 allocs/op BenchmarkTCP6Persistent-3 5000 47802 ns/op 1571 B/op 14 allocs/op BenchmarkTCP6PersistentTimeout 500 402023 ns/op 2370 B/op 23 allocs/op BenchmarkTCP6PersistentTimeout-2 5000 37002 ns/op 2273 B/op 22 allocs/op BenchmarkTCP6PersistentTimeout-3 2000 52503 ns/op 2326 B/op 23 allocs/op ok net 7.974s
Sign in to reply to this message.
What version of go is this benchmark for? Use "hg id" to discover. Please supply benchmarks for 2 versions of go (just like I did) 16007 and 16362. You can bring your workspace files to the correct state by issuing "hg up -r 16007" or "hg up -r 16362". You would have to rebuild everything (make.bat) before running benchmark. Also you didn't tell us what version of windows you have. Alex
Sign in to reply to this message.
On 2013/03/22 03:25:25, brainman wrote: > What version of go is this benchmark for? Use "hg id" to discover. Please supply > benchmarks for 2 versions of go (just like I did) 16007 and 16362. You can bring > your workspace files to the correct state by issuing "hg up -r 16007" or "hg up > -r 16362". You would have to rebuild everything (make.bat) before running > benchmark. > > Also you didn't tell us what version of windows you have. > > Alex go version is f12b24ea373f+ tip, windows version is Windows 7 SP1 amd64, I don't expect my benchmark result would be much different than yours except the one above.
Sign in to reply to this message.
On 2013/03/22 03:37:10, Ewan Chou wrote: > > go version is f12b24ea373f+ tip, windows version is Windows 7 SP1 amd64,... Well you have version 16292 now (it is close enough to 16362 that I tested). # hg log -r f12b24ea373f changeset: 16292:f12b24ea373f user: Mikio Hara <mikioh.mikioh@gmail.com> date: Sun Mar 17 19:50:01 2013 +0900 summary: net: revert Zone in IPNet temporally Can you re-run it for version 16007 now, so we could compare your two benchmarks. Thank you. Alex
Sign in to reply to this message.
On 2013/03/22 03:48:14, brainman wrote: > I can only supply valid TCPOneShot benchmark because sometimes I got "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full." error. And are there any BAT or EXE version of benchcmp? 16007: BenchmarkTCPOneShot 2000 119006 ns/op BenchmarkTCPOneShot-2 5000 108606 ns/op BenchmarkTCPOneShot-3 5000 57803 ns/op 16292: BenchmarkTCP4OneShot 100 1830105 ns/op BenchmarkTCP4OneShot-2 1000 166009 ns/op BenchmarkTCP4OneShot-3 5000 86404 ns/op
Sign in to reply to this message.
On 2013/03/22 04:52:50, Ewan Chou wrote: > > I can only supply valid TCPOneShot benchmark because sometimes I got "An > operation on a socket could not be performed because the system lacked > sufficient buffer space or because a queue was full." error. I had similar problems. I think the benchmark creates too many connections in 1s (default benchmark time). So I used shorter benchmark: go test net -run=NONE -bench=TCP -benchmem -benchtime=100ms -cpu=1,2,3 Try shorter benchtime until you see no errors. > > And are there any BAT or EXE version of benchcmp? > No, but you should be able to rewrite the script into Go or something. > 16007: > BenchmarkTCPOneShot 2000 119006 ns/op > BenchmarkTCPOneShot-2 5000 108606 ns/op > BenchmarkTCPOneShot-3 5000 57803 ns/op > > 16292: > BenchmarkTCP4OneShot 100 1830105 ns/op > BenchmarkTCP4OneShot-2 1000 166009 ns/op > BenchmarkTCP4OneShot-3 5000 86404 ns/op This is your result: benchmark old ns/op new ns/op delta BenchmarkTCPOneShot 119006 1830105 +1437.83% BenchmarkTCPOneShot-2 108606 166009 +52.85% BenchmarkTCPOneShot-3 57803 86404 +49.48% which shows big slowdown. Are you sure, you are using versions as reported? Also, please, make sure your benchmark runs end with no errors. Thank you. Alex
Sign in to reply to this message.
On 2013/03/22 06:04:12, brainman wrote: > On 2013/03/22 04:52:50, Ewan Chou wrote: > > > > I can only supply valid TCPOneShot benchmark because sometimes I got "An > > operation on a socket could not be performed because the system lacked > > sufficient buffer space or because a queue was full." error. > > I had similar problems. I think the benchmark creates too many connections in 1s > (default benchmark time). So I used shorter benchmark: > > go test net -run=NONE -bench=TCP -benchmem -benchtime=100ms -cpu=1,2,3 > > Try shorter benchtime until you see no errors. > > > > > And are there any BAT or EXE version of benchcmp? > > > > No, but you should be able to rewrite the script into Go or something. > > > 16007: > > BenchmarkTCPOneShot 2000 119006 ns/op > > BenchmarkTCPOneShot-2 5000 108606 ns/op > > BenchmarkTCPOneShot-3 5000 57803 ns/op > > > > 16292: > > BenchmarkTCP4OneShot 100 1830105 ns/op > > BenchmarkTCP4OneShot-2 1000 166009 ns/op > > BenchmarkTCP4OneShot-3 5000 86404 ns/op > > This is your result: > > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 119006 1830105 +1437.83% > BenchmarkTCPOneShot-2 108606 166009 +52.85% > BenchmarkTCPOneShot-3 57803 86404 +49.48% > > which shows big slowdown. Are you sure, you are using versions as reported? > Also, please, make sure your benchmark runs end with no errors. > > Thank you. > > Alex Yes, I'm sure.
Sign in to reply to this message.
On 2013/03/22 06:04:12, brainman wrote: > On 2013/03/22 04:52:50, Ewan Chou wrote: > > > > I can only supply valid TCPOneShot benchmark because sometimes I got "An > > operation on a socket could not be performed because the system lacked > > sufficient buffer space or because a queue was full." error. > > I had similar problems. I think the benchmark creates too many connections in 1s > (default benchmark time). So I used shorter benchmark: > > go test net -run=NONE -bench=TCP -benchmem -benchtime=100ms -cpu=1,2,3 > > Try shorter benchtime until you see no errors. > > > > > And are there any BAT or EXE version of benchcmp? > > > > No, but you should be able to rewrite the script into Go or something. > > > 16007: > > BenchmarkTCPOneShot 2000 119006 ns/op > > BenchmarkTCPOneShot-2 5000 108606 ns/op > > BenchmarkTCPOneShot-3 5000 57803 ns/op > > > > 16292: > > BenchmarkTCP4OneShot 100 1830105 ns/op > > BenchmarkTCP4OneShot-2 1000 166009 ns/op > > BenchmarkTCP4OneShot-3 5000 86404 ns/op > > This is your result: > > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 119006 1830105 +1437.83% > BenchmarkTCPOneShot-2 108606 166009 +52.85% > BenchmarkTCPOneShot-3 57803 86404 +49.48% > > which shows big slowdown. Are you sure, you are using versions as reported? > Also, please, make sure your benchmark runs end with no errors. > > Thank you. > > Alex Yes, I'm sure.
Sign in to reply to this message.
If you see errors in tcp benchmarks, try to apply the following change: --- a/src/pkg/net/tcp_test.go Thu Mar 21 12:54:19 2013 +0400 +++ b/src/pkg/net/tcp_test.go Fri Mar 22 10:08:15 2013 +0400 @@ -58,7 +58,7 @@ func benchmarkTCP(b *testing.B, persistent, timeout bool, laddr string) { const msgLen = 512 conns := b.N - numConcurrent := runtime.GOMAXPROCS(-1) * 16 + numConcurrent := runtime.GOMAXPROCS(-1) * 4 msgs := 1 if persistent { conns = numConcurrent I had to apply it locally, because Linux also has some internal limits on tcp connections per second. When you hit some limit, kernel throttles tcp activity unbelievably. Potentially this can solve the windows buffers bug. On Fri, Mar 22, 2013 at 10:04 AM, <alex.brainman@gmail.com> wrote: > On 2013/03/22 04:52:50, Ewan Chou wrote: > > I can only supply valid TCPOneShot benchmark because sometimes I got >> > "An > >> operation on a socket could not be performed because the system lacked >> sufficient buffer space or because a queue was full." error. >> > > I had similar problems. I think the benchmark creates too many > connections in 1s (default benchmark time). So I used shorter benchmark: > > go test net -run=NONE -bench=TCP -benchmem -benchtime=100ms -cpu=1,2,3 > > Try shorter benchtime until you see no errors. > > > > And are there any BAT or EXE version of benchcmp? >> > > > No, but you should be able to rewrite the script into Go or something. > > > 16007: >> BenchmarkTCPOneShot 2000 119006 ns/op >> BenchmarkTCPOneShot-2 5000 108606 ns/op >> BenchmarkTCPOneShot-3 5000 57803 ns/op >> > > 16292: >> BenchmarkTCP4OneShot 100 1830105 ns/op >> BenchmarkTCP4OneShot-2 1000 166009 ns/op >> BenchmarkTCP4OneShot-3 5000 86404 ns/op >> > > This is your result: > > > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 119006 1830105 +1437.83% > BenchmarkTCPOneShot-2 108606 166009 +52.85% > BenchmarkTCPOneShot-3 57803 86404 +49.48% > > which shows big slowdown. Are you sure, you are using versions as > reported? Also, please, make sure your benchmark runs end with no > errors. > > > Thank you. > > Alex > > https://codereview.appspot.**com/7612045/<https://codereview.appspot.com/7612... >
Sign in to reply to this message.
>> 3. Do not do 3-chan select (this is also allocates memory and locks mutexes, in >> particular cancelc chan mutex is shared between read and write operations). >I do not understand details here. Should I get rid of cancelc altogether, and just close >resultc (given that it should move into netFD) instead? Maybe something else ... Point 1 is that 3-chan select is just slow operation (relatively), it allocates memory, frees memory, sorts chans, locks and unlocks all mutexes, etc. Point 2 is that AFAIK HTTP1.1 impl reads and writes from the tcp conn concurrently, this means that 2 goroutines execute select involving the closec, this means that they fight over the same closec internal mutex.
Sign in to reply to this message.
>> 5. Better integration with scheduler with runtime.netpoll(). > What are the details, and why is it helpful? Scheduler knows when to poll. The integrated version won't suffer from blocking syscall problem, because the polling is done on an otherwise idle thread. So it does not need to wake up another thread to continue executing goroutines. Currently if you issue write, you need to block current goroutine, switch to poller, unblock first goroutine, block poller, switch to goroutine 1. With integrated poller it happens much faster. Integrated poller evenly distributes newly runnable goroutines across threads.
Sign in to reply to this message.
> On 386 (unlike amd64) all IO submission and cancellation happens on single dedicated thread. But I cannot see why this path gets so much slower. Aha! So most likely it's just due to "single dedicated thread" requirement. Please patch net package to do "single dedicated thread" on amd64 and benchmark it.
Sign in to reply to this message.
On Fri, Mar 22, 2013 at 10:21 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > > On 386 (unlike amd64) all IO submission and cancellation happens on > single dedicated thread. But I cannot see why this path gets so much slower. > > Aha! So most likely it's just due to "single dedicated thread" requirement. > Please patch net package to do "single dedicated thread" on amd64 and > benchmark it. > > Btw, is it actually relevant? Does people run high performance servers on old 386 boxes? It's good that we support them, but perhaps we do not need to care about performance much.
Sign in to reply to this message.
On Friday, 22 March 2013 17:21:26 UTC+11, Dmitry Vyukov wrote: > > Please patch net package to do "single dedicated thread" on amd64 and > benchmark it. > > It gets slower, but marginally: # ~/go/root/misc/benchcmp netstats/amd64/id16362.txt a benchmark old ns/op new ns/op delta BenchmarkTCP6OneShot 285138 345681 +21.23% BenchmarkTCP6OneShot-2 318339 314433 -1.23% BenchmarkTCP6OneShot-3 293926 369117 +25.58% BenchmarkTCP6OneShotTimeout 288067 378882 +31.53% BenchmarkTCP6OneShotTimeout-2 291973 349587 +19.73% BenchmarkTCP6OneShotTimeout-3 300762 380835 +26.62% BenchmarkTCP6Persistent 47262 66890 +41.53% BenchmarkTCP6Persistent-2 50289 59371 +18.06% BenchmarkTCP6Persistent-3 49020 74214 +51.40% BenchmarkTCP6PersistentTimeout 53707 69819 +30.00% BenchmarkTCP6PersistentTimeout-2 55660 69722 +25.26% BenchmarkTCP6PersistentTimeout-3 55269 82026 +48.41% benchmark old allocs new allocs delta BenchmarkTCP6OneShot 67 73 8.96% BenchmarkTCP6OneShot-2 67 73 8.96% BenchmarkTCP6OneShot-3 67 73 8.96% BenchmarkTCP6OneShotTimeout 75 81 8.00% BenchmarkTCP6OneShotTimeout-2 75 81 8.00% BenchmarkTCP6OneShotTimeout-3 75 81 8.00% BenchmarkTCP6Persistent 14 18 28.57% BenchmarkTCP6Persistent-2 14 18 28.57% BenchmarkTCP6Persistent-3 14 19 35.71% BenchmarkTCP6PersistentTimeout 22 26 18.18% BenchmarkTCP6PersistentTimeout-2 22 26 18.18% BenchmarkTCP6PersistentTimeout-3 22 27 22.73% benchmark old bytes new bytes delta BenchmarkTCP6OneShot 6222 7387 18.72% BenchmarkTCP6OneShot-2 6219 7395 18.91% BenchmarkTCP6OneShot-3 6217 7370 18.55% BenchmarkTCP6OneShotTimeout 6927 8071 16.52% BenchmarkTCP6OneShotTimeout-2 6921 8077 16.70% BenchmarkTCP6OneShotTimeout-3 6972 8087 15.99% BenchmarkTCP6Persistent 1543 2337 51.46% BenchmarkTCP6Persistent-2 1591 2326 46.20% BenchmarkTCP6Persistent-3 1571 2381 51.56% BenchmarkTCP6PersistentTimeout 2286 3058 33.77% BenchmarkTCP6PersistentTimeout-2 2306 3043 31.96% BenchmarkTCP6PersistentTimeout-3 2288 3089 35.01%
Sign in to reply to this message.
On Friday, 22 March 2013 17:22:38 UTC+11, Dmitry Vyukov wrote: > > ... > Btw, is it actually relevant? Does people run high performance servers on > old 386 boxes? It's good that we support them, but perhaps we do not need > to care about performance much. > It is large regression. At the very least, I would like to understand why it is happening. Alex
Sign in to reply to this message.
On Friday, 22 March 2013 17:10:13 UTC+11, Dmitry Vyukov wrote: > > If you see errors in tcp benchmarks, try to apply the following change: > > --- a/src/pkg/net/tcp_test.go Thu Mar 21 12:54:19 2013 +0400 > +++ b/src/pkg/net/tcp_test.go Fri Mar 22 10:08:15 2013 +0400 > @@ -58,7 +58,7 @@ > func benchmarkTCP(b *testing.B, persistent, timeout bool, laddr string) { > const msgLen = 512 > conns := b.N > - numConcurrent := runtime.GOMAXPROCS(-1) * 16 > + numConcurrent := runtime.GOMAXPROCS(-1) * 4 > msgs := 1 > if persistent { > conns = numConcurrent > > > Tried that, it still too aggressive. I can still see few failed > connections. The only wat to avoid is set smaller -benchtime. Perhaps this > benchmark is misleading. Perhaps windows will throttle activity like that > down for one reason or the other. I don't know :-( > Alex
Sign in to reply to this message.
It is what against what? For amd64/16362 you've posted very different numbers before: 16007 - 16362 - still very slow, especially for -cpu=1 # ~/go/root/misc/benchcmp id16007.txt id16362.txt benchmark old ns/op new ns/op delta BenchmarkTCPOneShot 281255 2031289 +622.22% BenchmarkTCPOneShot-2 218754 281255 +28.57% BenchmarkTCPOneShot-3 218754 187503 -14.29% BenchmarkTCPOneShotTimeout 281255 2187542 +677.78% BenchmarkTCPOneShotTimeout-2 218754 281255 +28.57% BenchmarkTCPOneShotTimeout-3 218754 218754 +0.00% BenchmarkTCPPersistent 87501 343756 +292.86% BenchmarkTCPPersistent-2 46875 62501 +33.34% BenchmarkTCPPersistent-3 43750 70313 +60.72% BenchmarkTCPPersistentTimeout 101564 312506 +207.69% BenchmarkTCPPersistentTimeout-2 50000 78126 +56.25% BenchmarkTCPPersistentTimeout-3 50000 62501 +25.00% On Fri, Mar 22, 2013 at 10:36 AM, brainman <alex.brainman@gmail.com> wrote: > On Friday, 22 March 2013 17:21:26 UTC+11, Dmitry Vyukov wrote: >> >> Please patch net package to do "single dedicated thread" on amd64 and >> benchmark it. >> >> > It gets slower, but marginally: > > # ~/go/root/misc/benchcmp netstats/amd64/id16362.txt a > > benchmark old ns/op new ns/op delta > BenchmarkTCP6OneShot 285138 345681 +21.23% > BenchmarkTCP6OneShot-2 318339 314433 -1.23% > BenchmarkTCP6OneShot-3 293926 369117 +25.58% > BenchmarkTCP6OneShotTimeout 288067 378882 +31.53% > BenchmarkTCP6OneShotTimeout-2 291973 349587 +19.73% > BenchmarkTCP6OneShotTimeout-3 300762 380835 +26.62% > BenchmarkTCP6Persistent 47262 66890 +41.53% > BenchmarkTCP6Persistent-2 50289 59371 +18.06% > BenchmarkTCP6Persistent-3 49020 74214 +51.40% > BenchmarkTCP6PersistentTimeout 53707 69819 +30.00% > BenchmarkTCP6PersistentTimeout-2 55660 69722 +25.26% > BenchmarkTCP6PersistentTimeout-3 55269 82026 +48.41% > > benchmark old allocs new allocs delta > BenchmarkTCP6OneShot 67 73 8.96% > BenchmarkTCP6OneShot-2 67 73 8.96% > BenchmarkTCP6OneShot-3 67 73 8.96% > BenchmarkTCP6OneShotTimeout 75 81 8.00% > BenchmarkTCP6OneShotTimeout-2 75 81 8.00% > BenchmarkTCP6OneShotTimeout-3 75 81 8.00% > BenchmarkTCP6Persistent 14 18 28.57% > BenchmarkTCP6Persistent-2 14 18 28.57% > BenchmarkTCP6Persistent-3 14 19 35.71% > BenchmarkTCP6PersistentTimeout 22 26 18.18% > BenchmarkTCP6PersistentTimeout-2 22 26 18.18% > BenchmarkTCP6PersistentTimeout-3 22 27 22.73% > > benchmark old bytes new bytes delta > BenchmarkTCP6OneShot 6222 7387 18.72% > BenchmarkTCP6OneShot-2 6219 7395 18.91% > BenchmarkTCP6OneShot-3 6217 7370 18.55% > BenchmarkTCP6OneShotTimeout 6927 8071 16.52% > BenchmarkTCP6OneShotTimeout-2 6921 8077 16.70% > BenchmarkTCP6OneShotTimeout-3 6972 8087 15.99% > BenchmarkTCP6Persistent 1543 2337 51.46% > BenchmarkTCP6Persistent-2 1591 2326 46.20% > BenchmarkTCP6Persistent-3 1571 2381 51.56% > BenchmarkTCP6PersistentTimeout 2286 3058 33.77% > BenchmarkTCP6PersistentTimeout-2 2306 3043 31.96% > BenchmarkTCP6PersistentTimeout-3 2288 3089 35.01% > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "golang-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-dev+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > >
Sign in to reply to this message.
Now I must admit that I am lost in all these numbers for 386/amd64, old tip/tip/patched tip, dedicated thread/no dedicated thread, GOMAXPROCS=1/>1... What is still slow, if we apply this patch? On Fri, Mar 22, 2013 at 10:54 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > It is what against what? > > For amd64/16362 you've posted very different numbers before: > > 16007 - 16362 - still very slow, especially for -cpu=1 > > # ~/go/root/misc/benchcmp id16007.txt id16362.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 281255 2031289 +622.22% > BenchmarkTCPOneShot-2 218754 281255 +28.57% > BenchmarkTCPOneShot-3 218754 187503 -14.29% > BenchmarkTCPOneShotTimeout 281255 2187542 +677.78% > BenchmarkTCPOneShotTimeout-2 218754 281255 +28.57% > BenchmarkTCPOneShotTimeout-3 218754 218754 +0.00% > BenchmarkTCPPersistent 87501 343756 +292.86% > BenchmarkTCPPersistent-2 46875 62501 +33.34% > BenchmarkTCPPersistent-3 43750 70313 +60.72% > BenchmarkTCPPersistentTimeout 101564 312506 +207.69% > BenchmarkTCPPersistentTimeout-2 50000 78126 +56.25% > BenchmarkTCPPersistentTimeout-3 50000 62501 +25.00% > > > On Fri, Mar 22, 2013 at 10:36 AM, brainman <alex.brainman@gmail.com>wrote: > >> On Friday, 22 March 2013 17:21:26 UTC+11, Dmitry Vyukov wrote: >>> >>> Please patch net package to do "single dedicated thread" on amd64 and >>> benchmark it. >>> >>> >> It gets slower, but marginally: >> >> # ~/go/root/misc/benchcmp netstats/amd64/id16362.txt a >> >> benchmark old ns/op new ns/op delta >> BenchmarkTCP6OneShot 285138 345681 +21.23% >> BenchmarkTCP6OneShot-2 318339 314433 -1.23% >> BenchmarkTCP6OneShot-3 293926 369117 +25.58% >> BenchmarkTCP6OneShotTimeout 288067 378882 +31.53% >> BenchmarkTCP6OneShotTimeout-2 291973 349587 +19.73% >> BenchmarkTCP6OneShotTimeout-3 300762 380835 +26.62% >> BenchmarkTCP6Persistent 47262 66890 +41.53% >> BenchmarkTCP6Persistent-2 50289 59371 +18.06% >> BenchmarkTCP6Persistent-3 49020 74214 +51.40% >> BenchmarkTCP6PersistentTimeout 53707 69819 +30.00% >> BenchmarkTCP6PersistentTimeout-2 55660 69722 +25.26% >> BenchmarkTCP6PersistentTimeout-3 55269 82026 +48.41% >> >> benchmark old allocs new allocs delta >> BenchmarkTCP6OneShot 67 73 8.96% >> BenchmarkTCP6OneShot-2 67 73 8.96% >> BenchmarkTCP6OneShot-3 67 73 8.96% >> BenchmarkTCP6OneShotTimeout 75 81 8.00% >> BenchmarkTCP6OneShotTimeout-2 75 81 8.00% >> BenchmarkTCP6OneShotTimeout-3 75 81 8.00% >> BenchmarkTCP6Persistent 14 18 28.57% >> BenchmarkTCP6Persistent-2 14 18 28.57% >> BenchmarkTCP6Persistent-3 14 19 35.71% >> BenchmarkTCP6PersistentTimeout 22 26 18.18% >> BenchmarkTCP6PersistentTimeout-2 22 26 18.18% >> BenchmarkTCP6PersistentTimeout-3 22 27 22.73% >> >> benchmark old bytes new bytes delta >> BenchmarkTCP6OneShot 6222 7387 18.72% >> BenchmarkTCP6OneShot-2 6219 7395 18.91% >> BenchmarkTCP6OneShot-3 6217 7370 18.55% >> BenchmarkTCP6OneShotTimeout 6927 8071 16.52% >> BenchmarkTCP6OneShotTimeout-2 6921 8077 16.70% >> BenchmarkTCP6OneShotTimeout-3 6972 8087 15.99% >> BenchmarkTCP6Persistent 1543 2337 51.46% >> BenchmarkTCP6Persistent-2 1591 2326 46.20% >> BenchmarkTCP6Persistent-3 1571 2381 51.56% >> BenchmarkTCP6PersistentTimeout 2286 3058 33.77% >> BenchmarkTCP6PersistentTimeout-2 2306 3043 31.96% >> BenchmarkTCP6PersistentTimeout-3 2288 3089 35.01% >> >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "golang-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-dev+unsubscribe@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > >
Sign in to reply to this message.
On Friday, 22 March 2013 17:54:43 UTC+11, Dmitry Vyukov wrote: > It is what against what? > > For amd64/16362 you've posted very different numbers before: > > 16007 - 16362 - still very slow, especially for -cpu=1 > > # ~/go/root/misc/benchcmp id16007.txt id16362.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 281255 2031289 +622.22% > BenchmarkTCPOneShot-2 218754 281255 +28.57% > BenchmarkTCPOneShot-3 218754 187503 -14.29% > .... > This is a diff between two clean versions (16007 is before new scheduler and 16362 is "current tip") on 386. This is the slow down that I am concerned about. Part of my earlier post http://play.golang.org/p/9bJ323Me7T. Alex
Sign in to reply to this message.
On Friday, 22 March 2013 18:00:43 UTC+11, Dmitry Vyukov wrote: > Now I must admit that I am lost in all these numbers for 386/amd64, old > tip/tip/patched tip, dedicated thread/no dedicated thread, > GOMAXPROCS=1/>1... > What is still slow, if we apply this patch? > > Do you mean your current CL? http://play.golang.org/p/9bJ323Me7T has some results: ... # ~/go/root/misc/benchcmp id16007.txt id16362patched.txt benchmark old ns/op new ns/op delta BenchmarkTCPOneShot 281255 281255 +0.00% BenchmarkTCPOneShot-2 218754 218754 +0.00% BenchmarkTCPOneShot-3 218754 203128 -7.14% BenchmarkTCPOneShotTimeout 281255 281255 +0.00% BenchmarkTCPOneShotTimeout-2 218754 218754 +0.00% BenchmarkTCPOneShotTimeout-3 218754 218754 +0.00% BenchmarkTCPPersistent 87501 62501 -28.57% BenchmarkTCPPersistent-2 46875 56251 +20.00% BenchmarkTCPPersistent-3 43750 62501 +42.86% BenchmarkTCPPersistentTimeout 101564 70313 -30.77% BenchmarkTCPPersistentTimeout-2 50000 62501 +25.00% BenchmarkTCPPersistentTimeout-3 50000 78126 +56.25% ... I do not understand why your fix helps. Perhaps there is better way to achieve that. That is my concern. Alex
Sign in to reply to this message.
On Fri, Mar 22, 2013 at 10:36 AM, brainman <alex.brainman@gmail.com> wrote: > On Friday, 22 March 2013 17:21:26 UTC+11, Dmitry Vyukov wrote: >> >> Please patch net package to do "single dedicated thread" on amd64 and >> benchmark it. >> >> > It gets slower, but marginally: > > # ~/go/root/misc/benchcmp netstats/amd64/id16362.txt a > > benchmark old ns/op new ns/op delta > BenchmarkTCP6OneShot 285138 345681 +21.23% > BenchmarkTCP6OneShot-2 318339 314433 -1.23% > BenchmarkTCP6OneShot-3 293926 369117 +25.58% > So with with this patch and "single dedicated thread" 386 is significantly *faster* than amd64, right? In your separate post you've posted the following numbers for 386: # ~/go/root/misc/benchcmp id16007.txt id16362patched.txt benchmark old ns/op new ns/op delta BenchmarkTCPOneShot 281255 281255 +0.00% BenchmarkTCPOneShot-2 218754 218754 +0.00% BenchmarkTCPOneShot-3 218754 203128 -7.14%
Sign in to reply to this message.
On Fri, Mar 22, 2013 at 1:23 PM, brainman <alex.brainman@gmail.com> wrote: > > On Friday, 22 March 2013 18:00:43 UTC+11, Dmitry Vyukov wrote: >> >> Now I must admit that I am lost in all these numbers for 386/amd64, old tip/tip/patched tip, dedicated thread/no dedicated thread, GOMAXPROCS=1/>1... >> What is still slow, if we apply this patch? >> > > Do you mean your current CL? http://play.golang.org/p/9bJ323Me7T has some results: > > ... > # ~/go/root/misc/benchcmp id16007.txt id16362patched.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 281255 281255 +0.00% > BenchmarkTCPOneShot-2 218754 218754 +0.00% > BenchmarkTCPOneShot-3 218754 203128 -7.14% > BenchmarkTCPOneShotTimeout 281255 281255 +0.00% > BenchmarkTCPOneShotTimeout-2 218754 218754 +0.00% > BenchmarkTCPOneShotTimeout-3 218754 218754 +0.00% > BenchmarkTCPPersistent 87501 62501 -28.57% > BenchmarkTCPPersistent-2 46875 56251 +20.00% > BenchmarkTCPPersistent-3 43750 62501 +42.86% > BenchmarkTCPPersistentTimeout 101564 70313 -30.77% > BenchmarkTCPPersistentTimeout-2 50000 62501 +25.00% > BenchmarkTCPPersistentTimeout-3 50000 78126 +56.25% > > ... > > I do not understand why your fix helps. Perhaps there is better way to achieve that. That is my concern. It helps because new scheduler favors short/non-blocking syscalls (e.g. read/write), but penalizes blocking syscalls (e.g. GetQueuedCompletionStatus )). And it seems that penalty on Windows is larger than on Linux. This change uses the old behaviour for GetQueuedCompletionStatus(), i.e. "do not penalize" blocking syscalls too much. Actually it's even better, because it first tries to do non-blocking GetQueuedCompletionStatus and only then blocking.
Sign in to reply to this message.
On Friday, 22 March 2013 20:24:39 UTC+11, Dmitry Vyukov wrote: > On Fri, Mar 22, 2013 at 10:36 AM, brainman <alex.b...@gmail.com<javascript:> > > wrote: > >> On Friday, 22 March 2013 17:21:26 UTC+11, Dmitry Vyukov wrote: >>> >>> Please patch net package to do "single dedicated thread" on amd64 and >>> benchmark it. >>> >>> >> It gets slower, but marginally: >> >> # ~/go/root/misc/benchcmp netstats/amd64/id16362.txt a >> >> benchmark old ns/op new ns/op delta >> BenchmarkTCP6OneShot 285138 345681 +21.23% >> BenchmarkTCP6OneShot-2 318339 314433 -1.23% >> BenchmarkTCP6OneShot-3 293926 369117 +25.58% >> > > So with with this patch and "single dedicated thread" 386 is > significantly *faster* than amd64, right? In your separate post you've > posted the following numbers for 386: > > # ~/go/root/misc/benchcmp id16007.txt id16362patched.txt > benchmark old ns/op new ns/op delta > BenchmarkTCPOneShot 281255 281255 +0.00% > BenchmarkTCPOneShot-2 218754 218754 +0.00% > BenchmarkTCPOneShot-3 218754 203128 -7.14% > > I wouldn't compare 386 and amd64 results between themselves. These are two very different computers. But, yes your patch helps on 386 (386 does "single dedicated thread" unconditionally - it cannot do anything else). Especially for -cpu=1 case. Alex
Sign in to reply to this message.
On 2013/03/22 09:29:09, dvyukov wrote: > It helps because new scheduler favors short/non-blocking syscalls > (e.g. read/write), but penalizes blocking syscalls (e.g. > GetQueuedCompletionStatus )). And it seems that penalty on Windows is > larger than on Linux. How do you decide if a particular syscall short/non-blocking? None of windows syscalls is marked as such in any way. > This change uses the old behaviour for GetQueuedCompletionStatus(), > i.e. "do not penalize" blocking syscalls too much. Actually it's even > better, because it first tries to do non-blocking > GetQueuedCompletionStatus and only then blocking. Fair enough - it helps here. What happens in other places where some of our syscalls "block"? Do we get similar regression there? Also, now we replaced every GetQueuedCompletionStatus call with two. It is quite a busy place here - every io completes here. Is there price to pay? Alex
Sign in to reply to this message.
On Fri, Mar 22, 2013 at 2:00 PM, <alex.brainman@gmail.com> wrote: > On 2013/03/22 09:29:09, dvyukov wrote: >> >> It helps because new scheduler favors short/non-blocking syscalls >> (e.g. read/write), but penalizes blocking syscalls (e.g. >> GetQueuedCompletionStatus )). And it seems that penalty on Windows is >> larger than on Linux. > > > How do you decide if a particular syscall short/non-blocking? None of > windows syscalls is marked as such in any way. The scheduler initially assumes that every syscall is non-blocking, so it does nothing when a thread enters into the syscall. Then, a background thread that sleeps for a while and then checks statuses of all threads. If it finds a thread that sits in a syscall for more than X us, then it assumes that the syscall is blocking and wakes up another worker thread to replace the blocked one. >> This change uses the old behaviour for GetQueuedCompletionStatus(), >> i.e. "do not penalize" blocking syscalls too much. Actually it's even >> better, because it first tries to do non-blocking >> GetQueuedCompletionStatus and only then blocking. > > > Fair enough - it helps here. What happens in other places where some of > our syscalls "block"? Do we get similar regression there? It is possible. > Also, now we replaced every GetQueuedCompletionStatus call with two. It > is quite a busy place here - every io completes here. Is there price to > pay? I think it depends on workload. If the non-blocking GetQueuedCompletionStatus usually discovers new completion notification then it's a win. If under particular workload we execute non-blocking GetQueuedCompletionStatus just to discover that there is no notifications, then it's a loss. You can benchmark a version with only blocking GetQueuedCompletionStatus (but the net benchmarks are pretty synthetic, so it will be difficult to decide).
Sign in to reply to this message.
On 2013/03/22 10:12:53, dvyukov wrote: > ... > ..., so it will be difficult to decide). I agree. :-) It raises more questions then it answers. What about a different solution - we leave GetQueuedCompletionStatus call as before, but will call runtime_blockingSyscallHint right before it? GetQueuedCompletionStatus is about waiting, let it block unconditionally. If that works, perhaps, we can provide new syscall.Syscall like functions to provide access to that functionality outside of runtime / syscall. Do you think it worth investigating? I am just talking ... here :-) Alex
Sign in to reply to this message.
On Fri, Mar 22, 2013 at 2:44 PM, <alex.brainman@gmail.com> wrote: > On 2013/03/22 10:12:53, dvyukov wrote: >> >> ... >> ..., so it will be difficult to decide). > > > I agree. :-) It raises more questions then it answers. > > What about a different solution - we leave GetQueuedCompletionStatus > call as before, but will call runtime_blockingSyscallHint right before > it? GetQueuedCompletionStatus is about waiting, let it block > unconditionally. Yes, that's what I meant. It should work. > If that works, perhaps, we can provide new syscall.Syscall like > functions to provide access to that functionality outside of runtime / > syscall. > > Do you think it worth investigating? I am just talking ... here :-) I had such thing on Linux at one point (locally). So it's doable. But general sentiment about adding additional knobs here is... not completely positive. And it's understandable, because by adding knobs we just transfer our problems onto users. Actually windows has a super cool feature called UMS (since Vista AFAIR). It allows to actually detect thread blocking (w/o guessing). But that would require considerable windows-specific parts in the scheduler, and UMS is not the most trivial API to program. This is not for Go1.1 in any way, shape or form. But it would allow to solve all syscall problems in the long term.
Sign in to reply to this message.
On 2013/03/22 10:53:36, dvyukov wrote: > > > > What about a different solution - we leave GetQueuedCompletionStatus > > call as before, but will call runtime_blockingSyscallHint right before > > it? GetQueuedCompletionStatus is about waiting, let it block > > unconditionally. > > Yes, that's what I meant. It should work. > I will try it next week. > > If that works, perhaps, we can provide new syscall.Syscall like > > functions to provide access to that functionality outside of runtime / > > syscall. > > > > Do you think it worth investigating? I am just talking ... here :-) > > I had such thing on Linux at one point (locally). So it's doable. But > general sentiment about adding additional knobs here is... not > completely positive. And it's understandable, because by adding knobs > we just transfer our problems onto users. There is large regression here. Perhaps knobs are unavoidable. It LGTM, if you like to submit now. But, I would wait for Russ. What do I know :-) Thank you for your patience. Alex
Sign in to reply to this message.
Russ, Can you take a look at this CL? A brief summary of the discussion: Windows performance has degraded with new scheduler because of blocking GetQueuedCompletionStatus() call. Especially for GOMAXPROCS=1. This CL brings the performance to the comparable with old sched level (somewhere worse, somewhere better, but no pathological degradation). A proper fix will be to integrate Windows network poller with runtime. But this is a lot of work. Alex said LGTM.
Sign in to reply to this message.
LGTM https://codereview.appspot.com/7612045/diff/4004/src/pkg/runtime/cgocall.c File src/pkg/runtime/cgocall.c (right): https://codereview.appspot.com/7612045/diff/4004/src/pkg/runtime/cgocall.c#ne... src/pkg/runtime/cgocall.c:157: if(Windows && g->blockingsyscall) { Is it terribly expensive to drop the Windows &&? If not, please drop it.
Sign in to reply to this message.
Ewan, does this CL helps in your case? Alex
Sign in to reply to this message.
On 2013/03/22 19:42:26, brainman wrote: > Ewan, does this CL helps in your case? > > Alex Yes, I applied this patch, it fixed the issue.
Sign in to reply to this message.
On 2013/03/22 14:26:29, rsc wrote: > LGTM > > https://codereview.appspot.com/7612045/diff/4004/src/pkg/runtime/cgocall.c > File src/pkg/runtime/cgocall.c (right): > > https://codereview.appspot.com/7612045/diff/4004/src/pkg/runtime/cgocall.c#ne... > src/pkg/runtime/cgocall.c:157: if(Windows && g->blockingsyscall) { > Is it terribly expensive to drop the Windows &&? > If not, please drop it. Done
Sign in to reply to this message.
*** Submitted as https://code.google.com/p/go/source/detail?r=2be8c885acc8 *** net: band-aid for windows network poller Fixes performance of the current windows network poller with the new scheduler. Gives runtime a hint when GetQueuedCompletionStatus() will block. Fixes issue 5068. benchmark old ns/op new ns/op delta BenchmarkTCP4Persistent 4004000 33906 -99.15% BenchmarkTCP4Persistent-2 21790 17513 -19.63% BenchmarkTCP4Persistent-4 44760 34270 -23.44% BenchmarkTCP4Persistent-6 45280 43000 -5.04% R=golang-dev, alex.brainman, coocood, rsc CC=golang-dev https://codereview.appspot.com/7612045
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/03/21 18:17:53, dvyukov wrote: > ... > 5. Better integration with scheduler with runtime.netpoll(). > I tried to do that, and I need your advise. I am not sure what to do when net.pollDesc's WaitRead or WaitWrite return errClosing or errTimeout. The problem is that, unlike unix implementation, windows io is in flight at the time (os is in the process of reading/writing of Go buffers), so I cannot just bail out of current function. What I do in our current implementation is call windows CancelIo or CancelIoEx and then go back into waiting forever for my io to finish (it will complete successfully or with ERROR_OPERATION_ABORTED, depending on if io finished or not before my CancelIO call). I don't believe your current runtime netpoll design allows for this situation. What do you suggest I should do. Thank you. Alex
Sign in to reply to this message.
On Thu, Apr 11, 2013 at 7:04 PM, <alex.brainman@gmail.com> wrote: > On 2013/03/21 18:17:53, dvyukov wrote: > >> ... >> 5. Better integration with scheduler with runtime.netpoll(). >> > > > I tried to do that, and I need your advise. I am not sure what to do > when net.pollDesc's WaitRead or WaitWrite return errClosing or > errTimeout. The problem is that, unlike unix implementation, windows io > is in flight at the time (os is in the process of reading/writing of Go > buffers), so I cannot just bail out of current function. What I do in > our current implementation is call windows CancelIo or CancelIoEx and > then go back into waiting forever for my io to finish (it will complete > successfully or with ERROR_OPERATION_ABORTED, depending on if io > finished or not before my CancelIO call). I don't believe your current > runtime netpoll design allows for this situation. What do you suggest I > should do. > I think you need to do roughly the same as now. If WaitRead/Write returns errTimeout, it just means that a timer fired inside of runtime. Call CancelIO() and wait for operation completion. If it will return success, reurn success. If it is cancelled return errTimeout. And similar for errClosing.
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/12 02:28:50, dvyukov wrote: > ... > If WaitRead/Write returns errTimeout, it just means that a timer fired > inside of runtime. Call CancelIO() and wait for operation completion. If it > will return success, reurn success. If it is cancelled return errTimeout. > And similar for errClosing. That sounds good, but I am worried about netpoll internal state. For example, when I call Evict from netFD.Close, netpoll marks this connection as "closing". But I need to call ReadWait again to wait for io completion after CanelIO. How do I handle that? Also, should netFD.Close() return after calling Evict immediately? The "wakeup / cancelio / wait-for-io-to-complete" should be quick, but it is up to OS to decide. Alex
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/12 02:51:49, brainman wrote: > On 2013/04/12 02:28:50, dvyukov wrote: > > ... > > If WaitRead/Write returns errTimeout, it just means that a timer fired > > inside of runtime. Call CancelIO() and wait for operation completion. If it > > will return success, reurn success. If it is cancelled return errTimeout. > > And similar for errClosing. > > That sounds good, but I am worried about netpoll internal state. I think for Windows you need a different impl of netpoll.goc. > For example, > when I call Evict from netFD.Close, netpoll marks this connection as "closing". > But I need to call ReadWait again to wait for io completion after CanelIO. How > do I handle that? I would call CancelIOEx() during Close() or in deadline timer func. So that Read()/Write() just wait once for completion, if Read()/Write() gets ERROR_OPERATION_ABORTED, then it needs to look at some variables to understand whether it's errClosing or errTimeout. > Also, should netFD.Close() return after calling Evict > immediately? The "wakeup / cancelio / wait-for-io-to-complete" should be quick, > but it is up to OS to decide. I think that's fine if Close() returns immediately. The last outstanding Read/Write will close the descriptor when completed. Here is a very dirty sketch of what I would try. This includes both net and runtime parts. Part of the complexity is added by the fact that if an operation completes synchronously, I wait for the notification during the *next* operation (when the notification is most likely already happened). Probably CancelIoEx() can be moved out the mutex. The state is quite complex, but it's always mutated under the mutex, so it's easy to ensure that it's transfers from one legal state to another. This code contains bugs, but I hope you can get the idea. Then you just need to carefully handle all state transitions. type netFD struct { ... ro, wo OVERLAPPED // to not allocate each time rpending, wpending bool // ro/wo is pending (another thread can cancel ro/wo) rprevpending, wprevpending bool // the previous op completed synchronously rcompleted, wcompleted bool // overlapped operation has completed rg, wg *G // waiting goroutine rdeadline, wdeadline int64 rerr, werr error // read/write result (valid when rcompleted/wcompleted == true) rn, wn int // read/write result (valid when rcompleted/wcompleted == true) closing bool Lock } func (fd *netFD) Read(buf []byte) (int, error) { // the same as now if err := fd.incref(false); err != nil { return 0, err } defer fd.decref() fd.rio.Lock() defer fd.rio.Unlock() // runtime part starts here // check that the socket has not yet timed out or been closed runtime.lock(fd) if(fd.closing) { runtime.unlock(fd) return errClosing; } if(fd.rdeadline < 0) { runtime.unlock(fd) return errTimeout } if(fd.rprevpending && !fd.rcompleted) { fd.rg = g; // wait for the *previous* read to complete runtime.park(runtime.unlock, fd, "IO wait") runtime.lock(fd) } fd.rcompleted = false fd.rprevpending = false runtime.unlock(fd) n, err := WSARecv(fd.sysfd, buf, ..., &fd.ro) if(err != ERROR_IO_PENDING) return 0, err if(err == nil) { // operation completed synchronously runtime.lock(fd) if(fd.closing && !fd.rcompleted) { fd.rg = g; // have to waiit for the completion notification runtime.park(runtime.unlock, fd, "IO wait") } // let the next reader wait for the notification fd.rprevpending = true; runtime.unlock(fd) return n, err } // err == ERROR_IO_PENDING runtime.lock(fd) if(fd.closing || fd.rdeadline < 0) { fd.rerr = fd.closing ? errClosing : errTimeout CancelIOEx(fd.sysfd, &fd.ro) } if(!fd.rcompleted) { fd.rpending = true fd.rg = g runtime.park(runtime.unlock, fd, "IO wait") } else runtime.unlock(fd) return fd.rn, fd.rerr } G *runtime.netpoll(...) { G *glist = nil; syscall.GetQueuedCompletionStatus(...) foreach returned fd, n, err, isread { runtime.lock(fd) if(isread) { fd.rn = n if(err != ERROR_OPERATION_ABORTED) fd.rerr = err // whoever cancels the io, sets the fd.rerr fd.rcompleted = true fd.rpending = false if(fd.rg != nil) { fd.rg->link = glist glist = fd.rg } } else { // the same for write } runtime.unlock(fd) } return glist } // the timer is set in setDeadline() void rdeadline(...) { // deadline timer handler runtime.lock(fd) fd.rdeadline = -1 if(fd.rpending) { fd.rerr = fd.closing ? errClosing : errTimeout CancelIoEx(fd.sysfd, &fd.ro) } runtime.unlock(fd) } void evict(...) { runtime.lock(fd) fd.closing = true fd.rerr = errClosing // cancel timers if(fd.rpending) CancelIoEx(fd.sysfd, &fd.ro) if(fd.wpending) CancelIoEx(fd.sysfd, &fd.wo) if(fd.rprevpending) { // have to wait here if(!fd.rcompleted) { fd.rg = g; runtime.park(runtime.unlock, fd, "IO wait") } fd.rcompleted = false fd.rprevpending = false } if(fd.wprevpending) { // the same for writes } runtime.unlock(fd) }
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/15 05:18:39, dvyukov wrote: > ... > I think for Windows you need a different impl of netpoll.goc. > ... This looks very complicated to me :-( Also, CancelIoEx is present on some systems, but not on others. If not found, we use CancelIo instead. But CancelIo cancels "all io on the caller thread". So we use special goroutine (locked to os thread) to do all IO submissions and cancellations. I am not sure how easy it is to implement all this in runtime. Also, all runtime syscall addresses are resolved by windows executable loader, so we cannot use api functions that we know might not be present on some systems. I just tried to do all cancellation logic inside of fd_windows.go instead. I have added new function similar to runtime_pollWait, but one that does not call checkerr on entry or exit. Then I call CancelIoEx (when runtime_pollWait returns with errClosing or errTimeout) and go wait on that function. I discard any results and return previously reported errClosing or errTimeout. All tests pass now. But I don't see any benchmark improvements :-(. I didn't test my changes on amd64 yet, so, perhaps, I will see it there. Still, I would expect some improvements, because I am not piddling with timers. It is hard to benchmark these, because, you cannot run them for very long (os runs out of sockets). I will keep at it, and report back. Thank you for your help. Alex
Sign in to reply to this message.
Message was sent while issue was closed.
Here are my benchmark results on windows/amd64: # ~/go/root/misc/benchcmp old.txt new.txt benchmark old ns/op new ns/op delta BenchmarkTCP4OneShot 314433 292950 -6.83% BenchmarkTCP4OneShot-2 343728 347634 +1.14% BenchmarkTCP4OneShot-3 339822 308574 -9.20% BenchmarkTCP4OneShotTimeout 343728 338845 -1.42% BenchmarkTCP4OneShotTimeout-2 345681 333963 -3.39% BenchmarkTCP4OneShotTimeout-3 353493 341775 -3.31% BenchmarkTCP4Persistent 41013 33591 -18.10% BenchmarkTCP4Persistent-2 38669 35544 -8.08% BenchmarkTCP4Persistent-3 40231 37692 -6.31% BenchmarkTCP4PersistentTimeout 44528 33396 -25.00% BenchmarkTCP4PersistentTimeout-2 43942 35349 -19.56% BenchmarkTCP4PersistentTimeout-3 48239 36325 -24.70% # cat old.txt PASS BenchmarkTCP4OneShot 500 314433 ns/op BenchmarkTCP4OneShot-2 500 343728 ns/op BenchmarkTCP4OneShot-3 500 339822 ns/op BenchmarkTCP4OneShotTimeout 500 343728 ns/op BenchmarkTCP4OneShotTimeout-2 500 345681 ns/op BenchmarkTCP4OneShotTimeout-3 500 353493 ns/op BenchmarkTCP4Persistent 5000 41013 ns/op BenchmarkTCP4Persistent-2 5000 38669 ns/op BenchmarkTCP4Persistent-3 5000 40231 ns/op BenchmarkTCP4PersistentTimeout 5000 44528 ns/op BenchmarkTCP4PersistentTimeout-2 5000 43942 ns/op BenchmarkTCP4PersistentTimeout-3 5000 48239 ns/op # cat new.txt PASS BenchmarkTCP4OneShot 500 292950 ns/op BenchmarkTCP4OneShot-2 1000 347634 ns/op BenchmarkTCP4OneShot-3 500 308574 ns/op BenchmarkTCP4OneShotTimeout 1000 338845 ns/op BenchmarkTCP4OneShotTimeout-2 500 333963 ns/op BenchmarkTCP4OneShotTimeout-3 500 341775 ns/op BenchmarkTCP4Persistent 5000 33591 ns/op BenchmarkTCP4Persistent-2 5000 35544 ns/op BenchmarkTCP4Persistent-3 5000 37692 ns/op BenchmarkTCP4PersistentTimeout 5000 33396 ns/op BenchmarkTCP4PersistentTimeout-2 5000 35349 ns/op BenchmarkTCP4PersistentTimeout-3 5000 36325 ns/op # These are just with windows runtime netpoll implemented. Speed ups are visible here. Especially where timers involved. I still didn't remove any memory allocations during net IO. I reckon, I should proceed with this. The code changes, so far, aren't substantial. I will clean-up my code and post it for review. Alex
Sign in to reply to this message.
On Mon, Apr 15, 2013 at 12:36 AM, <alex.brainman@gmail.com> wrote: > On 2013/04/15 05:18:39, dvyukov wrote: > >> ... >> >> I think for Windows you need a different impl of netpoll.goc. >> ... >> > > This looks very complicated to me :-( Also, CancelIoEx is present on > some systems, but not on others. If not found, we use CancelIo instead. > But CancelIo cancels "all io on the caller thread". So we use special > goroutine (locked to os thread) to do all IO submissions and > cancellations. I am not sure how easy it is to implement all this in > runtime. Also, all runtime syscall addresses are resolved by windows > executable loader, so we cannot use api functions that we know might not > be present on some systems. > Do we care much about performance of systems that does not have CancelIoEx? It's present for about 6 years. I would leave the non-CancelIoEx code as is, and implement the integrated poller only for systems supporting CancelIoEx/GetQueuedCompletionStatusEx. It's possible to call CancelIoEx only from Go code in net package, and call from runtime back into net to call it. But it's not that easy to do the same for GetQueuedCompletionStatusEx, becase netpoll() can not execute Go code (can not allocate memory, switch stacks and so on). I just tried to do all cancellation logic inside of fd_windows.go > instead. I have added new function similar to runtime_pollWait, but one > that does not call checkerr on entry or exit. Then I call CancelIoEx > (when runtime_pollWait returns with errClosing or errTimeout) and go > wait on that function. I discard any results and return previously > reported errClosing or errTimeout. > > All tests pass now. But I don't see any benchmark improvements :-(. I > didn't test my changes on amd64 yet, so, perhaps, I will see it there. > Still, I would expect some improvements, because I am not piddling with > timers. It is hard to benchmark these, because, you cannot run them for > very long (os runs out of sockets). > > I will keep at it, and report back. Thank you for your help. > > Alex > > https://codereview.appspot.**com/7612045/<https://codereview.appspot.com/7612... >
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/17 03:59:49, dvyukov wrote: > ... How about https://codereview.appspot.com/8670044/ ? I will split it into simpler CLs, if it looks OK to you. Alex
Sign in to reply to this message.
On Thu, Apr 18, 2013 at 12:40 AM, <alex.brainman@gmail.com> wrote: > On 2013/04/17 03:59:49, dvyukov wrote: > >> ... >> > > How about https://codereview.appspot.**com/8670044/<https://codereview.appspot.com/8670... > > I will split it into simpler CLs, if it looks OK to you. > I will look at it later. Probably in a week or two. This is not for Go1.1, right? So we have a plenty of time.
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/20 03:36:22, dvyukov wrote: > > I will look at it later. Probably in a week or two. This is not for Go1.1, > right? So we have a plenty of time. Sure. No hurry. This is for after go1.1. Thank you. Alex
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/20 03:36:22, dvyukov wrote: > ... > I will look at it later. Probably in a week or two. This is not for Go1.1, > right? So we have a plenty of time. Just a reminder. Mikio wants to move it along http://goo.gl/OuoNT. I am not asking for a thorough review - I just want to make sure, there are no show stoppers. As I said earlier, I am planning to split this change into smaller CLs, so it is easier to review. Thank you. Alex
Sign in to reply to this message.
|