Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/websocket: slowdown on OS X due to rand.Reader #7000

Closed
gopherbot opened this issue Dec 21, 2013 · 21 comments
Closed

x/net/websocket: slowdown on OS X due to rand.Reader #7000

gopherbot opened this issue Dec 21, 2013 · 21 comments

Comments

@gopherbot
Copy link

by kell.sommers:

Benchmarking a web socket server I started hitting a single threaded wall. It seems
messages/second performance on OSX doesn't increase when set beyond 1. Some others on
the IRC channel took my short reproducible code for a run on other Macs and other Linux
and FreeBSD machines and verified there is something going on with Go on OSX.

I did additional tests figure further down to out if the OSX problem is on the client or
on the server side. It seems it is on the server side. This test goes over the network
not over localhost. I know this isn't a localhost issue because I have a C based HTTP
server that can do 500,000 HTTP requests/second on this machine over localhost with the
much bulkier HTTP protocol.

Go 1.2 on all machines tested.

The test code can be found here:
https://gist.github.com/kellabyte/8032308

----------------------------------------------
OSX i7-3720QM CPU @ 2.60GHz go1.2 darwin/amd64
----------------------------------------------
Performance doesn't increase with GOMAXPROCS it gets worse on OSX!

GOMAXPROCS=1
./websocket_client 4 100000
Sent 400000 in 2.389414969s for a rate of 167404/second

GOMAXPROCS=2
./websocket_client 4 100000
Sent 400000 in 6.219798526s for a rate of 64310/second <-- Lost 100K msgs/second!

GOMAXPROCS=3
./websocket_client 4 100000
Sent 400000 in 5.487955623s for a rate of 72886/second

GOMAXPROCS=4
./websocket_client 4 100000
Sent 400000 in 7.428882557s for a rate of 53843/second

---------------
2013 MBP retina
---------------
Confirmed again on someone elses OSX machine.

$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
$ go version
go version go1.2 darwin/amd64
$ ./client 1 100000
Sent 100000 in 896.185619ms for a rate of 111584/second
$ ./client 4 100000
Sent 400000 in 3.661056594s for a rate of 109258/second
$ GOMAXPROCS=4 ./client 1 100000
Sent 100000 in 923.826561ms for a rate of 108245/second
$ GOMAXPROCS=4 ./client 4 100000
Sent 400000 in 3.254549872s for a rate of 122904/second

----------------------------------------
Ubuntu 13.10, cpu is Intel Core i5-2500K
----------------------------------------

This one seems to scale with GOMAXPROCS!

$ go version
go version go1.2 linux/amd64
$ ./client 1 100000
Sent 100000 in 428.895171ms for a rate of 233157/second
$ ./client 4 100000
Sent 400000 in 1.79000027s for a rate of 223463/second
$ GOMAXPROCS=4 ./client 1 100000
Sent 100000 in 454.46171ms for a rate of 220040/second
$ GOMAXPROCS=4 ./client 4 100000
Sent 400000 in 700.25884ms for a rate of 571217/second

--------------------------------------------------
2.2GHz dual core Ivy Bridge FreeBSD 10.0rc1 Go 1.2
--------------------------------------------------

GOMAXPROCS=1:
./client 4 100000
Sent 400000 in 2.723452743s for a rate of 146872/second

GOMAXPROCS=2:
./client 4 100000
Sent 400000 in 3.107706227s for a rate of 128712/second

GOMAXPROCS=3:
./client 4 100000
Sent 400000 in 1.903976998s for a rate of 210086/second

GOMAXPROCS=4:
./client 4 100000
Sent 400000 in 1.736387681s for a rate of 230363/second

-------------------------------------
Ubuntu server and OSX client over GbE LAN
-------------------------------------
Server: Ubuntu AMD Phenom(tm) 9150e Quad-Core Processor
Client: OSX i7-3720QM CPU @ 2.60GHz
GOMAXPROCS=4 on both client and server

./websocket_client 4 400000
Sent 1600000 in 3.354864357s for a rate of 476919/second

-------------------------------------
OSX server and Ubuntu client over GbE LAN
-------------------------------------
The OSX machine never goes above 115% CPU so the other 3 cores aren't used much.

Server: OSX i7-3720QM CPU @ 2.60GHz    
Client: Ubuntu AMD Phenom(tm) 9150e Quad-Core Processor    
GOMAXPROCS=4 on both client and server

./websocket_client 4 400000
Sent 1600000 in 7.687288138s for a rate of 208135/second
@mikioh
Copy link
Contributor

mikioh commented Dec 22, 2013

Comment 1:

Note that "OSX" means OS X Mavericks and "elses OSX" means OS X Mountain Lion.

Labels changed: added os-macosx.

@davecheney
Copy link
Contributor

Comment 2:

I wonder if this is related to the 'power nap' nonsense that apple implemented in 10.9.
@Kelly, are you able to disable power nap (you can see the effect in Activity
Monitor.app) 
Does updating to 10.9.1 fix, change, make worse the problem ?

@gopherbot
Copy link
Author

Comment 3 by kell.sommers:

I disabled Power Nap feature and didn't see any changes. As far as I know Power Nap
allows applications like email to periodically check email. It's not clear to me how
this affects performance when the machine isn't sleeping.
Also sorry for being vague on some of the details. The Mac I am testing on is already
running 10.9.1.

@gopherbot
Copy link
Author

Comment 4 by michael.w.keesey:

FWIW, some of us have seen this under Mountain Lion which predates Power Nap.

@dvyukov
Copy link
Member

dvyukov commented Dec 22, 2013

Comment 6:

I think the attacked file explains at least part of the problem.
I don't know why it wants to read from /dev/random on every message send.

Attachments:

  1. websocket.block.svg (20017 bytes)

@dvyukov
Copy link
Member

dvyukov commented Dec 22, 2013

Comment 7:

Adam, do you know if it's necessary to read /dev/random that frequently in this case?
Why?

@agl
Copy link
Contributor

agl commented Dec 22, 2013

Comment 8:

WebSockets needs to mask each message because of buggy intermediate proxies that would
otherwise have been vulnerable to cache poisoning. The masking is very dumb as I recall
- just a repeated, 32-bit XOR mask. But it does need to be random.
There are also other cases where we need very rapid entropy: AES-CBC in TLS for one,
where each record needs a random IV.
Ideally, this would be a thread-local structure (or CPU-local, where the kernel supports
it).

@dvyukov
Copy link
Member

dvyukov commented Dec 23, 2013

Comment 9:

Kelly, how important is it to you? Are you actually using mac for production?

@dvyukov
Copy link
Member

dvyukov commented Dec 23, 2013

Comment 10:

Thanks, Adam!
There is the same issue on linux as well. With GOMAXPROCS=16 with 9 seconds of execution
time, there is 550 seconds of blocking in crypto/rand.
I see that websocket reads just 4 bytes of randomness.
For a security point of view, is it a good idea to cache a little bit of /dev/random
output? I.e. read say 64 bytes, and then satisfy 4-byte requests from that buffer?
Also I am not sure why we protect reads from /dev/urandom with a mutex?...

Labels changed: added os-linux, performance.

@gopherbot
Copy link
Author

Comment 11 by kell.sommers:

Dmitry,
I'm not using a Mac in production. 
As a developer trying to write high performance services but with only a few months of
Go experience, it threw me for a loop on my development machine. I'm not that familiar
with Go yet and I thought maybe I screwed something up and I needed to learn go routines
and channels or something because I did something wrong and it didn't feel like I was
correctly using the cores on the system somehow.

@gopherbot
Copy link
Author

Comment 12 by kell.sommers:

How did you create that diagram? That's amazing :)

@dvyukov
Copy link
Member

dvyukov commented Dec 23, 2013

Comment 13:

Thanks, I see. Makes sense.
If you are running tests, then you just do 'go test -blockprofile=filename' and then "go
tool pprof --web binary.test filename".
It's also exposed via net/http/pprof, but you need to add
"runtime.SetBlockProfileRate(1)" in the beginning of program.
If you just running a standalone binary like in this case, then:
func main() {
    runtime.SetBlockProfileRate(1)
...
    f, _ := os.Create("/tmp/blockprof")
    pprof.Lookup("block").WriteTo(f, 2)
}
and then "go tool pprof --web binary /tmp/blockprof".

@robpike
Copy link
Contributor

robpike commented Dec 26, 2013

Comment 14:

Kelly: To create the file, see
$ go help testflag
in particular, the flag -test.blockprofile.

@davecheney
Copy link
Contributor

Comment 15:

@kelly, it looks like the underlying problem was traced to websockets using a lot of
random entropy and blocking on /dev/random.
What should I set the status of this ticket to ? Should it be closed ? Is there another
ticket for the underlying issue with the websocket library which this issue is blocked
on ?

Status changed to WaitingForReply.

@rsc
Copy link
Contributor

rsc commented Mar 3, 2014

Comment 16:

I don't understand: websocket is using crypto/rand, which should be reading from
/dev/urandom, not /dev/random. Why is /dev/urandom blocking? Or why is it using
/dev/random instead?

@rsc
Copy link
Contributor

rsc commented Mar 3, 2014

Comment 17:

Labels changed: added repo-net.

@agl
Copy link
Contributor

agl commented Mar 3, 2014

Comment 18:

rsc: I think it's hitting lock contention for the buffer in crypto/rand and also hitting
overhead from reading urandom.

@gopherbot
Copy link
Author

Comment 19 by kell.sommers:

Sorry for the delay in responding! I'm a bit confused how we should proceed with this?
Haven't the results shown that performance is pretty drastic on different platforms?
I might be missing something :)

@gopherbot
Copy link
Author

Comment 20 by ledangster:

Looks like the single Reader instance of crypto/rand is causing the issue due to the
lock being held while reading /dev/urandom (per agl's comment).
Two quick ways that this could be solved is if
(a) crypto/rand is updated to support multiple concurrent opens of /dev/urandom,
or
(b) websockets accepted a user provided random source - in which case that random source
can pool file opens to /dev/urandom
Either way, I don't think you'll be able to sort it out without support from either of
these two core components.

@gopherbot
Copy link
Author

Comment 21 by Jens.Alfke:

AFAIK, urandom is a Linux-ism, not POSIX. The man page for urandom on OS X says:
     /dev/urandom is a compatibility nod to Linux. On Linux, /dev/urandom will
     produce lower quality output if the entropy pool drains, while
     /dev/random will prefer to block and wait for additional entropy to be
     collected.  With Yarrow, this choice and distinction is not necessary,
     and the two devices behave identically. You may use either.
So it would seem that Yarrow isn't as fast as whatever PRNG Linux's urandom switches
over to when it runs out of entropy. Makes sense.
I suggest changing the websocket code to seed itself once from the cryptographic RNG
when it opens the socket and use that seed to run a PRNG. (Which is in fact what the
regular srandomdev() and random() syscalls do on OS X.)

@mikioh mikioh changed the title go.net/websocket: slowdown on OS X due to rand.Reader x/net/websocket: slowdown on OS X due to rand.Reader Dec 23, 2014
@mikioh mikioh added repo-net and removed repo-net labels Dec 23, 2014
@mikioh mikioh changed the title x/net/websocket: slowdown on OS X due to rand.Reader websocket: slowdown on OS X due to rand.Reader Jan 4, 2015
@rsc rsc removed os-linux labels Apr 10, 2015
@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc rsc changed the title websocket: slowdown on OS X due to rand.Reader x/net/websocket: slowdown on OS X due to rand.Reader Apr 14, 2015
@rsc rsc modified the milestones: Unreleased, Unplanned Apr 14, 2015
@odeke-em
Copy link
Member

odeke-em commented Jun 6, 2020

Unfortunately we aren’t maintaining x/net/websocket anymore and suggest alternatives at https://godoc.org/golang.org/x/net/websocket

@odeke-em odeke-em closed this as completed Jun 6, 2020
@golang golang locked and limited conversation to collaborators Jun 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants