Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puzzling performance of a multithreaded program #22995

Closed
jhrabe opened this issue Dec 5, 2017 · 12 comments
Closed

Puzzling performance of a multithreaded program #22995

jhrabe opened this issue Dec 5, 2017 · 12 comments

Comments

@jhrabe
Copy link

jhrabe commented Dec 5, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go1.9.2 linux/amd64

Does this issue reproduce with the latest release?

This is the latest release available on Gentoo linux

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOROOT="/usr/lib64/go"
GOTOOLDIR="/usr/lib64/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="x86_64-pc-linux-gnu-gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build485060439=/tmp/go-build -gno-record-gcc-switches"
CXX="x86_64-pc-linux-gnu-g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Below is a greatly simplified program reproducing the problem on at least 2 somewhat different machines, both with 8 CPUs (I assume this means 4 physical cores). The program is of the "embarasingly parallel" nature (one loop split into several parallel loops).

I see 2 problems:

  1. Increasing the number of threads (the threads constant ranging from 1 up to 8) almost exactly
    linearly reduces performance -- the more threads, the slower it gets, and quite significantly.

  2. The really funny bit: launching several instances (e.g., 2--4) of these programs simultaneously in separate terminals increases the performance of each instance! I'd love to understand why.

The problem goes away when the math/rand.Float64() is replaced by thread-local and individually seeded RNGs. Profiling shows lots of time spent in some sync.(*map).

If this issue cannot be easily fixed, I would suggest at least warning the users on manual pages for the math/rand package that the top-level functions, even though thread-safe, may be unsuitable for multithreading.


// Monte Carlo, originally a light scattering problem but no longer provides valid results of any kind
package main

import (
"fmt"
"math"
"math/rand"
)

const (
nShells = 101
albedo = 0.91
shellsPerMFP = 9.945454 * 1e-4
photons = int(1e6)
threads = 4
)

func main() {
var heat = make([]float64, nShells)
hCh := make(chan []float64) // heat channel (for returning heat arrays)
fmt.Printf("\nStarting ...\n\n")

rand.Seed(1)

// launch a goroutine for each subset of photon bundles
pt := photons / threads
for i := 1; i < threads; i++ {
	go photonCalc(pt, hCh)
}
go photonCalc(photons-pt*(threads-1), hCh)

// collect the results from all goroutines, sum them up
for i := 1; i <= threads; i++ {
	h := <-hCh
	for j := range heat {
		heat[j] += h[j]
	}
	fmt.Printf("A thread finishing\n")
}

fmt.Printf("All done.\n\n")

}

func photonCalc(nPhotons int, ch chan<- []float64) {
var h = make([]float64, nShells) // local heat array
var x, y, z float64
var u, v, w, t float64
var weight float64
var xi, yi, r2 float64

fmt.Printf("Starting a thread for %d photon bundles\n", nPhotons)

for i := 1; i <= nPhotons; i++ {
	x, y, z = 0.0, 0.0, 0.0
	weight = 1.0

	for {
		for {
			xi, yi = 2*rand.Float64()-1, 2*rand.Float64()-1
			if r2 = xi*xi + yi*yi; r2 <= 1 {
				break
			}
		}
		u = xi * 2.0 * math.Sqrt(1-r2)
		v = yi * 2.0 * math.Sqrt(1-r2)
		w = 2*r2 - 1

		t = -math.Log(rand.Float64() + math.SmallestNonzeroFloat64)
		x += t * u
		y += t * v
		z += t * w

		shell := int(math.Sqrt(x*x+y*y+z*z) * shellsPerMFP)
		if shell > nShells-1 {
			shell = nShells - 1
		}
		h[shell] += (1 - albedo) * weight
		weight *= albedo

		if weight < 0.001 {
			if rand.Float64 > 0.1 {
				break
			}
			weight /= 0.1
		}
	}
}
ch <- h

}

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

I expected:

  1. Improved performance with more threads.
  2. Diminished performance for more simultaneously run instances

What did you see instead?

The exact opposite.

@davecheney
Copy link
Contributor

Thank you for raising this issue. Unlike many projects on GitHub, the Go project does not use its bug tracker for general discussion or asking questions. We only use our bug tracker for tracking bugs and tracking proposals going through the Proposal Process.

Please see https://golang.org/wiki/Questions for good places to ask questions. They will most likely recommend that you profile your code as a first step towards understanding its runtime properties.

@jhrabe
Copy link
Author

jhrabe commented Dec 5, 2017

Thanks for the response. Perhaps my title was chosen badly but I intended to report a bug in math/rand or the scheduler, not enter into a general discussion or merely ask a question. Also, I did profiling and reported its results in my original post.

@bradfitz
Copy link
Contributor

bradfitz commented Dec 5, 2017

It's already somewhat documented:

https://golang.org/pkg/math/rand/#pkg-overview

Top-level functions, such as Float64 and Int, use a default shared Source that produces a deterministic sequence of values each time a program is run. Use the Seed function to initialize the default Source if different behavior is required for each run. The default Source is safe for concurrent use by multiple goroutines, but Sources created by NewSource are not.

It doesn't spell it out exactly, but writing high performance concurrent code means limiting sharing, so it kinda says it.

And profiling should also say it.

@jhrabe
Copy link
Author

jhrabe commented Dec 5, 2017

Based on that description, I would expect some reduction in ideally possible speedup. But would you expect the following results? I think these would take any user by surprise:

threads time (real) [s] (single instance)
1 11.3
2 17.3
4 32.1
8 39.1

instances time (real) [s] (using 2 threads)
1 17.3
2 17.2
3 15.5
4 14.0

@davecheney
Copy link
Contributor

Here is the profile for your program

screen shot 2017-12-05 at 17 25 02

Your program is spending a large % of it's time serialised on the lock around the top level functions in the math/rand package.

Here is a version of your program which does not serialise on a shared random number generator. https://play.golang.org/p/InIox-An1k

@jhrabe
Copy link
Author

jhrabe commented Dec 5, 2017

Yes, that is in agreement with my original report:

("The problem goes away when the math/rand.Float64() is replaced by thread-local and
individually seeded RNGs. Profiling shows lots of time spent in some sync.(*map).")

So I suppose we agree on the cause and a workaround. Because the top-level functions are currently so expensive (several hundreds of percent of CPU time), I would suggest to add the following sentence to the documentation:

The default Source is safe for concurrent use by multiple goroutines, but Sources created by
NewSource are not.
Because Source synchronization implemented by top-level functions can incur significant performance penalty, use thread-local Sources when possible.

@ianlancetaylor
Copy link
Contributor

@jhrabe That is too strong. Whether to use goroutine-local Sources should normally be made based on desired behavior rather than performance. Perhaps we could say "The default Source uses locks and is safe for...."

@jhrabe
Copy link
Author

jhrabe commented Dec 5, 2017

OK, so would you be happy with the following?

The default Source uses locks and is safe for concurrent use by multiple goroutines. If the performance penalty incurred by locking becomes too high, use thread-local Sources created by NewSource. These are not safe for concurrent use.

@bradfitz
Copy link
Contributor

bradfitz commented Dec 5, 2017

We never say "thread-local" (or even "threads", generally).

@ianlancetaylor
Copy link
Contributor

Personally I don't think it's necessary to spell out the performance penalty in the package comment.

@RLH
Copy link
Contributor

RLH commented Dec 5, 2017 via email

@jhrabe
Copy link
Author

jhrabe commented Dec 5, 2017

Sorry, I am being converted from numerical simulations using C. Very well, no mentioning threads and locks then:

The default Source is safe for concurrent use by multiple goroutines. If the performance penalty incurred by synchronization becomes too high, use NewSource to create Sources local to each goroutine. These Sources are not safe for concurrent use.

As for not explicitly mentioning performance penalty, I suppose that depends what typical overhead you expect. If the locking added something like 10%, it would not be such a surprise. In this case (see the table I posted previously), the overhead can reach hundreds of percents, making it likely dominant for most Monte Carlo simulation. I would therefore like to see a clear warning. When/if the performance is significantly improved, the warning can be removed.

@golang golang locked and limited conversation to collaborators Dec 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants