Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: testing: add (*PB).NextIndex() (int, bool) method #43030

Closed
chabbimilind opened this issue Dec 6, 2020 · 17 comments
Closed

proposal: testing: add (*PB).NextIndex() (int, bool) method #43030

chabbimilind opened this issue Dec 6, 2020 · 17 comments

Comments

@chabbimilind
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.15.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/xx/Library/Caches/go-build"
GOENV="/Users/xx/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/xx/gocode/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/xx/gocode"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15.5/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15.5/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xr/hsc7mk1n5glg6vfqb4z5nlkh0000gn/T/go-build663468880=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

testing.PB.Next() only returns a boolean whether there exits another iteration when running parallel benchmarks.
This is insufficient in many situations where one needs to know the current iteration index.
See the Go code in the attached stencil.go.zip file.
Consider the serial elision of the 3-point stencil example, which reads from an input slice in and writes to output slice out while computing in[i-1] + in[i] + in[i+1] for iteration index i and writing out to out[i].

func stencilSerial() {
	r := testing.Benchmark(func(b *testing.B) {
		in := make([]int, b.N)
		out := make([]int, b.N)
		for i := 0; i < b.N; i++ {
			in[i] = i
		}

		for i := 0; i < b.N; i++ {
			left := 0
			if i > 0 {
				left = in[i-1]
			}
			right := 0
			if i < int(b.N)-1 {
				right = in[i+1]
			}
			out[i] = left + in[i] + right
		}
	})
	fmt.Printf("\n stencilSerial %v", r)
}

Try to convert it to a parallel benchmark; here is a first racy version, which is definitely incorrect.
The for i := 0; pb.Next(); i++ loop is local to each go routine and multiple go routines will overwrite the same index out[i] and not all indices will be read/updated and hence it is functionally incorrect.
Furthermore, the cacheline pingpoining will deteriorate the performance of this parallel code (shown at the end of this issue).

func stencilParallelRacy() {
	r := testing.Benchmark(func(b *testing.B) {
		in := make([]int, b.N)
		out := make([]int, b.N)
		for i := 0; i < b.N; i++ {
			in[i] = i
		}
		b.RunParallel(func(pb *testing.PB) {
			for i := 0; pb.Next(); i++ {
				left := 0
				if i > 0 {
					left = in[i-1]
				}
				right := 0
				if i < int(b.N)-1 {
					right = in[i+1]
				}
				// racy update and not all in[i] are updated!
				out[i] = left + in[i] + right
			}
		})
	})
	fmt.Printf("\n stencilParallelRacy %v", r)
}

A somewhat sane parallel version is to create in and out slices inside each Go routine as shown below.
However, this benchmark has several problems.

  1. it will bloat the memory needs by O(GOMAXPROCS). One cannot know the size of the slice to create for each goroutine due to dynamic load balancing.
  2. it is not functionally equivalent to the original one because not all indices of the input slice will be read and output slices be written.
func stencilParallel() {
	r := testing.Benchmark(func(b *testing.B) {
		b.RunParallel(func(pb *testing.PB) {
			in := make([]int, b.N)
			out := make([]int, b.N)
			for i := 0; i < b.N; i++ {
				in[i] = i
			}

			for i := 0; pb.Next(); i++ {
				left := 0
				if i > 0 {
					left = in[i-1]
				}
				right := 0
				if i < int(b.N)-1 {
					right = in[i+1]
				}
				out[i] = left + in[i] + right
			}
		})
	})
	fmt.Printf("\n stencilParallel %v", r)
}

We need a func (pb *PB) NextIndex() (int, bool) API that can return the iteration index so that each iteration inside parallel region can take the appropriate action. See the code below, with the proposed NextIndex API in action.

func stencilParallelNextIndex() {
	r := testing.Benchmark(func(b *testing.B) {
		in := make([]int, b.N)
		out := make([]int, b.N)
		for i := 0; i < b.N; i++ {
			in[i] = i
		}
		b.RunParallel(func(pb *testing.PB) {
			for {
				i, ok := pb.NextIndex()
				if !ok {
					break
				}
				left := 0
				if i > 0 {
					left = in[i-1]
				}
				right := 0
				if int(i) < b.N-1 {
					right = in[i+1]
				}
				out[i] = left + in[i] + right
			}
		})
	})
	fmt.Printf("\n stencilParallelNextIndex %v", r)
}

It is worth observing the running time of each of these examples (I have implemented NextIndex locally) when scaling GOMAXPROCS from 1-8.

 GOMAXPROCS=1 go run stencil.go 

 stencilSerial 233812798	         6.19 ns/op
 stencilParallelRacy 323430168	         5.72 ns/op
 stencilParallel 314360209	         3.81 ns/op
 stencilParallelNextIndex 309754171	 3.89 ns/op
GOMAXPROCS=2 go run stencil.go 

 stencilSerial 234993747	         6.34 ns/op
 stencilParallelRacy 437399902	         5.94 ns/op
 stencilParallel 432057882	         6.89 ns/op
 stencilParallelNextIndex 443609341	 2.66 ns/op
 GOMAXPROCS=4 go run stencil.go 

 stencilSerial 243237640	         6.39 ns/op
 stencilParallelRacy 542654800	         3.41 ns/op
 stencilParallel 194487830	         7.74 ns/op
 stencilParallelNextIndex 572145619	 2.10 ns/op
GOMAXPROCS=8 go run stencil.go 

 stencilSerial 240228440	         6.31 ns/op
 stencilParallelRacy 697957074	         4.28 ns/op
 stencilParallel 100000000	         15.0 ns/op
 stencilParallelNextIndex 651700406	 1.87 ns/op

stencilSerial as expected stays put.
stencilParallelRacy has a poor scalability due to a heavy cacheline conflict (and of course incorrect result).
stencilParallel has pathetic parallel scaling due to bloated memory needs and its associated initialization in each Goroutine.
stencilParallelNextIndex shows higher (although not perfectly linear) throughput with more GOMAXPROCS and it is much desirable when writing parallel benchmarks.

This is a small enough change, which introduces a new API.

What did you expect to see?

NA

What did you see instead?

NA

@gopherbot gopherbot added this to the Proposal milestone Dec 6, 2020
@ianlancetaylor ianlancetaylor changed the title proposal: testing/benchmark: add func (pb *PB) NextIndex() (int, bool) API proposal: testing: add func (pb *PB) NextIndex() (int, bool) API Dec 6, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Dec 6, 2020
@ianlancetaylor ianlancetaylor changed the title proposal: testing: add func (pb *PB) NextIndex() (int, bool) API proposal: testing: add (*PB).NextIndex() (int, bool) method Dec 6, 2020
@ianlancetaylor
Copy link
Contributor

Can you show sample documentation for the NextIndex method? I don't understand what it returns. Thanks.

Could a particular benchmark solves this problem by using its own local variable with atomic.AddInt32? Could the testing package do that more efficiently in some way?

@chabbimilind
Copy link
Author

chabbimilind commented Dec 6, 2020

The documentation for (*PB).NextIndex() (int, bool) can look like below.

NextIndex returns whether there are more iterations to execute.
If there are more iterations to execute, a unique integer index in the range  [0-b.N) is additionally reported.
Due to concurrent execution of each iteration, there is no ordering guarantee for the indices returned by NextIndex().
However, it is guaranteed that every index in [0-b.N) is reported once and only once.

There is definitely a performance advantage to adding the NextIndex API in the testing package. It will eliminate the atomic addition on each loop iteration; instead, there will be one atomic add per PB.grain (https://golang.org/src/testing/benchmark.go?s=20045:20070#L706).

Furthermore, there is a programmatic advantage to keeping it inside the testing package; making each parallel benchmark repeatedly manage its own atomic.AddInt32 is tedious, error prone, and it will make the code look ugly.

@rsc
Copy link
Contributor

rsc commented Dec 9, 2020

Why is this part of Next? Why not just have a pb.Index that returns the current goroutine's index?

@rsc rsc moved this from Incoming to Active in Proposals (old) Dec 9, 2020
@chabbimilind
Copy link
Author

@rsc is there a provable superiority (programmatic or performance) of having Index over NextIndex? I see the performance advantage of NextIndex which both moves the iterator and provides the index. If one wants to access only
the Index because they call it many times from different parts of the program (although I find it less efficient), the Index would be justified.

@antichris
Copy link

antichris commented Dec 14, 2020

is there a provable superiority (programmatic or performance) of having Index over NextIndex?

b.RunParallel(func(pb *testing.PB) {
	for {
		i, ok := pb.NextIndex()
		if !ok {
			break
		}
		// ...
	}
})

instead of

b.RunParallel(func(pb *testing.PB) {
	for pb.Next() {
		i := pb.Index()
		// ...
	}
})

makes me ask the inverse question: is there a provable superiority (programmatic or performance) of having NextIndex over Index?

I'm unconvinced the loss of readability is justified by having one less function call per iteration (hypothetical, considering inlining).

@chabbimilind
Copy link
Author

chabbimilind commented Dec 15, 2020

@antichris, Index() is an interesting proposal, but I would still like to argue in favor of NextIndex() for the following two reasons.

  1. Usability:
    pb.Index() is not a standalone API unlike how you use in your second example. pb.Index() is intricately coupled with what happened at pb.Next(). pb.Index() cannot return an index when pb.Next() is not called even once by the calling goroutine and also whenpb.Next() has crossed the last iteration. Hence, the API should look more like: (*PB) Index() (int, bool). With this API signature, its usage will look not very different from NextIndex(); granted you can ignore the return value when it is inside the same loop where pb.Next was performed; but in full generality, calls to Index have to check its return value. So, in summary, the readability benefits of Index() are not significant.

  2. Performance:
    Let's not discount or speculate performance assuming inlining will happen or the compiler is omniscient. Look at the code below, which exercises both NextIndex() and Index variants. On my iMac with 8 cores (2x SMT), there is a 32% performance loss when using Next+Index compared with NextIndex. Your mileage may vary.

$ ./Index 
benchIndex      267	   3771623 ns/op
benchNextIndex      372	   2846961 ns/op
package main

import (
	"fmt"
	"testing"
)

func benchNextIndex() {
	r := testing.Benchmark(func(b *testing.B) {
		for i := 0; i < 100000; i++ {
			b.RunParallel(func(pb *testing.PB) {
				sum := uint64(0)
				for {
					v, ok := pb.NextIndex()
					if !ok {
						break
					}
					sum += v
				}
			})
		}
	})
	fmt.Printf("benchNextIndex %v\n", r)
}

func benchIndex() {
	r := testing.Benchmark(func(b *testing.B) {
		for i := 0; i < 100000; i++ {
			b.RunParallel(func(pb *testing.PB) {
				sum := uint64(0)
				for pb.Next() {
					v, _ := pb.Index() // Comment this line to gain 32% performance
					sum += v
				}
			})
		}
	})
	fmt.Printf("benchIndex %v\n", r)
}

func main() {
	benchIndex()
	benchNextIndex()
}

Of course, it raises the question, how did I implement these functions. The below code snippets should give you an idea:

// A PB is used by RunParallel for running parallel benchmarks.
type PB struct {
        globalN *uint64 // shared between all worker goroutines iteration counter
        grain   uint64  // acquire that many iterations from globalN at once
        cache   uint64  // local cache of acquired iterations
        bN      uint64  // total number of iterations to execute (b.N)
        where   uint64  // last n we got  <===== added this line
}
// Next reports whether there are more iterations to execute.
func (pb *PB) Next() bool {
        if pb.cache == 0 { 
                n := atomic.AddUint64(pb.globalN, pb.grain)
                if n <= pb.bN {
                        pb.cache = pb.grain
                } else if n < pb.bN+pb.grain {
                        pb.cache = pb.bN + pb.grain - n 
                } else {
                        return false
                }   
                pb.where = n - pb.grain  <===== added this line
        }   
        pb.where++ <===== added this line
        pb.cache--
        return true
}

// Index provides the current index.
func (pb *PB) Index() (uint64, bool) {
        if pb.where == 0 {
                return 0, false
        }
        if pb.where > pb.bN {
                return 0, false
        }
        return pb.where - 1, true
}

// NextIndex provides the next index and reports whether there are more iterations to execute.
func (pb *PB) NextIndex() (uint64, bool) {
        if pb.cache == 0 {
                n := atomic.AddUint64(pb.globalN, pb.grain)
                if n <= pb.bN {
                        pb.cache = pb.grain
                } else if n < pb.bN+pb.grain {
                        pb.cache = pb.bN + pb.grain - n
                } else {
                        return 0, false
                }
                pb.where = n - pb.grain
        }
        retVal := pb.where
        pb.where++
        pb.cache--
        return retVal, true
}

@antichris
Copy link

antichris commented Dec 15, 2020

@chabbimilind, with your code unaltered, I get an average

name       time/op
Index      1.64ms ± 1%
NextIndex  1.61ms ± 1%

over 50 runs, which amounts to at most a 2% performance loss on my machine. There must be something wrong with yours.

If Index is reduced to just return the goroutine's index,

func (pb *PB) Index() uint64 { return pb.where - 1 }

there is no performance loss at all, there's actually even a 0.42% gain, according to benchstat (I had to rename both benchmarks to match so that they could be compared):

name         old time/op  new time/op  delta
(Next)Index  1.61ms ± 1%  1.61ms ± 1%  -0.42%  (p=0.000 n=48+50)

That might be due to elimination of an XCHGL AX, AX at 0x000f in benchNextIndex.func1.1 which is replaced by a NOP in benchIndex.func1.1, but I suspect that's a more or less architecture-specific optimization together with the inlining that happened to both functions (just as I speculated earlier).

This, I believe, is the way pb.Index() is intended (when @rsc suggested it) to be used — always exclusively after a pb.Next() call:

// ...
for pb.Next() {
	i := pb.Index()
	// use i
}
// ...

The removal of those checks (and overhead) it had in your version might make it less safe to use improperly: it would return -1, before the first call to Next, and the last index after Next has started returning false, which might not be what one expects to get. Although, I think, one should not be expecting anything great from calling it outside of the appropriate context (which is only when Next has returned true).

Even you said so yourself:

pb.Index() is not a standalone API unlike how you use in your second example. pb.Index() is intricately coupled with what happened at pb.Next().

Looking at the code sample I suppose you were referring to (which is partially reproduced just above here), I can't really understand what you mean by "unlike how I use", though.

Your mileage may vary.

Oh, it did, my friend. Indeed it did.

@chabbimilind
Copy link
Author

@antichris how many threads (CPU cores) on your machine?

Are you prosing returning -1 on failure from pb.Index(). If that is the desired way Go standard APIs work, fine!
I have recorded the fact that there can be noticeable performance losses. I understand the focus of Golang is not high-performance.

@antichris
Copy link

how many threads (CPU cores) on your machine?

4 cores, 8 threads, according to the manufacturer's datasheet.

I think I know where you're headed with this, though. That's why I rewrote your benchmark code to "proper" Go benchmarks, so that they could be executed with all the wonderful features that test flags provide, including (but not limited to) ergonomic control over the run count and GOMAXPROCS.

(Expand for source)
package main_test

import "testing"

func BenchmarkNextIndex(b *testing.B) {
	for i := 0; i < 100000; i++ {
		b.RunParallel(func(pb *testing.PB) {
			sum := uint64(0)
			for {
				v, ok := pb.NextIndex()
				if !ok {
					break
				}
				sum += v
			}
		})
	}
}

func BenchmarkIndex(b *testing.B) {
	for i := 0; i < 100000; i++ {
		b.RunParallel(func(pb *testing.PB) {
			sum := uint64(0)
			for pb.Next() {
				v, _ := pb.Index()
				// v := pb.Index() // Use this line instead of the above
				                   // when a single value is returned.
				sum += v
			}
		})
	}
}

The invocation I used for running the benchmarks:

go test -bench=. -cpu=1,2,4,8 -count=10|tee results

I did not want to babysit my machine (it has a tendency for a sharp performance drop at putting the screen to/waking it up from sleep) for more than half an hour, waiting for 400 runs to complete, therefore the -count=10. Feel free to experiment with larger values.

The benchmark results were renamed to matching ones when comparing/summarizing with benchstat, using BenchmarkNextIndex as the baseline for each comparison.

  • The results for your original implementation of Index:

    (Expand for full results)

     name           old time/op  new time/op  delta
     (Next)Index     827µs ± 1%   826µs ± 1%    ~     (p=0.730 n=9+9)
     (Next)Index-2  1.04ms ± 1%  1.05ms ± 2%  +1.16%  (p=0.004 n=10+9)
     (Next)Index-4  1.36ms ± 1%  1.38ms ± 3%  +1.81%  (p=0.010 n=9+10)
     (Next)Index-8  1.61ms ± 1%  1.63ms ± 0%  +1.06%  (p=0.000 n=10+8)
    
     goos: linux
     goarch: amd64
     pkg: testing/index
     BenchmarkNextIndex     	    1365	    822306 ns/op
     BenchmarkNextIndex     	    1238	    827097 ns/op
     BenchmarkNextIndex     	    1250	    826455 ns/op
     BenchmarkNextIndex     	    1256	    825291 ns/op
     BenchmarkNextIndex     	    1248	    829401 ns/op
     BenchmarkNextIndex     	    1274	    831540 ns/op
     BenchmarkNextIndex     	    1264	    824213 ns/op
     BenchmarkNextIndex     	    1262	    828072 ns/op
     BenchmarkNextIndex     	    1260	    827564 ns/op
     BenchmarkNextIndex     	    1262	    848524 ns/op
     BenchmarkNextIndex-2   	    1072	   1040714 ns/op
     BenchmarkNextIndex-2   	    1126	   1038666 ns/op
     BenchmarkNextIndex-2   	    1105	   1044023 ns/op
     BenchmarkNextIndex-2   	    1108	   1024421 ns/op
     BenchmarkNextIndex-2   	    1119	   1033484 ns/op
     BenchmarkNextIndex-2   	    1076	   1044225 ns/op
     BenchmarkNextIndex-2   	    1105	   1040204 ns/op
     BenchmarkNextIndex-2   	    1093	   1029051 ns/op
     BenchmarkNextIndex-2   	    1122	   1041408 ns/op
     BenchmarkNextIndex-2   	    1106	   1040794 ns/op
     BenchmarkNextIndex-4   	     906	   1372349 ns/op
     BenchmarkNextIndex-4   	     903	   1365971 ns/op
     BenchmarkNextIndex-4   	     895	   1349072 ns/op
     BenchmarkNextIndex-4   	     895	   1357245 ns/op
     BenchmarkNextIndex-4   	     901	   1359918 ns/op
     BenchmarkNextIndex-4   	     913	   1362478 ns/op
     BenchmarkNextIndex-4   	     906	   1358369 ns/op
     BenchmarkNextIndex-4   	     896	   1359996 ns/op
     BenchmarkNextIndex-4   	     904	   1403334 ns/op
     BenchmarkNextIndex-4   	     892	   1347904 ns/op
     BenchmarkNextIndex-8   	     697	   1618363 ns/op
     BenchmarkNextIndex-8   	     697	   1610496 ns/op
     BenchmarkNextIndex-8   	     698	   1610698 ns/op
     BenchmarkNextIndex-8   	     704	   1609407 ns/op
     BenchmarkNextIndex-8   	     702	   1595361 ns/op
     BenchmarkNextIndex-8   	     697	   1615093 ns/op
     BenchmarkNextIndex-8   	     696	   1618305 ns/op
     BenchmarkNextIndex-8   	     692	   1607051 ns/op
     BenchmarkNextIndex-8   	     708	   1613780 ns/op
     BenchmarkNextIndex-8   	     690	   1597287 ns/op
     BenchmarkIndex         	    1375	    816488 ns/op
     BenchmarkIndex         	    1239	    825060 ns/op
     BenchmarkIndex         	    1274	    822574 ns/op
     BenchmarkIndex         	    1218	    832834 ns/op
     BenchmarkIndex         	    1275	    823519 ns/op
     BenchmarkIndex         	    1270	    841124 ns/op
     BenchmarkIndex         	    1261	    828743 ns/op
     BenchmarkIndex         	    1270	    828936 ns/op
     BenchmarkIndex         	    1254	    823869 ns/op
     BenchmarkIndex         	    1263	    830012 ns/op
     BenchmarkIndex-2       	    1094	   1029340 ns/op
     BenchmarkIndex-2       	    1090	   1018124 ns/op
     BenchmarkIndex-2       	    1105	   1047049 ns/op
     BenchmarkIndex-2       	    1114	   1049284 ns/op
     BenchmarkIndex-2       	    1129	   1059698 ns/op
     BenchmarkIndex-2       	    1095	   1044032 ns/op
     BenchmarkIndex-2       	    1111	   1042077 ns/op
     BenchmarkIndex-2       	    1093	   1045498 ns/op
     BenchmarkIndex-2       	    1138	   1055870 ns/op
     BenchmarkIndex-2       	    1132	   1074446 ns/op
     BenchmarkIndex-4       	     900	   1408361 ns/op
     BenchmarkIndex-4       	     898	   1378696 ns/op
     BenchmarkIndex-4       	     908	   1397115 ns/op
     BenchmarkIndex-4       	     889	   1373676 ns/op
     BenchmarkIndex-4       	     901	   1404665 ns/op
     BenchmarkIndex-4       	     879	   1344952 ns/op
     BenchmarkIndex-4       	     898	   1402569 ns/op
     BenchmarkIndex-4       	     890	   1358545 ns/op
     BenchmarkIndex-4       	     895	   1372714 ns/op
     BenchmarkIndex-4       	     889	   1397904 ns/op
     BenchmarkIndex-8       	     694	   1631860 ns/op
     BenchmarkIndex-8       	     694	   1627596 ns/op
     BenchmarkIndex-8       	     684	   1622259 ns/op
     BenchmarkIndex-8       	     693	   1627623 ns/op
     BenchmarkIndex-8       	     697	   1624575 ns/op
     BenchmarkIndex-8       	     682	   1627252 ns/op
     BenchmarkIndex-8       	     691	   1608856 ns/op
     BenchmarkIndex-8       	     700	   1626498 ns/op
     BenchmarkIndex-8       	     680	   1637182 ns/op
     BenchmarkIndex-8       	     693	   1625829 ns/op
     PASS
     ok  	testing/index	212.816s
    
  • Running with Index reduced to a single return:

    (Expand for full results)

     name           old time/op  new time/op  delta
     (Next)Index     808µs ± 1%   810µs ± 1%    ~     (p=0.278 n=9+10)
     (Next)Index-2  1.02ms ± 1%  1.03ms ± 2%  +1.23%  (p=0.035 n=10+10)
     (Next)Index-4  1.38ms ± 1%  1.38ms ± 1%    ~     (p=0.315 n=10+10)
     (Next)Index-8  1.61ms ± 1%  1.61ms ± 1%    ~     (p=0.853 n=10+10)
    
     goos: linux
     goarch: amd64
     pkg: testing/index
     BenchmarkNextIndex 	        1185	    845181 ns/op
     BenchmarkNextIndex 	        1378	    800164 ns/op
     BenchmarkNextIndex 	        1309	    802616 ns/op
     BenchmarkNextIndex 	        1274	    804395 ns/op
     BenchmarkNextIndex 	        1286	    805231 ns/op
     BenchmarkNextIndex 	        1258	    808409 ns/op
     BenchmarkNextIndex 	        1296	    814793 ns/op
     BenchmarkNextIndex 	        1244	    811059 ns/op
     BenchmarkNextIndex 	        1298	    805301 ns/op
     BenchmarkNextIndex 	        1246	    818426 ns/op
     BenchmarkNextIndex-2   	    1114	   1023248 ns/op
     BenchmarkNextIndex-2   	    1106	   1027178 ns/op
     BenchmarkNextIndex-2   	    1118	   1014118 ns/op
     BenchmarkNextIndex-2   	    1104	   1016580 ns/op
     BenchmarkNextIndex-2   	    1135	   1011715 ns/op
     BenchmarkNextIndex-2   	    1122	   1011229 ns/op
     BenchmarkNextIndex-2   	    1108	   1019155 ns/op
     BenchmarkNextIndex-2   	    1130	   1029966 ns/op
     BenchmarkNextIndex-2   	    1131	   1028267 ns/op
     BenchmarkNextIndex-2   	    1130	   1013857 ns/op
     BenchmarkNextIndex-4   	     912	   1384737 ns/op
     BenchmarkNextIndex-4   	     901	   1374862 ns/op
     BenchmarkNextIndex-4   	     913	   1374009 ns/op
     BenchmarkNextIndex-4   	     897	   1366385 ns/op
     BenchmarkNextIndex-4   	     907	   1386615 ns/op
     BenchmarkNextIndex-4   	     902	   1378177 ns/op
     BenchmarkNextIndex-4   	     913	   1389875 ns/op
     BenchmarkNextIndex-4   	     907	   1399369 ns/op
     BenchmarkNextIndex-4   	     904	   1379526 ns/op
     BenchmarkNextIndex-4   	     908	   1387491 ns/op
     BenchmarkNextIndex-8   	     702	   1613103 ns/op
     BenchmarkNextIndex-8   	     703	   1597941 ns/op
     BenchmarkNextIndex-8   	     696	   1608357 ns/op
     BenchmarkNextIndex-8   	     704	   1605088 ns/op
     BenchmarkNextIndex-8   	     708	   1610483 ns/op
     BenchmarkNextIndex-8   	     705	   1607491 ns/op
     BenchmarkNextIndex-8   	     709	   1593077 ns/op
     BenchmarkNextIndex-8   	     702	   1611638 ns/op
     BenchmarkNextIndex-8   	     699	   1607134 ns/op
     BenchmarkNextIndex-8   	     709	   1601878 ns/op
     BenchmarkIndex     	        1281	    811523 ns/op
     BenchmarkIndex     	        1287	    807519 ns/op
     BenchmarkIndex     	        1292	    808608 ns/op
     BenchmarkIndex     	        1278	    812850 ns/op
     BenchmarkIndex     	        1257	    812999 ns/op
     BenchmarkIndex     	        1292	    811811 ns/op
     BenchmarkIndex     	        1268	    804930 ns/op
     BenchmarkIndex     	        1312	    807247 ns/op
     BenchmarkIndex     	        1262	    808766 ns/op
     BenchmarkIndex     	        1280	    808776 ns/op
     BenchmarkIndex-2       	    1100	   1022728 ns/op
     BenchmarkIndex-2       	    1111	   1016110 ns/op
     BenchmarkIndex-2       	    1144	   1036095 ns/op
     BenchmarkIndex-2       	    1128	   1013398 ns/op
     BenchmarkIndex-2       	    1131	   1021993 ns/op
     BenchmarkIndex-2       	    1142	   1052461 ns/op
     BenchmarkIndex-2       	    1137	   1036276 ns/op
     BenchmarkIndex-2       	    1131	   1050766 ns/op
     BenchmarkIndex-2       	    1142	   1034039 ns/op
     BenchmarkIndex-2       	    1124	   1036667 ns/op
     BenchmarkIndex-4       	     908	   1373611 ns/op
     BenchmarkIndex-4       	     909	   1371849 ns/op
     BenchmarkIndex-4       	     904	   1374661 ns/op
     BenchmarkIndex-4       	     902	   1362660 ns/op
     BenchmarkIndex-4       	     915	   1371153 ns/op
     BenchmarkIndex-4       	     904	   1380963 ns/op
     BenchmarkIndex-4       	     908	   1381588 ns/op
     BenchmarkIndex-4       	     904	   1386198 ns/op
     BenchmarkIndex-4       	     903	   1390065 ns/op
     BenchmarkIndex-4       	     878	   1385323 ns/op
     BenchmarkIndex-8       	     688	   1611506 ns/op
     BenchmarkIndex-8       	     700	   1606052 ns/op
     BenchmarkIndex-8       	     704	   1610623 ns/op
     BenchmarkIndex-8       	     706	   1604036 ns/op
     BenchmarkIndex-8       	     693	   1618573 ns/op
     BenchmarkIndex-8       	     705	   1596077 ns/op
     BenchmarkIndex-8       	     681	   1629448 ns/op
     BenchmarkIndex-8       	     705	   1596526 ns/op
     BenchmarkIndex-8       	     699	   1594360 ns/op
     BenchmarkIndex-8       	     706	   1597677 ns/op
     PASS
     ok  	testing/index	215.121s
    

I have recorded the fact that there can be noticeable performance losses.

Those "noticeable performance losses" seem to be very slight in general, and practically non-existent in the case of the simplified, single value return implementation of Index.

Could you please be so kind as to run some benchmarks yourself (preferably with -count substantially more than 1) to confirm that your original result was not just a one time fluke/malfunction of your machine?

Otherwise, based on my experience with benchmarking this issue, I believe it is only a fair assumption to be made, that you cherry-picked your 32-percent-performance-loss result in bad faith, and, having grown so attached to the design you originally proposed, you are unwilling to concede it for a potentially more sensible and versatile one.

I have a question for @rsc, though: what about the fact that the Next+Index split API requires changes to Next to increment the goroutine index, changes that are very likely to have an impact on the performance of all existing benchmarks that currently rely on it?

Are you prosing returning -1 on failure from pb.Index(). If that is the desired way Go standard APIs work...

No, not really. I believe, failure, in this case, is something one would have checked for via a Next call. If that has returned false, there is no point in calling Index. You yourself explicitly stated that a call to Index on its own does not make sense; the API you originally proposed does not even provide such a function to call. And for the past few posts in this thread we came to a conclusion that there is, after all, a potential for inefficient implementation to have a negative impact on the performance.

... fine!

Da-yumn, @chabbimilind! If that ain't what they call "butthurt", I sincerely have no clue what was the impression you were trying to make with that. 😜

Just breathe, friend, relax and focus: having your proposal revised and improved by the gopher community is not the worst thing that could have happened to you today, nor is it a damning statement about your qualities as an individual and a member of the society.

@rsc
Copy link
Contributor

rsc commented Dec 16, 2020

is there a provable superiority (programmatic or performance) of having NextIndex over Index?

There's already a Next, so adding just Index avoids having two different Next operations.

@rsc
Copy link
Contributor

rsc commented Dec 16, 2020

It is also obviously possible to implement these in such a way that there is no performance difference:

func (pb *PB) Next() bool {
    var ok bool
    ok, pb.index = pb.nextIndex()
    return ok
}

func (pb *PB) Index() int {
    return pb.index
}

So I'm not placing much weight in the benchmarks showing a difference.

@rsc
Copy link
Contributor

rsc commented Dec 16, 2020

What I don't see here is much indication that this use of the extra index is something that is generally advisable in parallel benchmarks. My understanding of the parallel benchmark functionality was to be able to measure how fast it would be to run b.N iterations in parallel goroutines, keeping all the goroutines busy even when there is obviously some variation in how quickly each executes, so that you can't just hand out b.N/#goroutines to each one.

The more I think about this, the more I think we've gotten derailed by performance and API and have failed to recognize that adding either Index or NextIndex would be a mistake. If the work done for iteration i is not interchangeable with the work done for iteration j, that's not a valid benchmark at all.

My first objection to the stencilParallel example is that it probably matters which indexes a goroutine is writing. If pb.Next is doing some kind of fine-grained iteration where it splits the work across G goroutines and hands out k, k+G, k+2G, k+3G in a single goroutine until we get close to the end, then all the different goroutines are going to have cache collisions writing to nearby entries in "out". On the other hand, if pb.Next works on blocks and hands out mostly contiguous indexes to particular goroutines, then there will be few cache collisions. The reported speed of a benchmarked function should clearly not depend on the exact iteration order returned by package testing. And of course the way to make that happen is to hide the iteration order completely, as we do now.

But my second, more serious objection is that the stencilSerial example is not even a correct single-threaded benchmark. It's really important that b.N only change the number of times the code runs, not what code runs. In particular, it is a mistake to change the size of the data being processed based on b.N, because cache behavior changes non-linearly with data size, which then invalidates "divide total time by b.N to find per-operation time". When the data size is changing too, per-operation time is not actually a meaningful value - it's not constant.

I think if we were going to adopt Index as an API, we would need a compelling, correct example. We don't have one.

@rsc
Copy link
Contributor

rsc commented Dec 16, 2020

/cc @dvyukov

@dvyukov
Copy link
Member

dvyukov commented Dec 17, 2020

There is something I don't fully understand in the problem statement: for parallel algorithms don't we want to measure the production parallelization rather than some random scheme provided by the testing package? If we get some parallel speed up, it does not seem to be useful because the actual production code does not have any parallelization. Perhaps you want some parallel algorithm support package that can be used in production code, and then measured in benchmarks.

RunParallel is not intended for parallel algorithms, it's intended for measuring N logically independent copies of an operation to check how well that scales.

The proposed NextIndex scheme also seems to imply dependence of data size on the number of iterations. That's generally a wrong thing because of the associated non-linear effects. E.g. a faster version may trigger 10x larger data size and end up being slower... A better scheme is to do whole large operation b.N times on data set of target size (or sizes).

But per-se unique index assignment can be done with pb.Next as follows:

		var index uint64
		b.RunParallel(func(pb *testing.PB) {
			for pb.Next() {
				const batch = 42
				myIndex := atomic.AddUint64(&index, 1) - 1
				for i := myIndex * batch; i < (myIndex + 1) * batch; i++ {
					...
				}
			}
		})

@rsc
Copy link
Contributor

rsc commented Jan 6, 2021

Based on the discussion, this seems like a likely decline.

@rsc
Copy link
Contributor

rsc commented Jan 6, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Jan 6, 2021
@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Jan 13, 2021
@rsc
Copy link
Contributor

rsc commented Jan 13, 2021

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc closed this as completed Jan 13, 2021
@golang golang locked and limited conversation to collaborators Jan 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

6 participants