Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: IndexByte performs poorly on short lookups #21759

Closed
sirkon opened this issue Sep 5, 2017 · 7 comments
Closed

bytes: IndexByte performs poorly on short lookups #21759

sirkon opened this issue Sep 5, 2017 · 7 comments

Comments

@sirkon
Copy link

sirkon commented Sep 5, 2017

What version of Go are you using (go version)?

1.9

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/sasha/go"
GORACE=""
GOROOT="/usr/lib/go-1.9"
GOTOOLDIR="/usr/lib/go-1.9/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build547759744=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Benchmarking bytes.IndexByte vs loop character lookup

package main

import (
	"bytes"
	"strings"
	"testing"
)

const position = 3
const repeat = 10000

var data = []byte(strings.Repeat("1", position) + "\t" + "dsahfksdfhkdsj")

func BenchmarkIndexByte(b *testing.B) {
	for n := 0; n < b.N; n++ {
		for i := 0; i < repeat; i++ {
			if pos := bytes.IndexByte(data, '\t'); pos != position {
				b.Fatalf("Wrong position %d, %d expected", pos, position)
			}
		}
	}
}

func BenchmarkLoop(b *testing.B) {
	for n := 0; n < b.N; n++ {
		for i := 0; i < repeat; i++ {
			pos := -1
			for j, c := range data {
				if c == '\t' {
					pos = j
					break
				}
			}
			if pos != position {
				b.Fatalf("Wrong position %d, %d expected", pos, position)
			}
		}
	}
}

What did you expect to see?

BenchmarkIndexByte should be faster or at least close to BenchmarkLoop for any delimiter position.

What did you see instead?

But it can be several times slower when the delimeter is close to the data start (parameter position):

  1. 0th
    BenchmarkIndexByte-4   	   50000	     24888 ns/op
    BenchmarkLoop-4        	  200000	      7421 ns/op
    
  2. 2th
    BenchmarkIndexByte-4   	   50000	     27334 ns/op
    BenchmarkLoop-4        	  100000	     13582 ns/op
    

This matters because we are using IndexByte for log parsing via code generator and a good chunk of data pieces we are looking for are short.

It would be great if bytes.IndexByte would be faster at shorter lookups.

@dsnet dsnet changed the title Underwhelming performance of bytes.IndexByte at short lookups bytes: IndexByte performs poorly on short lookups Sep 5, 2017
@dsnet
Copy link
Member

dsnet commented Sep 5, 2017

\cc @martisch

@agnivade
Copy link
Contributor

agnivade commented Sep 5, 2017

Interesting.

bytes.IndexByte is written in assembly. And it executes AVX2 instructions if the data length is more than 32. Otherwise, it goes with SSE instructions.

CMPQ BX, $32
JA avx2

Can it be because of this that when the data length is small, a loop code generates faster assembly than the hand-written one ? Would be interesting to look into the assembly of the generated loop code and compare them.

@martisch
Copy link
Contributor

martisch commented Sep 5, 2017

BenchmarkIndexByte should be faster or at least close to BenchmarkLoop for any delimiter position.
If bytes.IndexByte with an inlined range loop everywhere then the above performance request would be fulfilled but it would make searching with big data sizes a lot slower than current version.

I do not see how a generic implementation of bytes.IndexByte can be made that will be (near) optimal on any length and distribution of match position if these characteristics are not known at compile time.

A solution that is good on a variety of lengths at runtime is complex and therefore will not be inlined to not bloat the binaries and fill the caches. Thereby there is a tradeoff for always having call overhead and overhead to decide which code path to use based e.g. on data length or supported instructions but thereby making it not optimal for short lengths/early positions that require little work.

The range loop in the example avoids that since it is known by profiling the application that the problem is mostly composed of small data length and early matches. The solution of the inlined range loop is simple enough that I think it does not warrant to add an e.g. IndexByteShort to the package. However writing a fast IndexByte for long strings is comparatively more complex.

It could be that the current version can be tuned a bit for small data sizes. For that we need to benchmark what possible room for optimization there is between an indexByte using a range loop that is not inlined and that checks for e.g. AVX2 and data length.
Benchmarks also need to make sure the range loop is not just better aligned in one benchmark than the other. Another characteristic that needs to be carefully avoided is skew in these benchmarks in that for a simple loop with always the same match position in the benchmark the branch pattern can be learned by the branch predictor on modern cpus. This is likely not to reflect a common case. If it is it would be even faster to unroll the loop and optimize the solution further for the specific problem of having the common match position be 3.

@sirkon what is the platform and cpu used for your benchmarks/production?

@ghost
Copy link

ghost commented Sep 5, 2017

bytes.IndexByte is not inlined on amd64 while the range loop is.

@sirkon
Copy link
Author

sirkon commented Sep 5, 2017

@martisch

vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz
stepping	: 9
microcode	: 0x3a
cpu MHz		: 799.941
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs		:
bogomips	: 7008.18
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

Some more info

  1. position = 5
BenchmarkIndexByte-4   	   50000	     27310 ns/op
BenchmarkLoop-4        	  100000	     22110 ns/op
  1. position = 6
BenchmarkIndexByte-4   	   50000	     28276 ns/op
BenchmarkLoop-4        	   50000	     36982 ns/op

I.e. with position 5 current implementation is fast enough and it is faster at the length of 6. I just thought if there would be a cheap way to check if the first 4 bytes has a character needed, then some sort of loop would find it, otherwise the generic algorithm would be used.
Check and "short" lookup could be inlined.

@martisch
Copy link
Contributor

martisch commented Sep 5, 2017

Inlining a short lookup is not cost free as the binary size would increase which makes for more icache pressure and it slows down for positions further away.

Also in the current state of the compiler (no mid stack inlining) it complicates the compiler and makes the bytes package function special since the optimization with intrinsics would not recognize when the code is copied from the bytes package to somewhere else.

I think that when it is known that the position is usually found early the pre check by range loop optimization is good and easy to make explicitly in the code when profiling determines it matters at that call site.

If there are performance improvements left to make small lookups a bit faster within the code or in the generated instructions inside the index function that would be great to find out.

@randall77
Copy link
Contributor

randall77 commented Sep 5, 2017

When I remove the inlining the difference goes away:

func BenchmarkLoopNoInline(b *testing.B) {
	for n := 0; n < b.N; n++ {
		for i := 0; i < repeat; i++ {
			pos := indexByte(data, '\t')
			if pos != position {
				b.Fatalf("Wrong position %d, %d expected", pos, position)
			}
		}
	}
}

//go:noinline
func indexByte(data []byte, ch byte) int {
	pos := -1
	for j, c := range data {
		if c == ch {
			pos = j
			break
		}
	}
	return pos
}

So I think we're mostly measuring call overhead here. I don't see any real opportunity for making the assembly better (but please, prove me wrong if you see something).

@sirkon sirkon closed this as completed Sep 12, 2017
@golang golang locked and limited conversation to collaborators Sep 12, 2017
@golang golang deleted a comment from daiweisong Sep 12, 2017
@golang golang deleted a comment Sep 12, 2017
@golang golang deleted a comment from dsnet Sep 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants