Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: IndexAny is slower than a naive implementation in some cases #60550

Closed
kpym opened this issue Jun 1, 2023 · 15 comments
Closed

bytes: IndexAny is slower than a naive implementation in some cases #60550

kpym opened this issue Jun 1, 2023 · 15 comments
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@kpym
Copy link

kpym commented Jun 1, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.3 windows/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Kroum\AppData\Local\go-build
set GOENV=C:\Users\Kroum\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=D:\MyPrograms\golibs\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=D:\MyPrograms\golibs
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.20.3
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=D:\MyDocuments\Programming\go\test\bench\go.mod
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=D:\Temp\System\Kroum\go-build2837998563=/tmp/go-build -gno-record-gcc-switches

What did you do?

I consider the following naive implementation of bytes.IndexAny

func IndexAny(s []byte, chars string) int {
	min := -1
	for _, c := range chars {
		if i := bytes.IndexRune(s, c); uint(i) < uint(min) {
			min = i
		}
	}
	return min
}

I benchmark it for 10 delimiters on a data where the first delimiter appears at position 1000, like this

package indexany

import (
	"bytes"
	"strings"
	"testing"
)

func IndexAny(s []byte, chars string) int {
	min := -1
	for _, c := range chars {
		if i := bytes.IndexRune(s, c); uint(i) < uint(min) {
			min = i
		}
	}
	return min
}

const (
	t             = "🚆"
	firstchar     = 1000
	numdelimiters = 10
)

var (
	data       []byte
	separators string
)

// create the data and the separators fot the benchmark
func init() {
	// data = "xxx...🚆"
	data = bytes.Repeat([]byte("x"), firstchar)
	data = append(data, t...)
	// separators = "yyy...🚆"
	separators = strings.Repeat("y", numdelimiters)
	separators = separators + t
}

func BenchmarkNaiveIndexAny(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IndexAny(data, separators)
	}
}

func BenchmarkBytesIndexAny(b *testing.B) {
	for i := 0; i < b.N; i++ {
		bytes.IndexAny(data, separators)
	}
}

What did you expect to see?

That bytes.IndexAny is faster then this "naive" IndexAny.

What did you see instead?

$ go test -benchmem -bench=. .
goos: windows
goarch: amd64
pkg: kpym.github.com/gobench/indexany
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
BenchmarkNaiveIndexAny-8         3033126               522.9 ns/op             0 B/op          0 allocs/op
BenchmarkBytesIndexAny-8          223227              7233 ns/op               0 B/op          0 allocs/op
PASS
ok      kpym.github.com/gobench/indexany        4.325s

So this "naive" implementation is more than 10 times faster than the standard bytes.IndexAny.

Am I doing something wrong ? Is this relevant ?

Remark : If we take a lot of delimiters (1000) the standard function is 30% better (on my computer).

@mknyszek mknyszek changed the title affected/package: better bytes.IndexAny bytes: IndexAny is slower than a naive implementation in some cases Jun 1, 2023
@mknyszek mknyszek added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. FeatureRequest labels Jun 1, 2023
@mknyszek mknyszek added this to the Backlog milestone Jun 1, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2023

@ianlancetaylor
Copy link
Contributor

Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases.

How does your implementation fare on the benchmarks that are already in the bytes package?

The current bytes.IndexAny algorithm is somewhat optimized for the case where the characters being searched for (chars) are all ASCII. It may be that we can do better for cases that is not true.

@kpym
Copy link
Author

kpym commented Jun 1, 2023

@ianlancetaylor

Any algorithm is going to have cases where it is better or worse. The question is how to get good average performance for typical cases.

I agree. When I searched for bytes.IndexAny on GitHub and in the standard library, it looks like that it is used in almost all the cases :

  • with between 2 and 4 characters (delimiters),
  • with characters that are ASCII (like "\r\n", " \t\r\n" or " ,;\r\n").

How does your implementation fare on the benchmarks that are already in the bytes package?

I didn't find any benchmarks for indexAny in the standard library. Am I missing something?

But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster. Except when the first occurrence is really close to the beginning (less than 4). But in this case, they're very similar. And as the distance from the first occurrence increases, so does the advantage of the naive version.

The current bytes.IndexAny algorithm is somewhat optimized for the case where the characters being searched for (chars) are all ASCII. It may be that we can do better for cases that is not true.

I'm really surprised by what you say, because I discovered this naive version precisely because I wanted to treat only the case of ASCII separators, and I was surprised that in indexAny the separator set is "string" and not []byte. And indexAnyByte (or indexByteAny) is missing!

If we only want to handle ASCII delimiters, the following []byte version is even faster

func IndexAnyByte(s, chars []byte) int {
	min := -1
	for _, c := range chars {
		if i := bytes.IndexByte(s, c); uint(i) < uint(min) {
			min = i
		}
	}
	return min
}

I really don't understand the choice of the string parameter in the bytes.IndexAny (vs. strings.IndexAny).
Why bother with all the Unicode complexity in this case?
Anyway, that's not the point of this issue, and we can't change it now (for backward compatibility reasons).

@ianlancetaylor
Copy link
Contributor

I didn't find any benchmarks for indexAny in the standard library. Am I missing something?

go test -test.bench=BenchmarkIndexAny bytes

@ianlancetaylor
Copy link
Contributor

But in all benchmarks with a small number of separators, even if they are all ASCII, the naive version is (much) faster.

This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect?

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2023

I got curious so I'm running the benchmarks in the bytes package with both implementations. I'll post the results back here.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2023

Here's what I got:

bytes $ benchstat baseline.txt naive.txt
name                    old time/op  new time/op    delta
IndexAnyASCII/1:1-8     6.04ns ± 1%    8.20ns ± 1%    +35.81%  (p=0.008 n=5+5)
IndexAnyASCII/1:2-8     6.03ns ± 1%   13.19ns ± 0%   +118.85%  (p=0.008 n=5+5)
IndexAnyASCII/1:4-8     6.02ns ± 1%   23.26ns ± 1%   +286.17%  (p=0.008 n=5+5)
IndexAnyASCII/1:8-8     6.02ns ± 0%   43.77ns ± 1%   +627.49%  (p=0.008 n=5+5)
IndexAnyASCII/1:16-8    6.29ns ± 1%   84.67ns ± 1%  +1245.89%  (p=0.008 n=5+5)
IndexAnyASCII/1:32-8    7.56ns ± 2%  177.14ns ± 0%  +2243.25%  (p=0.008 n=5+5)
IndexAnyASCII/1:64-8    6.00ns ± 1%  338.48ns ± 1%  +5544.53%  (p=0.008 n=5+5)
IndexAnyASCII/16:1-8    7.58ns ± 1%    8.47ns ± 1%    +11.83%  (p=0.008 n=5+5)
IndexAnyASCII/16:2-8    19.2ns ± 1%    13.8ns ± 0%    -28.51%  (p=0.008 n=5+5)
IndexAnyASCII/16:4-8    22.6ns ± 1%    24.4ns ± 0%     +8.18%  (p=0.008 n=5+5)
IndexAnyASCII/16:8-8    28.4ns ± 1%    45.7ns ± 1%    +61.00%  (p=0.008 n=5+5)
IndexAnyASCII/16:16-8   41.8ns ± 0%    88.6ns ± 0%   +111.92%  (p=0.008 n=5+5)
IndexAnyASCII/16:32-8   57.6ns ± 0%   175.7ns ± 2%   +204.86%  (p=0.008 n=5+5)
IndexAnyASCII/16:64-8    145ns ± 1%     353ns ± 0%   +143.91%  (p=0.008 n=5+5)
IndexAnyASCII/256:1-8   9.64ns ± 1%   10.44ns ± 1%     +8.39%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-8    182ns ± 1%      18ns ± 1%    -90.27%  (p=0.008 n=5+5)
IndexAnyASCII/256:4-8    188ns ± 1%      32ns ± 1%    -82.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:8-8    198ns ± 0%      62ns ± 2%    -68.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-8   207ns ± 4%     130ns ± 1%    -37.37%  (p=0.008 n=5+5)
IndexAnyASCII/256:32-8   216ns ± 1%     249ns ± 1%    +15.24%  (p=0.008 n=5+5)
IndexAnyASCII/256:64-8   306ns ± 2%     483ns ± 2%    +57.72%  (p=0.008 n=5+5)
IndexAnyUTF8/1:1-8      5.67ns ± 1%   12.37ns ± 2%   +118.07%  (p=0.008 n=5+5)
IndexAnyUTF8/1:2-8      5.69ns ± 0%   21.32ns ± 0%   +274.51%  (p=0.008 n=5+5)
IndexAnyUTF8/1:4-8      5.67ns ± 0%   23.92ns ± 1%   +321.79%  (p=0.008 n=5+5)
IndexAnyUTF8/1:8-8      5.66ns ± 1%   44.28ns ± 1%   +682.79%  (p=0.008 n=5+5)
IndexAnyUTF8/1:16-8     5.94ns ± 0%   69.25ns ± 1%  +1065.47%  (p=0.008 n=5+5)
IndexAnyUTF8/1:32-8     7.19ns ± 0%  155.06ns ± 1%  +2055.29%  (p=0.008 n=5+5)
IndexAnyUTF8/1:64-8     5.68ns ± 0%  300.56ns ± 1%  +5195.46%  (p=0.008 n=5+5)
IndexAnyUTF8/16:1-8     57.1ns ± 1%    61.3ns ± 1%     +7.43%  (p=0.008 n=5+5)
IndexAnyUTF8/16:2-8     65.0ns ± 0%   119.9ns ± 1%    +84.53%  (p=0.008 n=5+5)
IndexAnyUTF8/16:4-8     65.1ns ± 1%    76.4ns ± 0%    +17.40%  (p=0.008 n=5+5)
IndexAnyUTF8/16:8-8     65.3ns ± 1%   148.9ns ± 1%   +128.23%  (p=0.008 n=5+5)
IndexAnyUTF8/16:16-8    64.0ns ± 0%    91.8ns ± 0%    +43.45%  (p=0.008 n=5+5)
IndexAnyUTF8/16:32-8    83.8ns ± 0%   277.9ns ± 1%   +231.56%  (p=0.008 n=5+5)
IndexAnyUTF8/16:64-8    66.2ns ± 0%   414.2ns ± 1%   +525.28%  (p=0.008 n=5+5)
IndexAnyUTF8/256:1-8     838ns ± 1%     838ns ± 2%       ~     (p=1.000 n=5+5)
IndexAnyUTF8/256:2-8     973ns ± 1%    1704ns ± 2%    +75.16%  (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8     973ns ± 1%     861ns ± 1%    -11.53%  (p=0.008 n=5+5)
IndexAnyUTF8/256:8-8     971ns ± 0%    1726ns ± 1%    +77.80%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8    970ns ± 0%     109ns ± 1%    -88.77%  (p=0.008 n=5+5)
IndexAnyUTF8/256:32-8   1.29µs ± 0%    1.90µs ± 1%    +47.48%  (p=0.008 n=5+5)
IndexAnyUTF8/256:64-8   1.01µs ± 1%    1.29µs ± 1%    +28.61%  (p=0.008 n=5+5)

This is pretty interesting, assuming it's reproducible. Largely the implementation in the bytes package is faster (sometimes dramatically so). But there are a number of cases where the naive version does better. Picking them out:

IndexAnyASCII/16:2-8    19.2ns ± 1%    13.8ns ± 0%    -28.51%  (p=0.008 n=5+5)
IndexAnyASCII/256:2-8    182ns ± 1%      18ns ± 1%    -90.27%  (p=0.008 n=5+5)
IndexAnyASCII/256:4-8    188ns ± 1%      32ns ± 1%    -82.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:8-8    198ns ± 0%      62ns ± 2%    -68.74%  (p=0.008 n=5+5)
IndexAnyASCII/256:16-8   207ns ± 4%     130ns ± 1%    -37.37%  (p=0.008 n=5+5)
IndexAnyUTF8/256:4-8     973ns ± 1%     861ns ± 1%    -11.53%  (p=0.008 n=5+5)
IndexAnyUTF8/256:16-8    970ns ± 0%     109ns ± 1%    -88.77%  (p=0.008 n=5+5)

I don't know what's particularly interesting about these cases.

Benchmark platform:

goos: linux
goarch: amd64
cpu: AMD EPYC 7B12

@kpym
Copy link
Author

kpym commented Jun 1, 2023

@mknyszek I have the impression that the cases where bytes.IndexAny is faster are the "exceptional" cases for which the naive version is not optimized: when the data is really short (shorter or comparable with the number of separators).

If we look at the cases where the naive version exceeds are the cases where the data size is much larger than the number of separators, it seems to me. And this feels closer to actual use.

@kpym
Copy link
Author

kpym commented Jun 1, 2023

@ianlancetaylor

This surprises me. The current algorithm only looks through the string once. Do you have an explanation for why it is faster to look through it multiple times? Is this a memory caching effect?

In any case we have a double loop, over the separators and over the data. Let n be the number of separators and m be the first occurrence of a separator in the data.

Theoretically, without taking optimizations into account, especially if n<m, it's in our interest to loop over the delimiters first, because if the first delimiter found is on average at position n/2, we avoid half of the m*n/2 comparisons, while in the other direction we avoid only n/2 of the n*m comparisons.

I don't think this explains the dominance of the naive version in my benchmarks, since I put the "founded" separator in the last position.

A possible explanation for the advantage of the naive version in situations where the number of separators is smaller than the first occurrence is the fact that the bytes.IndexByte (or bytes.IndexRune) function is highly optimized. And in the "naive" version, we take advantage of this optimization by keeping the "short" loop and replacing the "long" loop with this optimization.

Notes:

  • I think in practice we're more often in the case where the number of delimiters is less than the data. The test in the standard libray contains a lot of extreme cases: very short data with no separators.
  • If we sort the separators by their probability of occurrence, this will improve performance even more in the "naive" version where we loop over the separators first.

@kpym
Copy link
Author

kpym commented Jun 2, 2023

I just realized last night that I was wrong in my previous message and that the benchmarks, both mine and the two in the standard library, are no good. They all suffer from the same problem: it's not normal to consider only cases where the separator is absent from the data or it's in the last position.

Let's consider that the separator in position i < n is in position j < m in the data and that the other separators are not present in the data. If we loop on the separators first, we'll have to make i*m comparisons, whereas if we loop on the data first, we can stop after n*j comparisons.

The situation is actually much more complicated, as the position of all the separators comes into play. We can try to make theoretical predictions by approximating the behavior of j by a Poisson distribution and that of i by a uniform distribution, for example, but it seems to me much simpler to design realistic benchmarks than to theorize the question.

I think a realistic benchmark could be

  • a simple CSV example where there are fields of different lengths (including empty ones) and where we scan for separators one after the other,
  • a template where we look for "special" characters.

In all cases, we'll limit ourselves to 2, 3 or 4 separators (which seem to be the use cases on GitHub) and to data which is scanned to find all the separators one after the other. I'll try to design a more realistic benchmark and see how it goes.

@kpym
Copy link
Author

kpym commented Jun 2, 2023

Using the following benchmark

func BenchmarkIndexAnyCSV(b *testing.B) {
	// A csv dataset with 7 columns (of diffrent lengths) and 10 rows.
	// Some of the fields (gender) are empty.
	// We will check with this data repetaed 1, 10 and 100 times.
	// This data is fake and was generated with https://www.convertcsv.com/generate-test-data.htm.
	var csvdata = []byte(`seq,name,adress,email,birthdate,gender,phone
	1,"May Moody",adress,gofpuwvi@wolamme.gp,08/03/2014,F,(313) 321-8728
	2,"Adele Martinez",adress,nohvahac@du.ai,02/09/1993,,(785) 834-8091
	3,"Zachary Brown",adress,bo@covih.yt,07/17/1901,M,(308) 643-7802
	4,"Lucy Lindsey",adress,sa@oh.fi,04/22/2023,F,(375) 395-3248
	5,"Lucy Oliver",adress,gemjivefa@ha.bd,07/21/1995,M,(422) 714-5921
	6,"Hilda Mullins",adress,kiujma@bok.kh,05/13/1922,,(716) 645-7631
	7,"Mollie Parker",adress,kir@kahamrew.io,03/18/1960,M,(534) 818-5916
	8,"Katie Adkins",adress,konocizah@jibfu.et,02/01/1965,F,(828) 544-2687
	9,"Maurice Carroll",adress,fosmutlo@erpewi.dm,05/01/1958,M,(478) 249-3000
	10,"Lettie Harvey",adress,lircuged@pe.bs,07/04/2008,,(571) 628-2216`)
	// We will check with the first 2, 3 and 4 delimiters of the following list.
	// The 4th is not present in the data.
	var delimiters = ",\n\";"

	for _, k := range []int{1, 10, 100} {
		for _, j := range []int{2, 3, 4} {
			b.Run(fmt.Sprintf("%d:%d", k, j), func(b *testing.B) {
				data := bytes.Repeat(csvdata, k)
				separators := delimiters[:j]
				for i := 0; i < b.N; i++ {
					p, q := 0, 0
					for {
						if q = IndexAny(data[p:], separators); q == -1 {
							break
						}
						p += q + 1
					}
				}
			})
		}
	}
}

I get the following results

$ benchstat standard.txt naive.txt
name                 old time/op  new time/op   delta
IndexAnyCSV/1:2-8    1.42µs ± 2%   1.30µs ± 1%     -8.29%  (p=0.029 n=4+4)
IndexAnyCSV/1:3-8    1.68µs ± 2%   2.34µs ± 1%    +38.98%  (p=0.029 n=4+4)
IndexAnyCSV/1:4-8    1.79µs ± 2%   4.13µs ± 2%   +130.71%  (p=0.029 n=4+4)
IndexAnyCSV/10:2-8   13.9µs ± 4%   14.0µs ± 4%       ~     (p=0.886 n=4+4)
IndexAnyCSV/10:3-8   16.4µs ± 2%   24.1µs ± 1%    +46.48%  (p=0.029 n=4+4)
IndexAnyCSV/10:4-8   17.8µs ± 2%  119.0µs ± 1%   +566.80%  (p=0.029 n=4+4)
IndexAnyCSV/100:2-8   139µs ± 4%    138µs ± 5%       ~     (p=0.686 n=4+4)
IndexAnyCSV/100:3-8   164µs ± 2%    243µs ± 2%    +48.53%  (p=0.029 n=4+4)
IndexAnyCSV/100:4-8   179µs ± 2%  11008µs ± 3%  +6036.16%  (p=0.029 n=4+4)

My conclusion is that the current version is better than the naive version in real-life situations. So there's no need to look any further.

On the other hand, I think we need to add a more realistic benchmark (like this one, for example) to the standard library (for both bytes and strings).

@mknyszek
Copy link
Contributor

mknyszek commented Jun 2, 2023

FTR, we try to aggregate more realistic benchmarks in golang.org/x/benchmarks. These run continuously in CI (https://perf.golang.org/dashboard). If you write a benchmark that is executable with go get, we can consider referencing it in bent cmd/bent (https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/suites.toml). If you have an even larger, more complex text processing benchmark, it might be worth adding to sweet.

It sounds like there might not be anything left to do in this issue? Closing for now, please let me know if you disagree!

@mknyszek mknyszek closed this as completed Jun 2, 2023
@ianlancetaylor
Copy link
Contributor

@kpym Thanks for the investigation.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 2, 2023

+1, thanks for the investigation!

@kpym
Copy link
Author

kpym commented Jun 9, 2023

Just for the sake of completeness, here is a better "naive" version, which is still slower than the standard bytes.IndexAny for the csv benchmark in general, but seems to be faster for two delimiters. This version avoids looking above the current minimum value.

func IndexAny(s []byte, chars string) int {
	min, cut := -1, len(s)
	for _, c := range chars {
		if i := bytes.IndexRune(s[:cut], c); uint(i) < uint(min) {
			min = i
			cut = i
		}
	}
	return min
}
name                 old time/op  new time/op  delta
IndexAnyCSV/1:2-8    1.46µs ± 9%  1.32µs ± 2%   -9.37%  (p=0.029 n=4+4)
IndexAnyCSV/1:3-8    1.66µs ± 6%  2.17µs ± 2%  +30.53%  (p=0.029 n=4+4)
IndexAnyCSV/1:4-8    1.74µs ± 2%  2.75µs ± 3%  +58.16%  (p=0.029 n=4+4)
IndexAnyCSV/10:2-8   13.6µs ± 1%  13.6µs ± 1%     ~     (p=0.486 n=4+4)
IndexAnyCSV/10:3-8   16.8µs ± 2%  21.8µs ± 1%  +30.08%  (p=0.029 n=4+4)
IndexAnyCSV/10:4-8   18.2µs ± 1%  27.4µs ± 2%  +50.51%  (p=0.029 n=4+4)
IndexAnyCSV/100:2-8   146µs ± 5%   137µs ± 2%   -5.96%  (p=0.029 n=4+4)
IndexAnyCSV/100:3-8   165µs ± 1%   221µs ± 2%  +34.37%  (p=0.029 n=4+4)
IndexAnyCSV/100:4-8   175µs ± 0%   285µs ± 4%  +62.70%  (p=0.029 n=4+4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

3 participants