strings: Count is O(n*m) #4600

robpike · 2012-12-30T16:33:09Z

The simple algorithm used in strings.Count and bytes.Count is n^2 when linear algorithms
are known, although they require precomputation and allocation. It may be worth looking
at the parameters in the call to decide when a smarter algorithm is worthwhile.

A simple non-allocating heuristic would be to check not only the first byte before using
== but also the last byte. Another option would be to check the first and last bytes of
the strings in the == operator before doing the O(n) scan, at least when n is large.
Such tricks will not eliminate n^2 behavior but can make it less likely to arise. (Since
in Count the lengths always match, the usual efficient cut-off of checking the lengths
does nothing.)

Noticed in https://plus.google.com/115609618223925128756/posts/5EG3QHA1ugH

rsc · 2012-12-30T16:44:56Z

Comment 1:

Labels changed: added priority-later, go1.1maybe, removed priority-triage, go1.1.

Status changed to Thinking.

remyoudompheng · 2012-12-30T21:02:08Z

Comment 3:

The use of a even stupid hash function could make it probabilistically linear and give a
bit less false positives than first and last bytes, while still avoiding allocation.

remyoudompheng · 2012-12-30T23:46:17Z

Comment 4:

Here is an example using a fnv-like hash. A suggestion would be to do the following
in Contains/Count(s1, s2): if len(s2) > 16, count the number N of occurences of
s2[:4] in s1.
If N*len(s2) > K*len(s1) for some constant K (the hash function cost, here K=3), then
full comparisons are probably more costly than hashing and we switch algorithms.
s1 = strings.Repeat("ACTG", 32<<10), s2 = strings.Repeat("ACTG", 4<<10)
SetBytes(len(s1)+len(s2))
BenchmarkContainsTorture           5     403871636 ns/op       0.37 MB/s
BenchmarkCountTorture          5     403779882 ns/op       0.37 MB/s
BenchmarkContainsHashTorture        5000        454527 ns/op     324.43 MB/s
BenchmarkCountHashTorture       5000        439006 ns/op     335.90 MB/s
s1 = strings.Repeat("ACTG", 32<<10)
s2 = strings.Repeat("ACTG", 2<<10)+"GATC"+strings.Repeat("ACTG", 2<<10)
SetBytes(len(s1)+len(s2))
BenchmarkContainsTorture      100000         28224 ns/op    5224.35 MB/s
BenchmarkCountTorture      10000        227133 ns/op     649.20 MB/s
BenchmarkContainsHashTorture       20000         87624 ns/op    1682.81 MB/s
BenchmarkCountHashTorture       5000        684507 ns/op     215.42 MB/s

Attachments:

string_test.go (2421 bytes)

remyoudompheng · 2012-12-30T23:57:43Z

Comment 5:

It turns out that even on the original example from the G+ post the hash method is
faster:
BenchmarkContainsTorture      500000          4971 ns/op     223.28 MB/s
BenchmarkCountTorture     500000          6585 ns/op     168.55 MB/s
BenchmarkContainsHashTorture      500000          3062 ns/op     362.43 MB/s
BenchmarkCountHashTorture     500000          3793 ns/op     292.62 MB/s
s1 = "Lang lang lang. langaha Langan langarai langate langauge langbanite Langbehn
langbeinite langca Langdon Lange langeel langel Langelo Langeloth Langer Langford
Langham Langhian Langhorne langi langiel Langill Langille langite langka lang-kail
Langland langlauf langlaufer langlaufers langlaufs langle Langley langley langleys
Langlois Langmuir Lango Langobard Langobardic langobardic langoon langooty langosta
langourous langourously langouste langrage langrages langrel langrels Langrenus Langreo
Langres langret langridge langsat Langsdon Langsdorffia langset langsettle Langshan
langshan langshans Langside langspiel langspil Langston Langsville langsyne langsynes
langteraloo Langton Langtry language languaged languageless languages languaging langue
langued Languedoc languedoc Languedocian Languedoc-Roussillon languent langues
languescent languet languets languette languid languidly languidness languidnesses
languish languished languisher languishers languishes languishing languishingly
languishment languor languorment languorous languorously languorousness languors langur
langurs Langworthy"
s2 = "languid"

remyoudompheng · 2013-02-17T12:21:15Z

Comment 6:

This issue was closed by revision 23093f8.

Status changed to Fixed.

robpike added fixed labels Feb 17, 2013

rsc added this to the Go1.1 milestone Apr 14, 2015

rsc removed the go1.1maybe label Apr 14, 2015

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strings: Count is O(n*m) #4600

strings: Count is O(n*m) #4600

robpike commented Dec 30, 2012

rsc commented Dec 30, 2012

remyoudompheng commented Dec 30, 2012

remyoudompheng commented Dec 30, 2012

remyoudompheng commented Dec 30, 2012

remyoudompheng commented Feb 17, 2013

strings: Count is O(n*m) #4600

strings: Count is O(n*m) #4600

Comments

robpike commented Dec 30, 2012

rsc commented Dec 30, 2012

remyoudompheng commented Dec 30, 2012

remyoudompheng commented Dec 30, 2012

remyoudompheng commented Dec 30, 2012

remyoudompheng commented Feb 17, 2013