Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings: Count is O(n*m) #4600

Closed
robpike opened this issue Dec 30, 2012 · 5 comments
Closed

strings: Count is O(n*m) #4600

robpike opened this issue Dec 30, 2012 · 5 comments
Milestone

Comments

@robpike
Copy link
Contributor

robpike commented Dec 30, 2012

The simple algorithm used in strings.Count and bytes.Count is n^2 when linear algorithms
are known, although they require precomputation and allocation. It may be worth looking
at the parameters in the call to decide when a smarter algorithm is worthwhile.

A simple non-allocating heuristic would be to check not only the first byte before using
== but also the last byte. Another option would be to check the first and last bytes of
the strings in the == operator before doing the O(n) scan, at least when n is large.
Such tricks will not eliminate n^2 behavior but can make it less likely to arise. (Since
in Count the lengths always match, the usual efficient cut-off of checking the lengths
does nothing.)

Noticed in https://plus.google.com/115609618223925128756/posts/5EG3QHA1ugH
@rsc
Copy link
Contributor

rsc commented Dec 30, 2012

Comment 1:

Labels changed: added priority-later, go1.1maybe, removed priority-triage, go1.1.

Status changed to Thinking.

@remyoudompheng
Copy link
Contributor

Comment 3:

The use of a even stupid hash function could make it probabilistically linear and give a
bit less false positives than first and last bytes, while still avoiding allocation.

@remyoudompheng
Copy link
Contributor

Comment 4:

Here is an example using a fnv-like hash. A suggestion would be to do the following
in Contains/Count(s1, s2): if len(s2) > 16, count the number N of occurences of
s2[:4] in s1.
If N*len(s2) > K*len(s1) for some constant K (the hash function cost, here K=3), then
full comparisons are probably more costly than hashing and we switch algorithms.
s1 = strings.Repeat("ACTG", 32<<10), s2 = strings.Repeat("ACTG", 4<<10)
SetBytes(len(s1)+len(s2))
BenchmarkContainsTorture           5     403871636 ns/op       0.37 MB/s
BenchmarkCountTorture          5     403779882 ns/op       0.37 MB/s
BenchmarkContainsHashTorture        5000        454527 ns/op     324.43 MB/s
BenchmarkCountHashTorture       5000        439006 ns/op     335.90 MB/s
s1 = strings.Repeat("ACTG", 32<<10)
s2 = strings.Repeat("ACTG", 2<<10)+"GATC"+strings.Repeat("ACTG", 2<<10)
SetBytes(len(s1)+len(s2))
BenchmarkContainsTorture      100000         28224 ns/op    5224.35 MB/s
BenchmarkCountTorture      10000        227133 ns/op     649.20 MB/s
BenchmarkContainsHashTorture       20000         87624 ns/op    1682.81 MB/s
BenchmarkCountHashTorture       5000        684507 ns/op     215.42 MB/s

Attachments:

  1. string_test.go (2421 bytes)

@remyoudompheng
Copy link
Contributor

Comment 5:

It turns out that even on the original example from the G+ post the hash method is
faster:
BenchmarkContainsTorture      500000          4971 ns/op     223.28 MB/s
BenchmarkCountTorture     500000          6585 ns/op     168.55 MB/s
BenchmarkContainsHashTorture      500000          3062 ns/op     362.43 MB/s
BenchmarkCountHashTorture     500000          3793 ns/op     292.62 MB/s
s1 = "Lang lang lang. langaha Langan langarai langate langauge langbanite Langbehn
langbeinite langca Langdon Lange langeel langel Langelo Langeloth Langer Langford
Langham Langhian Langhorne langi langiel Langill Langille langite langka lang-kail
Langland langlauf langlaufer langlaufers langlaufs langle Langley langley langleys
Langlois Langmuir Lango Langobard Langobardic langobardic langoon langooty langosta
langourous langourously langouste langrage langrages langrel langrels Langrenus Langreo
Langres langret langridge langsat Langsdon Langsdorffia langset langsettle Langshan
langshan langshans Langside langspiel langspil Langston Langsville langsyne langsynes
langteraloo Langton Langtry language languaged languageless languages languaging langue
langued Languedoc languedoc Languedocian Languedoc-Roussillon languent langues
languescent languet languets languette languid languidly languidness languidnesses
languish languished languisher languishers languishes languishing languishingly
languishment languor languorment languorous languorously languorousness languors langur
langurs Langworthy"
s2 = "languid"

@remyoudompheng
Copy link
Contributor

Comment 6:

This issue was closed by revision 23093f8.

Status changed to Fixed.

@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1maybe label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants