proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143

the80srobot · 2022-02-11T08:54:00Z

I'd like to propose a new group of string search functions in the strings package that can efficiently search for multiple needles.

Rationale

Looking at StackOverflow, the common advice given to people trying to search for multiple needles in a string is to match each one in turn. This is inefficient, but it looks like it's the common implementation. We could provide an efficient implementation, probably saving enough CPU cycles worldwide to make it worthwhile.

API sketch

I propose adding the following:

// ContainsAnySubstring reports whether any of the substrings are within s.
ContainsAnySubstring(s string, substrings []string) bool

// IndexAnySubstring returns the index of the first instance of any substring in s, of -1 if no substring is present in s.
IndexAnySubstring(s string, substrings []string) int

Implementation Sketch

The strings library already implements the Rabin-Karp algorithm, which is efficient in the average case. (And no less efficient than the common naive implementation in the worst case, which is anyway exceedingly unlikely.) It looks like both functions I propose could be built with minimal changes to the existing internals.

The text was updated successfully, but these errors were encountered:

mvdan · 2022-02-11T09:45:24Z

Have you seen https://pkg.go.dev/index/suffixarray? I think https://blog.gopheracademy.com/advent-2014/string-matching/ does a decent job at explaining the different options at hand.

It's tricky, though: searching for two substrings in a huge input string is nothing like searching for a thousand substrings in a small input. I imagine your proposed implementation would improve the latter case, but not the former, which would still be served best by e.g. suffixarray.

the80srobot · 2022-02-11T10:03:10Z

I was aware of the article and the Cloudflare Aho-Corasick package, but not that the standard library has a suffix array implementation.

As you point out, they're two different use cases - the suffix array approach works well for a large corpus, so it makes sense that it got added to support godoc.

For the use case of matching a modest input against lots of possible substrings, it seems not so well suited.

I think the current options for the many-needles use case are all not great:

Naive approach, matching strings individually.
Use a regex. My bias is that this probably won't go well with a large number of needles, but it's possible the regexp package handles that well. I should benchmark.
Use the cloudflare package, or something similar to it.

The problem with option 3 is that finding a good implementation of an algorithm is extremely time-consuming. Even if you know what you're looking for, 90% of the open source packages are toy implementations put on Github by someone following a tutorial. This isn't immediately obvious, though, and you have to have some level of knowledge of the algorithm to spot problems, and even then it takes a lot of time to find a good implementation. (I looked today, and I found, for example, a Rabin-Karp implementation using a tiny prime modulus, which would quickly hit the worst case, but if I didn't know the algorithm, it would look perfectly reasonable at a glance.)

The great thing about the standard library is that you can be sure the implementation is of a high quality. It looks like in this case, the strings package could provide those functions at almost no extra cost in maintenance.

the80srobot · 2022-02-11T20:20:27Z

In the meantime, I've written a POC extending Go's bytealg Rabin-Karp implementation to search for multiple needles. I benchmarked it against the other options:

Modified Rabin-Karp
Naive search
Regex
Cloudflare's Aho-Corasick

Some observations:

The POC runs well when needles are of similar length, but not when some are short and some long. I should have realized this before commenting that we could extend it - in hindsight it's pretty obvious, sorry.
Regex performance is 10-100× worse than other algorithms and probably scales superlinearly with the length of the uberpattern.
Naive string search is the fastest option up to about 10 needles.
Cloudflare's library has consistent and predictable performance in all situations I could think up.

Conclusion

After taking a couple of hours to try different things, I no longer think the existing bytealg package can be easily reused to build the functions I propose. That might be all the more reason to do it, because it did take me most of today's afternoon to arrive at the following heuristic:

If the number of needles is < 10, then search for them one by one
Otherwise use the Cloudflare library

It might be nice to save posterity those few hours.

the80srobot · 2022-02-12T07:30:45Z

Closed this for now, because it's not as simple of an addition as I thought, which makes it less attractive.

the80srobot added the Proposal label Feb 11, 2022

gopherbot added this to the Proposal milestone Feb 11, 2022

ianlancetaylor added this to Incoming in Proposals (old) Feb 11, 2022

ianlancetaylor changed the title ~~proposal: strings: Add multiple string search ContainsAnySubstring and IndexAnySubstring~~ proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring Feb 11, 2022

the80srobot closed this as completed Feb 12, 2022

ianlancetaylor removed this from Incoming in Proposals (old) Feb 12, 2022

golang locked and limited conversation to collaborators Feb 12, 2023

gopherbot added the FrozenDueToAge label Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143

proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143

the80srobot commented Feb 11, 2022 •

edited

mvdan commented Feb 11, 2022

the80srobot commented Feb 11, 2022

the80srobot commented Feb 11, 2022 •

edited

the80srobot commented Feb 12, 2022

proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143

proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143

Comments

the80srobot commented Feb 11, 2022 • edited

mvdan commented Feb 11, 2022

the80srobot commented Feb 11, 2022

the80srobot commented Feb 11, 2022 • edited

the80srobot commented Feb 12, 2022

the80srobot commented Feb 11, 2022 •

edited

the80srobot commented Feb 11, 2022 •

edited