Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143

Closed
the80srobot opened this issue Feb 11, 2022 · 4 comments

Comments

@the80srobot
Copy link

the80srobot commented Feb 11, 2022

I'd like to propose a new group of string search functions in the strings package that can efficiently search for multiple needles.

Rationale

Looking at StackOverflow, the common advice given to people trying to search for multiple needles in a string is to match each one in turn. This is inefficient, but it looks like it's the common implementation. We could provide an efficient implementation, probably saving enough CPU cycles worldwide to make it worthwhile.

API sketch

I propose adding the following:

// ContainsAnySubstring reports whether any of the substrings are within s.
ContainsAnySubstring(s string, substrings []string) bool

// IndexAnySubstring returns the index of the first instance of any substring in s, of -1 if no substring is present in s.
IndexAnySubstring(s string, substrings []string) int

Implementation Sketch

The strings library already implements the Rabin-Karp algorithm, which is efficient in the average case. (And no less efficient than the common naive implementation in the worst case, which is anyway exceedingly unlikely.) It looks like both functions I propose could be built with minimal changes to the existing internals.

@gopherbot gopherbot added this to the Proposal milestone Feb 11, 2022
@mvdan
Copy link
Member

mvdan commented Feb 11, 2022

Have you seen https://pkg.go.dev/index/suffixarray? I think https://blog.gopheracademy.com/advent-2014/string-matching/ does a decent job at explaining the different options at hand.

It's tricky, though: searching for two substrings in a huge input string is nothing like searching for a thousand substrings in a small input. I imagine your proposed implementation would improve the latter case, but not the former, which would still be served best by e.g. suffixarray.

@the80srobot
Copy link
Author

I was aware of the article and the Cloudflare Aho-Corasick package, but not that the standard library has a suffix array implementation.

As you point out, they're two different use cases - the suffix array approach works well for a large corpus, so it makes sense that it got added to support godoc.

For the use case of matching a modest input against lots of possible substrings, it seems not so well suited.

I think the current options for the many-needles use case are all not great:

  1. Naive approach, matching strings individually.
  2. Use a regex. My bias is that this probably won't go well with a large number of needles, but it's possible the regexp package handles that well. I should benchmark.
  3. Use the cloudflare package, or something similar to it.

The problem with option 3 is that finding a good implementation of an algorithm is extremely time-consuming. Even if you know what you're looking for, 90% of the open source packages are toy implementations put on Github by someone following a tutorial. This isn't immediately obvious, though, and you have to have some level of knowledge of the algorithm to spot problems, and even then it takes a lot of time to find a good implementation. (I looked today, and I found, for example, a Rabin-Karp implementation using a tiny prime modulus, which would quickly hit the worst case, but if I didn't know the algorithm, it would look perfectly reasonable at a glance.)

The great thing about the standard library is that you can be sure the implementation is of a high quality. It looks like in this case, the strings package could provide those functions at almost no extra cost in maintenance.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Feb 11, 2022
@ianlancetaylor ianlancetaylor changed the title proposal: strings: Add multiple string search ContainsAnySubstring and IndexAnySubstring proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring Feb 11, 2022
@the80srobot
Copy link
Author

the80srobot commented Feb 11, 2022

In the meantime, I've written a POC extending Go's bytealg Rabin-Karp implementation to search for multiple needles. I benchmarked it against the other options:

  1. Modified Rabin-Karp
  2. Naive search
  3. Regex
  4. Cloudflare's Aho-Corasick

Some observations:

  • The POC runs well when needles are of similar length, but not when some are short and some long. I should have realized this before commenting that we could extend it - in hindsight it's pretty obvious, sorry.
  • Regex performance is 10-100× worse than other algorithms and probably scales superlinearly with the length of the uberpattern.
  • Naive string search is the fastest option up to about 10 needles.
  • Cloudflare's library has consistent and predictable performance in all situations I could think up.

Conclusion

After taking a couple of hours to try different things, I no longer think the existing bytealg package can be easily reused to build the functions I propose. That might be all the more reason to do it, because it did take me most of today's afternoon to arrive at the following heuristic:

  1. If the number of needles is < 10, then search for them one by one
  2. Otherwise use the Cloudflare library

It might be nice to save posterity those few hours.

@the80srobot
Copy link
Author

Closed this for now, because it's not as simple of an addition as I thought, which makes it less attractive.

@ianlancetaylor ianlancetaylor removed this from Incoming in Proposals (old) Feb 12, 2022
@golang golang locked and limited conversation to collaborators Feb 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants