New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: strings: add multiple string search ContainsAnySubstring and IndexAnySubstring #51143
Comments
Have you seen https://pkg.go.dev/index/suffixarray? I think https://blog.gopheracademy.com/advent-2014/string-matching/ does a decent job at explaining the different options at hand. It's tricky, though: searching for two substrings in a huge input string is nothing like searching for a thousand substrings in a small input. I imagine your proposed implementation would improve the latter case, but not the former, which would still be served best by e.g. suffixarray. |
I was aware of the article and the Cloudflare Aho-Corasick package, but not that the standard library has a suffix array implementation. As you point out, they're two different use cases - the suffix array approach works well for a large corpus, so it makes sense that it got added to support godoc. For the use case of matching a modest input against lots of possible substrings, it seems not so well suited. I think the current options for the many-needles use case are all not great:
The problem with option 3 is that finding a good implementation of an algorithm is extremely time-consuming. Even if you know what you're looking for, 90% of the open source packages are toy implementations put on Github by someone following a tutorial. This isn't immediately obvious, though, and you have to have some level of knowledge of the algorithm to spot problems, and even then it takes a lot of time to find a good implementation. (I looked today, and I found, for example, a Rabin-Karp implementation using a tiny prime modulus, which would quickly hit the worst case, but if I didn't know the algorithm, it would look perfectly reasonable at a glance.) The great thing about the standard library is that you can be sure the implementation is of a high quality. It looks like in this case, the strings package could provide those functions at almost no extra cost in maintenance. |
In the meantime, I've written a POC extending Go's bytealg Rabin-Karp implementation to search for multiple needles. I benchmarked it against the other options:
Some observations:
Conclusion After taking a couple of hours to try different things, I no longer think the existing bytealg package can be easily reused to build the functions I propose. That might be all the more reason to do it, because it did take me most of today's afternoon to arrive at the following heuristic:
It might be nice to save posterity those few hours. |
Closed this for now, because it's not as simple of an addition as I thought, which makes it less attractive. |
I'd like to propose a new group of string search functions in the strings package that can efficiently search for multiple needles.
Rationale
Looking at StackOverflow, the common advice given to people trying to search for multiple needles in a string is to match each one in turn. This is inefficient, but it looks like it's the common implementation. We could provide an efficient implementation, probably saving enough CPU cycles worldwide to make it worthwhile.
API sketch
I propose adding the following:
Implementation Sketch
The strings library already implements the Rabin-Karp algorithm, which is efficient in the average case. (And no less efficient than the common naive implementation in the worst case, which is anyway exceedingly unlikely.) It looks like both functions I propose could be built with minimal changes to the existing internals.
The text was updated successfully, but these errors were encountered: