New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: strings: add func CommonPrefixLen(x, y string) int #53120
Comments
If you don't like that way around, flip the arguments. |
Both functions will be useful for Golang developers. |
I'd expect a function in the https://go.dev/play/p/DfUyiNfhvmI
|
@rhysh Very good question. |
In the linked examples, the specific callers want the Len variant in a slim majority of cases. I'd be ok with just CommonPrefix though.
Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result. |
This proposal has been added to the active column of the proposals project |
Appears that there are 5 possible variations here:
Would the UTF-8 aware versions of these functions be compatible with the examples you presented? |
If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g. Of course, once you get to combining characters, even UTF-8 boundaries are not always right. So maybe it should be bytes always. Looking at the examples, it does seem that many of them want the length anyway, so returning the length and avoiding the "which one gets sliced" might be best. That gets us to "bytes only - no UTF-8 worries - and returning the length".
And same for []byte. What do people think? |
I think it's more helpful to slice the result than return its length, only to slice it again, so maybe it should go in anyway. It just seems more natural as an API. If you want both, though, that's fine. And those examples don't feel representative to me, but then I don't write the kind of code others do. Which will you slice? As I said, do and document the first argument. |
The arguments in favor of CommonPrefixLen are that If the examples we looked at are not representative, does anyone want to contribute more examples of existing code that would use CommonPrefix or CommonPrefixLen? Another potential reason for CommonPrefixLen is that this length might be in the middle of a UTF-8 boundary or a grapheme cluster or any of those things. Making the user do the slice might make them think a bit about whether it's a sensible place to slice. |
@rsc Your last paragraph doesn't seem compelling. In writing
at what moment do I ask where I'm slicing that I wouldn't ask with
I'm not arguing against |
@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? |
There's value in uniformity between |
Currently where the |
Are we talking byte slices or strings? For strings, I would expect the answer to be 0. For byte slices, I think the jury is still out. |
For byte slices, the correct answer is (I assert) obviously 1, but for strings things are less clear: consistency across the two packages would demand that the answer also be 1, even though that means slicing a rune in half. I can live with that (given a suitable warning in the doc comment), but if there is no clear consensus that this is the right behavior, then that's (unfortunately) an argument against adding the function at all. |
@adonovan I see it exactly the opposite, that the value for strings is rune-based, and the value for bytes unclear, with consistency being a consideration in the decision. The strings package shouldn't slice a rune in half. |
Ignoring memory aliasing issues, I expect it to be safe to swap all usages of the |
Change https://go.dev/cl/413716 mentions this issue: |
I wrote a fuzz test (https://go.dev/cl/413716) verifying whether As such, I would expect the behavior of |
I see two possible solutions:
|
Hmm, I think I have been laboring under the misapprehension that the bytes package has a different viewpoint of the contents of byte strings than the strings package does about the content of strings (arbitrary binary vs UTF-8 encoded text). But I see that is not the case, and Rob's argument that both versions should respect rune boundaries now makes a lot more sense. All of the specific cases I encountered would work with a rune-boundary respecting implementation. Therefore I propose this new API:
It is worth stating explicitly that the function compares encodings, not decodings? That is, two different encodings of U+FFFD cannot be part of the common prefix? [Russ raises a good point of objection below.] |
Suppose we do the code point semantics. Then CommonPrefix("🧑🔧", "👨🏭") = "🧑\u200d" (Man, ZWJ, Wrench and Man, ZWJ, Factory produces Man, ZWJ) There's going to be weird boundary cases no matter what. It would also be very surprising for strings and bytes to differ here. I don't think there are any functions they have in common today that change semantics based on whether your UTF-8 is in []byte or string. |
There's also an option to add this to the generic slices package; it has well-defined semantics for any slice with comparable elements. That makes the semantics clear--indexing yields bytes, so it is byte-by-byte. The generic code could do a type assertion and dispatch to the faster code for specific types. |
To solve the "but it slices runes / grapheme clusters in half" problem, we could have a function that, given a string/byte slice, returns the prefix that end at a rune/cluster boundary. Then if you care about runes, you first find the shared prefix and then take the valid prefix of that. (Given that std is currently completely agnostic to grapheme clusters, it'd probably be UTF8 rune prefix, not valid cluster prefix. It'd be nice to have good grapheme cluster support in std, but that's an entirely different proposal.) utf8.DecodeLastRune[InString] is close to this but requires a little bit of work to hook up correctly. That's a generally useful API, and it composes well. By operating on a single string/byte slice, it doesn't raise any questions about which input it aliases. |
One of the major advantages of having it in That said, I guess we can have a switch statement in the generic implementation of |
It seems like we are getting close to bike-shedding this. To try to break the logjam, I suggest we add:
Does anyone object to this? |
This comment was marked as resolved.
This comment was marked as resolved.
@icholy The comment above says that the |
I think it's a serious mistake to have CommonPrefix, at least for strings, slice into a UTF-8 encoding. Very serious. I would never use it if it did, and would recommend that others avoid it as well. Not just out of safety, although that's paramount, I can't think why I'd want it. |
In the following example I expect it to print
It can print |
@gazerro Another interesting example, and I agree that's a desirable consistency property. It appears there is no good solution to all these constraints. Here's an alternative approach: We add a bytes.CommonPrefixLen function that works on arbitrary bytes without regard to UTF-8. But we don't add the analogous function to the strings package, and we document the reason why. Users that want the efficient implementation on strings interpreted as raw bytes can call bytes.CommonPrefixLen([]byte(str1), []byte(str2)) and the compiler will ensure that these conversions do not escape. |
I disagree with the notion that |
Another point of view is to see how In the following examples, at least one of the two strings can only contain ASCII characters. So the returned slice always contains only ASCII characters, no matter if
In the following example, In the following example, For the following, however, |
There's clearly no consensus here, and we've made it 10+ years without this function. It sounds like we should probably decline the proposal. |
I think this should be re-opened as an addition to |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
It looks like the UTF-8 vs bytes discussion missed the main point for adding the CommonPrefixLen() function to As for the utf-8 vs bytes semantics, it is better not implementing this function in |
A common recurring operation on strings is to remove or advance past the common prefix of two strings. Here are some examples from the main Go repository and the golang.org/x/tools repo:
https://github.com/golang/tools/blob/master/go/internal/gcimporter/bexport.go#L264
https://github.com/golang/tools/blob/master/internal/lsp/source/comment.go#L229
https://github.com/golang/go/blob/master/src/cmd/compile/internal/base/timings.go#L122
https://github.com/golang/go/blob/master/src/cmd/compile/internal/ir/dump.go#L129
https://github.com/golang/go/blob/master/src/strings/replace.go#L170
https://github.com/golang/go/blob/master/src/net/addrselect.go#L199
https://github.com/golang/go/blob/master/src/go/printer/printer.go#L485
https://github.com/golang/go/blob/master/src/go/doc/comment/parse.go#L397
This function is easy enough to implement by hand:
but it's possible to make it significantly faster than the naive version. This CL contains a version that runs 6x faster yet is still safe and portable. The code describes ways that it might be made much faster still at the cost of
unsafe
or assembly:https://go-review.googlesource.com/c/go/+/408116
While benchmarking and optimizing a diff algorithm this week, I noticed that one of the main profiling hotspots was, in essence, a CommonPrefixLen operation:
https://go-review.googlesource.com/c/tools/+/396855/11/internal/lsp/diff/lcs/old.go#219
(It's a little hard to see in this form, but in a private branch I refactored the code so that the dynamic dispatch of
eq(x, y byte)
was moved outside the loop. In essence the primitive operation over which this algorithm is abstracted is CommonPrefixLen.)I propose that we add
func CommonPrefixLen(x, y string) int
to the standardstrings
package. An analogous function should be added to thebytes
package too.We could also add a
CommonPrefix(x, y string) string
function, which actually strips off the prefix rather than merely returning its length, but I don't think this is necessary as it is trivial to implement. Also, whereasCommonPrefixLen
makes sense for bytes, it's less clear whatbytes.CommonPrefix
should do, since its result must alias one of the arguments. The first one presumably? Better to avoid the question.The text was updated successfully, but these errors were encountered: