proposal: strings: add func CommonPrefixLen(x, y string) int #53120

adonovan · 2022-05-27T21:53:46Z

A common recurring operation on strings is to remove or advance past the common prefix of two strings. Here are some examples from the main Go repository and the golang.org/x/tools repo:

https://github.com/golang/tools/blob/master/go/internal/gcimporter/bexport.go#L264
https://github.com/golang/tools/blob/master/internal/lsp/source/comment.go#L229
https://github.com/golang/go/blob/master/src/cmd/compile/internal/base/timings.go#L122
https://github.com/golang/go/blob/master/src/cmd/compile/internal/ir/dump.go#L129
https://github.com/golang/go/blob/master/src/strings/replace.go#L170
https://github.com/golang/go/blob/master/src/net/addrselect.go#L199
https://github.com/golang/go/blob/master/src/go/printer/printer.go#L485
https://github.com/golang/go/blob/master/src/go/doc/comment/parse.go#L397

This function is easy enough to implement by hand:

commonLen :=  min(len(a), len(b))
for i = 0; i < commonLen && a[i] == b[i]; i++ {}

but it's possible to make it significantly faster than the naive version. This CL contains a version that runs 6x faster yet is still safe and portable. The code describes ways that it might be made much faster still at the cost of unsafe or assembly:
https://go-review.googlesource.com/c/go/+/408116

While benchmarking and optimizing a diff algorithm this week, I noticed that one of the main profiling hotspots was, in essence, a CommonPrefixLen operation:
https://go-review.googlesource.com/c/tools/+/396855/11/internal/lsp/diff/lcs/old.go#219
(It's a little hard to see in this form, but in a private branch I refactored the code so that the dynamic dispatch of eq(x, y byte) was moved outside the loop. In essence the primitive operation over which this algorithm is abstracted is CommonPrefixLen.)

I propose that we add func CommonPrefixLen(x, y string) int to the standard strings package. An analogous function should be added to the bytes package too.

We could also add a CommonPrefix(x, y string) string function, which actually strips off the prefix rather than merely returning its length, but I don't think this is necessary as it is trivial to implement. Also, whereas CommonPrefixLen makes sense for bytes, it's less clear what bytes.CommonPrefix should do, since its result must alias one of the arguments. The first one presumably? Better to avoid the question.

The text was updated successfully, but these errors were encountered:

robpike · 2022-05-28T09:43:42Z

CommonPrefixLen is harder to use and less intuitive. I hear your argument against CommonPrefix but there is an easy rebuttal, as suggested by yourself and by this proposed doc comment.

CommonPrefix returns the longest prefix shared by the two arguments. The return value is sliced from the first argument.

If you don't like that way around, flip the arguments.

SeanTolstoyevski · 2022-05-28T16:39:55Z

Both functions will be useful for Golang developers.
If this is accepted, I suggest both added to the standard library.

rhysh · 2022-05-29T17:28:47Z

I'd expect a function in the bytes package would operate on bytes. But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?

https://go.dev/play/p/DfUyiNfhvmI

"¼" ¼ 0xc2bc [U+00BC]
"½" ½ 0xc2bd [U+00BD]
"\xc2" � 0xc2 [U+FFFD]

robpike · 2022-05-29T22:33:47Z

@rhysh Very good question.

adonovan · 2022-05-31T14:39:05Z

[rob] CommonPrefixLen is harder to use and less intuitive.

In the linked examples, the specific callers want the Len variant in a slim majority of cases. I'd be ok with just CommonPrefix though.

[rhys] But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?

Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.

rsc · 2022-06-01T18:05:04Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

icholy · 2022-06-01T18:51:19Z

Appears that there are 5 possible variations here:

strings.CommonPrefixLen operates on bytes, returns # of bytes
strings.CommonPrefixLen operates on runes, returns # of bytes
strings.CommonPrefixLen operates on runes, returns # of runes (probably useless)
strings.CommonPrefix operates on bytes, returns slice of first arg.
strings.CommonPrefix operates on runes, returns slice of first arg.

Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.

Would the UTF-8 aware versions of these functions be compatible with the examples you presented?

rsc · 2022-06-08T17:22:53Z

If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.

Of course, once you get to combining characters, even UTF-8 boundaries are not always right. So maybe it should be bytes always.

Looking at the examples, it does seem that many of them want the length anyway, so returning the length and avoiding the "which one gets sliced" might be best.

That gets us to "bytes only - no UTF-8 worries - and returning the length".

func CommonPrefixLen(x, y string) int {
    n := 0
    for n < len(x) && n < len(y) && x[n] == y[n] {
        n++
    }
    return n
}

And same for []byte. What do people think?

robpike · 2022-06-09T01:34:35Z

I think it's more helpful to slice the result than return its length, only to slice it again, so maybe it should go in anyway. It just seems more natural as an API. If you want both, though, that's fine. And those examples don't feel representative to me, but then I don't write the kind of code others do.

Which will you slice? As I said, do and document the first argument.

rsc · 2022-06-15T17:18:59Z

The arguments in favor of CommonPrefixLen are that
(1) it avoids the "which is sliced" question, and
(2) in the examples we looked at, the code was already using the length more often than the slice itself.

If the examples we looked at are not representative, does anyone want to contribute more examples of existing code that would use CommonPrefix or CommonPrefixLen?

Another potential reason for CommonPrefixLen is that this length might be in the middle of a UTF-8 boundary or a grapheme cluster or any of those things. Making the user do the slice might make them think a bit about whether it's a sensible place to slice.

robpike · 2022-06-15T20:19:17Z

@rsc Your last paragraph doesn't seem compelling. In writing

prefix := s[:strings.CommonPrefixLen(s, t)]

at what moment do I ask where I'm slicing that I wouldn't ask with

prefix := strings.CommonPrefix(s, t)

I'm not arguing against CommonPrefixLen, I'm just arguing that it feels sound to me to offer CommonPrefix as well. And for the third time, the question of which to slice is trivial to answer and understand. It's not an issue at all.

rsc · 2022-06-22T17:30:00Z

@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ?
I don't mind returning "\xce", it just seems more error-prone than the 1.

icholy · 2022-06-22T17:38:55Z

If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.

There's value in uniformity between strings & bytes, but having CommonPrefix("βα", "ββ") return "β\xce" feels like a footgun.

seankhliao · 2022-06-22T17:58:57Z

Currently where the strings package has to split user provided strings, it does so using runes/Unicode code points: ContainsAny, IndexAny, LastIndexAny, Map, Trim, TrimLeft, TrimRight.
The exceptions are functions that have a Byte suffix.
I think whatever is added to strings should keep operating on runes/codepoints unless otherwise explicitly labeled.

robpike · 2022-06-22T20:41:05Z

@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? I don't mind returning "\xce", it just seems more error-prone than the 1.

Are we talking byte slices or strings? For strings, I would expect the answer to be 0. For byte slices, I think the jury is still out.

adonovan · 2022-06-22T21:49:17Z

For byte slices, the correct answer is (I assert) obviously 1, but for strings things are less clear: consistency across the two packages would demand that the answer also be 1, even though that means slicing a rune in half. I can live with that (given a suitable warning in the doc comment), but if there is no clear consensus that this is the right behavior, then that's (unfortunately) an argument against adding the function at all.

robpike · 2022-06-23T00:56:40Z

@adonovan I see it exactly the opposite, that the value for strings is rune-based, and the value for bytes unclear, with consistency being a consideration in the decision. The strings package shouldn't slice a rune in half.

dsnet · 2022-06-23T01:23:27Z

Ignoring memory aliasing issues, I expect it to be safe to swap all usages of the bytes package with the strings package and vice-versa with no change in semantics. Are there existing cases where this property does not hold?

gopherbot · 2022-06-23T05:08:25Z

Change https://go.dev/cl/413716 mentions this issue: bytes: add fuzz test to ensure compatibility with strings

dsnet · 2022-06-23T05:09:55Z

Are there existing cases where this property does not hold?

I wrote a fuzz test (https://go.dev/cl/413716) verifying whether bytes and strings are identical for all common functions, and it seems (after a few hours of fuzzing) the answer is yes (except for a minor bug #53511).

As such, I would expect the behavior of bytes.CommonPrefix and strings.CommonPrefix to be identical (whether it be byte-based or rune-based).

mibk · 2022-06-23T06:01:05Z

I see two possible solutions:

Adding two different functions: PrefixRunes and PrefixBytes (to both strings and bytes packages)
Moving the function to package unicode/utf8 so it's obvious it operates on runes.

adonovan · 2022-06-23T13:09:06Z

Hmm, I think I have been laboring under the misapprehension that the bytes package has a different viewpoint of the contents of byte strings than the strings package does about the content of strings (arbitrary binary vs UTF-8 encoded text). But I see that is not the case, and Rob's argument that both versions should respect rune boundaries now makes a lot more sense. All of the specific cases I encountered would work with a rune-boundary respecting implementation. Therefore I propose this new API:

package strings

// CommonPrefixLen returns the length of the common prefix of two strings containing UTF-8 text.
// The resulting value is the index of a rune boundary.
func CommonPrefixLen(x, y string) int

package bytes

s/string/[]byte/
s/strings/byte slices/

It is worth stating explicitly that the function compares encodings, not decodings? That is, two different encodings of U+FFFD cannot be part of the common prefix?

[Russ raises a good point of objection below.]

rsc · 2022-06-29T17:47:10Z

Suppose we do the code point semantics. Then

CommonPrefix("🧑‍🔧", "👨‍🏭") = "🧑‍\u200d" (Man, ZWJ, Wrench and Man, ZWJ, Factory produces Man, ZWJ)

There's going to be weird boundary cases no matter what.

It would also be very surprising for strings and bytes to differ here. I don't think there are any functions they have in common today that change semantics based on whether your UTF-8 is in []byte or string.

josharian · 2022-06-29T18:29:44Z

There's also an option to add this to the generic slices package; it has well-defined semantics for any slice with comparable elements. That makes the semantics clear--indexing yields bytes, so it is byte-by-byte. The generic code could do a type assertion and dispatch to the faster code for specific types.

josharian · 2022-06-29T18:37:26Z

To solve the "but it slices runes / grapheme clusters in half" problem, we could have a function that, given a string/byte slice, returns the prefix that end at a rune/cluster boundary. Then if you care about runes, you first find the shared prefix and then take the valid prefix of that. (Given that std is currently completely agnostic to grapheme clusters, it'd probably be UTF8 rune prefix, not valid cluster prefix. It'd be nice to have good grapheme cluster support in std, but that's an entirely different proposal.) utf8.DecodeLastRune[InString] is close to this but requires a little bit of work to hook up correctly.

That's a generally useful API, and it composes well. By operating on a single string/byte slice, it doesn't raise any questions about which input it aliases.

dsnet · 2022-06-29T18:41:01Z

There's also an option to add this to the generic slices package

One of the major advantages of having it in strings and bytes is that we can optimize for the fact that this is a just a byte sequence of some kind so that we can utilize multi-byte comparisons. In @adonovan original post, he reported a 6x improvement over a naive element-to-element comparison.

That said, I guess we can have a switch statement in the generic implementation of slices.CommonPrefix to optimize for strings and []byte.

rsc · 2022-07-13T17:39:28Z

It seems like we are getting close to bike-shedding this. To try to break the logjam, I suggest we add:

package strings

// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y string) string

package bytes

// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y []byte) []byte

Does anyone object to this?

ianlancetaylor · 2022-07-13T20:14:29Z

@icholy The comment above says that the CommonPrefix functions return a slice of x, which is the first argument. So I think what you say is already the case. If not, could you show some code to explain what you mean? Thanks.

robpike · 2022-07-13T21:27:29Z

I think it's a serious mistake to have CommonPrefix, at least for strings, slice into a UTF-8 encoding. Very serious. I would never use it if it did, and would recommend that others avoid it as well. Not just out of safety, although that's paramount, I can't think why I'd want it.

gazerro · 2022-07-14T12:44:53Z

In the following example I expect it to print true or nothing.

if strings.HasPrefix(x, p) {
     print(strings.CommonPrefix(x, p) == p)
}

It can print false if CommonPrefix considers the strings to be UTF-8 encoded.

adonovan · 2022-07-14T20:19:00Z

@gazerro Another interesting example, and I agree that's a desirable consistency property.

It appears there is no good solution to all these constraints. Here's an alternative approach:

We add a bytes.CommonPrefixLen function that works on arbitrary bytes without regard to UTF-8. But we don't add the analogous function to the strings package, and we document the reason why. Users that want the efficient implementation on strings interpreted as raw bytes can call bytes.CommonPrefixLen([]byte(str1), []byte(str2)) and the compiler will ensure that these conversions do not escape.

dsnet · 2022-07-14T20:44:01Z

I disagree with the notion that string and bytes differ based on their handling of UTF-8. As things stand today, bytes already has some understanding of UTF-8 in some functions, but not in others. Thus, whether UTF-8 is considered is not package-specific, but rather function-specific (and appropriately documented in each case). The distinction between the two packages is just that bytes deals with []byte and strings deals with string.

gazerro · 2022-07-14T21:47:07Z

Another point of view is to see how CommonPrefix would be used in the examples in the standard library.

In the following examples, at least one of the two strings can only contain ASCII characters. So the returned slice always contains only ASCII characters, no matter if CommonPrefix works on arbitrary bytes or UTF-8:

In the following example, CommonPrefix must work on arbitrary bytes in order to be used:

https://github.com/golang/go/blob/master/src/strings/replace.go#L170

In the following example, CommonPrefix is preferable to work on arbitrary bytes, but is not necessary:

https://github.com/golang/tools/blob/master/go/internal/gcimporter/bexport.go#L264

For the following, however, CommonPrefix cannot be used because they need a particular function:

rsc · 2022-07-27T17:30:05Z

There's clearly no consensus here, and we've made it 10+ years without this function. It sounds like we should probably decline the proposal.

icholy · 2022-07-27T18:05:23Z

I think this should be re-opened as an addition to exp/slices.

rsc · 2022-08-03T18:01:48Z

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

rsc · 2022-08-12T03:54:48Z

No change in consensus, so declined.
— rsc for the proposal review group

valyala · 2023-03-18T09:47:30Z

It looks like the UTF-8 vs bytes discussion missed the main point for adding the CommonPrefixLen() function to bytes package - performance for text processing and compression algorithms. It is trivial to write the function in Go, but it isn't trivial to optimize it for performance, especially on multiple GOARCHs. So it would be great to have the optimized function in standard library in the same way as bytes.Equal() is optimized now.

As for the utf-8 vs bytes semantics, it is better not implementing this function in strings package because the performance isn't usually critical for utf-8 processing. So it is better to write a trivial function every time the common prefix for utf-8-encoded strings is needed.

adonovan added the Proposal label May 27, 2022

gopherbot added this to the Proposal milestone May 27, 2022

ianlancetaylor added this to Incoming in Proposals (old) May 28, 2022

rsc moved this from Incoming to Active in Proposals (old) Jun 1, 2022

AlexanderYastrebov mentioned this issue Jun 14, 2022

Issue when using Kana characters with common prefixes in the path zalando/skipper#2016

Closed

golang deleted a comment from ZekeLu Jun 16, 2022

This comment was marked as resolved.

Sign in to view

rsc moved this from Active to Likely Decline in Proposals (old) Aug 3, 2022

rsc added the Proposal-FinalCommentPeriod label Aug 3, 2022

rsc removed the Proposal-FinalCommentPeriod label Aug 12, 2022

rsc closed this as completed Aug 12, 2022

icholy mentioned this issue Dec 3, 2022

proposal: bytes, strings: add Truncate #56885

Closed

seankhliao mentioned this issue Mar 17, 2023

proposal: bytes: add CommonPrefixLen function #59086

Closed

golang locked and limited conversation to collaborators Mar 17, 2024

gopherbot added the FrozenDueToAge label Mar 17, 2024

proposal: strings: add func CommonPrefixLen(x, y string) int #53120

proposal: strings: add func CommonPrefixLen(x, y string) int #53120

Comments

adonovan commented May 27, 2022 • edited

robpike commented May 28, 2022 • edited

SeanTolstoyevski commented May 28, 2022

rhysh commented May 29, 2022

robpike commented May 29, 2022

adonovan commented May 31, 2022

rsc commented Jun 1, 2022

icholy commented Jun 1, 2022

rsc commented Jun 8, 2022

robpike commented Jun 9, 2022

rsc commented Jun 15, 2022

robpike commented Jun 15, 2022 • edited

rsc commented Jun 22, 2022

icholy commented Jun 22, 2022

seankhliao commented Jun 22, 2022

robpike commented Jun 22, 2022

adonovan commented Jun 22, 2022 • edited

robpike commented Jun 23, 2022 • edited

dsnet commented Jun 23, 2022

gopherbot commented Jun 23, 2022

dsnet commented Jun 23, 2022 • edited

mibk commented Jun 23, 2022

adonovan commented Jun 23, 2022 • edited

rsc commented Jun 29, 2022

josharian commented Jun 29, 2022

josharian commented Jun 29, 2022

dsnet commented Jun 29, 2022 • edited

rsc commented Jul 13, 2022

This comment was marked as resolved.

ianlancetaylor commented Jul 13, 2022

robpike commented Jul 13, 2022

gazerro commented Jul 14, 2022 • edited

adonovan commented Jul 14, 2022 • edited

dsnet commented Jul 14, 2022 • edited

gazerro commented Jul 14, 2022

rsc commented Jul 27, 2022

icholy commented Jul 27, 2022

rsc commented Aug 3, 2022

rsc commented Aug 12, 2022

valyala commented Mar 18, 2023

adonovan commented May 27, 2022 •

edited

robpike commented May 28, 2022 •

edited

robpike commented Jun 15, 2022 •

edited

adonovan commented Jun 22, 2022 •

edited

robpike commented Jun 23, 2022 •

edited

dsnet commented Jun 23, 2022 •

edited

adonovan commented Jun 23, 2022 •

edited

dsnet commented Jun 29, 2022 •

edited

gazerro commented Jul 14, 2022 •

edited

adonovan commented Jul 14, 2022 •

edited

dsnet commented Jul 14, 2022 •

edited