Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: strings: add func CommonPrefixLen(x, y string) int #53120

Closed
adonovan opened this issue May 27, 2022 · 39 comments
Closed

proposal: strings: add func CommonPrefixLen(x, y string) int #53120

adonovan opened this issue May 27, 2022 · 39 comments

Comments

@adonovan
Copy link
Member

adonovan commented May 27, 2022

A common recurring operation on strings is to remove or advance past the common prefix of two strings. Here are some examples from the main Go repository and the golang.org/x/tools repo:

https://github.com/golang/tools/blob/master/go/internal/gcimporter/bexport.go#L264
https://github.com/golang/tools/blob/master/internal/lsp/source/comment.go#L229
https://github.com/golang/go/blob/master/src/cmd/compile/internal/base/timings.go#L122
https://github.com/golang/go/blob/master/src/cmd/compile/internal/ir/dump.go#L129
https://github.com/golang/go/blob/master/src/strings/replace.go#L170
https://github.com/golang/go/blob/master/src/net/addrselect.go#L199
https://github.com/golang/go/blob/master/src/go/printer/printer.go#L485
https://github.com/golang/go/blob/master/src/go/doc/comment/parse.go#L397

This function is easy enough to implement by hand:

commonLen :=  min(len(a), len(b))
for i = 0; i < commonLen && a[i] == b[i]; i++ {}

but it's possible to make it significantly faster than the naive version. This CL contains a version that runs 6x faster yet is still safe and portable. The code describes ways that it might be made much faster still at the cost of unsafe or assembly:
https://go-review.googlesource.com/c/go/+/408116

While benchmarking and optimizing a diff algorithm this week, I noticed that one of the main profiling hotspots was, in essence, a CommonPrefixLen operation:
https://go-review.googlesource.com/c/tools/+/396855/11/internal/lsp/diff/lcs/old.go#219
(It's a little hard to see in this form, but in a private branch I refactored the code so that the dynamic dispatch of eq(x, y byte) was moved outside the loop. In essence the primitive operation over which this algorithm is abstracted is CommonPrefixLen.)

I propose that we add func CommonPrefixLen(x, y string) int to the standard strings package. An analogous function should be added to the bytes package too.

We could also add a CommonPrefix(x, y string) string function, which actually strips off the prefix rather than merely returning its length, but I don't think this is necessary as it is trivial to implement. Also, whereas CommonPrefixLen makes sense for bytes, it's less clear what bytes.CommonPrefix should do, since its result must alias one of the arguments. The first one presumably? Better to avoid the question.

@gopherbot gopherbot added this to the Proposal milestone May 27, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) May 28, 2022
@robpike
Copy link
Contributor

robpike commented May 28, 2022

CommonPrefixLen is harder to use and less intuitive. I hear your argument against CommonPrefix but there is an easy rebuttal, as suggested by yourself and by this proposed doc comment.

CommonPrefix returns the longest prefix shared by the two arguments. The return value is sliced from the first argument.

If you don't like that way around, flip the arguments.

@SeanTolstoyevski
Copy link

Both functions will be useful for Golang developers.
If this is accepted, I suggest both added to the standard library.

@rhysh
Copy link
Contributor

rhysh commented May 29, 2022

I'd expect a function in the bytes package would operate on bytes. But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?

https://go.dev/play/p/DfUyiNfhvmI

"¼" ¼ 0xc2bc [U+00BC]
"½" ½ 0xc2bd [U+00BD]
"\xc2" � 0xc2 [U+FFFD]

@robpike
Copy link
Contributor

robpike commented May 29, 2022

@rhysh Very good question.

@adonovan
Copy link
Member Author

[rob] CommonPrefixLen is harder to use and less intuitive.

In the linked examples, the specific callers want the Len variant in a slim majority of cases. I'd be ok with just CommonPrefix though.

[rhys] But several functions in the strings package operate specifically on runes. Would CommonPrefix{,Len} split UTF-8 runes?

Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.

@rsc
Copy link
Contributor

rsc commented Jun 1, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Jun 1, 2022
@icholy
Copy link

icholy commented Jun 1, 2022

Appears that there are 5 possible variations here:

  1. strings.CommonPrefixLen operates on bytes, returns # of bytes
  2. strings.CommonPrefixLen operates on runes, returns # of bytes
  3. strings.CommonPrefixLen operates on runes, returns # of runes (probably useless)
  4. strings.CommonPrefix operates on bytes, returns slice of first arg.
  5. strings.CommonPrefix operates on runes, returns slice of first arg.

Yes. Though this would make it the first function in the strings package that, when given UTF-8 arguments, does not always return a UTF-8 result.

Would the UTF-8 aware versions of these functions be compatible with the examples you presented?

@rsc
Copy link
Contributor

rsc commented Jun 8, 2022

If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.

Of course, once you get to combining characters, even UTF-8 boundaries are not always right. So maybe it should be bytes always.

Looking at the examples, it does seem that many of them want the length anyway, so returning the length and avoiding the "which one gets sliced" might be best.

That gets us to "bytes only - no UTF-8 worries - and returning the length".

func CommonPrefixLen(x, y string) int {
    n := 0
    for n < len(x) && n < len(y) && x[n] == y[n] {
        n++
    }
    return n
}

And same for []byte. What do people think?

@robpike
Copy link
Contributor

robpike commented Jun 9, 2022

I think it's more helpful to slice the result than return its length, only to slice it again, so maybe it should go in anyway. It just seems more natural as an API. If you want both, though, that's fine. And those examples don't feel representative to me, but then I don't write the kind of code others do.

Which will you slice? As I said, do and document the first argument.

@rsc
Copy link
Contributor

rsc commented Jun 15, 2022

The arguments in favor of CommonPrefixLen are that
(1) it avoids the "which is sliced" question, and
(2) in the examples we looked at, the code was already using the length more often than the slice itself.

If the examples we looked at are not representative, does anyone want to contribute more examples of existing code that would use CommonPrefix or CommonPrefixLen?

Another potential reason for CommonPrefixLen is that this length might be in the middle of a UTF-8 boundary or a grapheme cluster or any of those things. Making the user do the slice might make them think a bit about whether it's a sensible place to slice.

@robpike
Copy link
Contributor

robpike commented Jun 15, 2022

@rsc Your last paragraph doesn't seem compelling. In writing

prefix := s[:strings.CommonPrefixLen(s, t)]

at what moment do I ask where I'm slicing that I wouldn't ask with

prefix := strings.CommonPrefix(s, t)

I'm not arguing against CommonPrefixLen, I'm just arguing that it feels sound to me to offer CommonPrefix as well. And for the third time, the question of which to slice is trivial to answer and understand. It's not an issue at all.

@golang golang deleted a comment from ZekeLu Jun 16, 2022
@rsc
Copy link
Contributor

rsc commented Jun 22, 2022

@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ?
I don't mind returning "\xce", it just seems more error-prone than the 1.

@icholy
Copy link

icholy commented Jun 22, 2022

If we add to strings, we should add to bytes, which says 'Package bytes implements functions for the manipulation of byte slices. It is analogous to the facilities of the strings package.' I also expect that strings.CommonPrefixLen(string(x), string(y)) == bytes.CommonPrefixLen(x, y). I should never be surprised by changing code that works on strings to work on []byte by s/strings./bytes./g.

There's value in uniformity between strings & bytes, but having CommonPrefix("βα", "ββ") return "β\xce" feels like a footgun.

@seankhliao
Copy link
Member

Currently where the strings package has to split user provided strings, it does so using runes/Unicode code points: ContainsAny, IndexAny, LastIndexAny, Map, Trim, TrimLeft, TrimRight.
The exceptions are functions that have a Byte suffix.
I think whatever is added to strings should keep operating on runes/codepoints unless otherwise explicitly labeled.

@robpike
Copy link
Contributor

robpike commented Jun 22, 2022

@robpike, do you agree with the semantics that CommonPrefixLen("α", "β") = 1, and so CommonPrefix("α", "β") = "\xce" ? I don't mind returning "\xce", it just seems more error-prone than the 1.

Are we talking byte slices or strings? For strings, I would expect the answer to be 0. For byte slices, I think the jury is still out.

@adonovan
Copy link
Member Author

adonovan commented Jun 22, 2022

For byte slices, the correct answer is (I assert) obviously 1, but for strings things are less clear: consistency across the two packages would demand that the answer also be 1, even though that means slicing a rune in half. I can live with that (given a suitable warning in the doc comment), but if there is no clear consensus that this is the right behavior, then that's (unfortunately) an argument against adding the function at all.

@robpike
Copy link
Contributor

robpike commented Jun 23, 2022

@adonovan I see it exactly the opposite, that the value for strings is rune-based, and the value for bytes unclear, with consistency being a consideration in the decision. The strings package shouldn't slice a rune in half.

@dsnet
Copy link
Member

dsnet commented Jun 23, 2022

Ignoring memory aliasing issues, I expect it to be safe to swap all usages of the bytes package with the strings package and vice-versa with no change in semantics. Are there existing cases where this property does not hold?

@gopherbot
Copy link

Change https://go.dev/cl/413716 mentions this issue: bytes: add fuzz test to ensure compatibility with strings

@dsnet
Copy link
Member

dsnet commented Jun 23, 2022

Are there existing cases where this property does not hold?

I wrote a fuzz test (https://go.dev/cl/413716) verifying whether bytes and strings are identical for all common functions, and it seems (after a few hours of fuzzing) the answer is yes (except for a minor bug #53511).

As such, I would expect the behavior of bytes.CommonPrefix and strings.CommonPrefix to be identical (whether it be byte-based or rune-based).

@mibk
Copy link
Contributor

mibk commented Jun 23, 2022

I see two possible solutions:

  • Adding two different functions: PrefixRunes and PrefixBytes (to both strings and bytes packages)
  • Moving the function to package unicode/utf8 so it's obvious it operates on runes.

@adonovan
Copy link
Member Author

adonovan commented Jun 23, 2022

Hmm, I think I have been laboring under the misapprehension that the bytes package has a different viewpoint of the contents of byte strings than the strings package does about the content of strings (arbitrary binary vs UTF-8 encoded text). But I see that is not the case, and Rob's argument that both versions should respect rune boundaries now makes a lot more sense. All of the specific cases I encountered would work with a rune-boundary respecting implementation. Therefore I propose this new API:

package strings

// CommonPrefixLen returns the length of the common prefix of two strings containing UTF-8 text.
// The resulting value is the index of a rune boundary.
func CommonPrefixLen(x, y string) int

package bytes

s/string/[]byte/
s/strings/byte slices/

It is worth stating explicitly that the function compares encodings, not decodings? That is, two different encodings of U+FFFD cannot be part of the common prefix?

[Russ raises a good point of objection below.]

@rsc
Copy link
Contributor

rsc commented Jun 29, 2022

Suppose we do the code point semantics. Then

CommonPrefix("🧑‍🔧", "👨‍🏭") = "🧑‍\u200d" (Man, ZWJ, Wrench and Man, ZWJ, Factory produces Man, ZWJ)

There's going to be weird boundary cases no matter what.

It would also be very surprising for strings and bytes to differ here. I don't think there are any functions they have in common today that change semantics based on whether your UTF-8 is in []byte or string.

@josharian
Copy link
Contributor

There's also an option to add this to the generic slices package; it has well-defined semantics for any slice with comparable elements. That makes the semantics clear--indexing yields bytes, so it is byte-by-byte. The generic code could do a type assertion and dispatch to the faster code for specific types.

@josharian
Copy link
Contributor

To solve the "but it slices runes / grapheme clusters in half" problem, we could have a function that, given a string/byte slice, returns the prefix that end at a rune/cluster boundary. Then if you care about runes, you first find the shared prefix and then take the valid prefix of that. (Given that std is currently completely agnostic to grapheme clusters, it'd probably be UTF8 rune prefix, not valid cluster prefix. It'd be nice to have good grapheme cluster support in std, but that's an entirely different proposal.) utf8.DecodeLastRune[InString] is close to this but requires a little bit of work to hook up correctly.

That's a generally useful API, and it composes well. By operating on a single string/byte slice, it doesn't raise any questions about which input it aliases.

@dsnet
Copy link
Member

dsnet commented Jun 29, 2022

There's also an option to add this to the generic slices package

One of the major advantages of having it in strings and bytes is that we can optimize for the fact that this is a just a byte sequence of some kind so that we can utilize multi-byte comparisons. In @adonovan original post, he reported a 6x improvement over a naive element-to-element comparison.

That said, I guess we can have a switch statement in the generic implementation of slices.CommonPrefix to optimize for strings and []byte.

@rsc
Copy link
Contributor

rsc commented Jul 13, 2022

It seems like we are getting close to bike-shedding this. To try to break the logjam, I suggest we add:

package strings

// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y string) string

package bytes

// CommonPrefix returns the longest prefix p of x such that p == y[:len(p)].
// It considers the inputs as sequences of bytes, without any consideration of UTF-8.
func CommonPrefix(x, y []byte) []byte

Does anyone object to this?

@icholy

This comment was marked as resolved.

@ianlancetaylor
Copy link
Contributor

@icholy The comment above says that the CommonPrefix functions return a slice of x, which is the first argument. So I think what you say is already the case. If not, could you show some code to explain what you mean? Thanks.

@robpike
Copy link
Contributor

robpike commented Jul 13, 2022

I think it's a serious mistake to have CommonPrefix, at least for strings, slice into a UTF-8 encoding. Very serious. I would never use it if it did, and would recommend that others avoid it as well. Not just out of safety, although that's paramount, I can't think why I'd want it.

@gazerro
Copy link
Contributor

gazerro commented Jul 14, 2022

In the following example I expect it to print true or nothing.

if strings.HasPrefix(x, p) {
     print(strings.CommonPrefix(x, p) == p)
}

It can print false if CommonPrefix considers the strings to be UTF-8 encoded.

@adonovan
Copy link
Member Author

adonovan commented Jul 14, 2022

@gazerro Another interesting example, and I agree that's a desirable consistency property.

It appears there is no good solution to all these constraints. Here's an alternative approach:

We add a bytes.CommonPrefixLen function that works on arbitrary bytes without regard to UTF-8. But we don't add the analogous function to the strings package, and we document the reason why. Users that want the efficient implementation on strings interpreted as raw bytes can call bytes.CommonPrefixLen([]byte(str1), []byte(str2)) and the compiler will ensure that these conversions do not escape.

@dsnet
Copy link
Member

dsnet commented Jul 14, 2022

I disagree with the notion that string and bytes differ based on their handling of UTF-8. As things stand today, bytes already has some understanding of UTF-8 in some functions, but not in others. Thus, whether UTF-8 is considered is not package-specific, but rather function-specific (and appropriately documented in each case). The distinction between the two packages is just that bytes deals with []byte and strings deals with string.

@gazerro
Copy link
Contributor

gazerro commented Jul 14, 2022

Another point of view is to see how CommonPrefix would be used in the examples in the standard library.

In the following examples, at least one of the two strings can only contain ASCII characters. So the returned slice always contains only ASCII characters, no matter if CommonPrefix works on arbitrary bytes or UTF-8:

In the following example, CommonPrefix must work on arbitrary bytes in order to be used:

In the following example, CommonPrefix is preferable to work on arbitrary bytes, but is not necessary:

For the following, however, CommonPrefix cannot be used because they need a particular function:

@rsc
Copy link
Contributor

rsc commented Jul 27, 2022

There's clearly no consensus here, and we've made it 10+ years without this function. It sounds like we should probably decline the proposal.

@icholy
Copy link

icholy commented Jul 27, 2022

I think this should be re-opened as an addition to exp/slices.

@rsc
Copy link
Contributor

rsc commented Aug 3, 2022

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Aug 3, 2022
@rsc
Copy link
Contributor

rsc commented Aug 12, 2022

No change in consensus, so declined.
— rsc for the proposal review group

@valyala
Copy link
Contributor

valyala commented Mar 18, 2023

It looks like the UTF-8 vs bytes discussion missed the main point for adding the CommonPrefixLen() function to bytes package - performance for text processing and compression algorithms. It is trivial to write the function in Go, but it isn't trivial to optimize it for performance, especially on multiple GOARCHs. So it would be great to have the optimized function in standard library in the same way as bytes.Equal() is optimized now.

As for the utf-8 vs bytes semantics, it is better not implementing this function in strings package because the performance isn't usually critical for utf-8 processing. So it is better to write a trivial function every time the common prefix for utf-8-encoded strings is needed.

@golang golang locked and limited conversation to collaborators Mar 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Proposals (old)
Likely Decline
Development

No branches or pull requests