Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: strings: add Cut #40135

Closed
ainar-g opened this issue Jul 9, 2020 · 43 comments
Closed

proposal: strings: add Cut #40135

ainar-g opened this issue Jul 9, 2020 · 43 comments

Comments

@ainar-g
Copy link
Contributor

ainar-g commented Jul 9, 2020

In many projects I've seen code like this:

idx = strings.Index(username, "@")
if idx != -1 {
  name = username[:idx]
} else {
  name = username
}
idx = strings.LastIndex(address, "@")
if idx != -1 {
  host = address[idx+1:]
} else {
  host = address
}

I think this operation—getting a prefix or suffix based on a substring—is common enough to think about adding helpers for them. So I propose to add functions PrefixUntil and SuffixAfter (names up for debate) to package strings. Something like:

// PrefixUntil returns the prefix of the string s until the first
// occurrence of the substring substr, excluding the substring itself.
// It returns s if substr was not found.
func PrefixUntil(s, substr string) (prefix string) {
	var idx = Index(s, substr)
	if idx == -1 {
		return s
	}

	return s[:idx]
}

// SuffixAfter returns the suffix of the string s after the last
// occurrence of the substring substr, excluding the substring itself.
// It returns s if substr was not found.
func SuffixAfter(s, substr string) (suffix string) {
	var idx = LastIndex(s, substr)
	if idx == -1 {
		return s
	}

	return s[idx+len(substr):]
}

So that the code could be rewritten as:

name = strings.PrefixUntil(username, "@")
host = strings.SuffixAfter(address, "@")
@gopherbot gopherbot added this to the Proposal milestone Jul 9, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Jul 9, 2020
@ianlancetaylor
Copy link
Contributor

It would help if you could identify some cases in the standard library, and/or in other popular Go packages, where the new functions would be used. That is, give some supporting information for "in many projects I've seen...." Thanks.

@ainar-g
Copy link
Contributor Author

ainar-g commented Jul 9, 2020

@ianlancetaylor A simple grep query shows lots, actually. Here's one from package encoding/xml:

if i := strings.LastIndex(prefix, "/"); i >= 0 {
	prefix = prefix[i+1:]
}

Here's one from path:

if i := strings.LastIndex(path, "/"); i >= 0 {
	path = path[i+1:]
}

net/url:

i := strings.LastIndex(authority, "@")
if i < 0 {
	host, err = parseHost(authority)
} else {
	host, err = parseHost(authority[i+1:])
}

net/http (IndexByte, but the meaning is the same):

if i := strings.IndexByte(resp.Status, ' '); i != -1 {
	statusCode = resp.Status[:i]
}

go/types:

if i := strings.LastIndex(name, "/"); i >= 0 {
	name = name[i+1:]
}

test/fixedbugs/issue32778.go, a copy of the code from the original issue, which in turn, I assume, is a version of the code the reporter of that issue saw:

if i := strings.LastIndexByte(string(n), '.'); i >= 0 {
	return Name(n[i+1:])
}
return Name(n)

Lots of that in test/run.go.

I discovered these using:

$ grep -A 5 -e 'strings.\(Last\)\?Index(' -r

And then sifting the results manually.

@networkimprov
Copy link

This looks like a variation on strings.SplitN(str, "@", 2). So maybe SplitFirst & SplitLast?

@ainar-g
Copy link
Contributor Author

ainar-g commented Jul 9, 2020

@networkimprov Those functions don't really split (turn one into many) strings. And the proposed logic is different from that of the Split.* functions. Perhaps FirstPrefix and LastSuffix.

@jimmyfrasche
Copy link
Member

@ainar-g I believe these would be equivalent

name = strings.PrefixUntil(username, "@")
name, _ = strings.SplitFirst(username, "@")

@martisch
Copy link
Contributor

martisch commented Jul 10, 2020

If I should rather Split off the below into its own proposal I can but it seems the frequency of both patterns (prefix, suffix and partition) together makes for a potential better outlook to be relevant enough for addition.

Adding more general strings.SplitFirst and strings.SplitLast:

Much code I have optimised in parsing functions involves splitting around an identifier and uses Index and LastIndex.
so a key, value, ok = strings.SplitFirst(kv, separator) (and same for Last) would be nice and would cover the use cases of strings.PrefixUntil and SuffixAfter too.

Why ok as a return? To distinguish the case of separator appearing in the source string or not. Knowing value == "" does not provide that.

I can give Googlers references to internal CLs. Here is one for std lib:
https://go-review.googlesource.com/c/go/+/231738/1/src/encoding/asn1/common.go

Why not strings.Split(N)? these allocate a return slice and there is no strings.SplitLastN.

Alternative: strings.Partition like in Python:

Another take on this could be to align this with str.partition from Python:
https://docs.python.org/3/library/stdtypes.html#str.partition

str.partition(sep)
Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.

So we could discuss strings.Partition(s, sep string) (first, split, tail) and strings.PartitionLast(s, sep string) (head, split, last string) as an alternative with similar semantics as SplitFirst and SplitLast.

@rogpeppe
Copy link
Contributor

If something like this is accepted (and on balance I'm in favour), I'd prefer at least one variant that has a single return value so that it can be used directly in expressions.

@rsc
Copy link
Contributor

rsc commented Nov 4, 2020

There may be an opportunity here, but it's a pretty subtle thing to get right as a general function.
For example, the example above is a bit difficult if the string has no @ sign:

name = strings.PrefixUntil(username, "@")
host = strings.SuffixAfter(address, "@")

parsing "rsc" will end up with name="rsc", host="rsc", which is not what you want.

The Python partition function is more like what you want, in that it distinguishes "separator is present" from "separator is not". But it's a bit awkward to use too since there are no tuples to index like in Python, and allocating a slice defeats the purpose.

There are many helper functions in strings already. Why is this one important, and - even more importantly - can we handle all the needs with a single function and still be convenient?

@rsc rsc moved this from Incoming to Active in Proposals (old) Nov 4, 2020
@martisch
Copy link
Contributor

martisch commented Nov 4, 2020

The reason why I like a function that does a pair split is that I have optimized many parsing functions where you want to split off one element of a delimited list in a string without allocating and at the same time getting the rest of the string without the delimiter and split of element. Having a std library function do this that doesnt allocate a return slice would help readability:
https://go-review.googlesource.com/c/go/+/231738/1/src/encoding/asn1/common.go

returning elem, separator, tail or elem, tail, ok as 3 return values does not need a slice or allocation.

@rsc
Copy link
Contributor

rsc commented Nov 11, 2020

We certainly write code to do this operation all the time. Even just grepping for 'Index.* >=' in the Go tree I find plenty of code like:

if i := strings.Index(line, "#"); i >= 0 {
	line = line[:i]
}

and

if i := strings.Index(raw, "@"); i >= 0 {
	pattern, rawVers = raw[:i], raw[i+1:]
	if strings.Contains(rawVers, "@") || rawVers == "" {
		return nil, fmt.Errorf("invalid module version syntax %q", raw)
	}
}

If we had a function like:

func Cut(s, sep string) (before, after string, ok bool) {
    if i := strings.Index(s, sep); i >= 0 {
        return s[:i], s[i+len(sep):], true
    }
    return s, "", false
}

then snippets like the above would become:

line, _, _ = strings.Cut(line, "#")

if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
    // no need for: pattern, rawVers = raw[:i], raw[i+1:]
    ...
}

It does seem like it comes up a lot. What do people think? Should we add this (single) Cut function?

Update: Edited to return (before, after, ok) instead of (before, sep, after), as suggested by @nightlyone.

@rsc rsc changed the title proposal: strings: add PrefixUntil and SuffixAfter proposal: strings: add Cut Nov 11, 2020
@nightlyone
Copy link
Contributor

Since we pass in the separator, one could also return an OK bool instead of the separator.

For this I would suggest

Cut(s, sep string) (before, after string, ok bool) 

@rsc
Copy link
Contributor

rsc commented Nov 11, 2020

@nightlyone indeed, that signature is much better! Thanks.

@pascaldekloe
Copy link
Contributor

Since when does Go favour functions to save a line or two? 😞 Who wants to remember all those names, including the argument order? There are quite a few non-trivial things missing in the strings package. If there's room for growth then this cannot be the priority. 🙏

@martisch
Copy link
Contributor

@pascaldekloe can you list the issue numbers of some open string package proposals that should be prioritized?

@pascaldekloe
Copy link
Contributor

I can think of some right now @martisch.

// NRunes returns the first n runes in s. If s has less that n runes the return is s.
// Illegal byte sequences count as one rune per byte
func NRunes(s string, n int) string
// IndexNth returns the index of the n-th occurence of substr in s.
func IndexNth(s, substr string) int
// SplitSeq returns the bytes in between the seps if and only if they occur in s in that very order.
func SplitSeq(s string, seps ...string) []string

Don't take these examples too literally. I'm making these up just now. The last one might even be quite useful like SplitSeq(kv, ": ", "\r\n"). The point is that you really need something to be generic and widely useful. Millions of developers will have to learn this function, even if they personally don't use it.

@pascaldekloe
Copy link
Contributor

Also Split2 may be more consitent than Cut as a name.

@rsc
Copy link
Contributor

rsc commented Nov 12, 2020

@pascaldekloe, I've never written code that would have used any of those three. On the other hand, I can point to tons of code I've written using Index and manual slicing that would be helped by Cut. Just one data point, but those seem like pretty rare needs that could easily be handled outside the standard library.

@pascaldekloe
Copy link
Contributor

pascaldekloe commented Nov 12, 2020

If the usage metric is important then In(a []string, match string) bool perhaps. MinInt(a, b) int? I'll be quit now. 😄

@mvdan
Copy link
Member

mvdan commented Nov 12, 2020

@pascaldekloe I'd encourage you to file separate proposals if you have a strong case for any new API in particular. I don't think it's helpful to continue in this thread. This proposal should be considered on its own merits and drawbacks alone, not competing for priority against a handful of other potential proposals.

@mvdan
Copy link
Member

mvdan commented Nov 12, 2020

If something like this is accepted (and on balance I'm in favour), I'd prefer at least one variant that has a single return value so that it can be used directly in expressions.

I think the latest design still reasonably supports this:

// can't be a single line, but not too bad
line, _, _ = strings.Cut(line, "#")
use(line)

// with the ok check, plus scoping of the variable
if pattern, _, ok = strings.Cut(raw, "@"); ok {
    use(pattern)
}

I agree with Russ's point in #40135 (comment), though I agree in some cases it would be nice to not have to declare an extra variable.

@pascaldekloe
Copy link
Contributor

My point was *not to add any of these functions @mvdan 🙃

@mvdan
Copy link
Member

mvdan commented Nov 12, 2020

Then, if you point is "we shouldn't add this func because we shouldn't add any of those others either", I disagree. You're assuming that they're all equally important. If you argue that it's impossible to gauge how important they are relative to each other, then we would never add any new APIs because we would be stuck in "what about" and "what if" discussions.

@networkimprov
Copy link

This idea

if a, b, ok := strings.Cut(s, sep); ok { ... }

looks like a special case of

if list := strings.SplitN(s, sep, 2); len(list) == 2 { ... }
`

@mvdan
Copy link
Member

mvdan commented Nov 12, 2020

@networkimprov please see the response that @martisch already provided:

Why not strings.Split(N)? these allocate a return slice and there is no strings.SplitLastN.

@networkimprov
Copy link

networkimprov commented Nov 12, 2020

He asked for SplitFirst & SplitLast, but Cut is only SplitFirst.

Is the overhead of returning a slice vs two strings a problem much of the time?

EDIT: also your tone may imply that I hadn't read the thread.

@mvdan
Copy link
Member

mvdan commented Nov 12, 2020

An allocation is generally significantly more expensive than returning two extra values; allocations create GC work too.

EDIT: also your tone may imply that I hadn't read the thread.

I am simply pointing to previous replies to prevent the thread from going in circles. I don't imply anything else by it.

@networkimprov
Copy link

Any possibility of optimizing creation of very short non-escaping slices, e.g. represent it as an array internally?

@rsc
Copy link
Contributor

rsc commented Nov 12, 2020

@networkimprov, Yes, avoiding the allocation keeps people from using strings.Split or SplitN here.
And while we could possibly fix the allocation in SplitN for small constant n, from an API point of view it is still far nicer to write:

if pattern, rawVers, ok := strings.Cut(raw, "@"); ok {
    // no need for: pattern, rawVers = raw[:i], raw[i+1:]
    ...
}

than:

if x := strings.SplitN(raw, "@", 2); len(x) == 2 {
    pattern, rawVers := x[0], x[1]
    ...
}

@rsc
Copy link
Contributor

rsc commented Nov 12, 2020

This thread is starting to escalate, and it feels like the kind of trivial change that leads to classic bikeshed arguments.

Before you post, please think about:
(1) whether the point you are about to make is really important for evaluating this simple API, and
(2) whether someone has already made the point - in that case, just thumbs up the existing comment.

Thanks!

@networkimprov
Copy link

I'd disagree that "SplitN" isn't nice enough. I name a short, constant slice to indicate both items:
pattern_rawver := strings.SplitN(s, sep, 2)

I'd instead suggest adding "SplitLastN" and optimizing slice creation. Both have wider applicability, and the latter improves existing programs.

@hherman1
Copy link

Cut has quite a nice API, and a nice name. But I'm sympathetic to the concern about adding too many new functions to the package. I buy the argument that Cut is a worthwhile addition, but I would like to understand better where we draw the line, and would be curious how others in this thread think about that. After all, adding new APIs does add complexity.

@martisch
Copy link
Contributor

martisch commented Nov 13, 2020

@hherman1 I think Cut can be evaluated on its own without knowing exactly where the general threshold is for inclusion. There are many dimensions and the existing Go ecosystem and how developers write Go code to consider. If there is a general discussion needed of inclusion guidelines then I would suggest to open a new separate github issue about that to keep the discussion here focused on the specific proposal and use cases.

@martisch
Copy link
Contributor

martisch commented Nov 13, 2020

I'd instead suggest adding "SplitLastN" and optimizing slice creation. Both have wider applicability, and the latter improves existing programs.

Performance:

Returning a variable sized value (slice backing array) on the stack requires upstack allocation and variable stack allocation (alloca). Both add considerable runtime complexity and can also affect performance of code not using those features as all Go code now has to deal with the possibility of the callee allocating a variable sized slice on the stack.

While doing performance optimizations in dozens of different projects my experience strings.Split/SplitN has been often the one where the return slice allocation was the easiest to replace unfortunately with reduced readability of the new code. (unless writing Cut outside of std lib into its own library). Having a replacement in the std library would be nice to improve performance and readability and could likely also lead to Go programers using the more performant function from the start.

As an anecdote to understand the potential magnitude of the performance difference of avoiding strings.Split and allocation. I have made at least one replacement CL for strings.Split in a large production code base that saved more resources than any other string function in the string package uses in absolute resources in the production environment. One reason maybe that strings.Split is often used in parsing strings (e.g. key value pairs) and parsing strings happens a lot without writing full custom hand optimized parsers. strings.Split is one of the common string functions that allocates and allocation is more compute and lock heavy vs the more lightweight computations that are usually needed for strings (Index, Trim, HasPrefix, ...).

Readability:

To me also the utility of strings.Cut to strings.SplitN(..., ..., 2) is a bit like strings.Contains to strings.Index. One could have said strings.Contains is easy and can be written outside of std library. But its so common and readable that it warrants inclusion to be there from the start. Now strings.Cut is less common and has a more complex signature to start but the replacement code with strings.SplitN(..., ..., 2) is also more complex.

Side note for function type signature

I had proposed key, value, ok = strings.SplitFirst(kv, separator) in my first comment on the issue as alternative to introducing PrefixUntil and this matches the type signature of the current thumbs up favorited version with function name Cut later proposed by @nightlyone. As far as I remember this would have fitted all use cases I had for the Cut function well which was why it immediately came to my mind as alternative for the original proposal.

@networkimprov
Copy link

networkimprov commented Nov 13, 2020

Returning a variable sized value (slice backing array) on the stack requires upstack allocation and variable stack allocation (alloca).

I'm not v familiar with the runtime, but you'd only create a slice on the stack when it's small, can't escape, has a cap known at compile time, and can't be assigned anything of len > cap.

I'm not opposed to SplitFirst & SplitLast; I'd use them. I don't think they should be added instead of optimization.

@martisch
Copy link
Contributor

martisch commented Nov 13, 2020

I'm not v familiar with the runtime, but you'd only create a slice on the stack when it's small, can't escape, has a cap known at compile time, and can't be assigned anything of len > cap.

If that is the constraints of the optimization that is supposed to optimize the allocation in Split and SplitN it wont apply since the returned slice is variable length (Update: also capacity, inside SplitN: N ist not const) and it escapes those functions. Leaving what happens after inlining aside for a moment IF its inlined which is not always the case.

@networkimprov
Copy link

Creating it on the stack to return it means it can't escape the caller.

cap(s) must be known at compile time (i.e. SplitN with const N), but len(s) can vary.

@mvdan
Copy link
Member

mvdan commented Nov 13, 2020

@networkimprov I think you've made your points clear - you don't think Cut helps readability, and you think some SplitN calls could avoid allocations. For that last point, you should file an issue with the details on how exactly that would work with the compiler. I don't think it's a good idea to continue that discussion here, like Russ mentioned in his last comment.

@networkimprov
Copy link

Sorry to belabor the point. @martisch claimed it wasn't feasible, so I tried to explain why it might be.

Someone more familiar with compiler and runtime should file that issue. And maybe this could be placed on hold until that's decided.

@rsc
Copy link
Contributor

rsc commented Nov 18, 2020

There's far too much discussion about this for right now. Putting on hold. Sorry for such an attractive bikeshed.

@rsc rsc moved this from Active to Hold in Proposals (old) Nov 18, 2020
@networkimprov
Copy link

cc @josharian @mdempsky re filing an issue to consider optimizing small, constant-cap slices; discussion began at #40135 (comment)

@as
Copy link
Contributor

as commented Nov 18, 2020

Is there a reason to even consider the second example as a justification for this API?

Can't this snippet

idx = strings.LastIndex(address, "@")
if idx != -1 {
  host = address[idx+1:]
} else {
  host = address
}

Simply be rewritten as:

host := address[1+strings.LastIndex(address, "@"):]

@pymq
Copy link

pymq commented Jan 30, 2022

Should this issue be closed since #46336 is accepted and implemented?

@odeke-em
Copy link
Member

Should this issue be closed since #46336 is accepted and implemented?

Done! Thank you @pymq for raising it.

@golang golang locked and limited conversation to collaborators Jan 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests