Skip to content

proposal: strings: SplitAny and CountAny #72847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xformerfhs opened this issue Mar 13, 2025 · 8 comments
Open

proposal: strings: SplitAny and CountAny #72847

xformerfhs opened this issue Mar 13, 2025 · 8 comments
Labels
LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool Proposal
Milestone

Comments

@xformerfhs
Copy link

xformerfhs commented Mar 13, 2025

Proposal Details

The strings package contains the function Split that splits a string whereever the separator string occurs. Only one string can be specified.

There are use cases where one wants to split on any of a collection of characters. Often FieldsFunc is recommended for this. However, FieldsFunc has a bug in that it skips leading and trailing separators. This behaviour can not be fixed, just documented.

In order to make it possible to split a string on any of several characters there should be functions analogous to IndexAny, namely SplitAny and CountAny.

SplitAny would have the signature func SplitAny(s, chars string) []string, while CountAny would be func CountAny(s, chars string) int.

SplitAny splits the string on any character in chars, while CountAny returns how many times any of the characters in chars appears in the supplied string.

There could be another function SplitAnyN with the signature func SplitAnyN(s, chars string, n int []string that limits the splitting to a maximum of n strings.

I attach file split_any.zip where SplitAny and CountAny have been implemented as an example.

@gopherbot gopherbot added this to the Proposal milestone Mar 13, 2025
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Mar 13, 2025
@ianlancetaylor
Copy link
Member

You mention FieldFunc in the description but I assume you mean FieldsFunc (with an "s").

@ianlancetaylor
Copy link
Member

Can you add a comment with an example or two showing the exact behavior difference? Thanks.

@xformerfhs
Copy link
Author

xformerfhs commented Mar 13, 2025

Yes, you are right. I meant FieldsFunc. Sorry for the typo.

Here are some examples:

   source := ":something,to:split-"

   parts := strings.Split(source, ":")
   // part is [ "" "something,to" "split-"]

   separators := ":,-.;"
   parts = strings.SplitAny(source, separators)
   // parts is [ "" "something" "to" "split" "" ]

   count := strings.CountAny(source, separators)
   // count is 4 (for the 4 found characters ':', ',', ':' and '-'

   separators = "o,t.;"
   parts = strings.SplitAny(source, separators)
   // parts is [ ":s" "me" "hing" "" "" ":split-" ]

   parts = strings.SplitAnyN(source, separators, 2)
   // parts is [ ":s" "mething,to:split-" ]

I hope this helps to clarify the proposal. I will be glad to provide any more information that is deemed necessary.

@gabyhelp gabyhelp added the LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool label Mar 13, 2025
@seankhliao seankhliao changed the title proposal: strings: Add functions SplitAny and CountAny proposal: strings: SplitAny and CountAny Mar 14, 2025
@jub0bs
Copy link

jub0bs commented Mar 20, 2025

@xformerfhs I'm wary of adding more Split* functions that return a slice (as opposed to an iterator) in the standard library. In my experience, such functions tend to be misused (e.g. for splitting untrusted data); see https://nvd.nist.gov/vuln/detail/CVE-2025-22868, for instance.

@xformerfhs
Copy link
Author

xformerfhs commented Mar 21, 2025

Hi, @jub0bs, thanks four your comment.

I see that you have a reported a security vulnerability that was caused by using strings.Split without checking, limiting or cleaning what is going to be splitted. It was fixed by using strings.Count and only splitting after that returns the correct number of fields.

I agree that using Split and the likes is dangerous when the programmer does not check the string to split. Splitting definitely has security implications.

However, a strings.SplitAny function is missing. There ought to be a way of splitting a string on multiple different characters, not only on one separator.

What are the possible alternatives?

Function Impact
SplitAny Programmers have to be warned they they ought to check the string to split if it has the correct format, count the fields with CountAny or remove unwanted characters. This should be documented.
SplitAnyN This is much safer. If handled correctly, there is no vulnerability. However, setting n to a negative number will effectively turn SplitN into Split.
SplitAnySeq This is the safest form, but can make the program more cumbersome and less readable.

I think of my use case: The user specifies two encodings. One for the input file and one for the output file as a flag like e.g. -encodings win1252:utf8. The separator may be : or ,. When I use SplitAnyN this would look like this:

   ...
   if len(encodingsFlagValue) < minEncodingLen || len(encodingsFlagValue) > maxEncodingLen {
      return errors.New("invalid length of encodings")
   }

   encodings := strings.SplitAnyN(encodingsFlagValue, ":,", 3)
   if len(encodings) > 2 {
      return errors.New("invalid number of encodings")
   }

   var inputEncoding string
   var outputEncoding string

   inputEncoding = encodings[0]

   if len(encodings) == 1 {
      outputEncoding = inputEncoding
   } else {
      outputEncoding = encodings[1]
   }
   ...

This is simple and straight-forward.

Now the same with an iterator:

   ...
   var inputEncoding string
   var outputEncoding string
   
   var haveInputEncoding bool
   for encoding := strings.SplitAnySeq(encodingsFlagValue, ":,") {
      if !haveInputEncoding {
         inputEncoding = encoding
         haveInputEncoding = true
      } else {
        outputEncoding = encoding
        break
      } 
   }
   if len(outputEncoding) == 0 {
      outputEncoding = inputEncoding
   }
  ...

This is much less readable and understandable.

So, I think SplitAnyN is a sensible way to go. With the warning that one must not use an n that is less than 1 and to check for an appropriate length.

Even SplitAny would be a way to go with the clear warning that this may cause a security vulnerability if the source is not checked and that SplitAnyN, and SplitAnySeq are better alternatives.

@as
Copy link
Contributor

as commented Mar 24, 2025

The alternative is to normalize the seperators into one seperator and then call the split function.

source := ":something,to:split-"
source = strings.ReplaceAll(source, ":", ",")
source = strings.ReplaceAll(source, "-", ",")
fmt.Printf("%q\n", strings.Split(source, ","))

@xformerfhs
Copy link
Author

xformerfhs commented Mar 24, 2025

The alternative is to normalize the seperators into one seperator and then call the split function.

While this yields the correct result, it has three disadvantages:

  1. It allocates memory for two additional strings. Allocations are slow.
  2. It copies the string two times, resulting in additional CPU overhead.
  3. It is cumbersome, hard to read and does not convey what is meant. Someone reading this would have to figure out why all this replacing takes place. This makes it harder to understand the meaning of the code.

Using strings.SplitAny(source, ":,-") is short, simple and understandable at first glance. No unnecessary memory allocations, no unnecessary copying.

@as
Copy link
Contributor

as commented Apr 1, 2025

I have only had to write a SplitAny once and chose to use that method because that one case was not performance-critical and surrounded by slower code. In other instances, it was faster to avoid string splitting operations and use IndexAny to parse the tokens.

Your three points are sound, but I am curious how often this function would be used and how many times people have had to create it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool Proposal
Projects
Status: Incoming
Development

No branches or pull requests

6 participants