proposal: strings: SplitAny and CountAny #72847

xformerfhs · 2025-03-13T16:32:40Z

Proposal Details

The strings package contains the function Split that splits a string whereever the separator string occurs. Only one string can be specified.

There are use cases where one wants to split on any of a collection of characters. Often FieldsFunc is recommended for this. However, FieldsFunc has a bug in that it skips leading and trailing separators. This behaviour can not be fixed, just documented.

In order to make it possible to split a string on any of several characters there should be functions analogous to IndexAny, namely SplitAny and CountAny.

SplitAny would have the signature func SplitAny(s, chars string) []string, while CountAny would be func CountAny(s, chars string) int.

SplitAny splits the string on any character in chars, while CountAny returns how many times any of the characters in chars appears in the supplied string.

There could be another function SplitAnyN with the signature func SplitAnyN(s, chars string, n int []string that limits the splitting to a maximum of n strings.

I attach file split_any.zip where SplitAny and CountAny have been implemented as an example.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2025-03-13T16:44:06Z

You mention FieldFunc in the description but I assume you mean FieldsFunc (with an "s").

ianlancetaylor · 2025-03-13T16:46:47Z

Can you add a comment with an example or two showing the exact behavior difference? Thanks.

xformerfhs · 2025-03-13T17:01:44Z

Yes, you are right. I meant FieldsFunc. Sorry for the typo.

Here are some examples:

   source := ":something,to:split-"

   parts := strings.Split(source, ":")
   // part is [ "" "something,to" "split-"]

   separators := ":,-.;"
   parts = strings.SplitAny(source, separators)
   // parts is [ "" "something" "to" "split" "" ]

   count := strings.CountAny(source, separators)
   // count is 4 (for the 4 found characters ':', ',', ':' and '-'

   separators = "o,t.;"
   parts = strings.SplitAny(source, separators)
   // parts is [ ":s" "me" "hing" "" "" ":split-" ]

   parts = strings.SplitAnyN(source, separators, 2)
   // parts is [ ":s" "mething,to:split-" ]

I hope this helps to clarify the proposal. I will be glad to provide any more information that is deemed necessary.

jub0bs · 2025-03-20T21:00:26Z

@xformerfhs I'm wary of adding more Split* functions that return a slice (as opposed to an iterator) in the standard library. In my experience, such functions tend to be misused (e.g. for splitting untrusted data); see https://nvd.nist.gov/vuln/detail/CVE-2025-22868, for instance.

xformerfhs · 2025-03-21T17:14:33Z

Hi, @jub0bs, thanks four your comment.

I see that you have a reported a security vulnerability that was caused by using strings.Split without checking, limiting or cleaning what is going to be splitted. It was fixed by using strings.Count and only splitting after that returns the correct number of fields.

I agree that using Split and the likes is dangerous when the programmer does not check the string to split. Splitting definitely has security implications.

However, a strings.SplitAny function is missing. There ought to be a way of splitting a string on multiple different characters, not only on one separator.

What are the possible alternatives?

Function	Impact
`SplitAny`	Programmers have to be warned they they ought to check the string to split if it has the correct format, count the fields with `CountAny` or remove unwanted characters. This should be documented.
`SplitAnyN`	This is much safer. If handled correctly, there is no vulnerability. However, setting `n` to a negative number will effectively turn `SplitN` into `Split`.
`SplitAnySeq`	This is the safest form, but can make the program more cumbersome and less readable.

I think of my use case: The user specifies two encodings. One for the input file and one for the output file as a flag like e.g. -encodings win1252:utf8. The separator may be : or ,. When I use SplitAnyN this would look like this:

   ...
   if len(encodingsFlagValue) < minEncodingLen || len(encodingsFlagValue) > maxEncodingLen {
      return errors.New("invalid length of encodings")
   }

   encodings := strings.SplitAnyN(encodingsFlagValue, ":,", 3)
   if len(encodings) > 2 {
      return errors.New("invalid number of encodings")
   }

   var inputEncoding string
   var outputEncoding string

   inputEncoding = encodings[0]

   if len(encodings) == 1 {
      outputEncoding = inputEncoding
   } else {
      outputEncoding = encodings[1]
   }
   ...

This is simple and straight-forward.

Now the same with an iterator:

   ...
   var inputEncoding string
   var outputEncoding string
   
   var haveInputEncoding bool
   for encoding := strings.SplitAnySeq(encodingsFlagValue, ":,") {
      if !haveInputEncoding {
         inputEncoding = encoding
         haveInputEncoding = true
      } else {
        outputEncoding = encoding
        break
      } 
   }
   if len(outputEncoding) == 0 {
      outputEncoding = inputEncoding
   }
  ...

This is much less readable and understandable.

So, I think SplitAnyN is a sensible way to go. With the warning that one must not use an n that is less than 1 and to check for an appropriate length.

Even SplitAny would be a way to go with the clear warning that this may cause a security vulnerability if the source is not checked and that SplitAnyN, and SplitAnySeq are better alternatives.

as · 2025-03-24T20:25:29Z

The alternative is to normalize the seperators into one seperator and then call the split function.

source := ":something,to:split-"
source = strings.ReplaceAll(source, ":", ",")
source = strings.ReplaceAll(source, "-", ",")
fmt.Printf("%q\n", strings.Split(source, ","))

xformerfhs · 2025-03-24T21:40:20Z

The alternative is to normalize the seperators into one seperator and then call the split function.

While this yields the correct result, it has three disadvantages:

It allocates memory for two additional strings. Allocations are slow.
It copies the string two times, resulting in additional CPU overhead.
It is cumbersome, hard to read and does not convey what is meant. Someone reading this would have to figure out why all this replacing takes place. This makes it harder to understand the meaning of the code.

Using strings.SplitAny(source, ":,-") is short, simple and understandable at first glance. No unnecessary memory allocations, no unnecessary copying.

as · 2025-04-01T18:28:47Z

I have only had to write a SplitAny once and chose to use that method because that one case was not performance-critical and surrounded by slower code. In other instances, it was faster to avoid string splitting operations and use IndexAny to parse the tokens.

Your three points are sound, but I am curious how often this function would be used and how many times people have had to create it.

xformerfhs added the Proposal label Mar 13, 2025

gopherbot added this to the Proposal milestone Mar 13, 2025

ianlancetaylor added this to Proposals Mar 13, 2025

ianlancetaylor moved this to Incoming in Proposals Mar 13, 2025

gabyhelp added the LibraryProposal label Mar 13, 2025

seankhliao changed the title ~~proposal: strings: Add functions SplitAny and CountAny~~ proposal: strings: SplitAny and CountAny Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: strings: SplitAny and CountAny #72847

proposal: strings: SplitAny and CountAny #72847

xformerfhs commented Mar 13, 2025 •

edited

Loading

ianlancetaylor commented Mar 13, 2025

ianlancetaylor commented Mar 13, 2025

xformerfhs commented Mar 13, 2025 •

edited

Loading

jub0bs commented Mar 20, 2025 •

edited

Loading

xformerfhs commented Mar 21, 2025 •

edited

Loading

as commented Mar 24, 2025

xformerfhs commented Mar 24, 2025 •

edited

Loading

as commented Apr 1, 2025

proposal: strings: SplitAny and CountAny #72847

proposal: strings: SplitAny and CountAny #72847

Comments

xformerfhs commented Mar 13, 2025 • edited Loading

Proposal Details

ianlancetaylor commented Mar 13, 2025

ianlancetaylor commented Mar 13, 2025

xformerfhs commented Mar 13, 2025 • edited Loading

jub0bs commented Mar 20, 2025 • edited Loading

xformerfhs commented Mar 21, 2025 • edited Loading

as commented Mar 24, 2025

xformerfhs commented Mar 24, 2025 • edited Loading

as commented Apr 1, 2025

xformerfhs commented Mar 13, 2025 •

edited

Loading

xformerfhs commented Mar 13, 2025 •

edited

Loading

jub0bs commented Mar 20, 2025 •

edited

Loading

xformerfhs commented Mar 21, 2025 •

edited

Loading

xformerfhs commented Mar 24, 2025 •

edited

Loading