proposal: byteseq: add a generic byte string manipulation package #48643

tdakkota · 2021-09-27T09:39:29Z

This proposal is for use with #43651. I propose to define a new package, byteseq, that will provide simple generic
functions to manipulate UTF-8 encoded strings and byte slices.

Goals of this proposal:

Provide a safe generic API without string <-> []byte conversion overhead.
Reduce code duplication between strings and bytes packages.
Enforce immutability in API using type constraints. The ~string | ~[]byte constraint denotes that function should
not mutate their arguments.

API description:

// Byteseq represents a generic UTF-8 byte string.
type Byteseq interface {
     ~string | ~[]byte
}

// Compare returns an integer comparing two strings lexicographically. 
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func Compare[A, B Byteseq](a A, b B) int

// Contains reports whether subslice is within b.
func Contains[B, SubSlice Byteseq](b B, subslice SubSlice) bool

// ContainsAny reports whether any of the UTF-8-encoded code points in chars are within b.
func ContainsAny[B, Chars Byteseq](b B, chars Chars) bool

// Count counts the number of non-overlapping instances of sep in s. 
// If sep is empty, Count returns 1 + the number of UTF-8-encoded code points in s.
func Count[S, Sep Byteseq](s S, sep Sep) int

// Equal reports whether a and b are the same length and contain the same bytes
func Equal[A, B Byteseq](a A, b B) bool

// EqualFold reports whether s and t, interpreted as UTF-8 strings, are equal under Unicode case-folding, 
// which is a more general form of case-insensitivity.
func EqualFold[S, T Byteseq](s S, t T) bool

// Fields splits the string s around each instance of one or more consecutive white space
// characters, as defined by unicode.IsSpace, returning a slice of substrings of s or an
// empty slice if s contains only white space.
func Fields[S Byteseq](s S) []S

// FieldsFunc splits the string s at each run of Unicode code points c satisfying f(c)
// and returns an array of slices of s. If all code points in s satisfy f(c) or the
// string is empty, an empty slice is returned.
// 
// FieldsFunc makes no guarantees about the order in which it calls f(c)
// and assumes that f always returns the same value for a given c.
func FieldsFunc[S Byteseq](s S, f func (rune) bool) []S

// HasPrefix tests whether the string s begins with prefix.
func HasPrefix[S, Prefix Byteseq](s S, prefix Prefix) bool

// HasSuffix tests whether the string s ends with suffix.
func HasSuffix[S, Suffix Byteseq](s S, suffix Suffix) bool

// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s.
func Index[S, Substr Byteseq](s S, substr Substr) int

// IndexAny returns the index of the first instance of any Unicode code point
// from chars in s, or -1 if no Unicode code point from chars is present in s.
func IndexAny[S, Chars Byteseq](s S, chars Chars) int

// IndexByte returns the index of the first instance of c in s, or -1 if c is not present in s.
func IndexByte[S Byteseq](s S, c byte) int

// IndexFunc returns the index into s of the first Unicode
// code point satisfying f(c), or -1 if none do.
func IndexFunc[S Byteseq](s S, f func (rune) bool) int

// IndexRune returns the index of the first instance of the Unicode code point
// r, or -1 if rune is not present in s.
// If r is utf8.RuneError, it returns the first instance of any
// invalid UTF-8 byte sequence.
func IndexRune[S Byteseq](s S, r rune) int

// LastIndex returns the index of the last instance of substr in s, or -1 if substr is not present in s.
func LastIndex[S, Substr Byteseq](s S, substr Substr) int

// LastIndexAny returns the index of the last instance of any Unicode code
// point from chars in s, or -1 if no Unicode code point from chars is
// present in s.
func LastIndexAny[S, Chars Byteseq](s S, chars Chars) int

// LastIndexByte returns the index of the last instance of c in s, or -1 if c is not present in s.
func LastIndexByte[S Byteseq](s S, c byte) int

// LastIndexFunc returns the index into s of the last
// Unicode code point satisfying f(c), or -1 if none do.
func LastIndexFunc[S Byteseq](s S, f func (rune) bool) int

// Split slices s into all substrings separated by sep and returns a slice of
// the substrings between those separators.
// 
// If s does not contain sep and sep is not empty, Split returns a
// slice of length 1 whose only element is s.
// 
// If sep is empty, Split splits after each UTF-8 sequence. If both s
// and sep are empty, Split returns an empty slice.
// 
// It is equivalent to SplitN with a count of -1.
func Split[S, Sep Byteseq](s S, sep Sep) []S

// SplitAfter slices s into all substrings after each instance of sep and
// returns a slice of those substrings.
// 
// If s does not contain sep and sep is not empty, SplitAfter returns
// a slice of length 1 whose only element is s.
// 
// If sep is empty, SplitAfter splits after each UTF-8 sequence. If
// both s and sep are empty, SplitAfter returns an empty slice.
// 
// It is equivalent to SplitAfterN with a count of -1.
func SplitAfter[S, Sep Byteseq](s S, sep Sep) []S

// SplitAfterN slices s into substrings after each instance of sep and
// returns a slice of those substrings.
// 
// The count determines the number of substrings to return:
//   n > 0: at most n substrings; the last substring will be the unsplit remainder.
//   n == 0: the result is nil (zero substrings)
//   n < 0: all substrings
// 
// Edge cases for s and sep (for example, empty strings) are handled
// as described in the documentation for SplitAfter.
func SplitAfterN[S, Sep Byteseq](s S, sep Sep, n int) []S

// SplitN slices s into substrings separated by sep and returns a slice of
// the substrings between those separators.
// 
// The count determines the number of substrings to return:
//   n > 0: at most n substrings; the last substring will be the unsplit remainder.
//   n == 0: the result is nil (zero substrings)
//   n < 0: all substrings
// 
// Edge cases for s and sep (for example, empty strings) are handled
// as described in the documentation for Split.
func SplitN[S, Sep Byteseq](s S, sep Sep, n int) []S

// Trim returns a slice of the string s with all leading and
// trailing Unicode code points contained in cutset removed.
func Trim[S, Cutset Byteseq](s S, cutset Cutset) S

// TrimFunc returns a slice of the string s with all leading
// and trailing Unicode code points c satisfying f(c) removed.
func TrimFunc[S Byteseq](s S, f func (rune) bool) S

// TrimLeft returns a slice of the string s with all leading
// Unicode code points contained in cutset removed.
// 
// To remove a prefix, use TrimPrefix instead.
func TrimLeft[S, Cutset Byteseq](s S, cutset Cutset) S

// TrimLeftFunc returns a slice of the string s with all leading
// Unicode code points c satisfying f(c) removed.
func TrimLeftFunc[S Byteseq](s S, f func (rune) bool) S

// TrimPrefix returns s without the provided leading prefix string.
// If s doesn't start with prefix, s is returned unchanged.
func TrimPrefix[S, Prefix Byteseq](s S, prefix Prefix) S

// TrimRight returns a slice of the string s, with all trailing
// Unicode code points contained in cutset removed.
// 
// To remove a suffix, use TrimSuffix instead.
func TrimRight[S, Cutset Byteseq](s S, cutset Cutset) S

// TrimRightFunc returns a slice of the string s with all trailing
// Unicode code points c satisfying f(c) removed.
func TrimRightFunc[S Byteseq](s S, f func (rune) bool) S

// TrimSpace returns a slice of the string s, with all leading
// and trailing white space removed, as defined by Unicode.
func TrimSpace[S Byteseq](s S) S

// TrimSuffix returns s without the provided trailing suffix string.
// If s doesn't end with suffix, s is returned unchanged.
func TrimSuffix[S, Suffix Byteseq](s S, suffix Suffix) S

Notice that API proposal below does not include functions like strings.Map or strings.Join that build a new string.
The reason is avoiding dependency on strings.Builder.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2021-09-27T16:20:57Z

Reducing code duplication is useful, but that could be done by adding an internal package.

This proposal by itself would be useful if we didn't already have bytes and strings packages. But we do. What do we gain by adding a third variant?

tdakkota · 2021-09-28T06:31:27Z

Proposed API allows us to use parameters of different types, you can instanitate Index[[]byte, string] and find index of string in a byte slice without string<->[]byte conversion.

That's quite similiar to what packages like go4.org/mem do, but without using unsafe.

sfllaw · 2021-09-30T18:47:25Z

I have a silly question: is it possible for the compiler to realize that there’s an unnecessary byte-slice conversion and optimize it away?

ianlancetaylor · 2021-09-30T19:04:45Z

The compiler already does this in some specific cases. I don't know if there is a general optimization for it. See https://golang.org/wiki/CompilerOptimizations.

rsc · 2021-10-06T21:47:43Z

The strings and bytes packages have subtly different semantics around copying that I don't see how to capture in this new package.

Also, the strings and bytes packages already exist and can't be deleted for compatibility reasons. It doesn't seem like a win to make a third way to do things.

rsc · 2021-10-06T22:07:27Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

tdakkota · 2021-10-07T10:12:08Z

Also, the strings and bytes packages already exist and can't be deleted for compatibility reasons. It doesn't seem like a win to make a third way to do things.

It's seems reasonable for me.
Maybe we can update existing packages with generics instead of adding new one? Something like proposed here

package strings

// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s.
func Index[Substr constraints.Byteseq](s string, substr Substr) int { ... }
func Index(s, substr string) int { return Index[string](s, substr) }
...

ianlancetaylor · 2021-10-08T04:29:32Z

Permitting both bytes.Index and strings.Index to search for either a string or a []byte does seem like an interesting possibility, but let's make that a separate proposal, and not for 1.18. Thanks.

cristaloleg · 2021-10-08T14:44:46Z

Also #5376

rsc · 2021-10-13T18:01:07Z

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

rsc · 2021-10-20T18:04:31Z

No change in consensus, so declined.
— rsc for the proposal review group

gopherbot added this to the Proposal milestone Sep 27, 2021

gopherbot added the Proposal label Sep 27, 2021

ianlancetaylor added the generics label Sep 27, 2021

rsc changed the title ~~proposal: add a generic byte string manipulation package~~ proposal: byteseq: add a generic byte string manipulation package Oct 6, 2021

rsc added the Proposal-FinalCommentPeriod label Oct 13, 2021

rsc removed the Proposal-FinalCommentPeriod label Oct 20, 2021

rsc closed this as completed Oct 20, 2021

rsc moved this to Declined in Proposals Aug 10, 2022

rsc added this to Proposals Aug 10, 2022

golang locked and limited conversation to collaborators Oct 20, 2022

gopherbot added the FrozenDueToAge label Oct 20, 2022

rsc removed this from Proposals Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: byteseq: add a generic byte string manipulation package #48643

proposal: byteseq: add a generic byte string manipulation package #48643

tdakkota commented Sep 27, 2021 •

edited

Loading

ianlancetaylor commented Sep 27, 2021

tdakkota commented Sep 28, 2021

sfllaw commented Sep 30, 2021

ianlancetaylor commented Sep 30, 2021

rsc commented Oct 6, 2021

rsc commented Oct 6, 2021

tdakkota commented Oct 7, 2021

ianlancetaylor commented Oct 8, 2021

cristaloleg commented Oct 8, 2021

rsc commented Oct 13, 2021

rsc commented Oct 20, 2021

proposal: byteseq: add a generic byte string manipulation package #48643

proposal: byteseq: add a generic byte string manipulation package #48643

Comments

tdakkota commented Sep 27, 2021 • edited Loading

ianlancetaylor commented Sep 27, 2021

tdakkota commented Sep 28, 2021

sfllaw commented Sep 30, 2021

ianlancetaylor commented Sep 30, 2021

rsc commented Oct 6, 2021

rsc commented Oct 6, 2021

tdakkota commented Oct 7, 2021

ianlancetaylor commented Oct 8, 2021

cristaloleg commented Oct 8, 2021

rsc commented Oct 13, 2021

rsc commented Oct 20, 2021

tdakkota commented Sep 27, 2021 •

edited

Loading