Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: byteseq: add a generic byte string manipulation package #48643

Closed
tdakkota opened this issue Sep 27, 2021 · 11 comments
Closed

proposal: byteseq: add a generic byte string manipulation package #48643

tdakkota opened this issue Sep 27, 2021 · 11 comments
Labels
FrozenDueToAge generics Issue is related to generics Proposal
Milestone

Comments

@tdakkota
Copy link

tdakkota commented Sep 27, 2021

This proposal is for use with #43651. I propose to define a new package, byteseq, that will provide simple generic
functions to manipulate UTF-8 encoded strings and byte slices.

Goals of this proposal:

  • Provide a safe generic API without string <-> []byte conversion overhead.
  • Reduce code duplication between strings and bytes packages.
  • Enforce immutability in API using type constraints. The ~string | ~[]byte constraint denotes that function should
    not mutate their arguments.

API description:

// Byteseq represents a generic UTF-8 byte string.
type Byteseq interface {
     ~string | ~[]byte
}

// Compare returns an integer comparing two strings lexicographically. 
// The result will be 0 if a==b, -1 if a < b, and +1 if a > b.
func Compare[A, B Byteseq](a A, b B) int

// Contains reports whether subslice is within b.
func Contains[B, SubSlice Byteseq](b B, subslice SubSlice) bool

// ContainsAny reports whether any of the UTF-8-encoded code points in chars are within b.
func ContainsAny[B, Chars Byteseq](b B, chars Chars) bool

// Count counts the number of non-overlapping instances of sep in s. 
// If sep is empty, Count returns 1 + the number of UTF-8-encoded code points in s.
func Count[S, Sep Byteseq](s S, sep Sep) int

// Equal reports whether a and b are the same length and contain the same bytes
func Equal[A, B Byteseq](a A, b B) bool

// EqualFold reports whether s and t, interpreted as UTF-8 strings, are equal under Unicode case-folding, 
// which is a more general form of case-insensitivity.
func EqualFold[S, T Byteseq](s S, t T) bool

// Fields splits the string s around each instance of one or more consecutive white space
// characters, as defined by unicode.IsSpace, returning a slice of substrings of s or an
// empty slice if s contains only white space.
func Fields[S Byteseq](s S) []S

// FieldsFunc splits the string s at each run of Unicode code points c satisfying f(c)
// and returns an array of slices of s. If all code points in s satisfy f(c) or the
// string is empty, an empty slice is returned.
// 
// FieldsFunc makes no guarantees about the order in which it calls f(c)
// and assumes that f always returns the same value for a given c.
func FieldsFunc[S Byteseq](s S, f func (rune) bool) []S

// HasPrefix tests whether the string s begins with prefix.
func HasPrefix[S, Prefix Byteseq](s S, prefix Prefix) bool

// HasSuffix tests whether the string s ends with suffix.
func HasSuffix[S, Suffix Byteseq](s S, suffix Suffix) bool

// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s.
func Index[S, Substr Byteseq](s S, substr Substr) int

// IndexAny returns the index of the first instance of any Unicode code point
// from chars in s, or -1 if no Unicode code point from chars is present in s.
func IndexAny[S, Chars Byteseq](s S, chars Chars) int

// IndexByte returns the index of the first instance of c in s, or -1 if c is not present in s.
func IndexByte[S Byteseq](s S, c byte) int

// IndexFunc returns the index into s of the first Unicode
// code point satisfying f(c), or -1 if none do.
func IndexFunc[S Byteseq](s S, f func (rune) bool) int

// IndexRune returns the index of the first instance of the Unicode code point
// r, or -1 if rune is not present in s.
// If r is utf8.RuneError, it returns the first instance of any
// invalid UTF-8 byte sequence.
func IndexRune[S Byteseq](s S, r rune) int

// LastIndex returns the index of the last instance of substr in s, or -1 if substr is not present in s.
func LastIndex[S, Substr Byteseq](s S, substr Substr) int

// LastIndexAny returns the index of the last instance of any Unicode code
// point from chars in s, or -1 if no Unicode code point from chars is
// present in s.
func LastIndexAny[S, Chars Byteseq](s S, chars Chars) int

// LastIndexByte returns the index of the last instance of c in s, or -1 if c is not present in s.
func LastIndexByte[S Byteseq](s S, c byte) int

// LastIndexFunc returns the index into s of the last
// Unicode code point satisfying f(c), or -1 if none do.
func LastIndexFunc[S Byteseq](s S, f func (rune) bool) int

// Split slices s into all substrings separated by sep and returns a slice of
// the substrings between those separators.
// 
// If s does not contain sep and sep is not empty, Split returns a
// slice of length 1 whose only element is s.
// 
// If sep is empty, Split splits after each UTF-8 sequence. If both s
// and sep are empty, Split returns an empty slice.
// 
// It is equivalent to SplitN with a count of -1.
func Split[S, Sep Byteseq](s S, sep Sep) []S

// SplitAfter slices s into all substrings after each instance of sep and
// returns a slice of those substrings.
// 
// If s does not contain sep and sep is not empty, SplitAfter returns
// a slice of length 1 whose only element is s.
// 
// If sep is empty, SplitAfter splits after each UTF-8 sequence. If
// both s and sep are empty, SplitAfter returns an empty slice.
// 
// It is equivalent to SplitAfterN with a count of -1.
func SplitAfter[S, Sep Byteseq](s S, sep Sep) []S

// SplitAfterN slices s into substrings after each instance of sep and
// returns a slice of those substrings.
// 
// The count determines the number of substrings to return:
//   n > 0: at most n substrings; the last substring will be the unsplit remainder.
//   n == 0: the result is nil (zero substrings)
//   n < 0: all substrings
// 
// Edge cases for s and sep (for example, empty strings) are handled
// as described in the documentation for SplitAfter.
func SplitAfterN[S, Sep Byteseq](s S, sep Sep, n int) []S

// SplitN slices s into substrings separated by sep and returns a slice of
// the substrings between those separators.
// 
// The count determines the number of substrings to return:
//   n > 0: at most n substrings; the last substring will be the unsplit remainder.
//   n == 0: the result is nil (zero substrings)
//   n < 0: all substrings
// 
// Edge cases for s and sep (for example, empty strings) are handled
// as described in the documentation for Split.
func SplitN[S, Sep Byteseq](s S, sep Sep, n int) []S

// Trim returns a slice of the string s with all leading and
// trailing Unicode code points contained in cutset removed.
func Trim[S, Cutset Byteseq](s S, cutset Cutset) S

// TrimFunc returns a slice of the string s with all leading
// and trailing Unicode code points c satisfying f(c) removed.
func TrimFunc[S Byteseq](s S, f func (rune) bool) S

// TrimLeft returns a slice of the string s with all leading
// Unicode code points contained in cutset removed.
// 
// To remove a prefix, use TrimPrefix instead.
func TrimLeft[S, Cutset Byteseq](s S, cutset Cutset) S

// TrimLeftFunc returns a slice of the string s with all leading
// Unicode code points c satisfying f(c) removed.
func TrimLeftFunc[S Byteseq](s S, f func (rune) bool) S

// TrimPrefix returns s without the provided leading prefix string.
// If s doesn't start with prefix, s is returned unchanged.
func TrimPrefix[S, Prefix Byteseq](s S, prefix Prefix) S

// TrimRight returns a slice of the string s, with all trailing
// Unicode code points contained in cutset removed.
// 
// To remove a suffix, use TrimSuffix instead.
func TrimRight[S, Cutset Byteseq](s S, cutset Cutset) S

// TrimRightFunc returns a slice of the string s with all trailing
// Unicode code points c satisfying f(c) removed.
func TrimRightFunc[S Byteseq](s S, f func (rune) bool) S

// TrimSpace returns a slice of the string s, with all leading
// and trailing white space removed, as defined by Unicode.
func TrimSpace[S Byteseq](s S) S

// TrimSuffix returns s without the provided trailing suffix string.
// If s doesn't end with suffix, s is returned unchanged.
func TrimSuffix[S, Suffix Byteseq](s S, suffix Suffix) S

Notice that API proposal below does not include functions like strings.Map or strings.Join that build a new string.
The reason is avoiding dependency on strings.Builder.

@gopherbot gopherbot added this to the Proposal milestone Sep 27, 2021
@ianlancetaylor ianlancetaylor added the generics Issue is related to generics label Sep 27, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Sep 27, 2021
@ianlancetaylor
Copy link
Contributor

Reducing code duplication is useful, but that could be done by adding an internal package.

This proposal by itself would be useful if we didn't already have bytes and strings packages. But we do. What do we gain by adding a third variant?

@tdakkota
Copy link
Author

Proposed API allows us to use parameters of different types, you can instanitate Index[[]byte, string] and find index of string in a byte slice without string<->[]byte conversion.

That's quite similiar to what packages like go4.org/mem do, but without using unsafe.

@sfllaw
Copy link
Contributor

sfllaw commented Sep 30, 2021

I have a silly question: is it possible for the compiler to realize that there’s an unnecessary byte-slice conversion and optimize it away?

@ianlancetaylor
Copy link
Contributor

The compiler already does this in some specific cases. I don't know if there is a general optimization for it. See https://golang.org/wiki/CompilerOptimizations.

@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

The strings and bytes packages have subtly different semantics around copying that I don't see how to capture in this new package.

Also, the strings and bytes packages already exist and can't be deleted for compatibility reasons. It doesn't seem like a win to make a third way to do things.

@rsc rsc moved this from Incoming to Active in Proposals (old) Oct 6, 2021
@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc changed the title proposal: add a generic byte string manipulation package proposal: byteseq: add a generic byte string manipulation package Oct 6, 2021
@tdakkota
Copy link
Author

tdakkota commented Oct 7, 2021

Also, the strings and bytes packages already exist and can't be deleted for compatibility reasons. It doesn't seem like a win to make a third way to do things.

It's seems reasonable for me.
Maybe we can update existing packages with generics instead of adding new one? Something like proposed here

package strings

// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s.
func Index[Substr constraints.Byteseq](s string, substr Substr) int { ... }
func Index(s, substr string) int { return Index[string](s, substr) }
...

@ianlancetaylor
Copy link
Contributor

Permitting both bytes.Index and strings.Index to search for either a string or a []byte does seem like an interesting possibility, but let's make that a separate proposal, and not for 1.18. Thanks.

@cristaloleg
Copy link

Also #5376

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Oct 13, 2021
@rsc
Copy link
Contributor

rsc commented Oct 13, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Likely Decline to Declined in Proposals (old) Oct 20, 2021
@rsc
Copy link
Contributor

rsc commented Oct 20, 2021

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc closed this as completed Oct 20, 2021
@golang golang locked and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge generics Issue is related to generics Proposal
Projects
No open projects
Development

No branches or pull requests

6 participants