Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: strings: add backward iterator #66206

Open
leb-kuchen opened this issue Mar 8, 2024 · 10 comments
Open

proposal: strings: add backward iterator #66206

leb-kuchen opened this issue Mar 8, 2024 · 10 comments
Labels
Milestone

Comments

@leb-kuchen
Copy link

leb-kuchen commented Mar 8, 2024

Proposal Details

I propose to add the function Backward to the package strings, as iterating over strings in reverse order is not trivial without allocating a new slice.

func Backward(s string) iter.Seq2[int, rune]

For now would it be:

// Returns a backwards iterator over bytes in s, yielding both the starting index and the rune.
func Backward(s string) iter.Seq2[int, rune]{
	return func(yield func(int, rune) bool) {
		for i := len(s); i > 0; {
			r, size := utf8.DecodeLastRuneInString(s[0:i])
			i -= size
			if !yield(i, r) {
				return
			}
		}
	}
}
@gopherbot gopherbot added this to the Proposal milestone Mar 8, 2024
@adonovan
Copy link
Member

adonovan commented Mar 8, 2024

How often does one need to iterate backwards over a string? Iterating backward over ASCII strings is trivial. Iterating backwards over non-ASCII strings is a dubious operation since many runes (flags, joiners, diacritics, etc) only make sense in the forward sequence.

@leb-kuchen
Copy link
Author

leb-kuchen commented Mar 8, 2024

I think of the numerous last match methods. How often can you be sure to have ASCII strings? It would be the equivalent to char_indices().rev() in rust.

@apparentlymart
Copy link

Indeed... I think it any one function in the standard library would to the wrong thing for at least some use-cases and it may not be obvious to the caller that it's doing "the wrong thing" unless they are very familiar with the details of Unicode and various different languages.

There are at least three different things this could mean:

  1. Treat the string as a byte array and return the bytes in reverse order. But I assume this isn't what you mean because this would be relatively easy to implement.
  2. Treat the string as a series of runes encoded in UTF-8 and return them in reverse order. I guess in practice that means walking backwards until you find a byte less than 128 and then trying to parse a UTF-8 sequence at that point. It would presumably need to define what happens if what it encounters isn't valid UTF-8 after all. This would garble any string containing combining diacritics, emojis with gender presentation and skin tone modifiers, or emoji flag sequences.
  3. Treat the string as a series of grapheme clusters as defined in the Unicode text segmentation annex, also encoded as UTF-8, and return those as strings in reverse order. But this one's particularly tricky both because the standard library doesn't currently have the necessary tables to implement the segmentation algorithm, and because the algorithm is defined for stream-based processing in forward order and would be non-trivial to evaluate backwards.

All of these three interpretations are wrong in some way in various specific contexts, while all three are also reasonable things to do in certain situations. Reversing the characters in a string is also a pretty esoteric thing to do regardless of exactly how you choose to define it.

Therefore this seems more like something that would be better served by three third-party libraries, rather than by a single function in stdlib. It might even be something you should just implement inline in your program directly so that you can tailor it to exactly whatever weird requirements you are trying to meet by implementing it.

@leb-kuchen
Copy link
Author

leb-kuchen commented Mar 9, 2024

I am thinking of the counterpart of iterating of a string with range, so 2. Assume you'd want the last 5 code points. I would convert it to []rune and slice it. If do not see why it should be so complicated or inefficient to do. I am not thinking of reversing a string, then you would probably be using a lib, which provides Unicode segmentation.

@leb-kuchen
Copy link
Author

leb-kuchen commented Mar 9, 2024

In the package strings there are LastIndexFunc and LastIndexAny. LastIndexFunc calls lastIndexFunc internally

func lastIndexFunc(s string, f func(rune) bool, truth bool) int {
	for i := len(s); i > 0; {
		r, size := utf8.DecodeLastRuneInString(s[0:i])
		i -= size
		if f(r) == truth {
			return i
		}
	}
	return -1
}

@leb-kuchen
Copy link
Author

@earthboundkid
Copy link
Contributor

I think this isn’t a bad idea but it’s uncommon enough that it can go in a Unicode package (maybe both utf8 and utf16) and not strings itself. That will also make the context of returning runes (not graphemes) more clear.

@apparentlymart
Copy link

It sounds then like the idea is the definition I numbered as 2 in my earlier comment.

If this does end up being a function in stdlib somewhere, then I would suggest trying to name it to suggest "runes in reverse order" rather than "reversed string", so at least there's a hint that this is doing something quite specific that might not be what the potential caller wants.

I like @earthboundkid's compromise of possibly putting it in unicode/utf8, which as noted already contains the DecodeLastRuneInString building block that it would presumably use, and perhaps calling it RunesReversed/RunesReversedInString so that a potential future set of signatures would be something like this:

package utf8

func Runes(p []byte) iter.Seq2[int, rune] { /* ... */ }
func RunesInString(s string) iter.Seq2[int, rune] { /* ... */ }
func RunesReversed(p []byte) iter.Seq2[int, rune] { /* ... */ }
func RunesReversedInString(s string) iter.Seq2[int, rune] { /* ... */ }

I do still feel a little skeptical about the reverse cases being common enough for the standard library, but I'll concede that they are reasonably straightforward -- but still slightly subtle -- wrappers around DecodeLastRune/DecodeLastRuneInString.

A call to utf8.RunesReversedInString in particular seems to represent the relevant details:

  1. We're expecting UTF-8 encoding.
  2. We're dealing in runes, and not in bytes or grapheme clusters.
  3. It's the string variant, because package utf8 naming prioritizes functions working with []byte values.

(utf8.RunesInString would just do the same thing as the built-in range over a string, presumably, so that one is very marginal but I included it to show how the matrix of bytes vs. strings / forward vs. reverse might be completed while following the existing conventions in package utf8, assuming it needed to be. It probably doesn't actually need to be completed.)

@earthboundkid
Copy link
Contributor

The InString variants can be unified with the bytes variants using generics, I think, saving a lot of name duplication.

@apparentlymart
Copy link

That's fair. I was aiming for consistency, but not considering that I was trying to be consistent with a pre-generics API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

5 participants