proposal: strings: add backward iterator #66206

leb-kuchen · 2024-03-08T21:18:30Z

Proposal Details

I propose to add the function Backward to the package strings, as iterating over strings in reverse order is not trivial without allocating a new slice.

func Backward(s string) iter.Seq2[int, rune]

For now would it be:

// Returns a backwards iterator over bytes in s, yielding both the starting index and the rune.
func Backward(s string) iter.Seq2[int, rune]{
	return func(yield func(int, rune) bool) {
		for i := len(s); i > 0; {
			r, size := utf8.DecodeLastRuneInString(s[0:i])
			i -= size
			if !yield(i, r) {
				return
			}
		}
	}
}

The text was updated successfully, but these errors were encountered:

adonovan · 2024-03-08T22:07:36Z

How often does one need to iterate backwards over a string? Iterating backward over ASCII strings is trivial. Iterating backwards over non-ASCII strings is a dubious operation since many runes (flags, joiners, diacritics, etc) only make sense in the forward sequence.

leb-kuchen · 2024-03-08T23:34:04Z

I think of the numerous last match methods. How often can you be sure to have ASCII strings? It would be the equivalent to char_indices().rev() in rust.

apparentlymart · 2024-03-09T03:47:10Z

Indeed... I think it any one function in the standard library would to the wrong thing for at least some use-cases and it may not be obvious to the caller that it's doing "the wrong thing" unless they are very familiar with the details of Unicode and various different languages.

There are at least three different things this could mean:

Treat the string as a byte array and return the bytes in reverse order. But I assume this isn't what you mean because this would be relatively easy to implement.
Treat the string as a series of runes encoded in UTF-8 and return them in reverse order. I guess in practice that means walking backwards until you find a byte less than 128 and then trying to parse a UTF-8 sequence at that point. It would presumably need to define what happens if what it encounters isn't valid UTF-8 after all. This would garble any string containing combining diacritics, emojis with gender presentation and skin tone modifiers, or emoji flag sequences.
Treat the string as a series of grapheme clusters as defined in the Unicode text segmentation annex, also encoded as UTF-8, and return those as strings in reverse order. But this one's particularly tricky both because the standard library doesn't currently have the necessary tables to implement the segmentation algorithm, and because the algorithm is defined for stream-based processing in forward order and would be non-trivial to evaluate backwards.

All of these three interpretations are wrong in some way in various specific contexts, while all three are also reasonable things to do in certain situations. Reversing the characters in a string is also a pretty esoteric thing to do regardless of exactly how you choose to define it.

Therefore this seems more like something that would be better served by three third-party libraries, rather than by a single function in stdlib. It might even be something you should just implement inline in your program directly so that you can tailor it to exactly whatever weird requirements you are trying to meet by implementing it.

leb-kuchen · 2024-03-09T07:33:03Z

I am thinking of the counterpart of iterating of a string with range, so 2. Assume you'd want the last 5 code points. I would convert it to []rune and slice it. If do not see why it should be so complicated or inefficient to do. I am not thinking of reversing a string, then you would probably be using a lib, which provides Unicode segmentation.

leb-kuchen · 2024-03-09T10:26:32Z

In the package strings there are LastIndexFunc and LastIndexAny. LastIndexFunc calls lastIndexFunc internally

func lastIndexFunc(s string, f func(rune) bool, truth bool) int {
	for i := len(s); i > 0; {
		r, size := utf8.DecodeLastRuneInString(s[0:i])
		i -= size
		if f(r) == truth {
			return i
		}
	}
	return -1
}

leb-kuchen · 2024-03-09T13:43:27Z

https://go.dev/play/p/5o0X0Cbjtzd

earthboundkid · 2024-03-09T14:03:04Z

I think this isn’t a bad idea but it’s uncommon enough that it can go in a Unicode package (maybe both utf8 and utf16) and not strings itself. That will also make the context of returning runes (not graphemes) more clear.

apparentlymart · 2024-03-09T15:22:36Z

It sounds then like the idea is the definition I numbered as 2 in my earlier comment.

If this does end up being a function in stdlib somewhere, then I would suggest trying to name it to suggest "runes in reverse order" rather than "reversed string", so at least there's a hint that this is doing something quite specific that might not be what the potential caller wants.

I like @earthboundkid's compromise of possibly putting it in unicode/utf8, which as noted already contains the DecodeLastRuneInString building block that it would presumably use, and perhaps calling it RunesReversed/RunesReversedInString so that a potential future set of signatures would be something like this:

package utf8

func Runes(p []byte) iter.Seq2[int, rune] { /* ... */ }
func RunesInString(s string) iter.Seq2[int, rune] { /* ... */ }
func RunesReversed(p []byte) iter.Seq2[int, rune] { /* ... */ }
func RunesReversedInString(s string) iter.Seq2[int, rune] { /* ... */ }

I do still feel a little skeptical about the reverse cases being common enough for the standard library, but I'll concede that they are reasonably straightforward -- but still slightly subtle -- wrappers around DecodeLastRune/DecodeLastRuneInString.

A call to utf8.RunesReversedInString in particular seems to represent the relevant details:

We're expecting UTF-8 encoding.
We're dealing in runes, and not in bytes or grapheme clusters.
It's the string variant, because package utf8 naming prioritizes functions working with []byte values.

(utf8.RunesInString would just do the same thing as the built-in range over a string, presumably, so that one is very marginal but I included it to show how the matrix of bytes vs. strings / forward vs. reverse might be completed while following the existing conventions in package utf8, assuming it needed to be. It probably doesn't actually need to be completed.)

earthboundkid · 2024-03-09T15:46:26Z

The InString variants can be unified with the bytes variants using generics, I think, saving a lot of name duplication.

apparentlymart · 2024-03-09T19:17:10Z

That's fair. I was aiming for consistency, but not considering that I was trying to be consistent with a pre-generics API.

leb-kuchen added the Proposal label Mar 8, 2024

gopherbot added this to the Proposal milestone Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: strings: add backward iterator #66206

proposal: strings: add backward iterator #66206

leb-kuchen commented Mar 8, 2024 •

edited

adonovan commented Mar 8, 2024

leb-kuchen commented Mar 8, 2024 •

edited

apparentlymart commented Mar 9, 2024

leb-kuchen commented Mar 9, 2024 •

edited

leb-kuchen commented Mar 9, 2024 •

edited

leb-kuchen commented Mar 9, 2024

earthboundkid commented Mar 9, 2024

apparentlymart commented Mar 9, 2024

earthboundkid commented Mar 9, 2024

apparentlymart commented Mar 9, 2024

proposal: strings: add backward iterator #66206

proposal: strings: add backward iterator #66206

Comments

leb-kuchen commented Mar 8, 2024 • edited

Proposal Details

adonovan commented Mar 8, 2024

leb-kuchen commented Mar 8, 2024 • edited

apparentlymart commented Mar 9, 2024

leb-kuchen commented Mar 9, 2024 • edited

leb-kuchen commented Mar 9, 2024 • edited

leb-kuchen commented Mar 9, 2024

earthboundkid commented Mar 9, 2024

apparentlymart commented Mar 9, 2024

earthboundkid commented Mar 9, 2024

apparentlymart commented Mar 9, 2024

leb-kuchen commented Mar 8, 2024 •

edited

leb-kuchen commented Mar 8, 2024 •

edited

leb-kuchen commented Mar 9, 2024 •

edited

leb-kuchen commented Mar 9, 2024 •

edited