proposal: Go 2: for index, rune, runelen = range for strings #28599

martisch · 2018-11-05T15:22:15Z

UPDATED proposal to only apply to strings.

The proposal is to allow a new 3 value range for strings:

for index, rune, runelen = range string {
   ...
}

where index and rune are the same as in for index, rune = range string and runelen is the number of bytes that was needed to encode the rune in UTF8. In case a rune could not be decoded from the string due to an invalid UTF8 byte sequence: runelen is set to 1 and rune to utf8.RuneError.

This is backwards compatible with existing range syntax and Go1 since 3 value ranges do not exist yet.
gives the rune width of the decoded rune which is useful for e.g. strings.Map and other functions where the rune width is used to determine if a real RuneError rune was decoded or how much space is needed to write the rune to a byte slice.
there is no performance overhead vs existing for i, r = range string loop for the 3 value loop since internally the loop already computes the utf8 byte sequence width of the rune to correctly advance the index on the next iteration. This will however allow user supplied code in the loop to use this hidden variable.

For example current strings.Map first range loop:

for i, c := range s {
	r := mapping(c)
	if r == c && c != utf8.RuneError {
	   continue
	}

	var width int
	if c == utf8.RuneError {
		c, width = utf8.DecodeRuneInString(s[i:])
		if width != 1 && r == c {
			continue
		}
	} else {
		width = utf8.RuneLen(c)
	}
        ...

could become:

for i, c, width := range s {
   r := mapping(c)
   if r == c && !(c == utf8.RuneError && width == 1) {
       continue
   }
   ...

The text was updated successfully, but these errors were encountered:

theodesp · 2018-11-05T19:51:47Z

Wouldn't be the better if the size is computed explicitly as its fixed and does not change? Also, I don't think range is the ideal keyword to perform this extension as if you think about it a range should iterate over a starting - ending point only and having a size would be extraneous.

martisch · 2018-11-05T20:08:56Z

size here is the size of the rune that was decoded, its not fixed and changes from rune to rune.
Its the step between two indexes.

randall77 · 2018-11-05T20:09:35Z

I'd rather fix this by doing the optimization in the compiler.
Changing the language to support this optimization seems overkill.

martisch · 2018-11-05T20:25:42Z

Agreed, however unless the functions are very simple the compiler cant apply this optimization as any write to a byteslice or byte through a pointer needs to assume the byteslice iterated over was changed unless the compiler can prove they cant overlap. As far as i understand the compiler for example cant optimize the bytes.Map example as the mapping function could change the bytes of the byteslice being iterated over during the range loop.

The other simplification this proposal provides is allowing to have the already computed rune size (step to the next index) in the loop be usable by code in the loop without calling utf8.DecodeRune(InString).

Updated the proposal to reflect those points better.

deanveloper · 2018-11-06T17:47:51Z

This should definitely not be done for []byte. Bytes do not have any concept of a "rune" in any way, and shouldn't.

edit - To add on to this, this would mean that doing i, e, _ := range byteslice would be entirely different from i, e := range byteslice

martisch · 2018-11-06T18:07:11Z

This should definitely not be done for []byte. Bytes do not have any concept of a "rune" in any way, and shouldn't.

I would not think this is that of an absolute counter point given the existence of utf8.DecodeRune and other utf8 function operating on byte slices and the general symmetry of byteslices and string in various packages like utf8 showing byte slices frequently being interpreted as utf8 encoded runes.

edit - To add on to this, this would mean that doing i, e, _ := range byteslice would be entirely different from i, e := range byteslice

fair point that will indeed be confusing.

Maybe something like i, r = range(rune) byteslice would better signify that range interprets the byteslice as runes and byteslice does not have an inherent rune concept.

deanveloper · 2018-11-06T18:50:42Z

I would not think this is that of an absolute counter point given the existence of utf8.DecodeRune and other utf8 function operating on byte slices and the general symmetry of byteslices and string in various packages like utf8 showing byte slices frequently being interpreted as utf8 encoded runes.

That's because utf8 is an encoding. When one does utf8.DecodeRune(byteslice), they are saying "take this byte slice and look at it like it's encoded in utf8. Whats the first rune?". The byteslice has no knowledge of its encoding, and it rightfully shouldn't. It's just a bunch of numbers.

And of course the utf8 package is going to assume utf8 encoded bytes. I could say the same about the utf16 package but say that we should be using utf16 instead.

The point I'm trying to make is, bytes aren't special. They are just a uint8, and should be treated as such. If we want to introduce a concept of runes for your byte slice, either use a package that specifies how to convert from bytes to runes (like utf8), or convert it to a string.

Maybe something like i, r = range(rune) byteslice would better signify that range interprets the byteslice as runes and byteslice does not have an inherent rune concept.

Or i, r = range string(byteslice) 😉

martisch · 2018-11-06T19:00:26Z

That's because utf8 is an encoding. When one does utf8.DecodeRune(byteslice), they are saying "take this byte slice and look at it like it's encoded in utf8. Whats the first rune?". The byteslice has no knowledge of its encoding, and it rightfully shouldn't. It's just a bunch of numbers.

The same is true for strings. https://golang.org/ref/spec#String_types "A string type represents the set of string values. A string value is a (possibly empty) sequence of bytes." but currently range UTF8 decodes runes from the bytes of a string. The range keyword introduces the way the string or byteslice is decoded even if both have no knowledge of this on their own.

Or i, r = range string(byteslice) 😉

Which is not the same semantic since changes to the byteslice are not taken into account during iteration and therefore often a copy is needed.

deanveloper · 2018-11-06T19:31:56Z

The same is true for strings. https://golang.org/ref/spec#String_types "A string type represents the set of string values. A string value is a (possibly empty) sequence of bytes."

I thought this would be brought up. I wasn't going to do it, but I was thinking that if you did I would mention that both ranging over a string and converting it to []rune assume utf8 encoding, as you have mentioned already.

The range keyword introduces the way the string or byteslice is decoded even if both have no knowledge of this on their own.

There's not a single thing built-in to the language that assumes that a []byte is utf8 encoded IIRC.

Which is not the same semantic since changes to the byteslice are not taken into account during iteration and therefore often a copy is needed.

This is a fair assessment, although I feel like in the case where you want it to be available to be changed, the utf8 package (as verbose as it would be) should be used.

import "utf8"

func main() {
	byteslice := []byte("안녕!")

	for i := 0; i < len(byteslice); {
		r, rlen := utf8.DecodeRune(byteslice[i:])
		
		// code here
		
		i += rlen
	}
}

Anyway, the point I'm trying to make is that there's not a single feature in the language that assumes that a []byte is utf8 encoded, and I don't think we should add one. I'm alright with this proposal for string, but I don't agree that we should add it to []byte as well.

martisch · 2018-11-06T19:46:16Z

updated the proposal to the strings version only.

bradfitz · 2018-12-11T21:46:32Z

Or if we do the bug where rune32 and int32 are separate types (#29012), then we could define len(a rune32) to be the UTF-8 length of the rune.

But in general we're moving away from hard-coding UTF-8 assumptions from the language.

ianlancetaylor · 2018-12-11T21:48:49Z

If performance of this kind of operation is essential, we should teach the compiler to treat the relevant functions from the utf8 package as intrinsics. Ranging over a string is already somewhat non-orthogonal. Let's not make it even more different.

martisch added LanguageChange v2 A language change or incompatible library change labels Nov 5, 2018

gopherbot added this to the Proposal milestone Nov 5, 2018

gopherbot added the Proposal label Nov 5, 2018

martisch changed the title ~~proposal: for index, size, rune = range for strings and bytes~~ proposal: for index, size, rune = range for byte slices and strings Nov 5, 2018

martisch changed the title ~~proposal: for index, size, rune = range for byte slices and strings~~ proposal: for index, rune, size = range for byte slices and strings Nov 5, 2018

martisch changed the title ~~proposal: for index, rune, size = range for byte slices and strings~~ proposal: for index, rune, runelen = range for byte slices and strings Nov 5, 2018

martisch changed the title ~~proposal: for index, rune, runelen = range for byte slices and strings~~ proposal: for index, rune, runelen = range for strings Nov 6, 2018

ianlancetaylor changed the title ~~proposal: for index, rune, runelen = range for strings~~ proposal: Go 2: for index, rune, runelen = range for strings Nov 7, 2018

ianlancetaylor closed this as completed Dec 11, 2018

golang locked and limited conversation to collaborators Dec 11, 2019

gopherbot added the FrozenDueToAge label Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Go 2: for index, rune, runelen = range for strings #28599

proposal: Go 2: for index, rune, runelen = range for strings #28599

martisch commented Nov 5, 2018 •

edited

theodesp commented Nov 5, 2018 •

edited

martisch commented Nov 5, 2018

randall77 commented Nov 5, 2018

martisch commented Nov 5, 2018 •

edited

deanveloper commented Nov 6, 2018 •

edited

martisch commented Nov 6, 2018 •

edited

deanveloper commented Nov 6, 2018

martisch commented Nov 6, 2018 •

edited

deanveloper commented Nov 6, 2018 •

edited

martisch commented Nov 6, 2018

bradfitz commented Dec 11, 2018

ianlancetaylor commented Dec 11, 2018

proposal: Go 2: for index, rune, runelen = range for strings #28599

proposal: Go 2: for index, rune, runelen = range for strings #28599

Comments

martisch commented Nov 5, 2018 • edited

theodesp commented Nov 5, 2018 • edited

martisch commented Nov 5, 2018

randall77 commented Nov 5, 2018

martisch commented Nov 5, 2018 • edited

deanveloper commented Nov 6, 2018 • edited

martisch commented Nov 6, 2018 • edited

deanveloper commented Nov 6, 2018

martisch commented Nov 6, 2018 • edited

deanveloper commented Nov 6, 2018 • edited

martisch commented Nov 6, 2018

bradfitz commented Dec 11, 2018

ianlancetaylor commented Dec 11, 2018

martisch commented Nov 5, 2018 •

edited

theodesp commented Nov 5, 2018 •

edited

martisch commented Nov 5, 2018 •

edited

deanveloper commented Nov 6, 2018 •

edited

martisch commented Nov 6, 2018 •

edited

martisch commented Nov 6, 2018 •

edited

deanveloper commented Nov 6, 2018 •

edited