You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find myself in need of such a method to determine how many bytes in a UTF-8 string when iterating over bytes. Following RFC 3629, we can implement something like utf8.RuneStartLen(b byte) int.
Zig and Rust have these implemented to provide this functionality. Go could have something like this to do the same.
// RuneStartLen reports the number of bytes an encoded rune will have. It// returns a value between 1-4, or -1 if the byte is not a valid UTF-8 first// byte.funcRuneStartLen(bbyte) int {
ifb<=0b0111_1111 { // 0x00-0x7Freturn1
} elseifb>=0b1111_0000 { // 0xF0-0xF7return4
} elseifb>=0b1110_0000 { // 0xE0-0xEFreturn3
} elseifb>=0b1100_0000 { // 0xC0-0xDFreturn2
}
return-1
}
changed the title [-]proposal: utf8: given the first byte, determine how many bytes in the UTF-8 string[/-][+]proposal: utf8: RuneStartLen to get the length of the rune from the first byte[/+]on Aug 4, 2024
This is a reasonable function, but it is rarely needed except by clients that are doing something unusually sophisticated, and it's a trivial consequence of the four constants that appear in the compact pictorial summary of UTF-8 found in any document on the subject--especially if you simplify each else if cond1 && cond2 to else if cond2. (Each first condition is trivially true as a consequence of the control flow.)
Activity
gabyhelp commentedon Aug 2, 2024
Related Issues and Documentation
proposal: utf8.RuneIndexToByteIndex() #31879 (closed)
[Package utf8 > func RuneCount
¶
[Package utf8 > func RuneLen
¶
[Package utf8 > func RuneStart
¶
unicode/utf16: add RuneLen #44940 (closed)
proposal: unicode/utf8: rune count in a valid UTF-8 string #57896 (closed)
[Package utf8 > func DecodeLastRune
¶
[Package utf8 > func DecodeRune
¶
[Package utf8 > func RuneCountInString
¶
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
[-]proposal: utf8: given the first byte, determine how many bytes in the UTF-8 string[/-][+]proposal: utf8: RuneStartLen to get the length of the rune from the first byte[/+]adonovan commentedon Aug 5, 2024
This is a reasonable function, but it is rarely needed except by clients that are doing something unusually sophisticated, and it's a trivial consequence of the four constants that appear in the compact pictorial summary of UTF-8 found in any document on the subject--especially if you simplify each
else if cond1 && cond2
toelse if cond2
. (Each first condition is trivially true as a consequence of the control flow.)