Skip to content

proposal: utf8: RuneStartLen to get the length of the rune from the first byte #68716

@aymanbagabas

Description

@aymanbagabas

Proposal Details

I find myself in need of such a method to determine how many bytes in a UTF-8 string when iterating over bytes. Following RFC 3629, we can implement something like utf8.RuneStartLen(b byte) int.

Zig and Rust have these implemented to provide this functionality. Go could have something like this to do the same.

// RuneStartLen reports the number of bytes an encoded rune will have. It
// returns a value between 1-4, or -1 if the byte is not a valid UTF-8 first
// byte.
func RuneStartLen(b byte) int {
	if b <= 0b0111_1111 { // 0x00-0x7F
		return 1
	} else if b >= 0b1111_0000 { // 0xF0-0xF7
		return 4
	} else if b >= 0b1110_0000 { // 0xE0-0xEF
		return 3
	} else if b >= 0b1100_0000 { // 0xC0-0xDF
		return 2
	}
	return -1
}

Activity

added this to the Proposal milestone on Aug 2, 2024
gabyhelp

gabyhelp commented on Aug 2, 2024

@gabyhelp

Related Issues and Documentation

  • proposal: utf8.RuneIndexToByteIndex() #31879 (closed)

  • [Package utf8 > func RuneCount

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneCount) <!-- score=0.85166 -->
    
  • [Package utf8 > func RuneLen

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneLen) <!-- score=0.84605 -->
    
  • [Package utf8 > func RuneStart

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneStart) <!-- score=0.84119 -->
    
  • unicode/utf16: add RuneLen #44940 (closed)

  • proposal: unicode/utf8: rune count in a valid UTF-8 string #57896 (closed)

  • [Package utf8 > func DecodeLastRune

     	](https://go.dev/pkg/unicode/utf8/?m=old#DecodeLastRune) <!-- score=0.82540 -->
    
  • [Package utf8 > func DecodeRune

     	](https://go.dev/pkg/unicode/utf8/?m=old#DecodeRune) <!-- score=0.82336 -->
    
  • [Package utf8 > func RuneCountInString

     	](https://go.dev/pkg/unicode/utf8/?m=old#RuneCountInString) <!-- score=0.82101 -->
    

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

moved this to Incoming in Proposalson Aug 2, 2024
changed the title [-]proposal: utf8: given the first byte, determine how many bytes in the UTF-8 string[/-] [+]proposal: utf8: RuneStartLen to get the length of the rune from the first byte[/+] on Aug 4, 2024
adonovan

adonovan commented on Aug 5, 2024

@adonovan
Member

This is a reasonable function, but it is rarely needed except by clients that are doing something unusually sophisticated, and it's a trivial consequence of the four constants that appear in the compact pictorial summary of UTF-8 found in any document on the subject--especially if you simplify each else if cond1 && cond2 to else if cond2. (Each first condition is trivially true as a consequence of the control flow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Incoming

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @aymanbagabas@adonovan@gopherbot@gabyhelp

        Issue actions

          proposal: utf8: RuneStartLen to get the length of the rune from the first byte · Issue #68716 · golang/go