unicode/utf16: Add example on how to use utf16.DecodeRune #65498

soypat · 2024-02-03T16:34:04Z

Go version

go version go1.21.4 linux/amd64

Output of `go env` in your module/workspace:

N/A

What did you do?

Opened https://pkg.go.dev/unicode/utf16#DecodeRune

What did you see happen?

No examples on the page.

What did you expect to see?

An example on how to use DecodeRune to decode a []uint16 without performing heap allocations, similar to how utf8.DecodeRune works.

The text was updated successfully, but these errors were encountered:

soypat · 2024-02-03T22:12:26Z

From my understanding of the package, this is what I came up with to encode strings to and from utf8<->utf16. I had to dig into the utf16 package internals to write this code and copy paste some of it though, specifically for the first function.

Leaving them here since they would be nice additions to show how to use the utf16 package with its more commonly used counterpart, utf8.

func encodeUTF16to8(dstUTF8, srcUTF16 []byte, order16 binary.ByteOrder) (int, error) {
	// UTF16 values.
	const (
		// 0xd800-0xdc00 encodes the high 10 bits of a pair.
		// 0xdc00-0xe000 encodes the low 10 bits of a pair.
		// the value is those 20 bits plus 0x10000.
		surr1 = 0xd800
		surr2 = 0xdc00
		surr3 = 0xe000

		surrSelf = 0x10000
	)
	n := 0
	var r1, r2 rune
	for {
		slen := len(srcUTF16)
		if slen == 0 {
			break
		}
		r1 = rune(order16.Uint16(srcUTF16))
		if slen >= 4 {
			r2 = rune(order16.Uint16(srcUTF16[2:]))
		}
		var ar rune
		switch {
		case r1 < surr1, surr3 <= r1:
			// normal rune
			ar = r1
			srcUTF16 = srcUTF16[2:]
		case surr1 <= r1 && r1 < surr2 && slen >= 4 &&
			surr2 <= r2 && r2 < surr3:
			// valid surrogate sequence
			ar = utf16.DecodeRune(r1, r2)
			srcUTF16 = srcUTF16[4:]
		default:
			// invalid surrogate sequence
			return n, errors.New("invalid utf16")
		}
		// Encode the rune into UTF-8.
		if utf8.RuneLen(ar) > len(dstUTF8[n:]) {
			return n, errors.New("insufficient utf8 buffer")
		}
		n += utf8.EncodeRune(dstUTF8[n:], ar)
	}
	return n, nil
}

func encodeUTF8to16(dst16, src8 []byte, order16 binary.ByteOrder) (int, error) {
	n := 0
	for len(src8) > 0 {
		r1, size := utf8.DecodeRune(src8)
		src8 = src8[size:]
		switch {
		case utf16.IsSurrogate(r1):
			// Surrogate pair case.
			if len(dst16) < 4 {
				return n, errors.New("insufficient utf16 buffer")
			}
			r1, r2 := utf16.EncodeRune(r1)
			order16.PutUint16(dst16[n:], uint16(r1))
			order16.PutUint16(dst16[n+2:], uint16(r2))
			n += 4
		default:
			// General case.
			if len(dst16) < 2 {
				return n, errors.New("insufficient utf16 buffer")
			}
			// Simplest case for ASCII characters.
			order16.PutUint16(dst16[n:], uint16(r1))
			n += 2
		}
	}
	return n, nil
}

robpike · 2024-02-04T00:54:23Z

While I appreciate your desire to avoid heap allocation, all the uses of unicode/utf16 do the obvious conversion from []rune returned by this package into a string. It's easy and fast and very little code. If there's a bottleneck there, I'd like to see it in real life.

The unicode/utf8 package does not have cross-conversions like the one you suggest, although to be fair it doesn't really need them as the language supports that encoding directly.

In short, there seems little need for the routines you propose to add to the library.

I do believe that examples would be nice, but they should demonstrate the idiomatic conversion that everyone seems to use and not the complex code you show here.

soypat · 2024-02-04T14:11:49Z

Just to clarify my poorly worded comment: I meant add the routines as an example so that they appear in pkg.go.dev.

So as it turns out, there's not a bottleneck in say "real" Go code. Like you say, Go's garbage collector is state of the art and doing the obvious conversion would most likely work fine. The issue lies in allocating with TinyGo on a microcontroller where RAM is very limited and memory can easily get fragmented and eventually crash your program.

I understand TinyGo is not Go and that a more elegant fix would be to create a more robust GC in TinyGo, but that is a daunting task.

All this said, while I'm not for adding these utf16-utf8 conversion routines as part of the package but rather as examples of usage, there is one part of the utf16 internals I'd very much like exposed. I've created a proposal here: #65511

Edit: I've noticed that adding the routine proposed in #65511 would simplify one of the conversion functions greatly:

func encodeUTF16to8(dstUTF8, srcUTF16 []byte, order16 binary.ByteOrder) (int, error) {
	n := 0
	for len(srcUTF16) > 1 {
		r, size := utf16.DecodeBytes(srcUTF16, order16)
		if r == utf8.RuneError {
			return n, errors.New("invalid utf16 sequence")
		}
		srcUTF16 = srcUTF16[size:]
		n += utf8.EncodeRune(dstUTF8[n:], r)
	}
	return n, nil
}

seankhliao added Documentation NeedsInvestigation labels Feb 3, 2024

seankhliao added this to the Unplanned milestone Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

soypat commented Feb 3, 2024

soypat commented Feb 3, 2024

robpike commented Feb 4, 2024

soypat commented Feb 4, 2024 •

edited

Loading

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

Comments

soypat commented Feb 3, 2024

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

soypat commented Feb 3, 2024

robpike commented Feb 4, 2024

soypat commented Feb 4, 2024 • edited Loading

Output of `go env` in your module/workspace:

soypat commented Feb 4, 2024 •

edited

Loading