Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

Open
soypat opened this issue Feb 3, 2024 · 3 comments
Open

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

soypat opened this issue Feb 3, 2024 · 3 comments
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@soypat
Copy link

soypat commented Feb 3, 2024

Go version

go version go1.21.4 linux/amd64

Output of go env in your module/workspace:

N/A

What did you do?

Opened https://pkg.go.dev/unicode/utf16#DecodeRune

What did you see happen?

No examples on the page.

What did you expect to see?

An example on how to use DecodeRune to decode a []uint16 without performing heap allocations, similar to how utf8.DecodeRune works.

@seankhliao seankhliao added Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 3, 2024
@soypat
Copy link
Author

soypat commented Feb 3, 2024

From my understanding of the package, this is what I came up with to encode strings to and from utf8<->utf16. I had to dig into the utf16 package internals to write this code and copy paste some of it though, specifically for the first function.

Leaving them here since they would be nice additions to show how to use the utf16 package with its more commonly used counterpart, utf8.

func encodeUTF16to8(dstUTF8, srcUTF16 []byte, order16 binary.ByteOrder) (int, error) {
	// UTF16 values.
	const (
		// 0xd800-0xdc00 encodes the high 10 bits of a pair.
		// 0xdc00-0xe000 encodes the low 10 bits of a pair.
		// the value is those 20 bits plus 0x10000.
		surr1 = 0xd800
		surr2 = 0xdc00
		surr3 = 0xe000

		surrSelf = 0x10000
	)
	n := 0
	var r1, r2 rune
	for {
		slen := len(srcUTF16)
		if slen == 0 {
			break
		}
		r1 = rune(order16.Uint16(srcUTF16))
		if slen >= 4 {
			r2 = rune(order16.Uint16(srcUTF16[2:]))
		}
		var ar rune
		switch {
		case r1 < surr1, surr3 <= r1:
			// normal rune
			ar = r1
			srcUTF16 = srcUTF16[2:]
		case surr1 <= r1 && r1 < surr2 && slen >= 4 &&
			surr2 <= r2 && r2 < surr3:
			// valid surrogate sequence
			ar = utf16.DecodeRune(r1, r2)
			srcUTF16 = srcUTF16[4:]
		default:
			// invalid surrogate sequence
			return n, errors.New("invalid utf16")
		}
		// Encode the rune into UTF-8.
		if utf8.RuneLen(ar) > len(dstUTF8[n:]) {
			return n, errors.New("insufficient utf8 buffer")
		}
		n += utf8.EncodeRune(dstUTF8[n:], ar)
	}
	return n, nil
}

func encodeUTF8to16(dst16, src8 []byte, order16 binary.ByteOrder) (int, error) {
	n := 0
	for len(src8) > 0 {
		r1, size := utf8.DecodeRune(src8)
		src8 = src8[size:]
		switch {
		case utf16.IsSurrogate(r1):
			// Surrogate pair case.
			if len(dst16) < 4 {
				return n, errors.New("insufficient utf16 buffer")
			}
			r1, r2 := utf16.EncodeRune(r1)
			order16.PutUint16(dst16[n:], uint16(r1))
			order16.PutUint16(dst16[n+2:], uint16(r2))
			n += 4
		default:
			// General case.
			if len(dst16) < 2 {
				return n, errors.New("insufficient utf16 buffer")
			}
			// Simplest case for ASCII characters.
			order16.PutUint16(dst16[n:], uint16(r1))
			n += 2
		}
	}
	return n, nil
}

@robpike
Copy link
Contributor

robpike commented Feb 4, 2024

While I appreciate your desire to avoid heap allocation, all the uses of unicode/utf16 do the obvious conversion from []rune returned by this package into a string. It's easy and fast and very little code. If there's a bottleneck there, I'd like to see it in real life.

The unicode/utf8 package does not have cross-conversions like the one you suggest, although to be fair it doesn't really need them as the language supports that encoding directly.

In short, there seems little need for the routines you propose to add to the library.

I do believe that examples would be nice, but they should demonstrate the idiomatic conversion that everyone seems to use and not the complex code you show here.

@soypat
Copy link
Author

soypat commented Feb 4, 2024

Just to clarify my poorly worded comment: I meant add the routines as an example so that they appear in pkg.go.dev.

So as it turns out, there's not a bottleneck in say "real" Go code. Like you say, Go's garbage collector is state of the art and doing the obvious conversion would most likely work fine. The issue lies in allocating with TinyGo on a microcontroller where RAM is very limited and memory can easily get fragmented and eventually crash your program.

I understand TinyGo is not Go and that a more elegant fix would be to create a more robust GC in TinyGo, but that is a daunting task.

All this said, while I'm not for adding these utf16-utf8 conversion routines as part of the package but rather as examples of usage, there is one part of the utf16 internals I'd very much like exposed. I've created a proposal here: #65511

Edit: I've noticed that adding the routine proposed in #65511 would simplify one of the conversion functions greatly:

func encodeUTF16to8(dstUTF8, srcUTF16 []byte, order16 binary.ByteOrder) (int, error) {
	n := 0
	for len(srcUTF16) > 1 {
		r, size := utf16.DecodeBytes(srcUTF16, order16)
		if r == utf8.RuneError {
			return n, errors.New("invalid utf16 sequence")
		}
		srcUTF16 = srcUTF16[size:]
		n += utf8.EncodeRune(dstUTF8[n:], r)
	}
	return n, nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants