proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898

TimothyGu · 2021-05-01T06:18:45Z

When implementing a streaming UTF-8 decoder, it is often useful to know in which way a rune decode "failed."

For instance, suppose we want to write a func read(io.Reader) <-chan rune that takes a reader and returns a channel of decoded runes. A naïve implementation could be:

import (
	"io"
	"unicode/utf8"
)

func read(r io.Reader) <-chan rune {
        ch := make(chan rune)
        go func() {
                b := make([]byte, 512)
                for {
                        n, err := r.Read(b)
                        p := b[:n]
                        for len(p) > 0 {
                                r, size := utf8.DecodeRune(p)
                                ch <- r
                                p = p[size:]
                        }
                        if err != nil {
                                break
                        }
                }
                close(ch)
        }()
        return ch
}

However, it's clear that this doesn't always work: a UTF-8 sequence that encodes a single rune could spread into multiple Reads. Unfortunately, utf8.DecodeRune does not give us a way of figuring whether we have already seen a "bona fide" decoding error, or whether we are just temporarily out of bytes – both return (RuneError, 1).

At present, one could use utf8.FullRune to implement this function correctly, as:

 func read(r io.Reader) <-chan rune {
 	ch := make(chan rune)
 	go func() {
-		b := make([]byte, 512)
+		b := make([]byte, 0, 512)
 		for {
-			n, err := r.Read(b)
-			p := b[:n]
-			for len(p) > 0 {
+			n, err := r.Read(b[len(b):cap(b)])
+			p := b[:len(b)+n]
+			for len(p) > 0 && utf8.FullRune(p) {
 				r, size := utf8.DecodeRune(p)
 				ch <- r
 				p = p[size:]
 			}
+			b = b[:len(p)]
+			copy(b, p)
 			if err != nil {
 				break
 			}
 		}
+		for len(b) > 0 {
+			r, size := utf8.DecodeRune(b)
+			ch <- r
+			b = b[size:]
+		}
 		close(ch)
 	}()
 	return ch
 }

The close sequence of FullRune/DecodeRune is unfortunate, however, considering FullRune does many of the same computations as DecodeRune – a first lookup and an acceptRanges lookup. Ideally, we would have an incomplete boolean result coming from DecodeRune (or DecodeRune2), as say:

 func read(r io.Reader) <-chan rune {
 	ch := make(chan rune)
 	go func() {
-		b := make([]byte, 512)
+		b := make([]byte, 0, 512)
 		for {
-			n, err := r.Read(b)
-			p := b[:n]
+			n, err := r.Read(b[len(b):cap(b)])
+			p := b[:len(b)+n]
 			for len(p) > 0 {
-				r, size := utf8.DecodeRune(p)
+				r, incomplete, size := utf8.DecodeRune2(p)
+				if incomplete {
+					b = b[:len(p)]
+					copy(b, p)
+					break
+				}
 				ch <- r
 				p = p[size:]
 			}
 			if err != nil {
 				break
 			}
 		}
+		for len(b) > 0 {
+			r, size := utf8.DecodeRune(b)
+			ch <- r
+			b = b[size:]
+		}
 		close(ch)
 	}()
 	return ch
 }

This proposal is mainly concerned about adding such a functionality. The name could of course be further debated upon. Two options that I think are more realistic than DecodeRune2 are:

Roll this change in with unicode/utf8: add "Rune"-less aliases #11743, where the "Rune"-less variant of the functions has an incomplete result.
Change the signature of utf8.DecodeRune in Go 2.

Also up for debate is the direction of the "incomplete" boolean: should it be "incomplete" or "complete"?

The text was updated successfully, but these errors were encountered:

dsnet · 2021-05-01T07:37:05Z

The exact API and naming can still be figured out, but I also had a use-case for knowing whether some input was a possibly truncated UTF-8 sequence.

At present, I have code that looks like:

switch r, rn := utf8.DecodeRune(b[n:]); {
case r == utf8.RuneError && rn == 1: // invalid UTF-8
	switch {
	case b[n]&0b111_00000 == 0b110_00000 && len(b) < n+2:
		return n, io.ErrUnexpectedEOF
	case b[n]&0b1111_0000 == 0b1110_0000 && len(b) < n+3:
		return n, io.ErrUnexpectedEOF
	case b[n]&0b11111_000 == 0b11110_000 && len(b) < n+4:
		return n, io.ErrUnexpectedEOF
	default:
		return n, &SyntaxError{str: "invalid UTF-8 within string"}
	}
case r < ' ':
	...
}

Ideally, low-level encoding information like the above should be handled by the utf8 package.

dsnet · 2021-05-01T07:43:20Z

Given that the existing DecodeRune API already allows us to detect invalid UTF-8, perhaps we can have an API to just detect whether some input is the truncated form of some possibly valid UTF-8 sequence.

Perhaps:

// IsTruncated reports whether b is possibly the truncated prefix of some valid UTF-8 sequence.
func IsTruncated(b []byte) bool

Use of this API in my example above, would look like:

switch r, rn := utf8.DecodeRune(b[n:]); {
case r == utf8.RuneError && rn == 1: // invalid UTF-8
	if utf8.IsTruncated(b[n:]) {
		return n, io.ErrUnexpectedEOF
	}
	return n, &SyntaxError{str: "invalid UTF-8 within string"}
case r < ' ':
	...
}

TimothyGu · 2021-05-01T08:24:39Z

@dsnet I think for your case FullRune suffices: where you have utf8.IsTruncated, you could replace it with !utf8.FullRune.

I filed this issue for a new interface that combines DecodeRune and FullRune.

dsnet · 2021-05-01T22:27:12Z

Thanks! I guess I never realized that FullRune exists (I apologize for not carefully reading the full issue). I've always searched the documentation for words like "truncated", "partial", or "prefix". It didn't occur to me to search for "full".

I'm not sure I see a justifiable benefit of new API that combines DecodeRune and FullRune. One advantage would be performance, but it both your example and my case, a segmented rune is the exception, not the norm. Generally, when one is operating in a streaming manner, they usually operate on some chunk size (512 in your case; much larger for mine).

In your example where you use FullRune, couldn't you call DecodeRune first and only specially treat the situation where it returns RuneError and size==1 with an additional call to FullRune?

rsc · 2021-05-05T17:34:14Z

It seems like FullRune is the answer here. Perhaps we should rewrite the internals a bit to make it more easily inlined. In particular if len(p) >= utf8.UTFMax then it should return true immediately, and that one length check would be inlined. But there doesn't seem to be any need for new API.

rsc · 2021-05-05T18:26:30Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rsc · 2021-05-12T18:02:17Z

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

rogpeppe · 2021-05-14T12:23:43Z

A slightly different but related problem I've just encountered: I want to make an efficient io.Reader implementation that returns an error when an invalid utf-8 sequence is read. Most of the time it's sufficient to call utf8.Valid on the bytes that have been read, but that doesn't work when there's a final truncated sequence. Finding the length of that sequence is non-trivial. I specifically want to avoid calling DecodeRune for every rune because that's considerably slower than the optimised Valid implementation.

This was the function that I wrote - it wasn't entirely trivial to get right. Is there a nicer way to do this?

// lastPartialCount returns the number of bytes in any final
// truncated utf-8 sequence in p. If the final portion of p cannot be
// the start of a valid utf-8 sequence, it returns 0.
func lastPartialCount(p []byte) int {
	end := len(p)
	if end == 0 {
		return 0
	}
	start := end - 1
	if p[start] < utf8.RuneSelf {
		return 0
	}
	lim := end - utf8.UTFMax
	if lim < 0 {
		lim = 0
	}
	for ; start >= lim; start-- {
		if utf8.RuneStart(p[start]) {
			break
		}
	}
	if start < 0 {
		start = 0
	}
	p = p[start:]
	r, rn := utf8.DecodeRune(p)
	if rn != 1 || r != utf8.RuneError {
		// It's a valid rune.
		return 0
	}
	if utf8.FullRune(p) {
		// It's really invalid, not just truncated.
		return 0
	}
	// Truncated.
	return len(p)
}

rsc · 2021-05-19T18:30:16Z

No change in consensus, so declined.
— rsc for the proposal review group

rsc · 2021-05-20T13:26:53Z

@rogpeppe, I haven't tested this, but I'd have used something like:

func lastPartialCount(p []byte) int {
	// Look for final start of rune.
	for i := 0; i < len(p) && i < utf8.UTFMax; {
		i++
		if utf8.RuneStart(p[len(p)-i]) {
			if utf8.FullRune(p[len(p)-i:]) {
				// Found full rune.
				return 0
			}
			// Found i bytes of partial rune.
			return i
		}			
	}
	// Did not find start of final rune - invalid or empty input.
	return 0
}

rogpeppe · 2021-05-20T17:17:56Z

@rsc iterating over the entire slice would add significant runtime overhead (it wouldn't have been acceptable in my case for example) - this function doesn't need to be O(len(p)).

i4ki · 2021-10-24T11:01:03Z

@rogpeppe It doesn't iterate over the entire slice but just 4 times in the worst case (UTF8Max).

gopherbot added this to the Proposal milestone May 1, 2021

gopherbot added the Proposal label May 1, 2021

rsc mentioned this issue May 5, 2021

proposal: review meeting minutes #33502

Open

rsc added the Proposal-FinalCommentPeriod label May 12, 2021

rsc removed the Proposal-FinalCommentPeriod label May 19, 2021

rsc closed this as completed May 19, 2021

rsc moved this to Declined in Proposals Aug 10, 2022

rsc added this to Proposals Aug 10, 2022

rsc removed this from Proposals Oct 19, 2022

golang locked and limited conversation to collaborators Oct 24, 2022

gopherbot added the FrozenDueToAge label Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898

proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898

TimothyGu commented May 1, 2021 •

edited

Loading

dsnet commented May 1, 2021 •

edited

Loading

dsnet commented May 1, 2021

TimothyGu commented May 1, 2021

dsnet commented May 1, 2021

rsc commented May 5, 2021

rsc commented May 5, 2021

rsc commented May 12, 2021

rogpeppe commented May 14, 2021

rsc commented May 19, 2021

rsc commented May 20, 2021

rogpeppe commented May 20, 2021

i4ki commented Oct 24, 2021

proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898

proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898

Comments

TimothyGu commented May 1, 2021 • edited Loading

dsnet commented May 1, 2021 • edited Loading

dsnet commented May 1, 2021

TimothyGu commented May 1, 2021

dsnet commented May 1, 2021

rsc commented May 5, 2021

rsc commented May 5, 2021

rsc commented May 12, 2021

rogpeppe commented May 14, 2021

rsc commented May 19, 2021

rsc commented May 20, 2021

rogpeppe commented May 20, 2021

i4ki commented Oct 24, 2021

TimothyGu commented May 1, 2021 •

edited

Loading

dsnet commented May 1, 2021 •

edited

Loading