Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898

Closed
TimothyGu opened this issue May 1, 2021 · 12 comments

Comments

@TimothyGu
Copy link
Contributor

TimothyGu commented May 1, 2021

When implementing a streaming UTF-8 decoder, it is often useful to know in which way a rune decode "failed."

For instance, suppose we want to write a func read(io.Reader) <-chan rune that takes a reader and returns a channel of decoded runes. A naïve implementation could be:

import (
	"io"
	"unicode/utf8"
)

func read(r io.Reader) <-chan rune {
        ch := make(chan rune)
        go func() {
                b := make([]byte, 512)
                for {
                        n, err := r.Read(b)
                        p := b[:n]
                        for len(p) > 0 {
                                r, size := utf8.DecodeRune(p)
                                ch <- r
                                p = p[size:]
                        }
                        if err != nil {
                                break
                        }
                }
                close(ch)
        }()
        return ch
}

However, it's clear that this doesn't always work: a UTF-8 sequence that encodes a single rune could spread into multiple Reads. Unfortunately, utf8.DecodeRune does not give us a way of figuring whether we have already seen a "bona fide" decoding error, or whether we are just temporarily out of bytes – both return (RuneError, 1).

At present, one could use utf8.FullRune to implement this function correctly, as:

 func read(r io.Reader) <-chan rune {
 	ch := make(chan rune)
 	go func() {
-		b := make([]byte, 512)
+		b := make([]byte, 0, 512)
 		for {
-			n, err := r.Read(b)
-			p := b[:n]
-			for len(p) > 0 {
+			n, err := r.Read(b[len(b):cap(b)])
+			p := b[:len(b)+n]
+			for len(p) > 0 && utf8.FullRune(p) {
 				r, size := utf8.DecodeRune(p)
 				ch <- r
 				p = p[size:]
 			}
+			b = b[:len(p)]
+			copy(b, p)
 			if err != nil {
 				break
 			}
 		}
+		for len(b) > 0 {
+			r, size := utf8.DecodeRune(b)
+			ch <- r
+			b = b[size:]
+		}
 		close(ch)
 	}()
 	return ch
 }

The close sequence of FullRune/DecodeRune is unfortunate, however, considering FullRune does many of the same computations as DecodeRune – a first lookup and an acceptRanges lookup. Ideally, we would have an incomplete boolean result coming from DecodeRune (or DecodeRune2), as say:

 func read(r io.Reader) <-chan rune {
 	ch := make(chan rune)
 	go func() {
-		b := make([]byte, 512)
+		b := make([]byte, 0, 512)
 		for {
-			n, err := r.Read(b)
-			p := b[:n]
+			n, err := r.Read(b[len(b):cap(b)])
+			p := b[:len(b)+n]
 			for len(p) > 0 {
-				r, size := utf8.DecodeRune(p)
+				r, incomplete, size := utf8.DecodeRune2(p)
+				if incomplete {
+					b = b[:len(p)]
+					copy(b, p)
+					break
+				}
 				ch <- r
 				p = p[size:]
 			}
 			if err != nil {
 				break
 			}
 		}
+		for len(b) > 0 {
+			r, size := utf8.DecodeRune(b)
+			ch <- r
+			b = b[size:]
+		}
 		close(ch)
 	}()
 	return ch
 }

This proposal is mainly concerned about adding such a functionality. The name could of course be further debated upon. Two options that I think are more realistic than DecodeRune2 are:

Also up for debate is the direction of the "incomplete" boolean: should it be "incomplete" or "complete"?

@gopherbot gopherbot added this to the Proposal milestone May 1, 2021
@dsnet
Copy link
Member

dsnet commented May 1, 2021

The exact API and naming can still be figured out, but I also had a use-case for knowing whether some input was a possibly truncated UTF-8 sequence.

At present, I have code that looks like:

switch r, rn := utf8.DecodeRune(b[n:]); {
case r == utf8.RuneError && rn == 1: // invalid UTF-8
	switch {
	case b[n]&0b111_00000 == 0b110_00000 && len(b) < n+2:
		return n, io.ErrUnexpectedEOF
	case b[n]&0b1111_0000 == 0b1110_0000 && len(b) < n+3:
		return n, io.ErrUnexpectedEOF
	case b[n]&0b11111_000 == 0b11110_000 && len(b) < n+4:
		return n, io.ErrUnexpectedEOF
	default:
		return n, &SyntaxError{str: "invalid UTF-8 within string"}
	}
case r < ' ':
	...
}

Ideally, low-level encoding information like the above should be handled by the utf8 package.

@dsnet
Copy link
Member

dsnet commented May 1, 2021

Given that the existing DecodeRune API already allows us to detect invalid UTF-8, perhaps we can have an API to just detect whether some input is the truncated form of some possibly valid UTF-8 sequence.

Perhaps:

// IsTruncated reports whether b is possibly the truncated prefix of some valid UTF-8 sequence.
func IsTruncated(b []byte) bool

Use of this API in my example above, would look like:

switch r, rn := utf8.DecodeRune(b[n:]); {
case r == utf8.RuneError && rn == 1: // invalid UTF-8
	if utf8.IsTruncated(b[n:]) {
		return n, io.ErrUnexpectedEOF
	}
	return n, &SyntaxError{str: "invalid UTF-8 within string"}
case r < ' ':
	...
}

@TimothyGu
Copy link
Contributor Author

@dsnet I think for your case FullRune suffices: where you have utf8.IsTruncated, you could replace it with !utf8.FullRune.

I filed this issue for a new interface that combines DecodeRune and FullRune.

@dsnet
Copy link
Member

dsnet commented May 1, 2021

Thanks! I guess I never realized that FullRune exists (I apologize for not carefully reading the full issue). I've always searched the documentation for words like "truncated", "partial", or "prefix". It didn't occur to me to search for "full".

I'm not sure I see a justifiable benefit of new API that combines DecodeRune and FullRune. One advantage would be performance, but it both your example and my case, a segmented rune is the exception, not the norm. Generally, when one is operating in a streaming manner, they usually operate on some chunk size (512 in your case; much larger for mine).

In your example where you use FullRune, couldn't you call DecodeRune first and only specially treat the situation where it returns RuneError and size==1 with an additional call to FullRune?

@rsc
Copy link
Contributor

rsc commented May 5, 2021

It seems like FullRune is the answer here. Perhaps we should rewrite the internals a bit to make it more easily inlined. In particular if len(p) >= utf8.UTFMax then it should return true immediately, and that one length check would be inlined. But there doesn't seem to be any need for new API.

@rsc
Copy link
Contributor

rsc commented May 5, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented May 12, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rogpeppe
Copy link
Contributor

A slightly different but related problem I've just encountered: I want to make an efficient io.Reader implementation that returns an error when an invalid utf-8 sequence is read. Most of the time it's sufficient to call utf8.Valid on the bytes that have been read, but that doesn't work when there's a final truncated sequence. Finding the length of that sequence is non-trivial. I specifically want to avoid calling DecodeRune for every rune because that's considerably slower than the optimised Valid implementation.

This was the function that I wrote - it wasn't entirely trivial to get right. Is there a nicer way to do this?

// lastPartialCount returns the number of bytes in any final
// truncated utf-8 sequence in p. If the final portion of p cannot be
// the start of a valid utf-8 sequence, it returns 0.
func lastPartialCount(p []byte) int {
	end := len(p)
	if end == 0 {
		return 0
	}
	start := end - 1
	if p[start] < utf8.RuneSelf {
		return 0
	}
	lim := end - utf8.UTFMax
	if lim < 0 {
		lim = 0
	}
	for ; start >= lim; start-- {
		if utf8.RuneStart(p[start]) {
			break
		}
	}
	if start < 0 {
		start = 0
	}
	p = p[start:]
	r, rn := utf8.DecodeRune(p)
	if rn != 1 || r != utf8.RuneError {
		// It's a valid rune.
		return 0
	}
	if utf8.FullRune(p) {
		// It's really invalid, not just truncated.
		return 0
	}
	// Truncated.
	return len(p)
}

@rsc
Copy link
Contributor

rsc commented May 19, 2021

No change in consensus, so declined.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented May 20, 2021

@rogpeppe, I haven't tested this, but I'd have used something like:

func lastPartialCount(p []byte) int {
	// Look for final start of rune.
	for i := 0; i < len(p) && i < utf8.UTFMax; {
		i++
		if utf8.RuneStart(p[len(p)-i]) {
			if utf8.FullRune(p[len(p)-i:]) {
				// Found full rune.
				return 0
			}
			// Found i bytes of partial rune.
			return i
		}			
	}
	// Did not find start of final rune - invalid or empty input.
	return 0
}

@rogpeppe
Copy link
Contributor

@rsc iterating over the entire slice would add significant runtime overhead (it wouldn't have been acceptable in my case for example) - this function doesn't need to be O(len(p)).

@i4ki
Copy link

i4ki commented Oct 24, 2021

@rogpeppe It doesn't iterate over the entire slice but just 4 times in the worst case (UTF8Max).

@rsc rsc moved this to Declined in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc rsc removed this from Proposals Oct 19, 2022
@golang golang locked and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants