-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unicode/utf8: distinguish between bona fide encoding error and insufficient bytes #45898
Comments
The exact API and naming can still be figured out, but I also had a use-case for knowing whether some input was a possibly truncated UTF-8 sequence. At present, I have code that looks like: switch r, rn := utf8.DecodeRune(b[n:]); {
case r == utf8.RuneError && rn == 1: // invalid UTF-8
switch {
case b[n]&0b111_00000 == 0b110_00000 && len(b) < n+2:
return n, io.ErrUnexpectedEOF
case b[n]&0b1111_0000 == 0b1110_0000 && len(b) < n+3:
return n, io.ErrUnexpectedEOF
case b[n]&0b11111_000 == 0b11110_000 && len(b) < n+4:
return n, io.ErrUnexpectedEOF
default:
return n, &SyntaxError{str: "invalid UTF-8 within string"}
}
case r < ' ':
...
} Ideally, low-level encoding information like the above should be handled by the |
Given that the existing Perhaps: // IsTruncated reports whether b is possibly the truncated prefix of some valid UTF-8 sequence.
func IsTruncated(b []byte) bool Use of this API in my example above, would look like: switch r, rn := utf8.DecodeRune(b[n:]); {
case r == utf8.RuneError && rn == 1: // invalid UTF-8
if utf8.IsTruncated(b[n:]) {
return n, io.ErrUnexpectedEOF
}
return n, &SyntaxError{str: "invalid UTF-8 within string"}
case r < ' ':
...
} |
@dsnet I think for your case FullRune suffices: where you have utf8.IsTruncated, you could replace it with !utf8.FullRune. I filed this issue for a new interface that combines DecodeRune and FullRune. |
Thanks! I guess I never realized that I'm not sure I see a justifiable benefit of new API that combines In your example where you use |
It seems like FullRune is the answer here. Perhaps we should rewrite the internals a bit to make it more easily inlined. In particular if len(p) >= utf8.UTFMax then it should return true immediately, and that one length check would be inlined. But there doesn't seem to be any need for new API. |
This proposal has been added to the active column of the proposals project |
Based on the discussion above, this proposal seems like a likely decline. |
A slightly different but related problem I've just encountered: I want to make an efficient This was the function that I wrote - it wasn't entirely trivial to get right. Is there a nicer way to do this?
|
No change in consensus, so declined. |
@rogpeppe, I haven't tested this, but I'd have used something like:
|
@rsc iterating over the entire slice would add significant runtime overhead (it wouldn't have been acceptable in my case for example) - this function doesn't need to be O(len(p)). |
@rogpeppe It doesn't iterate over the entire slice but just 4 times in the worst case (UTF8Max). |
When implementing a streaming UTF-8 decoder, it is often useful to know in which way a rune decode "failed."
For instance, suppose we want to write a
func read(io.Reader) <-chan rune
that takes a reader and returns a channel of decoded runes. A naïve implementation could be:However, it's clear that this doesn't always work: a UTF-8 sequence that encodes a single rune could spread into multiple Reads. Unfortunately, utf8.DecodeRune does not give us a way of figuring whether we have already seen a "bona fide" decoding error, or whether we are just temporarily out of bytes – both return (RuneError, 1).
At present, one could use utf8.FullRune to implement this function correctly, as:
The close sequence of FullRune/DecodeRune is unfortunate, however, considering FullRune does many of the same computations as DecodeRune – a
first
lookup and anacceptRanges
lookup. Ideally, we would have anincomplete
boolean result coming from DecodeRune (or DecodeRune2), as say:This proposal is mainly concerned about adding such a functionality. The name could of course be further debated upon. Two options that I think are more realistic than DecodeRune2 are:
incomplete
result.Also up for debate is the direction of the "incomplete" boolean: should it be "incomplete" or "complete"?
The text was updated successfully, but these errors were encountered: