Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode/utf8: FullRune returns false for a known invalid byte #11733

Closed
mpvl opened this issue Jul 16, 2015 · 3 comments
Closed

unicode/utf8: FullRune returns false for a known invalid byte #11733

mpvl opened this issue Jul 16, 2015 · 3 comments

Comments

@mpvl
Copy link
Contributor

mpvl commented Jul 16, 2015

Any UTF-8 sequence that starts with 0xc0 or 0c1 is ill-formed and DecodeRune always returns RuneError, 1 for such a byte. However, FullRune of []byte("\xc0") returns false, even though the spec says it will return true for known invalid bytes.

I can look into fixing this as part of a considering a table-based utf-8 decoder. A fix would be trivial if we adopt this algorithm.

@mpvl mpvl self-assigned this Jul 16, 2015
@josharian josharian changed the title FullRune returns false for a known invalid byte. unicode/utf8: FullRune returns false for a known invalid byte Jul 16, 2015
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Jul 16, 2015
@gopherbot
Copy link

CL https://golang.org/cl/16940 mentions this issue.

mpvl added a commit that referenced this issue Nov 16, 2015
This simplifies covering all cases, reducing the number of branches
and making unrolling for simpler functions manageable.
This significantly improves performance of non-ASCII input.

This change will also allow addressing Issue #11733 in an efficient
manner.

RuneCountTenASCIIChars-8             13.7ns ± 4%  13.5ns ± 2%     ~     (p=0.116 n=7+8)
RuneCountTenJapaneseChars-8           153ns ± 3%    74ns ± 2%  -51.42%  (p=0.000 n=8+8)
RuneCountInStringTenASCIIChars-8     13.5ns ± 2%  12.5ns ± 3%   -7.13%  (p=0.000 n=8+7)
RuneCountInStringTenJapaneseChars-8   145ns ± 2%    68ns ± 2%  -53.21%  (p=0.000 n=8+8)
ValidTenASCIIChars-8                 14.1ns ± 3%  12.5ns ± 5%  -11.38%  (p=0.000 n=8+8)
ValidTenJapaneseChars-8               147ns ± 3%    71ns ± 4%  -51.72%  (p=0.000 n=8+8)
ValidStringTenASCIIChars-8           12.5ns ± 3%  12.3ns ± 3%     ~     (p=0.095 n=8+8)
ValidStringTenJapaneseChars-8         146ns ± 4%    70ns ± 2%  -51.62%  (p=0.000 n=8+7)
DecodeASCIIRune-8                    5.91ns ± 2%  4.83ns ± 3%  -18.28%  (p=0.001 n=7+7)
DecodeJapaneseRune-8                 12.2ns ± 7%   8.5ns ± 3%  -29.79%  (p=0.000 n=8+7)
FullASCIIRune-8                      5.95ns ± 3%  4.27ns ± 1%  -28.23%  (p=0.000 n=8+7)
FullJapaneseRune-8                   12.0ns ± 6%   4.3ns ± 3%  -64.39%  (p=0.000 n=8+8)

Change-Id: Iea1d6b0180cbbee1739659a0a38038126beecaca
Reviewed-on: https://go-review.googlesource.com/16940
Reviewed-by: Russ Cox <rsc@golang.org>
@mpvl
Copy link
Contributor Author

mpvl commented Nov 16, 2015

CL 16940 already fixes this issue. Will add some more tests and another round of verifications before closing this.

@gopherbot
Copy link

CL https://golang.org/cl/17225 mentions this issue.

@mpvl mpvl closed this as completed in 1dae473 Dec 1, 2015
@golang golang locked and limited conversation to collaborators Dec 1, 2016
@rsc rsc unassigned mpvl Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants