Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings: Map produces invalid utf-8 when passed PAD (U+0080) #25242

Closed
petercgrant opened this issue May 3, 2018 · 7 comments
Closed

strings: Map produces invalid utf-8 when passed PAD (U+0080) #25242

petercgrant opened this issue May 3, 2018 · 7 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@petercgrant
Copy link

What version of Go are you using (go version)?

go version go1.10.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOOS="darwin"
GOARCH="amd64"

What did you do?

Calling strings.ToLower with a PAD (U+0080) character causes the output to be invalid utf-8. It appears this bug is a faulty boundary condition in strings.Map: the comparisons r <= utf8.RuneSelf should perhaps be rewritten as r < utf8.RuneSelf.
https://play.golang.org/p/YVkdm_KyRPT

What did you expect to see?

0x41c280
true
0x61c280
true

What did you see instead?

0x41c280
true
0x6180
false

@josharian josharian changed the title strings.Map produces invalid utf-8 when passed PAD (U+0080) strings: Map produces invalid utf-8 when passed PAD (U+0080) May 3, 2018
@josharian josharian added this to the Go1.11 milestone May 3, 2018
@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label May 3, 2018
@martisch martisch self-assigned this May 4, 2018
@gopherbot
Copy link

Change https://golang.org/cl/111286 mentions this issue: strings: fix encoding of \u0080 in map

@robpike
Copy link
Contributor

robpike commented May 4, 2018

There may well be a bug here (and I believe you've diagnosed it) but this is the first time I've heard U+0080 called PAD. Where did you get that name?

@as
Copy link
Contributor

as commented May 4, 2018

The word PAD is used on some websites that enumerate the Latin-1 Supplement block.

http://www.unicode-symbol.com/u/0080.html
https://www.compart.com/en/unicode/U+0080
https://codepoints.net/U+0080

But a document from unicode.org does not mention "PAD" or "PADDING CHARACTER" anywhere. It also does not render the glyph in a way that suggests "PAD" is an identifier for it.

http://www.unicode.org/charts/PDF/U0080.pdf

ISO 7816-4 described 0x80 in a padding scheme, also commonly known in some libraries as OneAndZeroes [SIC] padding, where it is the first byte followed by a number of 0x00 bytes.

@petercgrant
Copy link
Author

Best I can tell the name PAD originated in RFC 1345. In addition to the links in the previous reply, the name also appears on the Wikipedia page for C0 and C1 control codes and on Wiktionary. Both sources include a morsel of history. The name PAD ironically appeared in a proposal for C1 control pictures, which was apparently accepted (because we have them now?) even though the control picture in the standard has the name XXX instead of PAD. I called it by the only unambiguous, if unofficial, name I found.

If you think the name (or lack thereof) is a problem, let's appreciate that we aren't tasked with deciphering what the character means. The answer to that starts with It depends... and ends with a headache.

@anupcshan
Copy link

@gopherbot Please consider this for backport to 1.10, it's a regression.

@gopherbot
Copy link

Backport issue(s) opened: #25479 (for 1.10).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@andybons
Copy link
Member

@gopherbot please backport to 1.9 as well

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants