proposal: Go 2: spec: define identifiers to be NFKC-normalized #27896

bcmills · 2018-09-27T12:44:52Z

Background

Go 1 allows Unicode letters as part of identifiers.
Normally that's quite useful, but sometimes it can lead to confusion: some letters look like and are acceptable substitutes for other similar letters.¹

For example, in this code snippet, the declared identifier is U+00B5 (MICRO SIGN), which is AltGr+m on the Linux altgr-intl layout, while its use is U+03BC (GREEK SMALL LETTER MU), which is on the long-press menu for the π key on the Google Gboard on-screen keyboard for Android and iOS.

Fortunately, the Unicode standard defines a number of normalization forms that are intended to smooth out these differences. As it happens, µ and μ are equivalent under the “compatibility” forms NFKC and NFKD.

Proposal

I propose that in Go 2, identifiers with the same NFKC normal form (with identifier modifications) should be considered the same identifier, and gofmt should automatically rewrite all identifiers within a program to their normal forms.

Compatibility

This change is not strictly Go 1 compatible: a Go 1 program may, for example, define and use distinct variables named µ and μ in the same scope. I argue that such programs are confusing at best, and actively misleading at worst, so the only programs that would break under this proposal should be fixed regardless of it.

(CC @mpvl @griesemer @robpike)

¹ In the example here, U+03BC is preferred over U+00B5: see https://unicode.org/reports/tr25/.

Revisions to this proposal:

Changed to apply the tr31 NFKC modifications for identifiers.

The text was updated successfully, but these errors were encountered:

dmitshur · 2018-09-27T16:52:28Z

Related issues:

proposal: spec: disallow LTR/RTL characters in string literals? #20209 - proposal: spec: disallow LTR/RTL characters in string literals?
proposal: spec: disallow unicode import paths to avoid punycode attacks #20210 - proposal: spec: disallow unicode import paths to avoid punycode attacks
proposal: cmd/vet: detect homograph attacks #20115 - proposal: cmd/vet: detect homograph attacks

ianlancetaylor · 2018-09-27T17:35:02Z

Unicode has specific recommendations for identifiers at http://unicode.org/reports/tr31/ .

bcmills · 2018-09-27T18:01:02Z

@ianlancetaylor Thanks for the reference. From that document, it seems like what I want is UAX31-R4 (Equivalent Normalized Identifiers), which implies that we also need the NFKC modifications.

bcmills · 2018-09-28T02:08:10Z

One downside of choosing NFKC over NFC is that my cutesy subscript variables like aᵢ := a[i] would normalize to ai := a[i], which doesn't look quite as nice. Still, I think that's a fairly small price for the reduction in ambiguous cases.

bakul · 2018-11-29T18:55:04Z

I propose that in Go 2, identifiers with the same NFKC normal form (with identifier modifications) should be considered the same identifier, and gofmt should automatically rewrite all identifiers within a program to their normal forms.

If I may propose a slight modification, gofmt should rewrite all identifiers to the form used in a definition/declaration. If something like aᵢ := a[i] is used, it would be to enhance readability. This is analogous to how a case-preserving but case-insensitive filesystem handles filenames.

bcmills · 2018-11-29T19:10:50Z

gofmt should rewrite all identifiers to the form used in a definition/declaration

That does sound nicer, but also much more difficult to implement (since it requires semantic information about variable scopes).

But perhaps more importantly, it wouldn't play as nicely with command-line search tools (such as grep): with normalized identifiers, one can filter the query through a normalizer and then perform a search for the literal byte sequence, but with identifiers-as-declared that no longer works.

bakul · 2018-11-29T19:36:48Z

The compiler should already have that information -- where an object is defined. It now just has to save a "display" string, which can only be set at the point of defn. This is in addition to a normalized identifier.

As for search tools, it would be somewhat similar to how option -i is handled. Just as grep -i foo turns into grep '[fF][oO][oO]' may be an option -I can be used to create a RE such as a[ᵢi]. That is, grep can be taught to handle Unicode TR31.

bcmills · 2018-11-29T20:53:51Z

The compiler should already have that information

gofmt is not integrated with the compiler.

grep can be taught to handle Unicode TR31.

Go is used on a wide variety of platforms: the Go team cannot unilaterally change core tools on those platforms. (Even if we could, it would be an enormous amount of work.)

andersk · 2018-11-30T08:20:23Z

Another possible variant: make it an error to declare two NFKC-equivalent identifiers in the same scope, and say that an identifier declared in an inner scope shadows any NFKC-equivalent identifiers from outer scopes, but continue to require that all uses of an identifier exactly match the form it was declared with.

Then we get the following desirable properties:

µ := 1; { μ := 2; return μ; } returns 2, and µ := 1; { μ := 2; return µ; } is an error, but we prevent any homoglyph of that code from returning 1. (Even without relying on “declared but not used” errors—imagine there were other uses in both scopes.)
Working Go 1 code might be rejected by Go 2, but it won’t be silently accepted with a different meaning.
Dumb grep works as expected.
gofmt could normalize all identifiers without risk of breaking code, or it could leave all identifiers as is; either way, it doesn’t need any scope awareness.

mpvl · 2018-12-07T13:28:20Z

A simple solution would be to disallow any character with a decomposition type "font" and permit all others as is. This would allow the cutesy aᵢ, but avoid the issue pointed out by @andersk.

However, disallowing or normalizing to NFKC is not backwards compatible.

mpvl · 2018-12-07T13:47:24Z

@bcmills: in general, normalizing identifiers to NFKC is not a good idea. NFKC normalization has a tendency to break properties of strings, meaning that if a string has some property before NFKC normalization, NFKC normalization may break it. Handling such cases properly can be hard.

bcmills · 2018-12-07T14:07:57Z

NFKC normalization has a tendency to break properties of strings, meaning that if a string has some property before NFKC normalization, NFKC normalization may break it.

Could you elaborate on how that impacts identifiers specifically? (I'm not proposing that we normalize Go source overall — only identifiers.)

bcmills · 2018-12-07T14:10:38Z

@andersk

Working Go 1 code might be rejected by Go 2, but it won’t be silently accepted with a different meaning.
[…]

gofmt could normalize all identifiers without risk of breaking code, or it could leave all identifiers as is; either way, it doesn’t need any scope awareness.

I don't see how those properties can both hold. gofmt must not change the meaning of the code: if a program containing NFKC-equivalent identifiers is invalid before running gofmt, then it should remain invalid after.

bcmills · 2018-12-07T14:23:55Z

Here's a neat example to consider (https://play.golang.org/p/NX3DW0YIGmE), because of an unfortunate interaction with permissive redeclaration (see also #377):

	var µ = 1
	μ, ok := 2, true

If identifiers are not compatibility-normalized, then µ and μ are separate variables, with values 1 and 2 respectively. If identifiers are compatibility-normalized, then µ and μ refer to the same variable (with value 2).

mpvl · 2018-12-07T14:46:21Z

@bcmills: re: impact of NFKC normalization of identifiers: TR31 points this out quite nicely. In order to make NFKC work properly for this case, it is recommended once implements the modifications mentioned in http://unicode.org/reports/tr31/#NFKC_Modifications. So luckily here it is hashed out, but if one wants to augment the spec for identifiers in any way, one needs to consider what this entails.

Simply disallowing runes with a decomposition type "font" is simpler and gets the best of both worlds (avoids the problem you mention and also allows using aᵢ as an identifier).

mpvl · 2018-12-07T14:52:47Z

But note that even disallowing runes with a decomposition type "font" only eliminates a small portion of such possible confusion. For instance, it would still allow the use of ο, o and о, as well as various other nastinesses. So I doubt it is worth breaking backwards compatibility for.

More feasible would be writing a vet tool that detects cases of confusability.

bcmills · 2018-12-07T14:53:08Z

@mpvl, I don't see how disallowing runes with decomposition type “font” helps at all. µ has decomposition type compat, not font. And there are even worse examples: Ω and Ω are similarly visually indistinguishable, and aren't even type compat.

bcmills · 2018-12-07T14:55:06Z

More feasible would be writing a vet tool that detects cases of confusability.

That's #20115, but that only solves part of the problem: it flags confusing code once it is already written, but doesn't allow someone with a “reasonable” international keyboard layout to, for example, reliably type an identifier that they've just read in other documentation, even if they know what language it was written in and are using characters appropriate to that language.

mpvl · 2018-12-07T15:13:35Z

Compat could be another candidate, indeed. No need to decompose "circle", "sub", "super", etc., though.

Aside from which to include or not, the point is that disallowing decomposing runes deemed illegal for identifiers is better than silently decomposing them. It avoids implementation complexity and is clearer overall.

bakul · 2018-12-07T16:31:40Z

With this normalization will ids like "рое" & "poe" be considered equivalent? The first "рое" is in Cyrillic.

bcmills · 2018-12-07T17:19:37Z

@bakul, no. Normalization doesn't convert between Cyrillic and Latin.

gopherbot added this to the Proposal milestone Sep 27, 2018

gopherbot added the Proposal label Sep 27, 2018

bcmills added LanguageChange v2 A language change or incompatible library change labels Sep 27, 2018

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 23, 2018

bcmills mentioned this issue Nov 29, 2018

proposal: spec: allow combining characters in identifiers #20706

Open

sunshowers mentioned this issue Jun 25, 2019

[Bug] Spec specifies access paths are "sorted", but does not specify sort order diem/diem#204

Open

cor3ntin mentioned this issue Aug 3, 2019

WG21 P1949: Improve support for Unicode characters in identifiers sg16-unicode/sg16#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Go 2: spec: define identifiers to be NFKC-normalized #27896

proposal: Go 2: spec: define identifiers to be NFKC-normalized #27896

bcmills commented Sep 27, 2018 •

edited

dmitshur commented Sep 27, 2018

ianlancetaylor commented Sep 27, 2018

bcmills commented Sep 27, 2018

bcmills commented Sep 28, 2018

bakul commented Nov 29, 2018

bcmills commented Nov 29, 2018

bakul commented Nov 29, 2018

bcmills commented Nov 29, 2018

andersk commented Nov 30, 2018

mpvl commented Dec 7, 2018 •

edited

mpvl commented Dec 7, 2018

bcmills commented Dec 7, 2018

bcmills commented Dec 7, 2018

bcmills commented Dec 7, 2018 •

edited

mpvl commented Dec 7, 2018

mpvl commented Dec 7, 2018

bcmills commented Dec 7, 2018 •

edited

bcmills commented Dec 7, 2018

mpvl commented Dec 7, 2018

bakul commented Dec 7, 2018

bcmills commented Dec 7, 2018

proposal: Go 2: spec: define identifiers to be NFKC-normalized #27896

proposal: Go 2: spec: define identifiers to be NFKC-normalized #27896

Comments

bcmills commented Sep 27, 2018 • edited

dmitshur commented Sep 27, 2018

ianlancetaylor commented Sep 27, 2018

bcmills commented Sep 27, 2018

bcmills commented Sep 28, 2018

bakul commented Nov 29, 2018

bcmills commented Nov 29, 2018

bakul commented Nov 29, 2018

bcmills commented Nov 29, 2018

andersk commented Nov 30, 2018

mpvl commented Dec 7, 2018 • edited

mpvl commented Dec 7, 2018

bcmills commented Dec 7, 2018

bcmills commented Dec 7, 2018

bcmills commented Dec 7, 2018 • edited

mpvl commented Dec 7, 2018

mpvl commented Dec 7, 2018

bcmills commented Dec 7, 2018 • edited

bcmills commented Dec 7, 2018

mpvl commented Dec 7, 2018

bakul commented Dec 7, 2018

bcmills commented Dec 7, 2018

bcmills commented Sep 27, 2018 •

edited

mpvl commented Dec 7, 2018 •

edited

bcmills commented Dec 7, 2018 •

edited

bcmills commented Dec 7, 2018 •

edited