Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use some Hindi unicode characters in source code as identifiers #42830

Closed
harshrathod50 opened this issue Nov 25, 2020 · 11 comments
Closed

Comments

@harshrathod50
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.15.2 windows/amd64

Does this issue reproduce with the latest release?

I don't know.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Harsh Rathod\AppData\Local\go-build
set GOENV=C:\Users\Harsh Rathod\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\Harsh Rathod\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\Harsh Rathod\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=c:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\HARSHR~1\AppData\Local\Temp\go-build432409930=/tmp/go-build -gno-record-gcc-switches

What did you do?

As per the spec, I should be able to use Unicode characters in source code but unable to do so as seen here.

What did you expect to see?

I expected the program to compile as usual. But when I remove some specific Hindi characters from the identifier: पोर्ट_नंबर to परट_नबर, the code compiles to give output:

Port: 5432

What did you see instead?

./prog.go:6:5: invalid character U+094B 'ो' in identifier
./prog.go:6:11: invalid character U+094D '्' in identifier
./prog.go:6:21: invalid character U+0902 'ं' in identifier
./prog.go:7:26: invalid character U+094B 'ो' in identifier
./prog.go:7:32: invalid character U+094D '्' in identifier
./prog.go:7:42: invalid character U+0902 'ं' in identifier
@seankhliao
Copy link
Member

spec:

unicode_letter = /* a Unicode code point classified as "Letter" */

The invalid characters are classed as (forgive the mangling of the characters)

character code point class
'ो' U+094B Mark, Spacing Combining [Mc]
'्' U+094D Mark, Nonspacing [Mn]
'ं' U+0902 Mark, Nonspacing [Mn]

@ALTree
Copy link
Member

ALTree commented Nov 25, 2020

Variable names needs to have letters in them

unicode_letter = /* a Unicode code point classified as "Letter" */ .

so as it was pointed out this is working as expected.

@ALTree ALTree closed this as completed Nov 25, 2020
@harshrathod50
Copy link
Author

Wait, this is too early. I am drafting my reply.

@ALTree
Copy link
Member

ALTree commented Nov 25, 2020

If it turns out this is a bug we'll re-open. We usually quickly close by default.

@harshrathod50
Copy link
Author

Okay, let me explain. There is a misconception regarding the Hindi language here. Hindi language characters are divided into two categories: Swar(s) and Matra(s).

Here is the list of some Hindi literals which are Matra(s):

ऀ  ँ  ं  ः  ऺ  ऻ  ़  ा  ि  ी  ु  ू  ृ  ॄ  ॅ  ॆ  े  ै  ॉ  ॊ  ो  ौ  ्  ॎ  ॏ  ॕ  ॖ  ॗ

Matra(s) do not make in sense in written form. But they do make sense when preceded by Swar(s). See the example:

 ौ = Not valid Hindi literal 
जौ = A valid Hindi literal

Therefore, Matra(s) preceding any Swar are valid identifiers. But the existing Go compiler is not allowing that. It is treating Matras(s) differently. So,

पोर्ट_नंबर := 5432

The above code is valid! There is a bug in the compiler.

@seankhliao
Copy link
Member

The compiler is correctly implementing the Go spec which depends on the Unicode spec. Even if it is a valid Hindi literal, it is not composed purely of Unicode Letters.

identifier = letter { letter | unicode_digit } .
letter = unicode_letter | "_" .
unicode_letter = /* a Unicode code point classified as "Letter" */ .

@ALTree
Copy link
Member

ALTree commented Nov 25, 2020

It doesn't matter if those character combine with others to make up letters. From the compiler point of view, every character in identifiers needs to be a letter. These characters are not letters, and even if theoretically they just "combine" with the previous letter, they're still there.

We are aware of the fact that this prevents people from writing certain words in certain scripts.

See for example: #194 (Allow Unicode combining characters in identifiers).

We also have a FAQ on this:

What's up with Unicode identifiers?

Go's rule [...] is simple to understand and to implement but has restrictions. Combining characters are excluded by design, for instance, and that excludes some languages such as Devanagari.

So, as it was pointed out: the compiler is behaving according to the language specification. Combining character are intentionally excluded. We are aware that this can cause some issues in certain scripts, but whether to change the rules (and how) is a different matter. The current behaviour is aligned to the language spec.

@harshrathod50
Copy link
Author

Okay, that rule needs to be changed, seriously! This rule is explicitly breaking other languages. What is the appropriate place to raise this issue?

@ALTree
Copy link
Member

ALTree commented Nov 25, 2020

Essentially you want to re-open #194 for consideration. This issue tracker is the right place to do it. For big changes, we generally prefer a well laid out proposal (see here: https://github.com/golang/proposal). Not breaking any existing Go code is extremely important and it's likely that any proposal that does it lightly will be rejected.

Note that any proposal about changing which identifiers are allowed should probably at least consider solving the other big issue currently Go has:

Since an exported identifier must begin with an upper-case letter, identifiers created from characters in some languages can, by definition, not be exported. For now the only solution is to use something like X日本語, which is clearly unsatisfactory.

For example if you can write पोर्ट_नंबर but you can't export it (is it uppercase? I don't know), then it'll be confusing. It's not easy to decide how to handle this. Japanese has the same issue.

@ianlancetaylor
Copy link
Contributor

I believe the relevant issue here is #20706.

@ALTree
Copy link
Member

ALTree commented Nov 26, 2020

Ah, there it is. Thanks Ian, I didn't remember we already had an issue for this.

@golang golang locked and limited conversation to collaborators Nov 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants