Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/token: unpack() doesn't handle unicode properly #34322

Closed
slrtbtfs opened this issue Sep 16, 2019 · 3 comments
Closed

go/token: unpack() doesn't handle unicode properly #34322

slrtbtfs opened this issue Sep 16, 2019 · 3 comments

Comments

@slrtbtfs
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.12.9 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/slrtbtfs/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/slrtbtfs/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/golang"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build881572432=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I used the following test file:

https://play.golang.org/p/g5qvYsFx9lr

package main

import "fmt"

func main() {
	fmt.Println("ਕਖਗਘਙਚਛਜਝਞਟ") a
}

What did you expect to see?

The compiler should have produced:

./test.go:6:36: syntax error: unexpected a at end of statement

What did you see instead?

Compiling it produces:

./test.go:6:51: syntax error: unexpected a at end of statement

Note the different column where the error is claimed to occur. Also not that the test file does not even have 51 columns.

What is the root cause of this?

The column number is calculated in https://golang.org/src/go/token/position.go, function unpack(), line 303. This code implicitly assumes that every column is exactly one byte, which is not the case with unicode.

Examples:

  • A TAB is one byte, but more than one column.
  • These nonstandard characters used in the test case are one column but more than one byte.
@robpike
Copy link
Contributor

robpike commented Sep 17, 2019

There is really not a clear definition of a column with multi-lingual text, with variable-pitch fonts, or in the presence of control or other invisible characters. The number presented in the error message is a byte offset, not a column, and is poorly documented. The documentation (go/token/Position.Column) should be clearer, and the choice of the word Column for this field is unfortunate but unchangeable.

@smasher164
Copy link
Member

The error message you are seeing is from cmd/compile, not go/parser, which gives a similar message, 6:51: expected ';', found a (and 2 more errors). That being said, go/token does document that the column field in Position is a byte count go/token/position.go#L24, so this is working as intended. Changing this behavior would not be backwards compatible.

Additionally, it's unclear from your description what a correct column count should be. Behavior here is inconsistent across languages and compilers. One could argue that code points are not meaningful enough to identify a column offset, and that extended grapheme clusters would be better. It's unclear which is better, but at the least a byte offset is unambiguous to tools, especially in the case of a UTF-8 encoded source file.

Let me know if this makes sense to you. I will leave you with an error message comparison from some sibling languages: For example, compare the column/caret positions given by Clang 9, GCC 7, Python 2, Python 3, and Swift 4:

cat > colerror.c
#include <stdio.h>

int main() {
    printf("ਕਖਗਘਙਚਛਜਝਞਟ"); a
}
^C
$ clang colerror.c
colerror.c:4:50: error: use of undeclared identifier 'a'
    printf("ਕਖਗਘਙਚਛਜਝਞਟ"); a
                           ^
1 error generated.
$ gcc colerror.c 
colerror.c: In function ‘main’:
colerror.c:4:50: error: ‘a’ undeclared (first use in this function)
     printf("ਕਖਗਘਙਚਛਜਝਞਟ"); a
                                                  ^
colerror.c:4:50: note: each undeclared identifier is reported only once for each function it appears in
colerror.c:5:1: error: expected ‘;’ before ‘}’ token
 }
 ^
$ cat > colerror.py
# coding=utf-8
print("ਕਖਗਘਙਚਛਜਝਞਟ") a
^C
$ python2.7 colerror.py
  File "colerror.py", line 2
    print("ਕਖਗਘਙਚਛਜਝਞਟ") a
                                               ^
SyntaxError: invalid syntax
$ python3 colerror.py
  File "colerror.py", line 2
    print("ਕਖਗਘਙਚਛਜਝਞਟ") a
                         ^
SyntaxError: invalid syntax
$ cat > colerror.swift
print("ਕਖਗਘਙਚਛਜਝਞਟ"); a
^C
$ swiftc colerror.swift
colerror.swift:1:45: error: use of unresolved identifier 'a'
print("ਕਖਗਘਙਚਛਜਝਞਟ"); a

@slrtbtfs
Copy link
Author

After digging deeper into unicode I agree that there is no canonical way to assign a width to a string and the issue is thus unsolvable.

Given that a lot of tool integrations do not get this right, there seem to be a lot of bugs to be reported on their side, though.

@golang golang locked and limited conversation to collaborators Sep 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants