Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: clarify tokenization of literals #28253

Closed
griesemer opened this issue Oct 17, 2018 · 4 comments
Closed

spec: clarify tokenization of literals #28253

griesemer opened this issue Oct 17, 2018 · 4 comments

Comments

@griesemer
Copy link
Contributor

Per the spec section on Tokens (https://golang.org/ref/spec#Tokens):

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

For instance, the source

0y

should be be tokenized into the integer literal 0 and the identifier y according to this rule. All the compilers agree with this, deducing from the error messages for this program:

package p

const _ = 0y

Here are the errors reported by cmd/compile, gccgo, and gotype:

$ go tool compile x.go
x.go:3:12: syntax error: unexpected y after top level declaration
$ gotype x.go
x.go:3:12: expected ';', found y
$ gccgo x.go
x.go:3:12: error: expected ';' or newline after top level declaration
3 | const _ = 0y
  |            ^

However, the rule is not strictly followed for the source

0x

cmd/compile and gotype report:

$ go tool compile x.go
x.go:3:11: malformed integer constant: 0x
x.go:3:13: malformed hex constant
$ gotype x.go
x.go:3:11: illegal hexadecimal number

Only gccgo produces an error consistent with the previous example and the spec:

$ gccgo x.go
x.go:3:12: error: expected ';' or newline after top level declaration
3 | const _ = 0x
  |            ^

cmd/compile and gotype both assume that 0x is the beginning of a hexadecimal number and both report an error when that number doesn't materialize; yet the longest sequences of characters that form valid tokens here are (as before): 0 and x.

Finally, for the source

0123456789

all compilers deviate from the spec rule:

$ go tool compile x.go
x.go:3:11: malformed integer constant: 0123456789
x.go:3:21: malformed octal constant
$ gotype x.go
x.go:3:11: illegal octal number
$ gccgo x.go
x.go:3:11: error: invalid octal literal
3 | const _ = 0123456789
  |           ^

as all assume this to be an octal constant. Yet, per the spec, this should be tokenized into two integer literals 01234567 and 89.

The implementation problem here is that we don't know if a sequence 012345678 is simply the octal literal 01234567 followed by the integer literal 8 or whether this turns out to be a valid floating-point constant 0123456789.0 had we kept reading. To make the right decision, a tokenizer must keep reading until there's a definitive answer. If the "longest-possible" tokenization fails, per the spec, the correct answer would require the tokenizer to go back, which may not be easily possible (say, if the implementation reads the source via an io.Reader). Worse, because there's virtually no size limit for octal literals, if backtracking is not an option, arbitrarily long look-ahead would be needed.

A similar problem arises for floating point numbers, for instance 1.2e-f should be tokenized as 1.2, e, -, f but cmd/compile and gotype complain about an invalid floating point number; only gccgo appears to be doing the right thing. The problem is not as bad here because the required look-ahead is limited (3 characters at most).

Octal constants pose unique tokenization requirements non-existent for other tokens if we want to strictly stick by the tokenization rule provided by the spec.

The simplest "fix" is to adjust the spec such that it permits implementations to deviate from the "longest sequence" requirement for numeric literals, and perhaps for octal literals only (the latter is what gccgo appears to be doing).

cc: @ianlancetaylor @mdempsky for commentary, if any.

@rsc
Copy link
Contributor

rsc commented Jan 17, 2019

For what it's worth, this isn't limited to octal. I find this equally confusing:

$ cat /tmp/x.go
package p
var _ = 1234abcd
$ gofmt /tmp/x.go
/tmp/x.go:2:13: expected ';', found abcd
$ 

If abcd1234 is one token, it's quite surprising for 1234abcd to be two tokens. I agree that 012389 being two tokens is more surprising, but only a tiny bit more.

(If it ever makes a difference - that is, if we have a valid program one way but not another - that would be seriously problematic. Hopefully it does not, in which case this probably doesn't matter much.)

@griesemer
Copy link
Contributor Author

griesemer commented Jan 30, 2019

To address this issue generally we could modify the primary tokenization rule. Currently we have:

While breaking the input into tokens, the next token is the longest sequence of characters
that form a valid token.

If we change this to something like:

While breaking the input into tokens, the next token is the sequence of characters
that form the longest prefix of a valid token. If the prefix is not a valid token,
an error is reported.

(When tokenizing char and string literals, we may want to look for the closing quote before deciding if the literal's interior is valid; we don't want to stop in the middle because of an invalid escape sequence. If we want to be absolutely precise in the spec, the interior syntax could be separated from the general char or string literal syntax.)

This would eliminate the need for any special cases for octals, hexadecimal floats, incomplete exponents, etc. in the spec. On the implementation side it would eliminate the need for any backtracking; a simple and consistent 1-char look-ahead would be sufficient. This would simplify lexing, and perhaps even speed it up a tiny bit (though this is unlikely to have any impact on compiler speed).

  • 0-octals would be consumed as they are now: We stop as soon as we don't have a decimal digit, ., exponent, or imaginary i. The token will be considered an integer literal, it may even have a defined value (regular base-8 conversion, even for digits 8 and 9), but an error is reported.

  • Base-prefixed integers without digits will be considered integer literals of value 0 and an error is reported.

  • Hexadecimal floats with missing exponent or exponent digits will be considered hexadecimal floats with exponent 0 and an error is reported.

This behavior is essentially what we are doing now anyway (though gccgo tokenizes 0xg as 0, xg, while Go 1.12 cmd/compile tokenizes it as 0x + hex-error, g).

There is one place where we currently (Go 1.12) use 2-char lookahead, and that is for ...: A ..x or ..0 is tokenized as ., ., x, or ., .0, respectively. With the proposed rule this would be tokenized as .. + ...-error, x, or .. + ...-error, 0, respectively (i.e., a .. is considered a ... and an error is reported). This seems fine as the resulting token sequence is invalid in all cases; i.e., no invalid program becomes suddenly valid.

However, your (@rsc) example above, 1234abcd continues to be tokenized as 1234, abcd. I don't see a strong reason to change this.

Alternatively, instead of changing the tokenization rule, it might be better to make the suggested change an implementation restriction. After all, the existing tokenization rule describes correct programs.

@gopherbot
Copy link

Change https://golang.org/cl/161417 mentions this issue: spec: add implementation restriction to tokenization rules

@griesemer
Copy link
Contributor Author

Based on feedback on https://golang.org/cl/161417 and the latest scanner implementations for Go 2 number literals, I am going to close this issue.

I agree that the spec describes correct programs and that we shouldn't expand it to describe compiler behavior in the presence of incorrect programs.

Second, the scanners for the Go 2 number literals are now accepting incorrect literals liberally and in return can provide more informative error messages than if the scanners would stick literally to the basic tokenization rule, or even the relaxed (suggested) prefix tokenization rule.

Finally, all the std library scanners now behave essentially the same when it comes to Go 2 number scanning, as they all use the same code outline.

@golang golang locked and limited conversation to collaborators Feb 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants