spec: clarify tokenization of literals #28253

griesemer · 2018-10-17T17:22:19Z

Per the spec section on Tokens (https://golang.org/ref/spec#Tokens):

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

For instance, the source

0y

should be be tokenized into the integer literal 0 and the identifier y according to this rule. All the compilers agree with this, deducing from the error messages for this program:

package p

const _ = 0y

Here are the errors reported by cmd/compile, gccgo, and gotype:

$ go tool compile x.go
x.go:3:12: syntax error: unexpected y after top level declaration
$ gotype x.go
x.go:3:12: expected ';', found y
$ gccgo x.go
x.go:3:12: error: expected ';' or newline after top level declaration
3 | const _ = 0y
  |            ^

However, the rule is not strictly followed for the source

0x

cmd/compile and gotype report:

$ go tool compile x.go
x.go:3:11: malformed integer constant: 0x
x.go:3:13: malformed hex constant
$ gotype x.go
x.go:3:11: illegal hexadecimal number

Only gccgo produces an error consistent with the previous example and the spec:

$ gccgo x.go
x.go:3:12: error: expected ';' or newline after top level declaration
3 | const _ = 0x
  |            ^

cmd/compile and gotype both assume that 0x is the beginning of a hexadecimal number and both report an error when that number doesn't materialize; yet the longest sequences of characters that form valid tokens here are (as before): 0 and x.

Finally, for the source

0123456789

all compilers deviate from the spec rule:

$ go tool compile x.go
x.go:3:11: malformed integer constant: 0123456789
x.go:3:21: malformed octal constant
$ gotype x.go
x.go:3:11: illegal octal number
$ gccgo x.go
x.go:3:11: error: invalid octal literal
3 | const _ = 0123456789
  |           ^

as all assume this to be an octal constant. Yet, per the spec, this should be tokenized into two integer literals 01234567 and 89.

The implementation problem here is that we don't know if a sequence 012345678 is simply the octal literal 01234567 followed by the integer literal 8 or whether this turns out to be a valid floating-point constant 0123456789.0 had we kept reading. To make the right decision, a tokenizer must keep reading until there's a definitive answer. If the "longest-possible" tokenization fails, per the spec, the correct answer would require the tokenizer to go back, which may not be easily possible (say, if the implementation reads the source via an io.Reader). Worse, because there's virtually no size limit for octal literals, if backtracking is not an option, arbitrarily long look-ahead would be needed.

A similar problem arises for floating point numbers, for instance 1.2e-f should be tokenized as 1.2, e, -, f but cmd/compile and gotype complain about an invalid floating point number; only gccgo appears to be doing the right thing. The problem is not as bad here because the required look-ahead is limited (3 characters at most).

Octal constants pose unique tokenization requirements non-existent for other tokens if we want to strictly stick by the tokenization rule provided by the spec.

The simplest "fix" is to adjust the spec such that it permits implementations to deviate from the "longest sequence" requirement for numeric literals, and perhaps for octal literals only (the latter is what gccgo appears to be doing).

cc: @ianlancetaylor @mdempsky for commentary, if any.

The text was updated successfully, but these errors were encountered:

rsc · 2019-01-17T02:33:11Z

For what it's worth, this isn't limited to octal. I find this equally confusing:

$ cat /tmp/x.go
package p
var _ = 1234abcd
$ gofmt /tmp/x.go
/tmp/x.go:2:13: expected ';', found abcd
$

If abcd1234 is one token, it's quite surprising for 1234abcd to be two tokens. I agree that 012389 being two tokens is more surprising, but only a tiny bit more.

(If it ever makes a difference - that is, if we have a valid program one way but not another - that would be seriously problematic. Hopefully it does not, in which case this probably doesn't matter much.)

griesemer · 2019-01-30T18:06:16Z

To address this issue generally we could modify the primary tokenization rule. Currently we have:

While breaking the input into tokens, the next token is the longest sequence of characters
that form a valid token.

If we change this to something like:

While breaking the input into tokens, the next token is the sequence of characters
that form the longest prefix of a valid token. If the prefix is not a valid token,
an error is reported.

(When tokenizing char and string literals, we may want to look for the closing quote before deciding if the literal's interior is valid; we don't want to stop in the middle because of an invalid escape sequence. If we want to be absolutely precise in the spec, the interior syntax could be separated from the general char or string literal syntax.)

This would eliminate the need for any special cases for octals, hexadecimal floats, incomplete exponents, etc. in the spec. On the implementation side it would eliminate the need for any backtracking; a simple and consistent 1-char look-ahead would be sufficient. This would simplify lexing, and perhaps even speed it up a tiny bit (though this is unlikely to have any impact on compiler speed).

0-octals would be consumed as they are now: We stop as soon as we don't have a decimal digit, ., exponent, or imaginary i. The token will be considered an integer literal, it may even have a defined value (regular base-8 conversion, even for digits 8 and 9), but an error is reported.
Base-prefixed integers without digits will be considered integer literals of value 0 and an error is reported.
Hexadecimal floats with missing exponent or exponent digits will be considered hexadecimal floats with exponent 0 and an error is reported.

This behavior is essentially what we are doing now anyway (though gccgo tokenizes 0xg as 0, xg, while Go 1.12 cmd/compile tokenizes it as 0x + hex-error, g).

There is one place where we currently (Go 1.12) use 2-char lookahead, and that is for ...: A ..x or ..0 is tokenized as ., ., x, or ., .0, respectively. With the proposed rule this would be tokenized as .. + ...-error, x, or .. + ...-error, 0, respectively (i.e., a .. is considered a ... and an error is reported). This seems fine as the resulting token sequence is invalid in all cases; i.e., no invalid program becomes suddenly valid.

However, your (@rsc) example above, 1234abcd continues to be tokenized as 1234, abcd. I don't see a strong reason to change this.

Alternatively, instead of changing the tokenization rule, it might be better to make the suggested change an implementation restriction. After all, the existing tokenization rule describes correct programs.

gopherbot · 2019-02-06T19:43:20Z

Change https://golang.org/cl/161417 mentions this issue: spec: add implementation restriction to tokenization rules

griesemer · 2019-02-11T19:22:33Z

Based on feedback on https://golang.org/cl/161417 and the latest scanner implementations for Go 2 number literals, I am going to close this issue.

I agree that the spec describes correct programs and that we shouldn't expand it to describe compiler behavior in the presence of incorrect programs.

Second, the scanners for the Go 2 number literals are now accepting incorrect literals liberally and in return can provide more informative error messages than if the scanners would stick literally to the basic tokenization rule, or even the relaxed (suggested) prefix tokenization rule.

Finally, all the std library scanners now behave essentially the same when it comes to Go 2 number scanning, as they all use the same code outline.

griesemer added the Documentation label Oct 17, 2018

griesemer added this to the Unplanned milestone Oct 17, 2018

griesemer self-assigned this Oct 17, 2018

griesemer mentioned this issue Oct 17, 2018

proposal: arbitrary-radix integer literals #28256

Closed

griesemer closed this as completed Feb 11, 2019

golang locked and limited conversation to collaborators Feb 11, 2020

gopherbot added the FrozenDueToAge label Feb 11, 2020

rsc unassigned griesemer Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: clarify tokenization of literals #28253

spec: clarify tokenization of literals #28253

griesemer commented Oct 17, 2018

rsc commented Jan 17, 2019

griesemer commented Jan 30, 2019 •

edited

gopherbot commented Feb 6, 2019

griesemer commented Feb 11, 2019

spec: clarify tokenization of literals #28253

spec: clarify tokenization of literals #28253

Comments

griesemer commented Oct 17, 2018

rsc commented Jan 17, 2019

griesemer commented Jan 30, 2019 • edited

gopherbot commented Feb 6, 2019

griesemer commented Feb 11, 2019

griesemer commented Jan 30, 2019 •

edited