go/scanner: newlines acting as semicolons cause comments to appear out of order #54941

adonovan · 2022-09-07T20:20:47Z

fmt.Printf("%q\n", tokenize("package p /*a*/ ; /*b*/"))  // ["package" "p" "/*a*/" ";" "/*b*/"]
fmt.Printf("%q\n", tokenize("package p /*a*/ \n /*b*/")) // ["package" "p" "\n" "/*a*/" "/*b*/"]

Observe that the first comment, a, is delivered before the semicolon token in the first example, but after the semicolon token in the second example, in which the semicolon token was encoded as a newline. Similar behavior is observed with other declarations (e.g. var) and with statements within a function.

I think this is a bug and that the output should be ["package" "p" "/*a*/" ";" "/*b*/"] in the second case.

The text was updated successfully, but these errors were encountered:

mknyszek · 2022-09-07T20:23:15Z

CC @griesemer

adonovan · 2022-09-07T21:38:23Z

On closer reading of the spec, I'm not sure it is a bug. It says:

"a semicolon is automatically inserted into the token stream immediately after a line's final token if that token is
an identifier or [...]"

In the example above the final token on the line is the identifier p. So the semicolon token is in the right place, even though it seems counterintuitive that it jumped ahead of a previous comment---or potentially several comments:

fmt.Printf("%q\n", tokenize("package p /*a*/ /*b*/ \n /*c*/")) // ["package" "p" "\n" "/*a*/" "/*b*/" "/*c*/"]

Even more strangely:

A general comment containing no newlines acts like a space. Any other comment acts like a newline.

So the following example https://go.dev/play/p/rZHcq1-Ko-g also behaves consistent with the spec:

fmt.Printf("%q\n", tokenize("package p /*a*/ /*b*/ /*c \n d*/ /*e*/")) // ["package" "p" "\n" "/*a*/" "/*b*/" "/*c \n d*/" "/*e*/"]

Observe that the scanner skips past several complete comments until it encounters a newline within a comment, then treats that comment as a newline, causing it to emit a semicolon after the previous token, which is the identifier p. It then finally emits the comments including the one with the newline, after the semicolon token.

So, not a bug. Just a challenge to anyone trying to come up with an answer to the question: which node in the syntax tree owns each comment?, since the comments from one declaration may be delivered to the parser only after it has moved on to another declaration. This means that a fully faithful parser would need to retroactively attach comments to a previously finished subtree, or a printer would need to visit in multiple passes in order to look ahead to find what comments are ahead in the visitation order, but behind in the source text. The plot thickens...

griesemer · 2022-09-07T23:25:45Z

Thanks for investigating - this is refreshing my memory. Per the spec:

A general comment containing no newlines acts like a space. Any other comment acts like a newline.

In other words, comments are just white (more precisely, "blackened") space, not tokens, and really invisible to the language proper (never mind their function as directives in some places). The issue arises because go/scanner reports comments as tokens, thus confusing matters (the syntax package reports comments with an out-of-band callback, not as tokens). From the spec's perspective, since comments are not tokens, the automatic semicolon rules cannot take comments into accounts.

So yes, I agree that this is working as expected. With respect to "who owns a comment", position information must be taken into account, not so much semicolons.

Closing as working as intended.

adonovan · 2022-09-08T00:00:31Z

On the train home I thought some more about this and reversed myself. As you point out, the spec says comments are not tokens; it's merely the go/token and go/scanner packages that call them tokens out of convenience. When the scanner runs with ScanComments disabled, it produces exactly the stream of tokens required by the spec.

But when ScanComments is enabled, the scanner produces a stream containing a mixture of tokens and comments. The nature of this mixed stream isn't really governed by the spec. The spec says only that a newline or newline-containing comment causes an artificial semicolon token to be injected into the token stream immediately after the last token of previous declaration. But because it governs only the token stream, there is ambiguity about what exactly it means for the mixed stream. As a practical matter, it would be simpler for various tools (such as the parser) if the artificial semicolon did not appear in the mixed stream ahead of any comment that entirely precedes it in the source text. There remains the question of what order the scanner should report the newline-containing comment versus the artificial semicolon. I don't think there can be any fully satisfactory answer since one token encloses the other, but my preference would be for the newline-containing comment to appear first in the mixed stream since it starts before the newline that is turned into a semicolon token.

In other words, I think the scanner is trying to be too clever by looking ahead, possibly past several comments, to find the newline. Instead it could simply dispatch comments as it encounters them, and inject a semicolon token once it sees a newline. This would be both compliant with the spec, and simpler for client tools such as a parser that wished to record every comment within the syntax tree.

What do you think?

mknyszek · 2022-09-08T15:03:59Z

(Adding back NeedsInvestigation and milestone to remove it from the triaging queue.)

gopherbot · 2022-09-08T21:38:01Z

Change https://go.dev/cl/429635 mentions this issue: go/scanner: emit implicit semicolon tokens in correct order

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 7, 2022

mknyszek added this to the Backlog milestone Sep 7, 2022

griesemer self-assigned this Sep 7, 2022

griesemer modified the milestones: Backlog, Go1.20 Sep 7, 2022

griesemer removed this from the Go1.20 milestone Sep 7, 2022

griesemer removed their assignment Sep 7, 2022

griesemer closed this as completed Sep 7, 2022

adonovan reopened this Sep 8, 2022

griesemer added Thinking and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 8, 2022

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 8, 2022

mknyszek added this to the Backlog milestone Sep 8, 2022

adonovan self-assigned this Sep 8, 2022

gopherbot closed this as completed in 92a94a7 Sep 27, 2022

golang locked and limited conversation to collaborators Sep 27, 2023

gopherbot added the FrozenDueToAge label Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/scanner: newlines acting as semicolons cause comments to appear out of order #54941

go/scanner: newlines acting as semicolons cause comments to appear out of order #54941

adonovan commented Sep 7, 2022 •

edited

mknyszek commented Sep 7, 2022

adonovan commented Sep 7, 2022 •

edited

griesemer commented Sep 7, 2022

adonovan commented Sep 8, 2022 •

edited

mknyszek commented Sep 8, 2022

gopherbot commented Sep 8, 2022

go/scanner: newlines acting as semicolons cause comments to appear out of order #54941

go/scanner: newlines acting as semicolons cause comments to appear out of order #54941

Comments

adonovan commented Sep 7, 2022 • edited

mknyszek commented Sep 7, 2022

adonovan commented Sep 7, 2022 • edited

griesemer commented Sep 7, 2022

adonovan commented Sep 8, 2022 • edited

mknyszek commented Sep 8, 2022

gopherbot commented Sep 8, 2022

adonovan commented Sep 7, 2022 •

edited

adonovan commented Sep 7, 2022 •

edited

adonovan commented Sep 8, 2022 •

edited