Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/scanner: newlines acting as semicolons cause comments to appear out of order #54941

Closed
adonovan opened this issue Sep 7, 2022 · 6 comments
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Thinking
Milestone

Comments

@adonovan
Copy link
Member

adonovan commented Sep 7, 2022

See https://go.dev/play/p/M096S5HDBAb:

fmt.Printf("%q\n", tokenize("package p /*a*/ ; /*b*/"))  // ["package" "p" "/*a*/" ";" "/*b*/"]
fmt.Printf("%q\n", tokenize("package p /*a*/ \n /*b*/")) // ["package" "p" "\n" "/*a*/" "/*b*/"]

Observe that the first comment, a, is delivered before the semicolon token in the first example, but after the semicolon token in the second example, in which the semicolon token was encoded as a newline. Similar behavior is observed with other declarations (e.g. var) and with statements within a function.

I think this is a bug and that the output should be ["package" "p" "/*a*/" ";" "/*b*/"] in the second case.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 7, 2022
@mknyszek mknyszek added this to the Backlog milestone Sep 7, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Sep 7, 2022

CC @griesemer

@griesemer griesemer self-assigned this Sep 7, 2022
@griesemer griesemer modified the milestones: Backlog, Go1.20 Sep 7, 2022
@adonovan
Copy link
Member Author

adonovan commented Sep 7, 2022

On closer reading of the spec, I'm not sure it is a bug. It says:

"a semicolon is automatically inserted into the token stream immediately after a line's final token if that token is
an identifier or [...]"

In the example above the final token on the line is the identifier p. So the semicolon token is in the right place, even though it seems counterintuitive that it jumped ahead of a previous comment---or potentially several comments:

fmt.Printf("%q\n", tokenize("package p /*a*/ /*b*/ \n /*c*/")) // ["package" "p" "\n" "/*a*/" "/*b*/" "/*c*/"]

Even more strangely:

A general comment containing no newlines acts like a space. Any other comment acts like a newline.

So the following example https://go.dev/play/p/rZHcq1-Ko-g also behaves consistent with the spec:

fmt.Printf("%q\n", tokenize("package p /*a*/ /*b*/ /*c \n d*/ /*e*/")) // ["package" "p" "\n" "/*a*/" "/*b*/" "/*c \n d*/" "/*e*/"]

Observe that the scanner skips past several complete comments until it encounters a newline within a comment, then treats that comment as a newline, causing it to emit a semicolon after the previous token, which is the identifier p. It then finally emits the comments including the one with the newline, after the semicolon token.

So, not a bug. Just a challenge to anyone trying to come up with an answer to the question: which node in the syntax tree owns each comment?, since the comments from one declaration may be delivered to the parser only after it has moved on to another declaration. This means that a fully faithful parser would need to retroactively attach comments to a previously finished subtree, or a printer would need to visit in multiple passes in order to look ahead to find what comments are ahead in the visitation order, but behind in the source text. The plot thickens...

@griesemer griesemer removed this from the Go1.20 milestone Sep 7, 2022
@griesemer griesemer removed their assignment Sep 7, 2022
@griesemer
Copy link
Contributor

Thanks for investigating - this is refreshing my memory. Per the spec:

A general comment containing no newlines acts like a space. Any other comment acts like a newline.

In other words, comments are just white (more precisely, "blackened") space, not tokens, and really invisible to the language proper (never mind their function as directives in some places). The issue arises because go/scanner reports comments as tokens, thus confusing matters (the syntax package reports comments with an out-of-band callback, not as tokens). From the spec's perspective, since comments are not tokens, the automatic semicolon rules cannot take comments into accounts.

So yes, I agree that this is working as expected. With respect to "who owns a comment", position information must be taken into account, not so much semicolons.

Closing as working as intended.

@adonovan
Copy link
Member Author

adonovan commented Sep 8, 2022

On the train home I thought some more about this and reversed myself. As you point out, the spec says comments are not tokens; it's merely the go/token and go/scanner packages that call them tokens out of convenience. When the scanner runs with ScanComments disabled, it produces exactly the stream of tokens required by the spec.

But when ScanComments is enabled, the scanner produces a stream containing a mixture of tokens and comments. The nature of this mixed stream isn't really governed by the spec. The spec says only that a newline or newline-containing comment causes an artificial semicolon token to be injected into the token stream immediately after the last token of previous declaration. But because it governs only the token stream, there is ambiguity about what exactly it means for the mixed stream. As a practical matter, it would be simpler for various tools (such as the parser) if the artificial semicolon did not appear in the mixed stream ahead of any comment that entirely precedes it in the source text. There remains the question of what order the scanner should report the newline-containing comment versus the artificial semicolon. I don't think there can be any fully satisfactory answer since one token encloses the other, but my preference would be for the newline-containing comment to appear first in the mixed stream since it starts before the newline that is turned into a semicolon token.

In other words, I think the scanner is trying to be too clever by looking ahead, possibly past several comments, to find the newline. Instead it could simply dispatch comments as it encounters them, and inject a semicolon token once it sees a newline. This would be both compliant with the spec, and simpler for client tools such as a parser that wished to record every comment within the syntax tree.

What do you think?

@adonovan adonovan reopened this Sep 8, 2022
@griesemer griesemer added Thinking and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 8, 2022
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 8, 2022
@mknyszek mknyszek added this to the Backlog milestone Sep 8, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Sep 8, 2022

(Adding back NeedsInvestigation and milestone to remove it from the triaging queue.)

@gopherbot
Copy link

Change https://go.dev/cl/429635 mentions this issue: go/scanner: emit implicit semicolon tokens in correct order

@adonovan adonovan self-assigned this Sep 8, 2022
@golang golang locked and limited conversation to collaborators Sep 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Thinking
Projects
None yet
Development

No branches or pull requests

4 participants