New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go/scanner: newlines acting as semicolons cause comments to appear out of order #54941
Comments
CC @griesemer |
On closer reading of the spec, I'm not sure it is a bug. It says:
In the example above the final token on the line is the identifier p. So the semicolon token is in the right place, even though it seems counterintuitive that it jumped ahead of a previous comment---or potentially several comments:
Even more strangely:
So the following example https://go.dev/play/p/rZHcq1-Ko-g also behaves consistent with the spec:
Observe that the scanner skips past several complete comments until it encounters a newline within a comment, then treats that comment as a newline, causing it to emit a semicolon after the previous token, which is the identifier p. It then finally emits the comments including the one with the newline, after the semicolon token. So, not a bug. Just a challenge to anyone trying to come up with an answer to the question: which node in the syntax tree owns each comment?, since the comments from one declaration may be delivered to the parser only after it has moved on to another declaration. This means that a fully faithful parser would need to retroactively attach comments to a previously finished subtree, or a printer would need to visit in multiple passes in order to look ahead to find what comments are ahead in the visitation order, but behind in the source text. The plot thickens... |
Thanks for investigating - this is refreshing my memory. Per the spec:
In other words, comments are just white (more precisely, "blackened") space, not tokens, and really invisible to the language proper (never mind their function as directives in some places). The issue arises because go/scanner reports comments as tokens, thus confusing matters (the syntax package reports comments with an out-of-band callback, not as tokens). From the spec's perspective, since comments are not tokens, the automatic semicolon rules cannot take comments into accounts. So yes, I agree that this is working as expected. With respect to "who owns a comment", position information must be taken into account, not so much semicolons. Closing as working as intended. |
On the train home I thought some more about this and reversed myself. As you point out, the spec says comments are not tokens; it's merely the go/token and go/scanner packages that call them tokens out of convenience. When the scanner runs with ScanComments disabled, it produces exactly the stream of tokens required by the spec. But when ScanComments is enabled, the scanner produces a stream containing a mixture of tokens and comments. The nature of this mixed stream isn't really governed by the spec. The spec says only that a newline or newline-containing comment causes an artificial semicolon token to be injected into the token stream immediately after the last token of previous declaration. But because it governs only the token stream, there is ambiguity about what exactly it means for the mixed stream. As a practical matter, it would be simpler for various tools (such as the parser) if the artificial semicolon did not appear in the mixed stream ahead of any comment that entirely precedes it in the source text. There remains the question of what order the scanner should report the newline-containing comment versus the artificial semicolon. I don't think there can be any fully satisfactory answer since one token encloses the other, but my preference would be for the newline-containing comment to appear first in the mixed stream since it starts before the newline that is turned into a semicolon token. In other words, I think the scanner is trying to be too clever by looking ahead, possibly past several comments, to find the newline. Instead it could simply dispatch comments as it encounters them, and inject a semicolon token once it sees a newline. This would be both compliant with the spec, and simpler for client tools such as a parser that wished to record every comment within the syntax tree. What do you think? |
(Adding back NeedsInvestigation and milestone to remove it from the triaging queue.) |
Change https://go.dev/cl/429635 mentions this issue: |
See https://go.dev/play/p/M096S5HDBAb:
Observe that the first comment, a, is delivered before the semicolon token in the first example, but after the semicolon token in the second example, in which the semicolon token was encoded as a newline. Similar behavior is observed with other declarations (e.g. var) and with statements within a function.
I think this is a bug and that the output should be
["package" "p" "/*a*/" ";" "/*b*/"]
in the second case.The text was updated successfully, but these errors were encountered: