Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bufio: allow terminating Scanner early cleanly without a final token or an error #56381

Closed
Tracked by #300
favonia opened this issue Oct 22, 2022 · 11 comments
Closed
Tracked by #300

Comments

@favonia
Copy link
Contributor

favonia commented Oct 22, 2022

Why

Currently, the clean way to end the scanning early is to use the error ErrFinalToken. However, when it is used, the Scanner.Scan method will unconditionally return true even if the token is nil, generating an extra token (via Scanner.Text). (Please note that this is different from the splitting function returning an empty slice. This proposal only covers the case where the second return value is exactly nil; every other case, including the case where the second return value is an empty slice, should remain the same.) In other words, a SplitFunc cannot return

len(data), nil, ErrFinalToken

as a way to cleanly terminate the scanning without adding an extra token. If I understand the logic correctly, a user thus must use a custom error or specifically check whether Scanner.Bytes returns nil to achieve token-free early termination, which is a bit awkward. However, I wonder if it makes sense for Scanner to skip the nil token in this very special case? No matter whether this proposal is accepted or not, I hope this tricky case can be more explicitly mentioned in the documentation of SplitFunc and/or Scanner.Scan.

Proposed code change

In Scanner.Sacn,

 if err == ErrFinalToken {
 	s.token = token
 	s.done = true
-	return true
+	return token != nil
 }

Example exploiting the proposed API

(This is adapted from the existing example ExampleScanner_emptyFinalToken.)

func ExampleScanner_earlyStop() {
	const input = "1,2,STOP,4,"
	scanner := bufio.NewScanner(strings.NewReader(input))
	onComma := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
		for i := 0; i < len(data); i++ {
			if data[i] == ',' {
				// if the token is "STOP", ignore the rest
				if string(data[:i]) == "STOP" {
					return i + 1, nil, bufio.ErrFinalToken
				}

				return i + 1, data[:i], nil
			}
		}
		return 0, data, bufio.ErrFinalToken
	}
	scanner.Split(onComma)

	for scanner.Scan() {
		fmt.Printf("Got a token %q\n", scanner.Text())
	}
	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, "reading input:", err)
	}
}

Output before the change:

Got a token "1"
Got a token "2"
Got a token ""

Output after the change:

Got a token "1"
Got a token "2"

Edits

  • 10/23: Added example code.
  • 10/23: I noticed that it's easy to confuse nil tokens with empty tokens. I updated the text with more clarification. I also fixed the typo in the code.
@gopherbot gopherbot added this to the Proposal milestone Oct 22, 2022
@ZekeLu
Copy link
Contributor

ZekeLu commented Oct 23, 2022

See the issue #11836 and the commit ec12754 where ErrFinalToken was introduced.

Please note that a final empty token is a valid token. Apps could depend on this behavior and expect an empty final token. So this proposal could break such apps.

If a caller of the Scan func does not want an empty final token, it could handle it with a simple check. Most of the time, a caller does not want an empty final token don't want empty tokens at all, and the check is already there.

It would be better to provide one or more concrete use cases to support your proposal.

@favonia
Copy link
Contributor Author

favonia commented Oct 23, 2022

Please note that a final empty token is a valid token. Apps could depend on this behavior and expect an empty final token. So this proposal could break such apps.

@ZekeLu No, what I meant is a nil token returned by the SplitFunc, not an "empty" token. Those nil tokens are always skipped by Scanner.scan except in one corner case---when delivered with ErrFinalToken, and the proposal is only about that one corner case. The proposed code change would not affect the behavior of empty tokens (as "empty slices"). It only affects nil tokens (as nil slices). It is hard for me to imagine that there is any application out there which uses nil (instead of an empty slice) to communicate. The official documentation does not support this usage and it arguably violates some of the implicit invariant. On the other hand, I can imagine that lots of programs (including mine) are using empty slices as empty tokens. Perhaps there's some misunderstanding here?

If a caller of the Scan func does not want an empty final token, it could handle it with a simple check. Most of the time, a caller does not want an empty final token don't want empty tokens at all, and the check is already there.

I believe we are talking across each other. The code is dealing with nil tokens. It will not affect any empty tokens (as empty slices) output by a splitter function. Please see the above.

It would be better to provide one or more concrete use cases to support your proposal.

For example, it is useful to write a parser for a language that supports "quit":

x = 1
y = 2
# quit
from this line everything should be and will be completely ignored by the parser
until the end of the file which can be very large and thus the scanner would better
stop early instead of trying to read more and more data and potentially panic
in the end as the splitter function will keep returning nil

@ZekeLu
Copy link
Contributor

ZekeLu commented Oct 23, 2022

Ah, I get it now. Thank you for the clarification!

According to the doc:

The return values are the number of bytes to advance the input and the next token to return to the user, if any, plus an error, if any.

I think you are correct that Scan should return false when the SplitFunc returns 0, nil, ErrFinalToken. I think the confusing part is the name of the error ErrFinalToken, which indicates that there is one token. ErrStop is better if we have the chance to choose the name from the beginning.

@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

Talked to @robpike (author of Scan) about this, and it seems reasonable.

@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Nov 2, 2022

As a very quick summary, right now if a SplitFunc returns 0, nil, ErrFinalToken, the result of Scan is a final "" before the end of the sequence. That empty string is almost never actually wanted. So the change is to let nil indicate "no empty string token". Users who want empty strings can still return []byte{}, but of course very few scanners actually want empty strings.

@favonia
Copy link
Contributor Author

favonia commented Nov 2, 2022

I can confirm the quick summary is accurate. Thanks!

@rsc
Copy link
Contributor

rsc commented Nov 9, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Nov 16, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: bufio: allow terminating Scanner early cleanly without a final token or an error bufio: allow terminating Scanner early cleanly without a final token or an error Nov 16, 2022
@rsc rsc modified the milestones: Proposal, Backlog Nov 16, 2022
@gopherbot
Copy link

Change https://go.dev/cl/451495 mentions this issue: bufio: allow terminating Scanner early cleanly without a final token or an error

@gopherbot
Copy link

Change https://go.dev/cl/498117 mentions this issue: bufio: allow terminating Scanner early cleanly without a final token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

5 participants