Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: bufio: add Reader.Buffer #50960

Open
dsnet opened this issue Feb 1, 2022 · 12 comments
Open

proposal: bufio: add Reader.Buffer #50960

dsnet opened this issue Feb 1, 2022 · 12 comments

Comments

@dsnet
Copy link
Member

dsnet commented Feb 1, 2022

The Reader type is initialized with a buffer size. Once set, there is no API to adjust this.

This can make it challenging when parsing a file with a combination of Peek and Discard calls.
One such pattern is something like:

var r *bufio.Reader = ...
for {
    buf, err := r.Peek(headerSize)
    if err != nil {
        return err
    }
    r.Discard(len(buf))
    hdr := parseHeader(buf)

    buf, err := r.Peek(hdr.Size)
    if err != nil {
        return err
    }
    r.Discard(len(buf))
    processData(buf)
}

The code above works just fine, but fails unexpectedly when the payload size exceeds the internal buffer of Reader.

The current workaround is to check for ErrBufferFull, allocate a separate buffer, and read the expected number of bytes into that buffer. The workaround looks like:

  var r *bufio.Reader = ...
  for {
      ...

      buf, err := r.Peek(hdr.Size)
-     if err != nil {
-         return err
-     }
-     r.Discard(len(buf))
+     switch {
+     case err == nil:
+         r.Discard(len(buf))
+     case err == bufio.ErrBufferFull:
+         buf = make([]byte, hdr.Size)
+         if _, err := io.ReadFull(r, buf); err != nil {
+             return err
+         }
+     default:
+         return err
+     }
      processData(buf)
  }

It is unfortunate that I need to allocate a relatively large buffer in order to read large records and that I'm not able to reuse that buffer for subsequent Reader operations.

I propose the addition of Reader.Buffer, with similar semantics as Scanner.Buffer:

Buffer sets the buffer to use when reading and the maximum size of buffer that may be allocated during reading. The maximum Peek or ReadSlice size is the larger of max and cap(buf). If max <= cap(buf), Reader will use this buffer only and do no allocation.

Buffer panics if it is called after scanning has started.

Using this new API, the code above would look like:

  var r *bufio.Reader = ...
+ r.Buffer(nil, 64<<20) // we never expect records to exceed 64MiB
  for {
      ...
  }

\cc @josharian

@dsnet dsnet added the Proposal label Feb 1, 2022
@gopherbot gopherbot added this to the Proposal milestone Feb 1, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Feb 2, 2022
@dsnet
Copy link
Member Author

dsnet commented Feb 2, 2022

The workaround does not work in a situation where temporary errors are expected as you can't "unread" the buffer. This is one of the reasons why Peek+Discard is a useful pattern since Peek does not mutate the state of the bufio.Reader until you're ready to advance it.

@rsc
Copy link
Contributor

rsc commented Feb 2, 2022

Reader has a fixed size buffer set at the beginning of time. I can't quite tell which of a few things you are asking for:

  1. The ability to make bufio.Reader expand its buffer as needed up to a given maximum.
  2. The ability to change the buffer size of bufio.Reader on the fly, by providing a new buffer.
  3. Something else?

I am assuming you don't want to use bufio.NewReaderSize(r, 64<<20) followed by r.Reset to reuse the buffer, because you don't want every Reader to be sitting on 64 MB?

@dsnet
Copy link
Member Author

dsnet commented Feb 2, 2022

  1. The ability to make bufio.Reader expand its buffer as needed up to a given maximum.

Yes.

  1. The ability to change the buffer size of bufio.Reader on the fly, by providing a new buffer.

No. The proposed Buffer method would not allow this after reading similar to how Scanner.Buffer is not allowed after scanning has started.

I am assuming you don't want to use bufio.NewReaderSize(r, 64<<20) followed by r.Reset to reuse the buffer, because you don't want every Reader to be sitting on 64 MB?

Exactly. Start with something small (e.g., 4KiB) and permit growing up to some maximum specified limit.

@rsc
Copy link
Contributor

rsc commented Feb 2, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Feb 2, 2022
@rsc
Copy link
Contributor

rsc commented Feb 9, 2022

Do you think you need the ability to set the initial slice yourself, or are you OK with bufio always allocating it?
I've always found the Scanner API a bit difficult to remember and am not sure about mimicking it here.

If we don't need the ability to set the slice ourselves, maybe we could

  1. Make NewReaderSize(64<<20) start out with a smaller allocation, so that you get the auto-growing as needed?
  2. Add a new SetMaxSize(int) method to enable growing?

Do either of those work for you? Any thoughts on which is better?

@dsnet
Copy link
Member Author

dsnet commented Feb 9, 2022

I've always found the Scanner API a bit difficult to remember

Our use case didn't care about setting the slice. I've personally never used the "set the slice" feature in bufio.Scanner.

Do either of those work for you? Any thoughts on which is better?

Both options work for us. Option 2 is more flexible as it allows modifying the buffer size after the fact, but then you'll need to deal with what happens when the buffer is set to be smaller than what's currently utilized.

@rsc
Copy link
Contributor

rsc commented Feb 16, 2022

Let's figure out whether we can do without API changes, since that's always preferable.
Suppose we make NewReaderSize(N) for any N > 4k start with a 4k buffer and then double up to N
(the last step might not be a complete doubling).
This could possibly trip up new programs on old releases, but there are probably not many of those,
since this is not going to be widely used.

Any reasons not to take this approach?

@dsnet
Copy link
Member Author

dsnet commented Feb 16, 2022

That approach SGTM, but this behavior is potentially observable through the Read method call based on the number of bytes that it returns. It's not valid for programs to depend on this artifact.

I can make a prototype CL and we can run through global testing and see how many things break and whether it's legitimate.

@rsc
Copy link
Contributor

rsc commented Feb 23, 2022

Sounds good, thanks Joe.

@rsc
Copy link
Contributor

rsc commented Mar 2, 2022

Waiting on a prototype CL from Joe to see if we can get away without an API change here.

@rsc
Copy link
Contributor

rsc commented Mar 23, 2022

Placing on hold for prototype CL. Please feel free to move out of hold once it is ready.

@rsc rsc moved this from Active to Hold in Proposals (old) Mar 23, 2022
@rsc
Copy link
Contributor

rsc commented Mar 23, 2022

Placed on hold.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests

3 participants