Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: add a way to reuse a []string for Reader.Read #19721

Closed
nussjustin opened this issue Mar 26, 2017 · 11 comments
Closed

encoding/csv: add a way to reuse a []string for Reader.Read #19721

nussjustin opened this issue Mar 26, 2017 · 11 comments
Milestone

Comments

@nussjustin
Copy link
Contributor

(*csv.Reader).Read allocates a new []string on each call. In many cases the slice could be reused instead of allocating a new one for each record.

I propose to add a new method ReadInto to csv.Reader that takes a []string and tries to read the next CSV line into the given slice and only allocates a new slice if necessary.

The method would look like this:

// ReadInto reads one record (a slice of fields) from r into s.
// If the record has more than len(s) fields, a new slice will be allocated
// instead and s will not be modified.
// If the record has an unexpected number of fields,
// Read returns the record along with the error ErrFieldCount.
// Except for that case, Read always returns either a non-nil
// record or a non-nil error, but not both.
// If there is no data left to be read, Read returns nil, io.EOF.
func (r *Reader) ReadInto(s []string) (record []string, err error) {
	// ...
}

Also Read would be updated to just call ReadInto(nil) internally.

I have a patch ready that yields the following numbers on my notebook (where the "old" column is using Read and the "new" column ReadInto):

name                          old time/op    new time/op    delta
Read-8                          2.75µs ± 2%    1.88µs ± 1%  -31.52%  (p=0.000 n=14+15)
ReadWithFieldsPerRecord-8       2.75µs ± 0%    1.89µs ± 1%  -31.43%  (p=0.000 n=13+13)
ReadWithoutFieldsPerRecord-8    2.77µs ± 1%    1.88µs ± 1%  -32.06%  (p=0.000 n=15+15)
ReadLargeFields-8               55.4µs ± 1%    54.2µs ± 0%   -2.07%  (p=0.000 n=15+14)

name                          old alloc/op   new alloc/op   delta
Read-8                            664B ± 0%       24B ± 0%  -96.39%  (p=0.000 n=15+15)
ReadWithFieldsPerRecord-8         664B ± 0%       24B ± 0%  -96.39%  (p=0.000 n=15+15)
ReadWithoutFieldsPerRecord-8      664B ± 0%       24B ± 0%  -96.39%  (p=0.000 n=15+15)
ReadLargeFields-8               3.94kB ± 0%    2.98kB ± 0%  -24.39%  (p=0.000 n=15+15)

name                          old allocs/op  new allocs/op  delta
Read-8                            18.0 ± 0%       8.0 ± 0%  -55.56%  (p=0.000 n=15+15)
ReadWithFieldsPerRecord-8         18.0 ± 0%       8.0 ± 0%  -55.56%  (p=0.000 n=15+15)
ReadWithoutFieldsPerRecord-8      18.0 ± 0%       8.0 ± 0%  -55.56%  (p=0.000 n=15+15)
ReadLargeFields-8                 24.0 ± 0%      12.0 ± 0%  -50.00%  (p=0.000 n=15+15)

Using ReadInto with the code in #16791 (see my comment) I get:

name   old time/op    new time/op    delta
CSV-8     1.12s ± 1%     1.03s ± 2%   -8.44%  (p=0.000 n=15+14)

name   old alloc/op   new alloc/op   delta
CSV-8     144MB ± 0%      48MB ± 0%  -66.66%  (p=0.000 n=13+14)

name   old allocs/op  new allocs/op  delta
CSV-8     2.00M ± 0%     1.00M ± 0%  -50.00%  (p=0.000 n=13+14)

Benchmarks run with: go version devel +ecc6a81617 Sat Mar 25 00:35:35 2017 +0000 linux/amd64

@nussjustin
Copy link
Contributor Author

/cc @bradfitz @ALTree

@bradfitz
Copy link
Contributor

Or copy database/sql? https://golang.org/pkg/database/sql/#Row.Scan

@bradfitz
Copy link
Contributor

/cc @kardianos

@kardianos
Copy link
Contributor

The difference between encoding/csv and database/sql is that sql will return always return a rectangular result while csv may have jagged results. Another difference is that sql will do type conversions while csv only handles strings.

I think an API like this makes sense. In this API proposal, can s be written to even if a new slice is allocated (is this written to field by field or buffered internally?

The other option I could see is adding a csv Reader option that reuses the same buffer for each call to Read (ReadAll would be unaffected).

@nussjustin
Copy link
Contributor Author

nussjustin commented Mar 26, 2017

The current implementation of encoding/csv knows exactly how many fields a record has when allocating the slice, so there would be no need to modify s when it is to small.

Internally this would be implemented like this:

if len(s) >= fieldsCount {
    record = s[:fieldsCount]
} else {
   record = make([]ſtring, fieldsCount)
}

The comment for ReadInto currently states:

// If the record has more than len(s) fields, a new slice will be allocated
// instead and s will not be modified.

Although probably nice, this is a somewhat big restriction for future changes.

I also thought about adding a new field to the Reader, but I decided against it based on personal preference, even if the current proposal adds another new method. Considering ReadAll, adding a new field would need to make sure that ReadAll doesn't use it by making a copy after each call to Read or even preventing the reuse (this would avoid an extra copy).

@bradfitz
Copy link
Contributor

bradfitz commented Mar 26, 2017

Or we could add a bool field to csv.Reader, like Reader.ReuseRecord' that causes Reader.Read` to always use the same slice.

Edit: I can't read. @nussjustin addressed this above. But I still think if we're going to add API for a performance case, it should be less high profile than a new method.

@kardianos
Copy link
Contributor

I also thought about adding a new field to the Reader, but I decided against it based on preference, even if the current proposal adds another new method. Considering ReadAll, adding a new field would need to make sure that ReadAll doesn't use it by making a copy after each call to Read.

Modifying ReadAll to copy the slice returned from Read if Reader.ResueRecord was set would be simple enough. However I'm not arguing for any particular API.

@nussjustin nussjustin changed the title encoding/csv: add method to read into an existing []string encoding/csv: add a way to reuse a []string for Reader.Read Mar 26, 2017
@nussjustin
Copy link
Contributor Author

I agree with @bradfitz that a method is likely a bit too much for this simple optimization. I'm currently thinking about something along the lines of:

	// If ReuseRecord is true, calls to Read will reuse a single record slice.
	// When a record doesn't fit into the currently held slice, a new slice will
	// be allocated and used for succcessive calls to Read instead.
	// If set to false, each call to Read returns a fresh slice.
	// ReadAll is not affected by this setting.
	ReuseRecord bool

I'm not completely sure about the second sentence, but I think it can help avoid some potential confusion. This definition also makes clear that ReadAll is not affected.

I also tested a version with ReuseRecord locally and found no difference between ReadInto and ReuseRecord (as expected).

@kardianos
Copy link
Contributor

@nussjustin @bradfitz SGTM

@gopherbot
Copy link

CL https://golang.org/cl/41730 mentions this issue.

@nussjustin
Copy link
Contributor Author

Sorry for the delay. I finally got some time (and internet!) to mail my change.

@bradfitz bradfitz added this to the Go1.9 milestone Apr 26, 2017
@golang golang locked and limited conversation to collaborators Apr 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants