New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: Reading is slow #16791
Comments
What if you change |
Yeah I'm dumb, |
CPU:
Memory:
We probably can't do much about the slice per row, but we could probably add a interned string table cache to *Reader, keeping an LRU of the most recent N rows. Maybe per-column. And maybe not a map, but just a window of the past N (3, 5?) values per column. You would have to take care of keeping the cost of that string interning mechanism cheaper than the savings from reduced GC. I haven't looked at the rest of the CPU profile. Peeking on bytes and doing ReadByte instead of the relatively expensive ReadRune would probably help a lot. |
FYI I build and ran the go code from this issue with tip (604efe1) and compared it to tip+cl 24723 applied and I got around 10% improvment. I also wrote a little benchmark: func BenchmarkCSV(b *testing.B) {
d, err := ioutil.ReadFile("mock_data.csv")
if err != nil {
b.Fatal(err)
}
br := bytes.NewReader(d)
b.ResetTimer()
for i := 0; i < b.N; i++ {
br.Reset(d)
r := csv.NewReader(br)
for {
_, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
b.Fatal(err)
}
}
}
} Results for 10 runs each (tip & tip+cl 24723)
CPU profile with cl 24723:
Memory profile with cl 24723:
|
Did a quick test and it seems for cases that are mostly ASCII we could get another speedup by replacing bytes.Buffer with []byte, byte appends in two places. This will avoid some of the overhead of the WriteRune->(if >utft8.RuneSelf)->WriteByte->grow bytes.Buffer chain for writing each byte. quick test with the mentioned mock_data.csv:
And it would work in addition to the patch from @nuss-justin. Quick prototype patch 27430 If it is deemed to be an ok tradeoff to remove the bytes.Buffer dependency i can prepare a polished patch. |
CL https://golang.org/cl/27430 mentions this issue. |
I think we can use https://golang.org/pkg/bufio/#Reader.ReadSlice. line, err := r.ReadSlice('\n') and use |
Find a barebone csv reader https://gist.github.com/ulikunitz/676de633513152c41ab832158a07ac57. It doesn't do CRLF conversion, doesn't count lines nor columns and supports no options. Here is the benchstat output against 1.7 with BenchmarkRead from reader_test.go and BenchmarkCSV proposed by @nuss-justin.
I propose to make it the starting point for making it compatible with the current implementation. |
Replacing From what I can see, we read runes just because we can have a rune delimiter (and/or comment). We can add a private byte I can imagine the
(Or it could be equally incorporated to (Handling of r.Comment would be handled similarly, but that's a rather minor issue.) (I don't know much about runes, but could we maybe just read on bytes and stop on the first byte of a rune, regardless of its length? And if it's a multibyte character, then check the subsequent bytes for a match. I don't know how many false positives this would yield though.) |
utf8 does just that: https://golang.org/pkg/unicode/utf8/#example_DecodeRune |
Apologies if this is the wrong place for this comment (perhaps this should be its own feature request?). I was wondering if it would be possible to implement an efficient streaming API. I'm thinking a single buffer reused reused across all rows, with calls that return a |
Here's a naive example: https://play.golang.org/p/zbMdK8rCTH type CSVReader struct {
Scanner *bufio.Scanner
}
func (r CSVReader) StreamingRead() ([][]byte, error) {
if r.Scanner.Scan() {
return bytes.Split(r.Scanner.Bytes(), []byte(",")), nil
}
if err := r.Scanner.Err(); err != nil {
return nil, err
}
return nil, io.EOF
}
|
It's almost OT but the equivalent of defer in Java is a try-with-resources, i.e. the try with parenthesis, public static void main(String[] args) throws IOException {
try(BufferedReader reader = Files.newBufferedReader(Paths.get("mock_data.csv"))) {
String line;
while((line = reader.readLine()) {
String[] data = line.split(",");
if (data[0].equals("42")) {
System.out.println(line);
}
}
} // close() is implicitly called here
} or with the Stream API of Java 8 public static void main(String[] args) throws IOException {
try(Stream stream = Files.lines(Paths.get("mock_data.csv"))) {
stream
.filter(l -> l.split(",")[0].equals("42"))
.forEach(System.out::println);
} // close() is implicitly called here
} |
The Java code isn't a CSV parser. Using Apache Commons CSV, import java.io.*;
import java.nio.charset.Charset;
import java.util.Iterator;
import org.apache.commons.csv.*;
class CsvTest {
public static void main(String[] args) throws IOException {
File data = new File("mock_data.csv");
Charset encoding = Charset.defaultCharset();
try (CSVParser parser = CSVParser.parse(data, encoding, CSVFormat.DEFAULT)) {
Iterator<CSVRecord> iterator = parser.iterator();
while (iterator.hasNext()) {
CSVRecord line = iterator.next();
if (line.get(0).equals("42")) {
System.out.println(line);
}
}
}
}
} $ TIMEFORMAT=%R
$ go version
go version go1.6 linux/amd64
$ time ./csv-test > /dev/null
3.049
$ time ./csv-test > /dev/null
3.238
$ python3 --version
Python 3.4.3
$ time python csv-test.py > /dev/null
0.586
$ time python csv-test.py > /dev/null
0.581
$ java -version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~14.04-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
$ time java -cp .:commons-csv-1.4.jar CsvTest > /dev/null
1.698
$ time java -cp .:commons-csv-1.4.jar CsvTest > /dev/null
1.552 Compared to Go 1.6 (latest version with a .deb), my test put Python at 5.3x faster and Java at 1.9x faster. |
@pauldraper thanks for posting this. I'm really not a Java programmer and the Java the code I posted in the first message is not actually useful for a fair comparison. |
Discussion on HN: https://news.ycombinator.com/item?id=12419939 |
For another data point, I observe that it is possible to go >5x faster than the stdlib parser, if you can throw out quote handling and avoid allocations. Whenever I've needed to read large amounts of CSV-style data, I first make sure separators don't appear in the data, then no quoting is required and you can use a much simpler CSV parser. My favourites are the ASCII characters Unit Separator and Record Separator ( https://github.com/pwaller/usv/blob/master/usv.go Of course this isn't a viable option if you want to be able to read standard CSV files, but it does seem to hint that there is a big performance loss somewhere in the standard library CSV decoder. |
@pauldraper Someone on HN pointed out that your Java benchmark didn't first warm up the JIT. This is probably an important caveat. |
@weberc2, it important if you have a long running process. It all depends what you want to measure. If I'm writing a command-line CSV parser, then warming it up would be unrealistic. This is an apple/orange/pear comparison anyway. Compiled vs. interpreted + compiled vs. JIT: completely difference performance characteristics. The point is that even in the best light, the csv module could use some love. FYI, for the record, I think you overestimate the importance of warm-up time here. PyPy uses JIT + pure Python csv. For me, it's 4x faster than the Go baseline and 2x faster than Java. (For Java, Apache Commons CSV is just not particularly fast. I chose the most "standard" library. If you want something faster but more complicated, look at Jackson or another library.) |
I took the approach I mentioned earlier (operating on bytes instead of runes) and extended it a bit. What I'm outlining here is an effort to achieve feature parity (and thus a drop-in replacement for the current reader) with improved performance. The main idea is as follows: instead of constructing each field rune by rune (or even byte by byte), why not just check each byte and when an action is required (e.g. a newline/separator/carriage return/quotation mark is hit), slice out everything buffered so far, starting at the end of the previous field. This way we eliminate one allocation per field (a costly one at that). (We may even call This can be done quite easily for unquoted fields, because they are always contiguous in memory, so we just slice them out. There is a bit more work for quoted fields with non-lazy quotes within - e.g. "this is ""a field"" with quotes" would need to be sliced three times to cut out the double quotes - still less work then byte by byte appending. In order to use this slicing of loaded data, I opted for a fixed-sized buffer that I read into and then slice that (or There are still some issues, obviously:
I did some simple benchmarking, but it's a bit premature, since the output is not always correct. The speedup is quite nice thought, around 2-3x, but let's wait for a bit for more sophisticated testing. https://gist.github.com/kokes/3924255b9629b0bef87f8127d473cd56 |
I wanted to benchmark Read() (e.g. how many allocations are made per record) so I added a benchmark to test this that reads Turns out it makes
The only problem I see with this approach is if the user needs to keep around one column which prevents the rest of a large record from being cleaned up by the GC (is the GC able to trim and reallocate unreachable strings after they've been sliced?). Changing this back to allocate every field separately makes the delta negligible. One bonus is that it's now trivial to implement a zero-allocs
|
@infogulch Yeah, your change is similiar to mine (https://golang.org/cl/24723) where I basically did the same thing.
No, the GC doesn't trim the unreachable memory. It will only collect the whole allocation at once. But the case where a user only holds onto a single column, preventing the rest from being GC'ed, is in my experience not that common and the user can still copy the column into a new string. |
The benchmarked code for all languages includes print functions even though they are redirected to /dev/null. So we are benchmarking the printing functions as well. It would be interesting to compare the results if you post the result for benchmarks without the print functions., i.e.,:
|
The print functions are only called once (the condition |
@ALTree Absolutely, my bad. I just read the CSV generation code. |
CL https://golang.org/cl/30357 mentions this issue. |
Benchmarks broken off from https://golang.org/cl/24723 and modified to allocate less in the places we're not trying to measure. Updates #16791 Change-Id: I508e4cfeac60322d56f1d71ff1912f6a6f183a63 Reviewed-on: https://go-review.googlesource.com/30357 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
CL https://golang.org/cl/24723 mentions this issue. |
This commit changes parseRecord to allocate a single string per record, instead of per field, by using indexes into the raw record. Benchstat (done with f69991c) name old time/op new time/op delta Read-8 3.17µs ± 0% 2.78µs ± 1% -12.35% (p=0.016 n=4+5) ReadWithFieldsPerRecord-8 3.18µs ± 1% 2.79µs ± 1% -12.23% (p=0.008 n=5+5) ReadWithoutFieldsPerRecord-8 4.59µs ± 0% 2.77µs ± 0% -39.58% (p=0.016 n=4+5) ReadLargeFields-8 57.0µs ± 0% 55.7µs ± 0% -2.18% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Read-8 660B ± 0% 664B ± 0% +0.61% (p=0.008 n=5+5) ReadWithFieldsPerRecord-8 660B ± 0% 664B ± 0% +0.61% (p=0.008 n=5+5) ReadWithoutFieldsPerRecord-8 1.14kB ± 0% 0.66kB ± 0% -41.75% (p=0.008 n=5+5) ReadLargeFields-8 3.86kB ± 0% 3.94kB ± 0% +1.86% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Read-8 30.0 ± 0% 18.0 ± 0% -40.00% (p=0.008 n=5+5) ReadWithFieldsPerRecord-8 30.0 ± 0% 18.0 ± 0% -40.00% (p=0.008 n=5+5) ReadWithoutFieldsPerRecord-8 50.0 ± 0% 18.0 ± 0% -64.00% (p=0.008 n=5+5) ReadLargeFields-8 66.0 ± 0% 24.0 ± 0% -63.64% (p=0.008 n=5+5) For a simple application that I wrote, which reads in a CSV file (via ReadAll) and outputs the number of rows read (15857625 rows), this change reduces the total time on my notebook from ~58 seconds to ~48 seconds. This reduces time and allocations (bytes) each by ~6% for a real world CSV file at work (~230000 rows, 13 colums). Updates #16791 Change-Id: Ia07177c94624e55cdd3064a7d2751fb69322d3e4 Reviewed-on: https://go-review.googlesource.com/24723 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
I've made a separate implementation of |
CL https://golang.org/cl/36270 mentions this issue. |
Is there any separate issue about stopping to interpret the input as UTF-8? Though performance improvements would be nice (CSV reading and writing is the main slowdown in my command line program), the functionality improvement is the important part. The problem is that in some situations I encounter, the columns are just byte strings, and they may contain byte sequences that are not valid UTF-8. Go, when reading (and probably also writing) them, mangles it:
Note that the 3rd byte
I didn't understand what was happening at first, until I looked at Go's CSV reading/writing source and saw that it used the |
@aktau this looks like a different problem. Please open another issue. Thanks! |
#19721 added a new field ReuseRecord to the Reader for cases where the returned slice itself is only needed until the next call to Read. This doesn't help in all cases, but can still improve many uses of the Read. |
Edit: not relevant, my bad. |
1 hour for a 50MB file? no way.. also why the profile shows 99.8% time spent in cgocall? There must be something wrong with your code, please ask for a code review in one of the go forums or mailing list. |
Ah, you're right, for some reason the code didn't have proper termination checking. Not sure when that got broken - apologies, ignore! |
Windows uses cgo (well a form of it) to call Windows' syscalls from DLLs
…On Tue, May 23, 2017 at 5:13 AM, Alberto Donizetti ***@***.*** > wrote:
1 hour for a 50MB file? no way.. also why the profile shows 99.8% time
spent in cgocall? There must be something wrong with your code, please ask
for help in one of the go forums or mailing list.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#16791 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA-E-RN_Sk-jqKCXFNiqwZiYI5rZ7ks5r8d5FgaJpZM4JnqaY>
.
|
Change https://golang.org/cl/72150 mentions this issue: |
Reading of csv files is, out of the box, quite slow (tl;dr: 3x slower than a simple Java program, 1.5x slower than the obvious python code). A typical example:
Python3 equivalent:
Equivalent Java code [EDIT: not actually equivalent, please see pauldraper comment below for a better test]
Tested on a 50MB, 1'000'002 lines csv file generated as:
Results:
Go error reporting is obviously better than the one you can have with that Java code, and I'm not sure about Python, but people has been complaining about
encoding/csv
slowness, so it's probably worth investigating whether thecsv
package can be made faster.The text was updated successfully, but these errors were encountered: