New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: reader runs out of memory on Windows 32-bit #6352
Comments
@alex, AFAIK the CSV reader does not buffer all the text in memory. At least I think it doesn't, that is what I'm trying to figure out. Ideally if the OP could post a single line from the CSV, then we can construct a reader which returns that line infinitely. That should save some download time. |
Yes, that is what I thought as well. However, memory usage starts climbing rapidly after approx. 300,000 lines are read. Please see attached file for some sample records(in the original file, the crash happens after the 4th line in this file but there seems to be nothing wrong with the line by itself). Cheers Sudhir Attachments:
|
I've also attached the stack trace after the program crash if that helps Sudhir Attachments:
|
Stack trace attached if it helps (reloaded a more complete trace) Sudhir Attachments:
|
Looking at the trace, while reading a field, the buffer has grown to at least 272mb (the first argument to bytes.makeSlice). I suspect data corruption in the source data. [fp=0xfd1580] bytes.makeSlice(0x103fffff, 0x0, 0x0, 0x0) H:/go/src/pkg/bytes/buffer.go:191 +0x6c [fp=0xfd15e0] bytes.(*Buffer).grow(0x2008c01c, 0x1, 0x81ffffe) H:/go/src/pkg/bytes/buffer.go:99 +0x177 [fp=0xfd15f0] bytes.(*Buffer).WriteByte(0x2008c01c, 0x2608287c, 0x0, 0x0) H:/go/src/pkg/bytes/buffer.go:228 +0x37 [fp=0xfd1630] bytes.(*Buffer).WriteRune(0x2008c01c, 0x7c, 0x0, 0x0, 0x0, ...) H:/go/src/pkg/bytes/buffer.go:239 +0x48 [fp=0xfd1690] encoding/csv.(*Reader).parseField(0x2008c000, 0x0, 0x81fffff, 0x0, 0x0, ...) |
Dave, That is what I suspected (missing newlines leading to parseField reading a whole bunch of characters). I wrote a small program to read through the file (using Scanner) and the file looks normal i.e., max line length is 1118 bytes (total lines = 3,834,974). In any case, at the point it crashed, it has read approximately 1/7 of the file, i.e., about 370 MB. If the buffer is 272MB at that point, it very much looks like a problem in the field buffer memory management. Given the size of the file (it is a database export), I can't easily open it in an editor to take a look at it. Do you have any suggestions as to how I could check for data corruption? |
The most crude solution would be to feed the whole file through csv.Reader, print out every line (or the line number), when it stops, the line that follows the last one is malformed. I think you can make the reader more robust by avoiding TrailingComma and setting FieldsPerRecord to the known value. I'd also recommend reading some of the other issues for this package, https://code.google.com/p/go/issues/list?can=1&q=TrailingComma&colspec=ID+Status+Stars+Priority+Owner+Reporter+Summary&cells=tiles |
Dave, I looked at it a bit more and this seems to be due to the LazyQuotes parsing logic. I.e., the root cause is the same as that in issue #3150 (https://golang.org/issue/3150&can=1&q=TrailingComma&colspec=ID%20Status%20Stars%20Priority%20Owner%20Reporter%20Summary) Essentially, the parser does not treat field values with quotes correctly. It expects either the entire field to be quoted or none of it. In other words, it seems that you cannot encode the following line (delimited by '|') as a single value - |He said - "what?".| If LazyQuotes is false, this is flagged as a 'BareQuote' error. If LazyQuotes is true, this sucks in following lines into a huge string. This was what was causing the buffer to grow. So either |He said - what?.| or |"He said - what?."| would be parsed ok. The quote handling, even if broken, should terminate at most at end of line. This would give the wrong number of fields for the line but at least it won't go nuts doing memory allocations. However, ideally, it should accept quoted substrings - I'll try and fix the code in question and provide a patch ... Cheers Sudhir |
Sorry, my last example was wrong. The case that is not handled is when the first character of the field is a quote, i.e., |"What?" - he said.|. In this case the second quote is taken to be a bare quote and the function will suck in all characters until (many lines later), it finds another quote. Attached is a small patch that fixes this problem (only) when LazyQuotes is set to true. After reading a quote in the middle of a field, if the next character isn't a Comma or Newline, it starts processing it as a normal field (ignoring any more quotes). The quote read in the middle is output to the field as a normal character. It would be desirable to rewrite the buffer at this point to include the initial quote but the attached fix does not do that, i.e., |"What?" he said| will be returned as |What?" he said| HTH Sudhir Attachments:
|
Thank you for the patch. Please follow the contribution process documented here, http://golang.org/doc/contribute.html. |
CL https://golang.org/cl/13659043 references this issue. |
This out of memory appears to have been caused by reading the entire input as one row, because of a bad quote in the input file. There's not much we can do about that: maybe there really is a 2.6 GB string in that field. This bad input was semantically indistinguishable from a very large valid input, and in general we don't try to harden APIs like this against very very large inputs. |
by gaijin.shenoy:
The text was updated successfully, but these errors were encountered: