encoding/csv: reader runs out of memory on Windows 32-bit #6352

gopherbot · 2013-09-10T02:55:15Z

by gaijin.shenoy:

When processing a large (2.6 GB) CSV file on Windows 32-bit, using the encoding/csv
package, the program fails after approximately 530,000 lines (each line is approximately
840 characters with 175 fields).

What steps will reproduce the problem?
Run attached program that simply reads an entire file without doing anything else on a
large CSV file (as per above)
http://play.golang.org/p/slFg6cS8o4

What is the expected output?
Should complete without errors

What do you see instead?
fatal error: runtime: cannot map pages in arena address space

Which compiler are you using (5g, 6g, 8g, gccgo)?
8g

Which operating system are you using?
Windows XP

Which version are you using?  (run 'go version')
1.1.2

Please provide any additional information below.

davecheney · 2013-09-10T05:12:05Z

Comment 1:

Where can I obtain a sufficiently large set of sample data to reproduce the issue ?

Status changed to WaitingForReply.

alexbrainman · 2013-09-10T05:15:01Z

Comment 2:

dave, how is it going to help here? 32-bit windows process can have up to 2GB of virtual
address space. So, surely, it will run out of memory very quickly.
Alex

davecheney · 2013-09-10T05:17:39Z

Comment 3:

@alex, AFAIK the CSV reader does not buffer all the text in memory. At least I think it
doesn't, that is what I'm trying to figure out.
Ideally if the OP could post a single line from the CSV, then we can construct a reader
which returns that line infinitely. That should save some download time.

alexbrainman · 2013-09-10T05:20:46Z

Comment 4:

Ahha! :-) Don't know anything about CSV reader.
Alex

gopherbot · 2013-09-10T05:27:07Z

Comment 5 by gaijin.shenoy:

Yes, that is what I thought as well. However, memory usage starts climbing rapidly after
approx. 300,000 lines are read.
Please see attached file for some sample records(in the original file, the crash happens
after the 4th line in this file but there seems to be nothing wrong with the line by
itself).
Cheers
Sudhir

Attachments:

junk.txt (6690 bytes)

gopherbot · 2013-09-10T05:38:14Z

Comment 6 by gaijin.shenoy:

I've also attached the stack trace after the program crash if that helps
Sudhir

Attachments:

stacktrace.txt (1741 bytes)

gopherbot · 2013-09-10T08:13:38Z

Comment 7 by gaijin.shenoy:

Stack trace attached if it helps (reloaded a more complete trace)
Sudhir

Attachments:

stacktrace.txt (3106 bytes)

davecheney · 2013-09-10T08:18:36Z

Comment 8:

Looking at the trace, while reading a field, the buffer has grown to at least 272mb (the
first argument to bytes.makeSlice). I suspect data corruption in the source data.
[fp=0xfd1580] bytes.makeSlice(0x103fffff, 0x0, 0x0, 0x0)
    H:/go/src/pkg/bytes/buffer.go:191 +0x6c
[fp=0xfd15e0] bytes.(*Buffer).grow(0x2008c01c, 0x1, 0x81ffffe)
    H:/go/src/pkg/bytes/buffer.go:99 +0x177
[fp=0xfd15f0] bytes.(*Buffer).WriteByte(0x2008c01c, 0x2608287c, 0x0, 0x0)
    H:/go/src/pkg/bytes/buffer.go:228 +0x37
[fp=0xfd1630] bytes.(*Buffer).WriteRune(0x2008c01c, 0x7c, 0x0, 0x0, 0x0, ...)
    H:/go/src/pkg/bytes/buffer.go:239 +0x48
[fp=0xfd1690] encoding/csv.(*Reader).parseField(0x2008c000, 0x0, 0x81fffff, 0x0, 0x0,
...)

gopherbot · 2013-09-10T09:07:05Z

Comment 9 by gaijin.shenoy:

Dave,
That is what I suspected (missing newlines leading to parseField reading a whole bunch
of characters). I wrote a small program to read through the file (using Scanner) and the
file looks normal i.e., max line length is 1118 bytes (total lines = 3,834,974).
In any case, at the point it crashed, it has read approximately 1/7 of the file, i.e.,
about 370 MB. If the buffer is 272MB at that point, it very much looks like a problem in
the field buffer memory management.
Given the size of the file (it is a database export), I can't easily open it in an
editor to take a look at it. Do you have any suggestions as to how I could check for
data corruption?

davecheney · 2013-09-10T09:12:37Z

Comment 10:

The most crude solution would be to feed the whole file through csv.Reader, print out
every line (or the line number), when it stops, the line that follows the last one is
malformed.
I think you can make the reader more robust by avoiding TrailingComma and setting
FieldsPerRecord to the known value. I'd also recommend reading some of the other issues
for this package,
https://code.google.com/p/go/issues/list?can=1&q=TrailingComma&colspec=ID+Status+Stars+Priority+Owner+Reporter+Summary&cells=tiles

gopherbot · 2013-09-11T01:16:20Z

Comment 11 by gaijin.shenoy:

Dave,
I looked at it a bit more and this seems to be due to the LazyQuotes parsing logic.
I.e., the root cause is the same as that in issue #3150
(https://golang.org/issue/3150&can=1&q=TrailingComma&colspec=ID%20Status%20Stars%20Priority%20Owner%20Reporter%20Summary)
Essentially, the parser does not treat field values with quotes correctly. It expects
either the entire field to be quoted or none of it. In other words, it seems that you
cannot encode the following line (delimited by '|') as a single value -
|He said - "what?".|
If LazyQuotes is false, this is flagged as a 'BareQuote' error. If LazyQuotes is true,
this sucks in following lines into a huge string. This was what was causing the buffer
to grow.
So either |He said - what?.| or |"He said - what?."| would be parsed ok.
The quote handling, even if broken, should terminate at most at end of line. This would
give the wrong number of fields for the line but at least it won't go nuts doing memory
allocations. However, ideally, it should accept quoted substrings - I'll try and fix the
code in question and provide a patch ...
Cheers
Sudhir

davecheney · 2013-09-11T01:17:47Z

Comment 12:

Labels changed: added priority-later, removed priority-triage.

Status changed to Started.

gopherbot · 2013-09-11T03:03:22Z

Comment 13 by gaijin.shenoy:

Sorry, my last example was wrong. The case that is not handled is when the first
character of the field is a quote, i.e., |"What?" - he said.|. In this case the second
quote is taken to be a bare quote and the function will suck in all characters until
(many lines later), it finds another quote.
Attached is a small patch that fixes this problem (only) when LazyQuotes is set to true.
After reading a quote in the middle of a field, if the next character isn't a Comma or
Newline, it starts processing it as a normal field (ignoring any more quotes). The quote
read in the middle is output to the field as a normal character.
It would be desirable to rewrite the buffer at this point to include the initial quote
but the attached fix does not do that, i.e., |"What?" he said| will be returned as
|What?" he said|
HTH
Sudhir

Attachments:

diff.out (807 bytes)

davecheney · 2013-09-11T03:05:13Z

Comment 14:

Thank you for the patch. Please follow the contribution process documented here,
http://golang.org/doc/contribute.html.

gopherbot · 2013-09-11T05:28:15Z

Comment 15 by gaijin.shenoy:

OK.
Submitted CL 13659043 for code review

rsc · 2013-11-27T18:44:57Z

Comment 16:

Labels changed: added go1.3maybe.

rsc · 2013-12-04T01:31:04Z

Comment 18:

Labels changed: added release-none, removed go1.3maybe.

rsc · 2013-12-04T01:50:32Z

Comment 19:

Labels changed: added repo-main.

gopherbot · 2014-04-09T01:58:51Z

Comment 20:

CL https://golang.org/cl/13659043 references this issue.

rsc · 2016-09-26T19:32:28Z

This out of memory appears to have been caused by reading the entire input as one row, because of a bad quote in the input file. There's not much we can do about that: maybe there really is a 2.6 GB string in that field. This bad input was semantically indistinguishable from a very large valid input, and in general we don't try to harden APIs like this against very very large inputs.

gopherbot added started labels Apr 9, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed priority-later labels Apr 10, 2015

rsc closed this as completed Sep 26, 2016

rsc removed the Started label Sep 26, 2016

golang locked and limited conversation to collaborators Sep 26, 2017

gopherbot added the FrozenDueToAge label Sep 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/csv: reader runs out of memory on Windows 32-bit #6352

encoding/csv: reader runs out of memory on Windows 32-bit #6352

gopherbot commented Sep 10, 2013

davecheney commented Sep 10, 2013

alexbrainman commented Sep 10, 2013

davecheney commented Sep 10, 2013

alexbrainman commented Sep 10, 2013

gopherbot commented Sep 10, 2013

gopherbot commented Sep 10, 2013

gopherbot commented Sep 10, 2013

davecheney commented Sep 10, 2013

gopherbot commented Sep 10, 2013

davecheney commented Sep 10, 2013

gopherbot commented Sep 11, 2013

davecheney commented Sep 11, 2013

gopherbot commented Sep 11, 2013

davecheney commented Sep 11, 2013

gopherbot commented Sep 11, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

gopherbot commented Apr 9, 2014

rsc commented Sep 26, 2016

encoding/csv: reader runs out of memory on Windows 32-bit #6352

encoding/csv: reader runs out of memory on Windows 32-bit #6352

Comments

gopherbot commented Sep 10, 2013

davecheney commented Sep 10, 2013

alexbrainman commented Sep 10, 2013

davecheney commented Sep 10, 2013

alexbrainman commented Sep 10, 2013

gopherbot commented Sep 10, 2013

gopherbot commented Sep 10, 2013

gopherbot commented Sep 10, 2013

davecheney commented Sep 10, 2013

gopherbot commented Sep 10, 2013

davecheney commented Sep 10, 2013

gopherbot commented Sep 11, 2013

davecheney commented Sep 11, 2013

gopherbot commented Sep 11, 2013

davecheney commented Sep 11, 2013

gopherbot commented Sep 11, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

gopherbot commented Apr 9, 2014

rsc commented Sep 26, 2016