Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: reader runs out of memory on Windows 32-bit #6352

Closed
gopherbot opened this issue Sep 10, 2013 · 20 comments
Closed

encoding/csv: reader runs out of memory on Windows 32-bit #6352

gopherbot opened this issue Sep 10, 2013 · 20 comments

Comments

@gopherbot
Copy link

by gaijin.shenoy:

When processing a large (2.6 GB) CSV file on Windows 32-bit, using the encoding/csv
package, the program fails after approximately 530,000 lines (each line is approximately
840 characters with 175 fields).

What steps will reproduce the problem?
Run attached program that simply reads an entire file without doing anything else on a
large CSV file (as per above)
http://play.golang.org/p/slFg6cS8o4

What is the expected output?
Should complete without errors

What do you see instead?
fatal error: runtime: cannot map pages in arena address space

Which compiler are you using (5g, 6g, 8g, gccgo)?
8g

Which operating system are you using?
Windows XP

Which version are you using?  (run 'go version')
1.1.2

Please provide any additional information below.
@davecheney
Copy link
Contributor

Comment 1:

Where can I obtain a sufficiently large set of sample data to reproduce the issue ?

Status changed to WaitingForReply.

@alexbrainman
Copy link
Member

Comment 2:

dave, how is it going to help here? 32-bit windows process can have up to 2GB of virtual
address space. So, surely, it will run out of memory very quickly.
Alex

@davecheney
Copy link
Contributor

Comment 3:

@alex, AFAIK the CSV reader does not buffer all the text in memory. At least I think it
doesn't, that is what I'm trying to figure out.
Ideally if the OP could post a single line from the CSV, then we can construct a reader
which returns that line infinitely. That should save some download time.

@alexbrainman
Copy link
Member

Comment 4:

Ahha! :-) Don't know anything about CSV reader.
Alex

@gopherbot
Copy link
Author

Comment 5 by gaijin.shenoy:

Yes, that is what I thought as well. However, memory usage starts climbing rapidly after
approx. 300,000 lines are read.
Please see attached file for some sample records(in the original file, the crash happens
after the 4th line in this file but there seems to be nothing wrong with the line by
itself).
Cheers
Sudhir

Attachments:

  1. junk.txt (6690 bytes)

@gopherbot
Copy link
Author

Comment 6 by gaijin.shenoy:

I've also attached the stack trace after the program crash if that helps
Sudhir

Attachments:

  1. stacktrace.txt (1741 bytes)

@gopherbot
Copy link
Author

Comment 7 by gaijin.shenoy:

Stack trace attached if it helps (reloaded a more complete trace)
Sudhir

Attachments:

  1. stacktrace.txt (3106 bytes)

@davecheney
Copy link
Contributor

Comment 8:

Looking at the trace, while reading a field, the buffer has grown to at least 272mb (the
first argument to bytes.makeSlice). I suspect data corruption in the source data.
[fp=0xfd1580] bytes.makeSlice(0x103fffff, 0x0, 0x0, 0x0)
    H:/go/src/pkg/bytes/buffer.go:191 +0x6c
[fp=0xfd15e0] bytes.(*Buffer).grow(0x2008c01c, 0x1, 0x81ffffe)
    H:/go/src/pkg/bytes/buffer.go:99 +0x177
[fp=0xfd15f0] bytes.(*Buffer).WriteByte(0x2008c01c, 0x2608287c, 0x0, 0x0)
    H:/go/src/pkg/bytes/buffer.go:228 +0x37
[fp=0xfd1630] bytes.(*Buffer).WriteRune(0x2008c01c, 0x7c, 0x0, 0x0, 0x0, ...)
    H:/go/src/pkg/bytes/buffer.go:239 +0x48
[fp=0xfd1690] encoding/csv.(*Reader).parseField(0x2008c000, 0x0, 0x81fffff, 0x0, 0x0,
...)

@gopherbot
Copy link
Author

Comment 9 by gaijin.shenoy:

Dave,
That is what I suspected (missing newlines leading to parseField reading a whole bunch
of characters). I wrote a small program to read through the file (using Scanner) and the
file looks normal i.e., max line length is 1118 bytes (total lines = 3,834,974).
In any case, at the point it crashed, it has read approximately 1/7 of the file, i.e.,
about 370 MB. If the buffer is 272MB at that point, it very much looks like a problem in
the field buffer memory management.
Given the size of the file (it is a database export), I can't easily open it in an
editor to take a look at it. Do you have any suggestions as to how I could check for
data corruption?

@davecheney
Copy link
Contributor

Comment 10:

The most crude solution would be to feed the whole file through csv.Reader, print out
every line (or the line number), when it stops, the line that follows the last one is
malformed.
I think you can make the reader more robust by avoiding TrailingComma and setting
FieldsPerRecord to the known value. I'd also recommend reading some of the other issues
for this package,
https://code.google.com/p/go/issues/list?can=1&q=TrailingComma&colspec=ID+Status+Stars+Priority+Owner+Reporter+Summary&cells=tiles

@gopherbot
Copy link
Author

Comment 11 by gaijin.shenoy:

Dave,
I looked at it a bit more and this seems to be due to the LazyQuotes parsing logic.
I.e., the root cause is the same as that in issue #3150
(https://golang.org/issue/3150&can=1&q=TrailingComma&colspec=ID%20Status%20Stars%20Priority%20Owner%20Reporter%20Summary)
Essentially, the parser does not treat field values with quotes correctly. It expects
either the entire field to be quoted or none of it. In other words, it seems that you
cannot encode the following line (delimited by '|') as a single value -
|He said - "what?".|
If LazyQuotes is false, this is flagged as a 'BareQuote' error. If LazyQuotes is true,
this sucks in following lines into a huge string. This was what was causing the buffer
to grow.
So either |He said - what?.| or |"He said - what?."| would be parsed ok.
The quote handling, even if broken, should terminate at most at end of line. This would
give the wrong number of fields for the line but at least it won't go nuts doing memory
allocations. However, ideally, it should accept quoted substrings - I'll try and fix the
code in question and provide a patch ...
Cheers
Sudhir

@davecheney
Copy link
Contributor

Comment 12:

Labels changed: added priority-later, removed priority-triage.

Status changed to Started.

@gopherbot
Copy link
Author

Comment 13 by gaijin.shenoy:

Sorry, my last example was wrong. The case that is not handled is when the first
character of the field is a quote, i.e., |"What?" - he said.|. In this case the second
quote is taken to be a bare quote and the function will suck in all characters until
(many lines later), it finds another quote.
Attached is a small patch that fixes this problem (only) when LazyQuotes is set to true.
After reading a quote in the middle of a field, if the next character isn't a Comma or
Newline, it starts processing it as a normal field (ignoring any more quotes). The quote
read in the middle is output to the field as a normal character.
It would be desirable to rewrite the buffer at this point to include the initial quote
but the attached fix does not do that, i.e., |"What?" he said| will be returned as
|What?" he said|
HTH
Sudhir

Attachments:

  1. diff.out (807 bytes)

@davecheney
Copy link
Contributor

Comment 14:

Thank you for the patch. Please follow the contribution process documented here,
http://golang.org/doc/contribute.html.

@gopherbot
Copy link
Author

Comment 15 by gaijin.shenoy:

OK.
Submitted CL 13659043 for code review

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 16:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 18:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 19:

Labels changed: added repo-main.

@gopherbot
Copy link
Author

Comment 20:

CL https://golang.org/cl/13659043 references this issue.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc
Copy link
Contributor

rsc commented Sep 26, 2016

This out of memory appears to have been caused by reading the entire input as one row, because of a bad quote in the input file. There's not much we can do about that: maybe there really is a 2.6 GB string in that field. This bad input was semantically indistinguishable from a very large valid input, and in general we don't try to harden APIs like this against very very large inputs.

@rsc rsc closed this as completed Sep 26, 2016
@rsc rsc removed the Started label Sep 26, 2016
@golang golang locked and limited conversation to collaborators Sep 26, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants