Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

Closed
gopherbot opened this issue Feb 28, 2012 · 13 comments

Comments

@gopherbot
Copy link

by philipp.schumann:

---What steps will reproduce the problem?---

1. Download and extract http://download.geonames.org/export/dump/allCountries.zip --
contains a single tab-separated file about 940MB

2. Set up a TSV reader in Go:

        tsv = csv.NewReader(txtFile)
        tsv.Comma = '\t'
        tsv.Comment = '#'
        tsv.LazyQuotes = true
        tsv.TrailingComma = true // retain rather than remove empty slots
        tsv.TrimLeadingSpace = false // retain rather than remove empty slots

3. Iterate through the records returned by tsv.Read() (after each read, set
tsv.FieldsPerRecord = 0) until the file's line 2293755 which begins like this:

3376027 ”S” Falls   "S" Falls     4.533......

---What is the expected result?---

With LazyQuotes set, the reader should return a string array containing the
tab-separated items of only this line.

---What do you see instead?---

The reader packs all fields of the current line, starting from field 2 (if counting
0-based), PLUS all consecutive lines until line 3043730 (record 6489131 B&B "a
Casa di Griffi"    B&B "a Casa di Griffi"    ... into a single 91MB big
string value.

6g, weekly.2012-02-22 under openSuSE 12.1 64bit.

NOTES: guessing this is due to quote character mismatches of some sort or another. So
this behaviour might very well be due to non-sanitized input data. However, such is the
nature of 99.9% of real-world CSV files out there. If I have to run custom code to scan
and sanitize this 950MB file myself prior to feeding it to encoding/csv, I can just
parse manually in the first place. Ideally, the csv package would offer an
"IgnoreQuotes" mode -- for use-cases where I *know* that there are no
multi-line records and where I *know* all newlines and commas (or tabs in this case)
absolutely *need* to take the strictest precedence over any quotes that may or may not
be appearing in the data and should ideally be taken as they are since nobody ever
bothered to properly escape those quotes that are 100% part of the data, not record or
field delimiters...
@robpike
Copy link
Contributor

robpike commented Feb 28, 2012

Comment 1:

Labels changed: added priority-later, removed priority-triage.

Status changed to Accepted.

@rsc
Copy link
Contributor

rsc commented Mar 12, 2013

Comment 2:

[The time for maybe has passed.]

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 3:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 4:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 5:

Labels changed: added repo-main.

@gopherbot
Copy link
Author

Comment 6:

CL https://golang.org/cl/13659043 references this issue.

@adg
Copy link
Contributor

adg commented Aug 7, 2014

Comment 7:

We need to make a policy decision about how liberal we want the encoding/csv package to
be.

@gopherbot
Copy link
Author

Comment 8 by Tom.Maiaroto:

In the meantime, I found this to be helpful: https://code.google.com/p/gocsv/  ... It's
tolerant of the quotes. You also don't need to specify the separator and number of
fields. I tested with Geonames cities data set which is tab separated.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@nikharris0
Copy link

I'm for addressing this issue as well, as I've run into it a couple of times now and have put messy workarounds in to address it.

One thought is that we could allow setting the quote rune (just like the Comma rune), whereas zero would be no quote rune, i.e. ignored and considered value data.

@gopherbot
Copy link
Author

CL https://golang.org/cl/23281 mentions this issue.

@gopherbot
Copy link
Author

CL https://golang.org/cl/23401 mentions this issue.

gopherbot pushed a commit that referenced this issue May 25, 2016
The intent of this comment is to reduce the number of issues opened
against the package to add support for new kinds of CSV formats, such as
issues #3150, #8458, #12372, #12755.

Change-Id: I452c0b748e4ca9ebde3e6cea188bf7774372148e
Reviewed-on: https://go-review.googlesource.com/23401
Reviewed-by: Andrew Gerrand <adg@golang.org>
@marceloboeira
Copy link

any updates for this?

@ianlancetaylor
Copy link
Contributor

Since this issue was filed, we've decided that the standard library's encoding/csv package is going to focus only on RFC 4180, while retaining backward compatibility with what is already there. We aren't going to add new features to handle formats not described in RFC 4180. Instead, we encourage people to make their own copy of the library and modify it for their needs.

I'm sorry this doesn't help you, but the reasons for this decision are 1) there are many many many variants of CSV files, and writing one package that can handle all of them will give us a package so complex that it will be hard to use and impossible to test; 2) the code that reads a CSV file is less than 400 lines of Go code anyhow, and it's easy to tweak for whatever odd formats one encounters.

I'm going to close this old issue, since we aren't going to be implementing it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants