encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

gopherbot · 2012-02-28T09:23:11Z

by philipp.schumann:

---What steps will reproduce the problem?---

1. Download and extract http://download.geonames.org/export/dump/allCountries.zip --
contains a single tab-separated file about 940MB

2. Set up a TSV reader in Go:

        tsv = csv.NewReader(txtFile)
        tsv.Comma = '\t'
        tsv.Comment = '#'
        tsv.LazyQuotes = true
        tsv.TrailingComma = true // retain rather than remove empty slots
        tsv.TrimLeadingSpace = false // retain rather than remove empty slots

3. Iterate through the records returned by tsv.Read() (after each read, set
tsv.FieldsPerRecord = 0) until the file's line 2293755 which begins like this:

3376027 ”S” Falls   "S" Falls     4.533......

---What is the expected result?---

With LazyQuotes set, the reader should return a string array containing the
tab-separated items of only this line.

---What do you see instead?---

The reader packs all fields of the current line, starting from field 2 (if counting
0-based), PLUS all consecutive lines until line 3043730 (record 6489131 B&B "a
Casa di Griffi"    B&B "a Casa di Griffi"    ... into a single 91MB big
string value.

6g, weekly.2012-02-22 under openSuSE 12.1 64bit.

NOTES: guessing this is due to quote character mismatches of some sort or another. So
this behaviour might very well be due to non-sanitized input data. However, such is the
nature of 99.9% of real-world CSV files out there. If I have to run custom code to scan
and sanitize this 950MB file myself prior to feeding it to encoding/csv, I can just
parse manually in the first place. Ideally, the csv package would offer an
"IgnoreQuotes" mode -- for use-cases where I *know* that there are no
multi-line records and where I *know* all newlines and commas (or tabs in this case)
absolutely *need* to take the strictest precedence over any quotes that may or may not
be appearing in the data and should ideally be taken as they are since nobody ever
bothered to properly escape those quotes that are 100% part of the data, not record or
field delimiters...

robpike · 2012-02-28T10:07:34Z

Comment 1:

Labels changed: added priority-later, removed priority-triage.

Status changed to Accepted.

rsc · 2013-03-12T20:18:48Z

Comment 2:

[The time for maybe has passed.]

rsc · 2013-11-27T18:50:30Z

Comment 3:

Labels changed: added go1.3maybe.

rsc · 2013-12-04T01:27:52Z

Comment 4:

Labels changed: added release-none, removed go1.3maybe.

rsc · 2013-12-04T01:47:01Z

Comment 5:

Labels changed: added repo-main.

gopherbot · 2014-04-09T01:58:50Z

Comment 6:

CL https://golang.org/cl/13659043 references this issue.

adg · 2014-08-07T05:09:46Z

Comment 7:

We need to make a policy decision about how liberal we want the encoding/csv package to
be.

gopherbot · 2014-11-05T05:28:35Z

Comment 8 by Tom.Maiaroto:

In the meantime, I found this to be helpful: https://code.google.com/p/gocsv/  ... It's
tolerant of the quotes. You also don't need to specify the separator and number of
fields. I tested with Geonames cities data set which is tab separated.

nikharris0 · 2015-09-17T09:05:20Z

I'm for addressing this issue as well, as I've run into it a couple of times now and have put messy workarounds in to address it.

One thought is that we could allow setting the quote rune (just like the Comma rune), whereas zero would be no quote rune, i.e. ignored and considered value data.

gopherbot · 2016-05-20T05:00:12Z

CL https://golang.org/cl/23281 mentions this issue.

gopherbot · 2016-05-25T01:00:18Z

CL https://golang.org/cl/23401 mentions this issue.

The intent of this comment is to reduce the number of issues opened against the package to add support for new kinds of CSV formats, such as issues #3150, #8458, #12372, #12755. Change-Id: I452c0b748e4ca9ebde3e6cea188bf7774372148e Reviewed-on: https://go-review.googlesource.com/23401 Reviewed-by: Andrew Gerrand <adg@golang.org>

marceloboeira · 2018-04-04T14:26:48Z

any updates for this?

ianlancetaylor · 2018-04-12T20:22:26Z

Since this issue was filed, we've decided that the standard library's encoding/csv package is going to focus only on RFC 4180, while retaining backward compatibility with what is already there. We aren't going to add new features to handle formats not described in RFC 4180. Instead, we encourage people to make their own copy of the library and modify it for their needs.

I'm sorry this doesn't help you, but the reasons for this decision are 1) there are many many many variants of CSV files, and writing one package that can handle all of them will give us a package so complex that it will be hard to use and impossible to test; 2) the code that reads a CSV file is less than 400 lines of Go code anyhow, and it's easy to tweak for whatever odd formats one encounters.

I'm going to close this old issue, since we aren't going to be implementing it.

gopherbot added accepted labels Nov 5, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed priority-later labels Apr 10, 2015

josharian mentioned this issue Feb 9, 2017

encoding/csv: unhelpful error location with badly quoted fields #19019

Closed

ianlancetaylor closed this as completed Apr 12, 2018

b5 mentioned this issue Aug 20, 2018

Replace stdlib encoding/csv with something more robust to bad input qri-io/dataset#140

Closed

golang locked and limited conversation to collaborators Apr 12, 2019

gopherbot added the FrozenDueToAge label Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

gopherbot commented Feb 28, 2012

robpike commented Feb 28, 2012

rsc commented Mar 12, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

gopherbot commented Apr 9, 2014

adg commented Aug 7, 2014

gopherbot commented Nov 5, 2014

nikharris0 commented Sep 17, 2015

gopherbot commented May 20, 2016

gopherbot commented May 25, 2016

marceloboeira commented Apr 4, 2018

ianlancetaylor commented Apr 12, 2018

encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

Comments

gopherbot commented Feb 28, 2012

robpike commented Feb 28, 2012

rsc commented Mar 12, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

gopherbot commented Apr 9, 2014

adg commented Aug 7, 2014

gopherbot commented Nov 5, 2014

nikharris0 commented Sep 17, 2015

gopherbot commented May 20, 2016

gopherbot commented May 25, 2016

marceloboeira commented Apr 4, 2018

ianlancetaylor commented Apr 12, 2018