Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: leading " in non-quoted field leads to error #24422

Closed
homanchou opened this issue Mar 16, 2018 · 6 comments
Closed

encoding/csv: leading " in non-quoted field leads to error #24422

homanchou opened this issue Mar 16, 2018 · 6 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@homanchou
Copy link

homanchou commented Mar 16, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.9.2

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

darwin
amd64

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

https://play.golang.org/p/nrG_caEeYL7

A quote " that appears at the beginning of the field but not at the end of the field "Medaka Box" Stray Bushiroad Storage Box Collection Vol.50 leads to an erroneous error: record on line 2: wrong number of fields, even though this file has a uniform number of tab '\t' delimited fields for every row. This file parses correctly in Excel, OpenOffice, Numbers etc. The data comes from a legitimate source, generated tab separated file from an Amazon report.

If I turn lazy quotes off, then I get a different error: parse error on line 2, column 11: extraneous or missing " in quoted-field

What did you expect to see?

I expect not to see an error. A partially quoted field in a non-quoted field is a legit use case that other systems/software generate and can consume.

What did you see instead?

record on line 2: wrong number of fields

@gopherbot
Copy link

Change https://golang.org/cl/101075 mentions this issue: fixes #24422 leading double quote in non-quoted \t separated field

@ALTree ALTree changed the title encoding/csv leading " in non-quoted field leads to error encoding/csv: leading " in non-quoted field leads to error Mar 16, 2018
@ALTree
Copy link
Member

ALTree commented Mar 16, 2018

I believe this is a dup of #19019

cc @dsnet

@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 16, 2018
@homanchou
Copy link
Author

Issue #19019 is more about not having enough detail in an error, whereas I'm saying we should not have an error at all. I have updated https://play.golang.org/p/YFyJ7uoGcXW example. There are two fields in every row, therefore the error is erroneous.

@dsnet
Copy link
Member

dsnet commented Mar 16, 2018

This is working as intended. A leading quote in a field indicates that the field is quoted. In that situation, the parser must look for a terminating quote. In your example, there is no terminating quote before the next delimiter.

If you want to the CSV parser to treat fields as unescaped fields separated by some delimiter, then csv is not the right package. Most of the complexity comes from quoted-string parsing, and you are intentionally avoiding that functionality.

You are better off doing this yourself as this example:

in := `item-name	item-description	listing-id	seller-sku	price	quantity	open-date	image-url	item-is-marketplace	product-id-type	zshop-shipping-fee	item-note	item-condition	zshop-category1	zshop-browse-path	zshop-storefront-feature	asin1	asin2	asin3	will-ship-internationally	expedited-shipping	zshop-boldface	product-id	bid-for-featured-placement	add-delete	pending-quantity	fulfillment-channel	merchant-shipping-group
"Medaka Box" Stray Bushiroad Storage Box Collection Vol.50 [ Japan Import ]		0124O7XD4ZR	crdcase-bushi-50-medaka-box	14.88	2	24/01/2014 10:08:14 GMT		y	1		Official Item Packed and Shipped Carefully and Quickly	11				B008DQZ0DA						B008DQZ0DA			0	DEFAULT	Migrated Template\
`
for _, s := range strings.Split(in, "\n") {
	if i := strings.Index(s, "#"); i >= 0 {
		s = s[:i]
	}
	if s == "" {
		continue
	}
	records := strings.Split(s, "\t")
	fmt.Printf("%d %q\n", len(records), records)
}

@dsnet dsnet closed this as completed Mar 16, 2018
@homanchou
Copy link
Author

Under https://tools.ietf.org/html/rfc4180#section-2.

Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all).  If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields.

Under that definition then a field which starts with a double quote but does not end with a double quote is not a valid CSV field. So under that strict definition, then yes, I would agree that [it works as intended] (for comma separated values).

However, the encoding/csv package already breaks with the strictest interpretation of the spec via LazyQuotes option. LazyQuotes = true allows double quotes in the field without the entire field being quoted, directly counter to this sentence:

 double quotes may not appear inside the fields.

Clearly the intention of the LazyQuote option is to support files from vendors that do not produce valid quoted fields. If the package allows the flexibility for use LazyQuotes and the flexibility to switch the delimiter to '\t', I would argue that developers would and should expect encoding/csv to be THE defacto standard package for handing csv files and for handling other delimited files by changing the r.Comma rune. Other CSV parsing libs in other languages, such as ruby, can handle delimiting '\t' and do not run into this issue.

The error message generated here https://play.golang.org/p/YFyJ7uoGcXW just makes no sense, (clearly the number of fields is uniform between lines) is unintuitive and wastes developer time. It should be handled.

The CSV spec is silent on using other delimiters. But the package here not only allows alternate delimiters, it also allows flexibility for breaking out of strictly enclosed quoted fields so we should handle this case correctly.

The necessity of using quoted fields at all is partial to using commas as a delimiter in the first place.
But when switching to a tab delimiter, many problems are avoided: commas to do need to be enclosed with double quotes, and double quotes do not need to be escaped since we do not require quoted fields.

If the package advertises that it is flexible to break the spec as a pragmatic consideration to parsing quotes in non-quoted fields and flexible to switching to a non-comma delimiter then I think this issue lies in it's domain.

@dsnet
Copy link
Member

dsnet commented Mar 16, 2018

The paragraph you cite says:

Each field may or may not be enclosed in double quotes

This says nothing about what is a quoted field or not.

The ABNF grammar is as such:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

Note that a field may either be escaped or non-escaped. The grammar for escaped insists that it starts and ends with a double quote.

Let's look at an example. The problematic field in question in your data is: "Medaka Box" Stray Bushiroad Storage Box Collection Vol.50 [ Japan Import ]. This field starts with a quote, so it is by definition a quoted field, which must terminate with a quote, but it doesn't.

In order to be valid it would have to be:
"""Medaka Box"" Stray Bushiroad Storage Box Collection Vol.50 [ Japan Import ]" or
""Medaka Box" Stray Bushiroad Storage Box Collection Vol.50 [ Japan Import ]" if LazyQuotes is enabled.

@golang golang locked and limited conversation to collaborators Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants