Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: unhelpful error location with badly quoted fields #19019

Closed
josharian opened this issue Feb 9, 2017 · 9 comments
Closed

encoding/csv: unhelpful error location with badly quoted fields #19019

josharian opened this issue Feb 9, 2017 · 9 comments

Comments

@josharian
Copy link
Contributor

I'm working with hundreds of 100,000+ line tsv files. They contain quotes in inappropriate places, so I am using LazyQuotes. However, somewhere in the guts of a few of the files, there are quotes at the beginning of a field, which breaks csv.Reader. That is #6287. When that happens, and LazyQuotes is on, the reported error location is at the very end of the file. See https://play.golang.org/p/y0jOObyBTV for an illustration of this.

However, if I turn LazyQuotes off, I get an error location on some other quotes that do no harm (with LazyQuotes on). I am left with an unreadable file that I cannot even fix manually, since there are some errors somewhere in hundreds of thousands of lines, but encoding/csv does not tell me where.

The error reported when LazyQuotes is on should probably at the beginning of the field in question if EOF is reached.

Also related: #3150.

I have some workarounds available. Just adding this to the list of encoding/csv quote issues. :)

@nussjustin
Copy link
Contributor

nussjustin commented Apr 28, 2017

For ErrFieldCount (which seems to be the problem child here) the column is always reported as 0 to make it clear that the whole record is the problem. I think the bigger problem here is the reported line. In the code from the playground link above, the Reader reads until EOF which is in line 5, but the record starts in line 3 and the other lines shouldn't even be part of the line. This could be solved by tracking the line a record started on and using this for errors instead or adding it as additional field to ParseError.

@nussjustin
Copy link
Contributor

/cc @bradfitz @pborman

@gopherbot
Copy link

Change https://golang.org/cl/52830 mentions this issue: encoding/csv: report line start line in errors

@nussjustin
Copy link
Contributor

I send a change which improves the issue described here by reporting the line where a record started, as proposed in my first comment. With this the Reader still reports column 0 for ErrFieldCount, but it should be much easier to find the problems in the files.

gopherbot pushed a commit that referenced this issue Aug 14, 2017
Errors returned by Reader contain the line where the Reader originally
encountered the error. This can be suboptimal since that line does not
always correspond with the line the current record/field started at.

This can easily happen with LazyQuotes as seen in #19019, but also
happens for example when a quoted fields has no closing quote and
the parser hits EOF before it finds another quote.

When this happens finding the erroneous field can be somewhat
complicated and time consuming, and in most cases it would be better to
report the line where the record started.

This change updates Reader to keep track of the line on which a record
begins and uses it for errors instead of the current line, making it
easier to find errors.

Although a user-visible change, this should have no impact on existing
code, since most users don't explicitly work with the line in the error
and probably already expect the new behaviour.

Updates #19019

Change-Id: Ic9bc70fad2651c69435d614d537e7a9266819b05
Reviewed-on: https://go-review.googlesource.com/52830
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@nussjustin
Copy link
Contributor

5d14ac7 was just merged, so with tip the code from the play ground example now reports

line 3, column 4: extraneous " in field
line 3, column 0: wrong number of fields in line

This seems much better to me, since it now reports the correct line. The column is still 0, but this is an explicit design choice and I think that in this case (ErrFieldCount) there isn't much value to the exact column when we already have the line.

@josharian
Copy link
Contributor Author

@nussjustin seems good enough to me. Thanks!

@gopherbot
Copy link

Change https://golang.org/cl/72050 mentions this issue: encoding/csv: deprecate ParseError.Column

@gopherbot
Copy link

Change https://golang.org/cl/72150 mentions this issue: encoding/csv: simplify and optimize Reader

gopherbot pushed a commit that referenced this issue Oct 20, 2017
The Reader implementation is slow because it operates on a rune-by-rune
basis via bufio.Reader.ReadRune. We speed this up by operating on entire
lines that we read from bufio.Reader.ReadSlice.

In order to ensure that we read the full line, we augment ReadSlice
in our Reader.readLine method to automatically expand the slice if
bufio.ErrBufferFull is every hit.

This change happens to fix #19410 because it no longer relies on
rune-by-rune parsing and only searches for the relevant delimiter rune.

In order to keep column accounting simple and consistent, this change
reverts parts of CL 52830.

This CL is an alternative to CL 36270 and builds on some of the ideas
from that change by Diogo Pinela.

name                                     old time/op    new time/op    delta
Read-8                                   3.12µs ± 1%    2.54µs ± 2%  -18.76%   (p=0.000 n=10+9)
ReadWithFieldsPerRecord-8                3.12µs ± 1%    2.53µs ± 1%  -18.91%    (p=0.000 n=9+9)
ReadWithoutFieldsPerRecord-8             3.13µs ± 0%    2.57µs ± 3%  -18.07%  (p=0.000 n=10+10)
ReadLargeFields-8                        52.3µs ± 1%     5.3µs ± 2%  -89.93%   (p=0.000 n=10+9)
ReadReuseRecord-8                        2.05µs ± 1%    1.40µs ± 1%  -31.48%   (p=0.000 n=10+9)
ReadReuseRecordWithFieldsPerRecord-8     2.05µs ± 1%    1.41µs ± 0%  -31.03%   (p=0.000 n=10+9)
ReadReuseRecordWithoutFieldsPerRecord-8  2.06µs ± 1%    1.40µs ± 1%  -31.70%   (p=0.000 n=9+10)
ReadReuseRecordLargeFields-8             50.9µs ± 0%     4.1µs ± 3%  -92.01%  (p=0.000 n=10+10)

name                                     old alloc/op   new alloc/op
Read-8                                       664B ± 0%      664B ± 0%
ReadWithFieldsPerRecord-8                    664B ± 0%      664B ± 0%
ReadWithoutFieldsPerRecord-8                 664B ± 0%      664B ± 0%
ReadLargeFields-8                          3.94kB ± 0%    3.94kB ± 0%
ReadReuseRecord-8                           24.0B ± 0%     24.0B ± 0%
ReadReuseRecordWithFieldsPerRecord-8        24.0B ± 0%     24.0B ± 0%
ReadReuseRecordWithoutFieldsPerRecord-8     24.0B ± 0%     24.0B ± 0%
ReadReuseRecordLargeFields-8               2.98kB ± 0%    2.98kB ± 0%

name                                     old allocs/op  new allocs/op
Read-8                                       18.0 ± 0%      18.0 ± 0%
ReadWithFieldsPerRecord-8                    18.0 ± 0%      18.0 ± 0%
ReadWithoutFieldsPerRecord-8                 18.0 ± 0%      18.0 ± 0%
ReadLargeFields-8                            24.0 ± 0%      24.0 ± 0%
ReadReuseRecord-8                            8.00 ± 0%      8.00 ± 0%
ReadReuseRecordWithFieldsPerRecord-8         8.00 ± 0%      8.00 ± 0%
ReadReuseRecordWithoutFieldsPerRecord-8      8.00 ± 0%      8.00 ± 0%
ReadReuseRecordLargeFields-8                 12.0 ± 0%      12.0 ± 0%

Updates #22352
Updates #19019
Fixes #16791
Fixes #19410

Change-Id: I31c27cfcc56880e6abac262f36c947179b550bbf
Reviewed-on: https://go-review.googlesource.com/72150
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/72310 mentions this issue: encoding/csv: add ParseError.RecordLine

gopherbot pushed a commit that referenced this issue Oct 21, 2017
CL 72150 fixes #22352 by reverting the problematic parts of that CL
where the line number and column number were inconsistent with each other.
This CL adds back functionality to address the issue that CL 72150
was trying to solve in the first place. That is, it reports the starting
line of the record, so that users have a frame of reference to start with
when debugging what went wrong.

In the event of gnarly CSV files with multiline quoted strings, a parse
failure likely occurs somewhere between the start of the record and
the point where the parser finally detected an error.
Since ParserError.{Line,Column} reports where the *error* occurs, we
add a RecordLine field to report where the record starts.

Also take this time to cleanup and modernize TestRead.

Fixes #19019
Fixes #22352

Change-Id: I16cebf0b81922c35f75804c7073e9cddbfd11a04
Reviewed-on: https://go-review.googlesource.com/72310
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@golang golang locked and limited conversation to collaborators Oct 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants