Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: incorrect line:column for error #22352

Closed
dsnet opened this issue Oct 20, 2017 · 11 comments
Closed

encoding/csv: incorrect line:column for error #22352

dsnet opened this issue Oct 20, 2017 · 11 comments
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Oct 20, 2017

Consider the following snippet:

r := csv.NewReader(strings.NewReader(`
"one","two","three","
x
x
x
x
123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789123456789"x`))
_, err := r.Read()
fmt.Println(err)

On tip, this prints line 2, column 189: extraneous " in field. Notice that the column number is absolutely bogus? There is no column 189 on line 2. This is a result of #19019 where reporting the start of the problematic line is more useful than far away when we fail to find the matching quote.

The solution is to either simply report 0 as the column to signify the start of the line, or to report the column of where the bad string actually started.

\cc @nussjustin

@dsnet dsnet self-assigned this Oct 20, 2017
@dsnet dsnet added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 20, 2017
@dsnet dsnet added this to the Go1.10 milestone Oct 20, 2017
@dsnet
Copy link
Member Author

dsnet commented Oct 20, 2017

I'm planning on just deprecating the ParseError.Column field. It's meaning is sometimes ambiguous and it is a hit to performance to even compute it. At least one other error ErrFieldCount has no use for the column.

@gopherbot
Copy link

Change https://golang.org/cl/72050 mentions this issue: encoding/csv: deprecate ParseError.Column

@ianlancetaylor ianlancetaylor added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed NeedsFix The path to resolution is known, but the work has not been done. labels Oct 20, 2017
@ianlancetaylor
Copy link
Contributor

I'm kicking this over to NeedsDecision. Anybody want to weigh in on whether deprecating Column is the right move? Looks like the field dates back to the initial implementation of the package by @pborman in https://golang.org/cl/4629085.

@pborman
Copy link
Contributor

pborman commented Oct 20, 2017

I think it would be unfortunate to deprecate this field. Being told a " is wrong in a line with 23 "'s might be difficult to find.

@nussjustin
Copy link
Contributor

nussjustin commented Oct 20, 2017

As commented on the CL:

Maybe it would make sense to replace (deprecate in favor of) ParseError.Column with a new field that contains the current record number (basically the current len(r.fieldIndexes)+1)? Unlike the column, the field number should always be meaningful, even for ErrFieldCount (where it would be the actual len of the record).

@pborman
Copy link
Contributor

pborman commented Oct 20, 2017

Knowing the field number is nice, but it still requires quite a bit of parsing for the human reading the error. For CSV files with few fields and simple values this is not an issue, but for CSV files with a large number of fields or longish fields this probably doesn't help a lot.

@dsnet
Copy link
Member Author

dsnet commented Oct 20, 2017

When debugging there are several pieces of information that is helpful:

  1. The line that the record started on
  2. The line:column that the problem field started on
  3. The line:column where the problem was actually detected

Note that all 3 pieces of information are distinct. In go1.9 and before, it used approach 3. However, with badly quoted strings, this is untenable since what the parser considers the problem is sometimes very far from what a human would consider the problem (which is the start of the record or field). As a result of #19019, the error was changed to be 1, in which case the column number is already senseless.

A better error message would combine information in 1 and 3.

Imagine the following:

  • record on line 2: wrong number of fields
  • parse error on line 5, column 32: extraneous " in field if starting line and error line are equal
  • record on line 2; parse error at line 5, column 32: extraneous " in field

Thoughts on this? While information in 2 is useful, it is also not performance friendly to keep that information around.

@nussjustin
Copy link
Contributor

I agree that ParseError.Column is nice and somewhat I would expect from an error when reading a CSV file, but I also agree with @dsnet that the error he got makes no sense and it's not the only case where Column doesn't make sense (ErrFieldCount comes to my mind, at least it't not really what most people would expect from my experience)

Even if we keep ParseError.Column I'd like to see a field with the current field index for the simpler cases where you don't have that many columns and/or long fields, ParseError.Column isn't that helpful or you are just trying to find the error in a tool like Excel (which many people do) where you can more easily jump to a specific column (although this wouldn't work for all errors).

@nussjustin
Copy link
Contributor

Didn't see @dsnet's comment pop up before submitting mine, so here I go again:

I like that idea. CL 52830 isn't even released yet so maybe we could just change it back to report the error line and add two new fields StartLine and StartColumn. This way existing uses of ParseError Line and Column wouldn't change at all from Go 1.9 to Go 1.10 (unless I'm missing something) and we would still get the information and nice error message.

@gopherbot
Copy link

Change https://golang.org/cl/72150 mentions this issue: encoding/csv: simplify and optimize Reader

gopherbot pushed a commit that referenced this issue Oct 20, 2017
The Reader implementation is slow because it operates on a rune-by-rune
basis via bufio.Reader.ReadRune. We speed this up by operating on entire
lines that we read from bufio.Reader.ReadSlice.

In order to ensure that we read the full line, we augment ReadSlice
in our Reader.readLine method to automatically expand the slice if
bufio.ErrBufferFull is every hit.

This change happens to fix #19410 because it no longer relies on
rune-by-rune parsing and only searches for the relevant delimiter rune.

In order to keep column accounting simple and consistent, this change
reverts parts of CL 52830.

This CL is an alternative to CL 36270 and builds on some of the ideas
from that change by Diogo Pinela.

name                                     old time/op    new time/op    delta
Read-8                                   3.12µs ± 1%    2.54µs ± 2%  -18.76%   (p=0.000 n=10+9)
ReadWithFieldsPerRecord-8                3.12µs ± 1%    2.53µs ± 1%  -18.91%    (p=0.000 n=9+9)
ReadWithoutFieldsPerRecord-8             3.13µs ± 0%    2.57µs ± 3%  -18.07%  (p=0.000 n=10+10)
ReadLargeFields-8                        52.3µs ± 1%     5.3µs ± 2%  -89.93%   (p=0.000 n=10+9)
ReadReuseRecord-8                        2.05µs ± 1%    1.40µs ± 1%  -31.48%   (p=0.000 n=10+9)
ReadReuseRecordWithFieldsPerRecord-8     2.05µs ± 1%    1.41µs ± 0%  -31.03%   (p=0.000 n=10+9)
ReadReuseRecordWithoutFieldsPerRecord-8  2.06µs ± 1%    1.40µs ± 1%  -31.70%   (p=0.000 n=9+10)
ReadReuseRecordLargeFields-8             50.9µs ± 0%     4.1µs ± 3%  -92.01%  (p=0.000 n=10+10)

name                                     old alloc/op   new alloc/op
Read-8                                       664B ± 0%      664B ± 0%
ReadWithFieldsPerRecord-8                    664B ± 0%      664B ± 0%
ReadWithoutFieldsPerRecord-8                 664B ± 0%      664B ± 0%
ReadLargeFields-8                          3.94kB ± 0%    3.94kB ± 0%
ReadReuseRecord-8                           24.0B ± 0%     24.0B ± 0%
ReadReuseRecordWithFieldsPerRecord-8        24.0B ± 0%     24.0B ± 0%
ReadReuseRecordWithoutFieldsPerRecord-8     24.0B ± 0%     24.0B ± 0%
ReadReuseRecordLargeFields-8               2.98kB ± 0%    2.98kB ± 0%

name                                     old allocs/op  new allocs/op
Read-8                                       18.0 ± 0%      18.0 ± 0%
ReadWithFieldsPerRecord-8                    18.0 ± 0%      18.0 ± 0%
ReadWithoutFieldsPerRecord-8                 18.0 ± 0%      18.0 ± 0%
ReadLargeFields-8                            24.0 ± 0%      24.0 ± 0%
ReadReuseRecord-8                            8.00 ± 0%      8.00 ± 0%
ReadReuseRecordWithFieldsPerRecord-8         8.00 ± 0%      8.00 ± 0%
ReadReuseRecordWithoutFieldsPerRecord-8      8.00 ± 0%      8.00 ± 0%
ReadReuseRecordLargeFields-8                 12.0 ± 0%      12.0 ± 0%

Updates #22352
Updates #19019
Fixes #16791
Fixes #19410

Change-Id: I31c27cfcc56880e6abac262f36c947179b550bbf
Reviewed-on: https://go-review.googlesource.com/72150
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/72310 mentions this issue: encoding/csv: add ParseError.RecordLine

@dsnet dsnet removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 21, 2017
@golang golang locked and limited conversation to collaborators Oct 21, 2018
@rsc rsc unassigned dsnet Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants