Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: add Decoder.InputPos #45628

Closed
pgundlach opened this issue Apr 19, 2021 · 24 comments
Closed

encoding/xml: add Decoder.InputPos #45628

pgundlach opened this issue Apr 19, 2021 · 24 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Proposal Proposal-Accepted Proposal-FinalCommentPeriod
Milestone

Comments

@pgundlach
Copy link
Contributor

pgundlach commented Apr 19, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16 darwin/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/var/folders/md/l2nnr5490tq114003qtxfnk40000gn/T//gocache"
GOENV="/Users/patrick/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/patrick/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/patrick/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.16/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.16/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/patrick/prog/go/xml/xmltolua/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/md/l2nnr5490tq114003qtxfnk40000gn/T/go-build1589123002=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I use the package encoding/xml for a programming language similar to XSLT. Now I'd like to provide user feedback if something is semantically wrong with the input. For example when the user has forgotten to initialise a variable. In that case the XML syntax is correct.

I want to tell the user "you have a problem in line xyz" but I cannot get the current line from the XML decoder. Thus I propose to add a function to the XML package similar to func (d *Decoder) InputOffset() int64 { return d.offset }:

// InputLine retuns the line of the current decoder position.
// The line gives the location of the end of the most recently returned token.
func (d *Decoder) InputLine() int {
	return d.line
}

where a test case would be something like this:

func TestInputLineNum(t *testing.T) {
	testInput := "<P>Fo\no<P>\n<P>Bar</P>\n"
	linenumbers := []int{1, 2, 2, 3, 3, 3, 3, 4}
	d := NewDecoder(strings.NewReader(testInput))
	var err error
	for c := 0; c < len(linenumbers); c++ {
		_, err = d.Token()
		if err != nil {
			t.Errorf("expected no error on d.Token()")
		}
		if have, want := d.InputLine(), linenumbers[c]; have != want {
			t.Errorf("d.InputLine() = %d, want %d (c = %d)", have, want, c)
		}
	}
}

I'd be happy to provide a pull request, but since this function is so trivial, I just post it here as well.

@seankhliao
Copy link
Member

duplicate of #43589 ?

@pgundlach
Copy link
Contributor Author

@seankhliao my bad. I think I should close this and create a pull request for #43589?

@seankhliao
Copy link
Member

considering this is new API and similar to a recent change to encoding/csv, lets put this through the poposal process

I'll close the other one as this one is more fleshed out.

cc @rsc via OWNERS

@seankhliao seankhliao changed the title encoding/xml: proposal to expose decoder line number proposal: encoding/xml: expose decoder line number Apr 19, 2021
@gopherbot gopherbot added this to the Proposal milestone Apr 19, 2021
@pgundlach
Copy link
Contributor Author

I have added the code to Gerrit at https://go-review.googlesource.com/c/go/+/311270

@gopherbot
Copy link

Change https://golang.org/cl/311270 mentions this issue: encoding/xml: expose decoder line number

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Apr 19, 2021
@rsc
Copy link
Contributor

rsc commented Apr 21, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Apr 21, 2021
@rsc
Copy link
Contributor

rsc commented Apr 28, 2021

Over on #44221, we started with line number but ended up returning (line, column int) where column is 1-based and counts bytes (not runes). It seems like probably the same thing would be worthwhile here? So should the method be

(*Decoder) InputPos() (line, column int)

?

@pgundlach
Copy link
Contributor Author

That makes more sense than just returning the line. Good point.

@pgundlach
Copy link
Contributor Author

@rsc Not sure if I can come up with a decent implementation for the column value though. I could increase a column counter every time a byte is read and set it to 0 when \n is encountered, but that does not work at (d *Decoder) ungetc() when a \n is unread.

Otherwise one could read backwards when calling the new function InputPos until the beginning of a line is found, but I am uncertain how to deal with reading backwards on an io.ByteReader.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Apr 28, 2021

Perhaps you can record the byte offset of the start of the line (using the Decoder.offset field) and then compute the column number by subtracting that from the current offset.

@pgundlach
Copy link
Contributor Author

I mistakenly assumed that ungetc() can unget more than one byte and is not immediately followed by getc(). So the implementation is rather simple.

I have updated the code on Gerrit.

@rsc rsc changed the title proposal: encoding/xml: expose decoder line number proposal: encoding/xml: add Decoder.InputPos May 5, 2021
@rsc
Copy link
Contributor

rsc commented May 5, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals (old) May 5, 2021
@dsnet
Copy link
Member

dsnet commented May 5, 2021

FWIW, I was against adding line-tracking logic to encoding/json on the basis that it would non-trivially slow down the common case. See #43513 (comment).

I don't think #44221 can serve as existing precedence for why this should be added. The CSV format fundamentally requires tracking where the new lines occur since newlines are the delimiter between records. In contrast, neither JSON nor XML have grammars where newlines have semantic significance beyond any other whitespace characters.

@pgundlach
Copy link
Contributor Author

pgundlach commented May 6, 2021

@dsnet The line numbering is already there. So the only addition is the simple counting of the offset which is IMO a very trivial (in the sense of computation complexity / computation time) operation.

While you are surely correct that the line number is not inherently used for grammar purposes, it is very helpful to give feedback to the user application in the case of an error. And this does not necessarily have to be a syntax error.

(edit:)

I also don't think that #44221 is the reason to add the InputPos. It merely suggests a function signature to be used for the reason of symmetry.

@dsnet
Copy link
Member

dsnet commented May 6, 2021

The line numbering is already there.

@pgundlach, you are correct. It seems that line information already leaks into the API through the SyntaxError.Line field. In other words, the xml package has already committed to taking on the computational cost of counting lines.

it is very helpful to give feedback to the user application in the case of an error. And this does not necessarily have to be a syntax error.

Line information is useful, but performance is also a good goal. Regardless, the performance ship has already sailed as you pointed out. So my argument no longer has basis in the context of xml.

@pgundlach
Copy link
Contributor Author

@dsnet Of course it will add some operations to the overall process, but when I read "the performance ship has already sailed" I get the impression that you want to say that line counting will dramatically slow down the processing. My (quick and not representative) benchmarks show me rather similar results with and without line counting and with column counting added in.

goos: darwin
goarch: amd64
pkg: xml
cpu: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
current state: with line number counting
BenchmarkLineno-4   	    1258	    957099 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1117	    996439 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1162	    979260 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1153	   1029538 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1138	    997193 ns/op	  112704 B/op	    3119 allocs/op

without line number counting
BenchmarkLineno-4   	    1285	    935848 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1178	    988658 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1136	    970551 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1213	    966737 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1254	    952503 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1266	    980071 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1198	    930091 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1252	    922722 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1203	    928346 ns/op	  112704 B/op	    3119 allocs/op

with line number and offset counter
BenchmarkLineno-4   	    1254	    947150 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1198	    945277 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1117	   1011851 ns/op	  112705 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1234	    949347 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1198	   1043049 ns/op	  112704 B/op	    3119 allocs/op
BenchmarkLineno-4   	    1152	   1004838 ns/op	  112704 B/op	    3119 allocs/op

All numbers are within a range where I can't tell if the difference is due to the algorithm or due to some other things happening on my computer right now. It would need better benchmarks to see how much time these things add to XML parsing.

Wich is what I' expect because

	if b == '\n' {
		d.line++
	}

is absolutely negligible in terms of computation time compared to the rest of the code.

@dsnet
Copy link
Member

dsnet commented May 6, 2021

The existing parsers in both the xml and json packages are fairly slow. Thus, adding if b == '\n' { d.line++ } today is a negligible cost. In the case of JSON, when I heavily optimized the parser to be an order of magnitude faster, the line counting logic imposed a non-trivial performance cost (5-15% if I recall).

@pgundlach
Copy link
Contributor Author

@dsnet ok, fair enough, good point.

@rsc
Copy link
Contributor

rsc commented May 12, 2021

I'm skeptical that the counting fundamentally can't be done efficiently even in a faster JSON. But that's not the issue today.

@rsc rsc moved this from Likely Accept to Accepted in Proposals (old) May 12, 2021
@rsc
Copy link
Contributor

rsc commented May 12, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: encoding/xml: add Decoder.InputPos encoding/xml: add Decoder.InputPos May 12, 2021
@rsc rsc modified the milestones: Proposal, Backlog May 12, 2021
@pgundlach
Copy link
Contributor Author

Since this is my first Go proposal, may I humbly ask what would be the next steps? I assume that there should be an implementation of the proposal, so I allow myself to link to the code I have put on the code review: https://go-review.googlesource.com/c/go/+/311270 - this is just an ad hoc implementation and I don't know if it meets the Go standards for inclusion in a standard library.

@rsc
Copy link
Contributor

rsc commented May 13, 2021

@pgundlach, your CL looks fine. We are already in the feature freeze for Go 1.17, so we will wait to review it when Go 1.18 work begins in August. Thanks!

@rsc rsc modified the milestones: Backlog, Go1.18 May 13, 2021
@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label May 13, 2021
pgundlach added a commit to speedata/publisher that referenced this issue May 25, 2021
This includes a copy of Go's XML processing library until the build
system has Go version 1.8.

See golang/go#45628
@odeke-em
Copy link
Member

Thank you @pgundlach for this issue and congrats on having your proposal accepted, it is also awesome to have you as a new Go contributor! Unfortunately we didn't get to look at your CL during the Go1.18 cycle, and we are already too deep into the freeze. However, I have left comments on your CL, and I shall annotate this as for early Go1.19. Apologies again, and we greatly appreciate your contributions and I am excited for your change to land in Go1.19, soon :-)

Thank you also Russ, Sean, Joe, Ian and the proposal review committee for the discussions!

@odeke-em odeke-em added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Nov 26, 2021
@odeke-em odeke-em modified the milestones: Go1.18, Go1.19 Nov 26, 2021
@gopherbot
Copy link

This issue is currently labeled as early-in-cycle for Go 1.19.
That time is now, so a friendly reminder to look at it again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Proposal Proposal-Accepted Proposal-FinalCommentPeriod
Projects
No open projects
Development

No branches or pull requests

7 participants