Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: memoize names during decode #38332

Open
qmuntal opened this issue Apr 9, 2020 · 3 comments
Open

encoding/xml: memoize names during decode #38332

qmuntal opened this issue Apr 9, 2020 · 3 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@qmuntal
Copy link
Contributor

qmuntal commented Apr 9, 2020

Counterpart of #32779 for encoding/xml, but tracking it as a performance improvement instead of a proposal.

When an xml is decoded, many of the same element and attribute names appear over and over in the XML and get copied and reconstructed in the result over and over as well. This was previously partially discussed in #21823, where @nussjustin and @a-h proposed some good initial implementation:

// Get name: /first(first|second)*/
// Do not set d.err if the name is missing (unless unexpected EOF is received):
// let the caller provide better context.
func (d *Decoder) name() (s string, ok bool) {
	d.buf.Reset()
	if !d.readName() {
		return "", false
	}

	// Now we check the characters.
	b := d.buf.Bytes()
        if s, ok = d.names[string(b)]; ok {
		return s, ok
	}
	if !isName(b) {
		d.err = d.syntaxError("invalid XML name: " + string(b))
		return "", false
	}
        s = string(b)
        d.names[s] = s
	return s, true
}
@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 10, 2020
@andybons andybons added this to the Unplanned milestone Apr 10, 2020
@Goodwine
Copy link

I got this benchmark from applying this change as-is

name          old time/op    new time/op    delta
Unmarshal-16     611ms ± 3%     571ms ± 6%     ~     (p=0.056 n=5+5)

name          old alloc/op   new alloc/op   delta
Unmarshal-16     245MB ± 0%     239MB ± 0%   -2.77%  (p=0.016 n=4+5)

name          old allocs/op  new allocs/op  delta
Unmarshal-16     4.63M ± 0%     3.24M ± 0%  -30.00%  (p=0.008 n=5+5)

@Goodwine
Copy link

I don't understand why this is making a huge improvement, is it because returning a string doesn't copy the whole string but instead returns a reference to it?

I still see you do string(b) and yet memory is still lowered so I'm confused how this is an improvement, would it be because this string doesn't always escape to the heap?

Sorry if my questions sound dumb, I was told there was no dumb question :)

@nussjustin
Copy link
Contributor

@Goodwine string(b) allocates which is slow, but if you do it when accessing a map like s, ok = d.names[string(b)] Go is smart enough to avoid the allocation. So what's happening here is that new / unseen names get converted into a string once and put into the d.names map and the next time the same name is encountered, the map lookup will return the previously allocated string which is then used instead of allocating a new one, saving a slow allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

4 participants