-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211
Comments
Here's the issue. CharsetReader may reference the old reader, effectively allocating a linked list of references. This issue manifests itself in the tests actually:
where downCaser is a struct. To fix this requires acquiring the underlying input stream such that we can throw away the wrapping io.Reader. Unfortunately io.Reader doesn't support this. So this is a breaking change, as CharsetReader must take something more general than io.Reader. Here's an ugly sketch of what needs to happen: func main() {
theXml := // Some byte array
parsed := // Whatever the struct we wanna parse into is
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
// Implement a "flattening" reader
myCurrentEncoding := "" // Store current encoding
decoder.CharsetReader = func(charset string, input io.Reader) (io.Reader, error) {
if br, e := input.(bytes.Reader); e == nil {
// First time wrapping, fine, wrap it!
r, err := charset.NewReaderLabel(charset ,b)
myCurrentEncoding = charset // Remember last encoding
return r, err
} else if tr, e := input.(transformer.Reader); e == nil {
// Already wrapped, is it the same encoding?
if charset == myCurrentEncoding {
// Same encoding, don't have to do anything
return input
} else {
// Other encoding, can't access underlying reader so we're screwed and have to wrap again
r, err := charset.NewReaderLabel(charset, input)
return r, err
}
} else {
// OK, who knows
r, err := charset.NewReaderLabel(charset, input)
return r, err
}
}
err := decoder.Decode(&parsed)
} |
Some final input: If you have this issue: This leak really only occurs if you allocate a xml.Decoder and keep that decoder to decode an infinite stream of bytes. Instead, just use xml.Unmarshal or throw away your old decoder once and again. |
How acceptable is this? Any server which processes tokens one by one in a streaming fashion will suffer from this leak. This isn't a hypothetical scenario. There are StackOverflow examples using this mechanism: https://stackoverflow.com/questions/6002619/unmarshal-an-iso-8859-1-xml-input-in-go The docs for NewReaderLabel suggests using this as CharsetReader for xml.Decoder.
|
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Parse an xml file that includes multiple
<?xml version="1.0" encoding="ISO-8859-1" ?>
control tags in the same file.For each encountered tag, a new reader will be created and will not be released (see below for details).
It is possible to trigger this with a degenerate xml consisting of only repeated tags. As an example, I created a file of ~400kb in size containing 9030 lines of:
What did you expect to see?
Memory usage somewhat constant
What did you see instead?
Memory usage grows for every
<?...?>
tag encountered.Profiling the heap with pprof, the problem appears to be that for every
<?...?>
tag encountered inxml.Decoder.rawToken
, a new*charset.Reader
is created throughcharset.NewReaderLabel
. The decoder then switches to this new reader withxml.Decoder.switchToReader
(and also wraps it in another*bufio.Reader
). However, it looks like references to the previous readers (created when we encountered previous<?...?>
tags) are never garbage collected. As a result, parsing a large file with enough<?...?>
can exhaust the system memory.As an example, running
go tool pprof heap
on the heap file produced above shows 44 MB of usage for the 9030 line example (which was ~400kb in size)The text was updated successfully, but these errors were encountered: