encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211

DrRibosome · 2020-02-13T14:44:34Z

What version of Go are you using (`go version`)?

$ go version
go version go1.13.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jcrawford/.cache/go-build"
GOENV="/home/jcrawford/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/jcrawford/git/gowork"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/jcrawford/opt/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/jcrawford/opt/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/jcrawford/git/djtest/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build917252283=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Parse an xml file that includes multiple <?xml version="1.0" encoding="ISO-8859-1" ?> control tags in the same file.

For each encountered tag, a new reader will be created and will not be released (see below for details).

It is possible to trigger this with a degenerate xml consisting of only repeated tags. As an example, I created a file of ~400kb in size containing 9030 lines of:

$ head bad.xml 
<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml version="1.0" encoding="ISO-8859-1" ?>

package main

import (
	"encoding/xml"
	"golang.org/x/net/html/charset"
	"io"
	"os"
	"runtime/pprof"
)

func main() {
	path := "bad.xml"

	f, err := os.Open(path)
	if err != nil {
		panic(err)
	}

	dec := xml.NewDecoder(f)
	dec.CharsetReader = charset.NewReaderLabel

	type Unused struct{}

	for {
		var value Unused
		err = dec.Decode(&value)
		if err == io.EOF {
			break
		}
	}

	heapOut, err := os.Create("heap")
	if err != nil {
		panic(err)
	}
	pprof.WriteHeapProfile(heapOut)
}

What did you expect to see?

Memory usage somewhat constant

What did you see instead?

Memory usage grows for every <?...?> tag encountered.

Profiling the heap with pprof, the problem appears to be that for every <?...?> tag encountered in xml.Decoder.rawToken, a new *charset.Reader is created through charset.NewReaderLabel. The decoder then switches to this new reader with xml.Decoder.switchToReader (and also wraps it in another *bufio.Reader). However, it looks like references to the previous readers (created when we encountered previous <?...?> tags) are never garbage collected. As a result, parsing a large file with enough <?...?> can exhaust the system memory.

As an example, running go tool pprof heap on the heap file produced above shows 44 MB of usage for the 9030 line example (which was ~400kb in size)

(pprof) top
Showing nodes accounting for 44.67MB, 100% of 44.67MB total
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
   25.10MB 56.18% 56.18%    25.10MB 56.18%  golang.org/x/text/transform.NewReader
   19.58MB 43.82%   100%    19.58MB 43.82%  bufio.NewReaderSize
         0     0%   100%    19.58MB 43.82%  bufio.NewReader
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).Decode
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).DecodeElement
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).Token
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).rawToken
         0     0%   100%    19.58MB 43.82%  encoding/xml.(*Decoder).switchToReader
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).unmarshal
         0     0%   100%    25.10MB 56.18%  golang.org/x/net/html/charset.NewReaderLabel

The text was updated successfully, but these errors were encountered:

dmitshur · 2020-02-15T01:31:32Z

/cc @rsc per owners.

js-sketch · 2021-11-29T08:56:56Z

				newr, err := d.CharsetReader(enc, d.r.(io.Reader))
 ....
 d.switchToReader(newr) // = d.r = newr

Here's the issue. CharsetReader may reference the old reader, effectively allocating a linked list of references. This issue manifests itself in the tests actually:

	d.CharsetReader = func(charset string, input io.Reader) (io.Reader, error) {
		if charset != "x-testing-uppercase" {
			t.Fatalf("unexpected charset %q", charset)
		}
                // Referencing the old reader!
		return &downCaser{t, input.(io.ByteReader)}, nil
	}

where downCaser is a struct.

To fix this requires acquiring the underlying input stream such that we can throw away the wrapping io.Reader. Unfortunately io.Reader doesn't support this.

So this is a breaking change, as CharsetReader must take something more general than io.Reader.

Here's an ugly sketch of what needs to happen:

func main() {
       theXml := // Some byte array
        parsed := // Whatever the struct we wanna parse into is
	reader := bytes.NewReader(theXml)
	decoder := xml.NewDecoder(reader)
        // Implement a "flattening" reader
        myCurrentEncoding := "" // Store current encoding
	decoder.CharsetReader = func(charset string, input io.Reader) (io.Reader, error) {
             if br, e := input.(bytes.Reader); e == nil {
                 // First time wrapping, fine, wrap it!
                 r, err := charset.NewReaderLabel(charset ,b)
                 myCurrentEncoding = charset // Remember last encoding
                 return r, err
             } else if tr, e := input.(transformer.Reader); e == nil {
                 // Already wrapped, is it the same encoding?
                 if charset == myCurrentEncoding {
                     // Same encoding, don't have to do anything
                     return input
                 } else {
                 // Other encoding, can't access underlying reader so we're screwed and have to wrap again
                     r, err := charset.NewReaderLabel(charset, input)
                     return r, err
                 }
             } else {
                    // OK, who knows
                     r, err := charset.NewReaderLabel(charset, input)
                     return r, err
             }
        }
	err := decoder.Decode(&parsed)
}

js-sketch · 2021-11-29T10:15:22Z

Some final input:

If you have this issue: This leak really only occurs if you allocate a xml.Decoder and keep that decoder to decode an infinite stream of bytes. Instead, just use xml.Unmarshal or throw away your old decoder once and again.

js-sketch · 2021-11-30T14:57:42Z

@rsc

How acceptable is this? Any server which processes tokens one by one in a streaming fashion will suffer from this leak. This isn't a hypothetical scenario.

There are StackOverflow examples using this mechanism: https://stackoverflow.com/questions/6002619/unmarshal-an-iso-8859-1-xml-input-in-go

The docs for NewReaderLabel suggests using this as CharsetReader for xml.Decoder.
Perhaps we can fix the circularity of references here by checking for an already wrapped NewReaderLabel*? This would solve the common case, at least. We can also add in documentation for how to develop CharsetReaders.

This would require adding in a new struct wrapping the transform.Reader, but it wouldn't break the public interface.

See: golang/go#37211

DrRibosome changed the title ~~encoding/xml decoder creates and does not release bufio.Reader for every control <? encountered~~ encoding/xml decoder creates and does not release bufio.Reader for every control <?...?> encountered Feb 13, 2020

DrRibosome changed the title ~~encoding/xml decoder creates and does not release bufio.Reader for every control <?...?> encountered~~ encoding/xml decoder allocates (and does not release) memory for every control <?...?> encountered Feb 13, 2020

DrRibosome changed the title ~~encoding/xml decoder allocates (and does not release) memory for every control <?...?> encountered~~ encoding/xml Decoder allocates (and does not release) memory for every control <?...?> encountered Feb 13, 2020

dmitshur changed the title ~~encoding/xml Decoder allocates (and does not release) memory for every control <?...?> encountered~~ encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered Feb 15, 2020

dmitshur added the NeedsInvestigation label Feb 15, 2020

dmitshur added this to the Backlog milestone Feb 15, 2020

axispx added a commit to axispx/gopub that referenced this issue Nov 17, 2023

Reduce more allocations by not reusing the decoder

a9bf470

See: golang/go#37211

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211

encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211

DrRibosome commented Feb 13, 2020

dmitshur commented Feb 15, 2020

js-sketch commented Nov 29, 2021 •

edited

Loading

js-sketch commented Nov 29, 2021

js-sketch commented Nov 30, 2021

encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211

encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211

Comments

DrRibosome commented Feb 13, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

dmitshur commented Feb 15, 2020

js-sketch commented Nov 29, 2021 • edited Loading

js-sketch commented Nov 29, 2021

js-sketch commented Nov 30, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

js-sketch commented Nov 29, 2021 •

edited

Loading