Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered #37211

Open
DrRibosome opened this issue Feb 13, 2020 · 4 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@DrRibosome
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.13.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jcrawford/.cache/go-build"
GOENV="/home/jcrawford/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/jcrawford/git/gowork"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/jcrawford/opt/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/jcrawford/opt/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/jcrawford/git/djtest/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build917252283=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Parse an xml file that includes multiple <?xml version="1.0" encoding="ISO-8859-1" ?> control tags in the same file.

For each encountered tag, a new reader will be created and will not be released (see below for details).

It is possible to trigger this with a degenerate xml consisting of only repeated tags. As an example, I created a file of ~400kb in size containing 9030 lines of:

$ head bad.xml 
<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml version="1.0" encoding="ISO-8859-1" ?>
package main

import (
	"encoding/xml"
	"golang.org/x/net/html/charset"
	"io"
	"os"
	"runtime/pprof"
)

func main() {
	path := "bad.xml"

	f, err := os.Open(path)
	if err != nil {
		panic(err)
	}

	dec := xml.NewDecoder(f)
	dec.CharsetReader = charset.NewReaderLabel

	type Unused struct{}

	for {
		var value Unused
		err = dec.Decode(&value)
		if err == io.EOF {
			break
		}
	}

	heapOut, err := os.Create("heap")
	if err != nil {
		panic(err)
	}
	pprof.WriteHeapProfile(heapOut)
}

What did you expect to see?

Memory usage somewhat constant

What did you see instead?

Memory usage grows for every <?...?> tag encountered.

Profiling the heap with pprof, the problem appears to be that for every <?...?> tag encountered in xml.Decoder.rawToken, a new *charset.Reader is created through charset.NewReaderLabel. The decoder then switches to this new reader with xml.Decoder.switchToReader (and also wraps it in another *bufio.Reader). However, it looks like references to the previous readers (created when we encountered previous <?...?> tags) are never garbage collected. As a result, parsing a large file with enough <?...?> can exhaust the system memory.

As an example, running go tool pprof heap on the heap file produced above shows 44 MB of usage for the 9030 line example (which was ~400kb in size)

(pprof) top
Showing nodes accounting for 44.67MB, 100% of 44.67MB total
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
   25.10MB 56.18% 56.18%    25.10MB 56.18%  golang.org/x/text/transform.NewReader
   19.58MB 43.82%   100%    19.58MB 43.82%  bufio.NewReaderSize
         0     0%   100%    19.58MB 43.82%  bufio.NewReader
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).Decode
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).DecodeElement
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).Token
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).rawToken
         0     0%   100%    19.58MB 43.82%  encoding/xml.(*Decoder).switchToReader
         0     0%   100%    44.67MB   100%  encoding/xml.(*Decoder).unmarshal
         0     0%   100%    25.10MB 56.18%  golang.org/x/net/html/charset.NewReaderLabel
@DrRibosome DrRibosome changed the title encoding/xml decoder creates and does not release bufio.Reader for every control <? encountered encoding/xml decoder creates and does not release bufio.Reader for every control <?...?> encountered Feb 13, 2020
@DrRibosome DrRibosome changed the title encoding/xml decoder creates and does not release bufio.Reader for every control <?...?> encountered encoding/xml decoder allocates (and does not release) memory for every control <?...?> encountered Feb 13, 2020
@DrRibosome DrRibosome changed the title encoding/xml decoder allocates (and does not release) memory for every control <?...?> encountered encoding/xml Decoder allocates (and does not release) memory for every control <?...?> encountered Feb 13, 2020
@dmitshur dmitshur changed the title encoding/xml Decoder allocates (and does not release) memory for every control <?...?> encountered encoding/xml: Decoder allocates (and does not release) memory for every control <?...?> encountered Feb 15, 2020
@dmitshur
Copy link
Contributor

/cc @rsc per owners.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 15, 2020
@dmitshur dmitshur added this to the Backlog milestone Feb 15, 2020
@js-sketch
Copy link

js-sketch commented Nov 29, 2021

				newr, err := d.CharsetReader(enc, d.r.(io.Reader))
 ....
 d.switchToReader(newr) // = d.r = newr

Here's the issue. CharsetReader may reference the old reader, effectively allocating a linked list of references. This issue manifests itself in the tests actually:

	d.CharsetReader = func(charset string, input io.Reader) (io.Reader, error) {
		if charset != "x-testing-uppercase" {
			t.Fatalf("unexpected charset %q", charset)
		}
                // Referencing the old reader!
		return &downCaser{t, input.(io.ByteReader)}, nil
	}

where downCaser is a struct.

To fix this requires acquiring the underlying input stream such that we can throw away the wrapping io.Reader. Unfortunately io.Reader doesn't support this.

So this is a breaking change, as CharsetReader must take something more general than io.Reader.

Here's an ugly sketch of what needs to happen:

func main() {
       theXml := // Some byte array
        parsed := // Whatever the struct we wanna parse into is
	reader := bytes.NewReader(theXml)
	decoder := xml.NewDecoder(reader)
        // Implement a "flattening" reader
        myCurrentEncoding := "" // Store current encoding
	decoder.CharsetReader = func(charset string, input io.Reader) (io.Reader, error) {
             if br, e := input.(bytes.Reader); e == nil {
                 // First time wrapping, fine, wrap it!
                 r, err := charset.NewReaderLabel(charset ,b)
                 myCurrentEncoding = charset // Remember last encoding
                 return r, err
             } else if tr, e := input.(transformer.Reader); e == nil {
                 // Already wrapped, is it the same encoding?
                 if charset == myCurrentEncoding {
                     // Same encoding, don't have to do anything
                     return input
                 } else {
                 // Other encoding, can't access underlying reader so we're screwed and have to wrap again
                     r, err := charset.NewReaderLabel(charset, input)
                     return r, err
                 }
             } else {
                    // OK, who knows
                     r, err := charset.NewReaderLabel(charset, input)
                     return r, err
             }
        }
	err := decoder.Decode(&parsed)
}

@js-sketch
Copy link

Some final input:

If you have this issue: This leak really only occurs if you allocate a xml.Decoder and keep that decoder to decode an infinite stream of bytes. Instead, just use xml.Unmarshal or throw away your old decoder once and again.

@js-sketch
Copy link

@rsc

How acceptable is this? Any server which processes tokens one by one in a streaming fashion will suffer from this leak. This isn't a hypothetical scenario.

There are StackOverflow examples using this mechanism: https://stackoverflow.com/questions/6002619/unmarshal-an-iso-8859-1-xml-input-in-go

The docs for NewReaderLabel suggests using this as CharsetReader for xml.Decoder.
Perhaps we can fix the circularity of references here by checking for an already wrapped NewReaderLabel*? This would solve the common case, at least. We can also add in documentation for how to develop CharsetReaders.

  • This would require adding in a new struct wrapping the transform.Reader, but it wouldn't break the public interface.

axispx added a commit to axispx/gopub that referenced this issue Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants