Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: Memory leak regression introduced in window updates change #56315

Closed
LINKIWI opened this issue Oct 19, 2022 · 9 comments
Closed
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@LINKIWI
Copy link

LINKIWI commented Oct 19, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE=""
GOENV=""
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH=""
GOPRIVATE=""
GOPROXY=""
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.2"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/bug/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3528461531=/tmp/go-build -gno-record-gcc-switches"

What did you do?

After mass-upgrading dependencies in a proprietary HTTP + gRPC reverse proxy, I observed unbounded memory growth when serving gRPC streams.

The full data flow is:

gRPC client stream -> gRPC proxy -> gRPC service stream

The implicated component that depends on x/net and observing the memory leak is the gRPC proxy.

I bisected the behavior to 2e0b12c274b70e5e836cf4bb0cda99c44f61a4d1 from golang.org/x/net. Prior to this commit, memory usage while serving unidirectional (client -> server) streams remained constant. At this commit and later, memory usage grows consistently as long as the stream is active.

Please see #56315 (comment) for a minimally complete example of the data flow described above that demonstrates the bug.

The client logic is to repeatedly send messages to the server over a unidirectional stream, each about 32 KB in size. The server logic is to read the message and noop (discard the data).

I don't have enough knowledge of the HTTP/2 specification to do any investigation into 2e0b12c274b70e5e836cf4bb0cda99c44f61a4d1 itself, so I don't have an RCA.

What did you expect to see?

I expected to be able to serve the gRPC stream without unbounded memory growth.

What did you see instead?

I observe what appears to be a memory leak regression starting from this commit.

I verified this behavior by watching pmap -x $(pidof ...) over time for the good/bad commits identified by bisect against the same client logic.

After running the test workload:

Here is an example heap usage graph captured with pprof at version v0.0.0-20220920183852-bf014ff85ad5.
Here is an example heap usage graph captured with pprof at version v0.0.0-20221019024206-cb67ada4b0ad.

cc @neild

@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 19, 2022
@neild
Copy link
Contributor

neild commented Oct 19, 2022

https://go.dev/cl/432038 changes the frequency with which the HTTP/2 server sends window size updates to clients. Previously it would send updates to extend the stream window immediately upon receiving data; now it sends updates only when at least half the current window has been consumed. This reduces the frequency of window updates and should improve performance overall.

Your heap profiles show that your heap is pretty much entirely consumed by HTTP/2 stream buffers, used to hold data being received or sent. Buffers are allocated as needed from a set of pools of varying size classes; the largest class is 16384 bytes.

I'm not sure why this change would lead to more buffers being consumed. I don't see any obvious way for a new memory leak to have been introduced here; perhaps I'm missing something, but I suspect any change is a result of the pattern of data arriving having changed. This change might result in clients sending data in fewer, larger chunks; perhaps this results in more bursty memory usage? Or perhaps clients are sending data faster, and running ahead of the ability of the server to forward it?

Do you configure the HTTP/2 server flow control limits at all? (Server. MaxUploadBufferPerConnection and Server. MaxUploadBufferPerStream in http2.Server.)

@LINKIWI
Copy link
Author

LINKIWI commented Oct 20, 2022

Thanks for taking a look.

I'm not sure why this change would lead to more buffers being consumed. I don't see any obvious way for a new memory leak to have been introduced here; perhaps I'm missing something, but I suspect any change is a result of the pattern of data arriving having changed.

I am also not clear on how this change ultimately results in unbounded memory usage. I set up a closed-environment test workload comparing the behavior at both the good and bad commits of golang.org/x/net (with the same pattern of data sent by the client), and can reliably reproduce unbounded memory growth with the commit in question.

This change might result in clients sending data in fewer, larger chunks; perhaps this results in more bursty memory usage?

This hypothesis makes sense, though in practice I haven't been observing bursty memory usage. This graph illustrates the unbounded heap usage growth over time from ~15 MB to ~80 GB as measured by /memory/classes/heap/objects:bytes from Go's runtime metrics system:

Screenshot from 2022-10-19 22-58-05

(The sudden drop was due to OOM kill)

Do you configure the HTTP/2 server flow control limits at all?

MaxUploadBufferPerConnection and MaxUploadBufferPerStream are not set explicitly, so the defaults are being used.

@LINKIWI
Copy link
Author

LINKIWI commented Oct 20, 2022

I have created a minimally complete and verifiable example workload that demonstrates the issue.

Source tree:

.
├── client
│   ├── client
│   └── main.go
├── go.mod
├── go.sum
├── proto
│   ├── service_grpc.pb.go
│   ├── service.pb.go
│   └── service.proto
├── proxy
│   ├── main.go
│   └── proxy
└── server
    ├── main.go
    └── server

go.mod:

module bug

go 1.19

require (
	golang.org/x/net v0.1.0
	google.golang.org/grpc v1.50.1
	google.golang.org/protobuf v1.28.1
)

require (
	github.com/golang/protobuf v1.5.2 // indirect
	golang.org/x/sys v0.1.0 // indirect
	golang.org/x/text v0.4.0 // indirect
	google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013 // indirect
)

proto/service.proto:

syntax = "proto3";

option go_package = "bug/proto";

message Request {
  bytes data = 1;
}

message Response {}

service Example {
    rpc Method(stream Request) returns (Response);
}

proxy/main.go:

package main

import (
	"crypto/tls"
	"net"
	"net/http"
	"net/http/httputil"

	"golang.org/x/net/http2"
	"golang.org/x/net/http2/h2c"
)

func main() {
	tr := &http2.Transport{
		AllowHTTP: true,
		DialTLS: func(network string, addr string, cfg *tls.Config) (net.Conn, error) {
			return net.Dial("tcp", "localhost:7001")
		},
	}

	rp := &httputil.ReverseProxy{
		Transport:     tr,
		FlushInterval: -1,
		Director: func(req *http.Request) {
			req.URL.Scheme = "http"
		},
	}

	handler := h2c.NewHandler(rp, &http2.Server{})

	srv := &http.Server{
		Addr:    "localhost:7000",
		Handler: handler,
	}

	srv.ListenAndServe()
}

server/main.go:

package main

import (
	"fmt"
	"net"

	"google.golang.org/grpc"

	"bug/proto"
)

type example struct {
	proto.UnimplementedExampleServer
}

func (e *example) Method(stream proto.Example_MethodServer) error {
	for {
		request, err := stream.Recv()
		if err != nil {
			break
		}

		fmt.Println("request", request)
	}

	return stream.SendAndClose(&proto.Response{})
}

func main() {
	srv := grpc.NewServer()
	e := &example{}

	ln, _ := net.Listen("tcp", ":7001")

	proto.RegisterExampleServer(srv, e)

	srv.Serve(ln)
}

client/main.go:

package main

import (
	"context"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	"bug/proto"
)

func main() {
	conn, err := grpc.DialContext(context.Background(), "localhost:7000", grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		panic(err)
	}

	e := proto.NewExampleClient(conn)
	stream, err := e.Method(context.Background())
	if err != nil {
		panic(err)
	}

	for {
		err := stream.Send(&proto.Request{Data: []byte{1, 2, 3, 4, 5}})
		if err != nil {
			panic(err)
		}
	}
}

In words, this is:

  • A HTTP/2-capable reverse proxy listening on localhost:7000 proxying to localhost:7001.
  • A streaming gRPC service definition.
  • A gRPC server implementation that simply prints all incoming messages over the stream, listening on localhost:7001.
  • A gRPC client that sends the same message over the stream continuously forever, connecting to localhost:7000.

First, generate gRPC code:

protoc --plugin=protoc-gen-go=protoc-gen-go --go_out=paths=source_relative:. --plugin=protoc-gen-go-grpc=protoc-gen-go-grpc --go-grpc_out=paths=source_relative:. service.proto

To reproduce the bug:

  1. Install golang.org/x/net@latest
  2. Build and run both the proxy and server.
  3. Build and run the client, and let it run for 5 minutes.
  4. Observe a steady increase in memory usage in the proxy process.

To verify that this is a regression:

  1. Install golang.org/x/net@v0.0.0-20220920183852-bf014ff85ad5
  2. Build and run both the proxy and sever.
  3. Build and run the client, and let it run for 5 minutes.
  4. Observe consistent/non-growing memory usage in the proxy process.

@bcmills bcmills added this to the Go1.20 milestone Oct 20, 2022
@ZekeLu
Copy link
Contributor

ZekeLu commented Oct 21, 2022

I think I have found out what's wrong. There is not memory leak. It's just that the proxy writes to the buffer (data from the client to the proxy) too fast and reads from the buffer (data from the proxy to the server) too slow. As indicated by neild@, you should configure the HTTP/2 server flow control limits.

The proxy writes to the buffer too fast thanks to the improvement of golang/net@2e0b12c. To verify this hypothesis:

  1. add log.Println("chunks:", len(b.chunks)) after this line https://github.com/golang/net/blob/master/http2/databuffer.go#L91, and run the test, you should find that the number of chunks keeps growing.
  2. simplify the server to make it faster, you should find that the number of b.chunks grows slower or does not grow.
    --- a/server/main.go
    +++ b/server/main.go
    @@ -1,38 +1,35 @@
     package main
     
     import (
    -   "fmt"
        "net"
     
        "google.golang.org/grpc"
     
        "bug/proto"
     )
     
     type example struct {
        proto.UnimplementedExampleServer
     }
     
     func (e *example) Method(stream proto.Example_MethodServer) error {
        for {
    -       request, err := stream.Recv()
    +       _, err := stream.Recv()
            if err != nil {
                break
            }
    -
    -       fmt.Println("request", request)
        }
     
        return stream.SendAndClose(&proto.Response{})
     }
     
     func main() {
        srv := grpc.NewServer()
        e := &example{}
     
        ln, _ := net.Listen("tcp", ":7001")
     
        proto.RegisterExampleServer(srv, e)
     
        srv.Serve(ln)
     }
  3. configure the flow control limits, 1<<13 works for me:
    --- a/proxy/main.go
    +++ b/proxy/main.go
    @@ -1,37 +1,41 @@
     package main
     
     import (
        "crypto/tls"
        "net"
        "net/http"
        "net/http/httputil"
     
        "golang.org/x/net/http2"
        "golang.org/x/net/http2/h2c"
     )
     
     func main() {
        tr := &http2.Transport{
            AllowHTTP: true,
            DialTLS: func(network string, addr string, cfg *tls.Config) (net.Conn, error) {
                return net.Dial("tcp", "localhost:7001")
            },
        }
     
        rp := &httputil.ReverseProxy{
            Transport:     tr,
            FlushInterval: -1,
            Director: func(req *http.Request) {
                req.URL.Scheme = "http"
            },
        }
     
    -   handler := h2c.NewHandler(rp, &http2.Server{})
    +   handler := h2c.NewHandler(rp, &http2.Server{
    +       // The default value is 1<<20.
    +       MaxUploadBufferPerConnection: 1 << 13,
    +       MaxUploadBufferPerStream:     1 << 13,
    +   })
     
        srv := &http.Server{
            Addr:    "localhost:7000",
            Handler: handler,
        }
     
        srv.ListenAndServe()
     }

@LINKIWI
Copy link
Author

LINKIWI commented Oct 22, 2022

Thanks for looking into this @ZekeLu. I think you are right. I tested a more restrictive MaxUploadBufferPerConnection and MaxUploadBufferPerStream in the proxy at golang.org/x/net@v0.0.0-20221019024206-cb67ada4b0ad and can confirm that the example test workload is no longer causing unbounded memory growth. I guess this was a latent behavior in the proxy that was only exposed by the optimization introduced in 2e0b12c274b70e5e836cf4bb0cda99c44f61a4d1.

That said, I think there's a couple remaining open questions:

(1) The example workload described in the issue seems fairly vanilla. Is there an opportunity to perhaps improve the "out-of-the-box" default http2.Server behavior to avoid the original problem? For example, does it make sense to change the default value of MaxUploadBufferPerConnection and MaxUploadBufferPerStream to 1 << 13?

(2) The original incident described in the memory usage graph above increased total heap allocation size from 15 MB to 80 GB (which exhausted all 80 GB of on-disk swap space on the system in question). It sounds like there should exist some capability to set/tune the maximum http2.Server buffer usage across all connections and streams to avoid this potential failure mode.

@gopherbot
Copy link

Change https://go.dev/cl/444816 mentions this issue: http2: calcualte a correct window increment size for a stream

@ZekeLu
Copy link
Contributor

ZekeLu commented Oct 22, 2022

My two cents:

  1. Every application has its own characteristic, I don't think 1 << 13 works for most of the applications. I think it's better to leave the decision to the developers. Maybe it should be documented to remind the developers to provide a proper configuration. I'm not sure.
  2. MaxUploadBufferPerConnection and MaxUploadBufferPerStream affect the buffer indirectly. So this question seems essentially the same as question 1 to me.

I found a new issue while investigating this issue. It seems that the increment size for a stream could be wrong in some cases. For example, when the flow control configuration is like this:

handler := h2c.NewHandler(rp, &http2.Server{
	// The default value is 1<<20.
	MaxUploadBufferPerConnection: 1 << 16,
	MaxUploadBufferPerStream:     1 << 20,
})

The proxy will panic with http: panic serving 127.0.0.1:44136: internal error; sent too many window updates without decrements?. I have sent CL 444816 to fix it.

gopherbot pushed a commit to golang/net that referenced this issue Oct 27, 2022
CL 432038 reduces sending WindowUpdates by introducing a threshold. Once
the remaining bytes are below the threshold, a single WindowUpdate is
sent to reset the amount back to the maximum amount configured.

The window increment size for a stream is calculated from:

    sc.srv.initialStreamRecvWindowSize() - st.inflow.available()

Where (*flow).available is defined as:

    func (f *flow) available() int32 {
    	n := f.n
    	if f.conn != nil && f.conn.n < n {
    		n = f.conn.n
    	}
    	return n
    }

When f.conn.c < f.n, it gets a bigger increment size. It should be
calculated from:

    sc.srv.initialStreamRecvWindowSize() - st.inflow.n

While we're here, remove an unnecessary type conversion too.

Updates golang/go#56315.

Change-Id: I4b26b27e4c5c5cd66e6a32b152d68f304adc65d8
GitHub-Last-Rev: 02fc09c
GitHub-Pull-Request: #155
Reviewed-on: https://go-review.googlesource.com/c/net/+/444816
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/448055 mentions this issue: Revert "http2: Send WindowUpdates when remaining bytes are below a threshold"

@neild
Copy link
Contributor

neild commented Nov 4, 2022

I've discovered another issue with https://go.dev/cl/432038, which is effectively disabling server pushback on clients. (Sorry, I really should have detected that in review. Mea culpa.)

Sent rollback.

WeiminShang added a commit to WeiminShang/net that referenced this issue Nov 16, 2022
CL 432038 reduces sending WindowUpdates by introducing a threshold. Once
the remaining bytes are below the threshold, a single WindowUpdate is
sent to reset the amount back to the maximum amount configured.

The window increment size for a stream is calculated from:

    sc.srv.initialStreamRecvWindowSize() - st.inflow.available()

Where (*flow).available is defined as:

    func (f *flow) available() int32 {
    	n := f.n
    	if f.conn != nil && f.conn.n < n {
    		n = f.conn.n
    	}
    	return n
    }

When f.conn.c < f.n, it gets a bigger increment size. It should be
calculated from:

    sc.srv.initialStreamRecvWindowSize() - st.inflow.n

While we're here, remove an unnecessary type conversion too.

Updates golang/go#56315.

Change-Id: I4b26b27e4c5c5cd66e6a32b152d68f304adc65d8
GitHub-Last-Rev: 02fc09c1e7564243630d3555e7880ba352eea0fe
GitHub-Pull-Request: golang/net#155
Reviewed-on: https://go-review.googlesource.com/c/net/+/444816
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Run-TryBot: Damien Neil <dneil@google.com>
WeiminShang added a commit to WeiminShang/net that referenced this issue Nov 16, 2022
…reshold"

This reverts commit 3a96036.

The calculation for when to return flow control doesn't properly take
data in server read buffers into account, resulting in flow control
credit being returned too quickly without backpressure.

Fixes golang/go#56315
For golang/go#28732

Change-Id: I573afd1a37d8a711da47f05f38f4de04183fb941
Reviewed-on: https://go-review.googlesource.com/c/net/+/448055
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Damien Neil <dneil@google.com>
Reviewed-by: Roland Shoemaker <roland@golang.org>
@golang golang locked and limited conversation to collaborators Nov 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants