-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/http2: Memory leak regression introduced in window updates change #56315
Comments
https://go.dev/cl/432038 changes the frequency with which the HTTP/2 server sends window size updates to clients. Previously it would send updates to extend the stream window immediately upon receiving data; now it sends updates only when at least half the current window has been consumed. This reduces the frequency of window updates and should improve performance overall. Your heap profiles show that your heap is pretty much entirely consumed by HTTP/2 stream buffers, used to hold data being received or sent. Buffers are allocated as needed from a set of pools of varying size classes; the largest class is 16384 bytes. I'm not sure why this change would lead to more buffers being consumed. I don't see any obvious way for a new memory leak to have been introduced here; perhaps I'm missing something, but I suspect any change is a result of the pattern of data arriving having changed. This change might result in clients sending data in fewer, larger chunks; perhaps this results in more bursty memory usage? Or perhaps clients are sending data faster, and running ahead of the ability of the server to forward it? Do you configure the HTTP/2 server flow control limits at all? ( |
Thanks for taking a look.
I am also not clear on how this change ultimately results in unbounded memory usage. I set up a closed-environment test workload comparing the behavior at both the good and bad commits of
This hypothesis makes sense, though in practice I haven't been observing bursty memory usage. This graph illustrates the unbounded heap usage growth over time from ~15 MB to ~80 GB as measured by (The sudden drop was due to OOM kill)
|
I have created a minimally complete and verifiable example workload that demonstrates the issue. Source tree:
package main
import (
"crypto/tls"
"net"
"net/http"
"net/http/httputil"
"golang.org/x/net/http2"
"golang.org/x/net/http2/h2c"
)
func main() {
tr := &http2.Transport{
AllowHTTP: true,
DialTLS: func(network string, addr string, cfg *tls.Config) (net.Conn, error) {
return net.Dial("tcp", "localhost:7001")
},
}
rp := &httputil.ReverseProxy{
Transport: tr,
FlushInterval: -1,
Director: func(req *http.Request) {
req.URL.Scheme = "http"
},
}
handler := h2c.NewHandler(rp, &http2.Server{})
srv := &http.Server{
Addr: "localhost:7000",
Handler: handler,
}
srv.ListenAndServe()
}
package main
import (
"fmt"
"net"
"google.golang.org/grpc"
"bug/proto"
)
type example struct {
proto.UnimplementedExampleServer
}
func (e *example) Method(stream proto.Example_MethodServer) error {
for {
request, err := stream.Recv()
if err != nil {
break
}
fmt.Println("request", request)
}
return stream.SendAndClose(&proto.Response{})
}
func main() {
srv := grpc.NewServer()
e := &example{}
ln, _ := net.Listen("tcp", ":7001")
proto.RegisterExampleServer(srv, e)
srv.Serve(ln)
}
package main
import (
"context"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
"bug/proto"
)
func main() {
conn, err := grpc.DialContext(context.Background(), "localhost:7000", grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil {
panic(err)
}
e := proto.NewExampleClient(conn)
stream, err := e.Method(context.Background())
if err != nil {
panic(err)
}
for {
err := stream.Send(&proto.Request{Data: []byte{1, 2, 3, 4, 5}})
if err != nil {
panic(err)
}
}
} In words, this is:
First, generate gRPC code: protoc --plugin=protoc-gen-go=protoc-gen-go --go_out=paths=source_relative:. --plugin=protoc-gen-go-grpc=protoc-gen-go-grpc --go-grpc_out=paths=source_relative:. service.proto To reproduce the bug:
To verify that this is a regression:
|
I think I have found out what's wrong. There is not memory leak. It's just that the proxy writes to the buffer (data from the client to the proxy) too fast and reads from the buffer (data from the proxy to the server) too slow. As indicated by neild@, you should configure the HTTP/2 server flow control limits. The proxy writes to the buffer too fast thanks to the improvement of golang/net@2e0b12c. To verify this hypothesis:
|
Thanks for looking into this @ZekeLu. I think you are right. I tested a more restrictive That said, I think there's a couple remaining open questions: (1) The example workload described in the issue seems fairly vanilla. Is there an opportunity to perhaps improve the "out-of-the-box" default (2) The original incident described in the memory usage graph above increased total heap allocation size from 15 MB to 80 GB (which exhausted all 80 GB of on-disk swap space on the system in question). It sounds like there should exist some capability to set/tune the maximum |
Change https://go.dev/cl/444816 mentions this issue: |
My two cents:
I found a new issue while investigating this issue. It seems that the increment size for a stream could be wrong in some cases. For example, when the flow control configuration is like this: handler := h2c.NewHandler(rp, &http2.Server{
// The default value is 1<<20.
MaxUploadBufferPerConnection: 1 << 16,
MaxUploadBufferPerStream: 1 << 20,
}) The proxy will panic with |
CL 432038 reduces sending WindowUpdates by introducing a threshold. Once the remaining bytes are below the threshold, a single WindowUpdate is sent to reset the amount back to the maximum amount configured. The window increment size for a stream is calculated from: sc.srv.initialStreamRecvWindowSize() - st.inflow.available() Where (*flow).available is defined as: func (f *flow) available() int32 { n := f.n if f.conn != nil && f.conn.n < n { n = f.conn.n } return n } When f.conn.c < f.n, it gets a bigger increment size. It should be calculated from: sc.srv.initialStreamRecvWindowSize() - st.inflow.n While we're here, remove an unnecessary type conversion too. Updates golang/go#56315. Change-Id: I4b26b27e4c5c5cd66e6a32b152d68f304adc65d8 GitHub-Last-Rev: 02fc09c GitHub-Pull-Request: #155 Reviewed-on: https://go-review.googlesource.com/c/net/+/444816 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Damien Neil <dneil@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Auto-Submit: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Run-TryBot: Damien Neil <dneil@google.com>
Change https://go.dev/cl/448055 mentions this issue: |
I've discovered another issue with https://go.dev/cl/432038, which is effectively disabling server pushback on clients. (Sorry, I really should have detected that in review. Mea culpa.) Sent rollback. |
CL 432038 reduces sending WindowUpdates by introducing a threshold. Once the remaining bytes are below the threshold, a single WindowUpdate is sent to reset the amount back to the maximum amount configured. The window increment size for a stream is calculated from: sc.srv.initialStreamRecvWindowSize() - st.inflow.available() Where (*flow).available is defined as: func (f *flow) available() int32 { n := f.n if f.conn != nil && f.conn.n < n { n = f.conn.n } return n } When f.conn.c < f.n, it gets a bigger increment size. It should be calculated from: sc.srv.initialStreamRecvWindowSize() - st.inflow.n While we're here, remove an unnecessary type conversion too. Updates golang/go#56315. Change-Id: I4b26b27e4c5c5cd66e6a32b152d68f304adc65d8 GitHub-Last-Rev: 02fc09c1e7564243630d3555e7880ba352eea0fe GitHub-Pull-Request: golang/net#155 Reviewed-on: https://go-review.googlesource.com/c/net/+/444816 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Damien Neil <dneil@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Auto-Submit: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Run-TryBot: Damien Neil <dneil@google.com>
…reshold" This reverts commit 3a96036. The calculation for when to return flow control doesn't properly take data in server read buffers into account, resulting in flow control credit being returned too quickly without backpressure. Fixes golang/go#56315 For golang/go#28732 Change-Id: I573afd1a37d8a711da47f05f38f4de04183fb941 Reviewed-on: https://go-review.googlesource.com/c/net/+/448055 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Damien Neil <dneil@google.com> Reviewed-by: Roland Shoemaker <roland@golang.org>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
After mass-upgrading dependencies in a proprietary HTTP + gRPC reverse proxy, I observed unbounded memory growth when serving gRPC streams.
The full data flow is:
The implicated component that depends on
x/net
and observing the memory leak is the gRPC proxy.I bisected the behavior to
2e0b12c274b70e5e836cf4bb0cda99c44f61a4d1
fromgolang.org/x/net
. Prior to this commit, memory usage while serving unidirectional (client -> server) streams remained constant. At this commit and later, memory usage grows consistently as long as the stream is active.Please see #56315 (comment) for a minimally complete example of the data flow described above that demonstrates the bug.
The client logic is to repeatedly send messages to the server over a unidirectional stream, each about 32 KB in size. The server logic is to read the message and noop (discard the data).
I don't have enough knowledge of the HTTP/2 specification to do any investigation into 2e0b12c274b70e5e836cf4bb0cda99c44f61a4d1 itself, so I don't have an RCA.
What did you expect to see?
I expected to be able to serve the gRPC stream without unbounded memory growth.
What did you see instead?
I observe what appears to be a memory leak regression starting from this commit.
I verified this behavior by watching
pmap -x $(pidof ...)
over time for the good/bad commits identified by bisect against the same client logic.After running the test workload:
Here is an example heap usage graph captured with pprof at version
v0.0.0-20220920183852-bf014ff85ad5
.Here is an example heap usage graph captured with pprof at version
v0.0.0-20221019024206-cb67ada4b0ad
.cc @neild
The text was updated successfully, but these errors were encountered: