Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) #39750

Closed
povsister opened this issue Jun 22, 2020 · 6 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@povsister
Copy link

povsister commented Jun 22, 2020

What version of Go are you using (go version)?

go version go1.13.9 linux/amd64

Does this issue reproduce with the latest release?

Yes. It happens after kubelet re-complied with 1.14.4
But I saw a commit connection health check on golang/net@0ba52f6
It should help avoid this problem partly, yet not enabled by default.

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/root/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/root/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

If we close the connection via conn.Close, the readLoop will immediately receive a read tcp ... use of closed connection error. Then a cleanup will be performed: notifying every stream on this connection, removing the broken connection from clientConnPool... etc
Everything seems fine. right ?

But, sometimes the readLoop of a clientConn may block on a idle connection.
And the writeHeaders method of clientConn in roundTrip never close a broken connection. (It just report errors to its caller)
there is the problem begins

Imaging this: An incoming request is about to using this idle connection. After getting it from the pool, the connection died very soon, so writeHeaders method failed with an error write tcp ... use of closed connection and left the broken connection untouched in the pool.

https://github.com/golang/net/blob/627f9648deb96c27737b83199d44bb5c1010cbcf/http2/transport.go#L1083-L1100

Since there is no traffic on the connection. readLoop will block, either no cleanup will be triggered.
Plus, incoming requests keep calling roundTrip -> newStream, add stream count on the conn, making the conn "busy". So the closeIfIdle will not cleanup the dead connection either.

Some actual case

kubernetes/kubernetes#87615

We encountered this problem several times on highly loaded VMs. Heavy load may lead to temporarily suspend of some VMs by master machine. That's why connection may died right after getting from the pool.
But it is hard to reproduce this scenario.

kubelet keep using dead connections
kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2020-06-16 08:45:59 CST; 1h 25min ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 21344 (kubelet)
    Tasks: 0
   CGroup: /system.slice/kubelet.service
           ‣ 21344 /usr/bin/kubelet --config=/opt/kubernetes/cfg/kubelet-config --root-dir=/opt/kubernetes/data/kubelet --cert-dir=/opt/kubernetes/data/kubelet/pki --network-plugin=cni --cni-bin-dir=/opt/kubernetes/cni --cni-conf-dir=/opt/kubernetes/cni/conf --kubeconfig=/opt/kubernetes/ssl/native.kubeconfig --hostname-override=autovmcos76vm01.company.com --node-labels=master=true,role=loadbalancer --v=1 --logtostderr=true --runtime-cgroups=/systemd/system.slice --pod-infra-container-image=localhost:5000/sandbox/pause:3.1

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.160008 21344 reflector.go:153] object-"kube-system"/"dns-hosts-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Ddns-hosts-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.195291 21344 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: Get https://autovmcos76vm01.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/autovmcos76vm01.company.com?timeout=10s: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.358961 21344 reflector.go:153] object-"core"/"cdf-deployer-token-4qr6w": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/secrets?fieldSelector=metadata.name%3Dcdf-deployer-token-4qr6w&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.558938 21344 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://autovmcos76vm01.company.com:8443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:39.759003 21344 reflector.go:153] object-"demo2ns"/"registrypullsecret": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/demo2ns/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:39 autovmCOS76VM01.company.com kubelet[21344]: W0616 10:11:39.958867 21344 status_manager.go:530] Failed to get status for pod "kube-registry-h67dx_core(4603bbf2-b773-4155-95a4-f73bdf2048eb)": Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/pods/kube-registry-h67dx: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.159135 21344 reflector.go:153] object-"core"/"base-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Dbase-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.358980 21344 reflector.go:153] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcoredns&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.559314 21344 reflector.go:153] object-"kube-system"/"itom-cdf-keepalived-token-c76dd": Failed to list *v1.Secret: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ditom-cdf-keepalived-token-c76dd&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

Jun 16 10:11:40 autovmCOS76VM01.company.com kubelet[21344]: E0616 10:11:40.761201 21344 reflector.go:153] object-"core"/"default-database-configmap": Failed to list *v1.ConfigMap: Get https://autovmcos76vm01.company.com:8443/api/v1/namespaces/core/configmaps?fieldSelector=metadata.name%3Ddefault-database-configmap&limit=500&resourceVersion=0: write tcp 127.0.0.1:44810->127.0.1.1:8443: use of closed network connection

@networkimprov
Copy link

cc @fraenkel

@agnivade agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 23, 2020
@agnivade agnivade added this to the Backlog milestone Jun 23, 2020
@fraenkel
Copy link
Contributor

This covers multiple issues already reported, some with awaiting fixes, others with awaiting responses.
#32388
https://go-review.googlesource.com/c/net/+/181457/
https://go-review.googlesource.com/c/net/+/224797/

#39337

There is no guarantee that the http side and the http2 side are actually in agreement. There are many more cases where the http side believes that http2 connections are still alive and valid. However, when they are used, that is when the detection kicks in.
CloseIdleConnection is one case where the http2 side will attempt cleanup. However, there are many situations that will make an existing connection not quite idle.

The only safe way out of many of these situations currently is setting write timeouts.
It would be helpful to have something that is reproducible or a stack that shows the hung threads.

@povsister
Copy link
Author

povsister commented Jun 24, 2020

@fraenkel I just go through the cases you provided above. It looks quite similar to my problem.
But, actually, I am talking about the cleanup mechanism failure in some odd situation. which leads to cache of dead connections. (sorry I can't reproduce it by doing some unit tests. It really happens in production and I can only analysis the issue with limited application logs and code implement.)

For now, the underlying tcp connections mostly rely on the readLoop to detect if there is some connection problem, plus some other mechanisms such as CloseIdleConnection and so on.

However, when they are used, that is when the detection kicks in.

I agree with this detection strategy. But in my case, the readLoop somehow seems fine and not returning any error (I guess It may block on completely idle connection). Meanwhile, I find a batch of write closed connection error, and it keeps using the same dead connection for multiple requests again and again, cause application looping on the network requests. ( which also make the connection not idle )

The only safe way out of many of these situations currently is setting write timeouts.

Setting write timeout just fixed the application hanging problem you mentioned above, it has no impact on dead connections.

It is proved by the code that currently we do not invoke any detection in writing: When an error occurs in writing, errors are returned to caller and nothing else was performed.
So, if we admit "writing is a kind of usage", I think we should introduce some judgement logic to determine if we have some connection problems on writing and should we immediately notify connectionPool about the issue to avoid returning this broken conn on next request coming. Then the retry attempt made by higher level applications may have the chance to succeed.

@povsister povsister changed the title net/http: http2 clientConnPool may cache a dead connection forever if a clientConn died just after getting from pool net/http: http2 clientConnPool may cache a dead connection forever (write tcp use of closed network connection) Jun 24, 2020
roidelapluie added a commit to roidelapluie/common that referenced this issue Aug 3, 2020
HTTP/2 support is golang has many problematic cornercases where dead
connections would be kept.

golang/go#32388
golang/go#39337
golang/go#39750

I suggest we disable HTTP/2 for now and enable it manually on the
blackbox exporter.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
@povsister
Copy link
Author

Ping @fraenkel
It's proved that h2 transport healthCheck works on such issues.

Any updates or plan to enable it by default ?

@roidelapluie
Copy link

Is this a duplicate of #39337 ?

@povsister
Copy link
Author

@roidelapluie Thanks a lot.
The fix in #39337 can also fix this issue. If write error is proper handled, this issue will not exist.

I will close this issue.

@golang golang locked and limited conversation to collaborators Nov 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants