Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: internal error: connCount underflow #61474

Open
zzkcode opened this issue Jul 20, 2023 · 10 comments
Open

net/http: internal error: connCount underflow #61474

zzkcode opened this issue Jul 20, 2023 · 10 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@zzkcode
Copy link
Contributor

zzkcode commented Jul 20, 2023

What version of Go are you using (go version)?

$ go version
go version go1.18.3 linux/amd64

Does this issue reproduce with the latest release?

Haven't tried it yet. Will try it when we have a producer.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1421070104=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We are piloting limited MaxConnsPerHost on several PRD instances, and running about 1~2 days separately. And one of them crashed with: panic: net/http: internal error: connCount underflow.

What did you expect to see?

No crash.

What did you see instead?

panic: net/http: internal error: connCount underflow.

Below are the details by examining the core file, and since there are some locals and args that seem to be optimized out or something, I try jump between frames to print out the details.

bt:

(dlv) bt
 0  0x000000000046e601 in runtime.raise
    at /usr/local/go/src/runtime/sys_linux_amd64.s:168
 1  0x000000000044fc25 in runtime.dieFromSignal
    at /usr/local/go/src/runtime/signal_unix.go:852
 2  0x00000000004505f6 in runtime.sigfwdgo
    at /usr/local/go/src/runtime/signal_unix.go:1066
 3  0x000000000044e947 in runtime.sigtrampgo
    at /usr/local/go/src/runtime/signal_unix.go:430
 4  0x000000000046f44e in runtime.sigtrampgo
    at <autogenerated>:1
 5  0x000000000046e8fd in runtime.sigtramp
    at /usr/local/go/src/runtime/sys_linux_amd64.s:361
 6  0x00007fa1b18df8e0 in ???
    at ?:-1
 7  0x0000000000439c49 in runtime.crash
    at /usr/local/go/src/runtime/signal_unix.go:944
 8  0x0000000000439c49 in runtime.fatalpanic
    at /usr/local/go/src/runtime/panic.go:1092
 9  0x0000000000439417 in runtime.gopanic
    at /usr/local/go/src/runtime/panic.go:941
10  0x0000000000869d52 in net/http.(*Transport).decConnsPerHost
    at /usr/local/go/src/net/http/transport.go:1475
11  0x000000000086472f in net/http.(*Transport).roundTrip
    at /usr/local/go/src/net/http/transport.go:604
12  0x000000000084c339 in net/http.(*Transport).RoundTrip
    at /usr/local/go/src/net/http/roundtrip.go:17
13  0x0000000000808d58 in net/http.send
    at /usr/local/go/src/net/http/client.go:252
14  0x00000000008085fb in net/http.(*Client).send
    at /usr/local/go/src/net/http/client.go:176
15  0x000000000080aa35 in net/http.(*Client).do
    at /usr/local/go/src/net/http/client.go:725
16  0x000000000101a6c9 in net/http.(*Client).Do
    at /usr/local/go/src/net/http/client.go:593
(dlv) frame 11
(dlv) p pconn.cacheKey
net/http.connectMethodKey {
        proxy: "http://proxy-server:3128",
        scheme: "https",
        addr: "real-endpoint-which-caused-crash:443",
        onlyH1: false,}

(dlv) frame 10
(dlv) p t.connsPerHost
map[net/http.connectMethodKey]int [
        // ignore unrelated connectMethods
        {proxy: "http://proxy-server:3128", scheme: "https", addr: "real-endpoint-which-caused-crash:443", onlyH1: false}: 1, 
]

It seems a little weird here, panic on net/http/transport.go:1475, but by printing t.connsPerHost, it shows that it should be 1?

	t.connsPerHostMu.Lock()
	defer t.connsPerHostMu.Unlock()
	n := t.connsPerHost[key]
	if n == 0 {
		// Shouldn't happen, but if it does, the counting is buggy and could
		// easily lead to a silent deadlock, so report the problem loudly.
		panic("net/http: internal error: connCount underflow")
	}

Note: n and key seem optimized out here? print them return as: unreadable could not find loclist entry..., and GDB show a large number:

(gdb) p n
$2 = 825625303360

Will try to reproduce it off the Production Environment, and then with -race and the latest Go Version. I have noticed there are some issues but closed already: #34941, #38172. Any comments and suggestions would be appreciated.

@mauri870
Copy link
Member

Running with -race should shine some light. If you manage to put together a reproducible example please share.

@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2023

go version go1.18.3 linux/amd64

Per the release policy Go 1.18 has been unsupported since 2023-02-01 (when Go 1.20 was released), and there have been quite a few fixes in the net/http package since 1.18 was released.

Does this reproduce with a supported (1.20.6 or 1.19.11) or upcoming (1.21rc3) release?

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jul 20, 2023
@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2023

(CC @neild)

@zzkcode
Copy link
Contributor Author

zzkcode commented Jul 20, 2023

@bcmills Thanks. We encountered this problem while we pilot MaxConnsPerHost on Production Environment, and we are still using go1.18.3 on Production. Will try to reproduce this problem first, and see if it could reproduce on the lastest Go.

What seems weird to me is the info that I show on the What did you see instead?, which appears like should not crash at all. Is it possible that while it panic on frame 10, and runs defer t.connsPerHostMu.Unlock(), then another goroutine happens to acquire the lock and add it to 1?

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jul 22, 2023
@bcmills
Copy link
Contributor

bcmills commented Aug 7, 2023

See previously #38172. @zzkcode, have you tried running a copy of your program built with the -race option? If so, does it report any races?

@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Aug 7, 2023
@zzkcode
Copy link
Contributor Author

zzkcode commented Aug 9, 2023

Hi all, here are some latest progress:

See previously #38172. @zzkcode, have you tried running a copy of your program built with the -race option? If so, does it report any races?

@bcmills Thanks, we haven't tried -race yet since it will downgrade performance, but will do once have a reproducer.

go version go1.18.3 linux/amd64

Per the release policy Go 1.18 has been unsupported since 2023-02-01 (when Go 1.20 was released), and there have been quite a few fixes in the net/http package since 1.18 was released.

Does this reproduce with a supported (1.20.6 or 1.19.11) or upcoming (1.21rc3) release?

@bcmills We upgrade go from 1.18.3 to 1.20.6, and this issue still exists.

panic: net/http: internal error: connCount underflow

goroutine 33385383 [running]:
panic({0x1289e40, 0x1852220})
        /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc03fb9f4b0 sp=0xc03fb9f3f0 pc=0x43a67b
net/http.(*Transport).decConnsPerHost(0xc00033c280, {{0xc0799dc080, 0x40}, {0xc078329880, 0x5}, {0xc0991c60c0, 0x21}, 0x0})
        /usr/local/go/src/net/http/transport.go:1482 +0x40f fp=0xc03fb9f550 sp=0xc03fb9f4b0 pc=0x8957cf
net/http.(*Transport).roundTrip(0xc00033c280, 0xc037407d00)
        /usr/local/go/src/net/http/transport.go:613 +0x8cb fp=0xc03fb9f778 sp=0xc03fb9f550 pc=0x89062b
net/http.(*Transport).RoundTrip(0x0?, 0x1855e60?)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x19 fp=0xc03fb9f798 sp=0xc03fb9f778 pc=0x878119
net/http.send(0xc037407d00, {0x1855e60, 0xc00033c280}, {0x8?, 0x14aa2e0?, 0x0?})
        /usr/local/go/src/net/http/client.go:252 +0x5f7 fp=0xc03fb9f990 sp=0xc03fb9f798 pc=0x833c57
net/http.(*Client).send(0xc001a78510, 0xc037407d00, {0xc002a53ba5?, 0x15?, 0x0?})
        /usr/local/go/src/net/http/client.go:176 +0x9b fp=0xc03fb9fa08 sp=0xc03fb9f990 pc=0x8334db
net/http.(*Client).do(0xc001a78510, 0xc037407d00)
        /usr/local/go/src/net/http/client.go:716 +0x8fb fp=0xc03fb9fc10 sp=0xc03fb9fa08 pc=0x8357db
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:582
(dlv) bt
 0  0x00000000004729a1 in runtime.raise
    at /usr/local/go/src/runtime/sys_linux_amd64.s:154
 1  0x0000000000451e05 in runtime.dieFromSignal
    at /usr/local/go/src/runtime/signal_unix.go:883
 2  0x0000000000452486 in runtime.sigfwdgo
    at /usr/local/go/src/runtime/signal_unix.go:1096
 3  0x0000000000450aa7 in runtime.sigtrampgo
    at /usr/local/go/src/runtime/signal_unix.go:432
 4  0x0000000000472ca6 in runtime.sigtramp
    at /usr/local/go/src/runtime/sys_linux_amd64.s:354
 5  0x00007f77d95ad8e0 in ???
    at ?:-1
 6  0x000000000043af89 in runtime.crash
    at /usr/local/go/src/runtime/signal_unix.go:975
 7  0x000000000043af89 in runtime.fatalpanic
    at /usr/local/go/src/runtime/panic.go:1172
 8  0x000000000043a67b in runtime.gopanic
    at /usr/local/go/src/runtime/panic.go:987
 9  0x00000000008957cf in net/http.(*Transport).decConnsPerHost
    at /usr/local/go/src/net/http/transport.go:1482
10  0x000000000089062b in net/http.(*Transport).roundTrip
    at /usr/local/go/src/net/http/transport.go:613
11  0x0000000000878119 in net/http.(*Transport).RoundTrip
    at /usr/local/go/src/net/http/roundtrip.go:17
12  0x0000000000833c57 in net/http.send
    at /usr/local/go/src/net/http/client.go:252
13  0x00000000008334db in net/http.(*Client).send
    at /usr/local/go/src/net/http/client.go:176
14  0x00000000008357db in net/http.(*Client).do
    at /usr/local/go/src/net/http/client.go:716
15  0x0000000001054053 in net/http.(*Client).Do
    at /usr/local/go/src/net/http/client.go:582

(dlv) frame 10

(dlv) p err
error(net/http.http2noCachedConnError) {}

(dlv) p pconn.cacheKey
net/http.connectMethodKey {
        proxy: "http://proxy-host:1234",
        scheme: "https",
        addr: "real-endpoint:443",
        onlyH1: false,}

By printing out the t.connsPerHost, we don't see the same pconn.cacheKey in it.

And please let me know if more details about it I could provide.

@zzkcode
Copy link
Contributor Author

zzkcode commented Aug 10, 2023

Hi @bcmills @mauri870, we ran with -race, and there is nothing related to this. Only panic and Crash Traceback.

panic: net/http: internal error: connCount underflow

goroutine 41240143 [running]:
panic({0x1bd7b00, 0x2c8b0b0})
	/usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc005fb4df8 sp=0xc005fb4d38 pc=0x44b0bb
net/http.(*Transport).decConnsPerHost(0xc00049a3c0, {{0xc04f629a80, 0x40}, {0xc03ab6d280, 0x5}, {0xc03f468240, 0x24}, 0x0})
	/usr/local/go/src/net/http/transport.go:1482 +0x818 fp=0xc005fb4f90 sp=0xc005fb4df8 pc=0xb4af78
net/http.(*Transport).roundTrip(0xc00049a3c0, 0xc03dad5800)
	/usr/local/go/src/net/http/transport.go:613 +0xe65 fp=0xc005fb52d0 sp=0xc005fb4f90 pc=0xb423c5
net/http.(*Transport).RoundTrip(0xc03dad5800?, 0x2c8ed00?)
	/usr/local/go/src/net/http/roundtrip.go:17 +0x37 fp=0xc005fb5318 sp=0xc005fb52d0 pc=0xb1a057
net/http.send(0xc03dad5800, {0x2c8ed00, 0xc00049a3c0}, {0x8?, 0xc001d81400?, 0x0?})
	/usr/local/go/src/net/http/client.go:252 +0x943 fp=0xc005fb55f0 sp=0xc005fb5318 pc=0xaab2a3
net/http.(*Client).send(0xc00122f650, 0xc03dad5800, {0x0?, 0x487077?, 0x0?})
	/usr/local/go/src/net/http/client.go:176 +0x165 fp=0xc005fb56a8 sp=0xc005fb55f0 pc=0xaaa6a5
net/http.(*Client).do(0xc00122f650, 0xc03dad5800)
	/usr/local/go/src/net/http/client.go:716 +0x10df fp=0xc005fb5a10 sp=0xc005fb56a8 pc=0xaae2bf
net/http.(*Client).Do(...)
	/usr/local/go/src/net/http/client.go:582

...

@mauri870
Copy link
Member

@zzkcode Thanks for taking the time to upgrade the Go version as well as running with -race. Let's wait till someone can look into this. Also check updates on #61794

@zzkcode
Copy link
Contributor Author

zzkcode commented Aug 10, 2023

@mauri870 Thanks, the lastest crash which ran with -race we found there are thounsands of reqs from same host that cause crash. Didn't check this previous times, and will try to make a simple reproducer. FYI @bcmills

We send reqs through a foward-proxy, and part of the http.Transport Configuration:

t := &http.Transport {
    DisableKeepAlives:     false,
    DisableCompression:    false,
    ForceAttemptHTTP2:     true,
    MaxConnsPerHost:       5000,
    MaxIdleConns:          10000,
    MaxIdleConnsPerHost:   2000,
    IdleConnTimeout:       0 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
}

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Aug 10, 2023
@bcmills bcmills added this to the Backlog milestone Aug 10, 2023
@elnappo
Copy link

elnappo commented Apr 17, 2024

We also faced this problem in prod. We couldn't identify a pattern and the occurrence seams random to us. A colleague tried to debug it but wasn't able to reproduce it. As he has no GitHub account he asked me to add his findings and ideas here:

Hi all 🙂
We also faced the connCount underflow panic some times so we started to investigate a bit and tried to reproduce it. Sadly we were not fully able to do so but we still got some insights that we want to share with you (maybe you know all of this already).
Most likely the panic is caused by a call to decConnsPerHost for a key that is not present in the connsPerHost map. In that case t.connsPerHost[key] also returns zero and the panic is called.

If we understand the code correctly this situation only arises if there are at some point in time two calls to decConnsPerHost for a single connection which results in a miscounting. Would it be possible to make sure that for each connection (wantConn or persistConn ) decConnsPerHost is called at most once?
Additionally it might make sense to split up the panic in order to get better insights what is going on:

n, ok := t.connsPerHost[key]
if !ok {
	panic("net/http: internal error: connCount underflow, key not in map")
}
if n == 0 {
	panic("net/http: internal error: connCount underflow, value is zero")
}

This is how we tried to reproduce the issue:

  1. Do a HTTP request to some running test server
  2. Let dialConn fail -> decConnsPerHost gets called from within dialConnFor
  3. At the same time close the connection -> decConnsPerHost gets called from the pconn.readLoop go routine
  4. The second call to decConnsPerHost will panic

To achieve 2. we manually altered the error that is returned by dialConn, we were sadly not able to trigger this from the outside (and it is probably not even possible because the readLoop() only gets started if there is no error). So this exact situation is probably unrealistic to be honest, But it still shows the vulnerability of the code as a second call to decConnsPerHost is not prohibited by some mechanism.

This is our code for testing:

package main

import (
	"context"
	"fmt"
	"net"
	"net/http"
	"net/http/httptest"
	"time"
)

var zeroDialer net.Dialer

func main() {
	// Start test server
	testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
	}))

	transport := &http.Transport{
		MaxConnsPerHost: 1,
		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
			conn, err := zeroDialer.DialContext(ctx, network, addr)
			if err != nil {
				return nil, err
			}
			fmt.Println("Will close connection")
			conn.Close()
			return conn, nil
		},
	}

	req, _ := http.NewRequest(http.MethodGet, testServer.URL, nil)

	httpClient := &http.Client{
		Transport: transport,
	}

	httpClient.Do(req)
	time.Sleep(10 * time.Second)
}

We replaced dialConnFor with

func (t *Transport) dialConnFor(w *wantConn) {
	defer w.afterDial()
	ctx := w.getCtxForDial()
	if ctx == nil {
		t.decConnsPerHost(w.key)
		return
	}

	pc, err := t.dialConn(ctx, w.cm)
	err = fmt.Errorf("some error")
	fmt.Println("Faked the error which was nil before to be not nil")
	time.Sleep(time.Second)
	delivered := w.tryDeliver(pc, err)
	if err == nil && (!delivered || pc.alt != nil) {
		// pconn was not passed to w,
		// or it is HTTP/2 and can be shared.
		// Add to the idle connection pool.
		t.putOrCloseIdleConn(pc)
	}
	if err != nil {
		t.decConnsPerHost(w.key)
	}
}

We replaced decConnsPerHost for better logging with

func (t *Transport) decConnsPerHost(key connectMethodKey) {
	fmt.Printf("Function decConnsPerHost gets called with key %v\n", key)
	if _, ok := t.connsPerHost[key]; !ok {
		fmt.Printf("The key %v does not exist in the map, this will result in a panic!\n", key)
	}
	if t.MaxConnsPerHost <= 0 {
		return
	}

	t.connsPerHostMu.Lock()
	defer t.connsPerHostMu.Unlock()
	n := t.connsPerHost[key]
	if n == 0 {
		// Shouldn't happen, but if it does, the counting is buggy and could
		// easily lead to a silent deadlock, so report the problem loudly.
		panic("net/http: internal error: connCount underflow")
	}

	// Can we hand this count to a goroutine still waiting to dial?
	// (Some goroutines on the wait list may have timed out or
	// gotten a connection another way. If they're all gone,
	// we don't want to kick off any spurious dial operations.)
	if q := t.connsPerHostWait[key]; q.len() > 0 {
		done := false
		for q.len() > 0 {
			w := q.popFront()
			if w.waiting() {
				go t.dialConnFor(w)
				done = true
				break
			}
		}
		if q.len() == 0 {
			delete(t.connsPerHostWait, key)
		} else {
			// q is a value (like a slice), so we have to store
			// the updated q back into the map.
			t.connsPerHostWait[key] = q
		}
		if done {
			return
		}
	}

	// Otherwise, decrement the recorded count.
	if n--; n == 0 {
		delete(t.connsPerHost, key)
		fmt.Printf("Deleted key %v in connsPerHost map\n", key)
	} else {
		t.connsPerHost[key] = n
	}
}

I anyone has other ideas how to reproduce or fix this, we are ready to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants