Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: deadlock connecting to erroneous listener #46392

Closed
lithdew opened this issue May 26, 2021 · 1 comment
Closed

net: deadlock connecting to erroneous listener #46392

lithdew opened this issue May 26, 2021 · 1 comment

Comments

@lithdew
Copy link

lithdew commented May 26, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16.4 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/lith/.cache/go-build"
GOENV="/home/lith/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/lith/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/lith/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16.4"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3236558645=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Implement a pooled TCP client that concurrently writes messages across 4 individual TCP connections, and write 100 messages using the client to netcat on port 9000.

Depending on whether or not there are messages that have been queued and have yet to be written to the TCP clients' destination address, more than 1 TCP connection may be established and pooled to write the 100 messages.

What did you expect to see?

Netcat is known to only support accepting one incoming connection. So I expected the case that should more than 1 TCP connection be established while writing 100 messages to netcat, net.DialContext() would return an error.

What did you see instead?

A deadlock on the connect() syscall (syscall.Connect()) itself instead.

(dlv) grs
  Goroutine 1 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/sema.go:56 sync.runtime_Semacquire (0x472645) [semacquire 450440h8m53.582824256s]
  Goroutine 2 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [force gc (idle) 450440h8m53.582861217s]
  Goroutine 3 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC sweep wait]
  Goroutine 4 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [sleep]
  Goroutine 5 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [finalizer wait 450440h8m53.582910923s]
* Goroutine 19 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/syscall/asm_linux_amd64.s:26 syscall.Syscall (0x492e9b) (thread 440468) [chan receive]
  Goroutine 3035 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
  Goroutine 3214 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
  Goroutine 3238 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle) 450440h8m53.582973511s]
  Goroutine 3239 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
  Goroutine 3240 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
  Goroutine 3241 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
  Goroutine 3242 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
  Goroutine 3243 - User: /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/proc.go:337 runtime.gopark (0x440755) [GC worker (idle)]
[14 goroutines]
(dlv) bt
 0  0x0000000000492e9b in syscall.Syscall
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/syscall/asm_linux_amd64.s:26
 1  0x00000000004922f8 in syscall.connect
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/syscall/zsyscall_linux_amd64.go:1425
 2  0x0000000000490ab3 in syscall.Connect
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/syscall/syscall_unix.go:262
 3  0x00000000005062e5 in net.(*netFD).connect
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/fd_unix.go:59
 4  0x0000000000521b69 in net.(*netFD).dial
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/sock_posix.go:149
 5  0x0000000000520ec5 in net.socket
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/sock_posix.go:70
 6  0x0000000000513c0d in net.internetSocket
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/ipsock_posix.go:141
 7  0x000000000052436e in net.(*sysDialer).doDialTCP
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/tcpsock_posix.go:65
 8  0x000000000052418c in net.(*sysDialer).dialTCP
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/tcpsock_posix.go:61
 9  0x00000000004fb7f2 in net.(*sysDialer).dialSingle
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/dial.go:580
10  0x00000000004facc5 in net.(*sysDialer).dialSerial
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/dial.go:548
11  0x00000000004f9465 in net.(*Dialer).DialContext
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/net/dial.go:425
12  0x0000000000534da5 in main.(*Client).spawnOrWaitForAvailableConnection.func1.1.1
    at ./cmd/main.go:111
13  0x0000000000535610 in main.(*Client).spawnOrWaitForAvailableConnection.func1.1
    at ./cmd/main.go:112
14  0x0000000000535c96 in main.(*Client).spawnOrWaitForAvailableConnection.func1
    at ./cmd/main.go:180
15  0x00000000004759c1 in runtime.goexit
    at /nix/store/9g87plb4z4xgpg8jm6ch94j32yi6ycbn-go-1.16.4/share/go/src/runtime/asm_amd64.s:1371

To reproduce, run netcat on port 9000 via. nc -l 9000 and run the program below. Debugging via delve. unravels the callstack showcasing the deadlock on syscall.Connect.

package main

import (
	"bytes"
	"context"
	"fmt"
	"net"
	"strconv"
	"sync"
	"time"
)

type ClientConn struct {
	conn *net.TCPConn
}

type Client struct {
	address string

	mu sync.Mutex
	wg sync.WaitGroup

	closed  chan struct{}
	waiters []chan error

	pool [4]*ClientConn
	len  uint8

	queue chan []byte
}

func NewClient(address string) Client {
	return Client{
		address: address,
		queue:   make(chan []byte, 1024),
	}
}

func (c *Client) Close() {
	c.close()
	c.wg.Wait()
	close(c.queue)
}

func (c *Client) close() {
	c.mu.Lock()
	defer c.mu.Unlock()

	for _, waiter := range c.waiters {
		waiter <- context.Canceled
		close(waiter)
	}
	c.waiters = c.waiters[:0]

	c.closed = make(chan struct{})
	close(c.closed)

	for _, conn := range c.pool[:c.len] {
		if conn.conn != nil {
			conn.conn.Close()
		}
	}
}

// spawnOrWaitForAvailableConnection returns a waiter channel if we are waiting for a connection
// in our pool (which may or may not have just been spawned) to be ready, or nil otherwise. It
// executes a pooling strategy.
//
// if any connected, pool isn't full, write queue isn't empty, spawn a new connection
// if any connected, pool isn't full, write queue is empty, don't spawn a new connection
// if any connected, pool is full, don't spawn a new connection

// if none connected, pool isn't full, write queue isn't empty, spawn a new connection
// if none connected, pool isn't full, write queue is empty, spawn a new connection
// if none connected, pool is full, don't spawn a new connection
func (c *Client) spawnOrWaitForAvailableConnection() (chan error, error) {
	c.mu.Lock()
	defer c.mu.Unlock()

	if c.closed != nil {
		return nil, context.Canceled
	}

	pool := c.pool[0:c.len:cap(c.pool)]

	connected := false
	for _, conn := range pool {
		if conn.conn != nil {
			connected = true
			break
		}
	}

	if len(pool) < cap(c.pool) && (!connected || len(c.queue) > 0) {
		cc := &ClientConn{}
		c.pool[c.len] = cc
		c.len = c.len + 1
		c.wg.Add(1)

		go func() {
			defer c.wg.Done()

			for {
				loop := func() bool {
					var d net.Dialer

					conn, err := func() (net.Conn, error) {
						ctx, cancel := context.WithTimeout(context.Background(), 5 * time.Second)
						defer cancel()

						return d.DialContext(ctx, "tcp", c.address)
					}()
					if err != nil {
						if !c.reportConnectionError(cc, err) {
							return false
						}
						return true
					}

					if !c.reportConnected(cc, conn.(*net.TCPConn)) {
						conn.Close()
						return false
					}

					writerDone := make(chan error)
					writerStop := make(chan struct{})
					readerDone := make(chan error)

					go func() {
						writerDone <- func() error {
							for {
								select {
								case <-writerStop:
									return nil
								case buf, open := <-c.queue:
									if !open {
										return context.Canceled
									}

									_, err := cc.conn.Write(buf)
									if err != nil {
										return err
									}
								}
							}
						}()
					}()

					go func() {
						readerDone <- func() error {
							var buf [1024]byte
							for {
								n, err := cc.conn.Read(buf[0:])
								if err != nil {
									return err
								}
								if n == 0 {
									return nil
								}
								fmt.Printf("Got: '%s'\n", bytes.TrimSpace(buf[0:n]))
							}
						}()
					}()

					select {
					case <-writerDone:
						conn.Close()
						<-readerDone
					case <-readerDone:
						conn.Close()
						close(writerStop)
						<-writerDone
					}

					return !c.reportDisconnected(cc)
				}()

				if !loop {
					break
				}
			}
		}()
	}

	if !connected {
		waiter := make(chan error, 1)
		c.waiters = append(c.waiters, waiter)
		return waiter, nil
	}

	return nil, nil
}

func (c *Client) tryRemoveWaiter(waiter chan error) bool {
	// Attempt to remove our waiter entry from the waiters list. It is possible
	// that our waiter has already been removed from the waiter list.

	// If we did not find our waiter entry, then that means that a connection is available.
	// This is needed because we are racing whether or not a pool connection is available
	// against ctx.Done().

	c.mu.Lock()
	defer c.mu.Unlock()

	waiters := c.waiters[:0]
	found := false

	for _, w := range c.waiters {
		if w == waiter {
			found = true
			continue
		}
		waiters = append(waiters, w)
	}
	c.waiters = waiters

	if !found {
		return false
	}

	close(waiter)
	return true
}

func (c *Client) reportConnected(cc *ClientConn, dialed *net.TCPConn) bool {
	c.mu.Lock()
	defer c.mu.Unlock()

	if c.closed != nil {
		pool := c.pool[:0]
		for _, conn := range c.pool {
			if cc == conn {
				continue
			}
			pool = append(pool, conn)
		}
		c.len = c.len - 1

		return false
	}

	cc.conn = dialed

	for _, waiter := range c.waiters {
		close(waiter)
	}
	c.waiters = c.waiters[:0]

	return true
}

func (c *Client) reportConnectionError(cc *ClientConn, err error) bool {
	c.mu.Lock()
	defer c.mu.Unlock()

	// If cc is not the last connection left, do not report
	// the error to the existing waiters.

	// Remove client connection entry from the pool.

	if c.len > 1 {
		pool := c.pool[:0]
		for _, conn := range c.pool {
			if cc == conn {
				continue
			}
			pool = append(pool, conn)
		}
		c.len = c.len - 1

		return false
	}

	for _, waiter := range c.waiters {
		waiter <- err
		close(waiter)
	}
	c.waiters = c.waiters[:0]

	return true
}

func (c *Client) reportDisconnected(cc *ClientConn) bool {
	c.mu.Lock()
	defer c.mu.Unlock()

	cc.conn = nil

	if c.closed != nil || c.len > 1 {
		pool := c.pool[:0]
		for _, conn := range c.pool {
			if cc == conn {
				continue
			}
			pool = append(pool, conn)
		}
		c.len = c.len - 1

		return true
	}

	return false
}

func (c *Client) EnsureConnectionAvailable(ctx context.Context) error {
	waiter, err := c.spawnOrWaitForAvailableConnection()
	if err != nil {
		return err
	}
	if waiter == nil {
		return nil
	}

	select {
	case <-c.closed:
		if !c.tryRemoveWaiter(waiter) {
			return nil
		}
		return context.Canceled
	case <-ctx.Done():
		if !c.tryRemoveWaiter(waiter) {
			return nil
		}
		return ctx.Err()
	case err := <-waiter:
		return err
	}
}

func (c *Client) Write(ctx context.Context, buf []byte) error {
	if err := c.EnsureConnectionAvailable(ctx); err != nil {
		return err
	}

	select {
	case <-c.closed:
		return context.Canceled
	case <-ctx.Done():
		return ctx.Err()
	case c.queue <- buf:
		return nil
	}
}

func main() {
	client := NewClient("127.0.0.1:9000")
	defer client.Close()

	for i := 0; i < 100; i++ {
		func() {
			ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
			defer cancel()

			if err := client.Write(ctx, append([]byte(strconv.FormatInt(int64(i), 10)), '\n')); err != nil {
				panic(err)
			}
		}()
	}
}
@lithdew lithdew changed the title net: deadlock connecting to erroneous listener net: deadlock connecting to listener that only supports accepting one client May 26, 2021
@lithdew lithdew changed the title net: deadlock connecting to listener that only supports accepting one client net: deadlock connecting to erroneous listener May 26, 2021
@lithdew lithdew closed this as completed May 26, 2021
@lithdew
Copy link
Author

lithdew commented May 26, 2021

I'll try come up with a simpler reproducer for this issue - closing for now.

@golang golang locked and limited conversation to collaborators May 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants