Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: document that on some systems SetLinger causes conn.Close to block #58882

Closed
StephenButtolph opened this issue Mar 5, 2023 · 17 comments
Closed
Labels
Documentation FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@StephenButtolph
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.20.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/stephen/.cache/go-build"
GOENV="/home/stephen/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/stephen/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/stephen/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.1"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build223210887=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We added an explicit call to SetLinger(15) here: ava-labs/avalanchego@1a2dca1

What did you expect to see?

go/src/net/tcpsock.go

Lines 161 to 173 in b94dc38

// SetLinger sets the behavior of Close on a connection which still
// has data waiting to be sent or to be acknowledged.
//
// If sec < 0 (the default), the operating system finishes sending the
// data in the background.
//
// If sec == 0, the operating system discards any unsent or
// unacknowledged data.
//
// If sec > 0, the data is sent in the background as with sec < 0. On
// some operating systems after sec seconds have elapsed any remaining
// unsent data may be discarded.
func (c *TCPConn) SetLinger(sec int) error {
claims that providing SetLinger with a positive value will:

// If sec > 0, the data is sent in the background as with sec < 0. On
// some operating systems after sec seconds have elapsed any remaining
// unsent data may be discarded.

We expected for the OS to flush any outstanding data over the TCP stream in the background.

What did you see instead?

It doesn't seem that the data is being sent in the backaground, but that conn.Close() may block until the specified timeout.

I don't think this is actually unexpected for the behavior of SO_LINGER:

SO_LINGER
Lingers on a close() if data is present. This option controls the action taken when unsent messages queue on a socket and close() is performed. If SO_LINGER is set, the system shall block the process during close() until it can transmit the data or until the time expires. If SO_LINGER is not specified, and close() is issued, the system handles the call in a way that allows the process to continue as quickly as possible. This option takes a linger structure, as defined in the <sys/socket.h> header, to specify the state of the option and linger interval.

However, I feel like the comment on SetLinger seems to contradict the man pages.

@panjf2000
Copy link
Member

panjf2000 commented Mar 6, 2023

UPDATE:

I did a little more in-depth research:

  • On BSD and other Unix-like OS's: calling close(2) on a non-blocking socket does not block, it returns immediately and the kernel tries sending the leftover data to the peer.
  • On Linux: calling close(2) on a non-blocking socket may block for the duration (seconds) to linger.

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

From my point of view, the comment on SetLinger might be a bit too concise and could be better worded but I would consider it accurate.

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

@panjf2000 panjf2000 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 6, 2023
@ianlancetaylor
Copy link
Contributor

What is your goal here?

The default behavior is for any remaining unsent socket data to be sent in the background (the usual TCP timeouts continue to apply, so the data won't hang around indefinitely).

Calling SetLinger with a positive N seconds means that after N seconds the remaining unsent socket data to be discarded. Is that what you want?

Since Go uses non-blocking I/O, as @panjf2000 says, the behavior of SetLinger shouldn't normally affect whether Close blocks or not.

The most common use of SetLinger is to call it with 0, meaning that any unsent data will be immediately discarded.

@StephenButtolph
Copy link
Author

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger.

@StephenButtolph
Copy link
Author

The default behavior is for any remaining unsent socket data to be sent in the background (the usual TCP timeouts continue to apply, so the data won't hang around indefinitely).

Calling SetLinger with a positive N seconds means that after N seconds the remaining unsent socket data to be discarded. Is that what you want?

The goal here is mainly around owning the amount of time for the server to attempt to flush the data before dropping it; rather than relying on an amorphous OS default.

@panjf2000
Copy link
Member

panjf2000 commented Mar 6, 2023

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger.

If this is the case, then I believe there is nothing to do with SO_LINGER and it might be a misuse, I think #21856 may help, check if there is a similar usage in your code, concretely, is there any other places still holding the reference of that connection? which may stop the c.Close() returning.

@StephenButtolph
Copy link
Author

StephenButtolph commented Mar 6, 2023

If this is the case, then I believe there is nothing to do with SO_LINGER and it might be a misuse, I think #21856 may help, check if there is a similar usage in your code, concretely, is there any other places still holding the reference of that connection? which may stop the c.Close() returning.

Hm, there are other goroutines calling Read and Write which I'm expecting the call to Close to end up cuasing to return an error based on the Conn interface: https://github.com/golang/go/blob/master/src/net/net.go#L128-L130.

The main difference here is that Close is not blocking on the listener, but the actual TCPConn.

@StephenButtolph
Copy link
Author

StephenButtolph commented Mar 6, 2023

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

I think this may actually be the issue. It seems that on linux systems, calling close() with SO_LINGER set with a positive timeout on a SOCK_NONBLOCKed socket does actually block.

Apologies for the sketchy link... (let me know if I can send this info better somehow...) https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/

Setting SO_LINGER to {on, N} where N > 0 on a non-blocking socket on Linux is a particularly bad idea. Even though the socket is in non-blocking mode, this call [close] will block.

nolingerresults

@panjf2000
Copy link
Member

panjf2000 commented Mar 6, 2023

Hm, there are other goroutines calling Read and Write which I'm expecting the call to Close to end up cuasing to return an error based on the Conn interface: https://github.com/golang/go/blob/master/src/net/net.go#L128-L130.

Sorry for the amphibolous statement about my previous comment, your usage of Close() here is totally justified, I was talking about some more sophisticated cases of misuse.

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

I think this may actually be the issue. It seems that on linux systems, calling close() with SO_LINGER set with a positive timeout on a SOCK_NONBLOCKed socket does actually block.

Apologies for the sketchy link... (let me know if I can send this info better somehow...) https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/

Setting SO_LINGER to {on, N} where N > 0 on a non-blocking socket on Linux is a particularly bad idea. Even though the socket is in non-blocking mode, this call [close] will block.

nolingerresults

Despite what this link says, according to your comment, It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger. before, the code there blocking you is not the system call close(2), somehow the current goroutine that is closing the connection is put into a waiting state cuz runtime believes the underlying file descriptor is not closed and will wake that goroutine, then return net.Close() after the fd is actually closed.

@panjf2000 panjf2000 added this to the Backlog milestone Mar 6, 2023
@ianlancetaylor
Copy link
Contributor

The goal here is mainly around owning the amount of time for the server to attempt to flush the data before dropping it; rather than relying on an amorphous OS default.

My understanding is that the OS default is simply the TCP timeout. And my understanding is that SetLinger isn't going to increase the TCP timeout anyhow.

That said, @panjf2000 makes a good point: can you confirm that when you see a goroutine hanging in close, it is hanging specifically on the line runtime_Semacquire(&fd.csema) ? Because I don't understand how that could be affected by SetLinger.

Is there a test case we can run to recreate the problem?

@StephenButtolph
Copy link
Author

My understanding is that the OS default is simply the TCP timeout. And my understanding is that SetLinger isn't going to increase the TCP timeout anyhow.

Yeah, I wouldn't consider this a pressing issue for us, we're likely to just remove our usage of SetLinger given the apparent platform dependent behavior. At this point I'm mainly more interested in figuring out what is actually happening so that others don't run into the same issues we are.

That said, @panjf2000 makes a good point: can you confirm that when you see a goroutine hanging in close, it is hanging specifically on the line runtime_Semacquire(&fd.csema) ? Because I don't understand how that could be affected by SetLinger.

Yes. Here is a trace showing that (was run using our release binaries which were built with go1.19.6):

  Goroutine 1585 - Start: /home/runner/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:72 golang.org/x/sync/errgroup.(*Group).Go.func1 (0x78a5c0) [semacquire]
	 0  0x00000000004444d6 in runtime.gopark
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/proc.go:364
	 1  0x000000000045543e in runtime.goparkunlock
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/proc.go:369
	 2  0x000000000045543e in runtime.semacquire1
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/sema.go:150
	 3  0x0000000000471e85 in internal/poll.runtime_Semacquire
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/sema.go:67
	 4  0x00000000004e62ad in internal/poll.(*FD).Close
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/internal/poll/fd_unix.go:116
	 5  0x0000000000549d18 in net.(*netFD).Close
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/net/fd_posix.go:37
	 6  0x000000000055cea5 in net.(*conn).Close
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/net/net.go:207
	 7  0x00000000006a730d in crypto/tls.(*Conn).Close
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/crypto/tls/conn.go:1374
	 8  0x0000000000c10862 in github.com/ava-labs/avalanchego/network/peer.(*peer).StartClose.func1
	     at /home/runner/work/avalanchego-internal/avalanchego-internal/network/peer/peer.go:311
	 9  0x000000000048f082 in sync.(*Once).doSlow
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:74
	10  0x0000000000c107e7 in sync.(*Once).Do
	     at /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:65
	11  0x0000000000c107e7 in github.com/ava-labs/avalanchego/network/peer.(*peer).StartClose
	     at /home/runner/work/avalanchego-internal/avalanchego-internal/network/peer/peer.go:310
	(truncated)

Is there a test case we can run to recreate the problem?

I'll work on getting a reproducible test case in the next couple days.

@StephenButtolph
Copy link
Author

StephenButtolph commented Mar 6, 2023

After writing a very short program I was able to replicate conn.Close hanging - but this time dlv points to syscall.Close (which seems more in-line with the reported behavior of SO_LINGER...)

goroutines.txt

I got this from running this server:

package main

import (
	"fmt"
	"net"
	"time"
)

func main() {
	fmt.Println("starting to listen")
	listener, err := net.Listen("tcp", ":7777")
	if err != nil {
		panic(err)
	}
	fmt.Println("listening")

	fmt.Println("waiting to accept")
	conn, err := listener.Accept()
	if err != nil {
		panic(err)
	}
	fmt.Println("accepted connection")

	fmt.Println("setting linger to 15s")
	tcpConn := conn.(*net.TCPConn)
	err = tcpConn.SetLinger(1500)
	if err != nil {
		panic(err)
	}
	fmt.Println("set linger to 15s")

	fmt.Println("writing some data")
	msg := make([]byte, 1<<20)
	n, err := conn.Write(msg)
	if err != nil {
		panic(err)
	}
	fmt.Printf("wrote %d bytes of data\n", n)

	startClose := time.Now()
	fmt.Println("starting to close the connection")
	err = conn.Close()
	if err != nil {
		panic(err)
	}
	fmt.Printf("closing the connection took %s\n", time.Since(startClose))
}

Along with running the client from the previously mentioned SO_LINGER tests: https://github.com/nybek/linger-tools/blob/master/linger-client.c with the arguments -i 127.0.0.1.

The output of the server should look like:

starting to listen
listening
waiting to accept
accepted connection
setting linger to 15s
set linger to 15s
writing some data
wrote 1048576 bytes of data
starting to close the connection
closing the connection took 15.082815133s

This was all run using the go version + go env listed at the beginning of the issue.

I feel like this is already a deviation from the documented behavior on SetLinger... But I'll keep looking into reproducing the hang at https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116.

@StephenButtolph
Copy link
Author

Here is a pure golang example that includes both the server and the client and has the same results as above:

package main

import (
	"fmt"
	"net"
	"sync"
	"time"
)

func main() {
	fmt.Println("starting to listen")
	listener, err := net.Listen("tcp", ":")
	if err != nil {
		panic(err)
	}
	fmt.Println("listening")
	defer listener.Close()

	addr := listener.Addr()

	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		fmt.Println("waiting to accept")
		conn, err := listener.Accept()
		if err != nil {
			panic(err)
		}
		fmt.Println("accepted connection")

		fmt.Println("setting linger to 15s")
		tcpConn := conn.(*net.TCPConn)
		err = tcpConn.SetLinger(15)
		if err != nil {
			panic(err)
		}
		fmt.Println("set linger to 15s")

		fmt.Println("writing some data")
		msg := make([]byte, 1<<20)
		n, err := conn.Write(msg)
		if err != nil {
			panic(err)
		}
		fmt.Printf("wrote %d bytes of data\n", n)

		startClose := time.Now()
		fmt.Println("starting to close the connection")
		err = conn.Close()
		if err != nil {
			panic(err)
		}
		fmt.Printf("closing the connection took %s\n", time.Since(startClose))
		wg.Done()
	}()

	conn, err := net.Dial("tcp", addr.String())
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	wg.Wait()
}

@StephenButtolph
Copy link
Author

Ok, I think I'm able to fully close the loop on this now. This program replicates the blocking on runtime_Semacquire.

Here is the output from dlv: goroutines.txt

package main

import (
	"fmt"
	"net"
	"sync"
	"time"
)

func main() {
	fmt.Println("starting to listen")
	listener, err := net.Listen("tcp", ":")
	if err != nil {
		panic(err)
	}
	fmt.Println("listening")
	defer listener.Close()

	addr := listener.Addr()

	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		fmt.Println("waiting to accept")
		conn, err := listener.Accept()
		if err != nil {
			panic(err)
		}
		fmt.Println("accepted connection")

		fmt.Println("setting linger to 15s")
		tcpConn := conn.(*net.TCPConn)
		err = tcpConn.SetLinger(1500)
		if err != nil {
			panic(err)
		}
		fmt.Println("set linger to 15s")

		go func() {
			fmt.Println("starting read")
			_, _ = conn.Read([]byte{0})
			fmt.Println("exited read")
		}()

		fmt.Println("writing some data")
		msg := make([]byte, 1<<20)
		n, err := conn.Write(msg)
		if err != nil {
			panic(err)
		}
		fmt.Printf("wrote %d bytes of data\n", n)

		startClose := time.Now()
		fmt.Println("starting to close the connection")
		err = conn.Close()
		if err != nil {
			panic(err)
		}
		fmt.Printf("closing the connection took %s\n", time.Since(startClose))
		wg.Done()
	}()

	conn, err := net.Dial("tcp", addr.String())
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	wg.Wait()
}

It seems like the call to syscall.Close moved into the Read call, and conn.Close is waiting for Read to return.

@ianlancetaylor
Copy link
Contributor

Thanks. For me that program blocks for 15 seconds in the call to syscall.Close, which is consistent with what you said earlier. I guess that Linux behaves that way, which I was not previously aware of. That makes it seems that there is no bug from Go's perspective. There isn't much that Go can do if the system call blocks.

But I guess we can mention that in the SetLinger docs.

@ianlancetaylor ianlancetaylor changed the title net: SetLinger causes conn.Close to block net: document that on some systems SetLinger causes conn.Close to block Mar 7, 2023
@ianlancetaylor ianlancetaylor added help wanted NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 7, 2023
@panjf2000
Copy link
Member

panjf2000 commented Mar 7, 2023

About the blocking on close(2) with SO_LINGER, see this previous comment:

UPDATE:

I did a little more in-depth research:

  • On BSD and other Unix-like OS's: calling close(2) on a non-blocking socket does not block, it returns immediately and the kernel tries sending the leftover data to the peer.
  • On Linux: calling close(2) on a non-blocking socket may block for the duration (seconds) to linger.

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

From my point of view, the comment on SetLinger might be a bit too concise and could be better worded but I would consider it accurate.

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

Also, I've run the test code above on my macOS and it didn't reproduce this issue, which testifies my updated comment.

@gopherbot
Copy link

Change https://go.dev/cl/473915 mentions this issue: net: indicate the exeception on Linux of Close blocking with SO_LINGER

@panjf2000
Copy link
Member

Thank you for all the effort here.
@StephenButtolph

@golang golang locked and limited conversation to collaborators Mar 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Documentation FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants