Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: copy from Unix socket to os.Stdout fails with "waiting for unsupported file type" #59041

Closed
ianlancetaylor opened this issue Mar 14, 2023 · 21 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Mar 14, 2023

On HEAD this test function fails:

func TestCopyUnixToStdout(t *testing.T) {
	sockPath := filepath.Join(t.TempDir(), "unixsock")
	addr, err := net.ResolveUnixAddr("unix", sockPath)
	if err != nil {
		t.Fatal(err)
	}

	ln, err := net.ListenUnix("unix", addr)
	if err != nil {
		t.Fatal(err)
	}
	defer ln.Close()

	go func() {
		server, err := ln.Accept()
		if err != nil {
			t.Error(err)
			return
		}
		defer func() {
			if err := server.Close(); err != nil {
				t.Error(err)
			}
		}()
		for i := 0; i < 100; i++ {
			if _, err := server.Write(bytes.Repeat([]byte{'a'}, 1024)); err != nil {
				t.Error(err)
				return
			}
		}
	}()

	conn, err := net.Dial(ln.Addr().Network(), ln.Addr().String())
	if err != nil {
		t.Fatal(err)
	}

	if _, err := io.Copy(os.Stdout, conn); err != nil {
		t.Fatal(err)
	}
}

I see

foo_test.go:50: write /dev/stdout: waiting for unsupported file type

Here line 50 is the t.Fatal called when io.Copy fails.

This test passes with Go 1.20.

This is most likely due to https://go.dev/cl/466015. CC @panjf2000

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Mar 14, 2023
@ianlancetaylor ianlancetaylor added this to the Go1.21 milestone Mar 14, 2023
@ianlancetaylor
Copy link
Contributor Author

Actually it looks like this can happen with TCP as well, so it may not be CL 466015.

@gopherbot
Copy link

Change https://go.dev/cl/476335 mentions this issue: os, internal/poll: don't use splice with tty

@nightlyone
Copy link
Contributor

I don't understand why there is a stat call needed asking the inode caches to achieve this instead of the much cheaper and more obviously correct normal tty check that only checks for presence a kernel file operation (equivalent to a static itab compare in Go). See e.g. https://github.com/mattn/go-isatty how to achieve this.

Checking for terminal only would still allow splicing to high speed device drivers that implement it correctly.

@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2023

Honestly, we've accumulated so many special-case checks on that path by now that I wonder if it would be more efficient to just try the splice call and fall back if it fails.

@ianlancetaylor
Copy link
Contributor Author

@bcmills The problem in this particular case is that we actually use two splice calls, one from the source to a pipe, and one from the pipe to the destination. We do already try the first splice call and give up if it fails. But if it works then we have started reading from the source, so if the second splice fails then we are stuck. This change is trying to detect whether the second splice will work.

Though now that I write that it occurs to me that perhaps we could add still more complexity: if the second splice fails, we could just copy the data ourselves. The result would likely be slower overall but it might work.

@nightlyone Thanks for the suggestion.

@ianlancetaylor
Copy link
Contributor Author

I now think that the stat call was a mistake on my part, and it may not be required at all. Taking it back out. Thanks for raising the issue.

@gopherbot
Copy link

Change https://go.dev/cl/477035 mentions this issue: os: don't check for TTY before calling splice

@andybalholm
Copy link
Contributor

@ianlancetaylor

Was removing spliceNonblock a mistake?

We've been running into an issue on a couple of our proxy servers (running github.com/andybalholm/redwood) where certain downloads time out instead of completing normally. We can't reproduce the issue on other machines, and the effects aren't even quite the same on the two machines. But it's quite consistent on those machines.

So we did a git bisect, and the commit associated with this issue is where the problem starts. The problem goes away if we revert the part of the change that relates to spliceNonblock:

diff --git a/src/internal/poll/splice_linux.go b/src/internal/poll/splice_linux.go
index 9505c5dcfc..4690962ae6 100644
--- a/src/internal/poll/splice_linux.go
+++ b/src/internal/poll/splice_linux.go
@@ -13,6 +13,9 @@ import (
 )
 
 const (
+       // spliceNonblock makes calls to splice(2) non-blocking.
+       spliceNonblock = 0x2
+
        // maxSpliceSize is the maximum amount of data Splice asks
        // the kernel to move in a single call to splice(2).
        // We use 1MB as Splice writes data through a pipe, and 1MB is the default maximum pipe buffer size,
@@ -89,7 +92,7 @@ func spliceDrain(pipefd int, sock *FD, max int) (int, error) {
                return 0, err
        }
        for {
-               n, err := splice(pipefd, sock.Sysfd, max, 0)
+               n, err := splice(pipefd, sock.Sysfd, max, spliceNonblock)
                if err == syscall.EINTR {
                        continue
                }
@@ -127,7 +130,7 @@ func splicePump(sock *FD, pipefd int, inPipe int) (int, error) {
        }
        written := 0
        for inPipe > 0 {
-               n, err := splice(sock.Sysfd, pipefd, inPipe, 0)
+               n, err := splice(sock.Sysfd, pipefd, inPipe, spliceNonblock)
                // Here, the condition n == 0 && err == nil should never be
                // observed, since Splice controls the write side of the pipe.
                if n > 0 {

When we make this change to the tip of the master branch, our problem goes away, and all tests still pass.

Interestingly, the removal of spliceNonblock is not mentioned in the commit message. So maybe it was not really intended…?

@ianlancetaylor
Copy link
Contributor Author

I removed spliceNonblock because it didn't seem to be serving any purpose, and because all the tests passed even when it was removed. I still don't see any purpose for it. The descriptors are expected to already be in non-blocking mode anyhow.

Is there any difference in the kernel version on the machines where your code fails?

Can you share a test case that demonstrates the failure? Thanks.

@andybalholm
Copy link
Contributor

Both of the machines with the problem are running Linux kernel 3.10.0 (from CentOS 7). We do have a few different kernel versions in use, but some of the ones that are not affected are also running 3.10.0.

I have no idea how to make a test case. For my git bisect work, I would compile a new binary of Redwood, deploy it to one of the routers, and message a user on the LAN behind the router, asking him to download a URL that was giving him problems. I don't even know whether the problem was on the client side or the server side of the proxy.

@andybalholm
Copy link
Contributor

Looking at the documentation for SPLICE_F_NONBLOCK in the man page for splice(2), it looks like it's somewhat of a separate thing from the nonblocking status of the file descriptors:

Do not block on I/O. This makes the splice pipe operations nonblocking, but splice() may nevertheless block because the file descriptors that are spliced to/from may block (unless they have the O_NONBLOCK flag set).

If the files are blocking, the splice call can block even if SPLICE_F_NONBLOCK is set. It doesn't say what could happen if the files are nonblocking but SPLICE_F_NONBLOCK is set, but maybe there are circumstances where the splice pipe operations themselves can block.

@panjf2000
Copy link
Member

panjf2000 commented Oct 17, 2023

Actually, I believe that the SPLICE_F_NONBLOCK flag was removed by mistake.

As far as I can tell, the original intent of SPLICE_F_NONBLOCK added by Linus Torvalds in this commit was for the flag to affect the buffering, not the endpoints. As you can see in that commit, it took effect after the data transfer, even though it doesn't make the splice itself necessarily nonblocking, but it could, with O_NONBLOCK set in file descriptors.

Furthermore, the Linux kernel added SPLICE_F_NONBLOCK along with O_NONBLOCK implicitly since kernel v5.1, to avoid the inconsistent behaviors with pipe file descriptors with O_NONBLOCK in splice, check out this commit for more details, which means that we don't need to explicitly assign SPLICE_F_NONBLOCK for splice since kernel v5.1 while we still have to do it in versions before v5.1.

I've drafted a CL for this.

@ianlancetaylor

@gopherbot
Copy link

Change https://go.dev/cl/536015 mentions this issue: internal/poll: add SPLICE_F_NONBLOCK flag for splice to avoid inconsistency with O_NONBLOCK

@ianlancetaylor
Copy link
Contributor Author

@panjf2000 Thanks for the analysis.

gopherbot pushed a commit that referenced this issue Oct 19, 2023
…stency with O_NONBLOCK

For #59041

Details: #59041 (comment)

Change-Id: Id3fc1df6d86b7c4cc383d09f9465fa8f4cc2a559
Reviewed-on: https://go-review.googlesource.com/c/go/+/536015
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
yunginnanet pushed a commit to yunginnanet/go that referenced this issue Oct 20, 2023
…stency with O_NONBLOCK

For golang#59041

Details: golang#59041 (comment)

Change-Id: Id3fc1df6d86b7c4cc383d09f9465fa8f4cc2a559
Reviewed-on: https://go-review.googlesource.com/c/go/+/536015
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
@sandeshrk
Copy link

@ianlancetaylor, @panjf2000 - thanks for the analysis and fix.
We are consistently running into this issue with golang.org/x/net@v0.0.17 (go 1.21.3) with a proxy server. The application communicates with a remote server through a goproxy. The transfer starts of normally and after a few seconds (and approximately 10MB of download), connection completely stalls. This is the stack trace of the goroutines that have stalled.

goroutine 1643 [IO wait]:
internal/poll.runtime_pollWait(0x7f282aa52678, 0x72)
/go/1.21.3/go/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc000684400?, 0xc000a93ce8?, 0x0)
/go/1.21.3/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
/go/1.21.3/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.spliceDrain(0xc0002be680?, 0xc000684400, 0x112?)
/go/1.21.3/go/src/internal/poll/splice_linux.go:100 +0x22f
internal/poll.Splice(0x0?, 0x0?, 0x7fffffffffffffff)
/go/1.21.3/go/src/internal/poll/splice_linux.go:42 +0x173
net.splice(0x0?, {0x1ad3d80?, 0xc0006224e0?})
/go/1.21.3/go/src/net/splice_linux.go:39 +0xdf
net.(*TCPConn).readFrom(0xc00051a0e0, {0x1ad3d80, 0xc0006224e0})
/go/1.21.3/go/src/net/tcpsock_posix.go:48 +0x28
net.(*TCPConn).ReadFrom(0xc00051a0e0, {0x1ad3d80?, 0xc0006224e0?})
/go/1.21.3/go/src/net/tcpsock.go:130 +0x30
io.copyBuffer({0x1ad3e60, 0xc00051a0e0}, {0x1ad3d80, 0xc0006224e0}, {0x0, 0x0, 0x0})
/go/1.21.3/go/src/io/io.go:416 +0x147
io.Copy(...)
/go/1.21.3/go/src/io/io.go:389
github.com/elazarl/goproxy.copyAndClose(0x0?, {0x7f282a9d5258?, 0xc00051a0e0}, {0x7f282a9d5258?, 0xc0006224e0})
go/pkg/mod/github.com/elazarl/goproxy@v0.0.0-20200310082302-296d8939dc5a/https.go:310 +0x8a
created by github.com/elazarl/goproxy.(*ProxyHttpServer).handleHttps in goroutine 364
go/pkg/mod/github.com/elazarl/goproxy@v0.0.0-20200310082302-296d8939dc5a/https.go:116 +0x8cd

goroutine 1644 [syscall]:
syscall.Syscall6(0x7f282aa526a0?, 0xc000439c68?, 0x46b985?, 0xc000439c78?, 0x4f005c?, 0xc000439c90?, 0x4f01c7?)
/go/1.21.3/go/src/syscall/syscall_linux.go:91 +0x30
syscall.Splice(0xc0002be680?, 0xc000439ce8?, 0x0?, 0x1443500?, 0x1443500?, 0xc000439d00?)
/go/1.21.3/go/src/syscall/zsyscall_linux_amd64.go:1356 +0x45
internal/poll.splice(...)
/go/1.21.3/go/src/internal/poll/splice_linux.go:155
internal/poll.spliceDrain(0xc000684400?, 0xc0002be680, 0xb50?)
/go/1.21.3/go/src/internal/poll/splice_linux.go:92 +0x185
internal/poll.Splice(0x40d86f?, 0x1652098?, 0x7fffffffffffffff)
/go/1.21.3/go/src/internal/poll/splice_linux.go:42 +0x173
net.splice(0x0?, {0x1ad3d80?, 0xc00051a0e0?})
/go/1.21.3/go/src/net/splice_linux.go:39 +0xdf
net.(*TCPConn).readFrom(0xc0006224e0, {0x1ad3d80, 0xc00051a0e0})
/go/1.21.3/go/src/net/tcpsock_posix.go:48 +0x28
net.(*TCPConn).ReadFrom(0xc0006224e0, {0x1ad3d80?, 0xc00051a0e0?})
/go/1.21.3/go/src/net/tcpsock.go:130 +0x30
io.copyBuffer({0x1ad3e60, 0xc0006224e0}, {0x1ad3d80, 0xc00051a0e0}, {0x0, 0x0, 0x0})
/go/1.21.3/go/src/io/io.go:416 +0x147
io.Copy(...)
/go/1.21.3/go/src/io/io.go:389
github.com/elazarl/goproxy.copyAndClose(0xc000439fd0?, {0x7f282a9d5258?, 0xc0006224e0}, {0x7f282a9d5258?, 0xc00051a0e0})
go/pkg/mod/github.com/elazarl/goproxy@v0.0.0-20200310082302-296d8939dc5a/https.go:310 +0x8a
created by github.com/elazarl/goproxy.(*ProxyHttpServer).handleHttps in goroutine 364
go/pkg/mod/github.com/elazarl/goproxy@v0.0.0-20200310082302-296d8939dc5a/https.go:117 +0x96f

Would you know when the fix for https://go.dev/cl/536015 will be available in git? I don't see this as an open issue. Would the fix make it in 1.21.4 or will it be available in a new version of golang.org/x/net?

Thanks!

@panjf2000
Copy link
Member

Could you pull the latest code of Go from the master branch by which you can verify if your issue is fixed?
@sandeshrk

@sebastien-rosset
Copy link

sebastien-rosset commented Oct 27, 2023

Could you pull the latest code of Go from the master branch by which you can verify if your issue is fixed? @sandeshrk

We have created a test program to reproduce the issue in a Linux system with with kernel < 5.1.

  1. The program hangs if it has been compiled with go 1.21.3. Looking at the git history of internal/poll/splice_linux.go, the issue also exists in all 1.21.x versions, though we haven't actually tested each patch version.
  2. The program completes successfully if it has been compiled with go master. The fix is https://go-review.googlesource.com/c/go/+/536015.

We need to do some code cleanup before we can share a minimum program to reproduce.

Should this issue be re-opened? Or should a new issue be created? It seems the issue is severe enough that it should be backported to the 1.21 branch.

@sandeshrk
Copy link

The attached go file can be used to recreate/validate the fix.
The go program...

  1. Starts a goProxy in a goroutine which blocks to accept a connection ([127.0.0.1:9090])
  2. Starts a Downloader connects in another goroutune. Downloader connects to the proxy ([127.0.0.1:9090]) and attempts to download a file.

Stalled:
linux#./proxy
Starting GoProxy
Download Started
Downloading... 31 MB complete

Success:
./proxy
Starting GoProxy
Download Started
Downloading... 1.1 GB complete
Download Finished

proxy.go.txt

@panjf2000
Copy link
Member

We should open a new issue for the subsequent discussions and link it to this issue.

Could you open a new issue to which you can move those previous comments and attachments about the suspicious behaviors (including what @andybalholm mentioned earlier)? Thanks! @sebastien-rosset @sandeshrk

@sebastien-rosset
Copy link

I have created #63795

@gopherbot
Copy link

Change https://go.dev/cl/538117 mentions this issue: [release-branch.go1.21] internal/poll: add SPLICE_F_NONBLOCK flag for splice to avoid inconsistency with O_NONBLOCK

gopherbot pushed a commit that referenced this issue Nov 8, 2023
… splice to avoid inconsistency with O_NONBLOCK

Fixes #63801
Updates #59041
Updates #63795

Details: #59041 (comment)

Change-Id: Id3fc1df6d86b7c4cc383d09f9465fa8f4cc2a559
Reviewed-on: https://go-review.googlesource.com/c/go/+/536015
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
(cherry picked from commit 40cdf69)
Reviewed-on: https://go-review.googlesource.com/c/go/+/538117
Auto-Submit: Heschi Kreinick <heschi@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants