Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: slow io.Copy(http.ResponseWriter, *os.File) with Content-Length #41513

Open
Ryan-A-B opened this issue Sep 21, 2020 · 20 comments
Open
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@Ryan-A-B
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.15.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build447228345=/tmp/go-build -gno-record-gcc-switches"

What did you do?

package http_test

import (
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"net/http/httptest"
	"os"
	"strconv"
	"testing"
)

func runHandlerBenchmark(b *testing.B, handler http.Handler) {
	server := httptest.NewServer(handler)
	defer server.Close()

	client := server.Client()
	for i := 0; i < b.N; i++ {
		response, err := client.Get(server.URL)
		if err != nil {
			log.Fatal(err.Error())
		}
		_, err = ioutil.ReadAll(response.Body)
		if err != nil {
			log.Fatal(err.Error())
		}
	}
}

type HandlerFile struct {
	callback func(response http.ResponseWriter, file *os.File)
}

func (handler *HandlerFile) ServeHTTP(response http.ResponseWriter, request *http.Request) {
	file, err := os.Open("./main_test.go")
	if err != nil {
		log.Fatal(err.Error())
	}
	defer file.Close()
	handler.callback(response, file)
	_, err = io.Copy(response, file)
	if err != nil {
		log.Fatal(err.Error())
	}
}

func BenchmarkFileWithoutContentLength(b *testing.B) {
	doNothing := func(response http.ResponseWriter, file *os.File) {}
	handler := HandlerFile{
		callback: doNothing,
	}
	runHandlerBenchmark(b, &handler)
}

func BenchmarkFileWithContentLength(b *testing.B) {
	addContentLength := func(response http.ResponseWriter, file *os.File) {
		stat, err := file.Stat()
		if err != nil {
			log.Fatal(err.Error())
		}
		response.Header().Set("Content-Length", strconv.FormatInt(stat.Size(), 10))
	}
	handler := HandlerFile{
		callback: addContentLength,
	}
	runHandlerBenchmark(b, &handler)
}

What did you expect to see?

Both benchmarks to take the same amount of time

What did you see instead?

goos: linux
goarch: amd64
BenchmarkFileWithoutContentLength-8        12226             95274 ns/op
BenchmarkFileWithContentLength-8              20          52161420 ns/op
PASS
@ianlancetaylor ianlancetaylor changed the title Slow io.Copy(http.ResponseWriter, *os.File) with Content-Length net/http: slow io.Copy(http.ResponseWriter, *os.File) with Content-Length Sep 21, 2020
@ianlancetaylor ianlancetaylor added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 21, 2020
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Sep 21, 2020
@ianlancetaylor
Copy link
Contributor

Interesting test case. I'm not sure what is going on here. I can see that BenchmarkFileWithContentLength uses the sendfile system call while BenchmarkFileWithoutContentLength does not, but I don't know why it would lead to such a significant difference in benchmark time.

If I use a much larger file, rather than the main_test.go file, then BenchmarkFileWithContentLength is faster. In that case BenchmarkFileWithoutContentLength is going to be using more system calls.

Perhaps sendfile is not a great choice for short files, but I don't know why the difference would be so dramatic.

@cuonglm
Copy link
Member

cuonglm commented Sep 21, 2020

The problem is in chunkWriter.writeHeader, when Content-Length header is detected, chunkWriter.chunking won't be set. That causes response.ReadFrom use the net.Conn.ReadFrom instead of wrapping the response with writerOnly.

The fix can be enable chunking even in case Content-Length header was detected, but I'm not sure about it.

Any idea @bradfitz @odeke-em

@darkLord19
Copy link
Contributor

On macos I am getting both benchmarks close to each other.

goos: darwin
goarch: amd64
BenchmarkFileWithoutContentLength-8   	   11030	    106178 ns/op
BenchmarkFileWithContentLength-8      	   10000	    106606 ns/op
PASS

@ianlancetaylor
Copy link
Contributor

@darkLord19 That's because macOS doesn't have sendfile.

@cuonglm Thanks for spotting where the behavior changes. I don't think it would be right to always enable chunking because as noted above using sendfile is actually faster when using a larger file.

I think we need to understand why sendfile is so much slower for short files.

CC @acln0 for any thoughts.

@acln0
Copy link
Contributor

acln0 commented Sep 21, 2020

When I tried to reproduce on my machine, I got this:

: forge:http-sendfile-bench $ go test -bench .
goos: linux
goarch: amd64
pkg: acln.ro/http-sendfile-bench
BenchmarkFileWithoutContentLength-4   	   16884	     70872 ns/op
BenchmarkFileWithContentLength-4      	   18878	     62345 ns/op
PASS
ok  	acln.ro/http-sendfile-bench	3.743s

... which looks quite reasonable. I don't know what is going on here.

I have a theory: I notice that the benchmark calls client.Get, but never closes response.Body. It is my understanding that this can cause a file descriptor leak. Perhaps the first round of benchmarks leaves just enough file descriptors hanging around that by the time we get to the second one, the process is reaching a file descriptor limit, and the server enters this portion of its internal Accept loop:

go/src/net/http/server.go

Lines 2962 to 2976 in 617f2c3

if ne, ok := err.(net.Error); ok && ne.Temporary() {
if tempDelay == 0 {
tempDelay = 5 * time.Millisecond
} else {
tempDelay *= 2
}
if max := 1 * time.Second; tempDelay > max {
tempDelay = max
}
srv.logf("http: Accept error: %v; retrying in %v", err, tempDelay)
time.Sleep(tempDelay)
continue
}
return err
}

The server sleeping for extended periods of time may account for the sendfile code path looking really really slow. I cannot imagine sendfile needing ~50ms to send a small file otherwise.

I have not been able to confirm my hypothesis, but I would kindly ask @Ryan-A-B to change the inner loop of the benchmark to this:

	for i := 0; i < b.N; i++ {
		response, err := client.Get(server.URL)
		if err != nil {
			log.Fatal(err.Error())
		}
		_, err = ioutil.ReadAll(response.Body)
		if err != nil {
			log.Fatal(err.Error())
		}
		response.Body.Close()
	}

and try again. Perhaps with this change, the benchmark would have more sensible results.

@Ryan-A-B
Copy link
Author

@acln0 adding response.Body.Close() made no difference, however pointing at a smaller file did, it seems like files over a certain size experience the slowdown when stating Content-Length. I think that makes sense based on @cuonglm findings.

@cuonglm
Copy link
Member

cuonglm commented Sep 22, 2020

@darkLord19 That's because macOS doesn't have sendfile.

@cuonglm Thanks for spotting where the behavior changes. I don't think it would be right to always enable chunking because as noted above using sendfile is actually faster when using a larger file.

I think we need to understand why sendfile is so much slower for short files.

CC @acln0 for any thoughts.

I don't mean we always enable chunking, but still enable chunking when we detect there's Content-Length header. The current code which enable chunking is:

cw.chunking = true

Which happens if hasCL is false. But after those code, there's:

if cw.chunking {

which seems useless, since when chunking is only set if Content-Length header was not existed. So I propose:

diff --git a/src/net/http/server.go b/src/net/http/server.go
index 25fab288f2..909adeb67f 100644
--- a/src/net/http/server.go
+++ b/src/net/http/server.go
@@ -1429,17 +1429,17 @@ func (cw *chunkWriter) writeHeader(p []byte) {
 
        if w.req.Method == "HEAD" || !bodyAllowedForStatus(code) {
                // do nothing
-       } else if code == StatusNoContent {
+       } else if code == StatusNoContent || hasCL {
                delHeader("Transfer-Encoding")
-       } else if hasCL {
-               delHeader("Transfer-Encoding")
-       } else if w.req.ProtoAtLeast(1, 1) {
+       }
+
+       if w.req.ProtoAtLeast(1, 1) {
                // HTTP/1.1 or greater: Transfer-Encoding has been set to identity, and no
                // content-length has been provided. The connection must be closed after the
                // reply is written, and no chunking is to be done. This is the setup
                // recommended in the Server-Sent Events candidate recommendation 11,
                // section 8.
-               if hasTE && te == "identity" {
+               if !hasCL && hasTE && te == "identity" {
                        cw.chunking = false
                        w.closeAfterReply = true
                } else {

Note

The slow performance come from this line:

if !w.cw.chunking && w.bodyAllowed() {

@ianlancetaylor
Copy link
Contributor

@Ryan-A-B Just to make sure we are on the same page, I want to clarify that what I see is that larger files are faster in BenchmarkFileWithContentLength. Smaller files are faster in BenchmarkFileWithoutContentLength.

@ianlancetaylor
Copy link
Contributor

Running strace -ftt, it seems to me that it takes an unreasonably long time for epoll_pwait to return after sendfile is used to send data on the socket. Relevant lines from strace -ftt output:

[pid 1686887] 15:12:57.528572 sendfile(8, 9, NULL, 4194304 <unfinished ...>
[pid 1686887] 15:12:57.528647 <... sendfile resumed> ) = 1080
[pid 1686887] 15:12:57.529057 close(9 <unfinished ...>
[pid 1686887] 15:12:57.529131 <... close resumed> ) = 0
[pid 1686894] 15:12:57.529327 epoll_pwait(5,  <unfinished ...>
[pid 1686894] 15:12:57.736463 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=3680355512, u64=139847815513272}}], 128, -1, NULL, 0) = 1

Note the huge jump from 57.529327 to 57.736463 before epoll_pwait returns. During that interval the strace reports nothing of any interest. Why does it take epoll_pwait to long to return?

@ianlancetaylor
Copy link
Contributor

My Linux kernel version is 5.7.17.

@qingyunha
Copy link
Contributor

strace client and server separately, show there is a 200ms delay between sendfile and recvfrom.

07:22:16.141964 connect(3, {sa_family=AF_INET, sin_port=htons(8081), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
07:22:16.142104 poll([{fd=3, events=POLLOUT|POLLWRNORM}], 1, 0) = 1 ([{fd=3, revents=POLLOUT|POLLWRNORM}])
07:22:16.142381 sendto(3, "GET / HTTP/1.1\r\nHost: 127.0.0.1:"..., 78, MSG_NOSIGNAL, NULL, 0) = 78
07:22:16.142458 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
07:22:16.142511 poll([{fd=3, events=POLLIN}], 1, 199) = 1 ([{fd=3, revents=POLLIN}])
07:22:16.142964 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLRDNORM}])
07:22:16.143005 recvfrom(3, "HTTP/1.1 200 OK\r\nContent-Length:"..., 102400, 0, NULL, NULL) = 631
07:22:16.143215 poll([{fd=3, events=POLLIN}], 1, 198) = 0 (Timeout)
07:22:16.342150 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
07:22:16.342346 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
07:22:16.342487 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
07:22:16.342583 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
07:22:16.348691 poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN|POLLRDNORM}])
07:22:16.348754 recvfrom(3, "llback func(response http.Respon"..., 959, 0, NULL, NULL) = 959
[pid  7073] 07:22:16.142734 write(4, "HTTP/1.1 200 OK\r\nContent-Length:"..., 631) = 631
[pid  7073] 07:22:16.142825 sendfile(4, 6, NULL, 4194304) = 959
[pid  7073] 07:22:16.142864 sendfile(4, 6, NULL, 4194304) = 0

@mattn
Copy link
Member

mattn commented Sep 25, 2020

FYI, I got this on my Windows PC.

goos: windows
goarch: amd64
pkg: github.com/mattn/http_bench
BenchmarkFileWithoutContentLength-4          679           1651679 ns/op
BenchmarkFileWithContentLength-4             727           1602470 ns/op
PASS

@Ryan-A-B
Copy link
Author

@ianlancetaylor up to 512 bytes they're equal then the timing from 513 up BenchmarkFileWithContentLength goes from being the slow version to the fast version, crossing at about 1.2MB

256 byte file
BenchmarkFileWithoutContentLength-8        20000            104532 ns/op
BenchmarkFileWithContentLength-8           20000             90245 ns/op
512 byte file
BenchmarkFileWithoutContentLength-8        20000             94685 ns/op
BenchmarkFileWithContentLength-8           20000             86804 ns/op
513 byte file
BenchmarkFileWithoutContentLength-8        20000             96399 ns/op
BenchmarkFileWithContentLength-8              20          52140028 ns/op
1024 byte file
BenchmarkFileWithoutContentLength-8        20000             97876 ns/op
BenchmarkFileWithContentLength-8              20          52145628 ns/op
1.2MB file
BenchmarkFileWithoutContentLength-8         1000           1242620 ns/op
BenchmarkFileWithContentLength-8            2000           1389453 ns/op
3.6MB file
BenchmarkFileWithoutContentLength-8          500           2606823 ns/op
BenchmarkFileWithContentLength-8            1000           1653781 ns/op

@qingyunha
Copy link
Contributor

@ianlancetaylor up to 512 bytes they're equal then the timing from 513 up BenchmarkFileWithContentLength goes from being the slow version to the fast version, crossing at about 1.2MB

Since the Content-Type detection sniff 512 bytes, so less 512 bytes will not invoke sendfile.

@imxyb
Copy link
Contributor

imxyb commented Oct 21, 2020

my mac os benchmark:

goos: darwin
goarch: amd64
pkg: test/ut
BenchmarkFileWithoutContentLength-12    	   13285	     88998 ns/op
BenchmarkFileWithContentLength-12       	   13695	     89779 ns/op
PASS

qingyunha added a commit to qingyunha/go that referenced this issue Mar 28, 2021
if the count param of sendfile larger than the file size to be sent,
the TCP will cause a 200ms delay somehow.

Fixes golang#41513
Fixes golang#45256
@gopherbot
Copy link

Change https://golang.org/cl/305229 mentions this issue: net: use the size be sent as the count param of sendfile

@harshavardhana
Copy link
Contributor

@darkLord19 That's because macOS doesn't have sendfile.
@cuonglm Thanks for spotting where the behavior changes. I don't think it would be right to always enable chunking because as noted above using sendfile is actually faster when using a larger file.
I think we need to understand why sendfile is so much slower for short files.
CC @acln0 for any thoughts.

I don't mean we always enable chunking, but still enable chunking when we detect there's Content-Length header. The current code which enable chunking is:

cw.chunking = true

Which happens if hasCL is false. But after those code, there's:

if cw.chunking {

which seems useless, since when chunking is only set if Content-Length header was not existed. So I propose:

diff --git a/src/net/http/server.go b/src/net/http/server.go
index 25fab288f2..909adeb67f 100644
--- a/src/net/http/server.go
+++ b/src/net/http/server.go
@@ -1429,17 +1429,17 @@ func (cw *chunkWriter) writeHeader(p []byte) {
 
        if w.req.Method == "HEAD" || !bodyAllowedForStatus(code) {
                // do nothing
-       } else if code == StatusNoContent {
+       } else if code == StatusNoContent || hasCL {
                delHeader("Transfer-Encoding")
-       } else if hasCL {
-               delHeader("Transfer-Encoding")
-       } else if w.req.ProtoAtLeast(1, 1) {
+       }
+
+       if w.req.ProtoAtLeast(1, 1) {
                // HTTP/1.1 or greater: Transfer-Encoding has been set to identity, and no
                // content-length has been provided. The connection must be closed after the
                // reply is written, and no chunking is to be done. This is the setup
                // recommended in the Server-Sent Events candidate recommendation 11,
                // section 8.
-               if hasTE && te == "identity" {
+               if !hasCL && hasTE && te == "identity" {
                        cw.chunking = false
                        w.closeAfterReply = true
                } else {

Note

The slow performance come from this line:

if !w.cw.chunking && w.bodyAllowed() {

I tried this - this breaks HTTP server for range GETs etc - I have seen similar behavior with epoll_wait under a combination of HEAD/GET requests where random 200ms delays spike up. All I can see by doing some extensive logging is that.

pconn.Read()

Regularly gets blocked, which seems to block and delay subsequent calls as well. This increases overall response times for calls - with unexpected latencies.

My OS is CentOS Linux release 8.3.2011

@networkimprov
Copy link

cc @neild

@spade69
Copy link

spade69 commented Dec 16, 2022

hey guys , what is going on here? Problem solved?

@AlexanderYastrebov
Copy link
Contributor

Rediscovered as #61530

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.