crypto/tls: TLS connections use small buffer size that results in small syscalls and ignore HTTP client transport buffer sizes #47672

richardartoul · 2021-08-12T20:11:42Z

What version of Go are you using (`go version`)?

$ go version

`go1.16 darwin/amd6`

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
COMP12013:dd-go richard.artoul$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/richard.artoul/Library/Caches/go-build"
GOENV="/Users/richard.artoul/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/richard.artoul/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/richard.artoul/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/t2/02qzh_vs4cn57ctvc7dwcsc80000gn/T/go-build075316343=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I'm using this library: https://github.com/google/go-cloud to read files from S3 in a streaming fashion.

I noticed that a lot of time was being spent in syscalls:

I was aware of this issue: #22618

So I tuned my http client read/write buffer transport sizes to be 256kib instead of 64kib but this had no impact on time spent in syscalls which made me suspicious that somehow the way the reads were being performed that reads were not actually being buffered as I expected.

I wrote a small program to download a file from S3 in a streaming fashion using large 1mib reads, like this:

stream, err := store.GetStream(ctx, *bucket, *path)
if err != nil {
    log.Fatalf("error getting: %s, err: %v", *path, err)
}
defer stream.Close()

buf := make([]byte, 1<<20)
    for {
        n, err := bufio.NewReaderSize(stream, 1<<20).Read(buf)
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatal(err)
        }
	fmt.Println("n:", n)
}

I couldn't get dtrace to work properly on OS X, but luckily my application uses a custom dialer for setting write/read deadlines on every socket read so I was able to instrument the actual socket read sizes like this:

func (d *deadlineConn) Read(p []byte) (int, error) {
	d.Conn.SetReadDeadline(time.Now().Add(d.readDeadline))
	fmt.Println("read size", len(p))
	n, err := d.Conn.Read(p)
	err = maybeWrapErr(err)
	return n, err
}

What did you expect to see?

Large syscall reads (in the range of 256KiB)

What did you see instead?

Extremely small sys call reads:

read size 52398
n: 16384
n: 1024
n: 16384
n: 1024
n: 16384
n: 1024
read size 52398
n: 16384
n: 1024
read size 28361
read size 26929
n: 16384
n: 1024
n: 16384
read size 5449
n: 1024
read size 56415
read size 47823
n: 16384
n: 1024
n: 16384
read size 23479

What did you do after?

I made a small change in tls.go to instantiate the TLS client with a much larger rawInput buffer:

// Client returns a new TLS client side connection
// using conn as the underlying transport.
// The config cannot be nil: users must set either ServerName or
// InsecureSkipVerify in the config.
func Client(conn net.Conn, config *Config) *Conn {
	c := &Conn{
		rawInput: *bytes.NewBuffer(make([]byte, 0, 1<<20)),
		conn:     conn,
		config:   config,
		isClient: true,
	}
	c.handshakeFn = c.clientHandshake
	return c
}

As expected I began to observe much larger syscall reads:

read size 1034019
read size 1024035
read size 1022603
read size 1003987
read size 993963
read size 991099
read size 985371
read size 982507
read size 981075
read size 965323
read size 963891
read size 955299
read size 945275
read size 935251

I haven't tried deploying my fork to production, and measuring performance on my laptop is not interesting since I have a terrible connection to S3, but I think its well understood that a 10x increase in syscalls (especially with such a small read size of 64kib) has a dramatic impact on performance.

Proposal

I'm not 100% sure what the best approach is here, but I think we should do something since this issue means that streaming large amounts of data over TLS is much more CPU intensive than it needs to be which is a big deal for applications that process large volumes of data over the network like distributed databases.

The tls package already has a Config struct. It seems like it would be straightforward to add buffer size configuration there like has already been done for the HTTP transport. In addition, it seems reasonable that the HTTP client transport buffer sizes should be automatically propagated as the values for the TLS buffer size if the user doesn't specify a specific override.

The text was updated successfully, but these errors were encountered:

richardartoul · 2021-08-13T15:25:43Z

Ok so to test my theory in production I wrote a custom TLS dialer that does some crazy unsafe stuff:

func (d *deadlineDialer) DialTLSContext(ctx context.Context, network, addr string) (c net.Conn, err error) {
	defer func() {
		if err != nil {
			err = maybeWrapErr(err)
		}
	}()

	var firstTLSHost string
	if firstTLSHost, _, err = net.SplitHostPort(addr); err != nil {
		return nil, err
	}

	cfg := &tls.Config{}
	cfg.ServerName = firstTLSHost

	trace := httptrace.ContextClientTrace(ctx)

	plainConn, err := d.dialer.DialContext(ctx, network, addr)
	if err != nil {
		return nil, err
	}

	tlsConn := tls.Client(plainConn, cfg)
	increaseTLSBufferSizeUnsafely(tlsConn)

	errc := make(chan error, 2)
	var timer *time.Timer // for canceling TLS handshake
	if d := transport.TLSHandshakeTimeout; d != 0 {
		timer = time.AfterFunc(d, func() {
			errc <- errs.NewRetryableError(errors.New("TLS handshake timeout"))
		})
	}
	go func() {
		if trace != nil && trace.TLSHandshakeStart != nil {
			trace.TLSHandshakeStart()
		}
		err := tlsConn.Handshake()
		if timer != nil {
			timer.Stop()
		}
		errc <- err
	}()
	if err := <-errc; err != nil {
		plainConn.Close()
		if trace != nil && trace.TLSHandshakeDone != nil {
			trace.TLSHandshakeDone(tls.ConnectionState{}, err)
		}
		return nil, err
	}
	cs := tlsConn.ConnectionState()
	if trace != nil && trace.TLSHandshakeDone != nil {
		trace.TLSHandshakeDone(cs, nil)
	}

        return tlsConn, nil
}

func increaseTLSBufferSizeUnsafely(tlsConn *tls.Conn) {
	var (
		pointerVal = reflect.ValueOf(tlsConn)
		val        = reflect.Indirect(pointerVal)
		member     = val.FieldByName("rawInput")
		ptrToY     = unsafe.Pointer(member.UnsafeAddr())
		realPtrToY = (*bytes.Buffer)(ptrToY)
	)
	*realPtrToY = *bytes.NewBuffer(make([]byte, 0, 1<<18))
}

After I deployed my service using the new dialer and a 256kib buffer size, all the time spent in syscalls more or less completely disappeared from my profiles. The performance difference ended up being more dramatic than I expected, profiles with the old implementation had 23.36s of CPU time whereas the new version was 19.98 for the same workload.

I would share CPU profile comparisons but its a bit difficult since my application is not open source and I don't want to potentially leak any sensitive information via the method names.

dmitshur · 2021-08-14T00:05:54Z

CC @FiloSottile, @katiehockman, @rolandshoemaker, @kevinburke via owners.

seizethedave · 2022-06-13T15:53:51Z

+1 on this! Our database downloads new partitions to serve from S3 while it is serving reads, and we are considering rewriting our downloader component in a different language as these 4GB downloads must be squeezed through the syscall pipe that is at most 64KB wide due to the way Conn.rawInput works.

read syscall return values:

@bytes:
(..., 0)           87559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[0]                 1629 |                                                    |
[1]                 1287 |                                                    |
[2, 4)               103 |                                                    |
[4, 8)                54 |                                                    |
[8, 16)              464 |                                                    |
[16, 32)             121 |                                                    |
[32, 64)             134 |                                                    |
[64, 128)            213 |                                                    |
[128, 256)           666 |                                                    |
[256, 512)           608 |                                                    |
[512, 1K)            948 |                                                    |
[1K, 2K)            4310 |@@                                                  |
[2K, 4K)            6920 |@@@@                                                |
[4K, 8K)           25991 |@@@@@@@@@@@@@@@                                     |
[8K, 16K)          46532 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[16K, 32K)         42064 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[32K, 64K)         16326 |@@@@@@@@@                                           |
[64K, 128K)          160 |                                                    |
[128K, 256K)           3 |                                                    |
[256K, 512K)           1 |                                                    |
[512K, 1M)             2 |                                                    |

rolandshoemaker · 2022-06-13T16:15:14Z

Dupe of #20420.

FiloSottile · 2022-06-14T11:34:42Z

Dupe of #20420.

I think it's related but not a dupe. #20420 is about negotiating over the protocol smaller individual records to reduce (the peer's) memory requirements. This is about reading from the wire more than one record at a time to reduce syscall overhead.

xdg · 2022-06-14T14:06:41Z

@FiloSottile what needs to happen to get this out of the backlog? Is the proposal acceptable and a PR along those lines would be accepted?

richardartoul · 2022-12-22T19:41:23Z

Was refactoring some code and found a link to this issue above an extremely ugly hack I introduced to work around it :P Would be awesome to see this proposal accepted or an alternative proposed!

rolandshoemaker · 2022-12-22T22:38:33Z

We obviously didn't get to this for 1.20 (sorry), but it is something on my radar to look at in 1.21.

elijahandrews · 2024-01-17T17:07:46Z

Thanks for the workaround @richardartoul, I just used it as well.

For the Go maintainers – any chance we can get this investigated for the next Go release?

dmitshur added this to the Backlog milestone Aug 14, 2021

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 14, 2021

rolandshoemaker self-assigned this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crypto/tls: TLS connections use small buffer size that results in small syscalls and ignore HTTP client transport buffer sizes #47672

crypto/tls: TLS connections use small buffer size that results in small syscalls and ignore HTTP client transport buffer sizes #47672

richardartoul commented Aug 12, 2021 •

edited

richardartoul commented Aug 13, 2021 •

edited

dmitshur commented Aug 14, 2021

seizethedave commented Jun 13, 2022 •

edited

rolandshoemaker commented Jun 13, 2022

FiloSottile commented Jun 14, 2022

xdg commented Jun 14, 2022

richardartoul commented Dec 22, 2022

rolandshoemaker commented Dec 22, 2022

elijahandrews commented Jan 17, 2024 •

edited

crypto/tls: TLS connections use small buffer size that results in small syscalls and ignore HTTP client transport buffer sizes #47672

crypto/tls: TLS connections use small buffer size that results in small syscalls and ignore HTTP client transport buffer sizes #47672

Comments

richardartoul commented Aug 12, 2021 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

What did you do after?

Proposal

richardartoul commented Aug 13, 2021 • edited

dmitshur commented Aug 14, 2021

seizethedave commented Jun 13, 2022 • edited

rolandshoemaker commented Jun 13, 2022

FiloSottile commented Jun 14, 2022

xdg commented Jun 14, 2022

richardartoul commented Dec 22, 2022

rolandshoemaker commented Dec 22, 2022

elijahandrews commented Jan 17, 2024 • edited

richardartoul commented Aug 12, 2021 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

richardartoul commented Aug 13, 2021 •

edited

seizethedave commented Jun 13, 2022 •

edited

elijahandrews commented Jan 17, 2024 •

edited