Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: tls handshake panic with custom dialer #58469

Open
aathan opened this issue Feb 11, 2023 · 11 comments
Open

net/http: tls handshake panic with custom dialer #58469

aathan opened this issue Feb 11, 2023 · 11 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@aathan
Copy link

aathan commented Feb 11, 2023

What version of Go are you using (go version)?

go version go1.20 linux/amd64

Does this issue reproduce with the latest release?

yes, 1.20.0

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/aathan/.cache/go-build"
GOENV="/home/aathan/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/aathan/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/aathan/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4181227200=/tmp/go-build -gno-record-gcc-switches"

What did you do?

var internalTransport *http.Transport = &http.Transport{
  DialContext: func(ctx context.Context, network, addr string) (ret net.Conn, err error) {
   return transportDialFunc(ctx, network, addr, myStuff)
  },
  TLSHandshakeTimeout: TLS_HANDSHAKE_TIMEOUT,
}

What did you expect to see?

No panics from inside golang stdlib.

What did you see instead?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x5d073c]
goroutine 4133611 [running]:
crypto/tls.(*Conn).handshakeContext.func2()
  /usr/local/go/src/crypto/tls/conn.go:1441 +0xbc
created by crypto/tls.(*Conn).handshakeContext
  /usr/local/go/src/crypto/tls/conn.go:1437 +0x205
Main process exited, code=exited, status=2/INVALIDARGUMENT
Failed with result 'exit-code'.

Proposed fix:

In crypto/tls/conn.go around line 1441:

go func() {
select {
case <-handshakeCtx.Done():
// Close the connection, discarding the error
_ = c.conn.Close()
interruptRes <- handshakeCtx.Err()
case <-done:
interruptRes <- nil
}
}()

I believe the problem may be that this should read:

conn := c.conn
if conn!= nil {
  conn.Close()
}

I have not deeply understood the net.Conn or tls dialer internals, but, I believe the issue here stems from the possibility that the tls handshake timer may expire before the dialer is actually finished dialing. Whether this implies that there should also be a deeper/upstream fix is beyond me.

I've prepared all of my credentials so that I can submit a related PR to googlesource.com & gerrit etc. once I get some feedback.

I would appreciate any tips on how to work around this issue in the meantime. I'm not sure how to replicate or override a core golang package to patch in this fix??

@rittneje
Copy link

What is transportDialFunc? Can you provide the full code to reproduce the issue?

@aathan
Copy link
Author

aathan commented Feb 11, 2023

Unfortunately no, it is non-trivial for me to provide transportDialFunc, which is a custom dialer that orchestrates multiple tcp connections between various nodes. I theorized that this may be due to the dialer taking longer than the TLS handshake timeout, therefore it may be possible to create a failing test case by doing a simple dial after a time.Sleep() longer than the TLS handshake timeout.

Irrespective of the contents of transportDialFunc, it seems prudent to add the nil check to crypto/tls/conn.go where I indicated, no?

@seankhliao
Copy link
Member

The relevant part of http code is below.
TLS handshaking only starts after dialing has successfully completed.
So if the conn is invalid, it would be a problem with your dialer code returning invalid values.

go/src/net/http/transport.go

Lines 1621 to 1634 in e03ee85

conn, err := t.dial(ctx, "tcp", cm.addr())
if err != nil {
return nil, wrapErr(err)
}
pconn.conn = conn
if cm.scheme() == "https" {
var firstTLSHost string
if firstTLSHost, _, err = net.SplitHostPort(cm.addr()); err != nil {
return nil, wrapErr(err)
}
if err = pconn.addTLS(ctx, firstTLSHost, trace); err != nil {
return nil, wrapErr(err)
}
}

@seankhliao seankhliao changed the title affected/package: crypto net/http: tls handshake panic with custom dialer Feb 11, 2023
@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 11, 2023
@rittneje
Copy link

To add to what @seankhliao said, the only way to construct a tls.Conn is via tls.Client or tls.Server, both of which require you to pass the already established TCP connection. Thus the only way for it to panic where it did is if you misuse the API, so adding the nil check is superfluous. (The handshake would panic shortly thereafter upon attempting to write to the TCP connection anyway.)

Hence this seems like a bug in your transportDialFunc implementation.

@aathan
Copy link
Author

aathan commented Feb 11, 2023

TL;DR
I may indeed have a race condition in my code (see 5 below), but there also seems to be a missing safety check in Transport.dial(), irrespective of whether additionally checking c.conn!=nil in tls/conn.go is considered superfluous.

Observations:

(1) The only reference I can find to TLSHandshakeTimeout is within (*persistConn) addTLS() which in turn is only used by (*Transport) dialConn()

(2) (*Transport) dial() does less error checking when using a DialContext. A return value of (nil,nil) allows a nil net.Conn to be passed back with no indicated error. Is the raw pass-through when using a DialContext intentional?

I might argue it would be safer to check for c==nil && err==nil irrespective of the dialer, particularly if a non-nil net.Conn is required by the rest of the code; but maybe you want the raw behavior. I don't know enough about the rest of framework to have a strong opinion.

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

(3) That dial() implementation is only ever called from (* Transport) dialConn() in one place. In that place, a hypothetical return of (nil,nil) from dial, as just discussed, would result in pconn.conn==nil during the call to pconn.addTLS().

In addTLS() the pconn.conn is used as the conn member in tlsConn. Subsequently, that's the proximal cause of the nil dereference in (*Conn) handshakeContext() via tlsConn.HandshakeContext(), invoked from inside a goroutine

Since I can't find anywhere that explicitly sets tlsConn.conn after the initial construction, the nil (or non-nil bad) value must have come from dial().

(4) Note that this panic would require that the handshakeCtx is Done() possibly due to the incoming parent context being Done().

Note: The comments in (*Conn) handshakeContext() seem a little misleading, in that the goroutine there waits on <-done and <-handleCtx.Done(), and no guarantee exists as to which will be invoked by select if both are active.

Anyway, the parent context to handshakeCtx is the "top level" Request.ctx that started the entire chain. This is my code for that. d.TimeoutSec=10 and TLSHandshakeTimeout=5 seconds.

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(d.TimeoutSec))
	defer cancel()
	req, err := http.NewRequestWithContext(ctx, "GET", urlURL.String(), nil)

Given these values either the handshakeCtx or parent context were canceled during the handshake in less than 5 seconds, or the system was so busy that it scheduled the handshakeCtx's goroutine prior to the time.AfterFunc() goroutine sometime after both expired. Dunno if that's possible with go's scheduler, but I thought it's worth mentioning here since it may evidence that things can happen in a rather weird ordering under load.

(5) Anyway, my dialer function looks like this. I guess there is a race condition in this code, right? Specifically, if the wg.Wait() wakes up transportDialFunc() first, and err is set to nil in the goroutine second, err might be returned as nil irrespective of ret's value.

func transportDialFunc(ctx context.Context, network, addr string, myStuff *MyStuff) (ret net.Conn, err error) {
	now := time.Now()
	if network != "tcp" {
		return nil, fmt.Errorf("unsupported network=%s", network)
	}

	// bunch of setup code...
	foo := something
	foo.streamHandler = func() {
		foo.wg.Done()
	}

	go func() {
		err = SetupStream(&foo)
	}()

	foo.wg.Wait()
	ret = foo.StreamConn

	if ret == nil && err == nil {
		err = fmt.Errorf("ret==nil")
	} else if ret != nil && err != nil {
		err = nil
	}
	if err != nil {
		//log
	} else {
		//log
	}
	return
}

@rittneje
Copy link

Where exactly is wg.Done() getting called? I see you are doing it in streamHandler, which is some sort of callback. Is that invoked at the end of SetupStream? Similarly, where exactly is wg.Add(1) getting called? Does SetupStream run forever, or do you expect it to finish before transportDialFunc returns?

The crux of my question is, why are you calling SetupStream in a goroutine, only to (I assume) wait for it to finish via the wait group immediately thereafter?

@aathan
Copy link
Author

aathan commented Feb 12, 2023

I'll answer you below, but I think the more relevant item for this issue is whether dial() should check the (nil,nil) case for a custom DialContext()? I would say "yes"


The code in 5 above is simplified, hiding implementation details of the app that are not immediately relevant here.

SetupStream() runs in a goroutine for all the usual reasons such as: Needing to run in a context where no locks are held, needing to perform tasks that could execute in parallel with those that run after readiness signalled by wg.Done(), etc. etc.

As you gleaned, that callback in fact does happen within SetupStream ... but really it could happen any time. The point is we don't have any guarantees as to when err= in the goroutine will execute relative to the code after the wg.Wait() ... including immediately after ret==nil && err==nil is checked ... although less likely, this execution order could happen even if wg.Done() executes after SetupStream() completes ... right?

PS: I recognize the logical issue here, of relying both on the synchronous return value of the function, and waiting on the waitgroup. Both are necessary, so I'll be modiying the code accordingly.

@rittneje
Copy link

My intent was to identity the root cause of the issue. The lack of a nil conn check is more of a secondary issue, as the code would still fail, just with a fairly generic error message rather than a panic.

With regards to SetupStream, I do not understand your code. Essentially you are saying it could continue to do work after signalling via wg.Done(). However, if this is the case, that means that there is no way for transportDialFunc to properly use its return value, as it has no idea when it will return. You should refactor your code to block on SetupStream finishing, at which point the goroutine is useless and you should just call it normally.

if err := SetupStream(&foo); err != nil {
    return nil, err
}

if foo.StreamConn == nil {
    return nil, errors.New(...)
}

return foo.StreamConn, nil

@aathan
Copy link
Author

aathan commented Feb 12, 2023

With regards to SetupStream,

I agree with you. I said what you said, differently.

the code would still fail

I think you're making the wrong assessment about the nil check because we are no longer talking about the same nil check.

The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn.

Explicitly, I'm talking about doing:

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		c,err := t.DialContext(ctx, network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.DialContext hook returned (nil, nil)")
		}
		return c, err
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

instead of:

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
	if t.DialContext != nil {
		return t.DialContext(ctx, network, addr)
	}
	if t.Dial != nil {
		c, err := t.Dial(network, addr)
		if c == nil && err == nil {
			err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")
		}
		return c, err
	}
	return zeroDialer.DialContext(ctx, network, addr)
}

@rittneje
Copy link

The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn.

Right, that's what I was referring to. You'd get an error back that says net/http: Transport.DialContext hook returned (nil, nil) instead of a panic. I agree that's better than a panic, and would be consistent with how it handles Transport.Dial today. My point was just that even if that change is made, you'd still have to fix transportDialFunc.

@aathan
Copy link
Author

aathan commented Feb 12, 2023

transportDialFunc is not in fact "totally" broken, because the nil net.Conn is the significant return value -- that said, as we've discussed, it may return (nil,nil) instead of (nil,!nil), but if it returns a nil net.Conn, it's a failure to dial, and the (nil,nil) should (as I think we agree) not cause a panic. I'll send in a PR. Just for completeness, another user seems to concur about this bug/improvement: https://groups.google.com/g/golang-nuts/c/qCn3BDUq74U

@seankhliao seankhliao added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants
@seankhliao @aathan @rittneje and others