Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: a "bad" connection could potentially block all https requests #28824

Closed
carter2000 opened this issue Nov 16, 2018 · 10 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@carter2000
Copy link

carter2000 commented Nov 16, 2018

In out production environment, we found all the outgoing https requests with different destination address happened to blocked at the same time and last for about 14 miniutes, while the http requests were not.

So I check the net/http source code (https://github.com/golang/go/blob/ac7c0ee26dda18076d5f6c151d8f920b43340ae3/src/net/http/h2_bundle.go), and found the "block logic", as show below.

A https request goes through a few stages, step1 get connection, step2 write headers, ...

step1, get connection
step1-1 call http2clientConnPool.GetClientConn to get a valid connection
step1-2 acquire http2clientConnPool.mu, line 738
step1-3 call http2ClientConn.CanTakeNewRequest, line 740, and http2ClientConn.mu is acquired, line 7175

7174 func (cc *http2ClientConn) CanTakeNewRequest() bool {
7175	cc.mu.Lock()
7176	defer cc.mu.Unlock()
7177	return cc.canTakeNewRequestLocked()
7178 }

 738         p.mu.Lock()                                                                              |8008                         ErrCode:      cc.goAway.ErrCode,                                        
 739         for _, cc := range p.conns[addr] {                                                       |8009                         DebugData:    cc.goAwayDebug,                                           
 740                 if cc.CanTakeNewRequest() {                                                      |8010                 }                                                                               
 741                         p.mu.Unlock()                                                            |8011         } else if err == io.EOF {                                                               
 742                         return cc, nil                                                           |8012                 err = io.ErrUnexpectedEOF                                                       
 743                 }                                                                                |8013         }                                                                                       
 744         }                                                                                        |8014         for _, cs := range cc.streams {                                                         
 745         if !dialOnMiss {                                                                         |8015                 cs.bufPipe.CloseWithError(err) // no-op if already closed                       
 746                 p.mu.Unlock()                                                                    |8016                 select {                                                                        
 747                 return nil, http2ErrNoCachedConn                                                 |8017                 case cs.resc <- http2resAndError{err: err}:                                     
 748         }                                                                                        |8018                 default:                                                                        
 749         call := p.getStartDialLocked(addr)                                                       |8019                 }                                                                               
 750         p.mu.Unlock()   

step2, write headers
step2-1 acquire http2ClientConn.mu, line 7335
step2-2 call http2ClientConn.writeHeaders to write request headers
step2-3 http2ClientConn.writeHeaders call bw.Flush which could potentially block
step2-4 release http2ClientConn.mu, line 7384

7335	cc.mu.Lock()
	......
7382	cc.wmu.Lock()
7383	endStream := !hasBody && !hasTrailers
7384	werr := cc.writeHeaders(cs.ID, endStream, int(cc.maxFrameSize), hdrs)
7385	cc.wmu.Unlock()
7386	http2traceWroteHeaders(cs.trace)
7387	cc.mu.Unlock()

If a request write headers but block in step2-3,
the next request with the same addr will be blocked in step1-3,
and from now on all the following requests (even with different destination address) will be blocked in step1-2 until the blocked step2-3 writeHeaders returned.

I check the sysctl configuration, the total retransmission timeout is just about 14minutes(with TCP_RTO_MAX = 120):

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
@agnivade
Copy link
Contributor

/cc @bradfitz @fraenkel

@agnivade agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 16, 2018
@agnivade agnivade added this to the Unplanned milestone Nov 16, 2018
@carter2000
Copy link
Author

carter2000 commented Nov 16, 2018

I make a mistake, my local go version is go1.10.1, which the problem really exists, but it's not the latest code (sorry for that), and the latest code seems fixed the problem.

@carter2000
Copy link
Author

The latest code (e8a95ae) maybe have the same problem,
http2ClientConn.CanTakeNewRequest is not called,
but http2ClientConn.idleState is called (line 762), it acquires http2ClientConn.mu too.

  760         p.mu.Lock()
  761         for _, cc := range p.conns[addr] {
  762                 if st := cc.idleState(); st.canTakeNewRequest {
  763                         if p.shouldTraceGetConn(st) {
  764                                 http2traceGetConn(req, addr)
  765                         }
  766                         p.mu.Unlock()
  767                         return cc, nil
  768                 }
  769         }
  770         if !dialOnMiss {
  771                 p.mu.Unlock()
  772                 return nil, http2ErrNoCachedConn
  773         }
  774         http2traceGetConn(req, addr)
  775         call := p.getStartDialLocked(addr)
  776         p.mu.Unlock()

@agnivade
Copy link
Contributor

Can you tell us the production Go version that you are running where the problem is actually occuring ?

@carter2000
Copy link
Author

carter2000 commented Nov 16, 2018

@agnivade

The production Go version is go1.10.1, and I reproduce the problem with go1.11.2.

package main

import (
	"flag"
	"fmt"
	"io"
	"io/ioutil"
	"net/http"
	"strings"
	"sync/atomic"
	"time"
)

var url = flag.String("url", "https://172.28.128.3:12345/test", "url")

func main() {

	flag.Parse()

	var reqA, doneA int64
	var reqB, doneB int64
	for {
		for i := 0; i < 10; i++ {
			go func() {
				atomic.AddInt64(&reqA, 1)
				resp, err := http.Get("https://www.qiniu.com")
				if err == nil {
					io.Copy(ioutil.Discard, resp.Body)
					resp.Body.Close()
				}
				atomic.AddInt64(&doneA, 1)
			}()
		}
		for i := 0; i < 10; i++ {
			go func() {
				atomic.AddInt64(&reqB, 1)
				req, _ := http.NewRequest("GET", *url, nil)
				req.Header.Set("X-Qiniu", strings.Repeat("helloworld", 1024))
				resp, err := http.DefaultClient.Do(req)
				if err == nil {
					io.Copy(ioutil.Discard, resp.Body)
					resp.Body.Close()
				}
				atomic.AddInt64(&doneB, 1)
			}()
		}

		fmt.Printf("reqA:%d doneA:%d, reqB:%d doneB:%d\n", atomic.LoadInt64(&reqA), atomic.LoadInt64(&doneA), atomic.LoadInt64(&reqB), atomic.LoadInt64(&doneB))

		time.Sleep(1e9)
	}
}

Run the code, and shutdown 172.28.128.3 server by pulling power at "2018-11-16 18:17:01",
as the output shown, from "2018-11-16 18:17:02", both doneA and doneB stop update, all requests blocked:

[18:16:08]~/code/go $ go run test.go 
2018-11-16 18:16:54 reqA:2 doneA:0, reqB:1 doneB:0
2018-11-16 18:16:55 reqA:12 doneA:10, reqB:11 doneB:6
2018-11-16 18:16:56 reqA:21 doneA:20, reqB:20 doneB:20
2018-11-16 18:16:57 reqA:33 doneA:30, reqB:31 doneB:30
2018-11-16 18:16:58 reqA:44 doneA:40, reqB:41 doneB:40
2018-11-16 18:16:59 reqA:52 doneA:50, reqB:50 doneB:50
2018-11-16 18:17:00 reqA:62 doneA:60, reqB:62 doneB:60
2018-11-16 18:17:01 reqA:72 doneA:70, reqB:71 doneB:60
2018-11-16 18:17:02 reqA:83 doneA:79, reqB:81 doneB:60
2018-11-16 18:17:03 reqA:92 doneA:79, reqB:91 doneB:60
2018-11-16 18:17:04 reqA:101 doneA:79, reqB:101 doneB:60
2018-11-16 18:17:05 reqA:110 doneA:79, reqB:110 doneB:60
2018-11-16 18:17:06 reqA:122 doneA:79, reqB:122 doneB:60
2018-11-16 18:17:07 reqA:132 doneA:79, reqB:130 doneB:60
2018-11-16 18:17:08 reqA:141 doneA:79, reqB:141 doneB:60
2018-11-16 18:17:09 reqA:151 doneA:79, reqB:151 doneB:60
2018-11-16 18:17:10 reqA:162 doneA:79, reqB:161 doneB:60
2018-11-16 18:17:11 reqA:171 doneA:79, reqB:171 doneB:60
2018-11-16 18:17:12 reqA:187 doneA:79, reqB:181 doneB:60
2018-11-16 18:17:13 reqA:193 doneA:79, reqB:192 doneB:60

@fraenkel
Copy link
Contributor

I am running your test against a local http2 server which I then killed.
The client is continuing without any issues.

@carter2000
Copy link
Author

carter2000 commented Nov 17, 2018

@fraenkel

Killing local http2 server is different with pulling power (the "shutdown" in previous comment is pulling power exactly).

Killing local http2 server,:
The client's socket will receive a FIN packet, and transfer to CLOSE_WAIT state.
The read operation on the socket should return EOF immediately.
The write operation on the socket should failed because it's unexpected in the server side and will cause a RST respond usually.

Pulling power:
No FIN packet, the client's socket know nothing about it.
The read operation on the socket should blocked till something happened, such as tcp keepalive timed out.
The write operation on the socket should blocked till something happened, such as retransmission timed out.

@carter2000
Copy link
Author

hi. @bradfitz @fraenkel , will the problem be further confirmed ?

@ALTree ALTree modified the milestones: Unplanned, Go1.12 Nov 27, 2018
@bradfitz
Copy link
Contributor

There have been two changes at tip that I think should fix this. Could you try Go 1.12beta1 when it comes out later today?

The two changes:

@bradfitz bradfitz added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 13, 2018
@bradfitz bradfitz modified the milestones: Go1.12, Go1.13 Dec 13, 2018
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants