Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: possible readloop goroutine leak #10457

Closed
yichengq opened this issue Apr 14, 2015 · 2 comments
Closed

net/http: possible readloop goroutine leak #10457

yichengq opened this issue Apr 14, 2015 · 2 comments
Milestone

Comments

@yichengq
Copy link

What we meet

We encountered this goroutine leak in our own project(https://github.com/coreos/etcd) when using the function to check the goroutine that is still running after test finishes:

    z_last_test.go:97: Test appears to have leaked a Transport:
        net/http.(*persistConn).readLoop(0xc20858c0b0)
            /usr/local/go/src/net/http/transport.go:858 +0x4eb
        created by net/http.(*Transport).dialConn
            /usr/local/go/src/net/http/transport.go:660 +0x10d4

        testing.RunTests(0x6bb550, 0x838980, 0x1a, 0x1a, 0x11800)
            /usr/local/go/src/testing/testing.go:556 +0xdc3
        testing.(*M).Run(0xc20802e140, 0x843160)
            /usr/local/go/src/testing/testing.go:485 +0xe1
        main.main()
            github.com/coreos/etcd/integration/_test/_testmain.go:102 +0x28d

The error is under go 1.4.2

The case is writeLoop is closed, but readLoop is still there, and hangs on rc := <-pc.reqch forever.

What we have found

After digging into it, @xiang90 find this, and we think that is the root cause of the leak.

(All codes are excerpted from "pkg/net/http/transport.go" in go1.4)

Bug case

dialConn inits a persistConn, and go readLoop()

roundTrip is called. It adds 1 to pc.numExpectedResponses and sends a requestAndChan to reqch, and the loop in readLoop receives and handles it.

roundTrip has not received the response, and is blocked at WaitResponse block.

readLoop reads the response from pc.br, and no error. It has no body and pushes the response into channel. Then it wait on pc.br.Peek(1) again.

pc.br is closed by remote target, and returns io.EOF.

pc.br.Peek(1) in readLoop returns error io.EOF.

The loop in readLoop finds that pc.numExpectedResponses == 0 is false, then waits on reqCh.

roundTrip receives the response and exits.

No further roundTrip comes, and the persistConn is closed. So no one will push requestAndChan to reqCh in the future, and reqCh will not be closed.

The readLoop goroutine waits on reqCh forever.

Init

func (t *Transport) dialConn(cm connectMethod) (*persistConn, error) {
    pconn := &persistConn{
        t:          t,
        cacheKey:   cm.key(),
        reqch:      make(chan requestAndChan, 1),
        writech:    make(chan writeRequest, 1),
        closech:    make(chan struct{}),
        writeErrCh: make(chan error, 1),
    }
...
    pconn.br = bufio.NewReader(noteEOFReader{pconn.conn, &pconn.sawEOF})
...
    go pconn.readLoop()

persistConn.readLoop

func (pc *persistConn) readLoop() {
    alive := true

    for alive {
        pb, err := pc.br.Peek(1)

        pc.lk.Lock()
        if pc.numExpectedResponses == 0 {
            if !pc.closed {
                pc.closeLocked()
                if len(pb) > 0 {
                    log.Printf("Unsolicited response received on idle HTTP channel starting with %q; err=%v",
                        string(pb), err)
                }
            }
            pc.lk.Unlock()
            return
        }
        pc.lk.Unlock()

        rc := <-pc.reqch

        var resp *Response
        if err == nil {
            resp, err = ReadResponse(pc.br, rc.req)
            if err == nil && resp.StatusCode == 100 {
                // Skip any 100-continue for now.
                // TODO(bradfitz): if rc.req had "Expect: 100-continue",
                // actually block the request body write and signal the
                // writeLoop now to begin sending it. (Issue 2184) For now we
                // eat it, since we're never expecting one.
                resp, err = ReadResponse(pc.br, rc.req)
            }
        }
...
   }
}

persistConn.roundTrip

func (pc *persistConn) roundTrip(req *transportRequest) (resp *Response, err error) {
    pc.t.setReqCanceler(req.Request, pc.cancelRequest)
    pc.lk.Lock()
    pc.numExpectedResponses++
    headerFn := pc.mutateHeaderFunc
    pc.lk.Unlock()
...
    resc := make(chan responseAndError, 1)
    pc.reqch <- requestAndChan{req.Request, resc, requestedGzip}
...
WaitResponse:
...
    pc.lk.Lock()
    pc.numExpectedResponses--
    pc.lk.Unlock()
...
   return re.res, re.err
}

persistConn.close

func (pc *persistConn) close() {
    pc.lk.Lock()
    defer pc.lk.Unlock()
    pc.closeLocked()
}

Can we reproduce it?

I can get the leak when running etcd integration tests once in five round in race mode.

I have done my best to reproduce it in a simple go program(2-3 full days), but i failed to make it happen because it is really hard to let the code run in the order that is described in bug case.

i hope that you can read over the bug case, and i can help on testing whether this is fixed.

@bradfitz
Copy link
Contributor

@xiang90
Copy link

xiang90 commented Apr 21, 2015

@bradfitz Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants