Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/playground,internal/poll: timeout or long run time when using httptest.NewServer #60356

Open
rogpeppe opened this issue May 23, 2023 · 11 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime.
Milestone

Comments

@rogpeppe
Copy link
Contributor

What version of Go are you using (go version)?

go1.20 as reported by the playground.

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

N/A

What did you do?

https://go.dev/play/p/Pc_vys_LNVG

What did you expect to see?

It should run and terminate very quickly.

What did you see instead?

0s request send
2009/11/10 23:00:30 Do: Post "http://127.0.0.1:16324": dial tcp 127.0.0.1:16324: i/o timeout

The same code (or very similar, modulo playground result caching) sometimes terminates immediately,
and sometimes prints:

0s request send
30s request sent

When instrumented to print the goroutine stack traces on timeout, we see this set of stacks:

goroutine profile: total 5
1 @ 0x42dcf6 0x461105 0x661dd5 0x661bed 0x65e96b 0x66b6ae 0x466bc1
#	0x461104	runtime/pprof.runtime_goroutineProfileWithLabels+0x24	/usr/local/go-faketime/src/runtime/mprof.go:844
#	0x661dd4	runtime/pprof.writeRuntimeProfile+0xb4			/usr/local/go-faketime/src/runtime/pprof/pprof.go:734
#	0x661bec	runtime/pprof.writeGoroutine+0x4c			/usr/local/go-faketime/src/runtime/pprof/pprof.go:694
#	0x65e96a	runtime/pprof.(*Profile).WriteTo+0x14a			/usr/local/go-faketime/src/runtime/pprof/pprof.go:329
#	0x66b6ad	main.main.func1+0x4d					/tmp/sandbox3803422248/prog.go:15

1 @ 0x438776 0x4311d7 0x461529 0x4a9352 0x4aab3d 0x4aab2b 0x557315 0x568685 0x5678dd 0x636245 0x65b26a 0x466bc1
#	0x461528	internal/poll.runtime_pollWait+0x88		/usr/local/go-faketime/src/runtime/netpoll.go:306
#	0x4a9351	internal/poll.(*pollDesc).wait+0x31		/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84
#	0x4aab3c	internal/poll.(*pollDesc).waitRead+0x2bc	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:89
#	0x4aab2a	internal/poll.(*FD).Accept+0x2aa		/usr/local/go-faketime/src/internal/poll/fd_unix.go:614
#	0x557314	net.(*netFD).accept+0x34			/usr/local/go-faketime/src/net/fd_unix.go:172
#	0x568684	net.(*TCPListener).accept+0x24			/usr/local/go-faketime/src/net/tcpsock_posix.go:148
#	0x5678dc	net.(*TCPListener).Accept+0x3c			/usr/local/go-faketime/src/net/tcpsock.go:297
#	0x636244	net/http.(*Server).Serve+0x384			/usr/local/go-faketime/src/net/http/server.go:3059
#	0x65b269	net/http/httptest.(*Server).goServe.func1+0x69	/usr/local/go-faketime/src/net/http/httptest/server.go:310

1 @ 0x438776 0x4311d7 0x461529 0x4a9352 0x556bb9 0x556b9d 0x556bc1 0x5658e5 0x565412 0x55e718 0x5681ef 0x568089 0x54e6f2 0x54e045 0x54d6b3 0x54d0cc 0x643ea7 0x646dac 0x6457f0 0x466bc1
#	0x461528	internal/poll.runtime_pollWait+0x88		/usr/local/go-faketime/src/runtime/netpoll.go:306
#	0x4a9351	internal/poll.(*pollDesc).wait+0x31		/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84
#	0x556bb8	internal/poll.(*pollDesc).waitWrite+0x798	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:93
#	0x556b9c	internal/poll.(*FD).WaitWrite+0x77c		/usr/local/go-faketime/src/internal/poll/fd_unix.go:741
#	0x556bc0	net.(*netFD).connect+0x7a0			/usr/local/go-faketime/src/net/fd_unix.go:141
#	0x5658e4	net.(*netFD).dial+0x3c4				/usr/local/go-faketime/src/net/sock_posix.go:151
#	0x565411	net.socket+0x2d1				/usr/local/go-faketime/src/net/sock_posix.go:70
#	0x55e717	net.internetSocket+0xf7				/usr/local/go-faketime/src/net/ipsock_posix.go:142
#	0x5681ee	net.(*sysDialer).doDialTCP+0xee			/usr/local/go-faketime/src/net/tcpsock_posix.go:74
#	0x568088	net.(*sysDialer).dialTCP+0x68			/usr/local/go-faketime/src/net/tcpsock_posix.go:64
#	0x54e6f1	net.(*sysDialer).dialSingle+0x1f1		/usr/local/go-faketime/src/net/dial.go:580
#	0x54e044	net.(*sysDialer).dialSerial+0x244		/usr/local/go-faketime/src/net/dial.go:548
#	0x54d6b2	net.(*sysDialer).dialParallel+0x412		/usr/local/go-faketime/src/net/dial.go:449
#	0x54d0cb	net.(*Dialer).DialContext+0x70b			/usr/local/go-faketime/src/net/dial.go:440
#	0x643ea6	net/http.(*Transport).dial+0xe6			/usr/local/go-faketime/src/net/http/transport.go:1176
#	0x646dab	net/http.(*Transport).dialConn+0x82b		/usr/local/go-faketime/src/net/http/transport.go:1614
#	0x6457ef	net/http.(*Transport).dialConnFor+0xaf		/usr/local/go-faketime/src/net/http/transport.go:1456

1 @ 0x438776 0x447ebe 0x5570a6 0x466bc1
#	0x5570a5	net.(*netFD).connect.func2+0x85	/usr/local/go-faketime/src/net/fd_unix.go:118

1 @ 0x438776 0x447ebe 0x644b66 0x640c7e 0x62b299 0x5f49b7 0x5f423b 0x5f645b 0x66b5ae 0x66b59d 0x438338 0x466bc1
#	0x644b65	net/http.(*Transport).getConn+0x5c5	/usr/local/go-faketime/src/net/http/transport.go:1382
#	0x640c7d	net/http.(*Transport).roundTrip+0x79d	/usr/local/go-faketime/src/net/http/transport.go:590
#	0x62b298	net/http.(*Transport).RoundTrip+0x18	/usr/local/go-faketime/src/net/http/roundtrip.go:17
#	0x5f49b6	net/http.send+0x5f6			/usr/local/go-faketime/src/net/http/client.go:252
#	0x5f423a	net/http.(*Client).send+0x9a		/usr/local/go-faketime/src/net/http/client.go:176
#	0x5f645a	net/http.(*Client).do+0x8fa		/usr/local/go-faketime/src/net/http/client.go:716
#	0x66b5ad	net/http.(*Client).Do+0xed		/usr/local/go-faketime/src/net/http/client.go:582
#	0x66b59c	main.main+0xdc				/tmp/sandbox3803422248/prog.go:23
#	0x438337	runtime.main+0x1f7			/usr/local/go-faketime/src/runtime/proc.go:250

The program to produce that output is here: https://go.dev/play/p/7t5iObMhESj

Note the sleep just before the request sent is printed. Without that, the goroutine stack trace is aborted half way: it seems that the fact that the goroutine woke up triggered the runtime to run the other goroutines which should have been running anyway.

@gopherbot gopherbot added this to the Unreleased milestone May 23, 2023
@bcmills
Copy link
Contributor

bcmills commented May 23, 2023

This is definitely related to the faketime patch — it reproduces locally using go run -tags faketime all the way back to go1.14, which was the first release with a platform-independent faketime implementation (https://go.dev/cl/192740).

@bcmills bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 23, 2023
@bcmills bcmills changed the title x/playground: timeout or long run time when using httptest.NewServer x/playground,internal/poll: timeout or long run time when using httptest.NewServer May 23, 2023
@gopherbot
Copy link

Change https://go.dev/cl/497618 mentions this issue: runtime/testdata/unfaketime: add a filtering binary for faketime output

@gopherbot
Copy link

Change https://go.dev/cl/497619 mentions this issue: cmd/go: add a regression test for #60356

@bcmills
Copy link
Contributor

bcmills commented May 23, 2023

I couldn't help but investigate a bit, and got as far as writing a proof-of-concept regression test. 😅

I don't plan to work on this further, but if someone wants to take it from here they're welcome to build on my WIP CLs.

@bcmills
Copy link
Contributor

bcmills commented May 23, 2023

(@ianlancetaylor, this seems to be an interaction between the runtime and poller when built with -tags=faketime.)

@ianlancetaylor
Copy link
Contributor

If I build the test case with -tags faketime,netgo it seems to work fine. So I'm not sure it's the poller as such; I think it's cgo.

I don't understand this line in checkdead:

	if !iscgo && cgoHasExtraM && extraMLength.Load() > 0 {

Why is it testing !iscgo? Why shouldn't we instead write,

	run := mcount() - sched.nmidle - sched.nmidlelocked - sched.nmsys - int32(extraMLength.Load())
	if run > 0 {
		return
	}

That is, account for all of extraMLength rather than just treating it as a value of 1 only if !iscgo. There might be some race conditions, but all the tests pass with that change.

CC @golang/runtime

@prattmic
Copy link
Member

We can't simply write that because then this program fails checkdead and claims to be deadlocked:

// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

//go:build !plan9 && !windows
// +build !plan9,!windows

package main

/*
#include <unistd.h>
#include <pthread.h>

void go_callback();

static void *thr(void *arg) {
    sleep(1);
    go_callback();
    return 0;
}

static void foo() {
    pthread_t th;
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setstacksize(&attr, 256 << 10);
    pthread_create(&th, &attr, thr, 0);
    //pthread_join(th, 0);
}
*/
import "C"

import (
        "os"
)

//export go_callback
func go_callback() {
        os.Exit(0)
}

func main() {
        C.foo()
        select {}
}

That is, we use extraMLength > 0 as an indication that we might get a callback from C, though admittedly this might not be the best way to represent this.

I haven't actually looked at the faketime issue yet, that's next.

@ianlancetaylor
Copy link
Contributor

Oh, right, I forgot about the possibility of a callback.

I think that means that faketime fundamentally can't work with a program that uses cgo. We never know for sure when to advance the fake clock.

@prattmic
Copy link
Member

Hm, actually I don't see a clean way to resolve this. Package net calls into C, meaning it is now technically possible for C to callback into Go at any time in the future. So it is not ever safe to say that there is truly nothing to do except move faketime forward.

A couple of possibilities:

  1. Don't use faketime with cgo. i.e., the playground should build with cgo disabled (I'm surprised it doesn't already).
  2. faketime could exempt this case. i.e., more time forward if there is nothing to do besides wait for a cgo callback. That is a bit odd, but on the other hand the C code is completely unaware of our fake time, so it's not like we will mess up Go / C timing any more than it already is.

@bcmills
Copy link
Contributor

bcmills commented Jun 30, 2023

faketime could exempt this case. i.e., more time forward if there is nothing to do besides wait for a cgo callback. That is a bit odd, but on the other hand the C code is completely unaware of our fake time, so it's not like we will mess up Go / C timing any more than it already is.

I think we should do that. If there are no Go threads running, and no Go threads blocked in syscalls, and some timer waiting in the timer heap, then we should advance faketime.

The vast majority of Playground programs won't have asynchronous callbacks from C. (At best, they might have synchronous callbacks, but those shouldn't matter for checkdead.) For those rare ones that do, the order of execution between the Go programs and the C callbacks is unspecified anyway.

(That is: faketime would just be simulating a reaaaaaaally long scheduling delay for the C threads. 😅)

@lrewega
Copy link

lrewega commented Jul 6, 2023

The playground link in the issue summary usually times out, but on rare occasion deadlocks.

By comparison the following program appears to deadlock in the playground very reliably for me on all Go versions in the playground (though on rare occasions it seems to succeed): https://go.dev/play/p/vRc-ulRCihw

output for deadlock
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [select]:
net/http.(*Transport).getConn(0xc00012a140, 0xc00001e240, {{}, 0x0, {0xc00001a090, 0x4}, {0xc00001c130, 0xf}, 0x0})
	/usr/local/go-faketime/src/net/http/transport.go:1382 +0x5c6
net/http.(*Transport).roundTrip(0xc00012a140, 0xc00014a000)
	/usr/local/go-faketime/src/net/http/transport.go:590 +0x79e
net/http.(*Transport).RoundTrip(0x10?, 0x73a780?)
	/usr/local/go-faketime/src/net/http/roundtrip.go:17 +0x19
net/http.send(0xc00014a000, {0x73a780, 0xc00012a140}, {0x8?, 0x6ba100?, 0x0?})
	/usr/local/go-faketime/src/net/http/client.go:252 +0x5f7
net/http.(*Client).send(0xc000016cf0, 0xc00014a000, {0x902de0?, 0x40c85e?, 0x0?})
	/usr/local/go-faketime/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc000016cf0, 0xc00014a000)
	/usr/local/go-faketime/src/net/http/client.go:716 +0x8fb
net/http.(*Client).Do(...)
	/usr/local/go-faketime/src/net/http/client.go:582
net/http.(*Client).Get(0x73af60?, {0xc00001a090?, 0x0?})
	/usr/local/go-faketime/src/net/http/client.go:480 +0x6a
main.main()
	/tmp/sandbox1064497742/prog.go:11 +0x37

goroutine 6 [IO wait]:
internal/poll.runtime_pollWait(0x7f7be3daff48, 0x72)
	/usr/local/go-faketime/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000132000?, 0xc000070d58?, 0x0)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000132000)
	/usr/local/go-faketime/src/internal/poll/fd_unix.go:614 +0x2bd
net.(*netFD).accept(0xc000132000)
	/usr/local/go-faketime/src/net/fd_unix.go:172 +0x35
net.(*TCPListener).accept(0xc0000100c0)
	/usr/local/go-faketime/src/net/tcpsock_posix.go:148 +0x25
net.(*TCPListener).Accept(0xc0000100c0)
	/usr/local/go-faketime/src/net/tcpsock.go:297 +0x3d
net/http.(*Server).Serve(0xc000146000, {0x73c8c0, 0xc0000100c0})
	/usr/local/go-faketime/src/net/http/server.go:3059 +0x385
net/http/httptest.(*Server).goServe.func1()
	/usr/local/go-faketime/src/net/http/httptest/server.go:310 +0x6a
created by net/http/httptest.(*Server).goServe
	/usr/local/go-faketime/src/net/http/httptest/server.go:308 +0x6a

goroutine 7 [IO wait]:
internal/poll.runtime_pollWait(0x7f7be3dafe58, 0x77)
	/usr/local/go-faketime/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000132080?, 0x6c48f6?, 0x0)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitWrite(...)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:93
internal/poll.(*FD).WaitWrite(...)
	/usr/local/go-faketime/src/internal/poll/fd_unix.go:741
net.(*netFD).connect(0xc000132080, {0x73cbc8?, 0xc00001c030}, {0x0?, 0x0?}, {0x73ab00?, 0xc000134060?})
	/usr/local/go-faketime/src/net/fd_unix.go:141 +0x799
net.(*netFD).dial(0xc000132080, {0x73cbc8, 0xc00001c030}, {0x73d4d8?, 0x0?}, {0x73d4d8?, 0xc000016f30}, 0x0?)
	/usr/local/go-faketime/src/net/sock_posix.go:151 +0x3c5
net.socket({0x73cbc8, 0xc00001c030}, {0x6c48f6, 0x3}, 0x2, 0x1, 0x415210?, 0x0?, {0x73d4d8, 0x0}, ...)
	/usr/local/go-faketime/src/net/sock_posix.go:70 +0x2d2
net.internetSocket({0x73cbc8, 0xc00001c030}, {0x6c48f6, 0x3}, {0x73d4d8, 0x0}, {0x73d4d8, 0xc000016f30}, 0xc00001c140?, 0x0, ...)
	/usr/local/go-faketime/src/net/ipsock_posix.go:142 +0xf8
net.(*sysDialer).doDialTCP(0xc0001481b0, {0x73cbc8, 0xc00001c030}, 0x0, 0xc000016f00?)
	/usr/local/go-faketime/src/net/tcpsock_posix.go:74 +0xef
net.(*sysDialer).dialTCP(0x28?, {0x73cbc8?, 0xc00001c030?}, 0xc00008d5c8?, 0x55d085?)
	/usr/local/go-faketime/src/net/tcpsock_posix.go:64 +0x69
net.(*sysDialer).dialSingle(0xc0001481b0, {0x73cbc8, 0xc00001c030}, {0x73bc38?, 0xc000016f30})
	/usr/local/go-faketime/src/net/dial.go:580 +0x1f2
net.(*sysDialer).dialSerial(0xc0001481b0, {0x73cbc8, 0xc00001c030}, {0xc000014db0?, 0x1, 0x415210?})
	/usr/local/go-faketime/src/net/dial.go:548 +0x245
net.(*sysDialer).dialParallel(0xc000014da0?, {0x73cbc8?, 0xc00001c030?}, {0xc000014db0?, 0xc00001c030?, 0x6c4a5d?}, {0x0?, 0x6c48f6?, 0x415210?})
	/usr/local/go-faketime/src/net/dial.go:449 +0x413
net.(*Dialer).DialContext(0x8d3020, {0x73cbc8, 0xc00001c030}, {0x6c48f6, 0x3}, {0xc00001c130, 0xf})
	/usr/local/go-faketime/src/net/dial.go:440 +0x70c
net/http.(*Transport).dial(0x443d71?, {0x73cbc8?, 0xc00001c030?}, {0x6c48f6?, 0x420785?}, {0xc00001c130?, 0x0?})
	/usr/local/go-faketime/src/net/http/transport.go:1185 +0xc5
net/http.(*Transport).dialConn(0xc00012a140, {0x73cbc8, 0xc00001c030}, {{}, 0x0, {0xc00001a090, 0x4}, {0xc00001c130, 0xf}, 0x0})
	/usr/local/go-faketime/src/net/http/transport.go:1614 +0x82c
net/http.(*Transport).dialConnFor(0x0?, 0xc0001182c0)
	/usr/local/go-faketime/src/net/http/transport.go:1456 +0xb0
created by net/http.(*Transport).queueForDial
	/usr/local/go-faketime/src/net/http/transport.go:1425 +0x3ea

Program exited.
expected output
2009/11/10 23:00:00 &{Status:200 OK StatusCode:200 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Content-Length:[0] Date:[Tue, 10 Nov 2009 23:00:00 GMT]] Body:{} ContentLength:0 TransferEncoding:[] Close:false Uncompressed:false Trailer:map[] Request:0xc00014a000 TLS:<nil>}

Program exited.

I don't believe I'm contributing anything particularly novel here, but until I found this issue I thought this was completely reproducible. It wasn't until I saw the same deadlock with the playground link in the summary that I realized it's just very frequent.

Given that a deadlock seems more severe than an I/O timeout, I figured I'd share another datapoint. (aside: I am pretty sure I have written the same, or nearly the same, playground program before, much more recently than 1.14, and never encountered this, so I wonder if another change has exacerbated it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime.
Projects
Status: In Progress
Development

No branches or pull requests

6 participants