Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Dial is slow on localhost #23366

Closed
levrik opened this issue Jan 7, 2018 · 29 comments
Closed

net: Dial is slow on localhost #23366

levrik opened this issue Jan 7, 2018 · 29 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Milestone

Comments

@levrik
Copy link

levrik commented Jan 7, 2018

See #23366 (comment) for more details!

What version of Go are you using (go version)?

go version go1.9.2 windows/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config

What did you do?

now := time.Now()
net.Dial("tcp", "localhost:3000")
fmt.Println(time.Since(now))

This code outputs around 1s.
When replacing localhost with 127.0.0.1 it takes around 2ms, sometimes also <1ms.
google.de:80 is around 60ms.

Is this expected to take so much longer with localhost? Took me several hours to find out that this was the root cause of multiple issues in my application.

@davecheney
Copy link
Contributor

davecheney commented Jan 7, 2018 via email

@davecheney
Copy link
Contributor

davecheney commented Jan 7, 2018 via email

@levrik
Copy link
Author

levrik commented Jan 7, 2018

@davecheney As I said google.de:80 is working fast (60ms) so I don't think it's an DNS issue.
I will continue to investigate.

@levrik
Copy link
Author

levrik commented Jan 7, 2018

@davecheney Just tried under macOS, same network. 2ms for localhost:3000 and 220µs for 127.0.0.1:3000. Maybe an Windows issue?

Edit:
Tested in WSL. About 800µs for localhost:3000 and 200µs for 127.0.0.1:3000. Not sure if requests from WSL go through the Windows DNS stack.

@bradfitz
Copy link
Contributor

bradfitz commented Jan 7, 2018

Ephemeral port exhaustion maybe?

@davecheney
Copy link
Contributor

davecheney commented Jan 7, 2018 via email

@odeke-em odeke-em changed the title net: net.Dial slow on localhost net: Dial is slow on localhost Jan 7, 2018
@agnivade
Copy link
Contributor

agnivade commented Jan 8, 2018

@levrik - Perhaps you can write a similar program in a different programming language and test it in the same box ? That will eliminate whether its an issue with your networking stack or with Go.

@davecheney
Copy link
Contributor

davecheney commented Jan 8, 2018 via email

@as
Copy link
Contributor

as commented Jan 8, 2018

(with a stopwatch, unless ping supports a one shot mode on windows)

ping -n 1 localhost

@levrik
Copy link
Author

levrik commented Jan 8, 2018

@davecheney @agnivade Let me try later this evening.

@levrik
Copy link
Author

levrik commented Jan 13, 2018

Sorry for not responding.
Just tried a ping to localhost. Took below 1ms.

@kkinnunen
Copy link

My connections were taking exactly 1s each with

  • dialing to localhost:port
  • listening on tcp, localhost:port

Worked around with using dialing and listening on "tcp", 127.0.0.1:port

net.Listen network (tcp, tcp4) and address of ":port", "localhost:port", "127.0.0.1:port" seem affect the outcome (to my adhoc testing). For example:

  • dialing to localhost:port
  • listening on "tcp", ":0" -> fast
  • listening on "tcp4", ":0" ->slow

go version
go version go1.10 windows/amd64

@andybons andybons added this to the Unplanned milestone Mar 15, 2018
@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 15, 2018
@ewxrjk
Copy link

ewxrjk commented Mar 19, 2018

I encountered the same issue, and I've done some digging. The first clue was that if I tried to connect to a port that has no listener at all the delay was 2 seconds, not 1 second.

The delay isn't in the name lookup. But the results of the name lookup do matter: here, localhost results in two addresses, ::1 and 127.0.0.1. In this case net.Dial attempts to connect to ::1 first and 127.0.0.1 second.

The delay arises from failed connection attempts. Specifically, the call to fd.pfd.ConnectEx takes almost exactly a second when it fails (for any address, not just local ones). You can see this by modifying the implementation to print before/after timestamps.

In the localhost case, if you're listening on both IPv4 and IPv6, then it succeeds immediately. If you're listening on just IPv4 then you get failed connection attempted and a 1s delay followed by success. If you're listening on neither then you get two failed connection attempts and thus a 2s delay.

The same issue affects PuTTY. tcpdump reveals that what's going on is that Windows apparently does not believe the RST bits it gets back from the target of the connection (even if, apparently, it is itself the target) and keeps trying for up to a second.

I doubt that Go can do anything about this. For application developers the answer is to bind to both IPv4 and IPv6 addresses, so that it doesn't matter what connection order clients use.

A couple of references:

@levrik
Copy link
Author

levrik commented Mar 20, 2018

Ah, that's really interesting.
The issue I have here is that I'm not in control of the target servers. I built a development proxy which also proxies websockets besides normal HTTP. If you use, for example, create-react-app which uses sockjs for the connection to its live reload server it will fall back first to eventsource and after that to polling. that is caused by some internal dynamic timeout that is calculated based on a normal HTTP call (which is fast). I think it is fine to just use 127.0.0.1 in that case, at least for now.

@agnivade
Copy link
Contributor

agnivade commented May 2, 2018

@levrik - So, would it be fair to say that this issue can be closed ?

@agnivade agnivade added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 19, 2018
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@levrik
Copy link
Author

levrik commented Jun 19, 2018

@agnivade Oh sorry. Completely forgot to anwser. Yes. Maybe it'd be an idea to put information about that into the Go docs?

@josharian josharian reopened this Jun 19, 2018
@agnivade
Copy link
Contributor

A PR will be much appreciated.

/cc @bradfitz

@bradfitz bradfitz added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jul 6, 2018
@golang golang deleted a comment from gh67uyyghj Jul 6, 2018
sunjayBhatia added a commit to cloudfoundry/diego-logging-client that referenced this issue Jul 13, 2018
Investigation story: https://www.pivotaltracker.com/story/show/158945682

This Golang issue has the required context: golang/go#23366 (comment)
- the loggregator agent only listens on localhost on ipv4
- net.Dial tries two addresses when it receives a localhost address: ::1 and 127.0.0.1
- according to the issue, the dial to ::1 fails and causes a second delay
- our addition of the 1 second dial timeout and blocking dial caused us to see this on AWS and vSphere but not GCP because GCP stemcells have ipv6 disabled on all interfaces

[#158991284](https://www.pivotaltracker.com/story/show/158991284)
@inliquid
Copy link

Got same here with http package and some profiling tests, f.e. this code seemed broken and program just hanged until localhost was changed to 127.0.0.1

func main() {
	ctrl.Setup()

	go func() {
		if err := http.ListenAndServe("localhost:3000", &poms.GZipServer{}); err != nil {
			log.Fatalln("Error listening:", err)
		}
	}()

	f, err := os.Create("cpu.prof")
	if err != nil {
		log.Fatalln("Can't create profiler dump file:", err)
	}

	pprof.StartCPUProfile(f)

	for i := 0; i < 1000; i++ {
		if _, err := http.Get("http://localhost:3000/api/purchaseOrders/1"); err != nil {
			log.Println("Error sending request:", err)
		}
	}

	pprof.StopCPUProfile()
}

Note, that in this example both listener and client supposed to bind to same host/interface: localhost, and there is no way to get an idea of what is happening. No errors, no feedback on failed [::1] attempts, what so ever. After replacing localhost with 127.0.0.1 everything worked.

I expect to receive some errors back from Get when there are timed out or refused connect attempts on different interfaces, including attempts to connect on IPv6/IPv4.

@as
Copy link
Contributor

as commented Nov 25, 2018

@inliquid you have a transactional race in your program.

There is no guarantee that by the time your listener goroutine resolves the host part of the socket and starts the http.ListenAndServe logic, your client hasn't already tried 1000 times to connect and failed.

I wouldn't be surprised if changing it to the IPv4 address made this process quicker, allowing your server to listen before the clients started clienting.

Another question is how you determine success. Nothing happens if the client relieves no error.

I suggest creating your own listener and attaching it to the http server in two separate steps. This would allow you to implement your own net.Listener and net.Conn for the connections and add whatever debugging output you want.

@inliquid
Copy link

inliquid commented Nov 25, 2018

@as well, in fact I don't have race problem. I could have such problem as you suggested, but I don't. If there were any - I would see an error as a result. I did a lot of experiments before realized what is happening, including running server and client in 2 separate processes and emulating http server to see is http.Get even working. Problem is that http.Get returns nothing in this case.

@ewxrjk
Copy link

ewxrjk commented Nov 25, 2018

@inliquid there is not much Go can do about the connect issue. It is just reflecting the behavior of the underlying platform. The best strategies are either (1) bind to both loopback addresses, or (2) bind to just one of them but also only connect to that one.

In general binding to names rather than addresses is risky: if the name resolves to multiple addresses then it will bind to only one of those addresses. There's already a warning about this under net.Listen.

@inliquid
Copy link

inliquid commented Nov 25, 2018

I agree in general, however, I think it's clear that localhost in Windows is exceptional case and since it has nothing to do with DNS or hosts I think it must be treated in exceptional manner, explicitly, avoiding such situations were program behavior may change unexpectedly. Maybe just make "localhost" == "127.0.0.1", or even deny "localhost" and require explicit address, smth like

  • "localhost_v4" == "127.0.0.1"
  • "localhost_v6" == "[::1]"

@EllieLockhart
Copy link

Is this still an issue of concern, and if so do we have potential courses of action? Is this a situation where at least a note in documentation should be made for the time being?

@ewxrjk
Copy link

ewxrjk commented Sep 5, 2020

@EllieLockhart the root cause is in Microsoft's TCP stack and only they can fix it.

There are some possible application/operator responses and they could be documented in https://golang.org/pkg/net/ (or elsewhere).

  • For clients that want to connect to servers that listen on 127.0.0.1 only, dial 127.0.0.1 instead of localhost. This isn't a general fix but as a tactical response to the problem I suspect it would be widely acceptable.
  • For servers that want to listen to loopback only, listen on both ::1 and 127.0.0.1, if the local platform supports both.

@inliquid

This comment has been minimized.

@qmuntal
Copy link
Contributor

qmuntal commented May 17, 2023

I'll just leave this here https://github.com/openjdk/jdk/blob/be4f25b0c885a6405b484e4035b30005311a2802/src/java.base/windows/native/libnio/ch/Net.c#L240-L243

Thanks for the pointer @artemsnisarenko. I've submitted CL 23366 based on that.

@gopherbot
Copy link

Change https://go.dev/cl/495875 mentions this issue: net: make Dial fail faster on Windows closed loopback devices

@qmuntal qmuntal added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Projects
None yet
Development

No branches or pull requests