Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler starvation (?) on windows 7/8/10 when a goroutine performs disk and network i/o (http.ServeFile) #19971

Closed
tmm1 opened this issue Apr 14, 2017 · 8 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Milestone

Comments

@tmm1
Copy link
Contributor

tmm1 commented Apr 14, 2017

What version of Go are you using (go version)?

go version go1.8.1 windows/amd64

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\Users\tmm1\go
set GORACE=
set GOROOT=c:\go
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
set PKG_CONFIG=pkg-config
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2

What did you do?

I have a complex application that deals with live video streams that come from a hardware TV tuner device on my home network. The application itself runs on mac, linux, windows and freebsd, but this particular issue only occurs on windows.

The device that produces the video stream is commercially available, produces large uncompressed video streams (~2MB/s), and is very sensitive to latency. It has limited on-board memory, and aggressively drops packets which are not received by the client in a timely fashion.

I've managed to whittle down my app into a simple repro using only stdlib, containing an http server with two routes:

  1. /network, which opens a http connection to the hardware tuner and proxies back the response data

  2. /file, which uses http.ServeFile to serve up a large video file from disk.

package main

import (
	"io"
	"log"
	"net/http"
)

func main() {
	http.HandleFunc("/file", func(w http.ResponseWriter, req *http.Request) {
		http.ServeFile(w, req, "video.mpg")
	})
	http.HandleFunc("/network", func(w http.ResponseWriter, req *http.Request) {
		resp, err := http.Get("http://10.0.1.4:5004/auto/v704")
		if err != nil {
			log.Fatal(err)
		}
		if resp.StatusCode != 200 {
			resp.Body.Close()
			log.Fatalf("got http response code %v", resp.StatusCode)
		}
		defer resp.Body.Close()

		w.Header().Set("Transfer-Encoding", "identity")
		w.Header().Set("Content-Type", "video/mpeg")
		io.Copy(w, resp.Body)
	})
	http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
		w.WriteHeader(404)
	})
	http.ListenAndServe(":9987", nil)
}

What did you expect to see?

I expect to be able to use curl against both http endpoints concurrently, to receive data from both the tuner and the file on disk at the same time.

On linux, darwin and freebsd, this code works as expected and there are no problems.

What did you see instead?

On windows, as soon as a request is made to /file, it starves out the other running goroutine. The end effect is that the goroutine serving the /network request does not read from its connection to the hardware tuner fast enough, causing the tuner to drop packets and thus impacting the watchability of the video stream.

I have confirmed this behavior with atleast half a dozen other users of my app, who are all using different hardware, different versions of windows (all 64-bit, but mixed between 7, 8 and 10), and have different home network setups.

Further investigation

In an effort to narrow down what's going on, I tried adding several other http handlers and observed the following behaviors:

async file i/o

I tried using the Microsoft/go-winio library to perform async disk i/o when reading the file from disk. This setup still reproduces the behavior, so the scheduler bug triggers with both sync and async i/o apis.

	http.HandleFunc("/file-async", func(w http.ResponseWriter, req *http.Request) {
		pathp, err := syscall.UTF16PtrFromString("video.mpg")
		if err != nil {
			log.Fatalf("path failed: %v", err)
		}
		h, err := syscall.CreateFile(pathp, syscall.GENERIC_READ|syscall.GENERIC_WRITE, 0, nil, syscall.OPEN_EXISTING, syscall.FILE_FLAG_OVERLAPPED, 0)
		if err != nil {
			log.Fatalf("createfile failed: %v", err)
		}
		fp, err := winio.MakeOpenFile(h)
		if err != nil {
			log.Fatalf("makeopenfile failed: %v", err)
		}
		defer fp.Close()
		w.Header().Set("Content-Type", "video/mpeg")
		io.Copy(w, fp)
	})

disk i/o only (no network)

I thought maybe it was the extra network i/o of serving up the file that was impacting the network connection on the other goroutine, so I swapped in a ioutil.Discard. I could not reproduce the issue with this setup, so just disk i/o alone is not enough to trigger the scheduler bug.

	http.HandleFunc("/file-discard", func(w http.ResponseWriter, req *http.Request) {
		fp, err := os.Open("video.mpg")
		if err != nil {
			log.Fatal(err)
		}
		defer fp.Close()

		w.Header().Set("Content-Type", "video/mpeg")
		io.Copy(ioutil.Discard, fp)
	})

network i/o only (no disk)

Since just the disk i/o didn't trigger the issue, I thought it must be the network. I added this route to send flood bytes over the network, but it also did not reproduce the issue. So just network i/o is also not enough to trigger the scheduler bug.

	http.HandleFunc("/zeros", func(w http.ResponseWriter, req *http.Request) {
		data := make([]byte, 32768)
		for {
			w.Write(data)
		}
	})

rate limited disk i/o

So it appears the scheduler bug only occurs when a goroutine does both disk and network i/o, like with http.ServeFile.

Since at this point it seemed like the /file goroutine was hogging the scheduler and moving bytes too quickly, I tried to insert a rate limiter from fujiwara/shapeio. This worked, and is an effective work-around for the bug. With the rate limit in place, the scheduler does not starve out the other goroutines.

	http.HandleFunc("/file-ratelimit", func(w http.ResponseWriter, req *http.Request) {
		fp, err := os.Open("video.mpg")
		if err != nil {
			log.Fatal(err)
		}
		defer fp.Close()

		reader := shapeio.NewReader(fp)
		reader.SetRateLimit(4 << 20)

		w.Header().Set("Content-Type", "video/mpeg")
		io.Copy(w, reader)
	})

tracer

I collected a 10s trace from the reproduction. During the first 5s only /network was active, and then during the last half the /file goroutine was also running.

	http.HandleFunc("/trace", func(w http.ResponseWriter, req *http.Request) {
		fp, err := os.OpenFile("trace.out", os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0644)
		if err != nil {
			log.Fatal(err)
		}
		trace.Start(fp)
		time.Sleep(10 * time.Second)
		trace.Stop()
	})

I'm not terribly familiar with the trace output, but here it is in-case someone else can make more sense of it:

trace.txt

screen shot 2017-04-13 at 6 27 58 pm

cc @dvyukov @jstarks who might have some ideas

@bradfitz
Copy link
Contributor

@bradfitz bradfitz added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows help wanted labels Apr 14, 2017
@bradfitz bradfitz changed the title Scheduler starvation (?) on windows 7/8/10 when a goroutine performs disk and network i/o (http.ServeFile) runtime: scheduler starvation (?) on windows 7/8/10 when a goroutine performs disk and network i/o (http.ServeFile) Apr 14, 2017
@bradfitz bradfitz added this to the Go1.9Maybe milestone Apr 14, 2017
@alexbrainman
Copy link
Member

The only thing I can think of at this moment, perhaps your /file path uses syscall.TransmitFile. And code around syscall.TransmitFile is broken.

Alex

@tmm1
Copy link
Contributor Author

tmm1 commented Apr 14, 2017

The only thing I can think of at this moment, perhaps your /file path uses syscall.TransmitFile. And code around syscall.TransmitFile is broken.

I didn't realize go implemented sendfile() optimizations in io.Copy. Cool!

Unfortunately it doesn't look like this is the cause of the issue. If it were, I would have expected the /file-async to perform better than the sync counterpart.

Just to make sure, I added another route which uses io.SectionReader (since sendFile currently only kicks in for *os.File and *io.LimitedReader), and sure enough there is no change in behavior.

	http.HandleFunc("/file-notransmit", func(w http.ResponseWriter, req *http.Request) {
		fp, err := os.Open("video.mpg")
		if err != nil {
			log.Fatal(err)
		}
		defer fp.Close()

		w.Header().Set("Content-Type", "video/mpeg")
		io.Copy(w, io.NewSectionReader(fp, 0, 1024<<20))
	})

@tmm1
Copy link
Contributor Author

tmm1 commented Apr 14, 2017

I was re-running yesterday's tests this morning, and noticed that the /zeros endpoint was now impacting the video stream from /network. I noticed that curl /zeros maxed out at 11M and the existing /network connection stopped receiving any data.

Digging in further, I discovered that the mac I using to make curl requests against the windows box was connected at 100mbps instead of gigabit. I'm now going through all the tests again to see if the behavior is the same with gigabit connectivity.

@tmm1
Copy link
Contributor Author

tmm1 commented Apr 14, 2017

I can no longer reproduce this behavior with my test program after switching to gigabit on the client, so it appears to be specific to when a gigabit windows box is talking to client that's only 100mbps. The problem has nothing to do with disk i/o, and can be replicated with the /network and /zeros endpoints in the test program.

The same test from the same 100mbps client (macbook w/ usb/ethernet adapter) against a gigabit mac mini running the test program works just fine.

This could easily be an OS issue, related to the windows network stack or its own thread scheduler. I guess the next step would be to write the same program in another language to compare the behavior.

Unfortunately this is not the bug I was searching for, as my app is affected even at gigabit speeds. I clearly did not capture the problematic scenario in my test program. In my app, the problem seems to occur when I'm reading and writing from the same file concurrently, so I'll start there as I try to write a new repro.

I'll try to come back to this bug after the find the one that I was originally trying to track down.

@bradfitz
Copy link
Contributor

I'll close this one. If you have a self-contained and reproducible bug in the future you can open a new one.

@tmm1
Copy link
Contributor Author

tmm1 commented Apr 15, 2017

I wrote my test in another language, and had the same issue. So this is clearly not a bug in the golang runtime.

The problem appears to be related to how windows sends Ethernet PAUSE frames in a mixed gigabit/fast-ethernet environment, combined with how the TV tuner hardware device reacts when it sees a PAUSE frame on the network.

Sorry for the noise!

@tmm1
Copy link
Contributor Author

tmm1 commented Apr 20, 2017

Turns out this wasn't related to the TV tuner hardware, and has to do with the Windows networking stack and ethernet drivers. Easily reproduced as follows:

  1. start any http server on a windows machine connected via gigabit serving up large files
  2. use curl to download a file to a gigabit client, see speeds of ~105MB/s
  3. use curl to download a file to a 100mbps client, see speeds of ~11MB/s
  4. run both curls at the same time, the 100mbps client stays at ~11MB/s but the gigabit client falls to 4MB/s

This behavior is documented in various posts across the internet including https://www.smallnetbuilder.com/lanwan/lanwan-features/30212-when-flow-control-is-not-a-good-thing, and cannot be reproduced once "Flow Control" is disabled on the ethernet adapter on the windows machine.

Again, this has nothing to do with the golang runtime, but after weeks of investigating this crazy bug I wanted to document my findings for posterity.

@golang golang locked and limited conversation to collaborators Apr 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
None yet
Development

No branches or pull requests

4 participants