-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: scheduler starvation (?) on windows 7/8/10 when a goroutine performs disk and network i/o (http.ServeFile) #19971
Comments
The only thing I can think of at this moment, perhaps your /file path uses syscall.TransmitFile. And code around syscall.TransmitFile is broken. Alex |
I didn't realize go implemented sendfile() optimizations in Unfortunately it doesn't look like this is the cause of the issue. If it were, I would have expected the Just to make sure, I added another route which uses http.HandleFunc("/file-notransmit", func(w http.ResponseWriter, req *http.Request) {
fp, err := os.Open("video.mpg")
if err != nil {
log.Fatal(err)
}
defer fp.Close()
w.Header().Set("Content-Type", "video/mpeg")
io.Copy(w, io.NewSectionReader(fp, 0, 1024<<20))
}) |
I was re-running yesterday's tests this morning, and noticed that the Digging in further, I discovered that the mac I using to make |
I can no longer reproduce this behavior with my test program after switching to gigabit on the client, so it appears to be specific to when a gigabit windows box is talking to client that's only 100mbps. The problem has nothing to do with disk i/o, and can be replicated with the The same test from the same 100mbps client (macbook w/ usb/ethernet adapter) against a gigabit mac mini running the test program works just fine. This could easily be an OS issue, related to the windows network stack or its own thread scheduler. I guess the next step would be to write the same program in another language to compare the behavior. Unfortunately this is not the bug I was searching for, as my app is affected even at gigabit speeds. I clearly did not capture the problematic scenario in my test program. In my app, the problem seems to occur when I'm reading and writing from the same file concurrently, so I'll start there as I try to write a new repro. I'll try to come back to this bug after the find the one that I was originally trying to track down. |
I'll close this one. If you have a self-contained and reproducible bug in the future you can open a new one. |
I wrote my test in another language, and had the same issue. So this is clearly not a bug in the golang runtime. The problem appears to be related to how windows sends Ethernet PAUSE frames in a mixed gigabit/fast-ethernet environment, combined with how the TV tuner hardware device reacts when it sees a PAUSE frame on the network. Sorry for the noise! |
Turns out this wasn't related to the TV tuner hardware, and has to do with the Windows networking stack and ethernet drivers. Easily reproduced as follows:
This behavior is documented in various posts across the internet including https://www.smallnetbuilder.com/lanwan/lanwan-features/30212-when-flow-control-is-not-a-good-thing, and cannot be reproduced once "Flow Control" is disabled on the ethernet adapter on the windows machine. Again, this has nothing to do with the golang runtime, but after weeks of investigating this crazy bug I wanted to document my findings for posterity. |
What version of Go are you using (
go version
)?go version go1.8.1 windows/amd64
What operating system and processor architecture are you using (
go env
)?What did you do?
I have a complex application that deals with live video streams that come from a hardware TV tuner device on my home network. The application itself runs on mac, linux, windows and freebsd, but this particular issue only occurs on windows.
The device that produces the video stream is commercially available, produces large uncompressed video streams (~2MB/s), and is very sensitive to latency. It has limited on-board memory, and aggressively drops packets which are not received by the client in a timely fashion.
I've managed to whittle down my app into a simple repro using only stdlib, containing an http server with two routes:
/network
, which opens a http connection to the hardware tuner and proxies back the response data/file
, which useshttp.ServeFile
to serve up a large video file from disk.What did you expect to see?
I expect to be able to use
curl
against both http endpoints concurrently, to receive data from both the tuner and the file on disk at the same time.On linux, darwin and freebsd, this code works as expected and there are no problems.
What did you see instead?
On windows, as soon as a request is made to
/file
, it starves out the other running goroutine. The end effect is that the goroutine serving the/network
request does not read from its connection to the hardware tuner fast enough, causing the tuner to drop packets and thus impacting the watchability of the video stream.I have confirmed this behavior with atleast half a dozen other users of my app, who are all using different hardware, different versions of windows (all 64-bit, but mixed between 7, 8 and 10), and have different home network setups.
Further investigation
In an effort to narrow down what's going on, I tried adding several other http handlers and observed the following behaviors:
async file i/o
I tried using the Microsoft/go-winio library to perform async disk i/o when reading the file from disk. This setup still reproduces the behavior, so the scheduler bug triggers with both sync and async i/o apis.
disk i/o only (no network)
I thought maybe it was the extra network i/o of serving up the file that was impacting the network connection on the other goroutine, so I swapped in a
ioutil.Discard
. I could not reproduce the issue with this setup, so just disk i/o alone is not enough to trigger the scheduler bug.network i/o only (no disk)
Since just the disk i/o didn't trigger the issue, I thought it must be the network. I added this route to send flood bytes over the network, but it also did not reproduce the issue. So just network i/o is also not enough to trigger the scheduler bug.
rate limited disk i/o
So it appears the scheduler bug only occurs when a goroutine does both disk and network i/o, like with
http.ServeFile
.Since at this point it seemed like the
/file
goroutine was hogging the scheduler and moving bytes too quickly, I tried to insert a rate limiter from fujiwara/shapeio. This worked, and is an effective work-around for the bug. With the rate limit in place, the scheduler does not starve out the other goroutines.tracer
I collected a 10s trace from the reproduction. During the first 5s only
/network
was active, and then during the last half the/file
goroutine was also running.I'm not terribly familiar with the trace output, but here it is in-case someone else can make more sense of it:
trace.txt
cc @dvyukov @jstarks who might have some ideas
The text was updated successfully, but these errors were encountered: