Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/cgo: scheduler blocked with goroutines #8636

Closed
DAddYE opened this issue Sep 3, 2014 · 14 comments
Closed

runtime/cgo: scheduler blocked with goroutines #8636

DAddYE opened this issue Sep 3, 2014 · 14 comments

Comments

@DAddYE
Copy link

DAddYE commented Sep 3, 2014

Hi there,

I have a quite big app, while my code works and is very fast in production, when I ran
my benchmark suite results 1000x worse. So I'm not sure if I'm doing something wrong or
is a know problem.

There is a thread on google groups:
https://groups.google.com/forum/#!topic/golang-nuts/9RqGvk-b3Ow

Tested on:

go version go1.3.1 darwin/amd64
go version go1.3.1 linux/amd64

Here a simplified version of the code:

func main() {
        runtime.GOMAXPROCS(0)

        // prepare communication protocol
        urls := make(chan string)
        wg := new(sync.WaitGroup)

        // fire a bunch of workers
        for i := 0; i < env.MaxProcs; i++ {
                wg.Add(1)
                go func(i int) {
                        fmt.Printf("starting worker %d...\n", i)
                        defer wg.Done()
                        for url := range urls {
                                C.resize_image(url) // takes around 30ms
                        }
                }(i)
        }

        // db connection (native go)....

        var url string
        for i := 0; iter.Scan(&url) && i < 1e4; i++ { // Scan takes around 8ms
                // with a time.Sleep(10 * time.Millisecond) results are 1000x better!
                urls <- url
        }
        err = iter.Close()

        // wait
        close(urls)
        wg.Wait()
}

When I run pprof (darwin):

3512  76.4%  76.4%     3512  76.4% runtime.mach_semaphore_wait
     968  21.1%  97.5%      968  21.1% runtime.cgocall
     107   2.3%  99.8%      107   2.3% runtime.kevent
       3   0.1%  99.9%        3   0.1% runtime.usleep
       2   0.0%  99.9%        2   0.0% syscall.Syscall
       1   0.0%  99.9%        1   0.0% math/rand.Float64
       1   0.0% 100.0%        1   0.0% runtime.MHeap_LookupMaybe
       1   0.0% 100.0%        1   0.0% runtime.MSpan_Sweep
       1   0.0% 100.0%        1   0.0% runtime.memclr
       0   0.0% 100.0%        6   0.1% GC
       0   0.0% 100.0%     3618  78.7% System
       0   0.0% 100.0%      968  21.1% github.com/daddye/vips.Resize
       0   0.0% 100.0%       28   0.6% github.com/daddye/vips._Cfunc_im_close
       0   0.0% 100.0%      278   6.0% github.com/daddye/vips._Cfunc_vips_affine_interpolator
       0   0.0% 100.0%       18   0.4% github.com/daddye/vips._Cfunc_vips_colourspace_0
       0   0.0% 100.0%       43   0.9% github.com/daddye/vips._Cfunc_vips_copy_0
       0   0.0% 100.0%      258   5.6% github.com/daddye/vips._Cfunc_vips_embed_extend
       0   0.0% 100.0%       35   0.8% github.com/daddye/vips._Cfunc_vips_image_new
       0   0.0% 100.0%       16   0.3% github.com/daddye/vips._Cfunc_vips_jpegload_buffer_seq
       0   0.0% 100.0%       15   0.3% github.com/daddye/vips._Cfunc_vips_jpegload_buffer_shrink

When I run pprof (linux):

Total: 6250 samples
    6188  99.0%  99.0%     6191  99.1% runtime.cgocall
      21   0.3%  99.3%       21   0.3% runtime.usleep
       7   0.1%  99.5%        7   0.1% retake
       7   0.1%  99.6%        7   0.1% runtime.rtsigprocmask
       4   0.1%  99.6%        4   0.1% syscall.Syscall
       3   0.0%  99.7%        3   0.0% runtime.epollwait
       3   0.0%  99.7%        3   0.0% runtime.futex
       2   0.0%  99.8%        2   0.0% ExternalCode
       2   0.0%  99.8%        2   0.0% runtime.memclr
       1   0.0%  99.8%        2   0.0% exitsyscallfast
       1   0.0%  99.8%     6192  99.1% github.com/daddye/vips.Resize
       1   0.0%  99.8%        1   0.0% github.com/gocql/gocql.(*frame).readBytes
       1   0.0%  99.9%        1   0.0% runqgrab
       1   0.0%  99.9%        1   0.0% runtime.atomicload64
       1   0.0%  99.9%        1   0.0% runtime.cas
       1   0.0%  99.9%        1   0.0% runtime.entersyscall
       1   0.0%  99.9%        1   0.0% runtime.fastrand1
       1   0.0%  99.9%        1   0.0% runtime.noteclear
       1   0.0% 100.0%        1   0.0% runtime.purgecachedstats
       1   0.0% 100.0%        1   0.0% runtime.unlock


Things I tried:

1. Use another CGO binding: image magick - Same results - Both libraries are thread safe.
2. Use runtime.Gosched() (tried to put it in different places) - Same
3. Use runtime.LockOSThread() (tried to put it in different places) - Same
4. Add a time.Sleep(10*time.Milliseconds) - the commented line
<<<<<<<<< I get back good results!
5. Test on linux (ubuntu lts 12) - Same and pretty much same pprof
6. Remove CGO and use a native call - Very fast. 
7. Avoid channels just `go func()` - Very fast!
8. Set GOMAXPROCS 1 - Very fast!
9. Remove the db connection just use a slice - Same
10. Use buffered channels - Worse


To explain more about point 4. it seems that if we add few milliseconds between each
request

        for i := 0; iter.Scan(&url) && i < 1e4; i++ { // Scan takes around 8ms
                time.Sleep(10 * time.Millisecond)
                urls <- url
        }

 the scheduler can handle everything better, that's why in production I wasn't able to see the issue, so here a benchmark with timings:

Without Sleep:

resize speed mean: 327.630885ms, min: 100.150452ms, max: 462.495835ms, 99%: 429.954171ms
Total: 6250 samples
    6188  99.0%  99.0%     6191  99.1% runtime.cgocall
      21   0.3%  99.3%       21   0.3% runtime.usleep
       7   0.1%  99.5%        7   0.1% retake
       7   0.1%  99.6%        7   0.1% runtime.rtsigprocmask
       4   0.1%  99.6%        4   0.1% syscall.Syscall
       3   0.0%  99.7%        3   0.0% runtime.epollwait
       3   0.0%  99.7%        3   0.0% runtime.futex
       2   0.0%  99.8%        2   0.0% ExternalCode
       2   0.0%  99.8%        2   0.0% runtime.memclr
       1   0.0%  99.8%        2   0.0% exitsyscallfast
       1   0.0%  99.8%     6192  99.1% github.com/daddye/vips.Resize
       1   0.0%  99.8%        1   0.0% github.com/gocql/gocql.(*frame).readBytes
       1   0.0%  99.9%        1   0.0% runqgrab
       1   0.0%  99.9%        1   0.0% runtime.atomicload64
       1   0.0%  99.9%        1   0.0% runtime.cas
       1   0.0%  99.9%        1   0.0% runtime.entersyscall
       1   0.0%  99.9%        1   0.0% runtime.fastrand1
       1   0.0%  99.9%        1   0.0% runtime.noteclear
       1   0.0% 100.0%        1   0.0% runtime.purgecachedstats
       1   0.0% 100.0%        1   0.0% runtime.unlock

With a Sleep of 10ms:

resize speed mean: 28.551381ms, min: 4.666236ms, max: 61.589934ms, 99%: 52.274779ms
Total: 1265 samples
    1101  87.0%  87.0%     1105  87.4% runtime.cgocall
      40   3.2%  90.2%       40   3.2% runtime.usleep
      40   3.2%  93.4%       41   3.2% syscall.Syscall
      23   1.8%  95.2%       23   1.8% runtime.futex
       7   0.6%  95.7%        7   0.6% runtime.epollwait
       4   0.3%  96.0%        4   0.3% runtime.MSpan_Sweep
       3   0.2%  96.3%        3   0.2% ExternalCode
       3   0.2%  96.5%        3   0.2% findrunnable
       3   0.2%  96.8%        3   0.2% runtime.markscan
       3   0.2%  97.0%        3   0.2% runtime.rtsigprocmask
       2   0.2%  97.2%        2   0.2% pidleget
       2   0.2%  97.3%        2   0.2% runtime.atomicload
       2   0.2%  97.5%        3   0.2% runtime.entersyscall
       2   0.2%  97.6%        2   0.2% runtime.fastrand1
       2   0.2%  97.8%        2   0.2% runtime.gogo
       2   0.2%  97.9%        2   0.2% runtime.memmove
       2   0.2%  98.1%        2   0.2% sysmon
       1   0.1%  98.2%        1   0.1% MHeap_AllocLocked
       1   0.1%  98.3%        1   0.1% addstackroots
       1   0.1%  98.3%       16   1.3% bufio.(*Reader).Read

Adding a cumulative with the sleep version:

Total: 1279 samples
       0   0.0%   0.0%     1190  93.0% runtime.gosched0
       1   0.1%   0.1%     1149  89.8% main.func·003
       0   0.0%   0.1%     1116  87.3% github.com/triggit/feeds/processor.ResizeVips
       1   0.1%   0.2%     1114  87.1% github.com/daddye/vips.Resize
    1111  86.9%  87.0%     1111  86.9% runtime.cgocall
       0   0.0%  87.0%      488  38.2% github.com/daddye/vips._Cfunc_vips_jpegsave_custom
       0   0.0%  87.0%      267  20.9% github.com/daddye/vips._Cfunc_vips_affine_interpolator
       0   0.0%  87.0%      233  18.2% github.com/daddye/vips._Cfunc_vips_embed_extend
       0   0.0%  87.0%       82   6.4% System
      38   3.0%  90.0%       40   3.1% syscall.Syscall
      34   2.7%  92.7%       34   2.7% runtime.usleep
      27   2.1%  94.8%       27   2.1% runtime.futex
       0   0.0%  94.8%       26   2.0% io/ioutil.WriteFile
       0   0.0%  94.8%       25   2.0% github.com/daddye/vips._Cfunc_im_close
       0   0.0%  94.8%       25   2.0% github.com/daddye/vips._Cfunc_vips_image_new
       0   0.0%  94.8%       21   1.6% github.com/daddye/vips._Cfunc_vips_jpegload_buffer_seq
       0   0.0%  94.8%       21   1.6% github.com/gocql/gocql.(*Conn).recv
       0   0.0%  94.8%       21   1.6% github.com/gocql/gocql.(*Conn).serve
       1   0.1%  94.8%       19   1.5% bufio.(*Reader).Read
       0   0.0%  94.8%       19   1.5% github.com/daddye/vips._Cfunc_vips_copy_0

We can see that:

With a sleep:

1   0.1%   0.2%     1114  87.1% github.com/daddye/vips.Resize

Without a sleep:

1   0.0%  99.8%     6192  99.1% github.com/daddye/vips.Resize

So finally:

The version without sleep took (just for the C function call): 327.630885ms
With a sleep: 28.551381ms

Thanks so much
Davide
@ianlancetaylor
Copy link
Contributor

Comment 1:

When a cgo call is started the thread will block waiting for the call to complete. 
There is a system monitor thread that looks for threads that have blocked waiting for
cgo for more than 20 us.  When it finds such a thread, and there is other work for it
do, it spins up a new thread to run new goroutines.
The effect is that you can have very different profiling results depending on how long
your cgo calls run.  Could that be the issue here?
If that's not the issue, please try to provide a complete standalone example that shows
the problem.  Thanks.

Labels changed: added repo-main.

@minux
Copy link
Member

minux commented Sep 3, 2014

Comment 2:

why set GOMAXPROCS to 0?

@DAddYE
Copy link
Author

DAddYE commented Sep 3, 2014

Comment 3:

Hi, 
here finally what I believe show the problem:
package main
// #include <unistd.h>
import "C"
import (
    "fmt"
    "net/http"
    _ "net/http/pprof"
    "runtime"
    "sync"
    "time"
)
const size = 10000
func main() {
    runtime.GOMAXPROCS(0)
    started := time.Now()
    go func() {
        fmt.Println("pprof enabled on port 6060")
        err := http.ListenAndServe(":6060", nil)
        if err != nil {
            panic(err)
        }
    }()
    // prepare communication protocol
    urls := make(chan string, size)
    wg := new(sync.WaitGroup)
    // fire a bunch of workers
    for i := 0; i < runtime.NumCPU(); i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            for _ = range urls {
                // time.Sleep(1 * time.Millisecond) // nanoseconds
                C.usleep(1000) // microseconds
            }
        }(i)
    }
    for i := 0; i < size; i++ {
        urls <- "test"
    }
    // wait
    close(urls)
    wg.Wait()
    fmt.Println("done in", time.Since(started))
}
Here I believe the final result should be around 1s, right? With the C call needs on my
mac 12.860826382s
The go version of sleeps instead correctly shows: 1.621176016s
If I increase the C sleep to 10 ms (10*1000), I get back the same result of the go
counterpart: 15.092162142s
> why set GOMAXPROCS to 0?
I believe be the same as GOMAXPROCS=runtime.NumCPU(), I saw that's a common pattern in
go.

@DAddYE
Copy link
Author

DAddYE commented Sep 3, 2014

Comment 4:

// edit: maxprocs If n < 1, it does not change the current setting.

@dvyukov
Copy link
Member

dvyukov commented Sep 4, 2014

Comment 5:

With go version
go version devel +9cc3e5afb5d6 Thu Sep 04 11:34:01 2014 +0400 linux/amd64
on linux/amd64 on 32 core machine:
without sleep:
$ GOMAXPROCS=1 go run /tmp/sched.go
done in 914.7684ms
$ GOMAXPROCS=32 go run /tmp/sched.go
done in 337.801592ms
with sleep:
$ go run /tmp/sched.go
done in 989.857984ms
$ GOMAXPROCS=32 go run /tmp/sched.go
done in 687.407522ms
Everything seems fine to me.
What go version do you use? what OS? what hardware? what is the value of GOMAXPROCS env
var?

@DAddYE
Copy link
Author

DAddYE commented Sep 4, 2014

Comment 6:

Hi, 
I'm on go version go1.3.1 darwin/amd64
and with a GOMAXPROCS=1 it takes 12.851289599s (retested again)

@DAddYE
Copy link
Author

DAddYE commented Sep 4, 2014

Comment 7:

Quick update, I re ran this, on different machines (of my coworkers all of them running
1.3.1 / darwin) and strange thing, on the same machine sometimes it takes ~1.5s
sometimes ~12s, that's on all machines I tried.

@DAddYE
Copy link
Author

DAddYE commented Sep 4, 2014

Comment 8:

Another update,
My initial guess was the using of "pprof", so removing it, give me back the 1.5s
(constantly).
However, if I import only "net/http" it takes back constantly 12s.
Here the code:
package main
// #include <unistd.h>
import "C"
import (
    "fmt"
    _ "net/http"
    "runtime"
    "sync"
    "time"
)
const size = 10000
func main() {
    started := time.Now()
    // prepare communication protocol
    urls := make(chan string, size)
    wg := new(sync.WaitGroup)
    // fire a bunch of workers
    for i := 0; i < runtime.NumCPU(); i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            for _ = range urls {
                // time.Sleep(1 * time.Millisecond) // nanoseconds
                C.usleep(1000) // microseconds
            }
        }(i)
    }
    for i := 0; i < size; i++ {
        urls <- "test"
    }
    // wait
    close(urls)
    wg.Wait()
    fmt.Println("done in", time.Since(started))
}
I don't know if it's related, but probably there is some package that I use that raise
on production (in a bigger way) this kind of issue.

@DAddYE
Copy link
Author

DAddYE commented Sep 4, 2014

Comment 9:

Here a dump with DEBUG enabled:
tmp $ GOTRACEBACK=crash GODEBUG=schedtrace=1000 go run test.go
SCHED 0ms: gomaxprocs=1 idleprocs=0 threads=2 idlethreads=0 runqueue=0 [1]
# command-line-arguments
SCHED 0ms: gomaxprocs=1 idleprocs=0 threads=2 idlethreads=0 runqueue=0 [1]
# command-line-arguments
SCHED 0ms: gomaxprocs=1 idleprocs=0 threads=2 idlethreads=0 runqueue=0 [1]
SCHED 0ms: gomaxprocs=1 idleprocs=0 threads=2 idlethreads=0 runqueue=0 [1]
SCHED 1006ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 1009ms: gomaxprocs=1 idleprocs=0 threads=5 idlethreads=0 runqueue=3 [5]
SCHED 2012ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 2010ms: gomaxprocs=1 idleprocs=0 threads=5 idlethreads=0 runqueue=5 [3]
SCHED 3022ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 3022ms: gomaxprocs=1 idleprocs=0 threads=5 idlethreads=0 runqueue=6 [2]
SCHED 4024ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 4030ms: gomaxprocs=1 idleprocs=0 threads=5 idlethreads=0 runqueue=8 [0]
SCHED 5032ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 5038ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=1 runqueue=0 [7]
SCHED 6039ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 6049ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=1 runqueue=1 [6]
SCHED 7044ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 7059ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=1 runqueue=1 [6]
SCHED 8046ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 8060ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=1 runqueue=2 [5]
SCHED 9058ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 9065ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=1 runqueue=3 [4]
SCHED 10065ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 10077ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=1 runqueue=3 [4]
SCHED 11067ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 11081ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=0 runqueue=4 [3]
SCHED 12074ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
SCHED 12086ms: gomaxprocs=1 idleprocs=0 threads=6 idlethreads=0 runqueue=6 [1]
SCHED 13084ms: gomaxprocs=1 idleprocs=1 threads=6 idlethreads=2 runqueue=0 [0]
done in 12.857744792s

@dvyukov
Copy link
Member

dvyukov commented Sep 5, 2014

Comment 10:

Try to use a buffered channel and set GOMAXPROC to a higher value.
Also, when you use GODEBUG=schedtrace=1000, don't use just 'go run'. Build and run
separately. Otherwise the trace is a mix of go command and your program output -- they
are both Go programs.

@DAddYE
Copy link
Author

DAddYE commented Sep 6, 2014

Comment 11:

Hi, yes, increasing GOMAXPROC solve the problem however I wonder why is not necessary
when you don't use C calls.

@dvyukov
Copy link
Member

dvyukov commented Sep 12, 2014

Comment 12:

Increasing GOMAXPROCS improves performance when C calls are not used as well.

@bradfitz bradfitz removed the new label Dec 18, 2014
@rsc
Copy link
Contributor

rsc commented Apr 10, 2015

Assigning to Dmitriy to see if there's anything to do here or else close.

@rsc rsc added this to the Go1.5Maybe milestone Apr 10, 2015
@dvyukov
Copy link
Member

dvyukov commented Apr 13, 2015

I don't see anything actionable here.
Using buffered channel and setting GOMAXPROCS to a higher value should solve the performance issue.

@dvyukov dvyukov closed this as completed Apr 13, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants