-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: syscalls are ~ 1.4x slower than in C #19563
Comments
I think you buried the lede. Yuu should update the title to reflect the
fact that syscalls got 60ns slower on tip.
…On Thu, 16 Mar 2017, 08:09 Josh Bleecher Snyder ***@***.***> wrote:
cc @bcmills <https://github.com/bcmills> @ianlancetaylor
<https://github.com/ianlancetaylor>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#19563 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcAyOiZH8mkblkB2AmQ5MGaAuVfmxsks5rmFOagaJpZM4MehVB>
.
|
What's odd is that I don't see any actual changes to That would seem to imply one of a few things:
Can you replicate the same regression from 1.7 to 1.8 to tip using |
Tip is bound to be slower than 1.8 because of the support for the runtime poller. This brings various advantages but does take more time. But I'm frankly surprised that this time is not swamped by the overall kernel time required to implement the syscall and copy the bytes. |
Ah, now I understand why the pprof output looks so strange: the program is not even remotely CPU-bound in the application code.
So the pprof output likely looks strange because it's drawing from a very small sample size: the sampling error swamps out most of the interesting detail. |
First of all everyone - thanks for feedback. @bcmills the timing measurments are outside benchmarking loop, so timer changes should not probably affect anything. Said that below is the benchmark to read 1 byte via pread from tmpfs for us to be sure: package main
import (
"crypto/rand"
"fmt"
"os"
"testing"
)
// xpread is pread(2) but aborts on an error or when fewer bytes was read
func xpread(f *os.File, buf []byte, offset int64) {
n, err := f.ReadAt(buf, offset)
if err != nil {
panic(err)
}
if n != len(buf) {
panic(fmt.Errorf("pread(%v) -> %v", len(buf), n))
}
}
// BenchmarkPread1 benchmarks pread syscall when reading 1 byte from a file on tmpfs
func BenchmarkPread1(b *testing.B) {
// setup 4K file on a tmpfs
f, err := os.Create("/dev/shm/x.dat")
if err != nil {
b.Fatal(err)
}
buf4K := make([]byte, 4096)
_, err = rand.Read(buf4K)
if err != nil {
// rand promises n = len(buf) <=> err == nil
b.Fatal(err)
}
_, err = f.Write(buf4K)
if err != nil {
b.Fatal(err)
}
buf1B := make([]byte, 1)
// warm up
xpread(f, buf1B, 0)
b.ResetTimer()
for i := 0; i < b.N; i++ {
xpread(f, buf1B, 0)
}
b.StopTimer()
err = f.Close()
if err != nil {
b.Fatal(f)
}
} (https://lab.nexedi.com/kirr/misc/blob/dcdba0e3/syspread_test.go) and below are timings for this benchmark for all go versions starting from go14 statted over 10 runs each:
(code to run/stat here: https://lab.nexedi.com/kirr/misc/blob/dcdba0e3/Makefile#L20) we can see, once again, that go tip increases the time significantly. about pprof output - I'm not so experienced here to check, but the time difference can be also seen via running programs directly, without internal benchmarks, and measuring whole-program time with bash's time - the difference measured this way is real. Thanks again, |
Your benchmark seems to be comparing direct |
@bcmills good point. It is:
So:
For syscall.Pread itself I get the following profile (maybe something dumb on my side; sorry):
So for the syscall itself it looks like the overhead is mainly in
|
I've moved os.File regression to #19586. |
If the |
That making system call in Go is ~ 1.4x slower than making it in C. |
Do you have a benchmark for that? I don't see one above; sorry if I missed it. It's pretty hard to fix. While we might be able to shave some time off, fundamentally the Go function |
@ianlancetaylor the benchmark for syspread in C and in Go is this:
This data was already presented there before in scattered way. The above table summarizes it once again with having actual syscal in focus. I appologize for the confusion. About overhead: let me once again say that I understand that in Go, due to goroutines, there is more work to make a system call compared to C, but it would be nice if the overhead is minimized. In my original post I already gave some ideas, regarding syscalls imho the low-hanging small fruit is to try to intrinsify getcallerpc (#17327). Usually if we also audit things with performance in mind something else could probably be found in Thanks, |
Your profile is still not even close to CPU-bound: less than 1% of the samples were executing instructions in the user part of the program.
I will also note that 40ns is approximately the cost of an L3 cache miss on modern AMD64 hardware. |
@bcmills I'm not sure about profile, because I had not played with go profiler enough, however I've made the following test: package main
import (
"flag"
"fmt"
"log"
"os"
"runtime/pprof"
"time"
)
// 12345678
const Niter = 100000000
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile `file`")
//go:noinline
func aaa() {
}
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal("could not create CPU profile: ", err)
}
if err := pprof.StartCPUProfile(f); err != nil {
log.Fatal("could not start CPU profile: ", err)
}
defer pprof.StopCPUProfile()
}
Tstart := time.Now()
for i := 0; i < Niter; i++ {
aaa()
}
Tend := time.Now()
fmt.Printf("%.1f\n", float64(Tend.Sub(Tstart).Nanoseconds()) / Niter)
}
This program is completely cpu-bound so to me above About 40ns - I will also note that it is not a cache miss here: if you look once again at original plot to pread/copy data: please ignore go lines above and look for syspread/c and copy/c lines. They clearly show the time grow linearly as the number of bytes to read/copy grows. The |
Ah, I think I'm misinterpreting the new pprof format (reported as google/pprof#128). I believe you're correct that the program is actually CPU-bound; I apologize for my confusion. |
Np, and thanks for bringing this up with pprof team. |
Btw, I've looked around a bit and noticed: in Here is updated benchmark for pread syscall for all Go releases:
and it indeed shows go13 -> go14 20ns slowdown (and another go10/go11 -> go12 15ns slowdown I had not checked reason for). So if atomics for e.g. I do not know the GC details, so please forgive me in advance if the above is dumb in one way or another. It sounds reasonable to me offhand, however. Thanks beforehand, /cc @RLH, @aclements |
Atomics got a bad reputation from their early implementations, which were extremely expensive. But on a modern x86 an atomic operation on an uncontended cache line in the L1 is only ~2.5x more expensive than a plain write. That said, you're definitely right that the atomic operations are the hottest individual operations in the user-space side of the syscall path (partly because there's not that much else going on :). According to PEBS profiling (
That's a clever idea. I haven't completely convinced myself that this is safe, but I did a simple experiment to see what impact it would have. I disabled GC with diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 159a9bd4bc..1538274c77 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -749,6 +749,17 @@ func casgstatus(gp *g, oldval, newval uint32) {
throw("casgstatus")
}
+ if gcphase == _GCoff {
+ if gp.atomicstatus != oldval {
+ throw("bad old status")
+ }
+ gp.atomicstatus = newval
+ if newval == _Grunning {
+ gp.gcscanvalid = false
+ }
+ return
+ }
+
// See http://golang.org/cl/21503 for justification of the yield delay.
const yieldDelay = 5 * 1000
var nextYield int64 Here are the results on my laptop before and after this patch (
So this saves 10–17ns in the syscall path when GC is inactive. Given the cost of most syscalls, I'm not sure that's worth the added complexity and potential for extremely subtle bugs, but maybe it is. |
@aclements first of all thanks for your feedback, and for heads-up that atomics became not so expensive. Looks like cache coherency logic implementation in x86 became much more advanced over time. Anyway
On my side with it for pread-related benchmarks I get:
which more or less confirms you findings.
Here comes something to consider: during e.g. web-serving workload, there are a lot of system calls made. If I understand correctly, actually on e.g. any socket, for Read or Write first a direct system call is made, and only if it returns EAGAIN the fd is registerd to netpoller (or operation succeeds right offhand if it was ready). The syscalls to read and write are also used later to send/receive data when fd becomes ready. So there are many system calls involved. With this in mind I've tried bench/go1's BenchmarkHTTPClientServer without and with your patch and for it it is the following on my computer:
The effect is quite visible, right? And it is only with partly mitigated go13 -> go14 syscall-related change. So what I would suggest is:
All that because Thanks again, |
That's pretty intriguing, actually. I've kicked off the go1 and x/benchmarks suites to see what the results there are.
Thanks for reminding me that I never published my new version of perflock! What I was actually using is https://github.com/aclements/perflock, which has nice properties over the old version like ensuring FIFO ordering, taking control of the CPU governor, and being written in Go. :) I'm hoping to do more with it, like using cgroups for isolation, but haven't yet. |
@aclements thanks for feedback. It is always nice when discussions make us put on light forgotten things, etc. Thanks for the correct link :) Thanks also for starting the benchmarks. Let's see the results when you have it. |
I note that we know that the network I/O syscalls are known to be non-blocking, and therefore it may be possible to use |
@ianlancetaylor thanks for heads up on Said that maybe what could be done on Go side for both sockets and files such as pipe/fifo/etc is:
Such if should cost a very little and imho the logic could be put into pollDesc. Said that as well it is needed to note that regular files remain not covered this way. So maybe it is better to just optimize regular Thanks again, (*) from http://man7.org/linux/man-pages/man2/open.2.html
(+) see e.g. http://marc.info/?l=linux-kernel&m=149002565506733&w=2 |
@navytux, here are the benchmark results (sorry, left them running and then forgot about them :): https://perf.golang.org/search?q=upload%3A20170330.4+%7C+commit%3A01ac5b8dcfe5342af3770b0834220b87ea328fad+vs+commit%3A9d96f6d72012c7ef902eecbec43ff7cf2a37c79d So there's actually very little difference, at least in these benchmarks. |
@aclements thanks for your feedback. I was going to blame myself for doing something stupid last time because today I tried and could not reproduce my previous results for BenchmarkHTTPClientServer neither on today's tip (5a45a15) nor on tree state more-or-less corresponding to time of my comment that claimed -2.15% on the benchmark. However with all that I could not believe myself to blame because I know I did not cooked the numbers and here they are in raw form for the runs on Mar 20:
So after trying many things and looking here and there I've noticed: the average time to run one operation is sometimes a bit more the more iterations the benchmark driver makes. E.g. (from 0.txt):
That means that if the whole time to run 1 benchmark for 30000 iterations was ~ 1s (go test defaults for -benchtime), the time to run 1 benchmark for 50000 iterations was ~ 1.7s - i.e. almost twice as longer. Then I've looked at cpu profile for BenchmarkHTTPClientServer ran with
Notice in particular:
That lead me to thinking that much less garbage collection is actually happenning when bench time is only 1s. That turned out to be true even if the benchmark is repeated several times as go benchmark driver explicitly calls GC in between runs but that time is not accounted for the actual benchmark in question: https://github.com/golang/go/blob/5a45a157/src/testing/benchmark.go#L141 That means that for benchmarks which actually generate garbage, but not so many to systematically trigger statistically sound number of actual garbage collections in default benchtime=1s, the default benchtime=1s is not appropriate as it actually does not show the cost associated with garbage collections (and casgstatus is associated 100% with the GC) even when run with So with this in mind the correct procedure to test the effect of GC-related patches must run the benchmark with benchtime=Ns (N is timeframe in which GC actually triggers statistically sound). This way I've tested once again today's tip (5a45a15) without and with the patch with N=50s on a Linux 4.9.0-2-amd64 with i7-6600U CPU:
and here is what it gives:
I've also made 3 smaller runs with N=30s and got for them delta= -0.64% (B1), -3.48% (B2) and -1.42% (B4). It seems to get the real delta with higher probability it is required to run with bigger N and more count, but from the data present it is already clear the delta is oscilating around -1.5% or -2%. Raw data for the benchmarks are available here: https://lab.nexedi.com/kirr/misc/tree/d13eb36c/bench/nocasgstatus/5a45a157 Hope this clarifies things. Thanks again, P.S. And last time it seems I was lucky to observe the effect with default |
@aclements could you please confirm Thanks beforehand, |
@aclements about this: I was doing something else but noted atomics are not that light, so I've benchmarked it myself. Over 20 runs it gives:
To me it says 58x is much more than 2.5x so I decided to share. My processor is i7-6600U which is relatively recent: http://cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-6600U+%40+2.60GHz&id=2608 About noise: not sure why it is there but it does not go away after rechecking with second
Full benchmark source: // Package qatomic shows how much overhead is there when using atomics
package qatomic
import (
"sync/atomic"
"testing"
)
// base showing how much overhead empty benchmarking loop takes itself
// the time per iteration an operation takes is thus can be estimated as:
//
// T(BenchmarkOp) - T(BenchmarkEmptyBase)
func BenchmarkEmptyBase(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
}
})
}
// ----------------------------------------
// plain store to uncontended memory
// made sure via compile -S the store is actually there inside loop
func BenchmarkStoreUncontended(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
var mem uint64
func(pmem *uint64) {
for i := uint64(0); pb.Next(); i++ {
*pmem = i
}
}(&mem)
})
}
// plain store to contended memory; races are ignored
// made sure via compile -S the store is actually there inside loop
func BenchmarkStoreContended(b *testing.B) {
var mem uint64
b.RunParallel(func(pb *testing.PB) {
for i := uint64(0); pb.Next(); i++ {
mem = i
}
})
}
// atomic store to uncontended memory
func BenchmarkAtomicStoreUncontended(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
var mem uint64
for i := uint64(0); pb.Next(); i++ {
atomic.StoreUint64(&mem, i)
}
})
}
// atomic store to contended memory
func BenchmarkAtomicStoreContended(b *testing.B) {
var mem uint64
b.RunParallel(func(pb *testing.PB) {
for i := uint64(0); pb.Next(); i++ {
atomic.StoreUint64(&mem, i)
}
})
}
// ----------------------------------------
// plain inc of uncontended memory
// made sure via compile -S it actually does `load; inc; store` inside loop
func BenchmarkIncUncontended(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
var mem uint64
func(pmem *uint64) {
for pb.Next() {
*pmem++
}
}(&mem)
})
}
// plain inc of contended memory
// made sure via compile -S it actually does `load; inc; store` inside loop
func BenchmarkIncContended(b *testing.B) {
var mem uint64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
mem++
}
})
}
// atomic inc of uncontended memory
func BenchmarkAtomicIncUncontended(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
var mem uint64
for pb.Next() {
atomic.AddUint64(&mem, 1)
}
})
}
// atomic inc of contended memory
func BenchmarkAtomicIncContended(b *testing.B) {
var mem uint64
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
atomic.AddUint64(&mem, 1)
}
})
} |
It's an interesting result. In the inner loop,
So it seems to me that
I do not know whether this means anything about real programs, which are unlikely to loop storing values to memory without ever reading them. |
@ianlancetaylor thanks for feedback. Yes on amd64 atomic store and atomic swap are both implemented with XCHGQ (from TEXT ·StoreUint64(SB),NOSPLIT,$0-16
MOVQ addr+0(FP), BP
MOVQ val+8(FP), AX
XCHGQ AX, 0(BP)
RET
...
TEXT ·SwapUint64(SB),NOSPLIT,$0-24
MOVQ addr+0(FP), BP
MOVQ new+8(FP), AX
XCHGQ AX, 0(BP)
MOVQ AX, old+16(FP)
RET with similar code emitted by compiler with atomic intrinics. However the more realistic IncUncontended and AtomicIncUncontended are compiled to this: TEXT ·BenchmarkIncUncontended.func1.1(SB), NOSPLIT, $0-16 // qatomic_test.go:73
NO_LOCAL_POINTERS
...
pc48:
MOVQ 16(CX), DX
DECQ DX
MOVQ DX, 16(CX)
MOVL $1, DX
pc64:
TESTB DL, DL
JEQ pc88
MOVQ (AX), DX // qatomic_test.go:75
INCQ DX // <-- load; inc; store here
MOVQ DX, (AX)
pc77:
MOVQ 16(CX), DX // qatomic_test.go:74
TESTQ DX, DX
JEQ pc12
JMP pc48
... TEXT ·BenchmarkAtomicIncUncontended.func1(SB), $24-8 // qatomic_test.go:94
NO_LOCAL_POINTERS
...
pc97:
MOVQ 16(CX), DX
DECQ DX
MOVQ DX, 16(CX)
MOVL $1, DX
pc113:
TESTB DL, DL
JEQ pc138
MOVL $1, DX
LOCK // qatomic_test.go:97
XADDQ DX, (AX) // <-- atomic load; inc; store here
pc127:
MOVQ 16(CX), DX // qatomic_test.go:96
TESTQ DX, DX
JEQ pc61
JMP pc97
... which makes the comparision fair and still it is more then an order of magnitude difference. |
Also if we are talking about ---------------------------- src/pkg/runtime/proc.c ----------------------------
@@ -1649,9 +1783,9 @@ runtime·entersyscallblock_m(void)
gp->syscallsp = gp->sched.sp;
gp->syscallpc = gp->sched.pc;
gp->syscallstack = gp->stackbase;
gp->syscallguard = gp->stackguard;
- gp->status = Gsyscall;
+ runtime·casgstatus(gp, Grunning, Gsyscall);
if(gp->syscallsp < gp->syscallguard-StackGuard || gp->syscallstack < gp->syscallsp) {
// runtime·printf("entersyscall inconsistent %p [%p,%p]\n",
// gp->syscallsp, gp->syscallguard-StackGuard, gp->syscallstack);
runtime·throw("entersyscall_m");
@@ -1673,9 +1807,11 @@ runtime·exitsyscall(void)
g->waitsince = 0;
if(exitsyscallfast()) {
// There's a cpu for us, so we can run.
g->m->p->syscalltick++;
- g->status = Grunning;
+ // We need to cas the status and scan before resuming...
+ runtime·casgstatus(g, Gsyscall, Grunning);
+
// Garbage collector isn't running (since we are),
// so okay to clear gcstack and gcsp.
g->syscallstack = (uintptr)nil;
g->syscallsp = (uintptr)nil;
---------------------------- src/pkg/runtime/mgc0.c ----------------------------
@@ -539,9 +540,10 @@ markroot(ParFor *desc, uint32 i)
runtime·throw("markroot: bad index");
gp = runtime·allg[i - RootCount];
// remember when we've first observed the G blocked
// needed only to output in traceback
- if((gp->status == Gwaiting || gp->status == Gsyscall) && gp->waitsince == 0)
+ status = runtime·readgstatus(gp);
+ if((status == Gwaiting || status == Gsyscall) && gp->waitsince == 0)
gp->waitsince = work.tstart;
// Shrink a stack if not much of it is being used.
runtime·shrinkstack(gp);
scanstack(gp);
@@ -736,15 +738,16 @@ scanstack(G *gp)
int32 n;
Stktop *stk;
uintptr sp, guard;
- switch(gp->status){
+ switch(runtime·readgstatus(gp)) {
default:
@@ -1348,9 +1351,9 @@ runtime·gc(int32 force)
if(i > 0)
a.start_time = runtime·nanotime();
// switch to g0, call gc(&a), then switch back
g->param = &a;
- g->status = Gwaiting;
+ runtime·casgstatus(g, Grunning, Gwaiting);
g->waitreason = runtime·gostringnocopy((byte*)"garbage collection");
runtime·mcall(mgc);
}
... so to me it looks like in the hot path (e.g. many syscalls called) it was plain store and became atomic cas because slowpath (GC) likely seldomly needs it. /cc @aclements, @RLH, @rsc, @dvyukov, @josharian once again because atomic stores were shown just above to be ~60x slower than plain store when destination is in L1 and is not contended. See also #19563 (comment) and #19563 (comment) for -1.5% - -2% speedup for whole BenchmarkHTTPClientServer when casgstatus is effectively disabled. |
Most of the CASes are unnecessary and should be removed. The spin waits are not necessary either. |
@dvyukov thanks for feedback. It is a bit pity there is no reply from GC team. |
GC team here. It isn't that we aren't monitoring this issue, it's just that we don't have any new numbers that add to the conversation. The GC is core so changes undergo a risk to benefit analysis. Here is where this analysis now stands. On the benefit side there are micro benchmarks but there is not a lot of evidence that replacing some atomics with normal stores show an improvement to a wide range of real world applications or a smaller set of important applications running on today's HW. On the risk side the GC is clever and uses CASes instead of mutexes to implement a lock free state machine algorithm. This discussion suggests that the GC should be even more clever and implement a non-blocking algorithm that changes the state variable sometimes with normal stores and sometimes with atomic instructions. The bug tail on these kinds of changes risks being long and difficult. |
kirr wrote about BenchmarkHTTPClientServer with nocasgstatus patch: (#19563 (comment))
ok, I assume it did not make a difference (though getting this confirmed would be more explicit)
ok GC team, thanks for feedback. Your position is clear. Kirill |
Closing, the suggested change's risks (and effort needed to ensure correctness) outweighs the potential benefits. |
Update 20170317: part of original 1.7x slowness was found to be due to os.File regression, not syscall itself (#19563 (comment), #19586). Raw Go syscalls are ~ 1.4x slower than in C.
Please answer these questions before submitting your issue. Thanks!
What did you do?
Hello up there. First of all let me please say right from the beginning that I understand that in Go, due to goroutines, there is more work to make a system call compared to C. Still please consider the following two programs which measure time to do a pread in Go and C:
(https://play.golang.org/p/eY1WVf7cZV)
below is a plot of their output when run on 4K file on a tmpfs (
dd if=/dev/urandom of=/dev/shm/x.dat bs=4K count=1
):From the plot it can be seen that the time to do a pread for n bytes is α + β⋅n. On my computer β ≈ 0.01ns and α is ≈ 175ns for C and ≈ 325ns for Go case. The Go syscall is thus taking ~ 150ns more time than the C version.
For programs that make many system calls that can become noticable.
The profile for Go version is below:
(svg version here)
from it it could be suggested to investigate the following:
the defer usage in IO poller is quite noticable. Either making defer faster (runtime: defer is slow #14939, cmd/compile,runtime: avoid multiple returns from deferproc #19466, runtime: freedefer performance oddity #18923, ...) or not using defer could be a solution.
runtime·entersyscall and runtime·exitsyscall takes good part of a time. Maybe something could be done about it? (probably related: cmd/compile: getcallerpc ought to be intrinsified #17327)
internal/poll locking: at least on Linux it is ok to issue several preads simultaneously (please correct me if I'm wrong). From pread man page:
(http://man7.org/linux/man-pages/man2/pread.2.html)
Thus no locking is needed to serialize pread vs pread or vs read. Maybe some locking is still needed for internal bookkeeping, but may I wonder it can be made lighter?
maybe something else.
Once again I understand that in Go, due to goroutines, there is more work to make a system call compared to C, but it would be nice if the overhead is minimized.
Thanks beforehand,
Kirill
P.S.
All sources for this benchmark are accessible here: https://lab.nexedi.com/kirr/misc/tree/bfa03189
What did you expect to see?
Making syscall in Go costs only a bit than in C.
What did you see instead?
Making pread syscall in Go is ~ 1.7x slower on average than in C for reads not bigger than one page (4K).
Does this issue reproduce with the latest release (go1.8)?
Yes, but there with go18 it was only ~ 90ns slower:
in other words pread in go tip is slower than pread in go18 by ~ 60ns.
System details
The text was updated successfully, but these errors were encountered: