Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: syscalls are ~ 1.4x slower than in C #19563

Closed
navytux opened this issue Mar 15, 2017 · 36 comments
Closed

runtime: syscalls are ~ 1.4x slower than in C #19563

navytux opened this issue Mar 15, 2017 · 36 comments

Comments

@navytux
Copy link
Contributor

navytux commented Mar 15, 2017

Update 20170317: part of original 1.7x slowness was found to be due to os.File regression, not syscall itself (#19563 (comment), #19586). Raw Go syscalls are ~ 1.4x slower than in C.


Please answer these questions before submitting your issue. Thanks!

What did you do?

Hello up there. First of all let me please say right from the beginning that I understand that in Go, due to goroutines, there is more work to make a system call compared to C. Still please consider the following two programs which measure time to do a pread in Go and C:

package main

import (
        "flag"
        "fmt"
        "log"
        "os"
        "time"
)

//             12345678
const Niter = 1000000

// xpread is pread(2) but aborts on an error or when fewer bytes was read
func xpread(f *os.File, buf []byte, offset int64) {
        n, err := f.ReadAt(buf, offset)
        if err != nil {
                panic(err)
        }
        if n != len(buf) {
                panic(fmt.Errorf("pread(%v) -> %v", len(buf), n))
        }
}

const BufSize = 4096
var buf [BufSize]byte

func main() {
        flag.Parse()
        if len(flag.Args()) < 1 {
                log.Fatalf("Usage: %s <file>", os.Args[0])
        }

        f, err := os.Open(flag.Args()[0])
        if err != nil {
                log.Fatal(err)
        }

        for size := 0; size <= BufSize; {
                sbuf := buf[:size]

                // warmup
                xpread(f, sbuf, /*offset=*/0)

                Tstart := time.Now()

                for i := 0; i < Niter; i++ {
                        xpread(f, sbuf, /*offset=*/0)
                }

                Tend := time.Now()

                fmt.Printf("%v\t%.1f\n", size, float64(Tend.Sub(Tstart).Nanoseconds()) / Niter)

                switch { 
                case size < 16:
                        size += 1       // detailed start
                case size < 128:
                        size += 16
                default:
                        size += 128
                }
        }
}

(https://play.golang.org/p/eY1WVf7cZV)

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

//               12345678
#define NITER   1000000

// microtime returns current time as double
double microtime() {
        int err;
        struct timeval tv;

        err = gettimeofday(&tv, NULL);
        if (err == -1) {
                perror("gettimeofday");
                abort();
        }

        return tv.tv_sec + 1E-6 * tv.tv_usec;
}

#define BUFSIZE 4096
static char buf[BUFSIZE];

// xpread is pread(2) but aborts on an error or when fewer bytes was read
void xpread(int fd, void *buf, size_t count, off_t offset) {
        ssize_t n;
        n = pread(fd, buf, count, offset);
        if (n == -1) {
                perror("pread");
                abort();
        }

        if (n != count) {
                fprintf(stderr, "pread(%ld) -> %ld\n", count, n);
                abort();
        }
}

int main(int argc, char *argv[]) {
        int fd, i;
        size_t size;

        if (argc < 2) {
                fprintf(stderr, "Usage: %s <file>\n", argv[0]);
                exit(1);
        }

        fd = open(argv[1], O_RDONLY);
        if (fd == -1) {
                perror("open");
                abort();
        }

        for (size = 0; size <= BUFSIZE;) {
                // warmup
                xpread(fd, &buf, size, /*offset=*/0);

                double Tstart, Tend;
                Tstart = microtime();

                for (i = 0; i < NITER; i++) {
                        xpread(fd, buf, size, /*offset=*/0);
                }

                Tend = microtime();

                printf("%ld\t%.1f\n", size, (Tend - Tstart) / NITER / 1E-9);


                if (size < 16) {
                        size += 1;      // detailed start
                } else if  (size < 128) {
                        size += 16;
                } else {
                        size += 128;
                }
        }
}

below is a plot of their output when run on 4K file on a tmpfs (dd if=/dev/urandom of=/dev/shm/x.dat bs=4K count=1):

syspread timings

From the plot it can be seen that the time to do a pread for n bytes is α + β⋅n. On my computer β ≈ 0.01ns and α is ≈ 175ns for C and ≈ 325ns for Go case. The Go syscall is thus taking ~ 150ns more time than the C version.

For programs that make many system calls that can become noticable.

The profile for Go version is below:

File: t_syspread_go
Type: cpu
Time: Mar 15, 2017 at 10:11pm (MSK)
Duration: 18.42s, Total samples = 183.20ms ( 0.99%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 156.70ms, 85.53% of 183.20ms total
Dropped 8 nodes (cum <= 0.92ms)
Showing top 10 nodes out of 24
      flat  flat%   sum%        cum   cum%
  112.80ms 61.57% 61.57%   141.70ms 77.35%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    9.20ms  5.02% 66.59%     9.20ms  5.02%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
    6.30ms  3.44% 70.03%    12.90ms  7.04%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
    4.90ms  2.67% 72.71%     4.90ms  2.67%  internal/poll.(*fdMutex).rwlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
    4.80ms  2.62% 75.33%     4.80ms  2.62%  internal/poll.(*fdMutex).rwunlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
    4.10ms  2.24% 77.57%     5.60ms  3.06%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.90ms  2.13% 79.69%     3.90ms  2.13%  runtime.newdefer /home/kirr/src/tools/go/go/src/runtime/panic.go
    3.80ms  2.07% 81.77%    12.80ms  6.99%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.60ms  1.97% 83.73%   179.80ms 98.14%  os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
    3.30ms  1.80% 85.53%     6.30ms  3.44%  runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go

t_syspread_go profile

(svg version here)

from it it could be suggested to investigate the following:

  1. the defer usage in IO poller is quite noticable. Either making defer faster (runtime: defer is slow #14939, cmd/compile,runtime: avoid multiple returns from deferproc #19466, runtime: freedefer performance oddity #18923, ...) or not using defer could be a solution.

  2. runtime·entersyscall and runtime·exitsyscall takes good part of a time. Maybe something could be done about it? (probably related: cmd/compile: getcallerpc ought to be intrinsified #17327)

  3. internal/poll locking: at least on Linux it is ok to issue several preads simultaneously (please correct me if I'm wrong). From pread man page:

    The  pread()  and  pwrite()  system calls are especially useful in multithreaded applications.  They allow multiple
    threads to perform I/O on the same file descriptor without being affected by changes to the file  offset  by  other
    threads.
    

    (http://man7.org/linux/man-pages/man2/pread.2.html)

    Thus no locking is needed to serialize pread vs pread or vs read. Maybe some locking is still needed for internal bookkeeping, but may I wonder it can be made lighter?

  4. maybe something else.

Once again I understand that in Go, due to goroutines, there is more work to make a system call compared to C, but it would be nice if the overhead is minimized.

Thanks beforehand,
Kirill

P.S.
All sources for this benchmark are accessible here: https://lab.nexedi.com/kirr/misc/tree/bfa03189

What did you expect to see?

Making syscall in Go costs only a bit than in C.

What did you see instead?

Making pread syscall in Go is ~ 1.7x slower on average than in C for reads not bigger than one page (4K).

Does this issue reproduce with the latest release (go1.8)?

Yes, but there with go18 it was only ~ 90ns slower:

syspread-go18

in other words pread in go tip is slower than pread in go18 by ~ 60ns.

System details

go version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/kirr/go"
GORACE=""
GOROOT="/home/kirr/src/tools/go/go"
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build990167505=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 X:framepointer
uname -sr: Linux 4.9.0-2-amd64
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.0 (stretch)
Release:	9.0
Codename:	stretch
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Debian GLIBC 2.24-9) stable release version 2.24, by Roland McGrath et al.
gdb --version: GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
@josharian
Copy link
Contributor

cc @bcmills @ianlancetaylor

@davecheney
Copy link
Contributor

davecheney commented Mar 15, 2017 via email

@bcmills
Copy link
Contributor

bcmills commented Mar 15, 2017

What's odd is that I don't see any actual changes to EnterSyscall6 in syscall/asm_linux_amd64.s since 2015, but the bulk of the time in the profile is in EnterSyscall6 directly.

That would seem to imply one of a few things:

  1. The difference is due to cache effects, some change in the function prologue or epilogue, or inlining.
  2. The actual source of the slowdown isn't being captured in the profile for some reason.
  3. The actual timing hasn't changed, but the measurement has (e.g. due to the monotonic-time changes or changes to debug maps).

Can you replicate the same regression from 1.7 to 1.8 to tip using testing.B instead of time.Now? That would help to confirm or refute (3).

@ianlancetaylor
Copy link
Contributor

Tip is bound to be slower than 1.8 because of the support for the runtime poller. This brings various advantages but does take more time. But I'm frankly surprised that this time is not swamped by the overall kernel time required to implement the syscall and copy the bytes.

@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2017

Ah, now I understand why the pprof output looks so strange: the program is not even remotely CPU-bound in the application code.

Duration: 18.42s, Total samples = 183.20ms ( 0.99%)

So the pprof output likely looks strange because it's drawing from a very small sample size: the sampling error swamps out most of the interesting detail.

@navytux
Copy link
Contributor Author

navytux commented Mar 16, 2017

First of all everyone - thanks for feedback.

@bcmills the timing measurments are outside benchmarking loop, so timer changes should not probably affect anything. Said that below is the benchmark to read 1 byte via pread from tmpfs for us to be sure:

package main

import (
        "crypto/rand"
        "fmt"
        "os"
        "testing"
)


// xpread is pread(2) but aborts on an error or when fewer bytes was read
func xpread(f *os.File, buf []byte, offset int64) {
        n, err := f.ReadAt(buf, offset)
        if err != nil {
                panic(err)
        }
        if n != len(buf) {
                panic(fmt.Errorf("pread(%v) -> %v", len(buf), n))
        }
}

// BenchmarkPread1 benchmarks pread syscall when reading 1 byte from a file on tmpfs
func BenchmarkPread1(b *testing.B) {
        // setup 4K file on a tmpfs
        f, err := os.Create("/dev/shm/x.dat")
        if err != nil {
                b.Fatal(err)
        }

        buf4K := make([]byte, 4096)
        _, err = rand.Read(buf4K)
        if err != nil {
                // rand promises n = len(buf) <=> err == nil
                b.Fatal(err)
        }

        _, err = f.Write(buf4K)
        if err != nil {
                b.Fatal(err)
        }

        buf1B := make([]byte, 1)

        // warm up
        xpread(f, buf1B, 0)

        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                xpread(f, buf1B, 0)
        }
        b.StopTimer()

        err = f.Close()
        if err != nil {
                b.Fatal(f)
        }
}

(https://lab.nexedi.com/kirr/misc/blob/dcdba0e3/syspread_test.go)

and below are timings for this benchmark for all go versions starting from go14 statted over 10 runs each:

name \ time/op  go14.txt    go15.txt    go16.txt    go17.txt    go18.txt    gotip.txt
Pread1          267ns ± 0%
Pread1-4                    277ns ± 1%  274ns ± 1%  273ns ± 1%  265ns ± 3%  325ns ± 1%

(code to run/stat here: https://lab.nexedi.com/kirr/misc/blob/dcdba0e3/Makefile#L20)

we can see, once again, that go tip increases the time significantly.

about pprof output - I'm not so experienced here to check, but the time difference can be also seen via running programs directly, without internal benchmarks, and measuring whole-program time with bash's time - the difference measured this way is real.

Thanks again,
Kirill

@navytux navytux changed the title runtime: syscalls are ~ 1.7x slower than in C runtime: syscalls are ~ 1.7x slower than in C (partly tip regression for 60ns over go1.8) Mar 16, 2017
@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2017

Your benchmark seems to be comparing direct pread syscalls to (*os.File).ReadAt calls. Can you benchmark direct syscalls from Go? It would be interesting to see whether the difference is coming from the syscall itself or some other part of ReadAt — for example, there could be a lock-contention issue for *os.File that is unrelated to the syscall proper.

@navytux
Copy link
Contributor Author

navytux commented Mar 17, 2017

@bcmills good point. It is:

name \ time/op  go14.txt    go15.txt    go16.txt    go17.txt    go18.txt    gotip.txt
ReadAt1         268ns ± 1%
Pread1          259ns ± 1%
ReadAt1-4                   274ns ± 1%  277ns ± 1%  274ns ± 1%  263ns ± 1%  323ns ± 0%
Pread1-4                    262ns ± 1%  266ns ± 1%  262ns ± 2%  252ns ± 1%  251ns ± 1%

(updated benchmark, diff)

So:

  1. There is no regression in syscall.Pread itself - go18/gotip even improves compared to go17 by 10ns; however even with this improvement syscall.Pread is slower than pread in C by 65 - 70ns, i.e. go is ~ 35-40% slower for the syscall itself.

For syscall.Pread itself I get the following profile (maybe something dumb on my side; sorry):

File: t_syspread_go
Type: cpu
Time: Mar 17, 2017 at 11:21am (MSK)
Duration: 14.81s, Total samples = 146.50ms ( 0.99%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 145.50ms, 99.32% of 146.50ms total
Dropped 4 nodes (cum <= 0.73ms)
Showing top 10 nodes out of 13
      flat  flat%   sum%        cum   cum%
  115.30ms 78.70% 78.70%   143.70ms 98.09%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    9.90ms  6.76% 85.46%     9.90ms  6.76%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
       7ms  4.78% 90.24%    13.90ms  9.49%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
    4.90ms  3.34% 93.58%     5.90ms  4.03%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.50ms  2.39% 95.97%    13.20ms  9.01%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
    1.50ms  1.02% 97.00%   146.40ms 99.93%  main.main /home/kirr/src/wendelin/z/rust/bench/t_syspread_go.go
       1ms  0.68% 97.68%        1ms  0.68%  runtime.exitsyscallfast_reacquired /home/kirr/src/tools/go/go/src/runtime/proc.go
    0.80ms  0.55% 98.23%     0.80ms  0.55%  runtime.getcallerpc /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
    0.80ms  0.55% 98.77%     0.80ms  0.55%  runtime.save /home/kirr/src/tools/go/go/src/runtime/proc.go
    0.80ms  0.55% 99.32%   144.50ms 98.63%  syscall.Pread /home/kirr/src/tools/go/go/src/syscall/zsyscall_linux_amd64.go

syspread-syscall timings
(svg)

So for the syscall itself it looks like the overhead is mainly in runtime·entersyscall and runtime·exitsyscall and probably related goroutiny things in scheduler.

  1. The differencein between os.File.ReadAt(len(buf)=1) and syscall.Pread(len(buf)=1) was ~ 10ns during go14 - go18. With go tip it raises to ~70ns, so it is indeed 60ns regression for os.File.ReadAt.

@navytux
Copy link
Contributor Author

navytux commented Mar 17, 2017

I've moved os.File regression to #19586.

@navytux navytux changed the title runtime: syscalls are ~ 1.7x slower than in C (partly tip regression for 60ns over go1.8) runtime: syscalls are ~ 1.4x slower than in C Mar 17, 2017
@ianlancetaylor
Copy link
Contributor

If the os.File slowdown is moved to #19586, what is left to do in this issue?

@navytux
Copy link
Contributor Author

navytux commented Mar 17, 2017

That making system call in Go is ~ 1.4x slower than making it in C.

@ianlancetaylor
Copy link
Contributor

Do you have a benchmark for that? I don't see one above; sorry if I missed it.

It's pretty hard to fix. While we might be able to shave some time off, fundamentally the Go function syscall.Syscall must update the scheduler in some way, whereas the C function syscall does not need to do anything similar. I'm actually surprised that the slowdown is only 1.4x.

@navytux
Copy link
Contributor Author

navytux commented Mar 17, 2017

@ianlancetaylor the benchmark for syspread in C and in Go is this:

action time (s)
pread(count=1), C 175ns
pread(count=4K), C 175ns + ~40ns
syscall.Pread(len(buf)=1), Go tip 251ns
syscall.Pread(len(buf)=4K), Go tip 251ns + ~40ns

This data was already presented there before in scattered way. The above table summarizes it once again with having actual syscal in focus. I appologize for the confusion.


About overhead: let me once again say that I understand that in Go, due to goroutines, there is more work to make a system call compared to C, but it would be nice if the overhead is minimized.

In my original post I already gave some ideas, regarding syscalls imho the low-hanging small fruit is to try to intrinsify getcallerpc (#17327). Usually if we also audit things with performance in mind something else could probably be found in runtime·entersyscall and runtime·exitsyscall land. We can use this issue as tracking for this kind of work.

Thanks,
Kirill

@bcmills
Copy link
Contributor

bcmills commented Mar 17, 2017

Your profile is still not even close to CPU-bound: less than 1% of the samples were executing instructions in the user part of the program.

Duration: 14.81s, Total samples = 146.50ms ( 0.99%)

I will also note that 40ns is approximately the cost of an L3 cache miss on modern AMD64 hardware.

@navytux
Copy link
Contributor Author

navytux commented Mar 17, 2017

@bcmills I'm not sure about profile, because I had not played with go profiler enough, however I've made the following test:

package main

import (
        "flag"
        "fmt"
        "log"
        "os"
        "runtime/pprof"
        "time"
)

//             12345678
const Niter = 100000000

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile `file`")

//go:noinline
func aaa() {
}

func main() {
        flag.Parse()

        if *cpuprofile != "" {
                f, err := os.Create(*cpuprofile)
                if err != nil {
                        log.Fatal("could not create CPU profile: ", err)
                }
                if err := pprof.StartCPUProfile(f); err != nil {
                        log.Fatal("could not start CPU profile: ", err)
                }
                defer pprof.StopCPUProfile()
        }


        Tstart := time.Now()

        for i := 0; i < Niter; i++ {
                aaa()
        }

        Tend := time.Now()

        fmt.Printf("%.1f\n", float64(Tend.Sub(Tstart).Nanoseconds()) / Niter)
}
kirr@mini:~/src/wendelin/z/rust/bench$ time ./t_loop_go -cpuprofile cpu.prof
14.6

real    0m1,615s
user    0m1,464s
sys     0m0,012s
kirr@mini:~/src/wendelin/z/rust/bench$ go tool pprof ./t_loop_go cpu.prof 
File: t_loop_go
Type: cpu
Time: Mar 17, 2017 at 10:27pm (MSK)
Duration: 1.60s, Total samples = 14.50ms (  0.9%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 14.50ms, 100% of 14.50ms total
      flat  flat%   sum%        cum   cum%
   13.70ms 94.48% 94.48%    14.40ms 99.31%  main.main /home/kirr/src/wendelin/z/rust/bench/t_loop_go.go
    0.70ms  4.83% 99.31%     0.70ms  4.83%  main.aaa /home/kirr/src/wendelin/z/rust/bench/t_loop_go.go
    0.10ms  0.69%   100%     0.10ms  0.69%  runtime._ExternalCode /home/kirr/src/tools/go/go/src/runtime/proc.go
         0     0%   100%     0.10ms  0.69%  runtime._System /home/kirr/src/tools/go/go/src/runtime/proc.go
         0     0%   100%    14.40ms 99.31%  runtime.main /home/kirr/src/tools/go/go/src/runtime/proc.go

This program is completely cpu-bound so to me above Duration: 1.60s, Total samples = 14.50ms ( 0.9%) either says the profiler shows fraction of time how much it sampled (?) or it is some other bug.


About 40ns - I will also note that it is not a cache miss here: if you look once again at original plot to pread/copy data:

aaa

please ignore go lines above and look for syspread/c and copy/c lines. They clearly show the time grow linearly as the number of bytes to read/copy grows. The syspread/c - copy/c line also shows that the kernel is actually doing a straight memcpy from pagecache to destination buffer. 40ns is just the time it takes to copy 4K on my computer when both source and destination are hot in cache, because for such case the average time to copy is 0.01ns per byte here.

@bcmills
Copy link
Contributor

bcmills commented Mar 17, 2017

Ah, I think I'm misinterpreting the new pprof format (reported as google/pprof#128).

I believe you're correct that the program is actually CPU-bound; I apologize for my confusion.

@navytux
Copy link
Contributor Author

navytux commented Mar 18, 2017

Np, and thanks for bringing this up with pprof team.

@navytux
Copy link
Contributor Author

navytux commented Mar 20, 2017

Btw, I've looked around a bit and noticed: in runtime·entersyscall / runtime·exitsyscall land a lot of time is spent on atomics, e.g. in casgstatus and it is know that atomics are quite expensive. Prior to go14 g.status was updated in non-atomic way: see commit 0a7c7ac (https://golang.org/cl/132960044) - the change was added there to support concurrent GC.

Here is updated benchmark for pread syscall for all Go releases:

name \ time/op  go10.txt    go11.txt    go12.txt    go13.txt    go14.txt    go15.txt    go16.txt    go17.txt    go18.txt    gotip.txt
Pread1-4        226ns ± 0%  229ns ± 0%  240ns ± 1%  242ns ± 2%  261ns ± 1%  263ns ± 1%  267ns ± 0%  262ns ± 1%  255ns ± 2%  252ns ± 1%

(updated benchmark, test)

and it indeed shows go13 -> go14 20ns slowdown (and another go10/go11 -> go12 15ns slowdown I had not checked reason for).

So if atomics for e.g. g.status update were introduced to support concurrent GC, maybe what is also possible to do is to update g.status not always in atomic manner, but in a way similar to what happens in runtime·writebarrierptr() - there before updating a pointer the runtime checks for whether write barrier is at all needed, and if not it becomes simple store. And if I understand correctly write barriers are needed only during certain phase of running garbage collection - in other words "usually not" most of the time. Probably we could have some similar flag for g.status + friends updates too.

I do not know the GC details, so please forgive me in advance if the above is dumb in one way or another. It sounds reasonable to me offhand, however.

Thanks beforehand,
Kirill

/cc @RLH, @aclements

@aclements
Copy link
Member

Btw, I've looked around a bit and noticed: in runtime·entersyscall / runtime·exitsyscall land a lot of time is spent on atomics, e.g. in casgstatus and it is know that atomics are quite expensive.

Atomics got a bad reputation from their early implementations, which were extremely expensive. But on a modern x86 an atomic operation on an uncontended cache line in the L1 is only ~2.5x more expensive than a plain write.

That said, you're definitely right that the atomic operations are the hottest individual operations in the user-space side of the syscall path (partly because there's not that much else going on :). According to PEBS profiling (perf -e cycles:pp), your Pread1 benchmark spends about 4% of its total time in the cmpxchg on the gstatus, 1.5% in the P status write in reentersyscall, and 1.4% in the CAS of the P status in exitsyscallfast.

maybe what is also possible to do is to update g.status not always in atomic manner

That's a clever idea. I haven't completely convinced myself that this is safe, but I did a simple experiment to see what impact it would have. I disabled GC with GOGC=off and added the following patch:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 159a9bd4bc..1538274c77 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -749,6 +749,17 @@ func casgstatus(gp *g, oldval, newval uint32) {
                throw("casgstatus")
        }
 
+       if gcphase == _GCoff {
+               if gp.atomicstatus != oldval {
+                       throw("bad old status")
+               }
+               gp.atomicstatus = newval
+               if newval == _Grunning {
+                       gp.gcscanvalid = false
+               }
+               return
+       }
+
        // See http://golang.org/cl/21503 for justification of the yield delay.
        const yieldDelay = 5 * 1000
        var nextYield int64

Here are the results on my laptop before and after this patch (go test -c main_test.go && GOGC=off perflock ./main.test -test.bench . -test.count 20):

name       old time/op  new time/op  delta  
ReadAt1-4   514ns ± 2%   504ns ± 1%  -2.02%  (p=0.000 n=20+20)
Pread1-4    431ns ± 2%   414ns ± 0%  -3.86%  (p=0.000 n=20+16)

So this saves 10–17ns in the syscall path when GC is inactive. Given the cost of most syscalls, I'm not sure that's worth the added complexity and potential for extremely subtle bugs, but maybe it is.

@navytux
Copy link
Contributor Author

navytux commented Mar 20, 2017

@aclements first of all thanks for your feedback, and for heads-up that atomics became not so expensive. Looks like cache coherency logic implementation in x86 became much more advanced over time. Anyway

  1. Thanks for confirming my observations with perf
  2. Thanks a lot for commenting on the idea and for try-patch.

On my side with it for pread-related benchmarks I get:

name       old time/op  new time/op  delta
ReadAt1-4   323ns ± 1%   317ns ± 0%  -1.84%  (p=0.000 n=18+19)
Pread1-4    251ns ± 0%   240ns ± 0%  -4.19%  (p=0.000 n=20+20)

which more or less confirms you findings.

Given the cost of most syscalls, I'm not sure that's worth the added complexity and potential for extremely subtle bugs, but maybe it is.

Here comes something to consider: during e.g. web-serving workload, there are a lot of system calls made. If I understand correctly, actually on e.g. any socket, for Read or Write first a direct system call is made, and only if it returns EAGAIN the fd is registerd to netpoller (or operation succeeds right offhand if it was ready). The syscalls to read and write are also used later to send/receive data when fd becomes ready. So there are many system calls involved.

With this in mind I've tried bench/go1's BenchmarkHTTPClientServer without and with your patch and for it it is the following on my computer:

name                old time/op  new time/op  delta
HTTPClientServer-4  40.4µs ± 1%  39.5µs ± 1%  -2.15%  (p=0.000 n=18+18)

The effect is quite visible, right? And it is only with partly mitigated go13 -> go14 syscall-related change.

So what I would suggest is:

  1. consider going on with atomics avoidance approach.
  2. bisect and see what made syscalls slower during go10/11 -> go12 transition; try to fix that too.
  3. try to remove checks that are not really needed from hot paths.
  4. maybe something else...

All that because runtime·entersyscall / runtime·exitsyscall are present there during every system call made and usually programs make many of them. At least if we consider WEB-servers or programs that work with files.

Thanks again,
Kirill

P.S. thanks also for cycles:pp 2 and perflock references.

@aclements
Copy link
Member

aclements commented Mar 20, 2017

With this in mind I've tried bench/go1's BenchmarkHTTPClientServer without and with your patch and for it it is the following on my computer:

That's pretty intriguing, actually. I've kicked off the go1 and x/benchmarks suites to see what the results there are.

P.S. thanks also for cycles:pp 2 and perflock references.

Thanks for reminding me that I never published my new version of perflock! What I was actually using is https://github.com/aclements/perflock, which has nice properties over the old version like ensuring FIFO ordering, taking control of the CPU governor, and being written in Go. :) I'm hoping to do more with it, like using cgroups for isolation, but haven't yet.

@navytux
Copy link
Contributor Author

navytux commented Mar 20, 2017

@aclements thanks for feedback. It is always nice when discussions make us put on light forgotten things, etc. Thanks for the correct link :)

Thanks also for starting the benchmarks. Let's see the results when you have it.

@ianlancetaylor
Copy link
Contributor

I note that we know that the network I/O syscalls are known to be non-blocking, and therefore it may be possible to use RawSyscall rather than Syscall, which cuts out the atomic instructions entirely. However, implementing that safely may be difficult, since these code paths are now also used for file I/O operations, and using RawSyscall for a file I/O operation can fail when using a user-space FUSE file system (https://groups.google.com/d/msg/golang-nuts/11rdExWP6ac/UnP5P5m3VPEJ).

@bradfitz bradfitz added this to the Go1.9Maybe milestone Mar 21, 2017
@navytux
Copy link
Contributor Author

navytux commented Mar 21, 2017

@ianlancetaylor thanks for heads up on RawSyscall vs Syscall. As replied in the thread (my post still waiting moderation to appear) it is not FUSE-only, but can be with any file system which operations block (e.g. close on NFS is blocking).

Said that maybe what could be done on Go side for both sockets and files such as pipe/fifo/etc is:

  • track whether file descriptor mode is O_NONBLOCK or not,

  • when there is a need to make Read or similar, do an if on that flag, and if we know:

    • the mode is non-blocking, and
    • for this fd we can trust the kernel to respect O_NONBLOCK (i.e. it is not regular file) (*)

    do a RawSyscall. Otherwise do Syscall.

Such if should cost a very little and imho the logic could be put into pollDesc.
Or it can be the other way: only proper file descriptors are registered to poller and everything else uses direct Syscall for Read/Write.

Said that as well it is needed to note that regular files remain not covered this way. So maybe it is better to just optimize regular Syscall for the benefit of all. And even if we go the RawSyscall way, since file IO cannot be really done without syscalls(+), removing the overhead is important. Not to mention that even with sockets there are still things like EpollWait, get/setsockopt, fcntl, etc which are using Syscall entry to kernel.

Thanks again,
Kirill

(*) from http://man7.org/linux/man-pages/man2/open.2.html

O_NONBLOCK or O_NDELAY
              ...

              Note that this flag has no effect for regular files and block
              devices; that is, I/O operations will (briefly) block when
              device activity is required, regardless of whether O_NONBLOCK
              is set...

(+) see e.g. http://marc.info/?l=linux-kernel&m=149002565506733&w=2

@aclements
Copy link
Member

@navytux, here are the benchmark results (sorry, left them running and then forgot about them :): https://perf.golang.org/search?q=upload%3A20170330.4+%7C+commit%3A01ac5b8dcfe5342af3770b0834220b87ea328fad+vs+commit%3A9d96f6d72012c7ef902eecbec43ff7cf2a37c79d

So there's actually very little difference, at least in these benchmarks.

@navytux
Copy link
Contributor Author

navytux commented Mar 31, 2017

@aclements thanks for your feedback. I was going to blame myself for doing something stupid last time because today I tried and could not reproduce my previous results for BenchmarkHTTPClientServer neither on today's tip (5a45a15) nor on tree state more-or-less corresponding to time of my comment that claimed -2.15% on the benchmark.

However with all that I could not believe myself to blame because I know I did not cooked the numbers and here they are in raw form for the runs on Mar 20:

kirr@deco:~/src/tools/go/go/test/bench/go1/old$ ll --full-time
-rw-r--r-- 1 kirr kirr 1171 2017-03-20 20:24:38.439971290 +0300 0.txt
-rw-r--r-- 1 kirr kirr 1171 2017-03-20 20:25:58.121720781 +0300 1.txt
kirr@deco:~/src/tools/go/go/test/bench/go1/old$ cat 0.txt # without patch
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4        30000             42846 ns/op
BenchmarkHTTPClientServer-4        30000             40022 ns/op
BenchmarkHTTPClientServer-4        30000             40354 ns/op
BenchmarkHTTPClientServer-4        30000             40427 ns/op
BenchmarkHTTPClientServer-4        30000             40177 ns/op
BenchmarkHTTPClientServer-4        50000             42325 ns/op
BenchmarkHTTPClientServer-4        30000             40564 ns/op
BenchmarkHTTPClientServer-4        30000             40597 ns/op
BenchmarkHTTPClientServer-4        30000             40321 ns/op
BenchmarkHTTPClientServer-4        50000             40198 ns/op
BenchmarkHTTPClientServer-4        30000             40224 ns/op
BenchmarkHTTPClientServer-4        30000             40491 ns/op
BenchmarkHTTPClientServer-4        30000             40140 ns/op
BenchmarkHTTPClientServer-4        30000             40580 ns/op
BenchmarkHTTPClientServer-4        50000             40381 ns/op
BenchmarkHTTPClientServer-4        50000             40575 ns/op
BenchmarkHTTPClientServer-4        50000             40123 ns/op
BenchmarkHTTPClientServer-4        30000             40274 ns/op
BenchmarkHTTPClientServer-4        50000             40764 ns/op
BenchmarkHTTPClientServer-4        30000             40157 ns/op
PASS
kirr@deco:~/src/tools/go/go/test/bench/go1/old$ cat 1.txt # with patch
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4        30000             42884 ns/op
BenchmarkHTTPClientServer-4        50000             41835 ns/op
BenchmarkHTTPClientServer-4        50000             39534 ns/op
BenchmarkHTTPClientServer-4        50000             39542 ns/op
BenchmarkHTTPClientServer-4        50000             39326 ns/op
BenchmarkHTTPClientServer-4        50000             39487 ns/op
BenchmarkHTTPClientServer-4        50000             39352 ns/op
BenchmarkHTTPClientServer-4        50000             39342 ns/op
BenchmarkHTTPClientServer-4        50000             39323 ns/op
BenchmarkHTTPClientServer-4        50000             39490 ns/op
BenchmarkHTTPClientServer-4        50000             39472 ns/op
BenchmarkHTTPClientServer-4        50000             39496 ns/op
BenchmarkHTTPClientServer-4        50000             39633 ns/op
BenchmarkHTTPClientServer-4        50000             39433 ns/op
BenchmarkHTTPClientServer-4        50000             39490 ns/op
BenchmarkHTTPClientServer-4        50000             39713 ns/op
BenchmarkHTTPClientServer-4        30000             39519 ns/op
BenchmarkHTTPClientServer-4        50000             39577 ns/op
BenchmarkHTTPClientServer-4        50000             39629 ns/op
BenchmarkHTTPClientServer-4        50000             39358 ns/op
PASS

So after trying many things and looking here and there I've noticed: the average time to run one operation is sometimes a bit more the more iterations the benchmark driver makes. E.g. (from 0.txt):

BenchmarkHTTPClientServer-4        30000             40177 ns/op
BenchmarkHTTPClientServer-4        50000             42325 ns/op    <-- NOTE
BenchmarkHTTPClientServer-4        30000             40564 ns/op

That means that if the whole time to run 1 benchmark for 30000 iterations was ~ 1s (go test defaults for -benchtime), the time to run 1 benchmark for 50000 iterations was ~ 1.7s - i.e. almost twice as longer.

Then I've looked at cpu profile for BenchmarkHTTPClientServer ran with -benchtime 1s and -benchtime 10s and they are visibly different:

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go test -bench Client -cpuprofile cpu.out -benchtime 1s
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4        30000             43363 ns/op
PASS
ok      _/home/kirr/src/tools/go/go/test/bench/go1      3.144s
(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go tool pprof go1.test cpu.out 
File: go1.test
Type: cpu
Time: Mar 31, 2017 at 5:37pm (MSK)
Duration: 1.90s, Total samples = 20.80ms ( 1.09%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 11.90ms, 57.21% of 20.80ms total
Dropped 103 nodes (cum <= 0.10ms)
Showing top 20 nodes out of 141
      flat  flat%   sum%        cum   cum%
    5.60ms 26.92% 26.92%     5.80ms 27.88%  syscall.Syscall /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    0.80ms  3.85% 30.77%     2.30ms 11.06%  runtime.findrunnable /home/kirr/src/tools/go/go/src/runtime/proc.go
    0.70ms  3.37% 34.13%     0.70ms  3.37%  runtime.usleep /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    0.50ms  2.40% 36.54%     0.50ms  2.40%  runtime._ExternalCode /home/kirr/src/tools/go/go/src/runtime/proc.go
    0.40ms  1.92% 38.46%     0.40ms  1.92%  runtime.futex /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    0.30ms  1.44% 39.90%     0.40ms  1.92%  internal/poll.runtime_pollSetDeadline /home/kirr/src/tools/go/go/src/runtime/netpoll.go
    0.30ms  1.44% 41.35%     0.60ms  2.88%  net/http.readTransfer /home/kirr/src/tools/go/go/src/net/http/transfer.go
    0.30ms  1.44% 42.79%     0.30ms  1.44%  runtime.epollwait /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    0.30ms  1.44% 44.23%     0.30ms  1.44%  runtime.heapBitsSetType /home/kirr/src/tools/go/go/src/runtime/mbitmap.go
    0.30ms  1.44% 45.67%     0.30ms  1.44%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/hashmap.go
    0.30ms  1.44% 47.12%     0.30ms  1.44%  runtime.memeqbody /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
    0.30ms  1.44% 48.56%     0.30ms  1.44%  runtime.nextFreeFast /home/kirr/src/tools/go/go/src/runtime/malloc.go
    0.30ms  1.44% 50.00%     0.40ms  1.92%  runtime.selectgo /home/kirr/src/tools/go/go/src/runtime/select.go
    0.30ms  1.44% 51.44%     0.30ms  1.44%  sync.(*Pool).pin /home/kirr/src/tools/go/go/src/sync/pool.go
    0.20ms  0.96% 52.40%     1.60ms  7.69%  bufio.(*Reader).ReadLine /home/kirr/src/tools/go/go/src/bufio/bufio.go
    0.20ms  0.96% 53.37%     0.60ms  2.88%  net/http.(*Transport).tryPutIdleConn /home/kirr/src/tools/go/go/src/net/http/transport.go
    0.20ms  0.96% 54.33%     4.40ms 21.15%  net/http.(*conn).readRequest /home/kirr/src/tools/go/go/src/net/http/server.go
    0.20ms  0.96% 55.29%     2.80ms 13.46%  net/http.(*persistConn).readLoop /home/kirr/src/tools/go/go/src/net/http/transport.go
    0.20ms  0.96% 56.25%     3.80ms 18.27%  net/http.(*persistConn).writeLoop /home/kirr/src/tools/go/go/src/net/http/transport.go
    0.20ms  0.96% 57.21%     0.20ms  0.96%  net/http.(*transferWriter).shouldSendContentLength /home/kirr/src/tools/go/go/src/net/http/transfer.go
(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go test -bench Client -cpuprofile cpu.out -benchtime 10s
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4       300000             41310 ns/op
PASS
ok      _/home/kirr/src/tools/go/go/test/bench/go1      14.265s
(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go tool pprof go1.test cpu.out 
File: go1.test
Type: cpu
Time: Mar 31, 2017 at 5:40pm (MSK)
Duration: 13.01s, Total samples = 158.60ms ( 1.22%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 79.70ms, 50.25% of 158.60ms total
Dropped 282 nodes (cum <= 0.79ms)
Showing top 20 nodes out of 169
      flat  flat%   sum%        cum   cum%
   29.70ms 18.73% 18.73%    31.50ms 19.86%  syscall.Syscall /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    6.20ms  3.91% 22.64%     6.20ms  3.91%  runtime.futex /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    4.10ms  2.59% 25.22%     4.10ms  2.59%  runtime.epollwait /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    4.10ms  2.59% 27.81%     4.80ms  3.03%  runtime.lock /home/kirr/src/tools/go/go/src/runtime/lock_futex.go
    3.80ms  2.40% 30.20%        8ms  5.04%  runtime.selectgo /home/kirr/src/tools/go/go/src/runtime/select.go
    3.50ms  2.21% 32.41%    17.90ms 11.29%  runtime.findrunnable /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.20ms  2.02% 34.43%    10.10ms  6.37%  runtime.mallocgc /home/kirr/src/tools/go/go/src/runtime/malloc.go
    2.80ms  1.77% 36.19%     2.80ms  1.77%  runtime._ExternalCode /home/kirr/src/tools/go/go/src/runtime/proc.go
    2.70ms  1.70% 37.89%     2.70ms  1.70%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
    2.60ms  1.64% 39.53%     2.60ms  1.64%  runtime.heapBitsSetType /home/kirr/src/tools/go/go/src/runtime/mbitmap.go
    2.30ms  1.45% 40.98%        3ms  1.89%  runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go
    2.10ms  1.32% 42.31%     2.10ms  1.32%  runtime.usleep /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
       2ms  1.26% 43.57%        2ms  1.26%  runtime.unlock /home/kirr/src/tools/go/go/src/runtime/lock_futex.go
    1.90ms  1.20% 44.77%     1.90ms  1.20%  sync.(*Mutex).Unlock /home/kirr/src/tools/go/go/src/sync/mutex.go
    1.80ms  1.13% 45.90%     1.80ms  1.13%  runtime.nextFreeFast /home/kirr/src/tools/go/go/src/runtime/malloc.go
    1.40ms  0.88% 46.78%        2ms  1.26%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/hashmap.go
    1.40ms  0.88% 47.67%     3.50ms  2.21%  runtime.mapassign_faststr /home/kirr/src/tools/go/go/src/runtime/hashmap_fast.go
    1.40ms  0.88% 48.55%     1.40ms  0.88%  runtime.memclrNoHeapPointers /home/kirr/src/tools/go/go/src/runtime/memclr_amd64.s
    1.40ms  0.88% 49.43%     1.40ms  0.88%  sync.(*Mutex).Lock /home/kirr/src/tools/go/go/src/sync/mutex.go
    1.30ms  0.82% 50.25%     3.40ms  2.14%  runtime.runqgrab /home/kirr/src/tools/go/go/src/runtime/proc.go

Notice in particular:

  • Syscall part of the time is quite different (18% vs 26%)
  • mallocgc + friends is visible in longer run (not in run for 1s)
  • casgstatus is visible in longer run (not in run for 1s)

That lead me to thinking that much less garbage collection is actually happenning when bench time is only 1s. That turned out to be true even if the benchmark is repeated several times as go benchmark driver explicitly calls GC in between runs but that time is not accounted for the actual benchmark in question:

https://github.com/golang/go/blob/5a45a157/src/testing/benchmark.go#L141

That means that for benchmarks which actually generate garbage, but not so many to systematically trigger statistically sound number of actual garbage collections in default benchtime=1s, the default benchtime=1s is not appropriate as it actually does not show the cost associated with garbage collections (and casgstatus is associated 100% with the GC) even when run with -test.count=N. And BenchmarkHTTPClientServer happens to be benchmark of such kind.

So with this in mind the correct procedure to test the effect of GC-related patches must run the benchmark with benchtime=Ns (N is timeframe in which GC actually triggers statistically sound). This way I've tested once again today's tip (5a45a15) without and with the patch with N=50s on a Linux 4.9.0-2-amd64 with i7-6600U CPU:

.../go/test/bench/go1$ go test -bench Client -benchtime 50s -count 20 -timeout 50m

and here is what it gives:

.../go/test/bench/go1$ benchstat B3/tip+gc.txt.50 B3/tip+patch+gc.txt.50 
name                old time/op  new time/op  delta
HTTPClientServer-4  41.3µs ± 1%  40.7µs ± 1%  -1.47%  (p=0.000 n=20+20)

I've also made 3 smaller runs with N=30s and got for them delta= -0.64% (B1), -3.48% (B2) and -1.42% (B4). It seems to get the real delta with higher probability it is required to run with bigger N and more count, but from the data present it is already clear the delta is oscilating around -1.5% or -2%.

Raw data for the benchmarks are available here: https://lab.nexedi.com/kirr/misc/tree/d13eb36c/bench/nocasgstatus/5a45a157

Hope this clarifies things.

Thanks again,
Kirill

P.S. And last time it seems I was lucky to observe the effect with default -benchtime=1s ...

@navytux
Copy link
Contributor Author

navytux commented May 4, 2017

That means that for benchmarks which actually generate garbage, but not so many to systematically trigger statistically sound number of actual garbage collections in default benchtime=1s, the default benchtime=1s is not appropriate as it actually does not show the cost associated with garbage collections (and casgstatus is associated 100% with the GC) even when run with -test.count=N. And BenchmarkHTTPClientServer happens to be benchmark of such kind.

@aclements could you please confirm -benchtime 50s makes a difference on your side, like it did for -1.5% - 2% for me?

Thanks beforehand,
Kirill

@navytux
Copy link
Contributor Author

navytux commented May 4, 2017

Atomics got a bad reputation from their early implementations, which were extremely expensive. But on a modern x86 an atomic operation on an uncontended cache line in the L1 is only ~2.5x more expensive than a plain write.

@aclements about this: I was doing something else but noted atomics are not that light, so I've benchmarked it myself. Over 20 runs it gives:

name                      time/op
EmptyBase-4               0.45ns ± 0%
StoreUncontended-4        0.50ns ± 0%
StoreContended-4          2.42ns ± 8%
AtomicStoreUncontended-4  3.37ns ±15%
AtomicStoreContended-4    16.3ns ± 0%
IncUncontended-4          0.61ns ± 6%
IncContended-4            3.41ns ± 5%
AtomicIncUncontended-4    3.41ns ±13%
AtomicIncContended-4      20.1ns ± 6%

To me it says T(StoreUncontended) = 0.05ns (= 0.50ns - 0.45ns) and T(AtomicStoreUncontended) ≈ 2.9ns which gives uncontended atomic store is ~ 58x slower than uncontended plain store. For uncontended increment T(IncUncontended) ≈ 0.16ns and T(AtomicIncUncontended) ≈ 2.96ns which gives uncontended atomic inc is ~ 18x slower than uncontended plain inc.

58x is much more than 2.5x so I decided to share. My processor is i7-6600U which is relatively recent:

http://cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-6600U+%40+2.60GHz&id=2608

About noise: not sure why it is there but it does not go away after rechecking with second -test.count 20 run (CPU governer is set to performance and no other benchmarks run):

name                      time/op
EmptyBase-4               0.45ns ± 0%
StoreUncontended-4        0.51ns ± 0%
StoreContended-4          2.24ns ±14%
AtomicStoreUncontended-4  3.49ns ± 5%
AtomicStoreContended-4    16.4ns ±10%
IncUncontended-4          0.63ns ± 5%
IncContended-4            3.47ns ± 2%
AtomicIncUncontended-4    3.52ns ± 5%
AtomicIncContended-4      19.3ns ± 2%

Full benchmark source:

// Package qatomic shows how much overhead is there when using atomics
package qatomic

import (
	"sync/atomic"
	"testing"
)

// base showing how much overhead empty benchmarking loop takes itself
// the time per iteration an operation takes is thus can be estimated as:
//
//	T(BenchmarkOp) - T(BenchmarkEmptyBase)
func BenchmarkEmptyBase(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
		}
	})
}

// ----------------------------------------

// plain store to uncontended memory
// made sure via compile -S the store is actually there inside loop
func BenchmarkStoreUncontended(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		var mem uint64
		func(pmem *uint64) {
			for i := uint64(0); pb.Next(); i++ {
				*pmem = i
			}
		}(&mem)
	})
}

// plain store to contended memory; races are ignored
// made sure via compile -S the store is actually there inside loop
func BenchmarkStoreContended(b *testing.B) {
	var mem uint64
	b.RunParallel(func(pb *testing.PB) {
		for i := uint64(0); pb.Next(); i++ {
			mem = i
		}
	})
}

// atomic store to uncontended memory
func BenchmarkAtomicStoreUncontended(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		var mem uint64
		for i := uint64(0); pb.Next(); i++ {
			atomic.StoreUint64(&mem, i)
		}
	})
}

// atomic store to contended memory
func BenchmarkAtomicStoreContended(b *testing.B) {
	var mem uint64
	b.RunParallel(func(pb *testing.PB) {
		for i := uint64(0); pb.Next(); i++ {
			atomic.StoreUint64(&mem, i)
		}
	})
}

// ----------------------------------------

// plain inc of uncontended memory
// made sure via compile -S it actually does `load; inc; store` inside loop
func BenchmarkIncUncontended(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		var mem uint64
		func(pmem *uint64) {
			for pb.Next() {
				*pmem++
			}
		}(&mem)
	})
}

// plain inc of contended memory
// made sure via compile -S it actually does `load; inc; store` inside loop
func BenchmarkIncContended(b *testing.B) {
	var mem uint64
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			mem++
		}
	})
}

// atomic inc of uncontended memory
func BenchmarkAtomicIncUncontended(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		var mem uint64
		for pb.Next() {
			atomic.AddUint64(&mem, 1)
		}
	})
}

// atomic inc of contended memory
func BenchmarkAtomicIncContended(b *testing.B) {
	var mem uint64
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			atomic.AddUint64(&mem, 1)
		}
	})
}

(https://play.golang.org/p/ujUaQUBMzO)

@ianlancetaylor
Copy link
Contributor

It's an interesting result. In the inner loop, StoreUncontended does this:

mov    %rdx,(%rcx)
inc    %rdx

AtomicStoreUncontended does this:

 xchg   %rdx,(%rax)

So it seems to me that StoreUncontended really is going to be faster, since it doesn't have to wait for the memory load in the loop.

StoreContended and AtomicStoreContended are similarly different.

I do not know whether this means anything about real programs, which are unlikely to loop storing values to memory without ever reading them.

@navytux
Copy link
Contributor Author

navytux commented May 5, 2017

@ianlancetaylor thanks for feedback. Yes on amd64 atomic store and atomic swap are both implemented with XCHGQ (from sync/atomic/asm_amd64.s):

TEXT ·StoreUint64(SB),NOSPLIT,$0-16
        MOVQ    addr+0(FP), BP
        MOVQ    val+8(FP), AX
        XCHGQ   AX, 0(BP)
        RET
...
TEXT ·SwapUint64(SB),NOSPLIT,$0-24
        MOVQ    addr+0(FP), BP
        MOVQ    new+8(FP), AX
        XCHGQ   AX, 0(BP)
        MOVQ    AX, old+16(FP)
        RET

with similar code emitted by compiler with atomic intrinics.

However the more realistic IncUncontended and AtomicIncUncontended are compiled to this:

TEXT ·BenchmarkIncUncontended.func1.1(SB), NOSPLIT, $0-16 // qatomic_test.go:73
        NO_LOCAL_POINTERS
...
pc48:
        MOVQ       16(CX), DX
        DECQ       DX
        MOVQ       DX, 16(CX)
        MOVL       $1, DX
pc64:
        TESTB      DL, DL
        JEQ        pc88
        MOVQ       (AX), DX         // qatomic_test.go:75
        INCQ       DX               // <-- load; inc; store here
        MOVQ       DX, (AX)
pc77:
        MOVQ       16(CX), DX       // qatomic_test.go:74
        TESTQ      DX, DX
        JEQ        pc12
        JMP        pc48
...
TEXT ·BenchmarkAtomicIncUncontended.func1(SB), $24-8 // qatomic_test.go:94
        NO_LOCAL_POINTERS
...
pc97:
        MOVQ       16(CX), DX
        DECQ       DX
        MOVQ       DX, 16(CX)
        MOVL       $1, DX
pc113:
        TESTB      DL, DL
        JEQ        pc138
        MOVL       $1, DX
        LOCK                                // qatomic_test.go:97
        XADDQ      DX, (AX)                 // <-- atomic load; inc; store here
pc127:
        MOVQ       16(CX), DX               // qatomic_test.go:96
        TESTQ      DX, DX
        JEQ        pc61
        JMP        pc97
...

which makes the comparision fair and still it is more then an order of magnitude difference.

@navytux
Copy link
Contributor Author

navytux commented May 5, 2017

Also if we are talking about casgstatus which was introduced in 0a7c7ac ("runtime: changes to g->atomicstatus (nee status) to support concurrent GC") there the situation is more store-only like - some excerpt from the diff:

---------------------------- src/pkg/runtime/proc.c ----------------------------
@@ -1649,9 +1783,9 @@ runtime·entersyscallblock_m(void)
 	gp->syscallsp = gp->sched.sp;
 	gp->syscallpc = gp->sched.pc;
 	gp->syscallstack = gp->stackbase;
 	gp->syscallguard = gp->stackguard;
-	gp->status = Gsyscall;
+	runtime·casgstatus(gp, Grunning, Gsyscall);
 	if(gp->syscallsp < gp->syscallguard-StackGuard || gp->syscallstack < gp->syscallsp) {
 		// runtime·printf("entersyscall inconsistent %p [%p,%p]\n",
 		//	gp->syscallsp, gp->syscallguard-StackGuard, gp->syscallstack);
 		runtime·throw("entersyscall_m");
@@ -1673,9 +1807,11 @@ runtime·exitsyscall(void)
 	g->waitsince = 0;
 	if(exitsyscallfast()) {
 		// There's a cpu for us, so we can run.
 		g->m->p->syscalltick++;
-		g->status = Grunning;
+		// We need to cas the status and scan before resuming...
+		runtime·casgstatus(g, Gsyscall, Grunning);
+
 		// Garbage collector isn't running (since we are),
 		// so okay to clear gcstack and gcsp.
 		g->syscallstack = (uintptr)nil;
 		g->syscallsp = (uintptr)nil;
---------------------------- src/pkg/runtime/mgc0.c ----------------------------
@@ -539,9 +540,10 @@ markroot(ParFor *desc, uint32 i)
 			runtime·throw("markroot: bad index");
 		gp = runtime·allg[i - RootCount];
 		// remember when we've first observed the G blocked
 		// needed only to output in traceback
-		if((gp->status == Gwaiting || gp->status == Gsyscall) && gp->waitsince == 0)
+		status = runtime·readgstatus(gp);
+		if((status == Gwaiting || status == Gsyscall) && gp->waitsince == 0)
 			gp->waitsince = work.tstart;
 		// Shrink a stack if not much of it is being used.
 		runtime·shrinkstack(gp);
 		scanstack(gp);
@@ -736,15 +738,16 @@ scanstack(G *gp)
 	int32 n;
 	Stktop *stk;
 	uintptr sp, guard;
 
-	switch(gp->status){
+	switch(runtime·readgstatus(gp)) {
 	default:
@@ -1348,9 +1351,9 @@ runtime·gc(int32 force)
 		if(i > 0)
 			a.start_time = runtime·nanotime();
 		// switch to g0, call gc(&a), then switch back
 		g->param = &a;
-		g->status = Gwaiting;
+		runtime·casgstatus(g, Grunning, Gwaiting);
 		g->waitreason = runtime·gostringnocopy((byte*)"garbage collection");
 		runtime·mcall(mgc);
 	}

...

so to me it looks like in the hot path (e.g. many syscalls called) it was plain store and became atomic cas because slowpath (GC) likely seldomly needs it.

/cc @aclements, @RLH, @rsc, @dvyukov, @josharian once again because atomic stores were shown just above to be ~60x slower than plain store when destination is in L1 and is not contended.

See also #19563 (comment) and #19563 (comment) for -1.5% - -2% speedup for whole BenchmarkHTTPClientServer when casgstatus is effectively disabled.

@dvyukov
Copy link
Member

dvyukov commented May 5, 2017

Most of the CASes are unnecessary and should be removed. The spin waits are not necessary either.
https://codereview.appspot.com/132960044#msg17

@navytux
Copy link
Contributor Author

navytux commented May 21, 2017

@dvyukov thanks for feedback. It is a bit pity there is no reply from GC team.

@RLH
Copy link
Contributor

RLH commented May 22, 2017

GC team here. It isn't that we aren't monitoring this issue, it's just that we don't have any new numbers that add to the conversation.

The GC is core so changes undergo a risk to benefit analysis. Here is where this analysis now stands.

On the benefit side there are micro benchmarks but there is not a lot of evidence that replacing some atomics with normal stores show an improvement to a wide range of real world applications or a smaller set of important applications running on today's HW.

On the risk side the GC is clever and uses CASes instead of mutexes to implement a lock free state machine algorithm. This discussion suggests that the GC should be even more clever and implement a non-blocking algorithm that changes the state variable sometimes with normal stores and sometimes with atomic instructions. The bug tail on these kinds of changes risks being long and difficult.

@navytux
Copy link
Contributor Author

navytux commented May 22, 2017

GC team here. It isn't that we aren't monitoring this issue, it's just that we don't have any new numbers that add to the conversation.

kirr wrote about BenchmarkHTTPClientServer with nocasgstatus patch: (#19563 (comment))

@aclements could you please confirm -benchtime 50s makes a difference on your side, like it did for -1.5% - 2% for me?

ok, I assume it did not make a difference (though getting this confirmed would be more explicit)

On the benefit side there are micro benchmarks but there is not a lot of evidence that replacing some atomics with normal stores show an improvement to a wide range of real world applications or a smaller set of important applications running on today's HW.

On the risk side the GC is clever and uses CASes instead of mutexes to implement a lock free state machine algorithm. This discussion suggests that the GC should be even more clever and implement a non-blocking algorithm that changes the state variable sometimes with normal stores and sometimes with atomic instructions. The bug tail on these kinds of changes risks being long and difficult

ok GC team, thanks for feedback. Your position is clear.

Kirill

@RLH
Copy link
Contributor

RLH commented May 29, 2017

Closing, the suggested change's risks (and effort needed to ensure correctness) outweighs the potential benefits.

@RLH RLH closed this as completed May 29, 2017
@golang golang locked and limited conversation to collaborators May 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants