Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: Missed checking for EINTR in (*os.File).readdir #40846

Closed
jameshartig opened this issue Aug 17, 2020 · 74 comments
Closed

os: Missed checking for EINTR in (*os.File).readdir #40846

jameshartig opened this issue Aug 17, 2020 · 74 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@jameshartig
Copy link
Contributor

What version of Go are you using (go version)?

go version go1.15 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/toaster/.cache/go-build"
GOENV="/home/toaster/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/opt/gopath/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/opt/gopath"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build047277846=/tmp/go-build -gno-record-gcc-switches"

What did you do?

See: https://go-review.googlesource.com/c/go/+/232862/

If you call ioutil.ReadDir it ends up calling (*os.File).readdir which calls lstat and does not retry on an EINTR error. Other spots were handled in https://go-review.googlesource.com/c/go/+/232862/ and I think this spot was just missed.

I wasn't sure of the best way since lstat is automatically generated. Should this just be handled in readdir?

What did you expect to see?

I expected ioutil.ReadDir not to error.

What did you see instead?

Instead ioutil.ReadDir errored with lstat /opt/admiral-dev/hazadblock/dist/static/lib/ergoHeadOverwrite.dev.min.js.map: interrupted system call.

@bcmills
Copy link
Contributor

bcmills commented Aug 17, 2020

CC @ianlancetaylor

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 17, 2020
@bcmills bcmills added this to the Go1.16 milestone Aug 17, 2020
@ianlancetaylor
Copy link
Contributor

As far as I know neither the stat nor the lstat system call is interruptible by a signal. But the error you show suggests that lstat can return EINTR. That suggests that stat can also return EINTR. Either way it's not clear to me that this has anything to do with (*os.File).Readdir as such. Perhaps we need to fix Stat and Lstat to loop on EINTR.

Can you tell us more about the exact case where this fails? Can you show us a way to reproduce the problem? Thanks.

@jameshartig
Copy link
Contributor Author

jameshartig commented Aug 18, 2020

@ianlancetaylor thanks for the quick reply!

Can you tell us more about the exact case where this fails?

So the underlying mount is a FUSE filesystem and the stat call actually could make a network call if the information isn't cached locally. From my reading of the fuse documentation [1]:

Interrupting filesystem operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a process issuing a FUSE filesystem request is interrupted, the
following will happen:

  1) If the request is not yet sent to userspace AND the signal is
     fatal (SIGKILL or unhandled fatal signal), then the request is
     dequeued and returns immediately.

  2) If the request is not yet sent to userspace AND the signal is not
     fatal, then an 'interrupted' flag is set for the request.  When
     the request has been successfully transferred to userspace and
     this flag is set, an INTERRUPT request is queued.

  3) If the request is already sent to userspace, then an INTERRUPT
     request is queued.

INTERRUPT requests take precedence over other requests, so the
userspace filesystem will receive queued INTERRUPTs before any others.

The userspace filesystem may ignore the INTERRUPT requests entirely,
or may honor them by sending a reply to the _original_ request, with
the error set to EINTR.

It is also possible that there's a race between processing the
original request and its INTERRUPT request.  There are two possibilities:

  1) The INTERRUPT request is processed before the original request is
     processed

  2) The INTERRUPT request is processed after the original request has
     been answered

If the filesystem cannot find the original request, it should wait for
some timeout and/or a number of new requests to arrive, after which it
should reply to the INTERRUPT request with an EAGAIN error.  In case
1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
reply will be ignored.

it seems like returning EINTR is following the spec but I'm also not sure if I can somehow modify the FUSE library to ignore the Interrupt. Can you point me in a direction for how to debug the go-side of this? I'm not sure how I can figure out exactly what signal go received. I was assuming it was SIGUSR but it's not clear to me what the best way is to discover what signal was received or how to correlate that with the separate FUSE program. Here are the debug logs from the FUSE-side:

<- Getattr [ID=0x428 Node=0x25 Uid=1002 Gid=1003 Pid=22546] 0x0 fl=0
<- Interrupt [ID=0x42a Node=0x0 Uid=0 Gid=0 Pid=0] ID 0x428
-> [ID=0x42a] Interrupt
-> [ID=0x428] Getattr error=EINTR

There's a lot of other logs around this so I'm under the current assumption that the Interrupt happened almost immediately after the Getattr was issued, which is weird. If it helps at all, the program doing the reads makes a new goroutine for each file and reads them in so there's 10-15 goroutines reading at once. I don't know yet how to understand what signal caused the Interrupt since I don't think that's exposed to the FUSE program.

Can you show us a way to reproduce the problem?

Once we can determine what the signal is that caused this, then I can hopefully come up with some sort of simple reproduction. We can just make a simple FUSE program that just sleeps and then trigger the signal and it should send an Interrupt.

[1] https://www.kernel.org/doc/Documentation/filesystems/fuse.txt

@ianlancetaylor
Copy link
Contributor

Thanks for the info.

The signal was most likely SIGURG, which is used to preempt a goroutine.

You should be able to see it by running the program under strace -f.

@networkimprov
Copy link

networkimprov commented Aug 18, 2020

@fastest963 what FUSE library are you referring to? How is it configured on your system? That's probably all we need to know to reproduce it.

Go 1.14+ generates a lot of interrupts, for goroutine scheduling. Causing an interrupt is easy in a Go app.

@jameshartig
Copy link
Contributor Author

@networkimprov It's using https://github.com/bazil/fuse to mount the FUSE mount. Nothing very special on the configuration front other than fuse.AsyncRead and fuse.DefaultPermissions. Here's a program to replicate the problem: https://gist.github.com/fastest963/756f67d406540ad8ae14d22e651043d0

It should print out:

fuse: <- Getattr [ID=0x2 Node=0x1 Uid=1001 Gid=1002 Pid=12860] 0x0 fl=0
fuse: <- Interrupt [ID=0x3 Node=0x0 Uid=0 Gid=0 Pid=0] ID 0x2
fuse: -> [ID=0x3] Interrupt
interrrupted 0x2
fuse: -> [ID=0x2] Getattr error=EINTR
error: stat ./test: interrupted system call

@ianlancetaylor

You should be able to see it by running the program under strace -f.

Here's the relevant strace output:

openat(AT_FDCWD, "/opt/admiral-dev/hazadblock/dist/static/lib/reporting/js.dev.min.js", O_RDONLY|O_CLOEXEC) = 35
futex(0x1c94b98, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_ctl(3, EPOLL_CTL_ADD, 35, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2416816088, u64=140271753736152}}) = 0
fcntl(35, F_GETFL)      = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(35, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
fstat(35, 0xc0001cb218) = -1 EINTR (Interrupted system call)
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=10907, si_uid=1002} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
read(35, "!(function(e){var r={};function "..., 512) = 512
futex(0x1c94b98, FUTEX_WAKE_PRIVATE, 1) = 1
read(35, "ule)return t;var e=Object.create"..., 1024) = 1024
read(35, "sName=i.join(\" \")),e}},179:funct"..., 2048) = 2048
read(35, "typeof Symbol&&n.constructor===S"..., 4096) = 2761
futex(0x1c94b98, FUTEX_WAKE_PRIVATE, 1) = 1
read(35, "", 1335)      = 0
epoll_ctl(3, EPOLL_CTL_DEL, 35, 0xc02032ad2c) = 0
close(35)               = 0

and the associated FUSE logs:

<- Open [ID=0xe79 Node=0xe Uid=1002 Gid=1003 Pid=10935] dir=false fl=OpenReadOnly
-> [ID=0xe79] Open 0x2 fl=0
<- Getattr [ID=0xe7b Node=0xe Uid=1002 Gid=1003 Pid=10935] 0x0 fl=0
<- Interrupt [ID=0xe7d Node=0x0 Uid=0 Gid=0 Pid=0] ID 0xe7b
-> [ID=0xe7d] Interrupt
-> [ID=0xe7b] Getattr error=EINTR

Looks like this happened inside of iotuil.ReadFile because the stat error was ultimately ignored.

@ianlancetaylor
Copy link
Contributor

Thanks. Can you double check that you passed the -f option when you ran strace? The output without -f is unhelpful for Go programs.

@jameshartig
Copy link
Contributor Author

jameshartig commented Aug 18, 2020

@ianlancetaylor I actually used -ff and copied only the relevant section out above from the PID that got the signal because there were 30+ threads and a ton of irrelevant traffic but I realized now I should instead just run strace with my example program. For some reason the FUSE mounting failed with Operation not permitted whenever I ran my test binary with strace -f test so instead I made it sleep and then I attached strace to the pid:
strace.txt

The relevant section is:

[pid 15228] newfstatat(AT_FDCWD, "./test",  <unfinished ...>
[pid 15229] rt_sigprocmask(SIG_SETMASK, ~[],  <unfinished ...>
[pid 15230] <... nanosleep resumed>NULL) = 0
[pid 15229] <... rt_sigprocmask resumed>[], 8) = 0
[pid 15229] clone(strace: Process 15236 attached
 <unfinished ...>
[pid 15230] read(6, "8\0\0\0\3\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\351\3\0\0\352\3\0\0"..., 135168) = 56
[pid 15229] <... clone resumed>child_stack=0xc000136000, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM) = 15236
[pid 15236] gettid( <unfinished ...>
[pid 15230] read(6,  <unfinished ...>
[pid 15236] <... gettid resumed>)       = 15236
[pid 15229] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15236] arch_prctl(ARCH_SET_FS, 0xc000124090 <unfinished ...>
[pid 15229] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15236] <... arch_prctl resumed>)   = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15236] sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
[pid 15236] sigaltstack({ss_sp=0xc000128000, ss_flags=0, ss_size=32768},  <unfinished ...>
[pid 15231] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15236] <... sigaltstack resumed>NULL) = 0
[pid 15231] futex(0xc000030948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15236] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15236] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15236] gettid()                    = 15236
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15229] futex(0xc000030948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15236] rt_sigprocmask(SIG_SETMASK, ~[],  <unfinished ...>
[pid 15231] <... futex resumed>)        = 0
[pid 15229] <... futex resumed>)        = 1
[pid 15236] <... rt_sigprocmask resumed>[], 8) = 0
[pid 15231] write(1, "fuse: <- Getattr [ID=0x2 Node=0x"..., 72 <unfinished ...>
[pid 15229] getpid( <unfinished ...>
[pid 15236] clone( <unfinished ...>
[pid 15231] <... write resumed>)        = 72
[pid 15231] epoll_pwait(7,  <unfinished ...>
[pid 15229] <... getpid resumed>)       = 15228
[pid 15231] <... epoll_pwait resumed>[], 128, 0, NULL, 362006) = 0
[pid 15231] epoll_pwait(7, strace: Process 15237 attached
 <unfinished ...>
[pid 15237] gettid()                    = 15237
[pid 15237] arch_prctl(ARCH_SET_FS, 0xc000124490) = 0
[pid 15236] <... clone resumed>child_stack=0xc000138000, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM) = 15237
[pid 15229] tgkill(15228, 15236, SIGURG <unfinished ...>
[pid 15237] sigaltstack(NULL,  <unfinished ...>
[pid 15236] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15237] <... sigaltstack resumed>{ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
[pid 15229] <... tgkill resumed>)       = 0
[pid 15236] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15237] sigaltstack({ss_sp=0xc00013c000, ss_flags=0, ss_size=32768},  <unfinished ...>
[pid 15236] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15228, si_uid=1001} ---
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15237] <... sigaltstack resumed>NULL) = 0
[pid 15237] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15236] rt_sigreturn({mask=[]} <unfinished ...>
[pid 15237] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15236] <... rt_sigreturn resumed>) = 0
[pid 15237] gettid( <unfinished ...>
[pid 15236] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1],  <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15237] <... gettid resumed>)       = 15237
[pid 15229] getpid( <unfinished ...>
[pid 15237] futex(0x672ed8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15236] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15229] <... getpid resumed>)       = 15228
[pid 15229] tgkill(15228, 15236, SIGURG <unfinished ...>
[pid 15236] futex(0x672ed8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15229] <... tgkill resumed>)       = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15236] <... futex resumed>)        = 1
[pid 15237] <... futex resumed>)        = 0
[pid 15237] sched_yield( <unfinished ...>
[pid 15236] futex(0xc000124148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15237] <... sched_yield resumed>)  = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15237] futex(0x672ec0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15237] rt_sigprocmask(SIG_SETMASK, ~[],  <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15237] <... rt_sigprocmask resumed>[], 8) = 0
[pid 15237] clone(child_stack=0xc000152000, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM) = 15238
[pid 15229] <... nanosleep resumed>NULL) = 0
strace: Process 15238 attached
[pid 15237] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15238] gettid( <unfinished ...>
[pid 15237] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15238] <... gettid resumed>)       = 15238
[pid 15237] futex(0x672ed8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15238] arch_prctl(ARCH_SET_FS, 0xc000124890 <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15238] <... arch_prctl resumed>)   = 0
[pid 15238] sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
[pid 15238] sigaltstack({ss_sp=0xc000144000, ss_flags=0, ss_size=32768},  <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15238] <... sigaltstack resumed>NULL) = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15238] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 15238] gettid( <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15238] <... gettid resumed>)       = 15238
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15238] futex(0xc000124148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15238] <... futex resumed>)        = 1
[pid 15236] <... futex resumed>)        = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15238] futex(0xc000124948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15236] futex(0x672ed8, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15237] <... futex resumed>)        = 0
[pid 15236] rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV USR2 TERM STKFLT CHLD PROF SYS RTMIN RT_1],  <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15237] rt_sigprocmask(SIG_SETMASK, ~[],  <unfinished ...>
[pid 15236] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15237] <... rt_sigprocmask resumed>[], 8) = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15236] futex(0xc000124948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15237] clone( <unfinished ...>
[pid 15236] <... futex resumed>)        = 1
[pid 15238] <... futex resumed>)        = 0
[pid 15236] futex(0xc000124148, FUTEX_WAIT_PRIVATE, 0, NULLstrace: Process 15239 attached
 <unfinished ...>
[pid 15237] <... clone resumed>child_stack=0xc000154000, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM) = 15239
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15238] write(9, "\0", 1 <unfinished ...>
[pid 15237] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15237] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15238] <... write resumed>)        = 1
[pid 15237] futex(0x672ed8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15238] rt_sigprocmask(SIG_SETMASK, ~[],  <unfinished ...>
[pid 15239] gettid( <unfinished ...>
[pid 15231] <... epoll_pwait resumed>[{EPOLLIN, {u32=6761816, u64=6761816}}], 128, 999, NULL, 0) = 1
[pid 15239] <... gettid resumed>)       = 15239
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15231] read(8,  <unfinished ...>
[pid 15239] arch_prctl(ARCH_SET_FS, 0xc000124c90 <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15231] <... read resumed>"\0", 16) = 1
[pid 15239] <... arch_prctl resumed>)   = 0
[pid 15238] <... rt_sigprocmask resumed>[], 8) = 0
[pid 15231] futex(0xc000030948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15239] sigaltstack(NULL,  <unfinished ...>
[pid 15238] clone( <unfinished ...>
[pid 15239] <... sigaltstack resumed>{ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15239] sigaltstack({ss_sp=0xc000154000, ss_flags=0, ss_size=32768},  <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15239] <... sigaltstack resumed>NULL) = 0
[pid 15238] <... clone resumed>child_stack=0xc00014e000, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM) = 15240
strace: Process 15240 attached
[pid 15239] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15238] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15240] gettid( <unfinished ...>
[pid 15239] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15240] <... gettid resumed>)       = 15240
[pid 15238] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15240] arch_prctl(ARCH_SET_FS, 0xc000125090 <unfinished ...>
[pid 15239] gettid( <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15238] futex(0x672fe0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15240] <... arch_prctl resumed>)   = 0
[pid 15239] <... gettid resumed>)       = 15239
[pid 15240] sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
[pid 15239] epoll_pwait(7,  <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15240] sigaltstack({ss_sp=0xc00015c000, ss_flags=0, ss_size=32768},  <unfinished ...>
[pid 15239] <... epoll_pwait resumed>[], 128, 0, NULL, 362006) = 0
[pid 15240] <... sigaltstack resumed>NULL) = 0
[pid 15239] futex(0xc000124148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15240] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 15239] <... futex resumed>)        = 1
[pid 15236] <... futex resumed>)        = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15240] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 15239] futex(0xc000124d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15236] futex(0xc000124d48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15240] gettid( <unfinished ...>
[pid 15239] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid 15236] <... futex resumed>)        = 0
[pid 15240] <... gettid resumed>)       = 15240
[pid 15239] epoll_pwait(7,  <unfinished ...>
[pid 15236] futex(0xc000124148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15240] kill(0, SIGUSR2 <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15239] <... epoll_pwait resumed>[], 128, 0, NULL, 362006) = 0
[pid 15240] <... kill resumed>)         = 0
[pid 15239] --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=15228, si_uid=1001} ---
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15239] futex(0x672fe0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15240] epoll_pwait(7,  <unfinished ...>
[pid 15230] <... read resumed>"0\0\0\0$\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 135168) = 48
[pid 15240] <... epoll_pwait resumed>[], 128, 0, NULL, 362006) = 0
[pid 15239] <... futex resumed>)        = 1
[pid 15238] <... futex resumed>)        = 0
[pid 15230] futex(0xc000030548, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15239] rt_sigreturn({mask=[]} <unfinished ...>
[pid 15238] futex(0xc000124948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15240] read(6,  <unfinished ...>
[pid 15239] <... rt_sigreturn resumed>) = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15239] futex(0xc000124948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15238] <... futex resumed>)        = 0
[pid 15239] futex(0x672fe0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15238] kill(0, SIGUSR2 <unfinished ...>
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15238] <... kill resumed>)         = 0
[pid 15229] --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=15228, si_uid=1001} ---
[pid 15238] epoll_pwait(7, [], 128, 0, NULL, 362006) = 0
[pid 15230] <... futex resumed>)        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 15229] futex(0x672fe0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15230] futex(0xc000030548, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15229] <... futex resumed>)        = 1
[pid 15239] <... futex resumed>)        = 0
[pid 15229] rt_sigreturn({mask=[]})     = 0
[pid 15239] futex(0xc000124d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15229] futex(0xc000124d48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15239] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid 15238] write(1, "fuse: <- Interrupt [ID=0x3 Node="..., 62 <unfinished ...>
[pid 15229] <... futex resumed>)        = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15239] futex(0xc000030548, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15239] futex(0x672fe0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15238] <... write resumed>)        = 62
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15230] <... futex resumed>)        = 0
[pid 15230] kill(0, SIGUSR2)            = 0
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15230] epoll_pwait(7,  <unfinished ...>
[pid 15229] --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=15228, si_uid=1001} ---
[pid 15238] write(1, "fuse: -> [ID=0x3] Interrupt\n", 28 <unfinished ...>
[pid 15229] futex(0x672fe0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15230] <... epoll_pwait resumed>[], 128, 0, NULL, 362006) = 0
[pid 15239] <... futex resumed>)        = 0
[pid 15229] <... futex resumed>)        = 1
[pid 15239] futex(0xc000124d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15238] <... write resumed>)        = 28
[pid 15230] epoll_pwait(7,  <unfinished ...>
[pid 15229] rt_sigreturn({mask=[]} <unfinished ...>
[pid 15238] write(1, "interrrupted 0x2\n", 17 <unfinished ...>
[pid 15230] <... epoll_pwait resumed>[], 128, 0, NULL, 362006) = 0
[pid 15229] <... rt_sigreturn resumed>) = 0
[pid 15230] futex(0xc000124d48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15239] <... futex resumed>)        = 0
[pid 15238] <... write resumed>)        = 17
[pid 15230] <... futex resumed>)        = 1
[pid 15239] kill(0, SIGUSR2 <unfinished ...>
[pid 15230] futex(0x672fe0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15239] <... kill resumed>)         = 0
[pid 15238] write(1, "fuse: -> [ID=0x2] Getattr error="..., 38 <unfinished ...>
[pid 15231] <... futex resumed>)        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 15239] --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=15228, si_uid=1001} ---
[pid 15231] futex(0xc000030948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15238] <... write resumed>)        = 38
[pid 15230] <... futex resumed>)        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15239] futex(0x672fe0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15238] write(6, "\20\0\0\0\374\377\377\377\2\0\0\0\0\0\0\0", 16 <unfinished ...>
[pid 15230] futex(0x672fe0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15229] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 15239] <... futex resumed>)        = 0
[pid 15238] <... write resumed>)        = 16
[pid 15230] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid 15228] <... newfstatat resumed>0xc000122038, 0) = -1 EINTR (Interrupted system call)
[pid 15239] rt_sigreturn({mask=[]} <unfinished ...>
[pid 15238] futex(0xc000030548, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 15230] futex(0xc000030548, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15239] <... rt_sigreturn resumed>) = 0
[pid 15238] <... futex resumed>)        = 0
[pid 15230] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid 15229] <... nanosleep resumed>NULL) = 0
[pid 15228] futex(0x643e28, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15230] write(1, "error: stat ./test: interrupted "..., 44 <unfinished ...>

@networkimprov
Copy link

cc @tv42

@ianlancetaylor
Copy link
Contributor

@fastest963 Thanks. The strace output shows that the newfstatat syscall is being interrupted, and is returning EINTR. My takeaway is that when using a FUSE file system essentially any system call that touches the file system can return EINTR. Unfortunately I don't see a reasonable way to add a test case for a FUSE file system to the standard library, though I would be happy to hear suggestions (these suggestions can not involve importing a big package like https://github.com/bazil/fuse into the standard library just for testing purposts).

I guess I'll just look for syscall functions in the os package and add EINTR loops. Then we'll see what happens.

@gopherbot
Copy link

Change https://golang.org/cl/249178 mentions this issue: os, internal/poll: loop on EINTR for all file syscalls

@tv42
Copy link

tv42 commented Aug 19, 2020

A non-FUSE alternative for testing would be seccomp or ptrace triggering that EINTR. Or a special slower build where RawSyscall etc check a flag and never even enter kernelspace. Of course both of those need to be somehow targeted to the thing under test, matching by fd or pathname or kernel thread id.

And yes once networking etc are added to the mix, just about anything can return EINTR, because they mostly have no other choice -- allowing user to control-C out of slow networking is a nice thing to have, so not bailing out isn't really an option. Only a precious few syscalls have something else meaningful to report, so EINTR it is.

If Go's scheduler using SIGURG triggers this logic, that could lead to a nasty scenario where the work is started from scratch on every round, and always interrupted by SIGURG. I sure hope the SIGURG doesn't happen too often.

@networkimprov
Copy link

networkimprov commented Aug 19, 2020

In that case, maybe loop-on-EINTR logic should be disabled via an os.OpenFile() flag. That would apply to previously added EINTR loops, too.

We need loops to keep existing code working with async-preemption, and also need to accommodate network filesystems.

EDIT: this also arose with CIFS: #39237, #38836

@bcmills
Copy link
Contributor

bcmills commented Aug 19, 2020

os.OpenFile has never specified its behavior w.r.t. EINTR.

If you need to open a file with the possibility of interruption or cancellation, an explicit context.Context seems like the way to do it. (A third-party package could implement that functionality using the syscall package; adding that API to os seems like a separate proposal, but one that should wait for the abstract-filesystem proposal #5636 to settle.)

@networkimprov
Copy link

I've added a comment on the FS API thread re network filesystems and EINTR loops:
https://www.reddit.com/r/golang/comments/hv976o/qa_iofs_draft_design/g24rusp/

@tv42
Copy link

tv42 commented Aug 19, 2020

I don't think there should be a special flag to os.OpenFile; the default and only behavior should be "do the right thing" (whatever that ends up meaning).

I posit the two requirements for developer and user comfort here are:

  1. Control-C causing SIGINT must allow programs in e.g. io.Copy to do an orderly shutdown (with whatever logic the app wants; not a runtime panic, not just the default signal handler, though of course SIGINT default handling applies until overriden), assuming the read(2)/write(2) system call they were hanging in returns with EINTR. (That is, this feature may need per-filesystem kernel support, which can be assumed to be there.)

  2. Programs running in normal circumstances should not see spurious read/write/etc failures.

The way to differentiate between cases 1 and 2 is the SIGINT, not the EINTR.

@networkimprov
Copy link

Assuming a reliable method to determine the source signal for every occurrence of EINTR, on every OS that returns it :-)

Programs may expect EINTR for more than SIGINT, so you'd retry on EINTR when the cause was SIGURG.

@ianlancetaylor
Copy link
Contributor

We don't need to know what signal caused the EINTR, and we don't need special treatment on a per-file basis. Go is always multi-threaded. A SIGINT will either cause the program to exit (the default behavior) or the signal will be reported on a channel (the behavior if the program uses os/signal.Notify). This is entirely independent of the code that makes the system call that returns EINTR.

We always send SIGURG to a specific thread to preempt that thread. But we only need to preempt threads that are actually running Go code. We never need to preempt threads that are waiting in a system call (because the scheduler already ignores them). So there should be no risk of sending SIGURG so frequently that a system call never completes. Also, we already have code to prevent signaling a thread too frequently; if it is necessary for some reason, we can improve that code to ensure that the thread makes progress.

@tv42
Copy link

tv42 commented Aug 19, 2020

I'm not sure you can just retry on all EINTR and call it a day. That means one can't write a SIGINT handler that does an orderly shutdown, because it can't claw back control of those threads that are in syscalls.

If you leave it at this, all such shutdown is done by essentially abandoning the goroutines that are hanging in syscalls. I don't think this is a good idea, and I don't think it's always doable safely! The hang might have been e.g. io.Copy(os.Stdout, aNetworkFS) and we might output "Control-C received, shutting down..." after which the io.Copy manages to do some work. I'm sure there are scenarios where that ugliness presents as corrupted state and not just confusing messages.

I believe that programs should be able to receive SIGINT, un-hang those syscalls that can be un-hung, and do an orderly shutdown. There are plenty of examples of this in the C world. The typical convention is that a second SIGINT while trying to do an orderly shutdown makes the program crash harder.

I posit that the only way to achieve "orderly shutdown" is by letting the syscalls complete (hopefully with EINTR asap; one may need to nudge them that way!), and then let the calling code deal with the consequences, with the top level waiting on something like errgroup until everything is shut down.

A single-threaded C program would set a global flag ("shutting down") in a signal handler, and then the EINTR loop would actually be if errno==EINTR && !shutting_down, letting the EINTR through after a SIGINT.

A Go program doesn't even get to have signal handlers that are guaranteed to run before the EINTR loop goes back to making the syscall hang. And that's before we even consider multithreading, where the hanging syscall(s) are in separate threads from where the signal is handled!

A multithreaded program will likely end up getting the SIGINT on thread different from the thread that's hanging on a syscall. Hopefully the scheduler will keep something available to run a goroutine to actually handle the signal, these are not considered "blocking syscalls" afaik!

This multi-threaded aspect is making me think there's basically only one path forward to clean shutdowns:

Some sort of universal support for cancellation interruption, where the signal.Notify channel receiver can tell the hanging syscalls to bail out and then main can wait for them to have done so. This could tie into io.Reader taking context #20280 but here we have much more than Read and Write at stake.

To make this happen, the hanging syscalls can be woken up with a signal (assuming kernel support again), and EINTR is a perfectly fine error in such a scenario. A multi-threaded program would have to "broadcast" the incoming signal (handled by just one thread) to all of its threads, to wake up all stuck syscalls.

Here's a strawman proposal: Add function to make all hanging syscalls wake up with EINTR (where kernel support exists), exactly once.

os/signal.InterruptAll() causes every goroutine that is in a syscall to receive an unspecified signal and disables the EINTR retry loop exactly once for each for them.

(It may also send signals to more threads than that, just to make implementation simpler. Goroutines not in syscalls would not observe anything special happening; pretty much exactly like the scheduler SIGURG code path.)

These signals will not be seen specifically by signal.Notify nor prevented by signal.Ignore. (Probably just by reusing SIGURG as the signal.)

Construct a mechanism for InterruptAll to disable EINTR retry loops exactly once, so we don't wreck the continuing existence of the program with spurious EINTRs. Something like: Have a per-goroutine noRetryOnEINTR flag, always clear it on entry to syscall, InterruptAll atomically sets that for target threads, then sends SIGURG to them.

@ianlancetaylor
Copy link
Contributor

Thanks for the comment. However, I don't think what you are suggesting is a good fit for the Go world. Go programs have a managed runtime. One consequence is that they can in effect crash at any time, for example if the runtime is unable to allocate memory or unable to start a new thread. There is no reliable way to enforce the orderly shutdown of a Go program. This is an intentional choice made by the language and runtime.

For cases where an orderly shutdown is essential, Go requires separating the program into separate processes. For example, the parent process can run the real program in a child process and then, if the child process exits in some unpredictable way, restore the state to some acceptable status.

@tv42
Copy link

tv42 commented Aug 20, 2020

I think crash safety and orderly shutdown are two different things. I'm a huge proponent of even crash-only software overall, but there are cases where an orderly shutdown just makes sense. I think we can keep "user asks to end program" quite separate from "out of memory". There's no way to enforce the orderly shutdown of anything computerized, but that doesn't mean all hope is lost for the common case.

Here's a recent example of graceful shutdown functionality added to stdlib: https://golang.org/pkg/net/http/#Server.Shutdown

@networkimprov
Copy link

@tv42 maybe you could paste your previous comments into a new proposal issue? We'd get more eyes on it that way.

@bcmills
Copy link
Contributor

bcmills commented Aug 20, 2020

@tv42, we have an existing mechanism for tearing down interrupted calls. That mechanism is context.Context, for which there is an already-accepted proposal to simplify signal plumbing (#37255).

If we had a way to plumb the Context into filesystem operations, then that could be used for both the SIGINT shutdown use-case and many others, and could be also be supported by in-process APIs that do not compile down to system calls. However, an InterruptAll call would not address these more general Context use-cases. Therefore, we should pursue the general Context plumbing rather than the specific SIGINT/EINTR mechanism.

@ianlancetaylor
Copy link
Contributor

Yes, the advantage of using context.Context is that it puts the decision of how to handle a termination request in the only place it can safely go, which is the code that is going to be terminated. We don't want to force all Go code that makes system calls to have to handle termination requests. And, to be clear, relying on interrupting a system call would indeed force that: otherwise, for example, we will have ten thousand worker goroutines all log a message about a system call failure.

@tv42
Copy link

tv42 commented Aug 20, 2020

@bcmills That would require plumbing Context to an lstat(2) (and many others) that hangs in kernelspace, including sending a signal to get the syscall to return. I haven't seen anyone talk about doing that.

@ianlancetaylor
Copy link
Contributor

@tv42 I don't see a reason to worry about whether the actual system call returns or not. That doesn't matter with regard to an orderly shutdown. It's fine for the program to exit with a bunch of threads stuck in system calls. The kernel will clean them up.

What matters for an orderly shutdown is whether the goroutines complete their logical shutdown. And that can be done, in principle, with a context.Context.

@networkimprov
Copy link

Sorry, I must have missed something... How does a Context terminate a goroutine waiting on a syscall that won't return?

@ianlancetaylor
Copy link
Contributor

@tv42 Fair point, please read all discussions about deadlines above as using contexts (https://golang.org/pkg/context). The issues seem the same.

@networkimprov
Copy link

Ian, another way to look at the problem is that EINTR is part of the API for CIFS & FUSE, because they're designed to be user-interruptible. This is not compatible with Go, so we'd like to fix that.

Syscall interruption by signalling a thread requires runtime support, so wouldn't the API reside in stdlib? See my suggested mechanism in #40846 (comment).

There's no need to replicate the file API with a Context or deadline argument. This is sufficient:

// terminates syscalls trying any path beginning with rootpath
// the original caller will see a PathError with .Err == io.ErrInterrupt
InterruptPendingUse(rootpath string) error

This is an os or runtime package function. I think you'd have to build Context cancellation on the above API.

Shall I post a proposal for this?

@ianlancetaylor
Copy link
Contributor

Ian, another way to look at the problem is that EINTR is part of the API for CIFS & FUSE, because they're designed to be user-interruptible. This is not compatible with Go, so we'd like to fix that.

That is entirely reasonable for C, but it does not make sense for Go, because Go is always multi-threaded. When you send a signal to a Go program, it is unpredictable whether it will interrupt a system call or not, because it is unpredictable which of the various threads will receive the signal. The only way to make that predictable is to direct the signal to a specific thread. But the only code that can do that reliably is code within the program itself. Perhaps there are cases where a Go program should interrupt a specific thread. But even if we believe that, then the error returned to the user's program for that thread should not be EINTR. The important fact is not that the system call was interrupted. The important fact is why it was interrupted. The error returned to the user should be something like os.ErrDeadlineExceeded or context.Canceled. It should never be EINTR.

@ianlancetaylor
Copy link
Contributor

Syscall interruption by signalling a thread requires runtime support, so wouldn't the API reside in stdlib?

I suspect that we could do everything via the golang.org/x/sys/unix package, but I could be missing something.

// terminates syscalls trying any path beginning with rootpath
// the original caller will see a PathError with .Err == io.ErrInterrupt
InterruptPendingUse(rootpath string) error

That is not going to be a good API, because it will require every single file operation to register itself somewhere so that it can be interrupted, and then deregister itself when complete. That is too high a price to pay for a feature that extremely few programs will want to use.

@tv42
Copy link

tv42 commented Aug 25, 2020

Strawman API proposal: end := bikeshed.Cancellable(ctx); defer end() causes syscalls made inside the function interrupt if context is canceled, returning the error from ctx.Err(). Multiple overlapping calls would be legal. Interrupting would be best-effort and not guaranteed.

(Or, alternatively, bikeshed.Cancellable(ctx, func() { ... }), but I like the defer variant more because return values are awkward with func().)

This could be done by piggybacking on SIGURG (using runtime knowledge to target the the right OS thread, whereever the goroutine is being run) and a per-goroutine noRetryOnEINTR flag that temporarily disables the EINTR retry loop.

(This is starting to smell a lot like implicit context passing, #28342 and probably others -- but it would avoid duplicating every API just to add ctx.)

@ianlancetaylor
Copy link
Contributor

Frankly I wouldn't hesitate to duplicate APIs to add a context. Explicit is better than implicit.

I'm not inherently opposed to using a signal to interrupt the thread but it's not like we can just send a signal and carry on. What if the thread isn't in a system call? What if the thread is preempted by the kernel but just about to enter a system call when it resumes? Maybe a signal can be part of the code but it needs considerable machinery around it.

@networkimprov
Copy link

Strawman API proposal

When a network filesystem becomes unavailable, you interrupt all syscalls trying paths within a file tree whose root is on the network. Marking as interruptible all file ops in the current function (and implicitly, those in functions it calls, etc) isn't helpful, as you need to interrupt some ops and not others, depending on their location.

Interrupt errors would often trigger context cancellation. A file tree is typically used by many tasks; it's not a task unto itself.

require every single file operation to register itself somewhere

Thankfully, no. The init() procedure for "os/interrupt" could enable said registration. Alternatively, replicate the file API with variants that perform said registration. The former is more user-friendly, but either works :-)

it's not like we can just send a signal and carry on. What if...

@aclements is probably better able to address these Qs. (Austin, the current subthread starts at #40846 (comment).)

via the golang.org/x/sys/unix package

We also need this on Windows for CIFS. The x/ tree is fine, but not x/sys.

a feature that extremely few programs will want to use.

Any program that relies on the filesystem and is deployed into environments that its developers don't control needs to allow for network filesystems. They're still heavily used in office LANs.

@tv42
Copy link

tv42 commented Aug 25, 2020

What if the thread isn't in a system call?

Since the scheduler sends SIGURGs anyway, SIGURG should be harmless. Either the scheduler knows that it didn't mean to preempt that goroutine, and the signal handler does nothing, or the goroutine goes to sleep for a while. Neither should be visible to the program, and these false positive SIGURGs shouldn't happen often enough to cause much trouble.

What if the thread is preempted by the kernel but just about to enter a system call when it resumes?

Good question. This is sort of comparable to the race inherent in for { select { case <-ctx.Done(): return ctx.Err(); default: }; expensiveOperation() }; that sort of look-before-use is better than nothing, but obviously expensiveOperation understanding context is better. Our SIGURG could hit just before the syscall.

Basically what is being asked here is to reconcile the desire ("cancel the slow operation in this goroutine") with the current state ("this goroutine is not / is inside a syscall"), and sometimes that sort of a reconciliation needs to be repeated until it reaches its desired effect.

One thing that needs careful thought is aiming the cancellation at the correct syscall. Consider syscallA(ctx1); syscallB(ctx2), you don't want ctx1 interrupting syscallB. That can happen because the cancellation is an outside observer to the goroutine's progress, sending signals from a separate goroutine. One way to deal with that is to let syscallB's EINTR retry loop just restart the syscall if we encounter such bad luck. This can be made to happen easily by checking ctx.Err() != nil in the EINTR retry loop; even if we get a false positive SIGURG, we won't get a false positive cancellation for our context.

Strawman implementation idea 1: Waste a goroutine per cancelable syscall

func SyscallCtx(ctx context.Context, ...) ... {
    gp := ...something...
    exiting := make(chan struct{})
    defer close(exiting)
    go func() {
        for {
            select {
                case <-exiting:
                    return
                case <-ctx.Done():
                    sendSIGURG(gp)
                    // could also write this with a ticker to get an immediate return on <-exiting
                    time.Sleep(aLittle)
            }
        }
    }()
    // make the syscall, with EINTR retry loop checking ctx
}

Strawman implementation idea 2: Add a context field and a periodic job

Scheduler "G" goroutine state (type g) stores an extra atomic unsafe.Pointer ctxp, pointing to the relevant context or nil.

New function SyscallCtx sets gp.ctxp, and defers an atomic set to nil.

SyscallCtx has an EINTR retry loop that bails out on cancel: if err := ctx.Err(); err != nil { return err }

Have a periodic job that scans for G's in state _Gsyscall, ctxp := (*context.Value)(atomic.LoadPointer(&gp.ctxp)); if ctxp != nil && (*ctxp).Err() != nil { sendSIGURGToG(gp) }. This is our reconciliation loop.

(Alternatively, instead of g.ctxp and increasing the size of every G, have a single sync.Map(-alike) that stores gp->ctx mapping. Could make the scan cheaper, as it only needs to walk over current cancellable syscalls. Then again, causes central coordination.)

(Might also be able to relocate ctxp to M? As there's no such thing as a syscall without an M, I believe, and there's a lot fewer M's than G's. My knowledge to the scheduler runs out here.)

@ianlancetaylor
Copy link
Contributor

@networkimprov

require every single file operation to register itself somewhere

Thankfully, no. The init() procedure for "os/interrupt" could enable said registration. Alternatively, replicate the file API with variants that perform said registration. The former is more user-friendly, but either works :-)

Sorry, I didn't mean that user code would have to register. I meant that the each file operation in the standard library would have to register itself. As I said earlier, that is too high a price to pay for a feature that extremely few programs will want to use.

Any program that relies on the filesystem and is deployed into environments that its developers don't control needs to allow for network filesystems. They're still heavily used in office LANs.

That is easy to say, but how many actual programs actually do that, in any language?

@ianlancetaylor
Copy link
Contributor

@tv42 For read-only operations I would vote for what I suggested above in #40846 (comment), but add some mechanism to clean up the system call in the background. I don't see an obvious reason to force the real work of the program to wait until the system call is interrupted.

For write operations I agree that we would need to somehow know whether the system call completes or is interrupted.

@tv42
Copy link

tv42 commented Aug 25, 2020

Leaving the syscall running would leave the buffer passed to Read etc poisoned, so that would only apply to things that allocate internally, and have no out-parameters.

I'm also uncomfortable with leaking such threads-in-syscalls. There's no guarantee when or even if they will return. It just seems like a bad design to me. I'd much rather have the user program in charge of them, to have that visibility.

@bcmills
Copy link
Contributor

bcmills commented Aug 25, 2020

Either way, it seems clear to me that this use-case needs a proposal with more depth to it than just “don't check for EINTR in os functions”. @tv42, perhaps file a separate proposal for a high-level API, and we can figure out the implementation once we know what the API should look like?

(But again, I wouldn't expect to make progress on any API changes until we have a clearer picture of where #5636 is going.)

@networkimprov
Copy link

The init() procedure for "os/interrupt" could enable said registration. Alternatively, replicate the file API with variants that perform said registration. The former is more user-friendly, but either works :-)

Sorry, I didn't mean that user code would have to register. I meant that the each file operation in the standard library would have to register itself.

I understood what you meant. It's addressed by either of the approaches I gave (quoted above).

All: a core issue is what the runtime must do to let a blocked syscall thread be signaled on demand. We should find a general-purpose scheme. I see two options...

a) keep a table of syscall thread IDs, adding to it when requested; provide a reference or copy of it when requested
b) asynchronously post a thread ID to an app-supplied variable/table

PS: discussion of #5636 has moved to https://www.reddit.com/r/golang/comments/hv976o/qa_iofs_draft_design/?sort=new

@networkimprov
Copy link

I filed #41054 after digesting the above discussion. It should support both use cases described above:

a) APIs that can be cancelled via a context argument, and
b) interrupting any syscalls trying paths on a specific tree.

@Cyberax
Copy link

Cyberax commented Sep 6, 2020

So it seems to me that the only programs that are going to use a deadline for a stat call are going to be ones that are aware that they are running on a networked file system. And I argue that in the modern era those programs are much more likely to use a different approach, rather than use a networked file system at all.

I'm sorry for a late post here, but I do have an example of this kind of a workload. I'm using NFS-based disk as a write-through cache for large objects in front of DynamoDB. My code first tries to check the contents on the disk with a very tight deadline (2ms) and if the deadline expires it falls back on an expensive call into DynamoDB.

As such, it needs timeouts for all calls used and stat is among them (to check for object's presence).

@networkimprov
Copy link

Alex, curious to hear what OS is running the NFS client, and which client if not the default?

Last I checked Linux doesn't interrupt syscalls to NFS on signal.

@ianlancetaylor
Copy link
Contributor

I'm sorry for a late post here, but I do have an example of this kind of a workload. I'm using NFS-based disk as a write-through cache for large objects in front of DynamoDB. My code first tries to check the contents on the disk with a very tight deadline (2ms) and if the deadline expires it falls back on an expensive call into DynamoDB.

Do you mean: you do a read, and if the read doesn't return within 2ms you contact the database directly?

We already support setting a deadline on a read call, so it's not immediately obvious to me that this requires any feature that we don't already support. Although I don't know whether the kernel supports polling in an NFS file descriptor.

@Cyberax
Copy link

Cyberax commented Sep 6, 2020

Correct, we do all operations (including read() and write()) under tight deadlines and if these deadlines are violated we fall back to the direct database requests. These requests are expensive in monetary terms, as DynamoDB essentially bills per request.

In case of read() and write() it's a bit more involved, we "recharge" the deadline upon receiving each 512Kb of data.

The stat() calls are used to get the object size and the creation time, we don't use readdir() right now.

We use regular IO in a child goroutine and simply abandon it in case the deadline expires. We also added a semaphore to limit the concurrency in case of NFS server hiccups. It all works acceptably well, but it would be nice to actually be able to just be able to use the regular context-based cancellation.

@Cyberax
Copy link

Cyberax commented Sep 6, 2020

Alex, curious to hear what OS is running the NFS client, and which client if not the default?

Linux and NFSv4.

Last I checked Linux doesn't interrupt syscalls to NFS on signal.

Linux supports interrupting NFS calls: https://elixir.bootlin.com/linux/latest/source/net/sunrpc/sched.c#L924 and https://elixir.bootlin.com/linux/latest/source/net/sunrpc/sched.c#L954 - but so far it's used only during unmounting. I'm planning to fix that in the upstream Linux.

@networkimprov
Copy link

@Cyberax I'm curious as to why NFS vs HTTP or somesuch?

@ncw
Copy link
Contributor

ncw commented Oct 26, 2020

@bcmills @ianlancetaylor Is the fix for this https://go-review.googlesource.com/c/go/+/249178/ going to get a backport into go1.15.x ? I couldn't see it on the milestones.

@ianlancetaylor
Copy link
Contributor

@ncw The problem is not new in 1.15, so this does not qualify for a backport by our usual rules.

@networkimprov
Copy link

To clarify, for the great majority of Go sites, the problem is new in 1.14. It can in theory occur before that, but was never reported.

@ncw
Copy link
Contributor

ncw commented Oct 26, 2020

@ncw The problem is not new in 1.15, so this does not qualify for a backport by our usual rules.

Ah, that's a shame. Rclone users seem to come across this quite regularly.

@golang golang locked and limited conversation to collaborators Oct 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants