Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/poll: can get EINTR reading from CIFS #38836

Closed
dsoprea opened this issue May 3, 2020 · 20 comments
Closed

internal/poll: can get EINTR reading from CIFS #38836

dsoprea opened this issue May 3, 2020 · 20 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@dsoprea
Copy link

dsoprea commented May 3, 2020

What version of Go are you using (go version)?

1.14.2

Does this issue reproduce with the latest release?

(This is the latest release)

What operating system and processor architecture are you using (go env)?

linux, amd64

What did you do?

I'm running a Go tool of mine over a bunch of images. I'm getting periodic "interrupted system call" errors:

*os.PathError read /general_images/dense_album_staging/2015/google_photos/takeout-20150928-0047/dump/2012-12-05-/IMAG0507-edited.jpg: interrupted system call
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go:69 (0x7c8a4c)
	GetExif: log.PanicIf(err)

I've restarted several times. On earlier runs, I've gotten large clumps of these in the output (seems racy).

They always happen in the same place:

    65		} else {
    66			var err error
    67	
    68			data, err = ioutil.ReadFile(imageFilepath)
    69			log.PanicIf(err)
    70		}

Is it possible that ioutil.ReadFile() is not correctly managing signals?

(https://golang.org/pkg/os/signal/#hdr-Go_programs_that_use_cgo_or_SWIG)

If the non-Go code installs any signal handlers, it must use the SA_ONSTACK flag with sigaction. Failing to do so is likely to cause the program to crash if the signal is received. Go programs routinely run with a limited stack, and therefore set up an alternate signal stack. Also, the Go standard library expects that any signal handlers will use the SA_RESTART flag. Failing to do so may cause some library calls to return "interrupted system call" errors.

@mdlayher
Copy link
Member

mdlayher commented May 3, 2020

I believe this is expected due to runtime changes in Go 1.14, see: https://golang.org/doc/go1.14#runtime.

@dsoprea
Copy link
Author

dsoprea commented May 3, 2020 via email

@ianlancetaylor
Copy link
Contributor

No, 1.14 triggers more signals, but Go installs signal handlers with SA_RESTART set, so any signals that occur during system calls called by ioutil.ReadFile should be restarted.

Is this a pure Go program?

Can you run the program under strace -f and see exactly what system call is returning EINTR, and show us a dozen lines around that call? Thanks.

@ianlancetaylor ianlancetaylor changed the title ioutil.ReadFile: "interrupted system call" io/ioutil: ReadFile: "interrupted system call" May 4, 2020
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels May 4, 2020
@ianlancetaylor ianlancetaylor added this to the Go1.15 milestone May 4, 2020
@dsoprea
Copy link
Author

dsoprea commented May 4, 2020

Yes, it's definitely pure Go (a principal feature of the project).

Okay. I'm running it with strace right now. I don't know how long it will take to encounter the first one. Hopefully the slowdown won't completely stop it from occurring.

@dsoprea
Copy link
Author

dsoprea commented May 4, 2020

Got a couple.

1: Triggered in lines 423-427

   401  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}}) = 0
   402  [pid  2791] <... nanosleep resumed> NULL) = 0
   403  [pid  2791] epoll_pwait(4,  <unfinished ...>
   404  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}} <unfinished ...>
   405  [pid  2791] <... epoll_pwait resumed> [], 128, 0, NULL, 139628791573936) = 0
   406  [pid  2790] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   407  [pid  2791] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   408  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d7a8c) = -1 EPERM (Operation not permitted)
   409  [pid  2791] <... nanosleep resumed> NULL) = 0
   410  [pid  2791] nanosleep({tv_sec=0, tv_nsec=40000},  <unfinished ...>
   411  [pid  2790] fstat(3, {st_mode=S_IFREG|0755, st_size=2296983, ...}) = 0
   412  [pid  2791] <... nanosleep resumed> NULL) = 0
   413  [pid  2791] nanosleep({tv_sec=0, tv_nsec=80000},  <unfinished ...>
   414  [pid  2790] mmap(NULL, 1439992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
   415  [pid  2791] <... nanosleep resumed> NULL) = 0
   416  [pid  2790] <... mmap resumed> )        = 0x7efdd6e9e000
   417  [pid  2791] getpid()                    = 2790
   418  [pid  2791] tgkill(2790, 2790, SIGURG <unfinished ...>
   419  [pid  2790] read(3,  <unfinished ...>
   420  [pid  2791] <... tgkill resumed> )      = 0
   421  [pid  2791] nanosleep({tv_sec=0, tv_nsec=160000}, NULL) = 0
   422  [pid  2791] nanosleep({tv_sec=0, tv_nsec=320000},  <unfinished ...>

   423  [pid  2790] <... read resumed> 0xc000180000, 2297495) = -1 EINTR (Interrupted system call)
   424  [pid  2790] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=2790, si_uid=1000} ---
   425  [pid  2790] rt_sigreturn({mask=[]} <unfinished ...>
   426  [pid  2791] <... nanosleep resumed> NULL) = 0
   427  [pid  2790] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)

   428  [pid  2791] nanosleep({tv_sec=0, tv_nsec=640000},  <unfinished ...>
   429  [pid  2790] close(3)                    = 0
   430  [pid  2791] <... nanosleep resumed> NULL) = 0
   431  [pid  2791] nanosleep({tv_sec=0, tv_nsec=1280000},  <unfinished ...>
   432  [pid  2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go", O_RDONLY|O_CLOEXEC) = 3
   433  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -1 EPERM (Operation not permitted)
   434  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
   435  [pid  2790] read(3, "package exifknife\n\nimport (\n\t\"er"..., 4096) = 4096
   436  [pid  2790] close(3)                    = 0
   437  [pid  2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go", O_RDONLY|O_CLOEXEC) = 3
   438  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}} <unfinished ...>
   439  [pid  2791] <... nanosleep resumed> NULL) = 0
   440  [pid  2790] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   441  [pid  2791] nanosleep({tv_sec=0, tv_nsec=2560000},  <unfinished ...>
   442  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
   443  [pid  2790] read(3, "package exifkniferead\n\nimport (\n"..., 4096) = 4096
   444  [pid  2790] close(3)                    = 0
   445  [pid  2790] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/command/go-exif-knife/main.go", O_RDONLY|O_CLOEXEC) = 3
   446  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -1 EPERM (Operation not permitted)
   447  [pid  2790] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000d77b4) = -1 EPERM (Operation not permitted)
   448  [pid  2790] read(3, "package main\n\nimport (\n\t\"os\"\n\n\t\""..., 4096) = 3957
   449  [pid  2790] close(3)                    = 0
   450  [pid  2790] openat(AT_FDCWD, "/usr/local/go/src/runtime/proc.go", O_RDONLY|O_CLOEXEC) = 3
   451  [pid  2790] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=60702552, u64=139629447495512}}) = -

2: Triggered in lines 475-481

   452  [pid  3276] pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0
   453  [pid  3277] <... nanosleep resumed> NULL) = 0
   454  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}} <unfinished ...>
   455  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   456  [pid  3276] <... epoll_ctl resumed> )   = 0
   457  [pid  3277] <... nanosleep resumed> NULL) = 0
   458  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}} <unfinished ...>
   459  [pid  3277] epoll_pwait(4,  <unfinished ...>
   460  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   461  [pid  3277] <... epoll_pwait resumed> [], 128, 0, NULL, 140490176716208) = 0
   462  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000dfa8c <unfinished ...>
   463  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   464  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   465  [pid  3277] <... nanosleep resumed> NULL) = 0
   466  [pid  3276] fstat(3,  <unfinished ...>
   467  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   468  [pid  3276] <... fstat resumed> {st_mode=S_IFREG|0755, st_size=310986, ...}) = 0
   469  [pid  3277] <... nanosleep resumed> NULL) = 0
   470  [pid  3277] getpid()                    = 3276
   471  [pid  3277] tgkill(3276, 3276, SIGURG <unfinished ...>
   472  [pid  3276] read(3,  <unfinished ...>
   473  [pid  3277] <... tgkill resumed> )      = 0
   474  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0

   475  [pid  3276] <... read resumed> 0xc000180000, 311498) = -1 EINTR (Interrupted system call)
   476  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   477  [pid  3276] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=3276, si_uid=1000} ---
   478  [pid  3277] <... nanosleep resumed> NULL) = 0
   479  [pid  3276] rt_sigreturn({mask=[]} <unfinished ...>
   480  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   481  [pid  3276] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)

   482  [pid  3277] <... nanosleep resumed> NULL) = 0
   483  [pid  3276] close(3 <unfinished ...>
   484  [pid  3277] nanosleep({tv_sec=0, tv_nsec=40000}, NULL) = 0
   485  [pid  3277] nanosleep({tv_sec=0, tv_nsec=80000}, NULL) = 0
   486  [pid  3276] <... close resumed> )       = 0
   487  [pid  3277] nanosleep({tv_sec=0, tv_nsec=160000},  <unfinished ...>
   488  [pid  3276] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go", O_RDONLY|O_CLOEXEC <unfinished ...>
   489  [pid  3277] <... nanosleep resumed> NULL) = 0
   490  [pid  3276] <... openat resumed> )      = 3
   491  [pid  3277] nanosleep({tv_sec=0, tv_nsec=320000},  <unfinished ...>
   492  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}}) = -1 EPERM (Operation not permitted)
   493  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000df7b4) = -1 EPERM (Operation not permitted)
   494  [pid  3277] <... nanosleep resumed> NULL) = 0
   495  [pid  3276] read(3,  <unfinished ...>
   496  [pid  3277] nanosleep({tv_sec=0, tv_nsec=640000},  <unfinished ...>
   497  [pid  3276] <... read resumed> "package exifknife\n\nimport (\n\t\"er"..., 4096) = 4096
   498  [pid  3276] close(3)                    = 0
   499  [pid  3276] openat(AT_FDCWD, "/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go", O_RDONLY|O_CLOEXEC) = 3
   500  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}}) = -1 EPERM (Operation not permitted)
   501  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000df7b4 <unfinished ...>
   502  [pid  3277] <... nanosleep resumed> NULL) = 0

@ianlancetaylor
Copy link
Contributor

Thanks. I see that the read calls are being interrupted, but I don't see the calls that created those file descriptors. They are likely to be just shortly before the traces you show. What I'm wondering is whether those files are on an unusual file system, such as /sys.

@dsoprea
Copy link
Author

dsoprea commented May 4, 2020 via email

@dsoprea
Copy link
Author

dsoprea commented May 4, 2020

This corresponds to trace (2) above (about twenty lines before):

   432  [pid  3276] openat(AT_FDCWD, "/general_images/dense_album_staging/2015/google_photos/takeout-20150928-0047/dump/2014-05-10 #2/20140510_161510-edited.jpg", O_RDONLY|O_CLOEXEC <unfinished ...>

   433  [pid  3277] <... nanosleep resumed> NULL) = 0
   434  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   435  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   436  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   437  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   438  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   439  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   440  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   441  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   442  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   443  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   444  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   445  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
   446  [pid  3276] <... openat resumed> )      = 3
   447  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   448  [pid  3276] epoll_create1(EPOLL_CLOEXEC <unfinished ...>
   449  [pid  3277] <... nanosleep resumed> NULL) = 0
   450  [pid  3276] <... epoll_create1 resumed> ) = 4
   451  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   452  [pid  3276] pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0
   453  [pid  3277] <... nanosleep resumed> NULL) = 0
   454  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=12892136, u64=12892136}} <unfinished ...>
   455  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   456  [pid  3276] <... epoll_ctl resumed> )   = 0
   457  [pid  3277] <... nanosleep resumed> NULL) = 0
   458  [pid  3276] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2452402288, u64=140490832654448}} <unfinished ...>
   459  [pid  3277] epoll_pwait(4,  <unfinished ...>
   460  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   461  [pid  3277] <... epoll_pwait resumed> [], 128, 0, NULL, 140490176716208) = 0
   462  [pid  3276] epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc0000dfa8c <unfinished ...>
   463  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   464  [pid  3276] <... epoll_ctl resumed> )   = -1 EPERM (Operation not permitted)
   465  [pid  3277] <... nanosleep resumed> NULL) = 0
   466  [pid  3276] fstat(3,  <unfinished ...>
   467  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
   468  [pid  3276] <... fstat resumed> {st_mode=S_IFREG|0755, st_size=310986, ...}) = 0
   469  [pid  3277] <... nanosleep resumed> NULL) = 0
   470  [pid  3277] getpid()                    = 3276
   471  [pid  3277] tgkill(3276, 3276, SIGURG <unfinished ...>
   472  [pid  3276] read(3,  <unfinished ...>
   473  [pid  3277] <... tgkill resumed> )      = 0
   474  [pid  3277] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0

   475  [pid  3276] <... read resumed> 0xc000180000, 311498) = -1 EINTR (Interrupted system call)

As a kludge, I added a loop in the scripting that's calling this tool to loop whenever it sees this failure. It looks like, that, when it occurs, it may perfectly reproduce, at least under similar conditions. It was likely exposed by network latency. Your guess/suggestion was very good.

$ time ./parse_images.py      
interrupted system call; retrying (1): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (2): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (3): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (4): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
interrupted system call; retrying (5): [/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
Too many tries.                                               
                                                    
ERROR: [dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg]
['/tmp/go-exif-knife', 'read', '--filepath', '/general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg', '--ignore-no-exif']
                                                      
                                                       
Stack:                                                                                        
                                                                 
*os.PathError read /general_images/dense_album_staging/2007/facebook/20070713 Silver Lake Bike Trip/1.jpg: interrupted system call
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/exif.go:69 (0x7c8ddc)
        GetExif: log.PanicIf(err)                       
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/handler/read/read.go:39 (0x7cafe2)
        (*ExifRead).Read: mc, err := exifknife.GetExif(imageFilepath)      
/home/dustin/development/go/src/github.com/dsoprea/go-exif-knife/command/go-exif-knife/main.go:79 (0x7cd6d9)
        main: err := er.Read(options.Filepath, options.JustTry, options.SpecificIfd, options.SpecificTags, options.JustValues, options.Json, options.IgnoreNoExif)
/usr/local/go/src/runtime/proc.go:203 (0x436d72)                                          
        main: fn()                                                       
/usr/local/go/src/runtime/asm_amd64.s:1373 (0x463491)
        goexit: BYTE    $0x90   // NOP

@ianlancetaylor ianlancetaylor changed the title io/ioutil: ReadFile: "interrupted system call" internal/poll: can get EINTR reading from CIFS May 5, 2020
@ianlancetaylor
Copy link
Contributor

Thanks for the info. It's the use of CIFS that causes this. I think the only fix is to fix #20400.

@dsoprea
Copy link
Author

dsoprea commented May 5, 2020

That sucks. This is happening in the context of ioutil.ReadFile(), but I should assume it will happen regardless of the method?

@ianlancetaylor
Copy link
Contributor

Yes, it can happen in general.

For 1.14 you can make the problem less likely to occur by setting GODEBUG=asyncpreemptoff=1 in the environment. But it can still happen if your process receives a signal.

@networkimprov
Copy link

What's the source of the handler which didn't set SA_RESTART in this case?

@ianlancetaylor
Copy link
Contributor

The handler did set SA_RESTART. What happened here is that reading from CIFS didn't honor it. At least, that is my assumption, since CIFS is a networked file system.

@networkimprov
Copy link

Meaning a bug in Linux? Maybe a later kernel doesn't do this? Possibly related:
https://patchwork.kernel.org/patch/10887855/
https://spinics.net/lists/linux-cifs/msg16565.html

@gopherbot
Copy link

Change https://golang.org/cl/232862 mentions this issue: internal/poll, os: loop on EINTR

@networkimprov
Copy link

I posted a 1.14 backport request for this: #39026

rfjakob added a commit to rfjakob/gocryptfs that referenced this issue May 23, 2020
tmm1 pushed a commit to fancybits/go that referenced this issue Jul 8, 2020
Historically we've assumed that we can install all signal handlers
with the SA_RESTART flag set, and let the system restart slow functions
if a signal is received. Therefore, we don't have to worry about EINTR.

This is only partially true, and we've added EINTR checks already for
connect, and open/read on Darwin, and sendfile on Solaris.

Other cases have turned up in golang#36644, golang#38033, and golang#38836.

Also, golang#20400 points out that when Go code is included in a C program,
the C program may install its own signal handlers without SA_RESTART.
In that case, Go code will see EINTR no matter what it does.

So, go ahead and check for EINTR. We don't check in the syscall package;
people using syscalls directly may want to check for EINTR themselves.
But we do check for EINTR in the higher level APIs in os and net,
and retry the system call if we see it.

This change looks safe, but of course we may be missing some cases
where we need to check for EINTR. As such cases turn up, we can add
tests to runtime/testdata/testprogcgo/eintr.go, and fix the code.
If there are any such cases, their handling after this change will be
no worse than it is today.

For golang#22838
Fixes golang#20400
Fixes golang#36644
Fixes golang#38033
Fixes golang#38836

Change-Id: I7e46ca8cafed0429c7a2386cc9edc9d9d47a6896
Reviewed-on: https://go-review.googlesource.com/c/go/+/232862
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
(cherry picked from commit 8c1db77)
@tmm1
Copy link
Contributor

tmm1 commented Jul 8, 2020

I am running into EINTR failures with a CIFS mount as well, on go1.14.

I cherry-picked 8c1db77 onto upstream/release-branch.go1.14 and have been trying that.

Before the cherry-pick, the errors were always in the read or write syscalls:

read /Channels-Video/Streaming/channels-cache-114767979/b7fbdba671ed14a2cb5c0ad7eb058ce45ad63bff1c8923cc9ac30911e6658f13: interrupted system call
write /Channels-Video/Streaming/channels-cache-525223662/99607c430416930d925f9399b8763fb79528ff4147f36f18ea6d53d462b2e7a0: interrupted system call

After the cherry-pick, the situation has improved but I still see occasional errors from open:

open /Channels-Video/Streaming/channels-cache-385377817/162c68b4fce222635832db9996c5eb8841b1d4ae4beb4a08689aeb5c51133ef6: interrupted system call

@networkimprov
Copy link

This fix isn't supported in 1.14, sadly. Can you try the 1.15 beta, and if it fails, file a new issue?

@tmm1
Copy link
Contributor

tmm1 commented Jul 8, 2020

Unfortunately I'm not able to do that since I don't control the machine/network where it occurs.

If there are other related CLs I would appreciate a pointer to them. But from what I saw only the read/write() paths were fixed, so I would guess 1.15 beta has the same problem with open()

@ianlancetaylor
Copy link
Contributor

You probably also want https://golang.org/cl/236997. See #39237.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

6 participants