Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: Copy leaves file zero bytes in Azure AKS with CIFS — Go1.15 regression #42400

Closed
alhails opened this issue Nov 5, 2020 · 28 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@alhails
Copy link

alhails commented Nov 5, 2020

What version of Go are you using (go version)?

$ go version
1.15.2

Does this issue reproduce with the latest release?

Yes (1.15.3)

What operating system and processor architecture are you using?

The issue is reproduced on Azure AKS node (Ubuntu 16.04 kernel 4.15.0-1098-azure)

What did you do?

We have a golang API running in Azure AKS. The initial problem was noticed with file upload via http multipart request. After upgrade to go 1.15, uploaded files were left as zero bytes. Before this, uploads were working fine.

My initial suspicion was that the problem was related to the addition of the copy_file_range syscall in os.File ReadFrom, but also our use of a CIFS file share for our running pod.

I have only been able to reproduce in AKS, not locally.

I've reproduced the issue with a very simple golang code example..

package main

import (
 "fmt"
 "io"
 "os"
)

func main() {
 src, err := os.Open("source.txt")
 if err != nil {
  fmt.Printf("Failed to open file: %s\n", err.Error())
  return
 }

 dst, err := os.Create("dest.txt")
 if err != nil {
  fmt.Printf("Failed to create file: %s\n", err.Error())
  return
 }
 _, err = io.Copy(dst, src)
 if err != nil {
  fmt.Printf("Failed to copy file: %s\n", err.Error())
  return
 }
}

When this is executed on the container file system, the file is copied correctly.
When this is executed on a CIFS volume in the container, the file is created but is left as zero bytes.

I now suspect that the issue is with the kernel or CIFS, so have raised an issue with the team responsible for the azure linux kernel - https://answers.launchpad.net/ubuntu/+source/linux-azure/+question/693837

However I thought I might also raise it here, in case anyone here has any thoughts on this issue.

Thanks

Alastair

@odeke-em odeke-em changed the title io.Copy leaves file zero bytes in Azure AKS with CIFS io: Copy leaves file zero bytes in Azure AKS with CIFS — Go1.15 regression Nov 5, 2020
@odeke-em odeke-em added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 5, 2020
@odeke-em
Copy link
Member

odeke-em commented Nov 5, 2020

Thank you for reporting this @alhails!

Kindly paging @tklauser @ianlancetaylor @alexbrainman.

@randall77
Copy link
Contributor

Try Closeing the destination file after the io.Copy. You might also try Sync.
(I don't think either should be necessary, but worth the experiment.)

@alhails
Copy link
Author

alhails commented Nov 5, 2020

I've reproduced locally by mounting a CIFS share to my NAS drive(!). I get the same behaviour here ie file is left zero bytes, so maybe the azure kernel is out of the frame?

PS I'm using debian buster (kernel 4.19.0-12-amd64) if this is still relevant.

@alhails
Copy link
Author

alhails commented Nov 5, 2020

@randall77 Thanks for the suggestion

I still get the same behaviour (locally on the CIFS share) with...

package main

import (
	"fmt"
	"io"
	"os"
)

func main() {
	src, err := os.Open("source.txt")
	if err != nil {
		fmt.Printf("Failed to open file: %s\n", err.Error())
		return
	}
	defer src.Close()

	dst, err := os.Create("dest.txt")
	if err != nil {
		fmt.Printf("Failed to create file: %s\n", err.Error())
		return
	}
	defer dst.Close()

	_, err = io.Copy(dst, src)
	if err != nil {
		fmt.Printf("Failed to copy file: %s\n", err.Error())
		return
	}

	err = dst.Sync()
	if err != nil {
		fmt.Printf("Failed to sync file: %s\n", err.Error())
		return
	}
}

@tklauser
Copy link
Member

tklauser commented Nov 5, 2020

Maybe related #42334?

@alhails does io.Copy return any error? Could you try logging what its first return value (written int64) is?

@alhails
Copy link
Author

alhails commented Nov 5, 2020

With amended copy...

	written, err := io.Copy(dst, src)
	if err != nil {
		fmt.Printf("Failed to copy file: %s\n", err.Error())
		return
	}
	fmt.Printf("written: %d\n", written)

I get the same output for both local disk (where file is written correctly) and on CIFS (where is left empty)

written: 4

PS my test source.txt just contains "123"

PPS There is no error returned

@ianlancetaylor
Copy link
Contributor

I agree that it sounds like a problem with the copy_file_range system call. If possible, try changing the standard library to initialize copyFileRangeSupported to 0 in src/internal/poll/copy_file_range_linux.go. If that fixes the problem, then it's definitely copy_file_range.

This is problematic, though, since we do want to use copy_file_range but we don't want to break people unexpectedly like this. Please keep us updated on the question you filed. It may be necessary to check whether a file is on CIFS before using copy_file_range, assuming there is some way that we can check that at all.

CC @acln0

@ianlancetaylor ianlancetaylor added this to the Go1.16 milestone Nov 6, 2020
@ianlancetaylor
Copy link
Contributor

Tentatively marking as a release blocker for 1.16, since we don't want this kind of surprising breakage in Go. We can remove the label if we discover that the problem only occurs on some limited kernel range or something.

@alhails
Copy link
Author

alhails commented Nov 6, 2020

I reproduced the failure from source (go1.15.4) and then made the tweak you suggested and the problem went away.

-var copyFileRangeSupported int32 = 1 // accessed atomically
+var copyFileRangeSupported int32 = 0 // accessed atomically

If this confirms it is an issue with copy_file_range, then is this a linux kernel issue? I would like to update my other ticket on https://answers.launchpad.net/ubuntu/+source/linux-azure/+question/693837

@gopherbot
Copy link

Change https://golang.org/cl/267958 mentions this issue: os, internal/poll: don't use copy_file_range on CIFS destinations

@acln0
Copy link
Contributor

acln0 commented Nov 6, 2020

It's unfortunate that it doesn't seem to work for CIFS, where it could have helped significantly.

On my home system, I have a SMB share that I tested the program against. Everything worked correctly. My SMB server is FreeBSD 11.2, and my client is Debian 5.8.0-2-amd64. I don't know that we can draw any conclusion from this. It is also unclear to me how exactly CIFS and SMB differ, so perhaps this test I performed is not informative at all. They do differ at least in magic file system numbers, as far as man 2 statfs tells us (but more on this later).

This looks like a kernel bug, but even if we knew for a fact that this is a kernel bug, and we knew the precise range of affected kernels, trying to detect those specific kernel versions from within poll.CopyFileRange seems exceedingly difficult, since unlike other cases, where the error appeared in the form of a specific new errno value from copy_file_range (#40731, #40893, #42334), we do not seem to have the luxury of such a visible error here. It seems like no error is reported, but the results are silently wrong.

As for detecting file systems, I think we could call statfs and look for CIFS_MAGIC_NUMBER in the f_type field, but that seems rather heavy handed. (Sidenote: In the test I ran on my own system, I got SMB2_MAGIC_NUMBER when I interrogated the destination file path using the statfs system call, so perhaps my test is not relevant at all.)

In any case, I have sent CL 267958 for testing and for discussion. I'm not sure that is the right approach at all, but I think it's worth at least a try and a discussion.

@alhails Can you please try https://go-review.googlesource.com/c/go/+/267958? Can you also try it with the arguments in reverse?

@ianlancetaylor What do you think about calling statfs this way?

@alhails
Copy link
Author

alhails commented Nov 6, 2020

@acln0 Not an experienced contributor, but I think I did this right...

I did a git pull https://go.googlesource.com/go refs/changes/58/267958/1 on master, and merged in your change.

Yes this does seem to fix the issue. I also confirmed that it was still making the copy_file_range call when operating on ext fs (using strace).

@alhails
Copy link
Author

alhails commented Nov 6, 2020

@acln0 Not quite sure what you meant by...

Can you also try it with the arguments in reverse?

@acln0
Copy link
Contributor

acln0 commented Nov 6, 2020

Thank you for testing.

Sorry for not being clear enough with "in reverse". If we are to choose the way of detecting file system types inside poll.CopyFileRange, I think we should test the copy code for all three interesting cases in terms of where the source and destination files are situated:

  • remote -> remote
  • local -> remote
  • remote -> local

The fourth case, local -> local, probably works fine, as far as we know. But as I am typing this, I now realize that since you are on Linux pre-5.3, your initial test case was likely remote -> remote, otherwise copy_file_range would have returned EXDEV and you would have seen a traditional read + write transfer.

The other two cases, where only one of the files is on a remote file system, may also be interesting, because the CL I sent only checks the file system type for the destination file. This may be insufficient in the general case, because the bug may exist when the source file is mounted remotely and the destination file is mounted locally. If there is any way that you could check with your existing CIFS share and with a 5.3+ kernel, that would be very helpful.

On my machine, with the SMB setup I describe above (5.8 kernel), everything works fine, but if I check the file system type using this function:

func fsMagic(path string) (int64, error) {
	var sfs syscall.Statfs_t
	if err := syscall.Statfs(path, &sfs); err != nil {
		return 0, err
	}
	return sfs.Type, nil
}

... I get SMB2_MAGIC_NUMBER rather than CIFS_MAGIC_NUMBER, and I cannot reproduce the bug to begin with.

@alhails
Copy link
Author

alhails commented Nov 6, 2020

Ok I've tested all four scenarios with both my installed go (go1.15.2) and your patch

All of the scenarios seem to work apart from...

  • go1.15.2
    remote -> remote

Upgrading my kernel isn't something I've done before. Is this is something that I can do without rebuilding my machine, or upgrading my distribution from debian buster?

Update

I see it's possible using backports and had a quick look but I've got some dependency issues...

@alhails
Copy link
Author

alhails commented Nov 8, 2020

Rather than mess around with my setup, I set up an alpine vm (kernel 5.4.72-0-lts)

This seems to work fine with the same two sets of four tests as I ran above. So maybe this is an indication that it's a problem with the older kernel?

PS I did notice that stat -f <file> on the cifs share returned UNKNOWN, rather than cifs as it does on my debian machine, but the drive was mounted in the same way ie with mount -t cifs ...

@networkimprov
Copy link

From https://man7.org/linux/man-pages/man2/copy_file_range.2.html

   The copy_file_range() system call first appeared in Linux 4.5, but
   glibc 2.27 provides a user-space emulation when it is not available.

   A major rework of the kernel implementation occurred in 5.3.  Areas
   of the API that weren't clearly defined were clarified and the API
   bounds are much more strictly checked than on earlier kernels.
   Applications should target the behaviour and requirements of 5.3
   kernels.

   First support for cross-filesystem copies was introduced in Linux
   5.3.  Older kernels will return -EXDEV when cross-filesystem copies
   are attempted.

From that, I'd guess this syscall should not be tried on kernels prior to 5.3.

@tklauser
Copy link
Member

tklauser commented Nov 9, 2020

From what @alhails describes in #42400 (comment) the failure occurs only if the file is copied from CIFS to CIFS. Thus, it looks like we would in the worst case need to check the file system for the source and the destination file, i.e. two statfs syscalls per copy_file_range syscalls and possibly extending the list of broken cases on other filesystems.

Given the man page note and @alhails's confirmation that the failure no longer occurs on kernel 5.4, to me it seems the safest solution is to disable use of copy_file_range on kernels older than 5.3 - as unfortunate as this is. I've sent https://golang.org/cl/268338 implementing this.

@ianlancetaylor @acln0 what do you think?

@gopherbot
Copy link

Change https://golang.org/cl/268338 mentions this issue: internal/poll: use copy_file_range only on Linux kernel >= 5.3

@networkimprov
Copy link

Could we do the kernel version check at runtime startup if syscall.CopyFileRange is linked in the binary?

@tklauser
Copy link
Member

Could we do the kernel version check at runtime startup if syscall.CopyFileRange is linked in the binary?

I think this would be a much more invasive change. The current solution checks the kernel version once on the first call to copy_file_range which I think is preferable.

@acln0
Copy link
Contributor

acln0 commented Nov 10, 2020

@tklauser Now that 1c7650a is merged, I wonder if we can ever hit this switch case:

case syscall.ENOSYS:

Perhaps we should remove it?

@acln0
Copy link
Contributor

acln0 commented Nov 10, 2020

There is also a similar check for EXDEV which may no longer apply, given that we have raised the effective requirement for copy_file_range to Linux 5.3 or later.

@tklauser
Copy link
Member

@acln0 good point. From quickly skimming through kernel code (check on 5.10-rc3) it indeed looks like we can no longer hit ENOSYS/EXDEV on Linux 5.3+. I'm still a bit worried that I overlooked something. And with the code freeze in place, I'd probably submit a CL removing these only for the 1.17 cycle.

@acln0
Copy link
Contributor

acln0 commented Nov 12, 2020

I'm still a bit worried that I overlooked something. And with the code freeze in place, I'd probably submit a CL removing these only for the 1.17 cycle.

Agreed. SGTM for 1.17. I don't think the check can possibly do any harm now.

@networkimprov
Copy link

@gopherbot please backport to 1.15. This is a regression which may be undetectable.

@gopherbot
Copy link

Backport issue(s) opened: #42550 (for 1.15).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

MichaelEischer added a commit to MichaelEischer/restic that referenced this issue Dec 18, 2020
Bugs in the error handling while uploading a file to the backend could
cause incomplete files, e.g. golang/go#42400
which could affect the local backend.

Proactively add sanity checks which will treat an upload as failed if
the reported upload size does not match the actual file size.
fd0 pushed a commit to MichaelEischer/restic that referenced this issue Jan 29, 2021
Bugs in the error handling while uploading a file to the backend could
cause incomplete files, e.g. golang/go#42400
which could affect the local backend.

Proactively add sanity checks which will treat an upload as failed if
the reported upload size does not match the actual file size.
@ZimbiX
Copy link

ZimbiX commented May 17, 2021

Possibly related: docker/for-linux#1015 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

9 participants