Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: occasional "leaked parent file" on illumos/amd64 #42431

Closed
jclulow opened this issue Nov 6, 2020 · 9 comments
Closed

os/exec: occasional "leaked parent file" on illumos/amd64 #42431

jclulow opened this issue Nov 6, 2020 · 9 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos
Milestone

Comments

@jclulow
Copy link
Contributor

jclulow commented Nov 6, 2020

Occasionally we seem to see a test failure in the illumos buildlet:

--- FAIL: TestExtraFiles (1.58s)
    exec_test.go:696: Run: exit status 1
        --- stdout:
        leaked parent file. fd = 5; want 4
        readlink("/proc/self/fd/4") = "", readlink /proc/self/fd/4: no such file or directory
        * Displaying port information for all processes..
        PID     PROCESS                   IP                        PORT   
        -----------------------------------------------------------------
        981987  ./buildlet.exe            165.225.148.197           36741  
        981987  ./buildlet.exe            107.178.219.46            443    
        981987  ./buildlet.exe            165.225.148.197           64726  
        981987  ./buildlet.exe            107.178.219.46            443    
        981987  ./buildlet.exe            165.225.148.197           33606  
        981987  ./buildlet.exe            107.178.219.46            443    
        993967  /var/tmp/workdir-host-i   127.0.0.1                 64698  
        993967  /var/tmp/workdir-host-i   127.0.0.1                 64698  
        993967  /var/tmp/workdir-host-i   127.0.0.1                 64698  
        993967  /var/tmp/workdir-host-i   127.0.0.1                 33969  
        --- stderr:
        lsof -p 994152 failed: exit status 1
FAIL
FAIL	os/exec	2.476s

The test in os/exec/read3.go is trying to execute lsof to get debugging information, but unfortunately at least in the distribution that the buildlet is running under that command is not terribly useful. The native tool for inspecting open file descriptors on illumos systems is pfiles, and I see some code in there to handle platform differences already so I will put up a CL to add our distinctiveness alongside. Hopefully it will then be easier to debug subsequent instances of this issue.

@gopherbot
Copy link

Change https://golang.org/cl/268019 mentions this issue: os/exec: use "pfiles" for fd debugging on illumos

@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 7, 2020
@dmitshur dmitshur added this to the Go1.16 milestone Nov 7, 2020
gopherbot pushed a commit that referenced this issue Nov 7, 2020
On illumos (and Solaris) systems, the native "pfiles" tool provides the
best information about open file descriptors for a process:

    https://illumos.org/man/1/pfiles

Use that instead of "lsof" when debugging file descriptor leaks.

Updates #42431.

Change-Id: If1250c4e6c9e8adbd076495a09fb1ce63abcc68b
Reviewed-on: https://go-review.googlesource.com/c/go/+/268019
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Trust: Dmitri Shuralyov <dmitshur@golang.org>
@jclulow
Copy link
Contributor Author

jclulow commented Nov 10, 2020

Now that we're using pfiles we can at least see what's still open after the failure:

--- FAIL: TestExtraFiles (0.69s)
    exec_test.go:696: Run: exit status 1
        --- stdout:
        leaked parent file. fd = 5; want 4
        readlink("/proc/self/fd/4") = "", readlink /proc/self/fd/4: no such file or directory
        480634:	/var/tmp/workdir-host-illumos-amd64-jclulow/tmp/TestExtraFiles35626464
          Current rlimit: 65536 file descriptors
           0: S_IFCHR mode:0666 dev:574,63 ino:698458753 uid:0 gid:3 rdev:38,2
              O_RDONLY|O_LARGEFILE
              /dev/null
              offset:0
           1: S_IFIFO mode:0000 dev:571,0 ino:282113885 uid:84001 gid:84001 rdev:0,0
              O_RDWR
           2: S_IFIFO mode:0000 dev:571,0 ino:282113886 uid:84001 gid:84001 rdev:0,0
              O_RDWR
           3: S_IFREG mode:0600 dev:90,65612 ino:183330 uid:84001 gid:84001 size:12
              O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE
              /var/tmp/workdir-host-illumos-amd64-jclulow/tmp/276168692
              offset:12
           5: S_IFREG mode:0755 dev:90,65612 ino:184467 uid:84001 gid:84001 size:2273611
              O_RDONLY|O_LARGEFILE FD_CLOEXEC
              /var/tmp/workdir-host-illumos-amd64-jclulow/tmp/TestExtraFiles356264643/001/read3.exe
              offset:0
           6: S_IFPORT mode:0000 dev:585,0 uid:84001 gid:84001 size:0
           7: S_IFIFO mode:0000 dev:571,0 ino:282113888 uid:84001 gid:84001 rdev:0,0
              O_RDWR|O_NONBLOCK FD_CLOEXEC
        --- stderr:
FAIL
FAIL	os/exec	1.579s

There's nothing currently at fd 4, but presumably there was when we tried to open the test file. This doesn't seem to happen on most test runs, which implies a race of some kind. We should look at running the test under tracing to see if there is another thread opening and closing a file descriptor early in the process lifetime.

@bcmills bcmills added OS-illumos NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 10, 2020
@gopherbot gopherbot removed the NeedsFix The path to resolution is known, but the work has not been done. label Nov 10, 2020
@ianlancetaylor
Copy link
Contributor

On GNU/Linux we had a similar problem that turned out to be because the C library would sometimes open /sys/devices/system/cpu/online. See #25628 (comment) .

Is it possible to write something similar to https://golang.org/cl/225799 for Illumos, to help identify what file is being opened?

@jclulow
Copy link
Contributor Author

jclulow commented Nov 14, 2020

Because this seemed like a race, I decided to start by just investigating open(2) and close(2) activity on threads other than the main thread without needing to actually catch the test in the act of failing. First, I did my best to zero in on just the specific test:

$ cd /ws/go/master/src
$ GOROOT_BOOTSTRAP=/opt/go/1.15.2 ./all.bash
...

Once I had a successful build of the latest revision, I prebuilt the test and ran it:

$ export PATH=/ws/go/master/bin:$PATH
$ cd /ws/go/master/src/os/exec
$ go test -c .
$ ./exec.test -test.v -test.run '^TestExtraFiles$'
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.60s)
PASS

Next, looking at the test, it seems like the interesting failure occurs in the read3.exe child process, so I wrote a pair of DTrace scripts. First, off-thread-1.d, a script to trace open() and close() activity in a specific process, and to emit a stack trace with symbols that shows where the call was made:

BEGIN
{
        start = $1; /* use start time from wrapper script */
        system("prun %d", $target);
}

syscall::open:entry,
syscall::close:entry
/pid == $target && tid != 1/
{
        self->t = walltimestamp - start;
}

syscall::open:entry
/self->t/
{
        printf("%10u %6u/%u %s(%s)\n", self->t, pid, tid, probefunc, copyinstr(arg0));
}

syscall::open:return,
syscall::close:return
/self->t/
{
        printf("%10u %6u/%u --> %u\n", self->t, pid, tid, arg1);
        ustack();
        printf("\n");
}

syscall::close:entry
/self->t/
{
        printf("%10u %6u/%u %s(%u)\n", self->t, pid, tid, probefunc, arg0);
}

syscall:::return
/self->t/
{
        self->t = 0;
}

Then, because this script needs to grab the specific child process in order to resolve symbols, I wrote another script, catch.d, that would watch for read3.exe children and run the first script against them:

BEGIN
{
        start = walltimestamp;
}

exec-success
/execname == "read3.exe"/
{
        this->t = walltimestamp - start;

        printf("%10u %6u/%u exec ok of: %s\n", this->t, pid, tid, execname);
        raise(SIGSTOP);
        system("pfiles %d\n", pid);
        system("dtrace -p %d -wqCs off-thread-1.d %u\n", pid, start);
}

I was then able to trace the test process:

# dtrace -qCws catch.d
...
15291794879  72523/1 exec ok of: read3.exe
72523:  /tmp/TestExtraFiles286557209/001/read3.exe
  Current rlimit: 65536 file descriptors
   0: S_IFCHR mode:0666 dev:576,3 ino:671925534 uid:0 gid:3 rdev:38,2
      O_RDONLY|O_LARGEFILE
      /dev/null
      offset:0
   1: S_IFIFO mode:0000 dev:573,0 ino:1713104 uid:1000 gid:15000 rdev:0,0
      O_RDWR
   2: S_IFIFO mode:0000 dev:573,0 ino:1713105 uid:1000 gid:15000 rdev:0,0
      O_RDWR
   3: S_IFREG mode:0600 dev:581,26 ino:306381678 uid:1000 gid:15000 size:12
      O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE
      /tmp/152564002
      offset:0
18021810301  72523/3 open(/proc/self/lwp/5/lwpname)
18021810301  72523/3 --> 4

              libc.so.1`_syscall6+0x1b
              libc.so.1`pthread_setname_np+0x75
              libc.so.1`_thrp_create+0x5fb
              libc.so.1`pthread_create+0xc1
              read3.exe`runtime.asmsysvicall6+0x5a
              read3.exe`runtime.newosproc+0x17d
              read3.exe`runtime.newm1+0xdd
              read3.exe`runtime.newm+0xa6
              read3.exe`runtime.startm+0xe5
              read3.exe`runtime.wakep+0x66
              read3.exe`runtime.resetspinning+0x59
              read3.exe`runtime.schedule+0x2b9
              read3.exe`runtime.mstart1+0x93
              read3.exe`runtime.mstart+0x66
              read3.exe`runtime.tstart_sysvicall+0x42
              libc.so.1`_lwp_start

18021810301  72523/3 close(4)
18021810301  72523/3 --> 0

              libc.so.1`__close+0xa
              libc.so.1`_thrp_create+0x5fb
              libc.so.1`pthread_create+0xc1
              read3.exe`runtime.asmsysvicall6+0x5a
              read3.exe`runtime.newosproc+0x17d
              read3.exe`runtime.newm1+0xdd
              read3.exe`runtime.newm+0xa6
              read3.exe`runtime.startm+0xe5
              read3.exe`runtime.wakep+0x66
              read3.exe`runtime.resetspinning+0x59
              read3.exe`runtime.schedule+0x2b9
              read3.exe`runtime.mstart1+0x93
              read3.exe`runtime.mstart+0x66
              read3.exe`runtime.tstart_sysvicall+0x42
              libc.so.1`_lwp_start

Note that pfiles output at the time of exec(2) shows only the expected files. We then see, on thread 3, an open() and immediate close() of /proc/self/lwp/5/lwpname from within the C library as part of setting the thread name. If this call overlapped with one of the open calls the test makes in the main thread, we would not get fd 4 as it would (briefly) be in use.

I'm not that familiar with the internals of the Go runtime, or why it is creating a new thread here. Merely running the test (not under tracing) in a loop is enough to hit the race with relative frequency:

 $ while :; do ./exec.test -test.v -test.run '^TestExtraFiles$'; done
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.55s)
PASS
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.66s)
PASS
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.56s)
PASS
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.98s)
PASS
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.56s)
PASS
=== RUN   TestExtraFiles
--- PASS: TestExtraFiles (0.60s)
PASS
=== RUN   TestExtraFiles
    exec_test.go:696: Run: exit status 1
        --- stdout:
        leaked parent file. fd = 5; want 4
        readlink("/proc/self/fd/4") = "", readlink /proc/self/fd/4: no such file or directory
        70423:  /tmp/TestExtraFiles445987867/001/read3.exe
          Current rlimit: 65536 file descriptors
           0: S_IFCHR mode:0666 dev:576,3 ino:671925534 uid:0 gid:3 rdev:38,2
              O_RDONLY|O_LARGEFILE
              /dev/null
              offset:0
           1: S_IFIFO mode:0000 dev:573,0 ino:1707977 uid:1000 gid:15000 rdev:0,0
              O_RDWR
           2: S_IFIFO mode:0000 dev:573,0 ino:1707978 uid:1000 gid:15000 rdev:0,0
              O_RDWR
           3: S_IFREG mode:0600 dev:581,26 ino:3349940028 uid:1000 gid:15000 size:12
              O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE
              /tmp/027677868
              offset:12
           5: S_IFREG mode:0755 dev:581,26 ino:3878790049 uid:1000 gid:15000 size:2260397
              O_RDONLY|O_LARGEFILE FD_CLOEXEC
              offset:0
           6: S_IFPORT mode:0000 dev:587,0 uid:1000 gid:15000 size:0
           7: S_IFIFO mode:0000 dev:573,0 ino:1707980 uid:1000 gid:15000 rdev:0,0
              O_RDWR|O_NONBLOCK FD_CLOEXEC
        --- stderr:
--- FAIL: TestExtraFiles (0.72s)
...

From some limited experiments, it seems that setting GOMAXPROCS=1 when running the test loop makes this go away. Running the read3.exe child from within the test with an explicit GOMAXPROCS=1 (or something like that) seems like it would be enough to fix this test. Does that make sense?

@jclulow
Copy link
Contributor Author

jclulow commented Nov 14, 2020

Indeed, this seems to help:

diff --git a/src/os/exec/exec_test.go b/src/os/exec/exec_test.go
index cd3d759ebc..2ad82033cb 100644
--- a/src/os/exec/exec_test.go
+++ b/src/os/exec/exec_test.go
@@ -684,20 +684,23 @@ func TestExtraFiles(t *testing.T) {
                var cancel context.CancelFunc
                ctx, cancel = context.WithDeadline(ctx, deadline)
                defer cancel()
        }

        c = exec.CommandContext(ctx, exe)
        var stdout, stderr bytes.Buffer
        c.Stdout = &stdout
        c.Stderr = &stderr
        c.ExtraFiles = []*os.File{tf}
+       if runtime.GOOS == "illumos" {
+               c.Env = append(os.Environ(), "GOMAXPROCS=1")
+       }
        err = c.Run()
        if err != nil {
                t.Fatalf("Run: %v\n--- stdout:\n%s--- stderr:\n%s", err, stdout.Bytes(), stderr.Bytes())
        }
        if stdout.String() != text {
                t.Errorf("got stdout %q, stderr %q; want %q on stdout", stdout.String(), stderr.String(), text)
        }
 }

@ianlancetaylor
Copy link
Contributor

I assume that the C library always reads from /proc/self/lwp/5/lwpname.

In https://golang.org/cl/228639 I split the code into a separate program, so that it could run without the C library, and avoid any cases where the C library might open a file. But, of course, on illumos we can't avoid the C library. So I think we should just skip the test entirely on illumos.

@jclulow
Copy link
Contributor Author

jclulow commented Nov 14, 2020

Looking into this a little more, it seems like we could probably void making this particular use of proc in the OS because Go is not presently setting thread names. At the moment we seem to be setting the name (to nothing) unconditionally here. (I have filed illumos bug 13312 to cover investigating this).

In the interim, though, the diff I suggested above (to use GOMAXPROCS=1) seems to avoid creating more threads in the child program, and would allow the test to continue to catch any leaked fds that are specific interactions with illumos subsystems (e.g., event ports)?

@ianlancetaylor
Copy link
Contributor

OK, sure, let's try it. Can you send a change? The code needs an appropriate comment that should mention this issue. Thanks.

@gopherbot
Copy link

Change https://golang.org/cl/273966 mentions this issue: os/exec: constrain thread usage in leaked descriptor test on illumos

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos
Projects
None yet
Development

No branches or pull requests

5 participants