Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: TestExtraFiles is flaky on 32-bit linux builders #25628

Closed
ALTree opened this issue May 29, 2018 · 52 comments
Closed

os/exec: TestExtraFiles is flaky on 32-bit linux builders #25628

ALTree opened this issue May 29, 2018 · 52 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@ALTree
Copy link
Member

ALTree commented May 29, 2018

I've seen several of these:

--- FAIL: TestExtraFiles (0.10s)
	exec_test.go:607: Run: exit status 1; stdout "leaked parent file. fd = 66; want 65\n", stderr ""
FAIL
FAIL	os/exec	0.687s

on the linux-386-sid builder in the past weeks. Here's one:

https://build.golang.org/log/b95e0220b4130af6c1bcfed59ec934568717d0ef

@ALTree ALTree added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 29, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone May 31, 2018
@ianlancetaylor
Copy link
Contributor

That is odd. The errors show relatively large descriptor numbers. It implies that something in the system is opening a large descriptor before running make.bash.

@bcmills
Copy link
Contributor

bcmills commented Jun 1, 2018

There is another recurring failure mode in os/exec on linux-386-sid that produces broken pipe instead. Any idea whether they're related?

An example failure (https://build.golang.org/log/cef4b584d2ccb30621e2965de0517f6077aa2f92):

testing: can't write /tmp/go-build619681385/b056/testlog.txt: write /tmp/go-build619681385/b056/testlog.txt: broken pipe
FAIL	os/exec	0.697s

@ianlancetaylor
Copy link
Contributor

Yes, they could be related. The os/exec package tries to guess which descriptor is being used for testlog.txt; if it guesses wrong, because of an unexpectedly open descriptor, that could cause the broken pipe error.

@ALTree ALTree changed the title os/exec: TestExtraFiles is flaky on linux-386-sid bulder os/exec: TestExtraFiles is flaky on linux-386-sid builder Jun 2, 2018
@ianlancetaylor
Copy link
Contributor

2018-05-09T00:05:52-c1492b6/linux-386-sid
2018-05-11T14:40:34-e405c9c/linux-386-sid
2018-05-11T23:11:53-e6a9335/linux-386-sid
2018-05-12T06:21:01-f94b5a8/linux-386-sid
2018-05-14T13:45:18-c06f027/linux-386-sid
2018-05-14T14:12:31-7bac2a9/linux-386-sid
2018-05-14T17:59:59-15f2cbf/linux-386-sid
2018-05-15T16:29:04-212c947/linux-386-sid
2018-05-16T12:01:16-4102e6f/linux-386-sid
2018-05-21T13:18:16-f2cde55/linux-386-sid
2018-05-22T14:15:44-b812eec/linux-386-sid
2018-05-22T15:55:01-840f25b/linux-386-sid
2018-05-22T18:47:03-3d15f76/linux-386-sid
2018-05-22T18:49:31-f045ddc/linux-386-sid
2018-05-22T20:27:52-402dd10/linux-386-sid
2018-05-22T20:44:03-3c36b8b/linux-386-sid
2018-05-23T01:45:00-92bdfab/linux-386-sid
2018-05-23T14:13:24-ef880a2/linux-386-sid
2018-05-23T19:01:58-3f89214/linux-386-sid
2018-05-24T17:19:09-8aa57f1/linux-386-sid
2018-05-24T18:03:21-b0b0921/linux-386-sid
2018-05-24T20:26:47-9eb2c43/linux-386-sid
2018-05-24T22:08:00-30b6bc3/linux-386-sid
2018-05-25T08:42:09-e52c1c0/linux-386-sid
2018-05-25T13:09:29-bdaadeb/linux-386-sid
2018-05-27T07:33:08-c1d9d1f/linux-386-sid
2018-05-29T02:39:16-002c764/linux-386-sid
2018-05-29T17:46:19-ef53de8/linux-386-sid
2018-05-29T20:27:03-4790181/linux-386-sid
2018-05-29T21:54:57-210a9e0/linux-386-sid
2018-05-30T09:10:04-7d704a9/linux-386-sid
2018-05-30T13:40:36-e023d5b/linux-386-sid
2018-05-31T13:29:50-caf9686/linux-386-sid
2018-05-31T15:13:02-08e2e88/linux-386-sid
2018-05-31T17:29:10-424c215/linux-386-sid
2018-06-01T17:36:40-6cfeedb/linux-386-sid
2018-06-04T18:33:41-d6c3b0a/linux-386-sid
2018-06-05T00:29:50-75c1aed/linux-386-sid
2018-06-06T17:59:08-11f6cb4/linux-386-sid
2018-06-06T19:20:12-ca8ec69/linux-386-sid
2018-06-07T07:00:46-e5f0c1f/linux-386-sid
2018-06-07T14:31:33-35778aa/linux-386-sid
2018-06-07T15:47:29-0e1517c/linux-386-sid
2018-06-07T16:17:04-7b08e61/linux-386-sid
2018-06-08T22:44:03-77d5c62/linux-386-sid
2018-06-10T05:49:55-a5f8303/linux-386-sid
2018-06-11T12:13:11-30a63ec/linux-386-sid
2018-06-11T17:45:36-9e9ff56/linux-386-sid
2018-06-11T20:58:09-92f8acd/linux-386-sid
2018-06-12T17:05:39-ec98933/linux-386-sid
2018-06-12T20:22:50-44b826b/linux-386-sid
2018-06-12T20:37:02-bd83774/linux-386-sid
2018-06-13T17:55:32-a1b85ee/linux-386-sid
2018-06-13T20:32:08-fb4fb04/linux-386-sid
2018-06-13T21:43:06-1e721cf/linux-386-sid
2018-06-13T21:57:41-fc0e1d2/linux-386-sid
2018-06-13T22:01:28-efddc16/linux-386-sid
2018-06-14T13:21:46-1031380/linux-386-sid
2018-06-15T02:01:04-8998c55/linux-386-sid
2018-06-21T22:55:55-4b34289/linux-386-sid
2018-06-22T14:27:22-00eac89/linux-386-sid

@ianlancetaylor
Copy link
Contributor

Since this doesn't seem to be happening anywhere else, this is most likely a problem specific to the linux-386-sid builder. Something on that system is causing the os/exec tests to run with an unexpectedly open file descriptor.

@bradfitz Is there anything unusual about the linux-386-sid builder?

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 27, 2018
@bradfitz
Copy link
Contributor

The 386-sid builder is an amd64 Docker container (like all our linux-386-*) builders with the "debian:sid" base layer (https://github.com/golang/build/blob/master/env/linux-x86-sid/Dockerfile). We occasionally re-build it to reflect the ever-changing Debian sid.

Last updates:

  • April 11, 2017
  • Nov 2, 2017
  • May 8, 2018 (current)

@ALTree
Copy link
Member Author

ALTree commented Jul 7, 2018

This was reported to also happen on a linux_386 Virtualbox machine with an updated Ubuntu Mate 18.04 LTS and boostrap Go version 1.4.3 (Issue #26261).

@mikesmitty
Copy link

Ran into this issue with 1.11rc2 on Fedora 27 i386 as well:

--- FAIL: TestExtraFiles (0.07s)
    exec_test.go:607: Run: exit status 1; stdout "leaked parent file. fd = 66; want 65\n", stderr ""
FAIL
FAIL	os/exec	1.074s

@bobvanderlinden
Copy link

bobvanderlinden commented Sep 15, 2018

This happens on NixOS as well. There it seems to be happening reliably as I can reproduce it on my laptop again and again. The NixOS build server also reproduces the error for every build.

This happens on go 1.11.

For reference: https://hydra.nixos.org/build/80999166/nixlog/12

bobvanderlinden added a commit to bobvanderlinden/nixpkgs that referenced this issue Sep 15, 2018
This seems to only happen on i686 on go 1.11. It also happens on
Debian. For more information:

golang/go#25628
Mic92 pushed a commit to NixOS/nixpkgs that referenced this issue Sep 16, 2018
This seems to only happen on i686 on go 1.11. It also happens on
Debian. For more information:

golang/go#25628
(cherry picked from commit 0bf6b44)
@ianlancetaylor
Copy link
Contributor

@bradfitz The test will try to run lsof when it finds a leaked FD. That isn't happening on linux-386-sid. Could we get the lsof package installed on that system?

@bradfitz
Copy link
Contributor

@ianlancetaylor, I filed #29347 for somebody to do that.

@bcmills
Copy link
Contributor

bcmills commented Jun 14, 2019

Still flaky, but no lsof yet.
https://build.golang.org/log/337a9a3bfffb327f66750c82f0ab47b9f796576b

@bcmills
Copy link
Contributor

bcmills commented Jul 3, 2019

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc
Copy link
Contributor

rsc commented Mar 24, 2020

I am a bit confused about this code:

    for fd := uintptr(3); fd <= 100; fd++ {
            if poll.IsPollDescriptor(fd) {
                    continue
            }
            // We have no good portable way to check whether an FD is open.
            // We use NewFile to create a *os.File, which lets us
            // know whether it is open, but then we have to cope with
            // the finalizer on the *os.File.
            f := os.NewFile(fd, "")
            if _, err := f.Stat(); err != nil {
                    // Close the file to clear the finalizer.
                    // We expect the Close to fail.
                    f.Close()
            } else {
                    fmt.Printf("fd %d open at test start\n", fd)
                    haveUnexpectedFDs = true
                    // Use a global variable to avoid running
                    // the finalizer, which would close the FD.
                    unfinalizedFiles = append(unfinalizedFiles, f)
            }
    }

This is in a func init. I don't understand what is starting the network at all. Why is poll.IsPollDescriptor required?

If it is required, then it looks like this code is racing against netpollinit, in which case (1) IsPollDescriptor should be using atomics, if only to prevent the compiler from optimizing away the repeated loads, and (2) it needs to be double-checked after we detect an issue. The epfd might be -1 at the start of the loop, but its creation is racing with our os.NewFile, and epfd wins: epfd gets the fd we want, and we get fd+1. In that case, we should fd++; continue instead of failing.

@ianlancetaylor
Copy link
Contributor

The network is now started whenever you create a timer, which could conceivably happen in an init function somewhere in the test. Likely nothing is starting the network here, but calling IsPollDescriptor is cheap. There shouldn't be a race unless some init function starts a goroutine.

For this issue, I don't think it matters. The point of that code was to report whether any descriptors are open when the test starts. The logs show that they aren't. If there were somehow a race on IsPollDescriptor, it would show that there were descriptors open (the poll descriptors). We aren't seeing that.

@ianlancetaylor
Copy link
Contributor

I sent https://golang.org/cl/225278 for the os.Readlink suggestion.

@gopherbot
Copy link

Change https://golang.org/cl/225278 mentions this issue: os/exec: for TestExtraFiles failure, print readlink of unexpeced fd

gopherbot pushed a commit that referenced this issue Mar 24, 2020
For #25628

Change-Id: If1dce7ba9310e1418e67b9954c989471b775a28e
Reviewed-on: https://go-review.googlesource.com/c/go/+/225278
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@bcmills
Copy link
Contributor

bcmills commented Mar 27, 2020

We have a hit with the new readlink logging, but it doesn't seem particularly illuminating:

2020-03-26T16:12:18-d1ecfcc/linux-386-clang

--- FAIL: TestExtraFiles (1.31s)
    exec_test.go:686: Run: exit status 1
        --- stdout:
        leaked parent file. fd = 66; want 65
        readlink("/proc/self/fd/65") = "", readlink /proc/self/fd/65: no such file or directory

@ianlancetaylor
Copy link
Contributor

That's just weird. The code looks fine to me. Seems like this could only open if some other goroutine briefly opens and closes a descriptor. But I can't think of what would do that. Or why it would be specific to these builders.

I sent CL 225799 to run the test under strace. Perhaps that will show us something.

@gopherbot
Copy link

Change https://golang.org/cl/225799 mentions this issue: os/exec: add temporary debugging code for #25628

gopherbot pushed a commit that referenced this issue Mar 27, 2020
On linux-386 builders run the TestExtraFiles subprocess under strace,
in hopes of finding out where the unexpected descriptor is coming from.

For #25628

Change-Id: I9a62d6a5192a076525a616ccc71de74bbe7ebd58
Reviewed-on: https://go-review.googlesource.com/c/go/+/225799
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bcmills
Copy link
Contributor

bcmills commented Apr 7, 2020

The new debugging code seems to be deadlocking or otherwise timing out instead of generating output.

2020-04-04T01:01:04-fff7509/linux-386-sid
2020-04-03T17:08:32-3103495/linux-386-sid
2020-03-31T04:31:04-d98023e/linux-386-clang

@gopherbot
Copy link

Change https://golang.org/cl/227517 mentions this issue: os/exec: use subprocess deadline in TestExtraFiles

gopherbot pushed a commit that referenced this issue Apr 8, 2020
Try to get some output even if the subprocess hangs.

For #25628

Change-Id: I4cc0a8f2c52b03a322b8fd0a620cba37b06ff10a
Reviewed-on: https://go-review.googlesource.com/c/go/+/227517
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bcmills
Copy link
Contributor

bcmills commented Apr 9, 2020

Looks like we need a longer grace period on the deadline https://build.golang.org/log/8cdf7d284333d1fd9307916177b4ea66f1d898f0):

panic: test timed out after 3m0s
…
goroutine 38 [runnable]:
strings.Index(0x3c43c6b4, 0xb86e8d0, 0x8343ff8, 0x1, 0x48)
	/workdir/go/src/strings/strings.go:1019 +0x422
strings.genSplit(0x3c43c6b4, 0xb86e8d0, 0x8343ff8, 0x1, 0x0, 0x2e9bee, 0x2, 0x12, 0x0)
	/workdir/go/src/strings/strings.go:251 +0xc2
strings.Split(...)
	/workdir/go/src/strings/strings.go:299
testing.(*common).decorate(0xa52ef00, 0x3a800000, 0xd4aaf84, 0x4, 0xa42ed80, 0xf7cb3750)
	/workdir/go/src/testing/testing.go:493 +0x1e7
testing.(*common).logDepth(0xa52ef00, 0x3a800000, 0xd4aaf84, 0x3)
	/workdir/go/src/testing/testing.go:693 +0x1c9
testing.(*common).log(...)
	/workdir/go/src/testing/testing.go:669
testing.(*common).Fatalf(0xa52ef00, 0x8350ec8, 0x24, 0xa53bf54, 0x3, 0x3)
	/workdir/go/src/testing/testing.go:730 +0x65
os/exec_test.TestExtraFiles(0xa52ef00)
	/workdir/go/src/os/exec/exec_test.go:704 +0x93a
testing.tRunner(0xa52ef00, 0x8360064)
	/workdir/go/src/testing/testing.go:1022 +0xb4
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1073 +0x2ad

I'll send a CL.

@gopherbot
Copy link

Change https://golang.org/cl/227765 mentions this issue: os/exec: extend grace period in TestExtraFiles to 10% of overall deadline

gopherbot pushed a commit that referenced this issue Apr 10, 2020
…line

Updates #25628

Change-Id: I938a7646521b34779a3a57833e7ce9d508b58faf
Reviewed-on: https://go-review.googlesource.com/c/go/+/227765
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor

bcmills commented Apr 13, 2020

It worked! We have some strace outputs:
https://build.golang.org/log/76649c297be946b148e1c8b424637ad8d8cf177b
https://build.golang.org/log/5d21f8205e46d20f4ad3afb3bd5ed5a6ebd018df

@ianlancetaylor
Copy link
Contributor

Excellent. In both cases the extraneous file descriptor was due to opening /sys/devices/system/cpu/online. It seems to be associated with calls to the set_robust_list system call, which is called among other places by a newly started C thread. I think the most likely sequence of events here is a libc starting a new thread when a cgo-using program calls pthread_create via _cgo_thread_start. _cgo_thread_start calls the C malloc function. When malloc needs to create a new arena for the first time, it calls __get_nprocs. __get_nprocs opens /sys/devices/system/cpu/online to see how many processors the system has.

But I do not know why this only happens on 386 systems. I may be missing a step.

@ianlancetaylor
Copy link
Contributor

I filed https://sourceware.org/bugzilla/show_bug.cgi?id=25817 for glibc to stop doing this.

@gopherbot
Copy link

Change https://golang.org/cl/228099 mentions this issue: os/exec: create extra threads when starting a subprocess

gopherbot pushed a commit that referenced this issue Apr 14, 2020
TestExtraFiles seems to be flaky on GNU/Linux systems when using cgo
because creating a new thread will call malloc which can create a new
arena which can open a file to see how many processors there are.
Try to avoid the flake by creating several new threads at process
startup time.

For #25628

Change-Id: Ie781acdbba475d993c39782fe172cf7f29a05b24
Reviewed-on: https://go-review.googlesource.com/c/go/+/228099
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bcmills
Copy link
Contributor

bcmills commented Apr 16, 2020

Hmm, the extra threads seem to have made it flaky on 64-bit builders:

2020-04-16T03:19:50-ab3bd2c/linux-arm64-packet
2020-04-15T20:12:13-aa3413c/linux-mips64le-mengzhuo

That's... unexpected. Maybe we should skip this test on cgo-enabled configurations?

@ianlancetaylor
Copy link
Contributor

I think I must have missed something.

I think you're right that we should run this test without cgo. I think that just skipping if cgo is enabled would effectively mean that we never run the test. But we should be able to make this work.

@gopherbot
Copy link

Change https://golang.org/cl/228639 mentions this issue: os/exec: build TestExtraFiles subprocess without cgo

@erifan
Copy link

erifan commented Apr 20, 2020

Hi, does TestExtraFiles now require 'lsof' to be installed in the test environment? Because I still see this test fail on our linux/amd64 and linux/arm64 machines.

Repo: 4eaf855
Arch: linux/amd64

00:52:34 --- FAIL: TestExtraFiles (0.14s)
00:52:34     exec_test.go:708: Run: exit status 1
00:52:34         --- stdout:
00:52:34         leaked parent file. fd = 67; want 65
00:52:34         readlink("/proc/self/fd/65") = "", readlink /proc/self/fd/65: no such file or directory
00:52:34         --- stderr:
00:52:34         lsof -p 37167 failed: exec: "lsof": executable file not found in $PATH
00:52:34 FAIL
00:52:34 FAIL	os/exec	1.244s

Arch: linux/arm64

20:59:22 --- FAIL: TestExtraFiles (0.19s)
20:59:22     exec_test.go:708: Run: exit status 1
20:59:22         --- stdout:
20:59:22         leaked parent file. fd = 5; want 4
20:59:22         readlink("/proc/self/fd/4") = "/sys/devices/system/cpu/online", <nil>
20:59:22         --- stderr:
20:59:22         lsof -p 19941 failed: exec: "lsof": executable file not found in $PATH
20:59:22 FAIL
20:59:22 FAIL	os/exec	1.095s

@ianlancetaylor
Copy link
Contributor

TestExtraFiles does not require lsof to be installed. It invokes lsof if it fails, in an attempt to find out why it failed. But if it does not fail, it does not invoke lsof.

Fortunately it looks like your failure appeared before https://golang.org/cl/228639 was committed, as I think that CL should fix the test.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

10 participants