Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: misc/cgo/testcshared failure #23784

Closed
Invizory opened this issue Feb 12, 2018 · 25 comments
Closed

build: misc/cgo/testcshared failure #23784

Invizory opened this issue Feb 12, 2018 · 25 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@Invizory
Copy link
Contributor

What version of Go are you using (go version)?

I'm using go version go1.9.4 linux/amd64 to build the sources from the latest master, i.e. go version devel +829b64c1ea linux/amd64.

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/inviz/go"
GORACE=""
GOROOT="/usr/lib/go-1.9"
GOTOOLDIR="/usr/lib/go-1.9/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build762069868=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

cd go/src
./all.bash

What did you expect to see?

ALL TESTS PASSED

What did you see instead?

Building Go cmd/dist using /usr/lib/go-1.9.
Building Go toolchain1 using /usr/lib/go-1.9.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.

##### Testing packages.

...

##### ../misc/cgo/testcshared
--- FAIL: TestUnexportedSymbols (2.95s)
        cshared_test.go:349: run: [go build -buildmode=c-shared -installsuffix testcshared -o libgo2.so libgo2]
        cshared_test.go:363: run: [gcc -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build486143289=/tmp/go-build -gno-record-gcc-switches -I pkg/linux_amd64_shared -o testp2 main2.c -Wl,--no-as-needed libgo2.so]
        cshared_test.go:369: command failed: [./testp2]
                exit status 2
                BUG: failed to read any data from pipe

FAIL

This test is very unstable, though, and may fail or may pass with approximate probability of 1/2.

@ianlancetaylor
Copy link
Contributor

That test runs these commands from the misc/cgo/testcshared directory:

GOPATH=`pwd` go build -buildmode=c-shared -installsuffix testcshared -o libgo2.so libgo2
gcc -o testp2 main2.c -Wl,--no-as-needed libgo2.so
LD_LIBRARY_PATH=. ./testp2

(Running those commands will leave the files libgo2.so and testp2 behind, you'll want to clean them up.)

Can you confirm that running those commands by themselves demonstrates the same failure.

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Feb 12, 2018
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure. labels Feb 12, 2018
@ianlancetaylor
Copy link
Contributor

It's possible that the test will run the wrong version of go; which go command is first on your PATH?

@Invizory
Copy link
Contributor Author

Invizory commented Feb 12, 2018

@ianlancetaylor,

It's possible that the test will run the wrong version of go; which go command is first on your PATH?

$ which go
/usr/lib/go-1.9/bin/go
$ go version
go version go1.9.4 linux/amd64

This version is taken from ppa:gophers/archive.

Can you confirm that running those commands by themselves demonstrates the same failure.

Yes, it gives:

$ LD_LIBRARY_PATH=. ./testp2
BUG: failed to read any data from pipe

See also strace output.

@ianlancetaylor
Copy link
Contributor

Thanks for the strace output. That was going to be my next question. But what we need is actually the output from strace -f. Can you show that? Thanks.

@Invizory
Copy link
Contributor Author

Invizory commented Feb 12, 2018

@ianlancetaylor, oh, you're right, here it is: strace -f -v -s 100 output.

@ianlancetaylor
Copy link
Contributor

Thanks. That is strange. I see this:

[pid 18664] pipe([3, 4])                = 0
[pid 18664] dup2(3, 100 <unfinished ...>

and then the call to dup2 never completes, despite many calls to nanosleep for 1000000 nanoseconds. The next PID 18664 shows up is at the very end of the trace when all the processes are exiting.

[pid 18664] <... dup2 resumed>)         = ?
[pid 18664] +++ exited with 2 +++

What could possibly cause dup2 to hang like that? Are you using a FUSE file system of any sort?

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Feb 13, 2018
@Invizory
Copy link
Contributor Author

Invizory commented Feb 13, 2018

@ianlancetaylor,

No, there are no FUSE file systems mounted inside my home directory (mount).

I've also noticed that increasing the number of iterations in misc/cgo/testcshared/main2.c from 1000 to 10000 helps to PASS the test, i.e. dup2 resumes in the libgo2 thread after more iterations.

@ianlancetaylor
Copy link
Contributor

Do you have an unusually slow system? 1000 iterations of a sleep for 1000000 nanoseconds is a full second. How could it possibly take a full second for dup2 to complete? It's not like the program is doing anything, it's mostly sleeping. Can you show strace -ftt output?

@Invizory
Copy link
Contributor Author

@ianlancetaylor,

Do you have an unusually slow system?

I have 32×2.2GHz CPU cores with small load average: 0.82, 1.25, 1.12.

Can you show strace -ftt output?

I'm confused because now I can't reproduce the bug even with 1000 iterations. However, I reproduced it with 300 iterations, here is the strace -ftt output.

@Invizory
Copy link
Contributor Author

@ianlancetaylor,

Well, I have reproduced it with 1000 iterations again after 340 more runs of ./testp2, see strace -ftt.

@ianlancetaylor
Copy link
Contributor

Thanks. That output does clearly show that dup2 did not complete for more than one second, even though the program spent most of its time doing a loop of nothing but

[pid 12133] 00:48:36.838609 nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
[pid 12133] 00:48:36.839734 read(100, 0x7fff18a156e0, 20) = -1 EBADF (Bad file descriptor)

By comparison, on my laptop, I see

[pid 238401] 14:01:39.650477 dup2(3, 100 <unfinished ...>
...
[pid 238401] 14:01:39.705113 <... dup2 resumed> ) = 100

so the dup2 call completes in much less than 1/10 second.

I'm grasping at straws, but can you try adding

    sched_yield();

before the call to nanosleep in misc/cgo/testcshared/main2.c?

@Invizory
Copy link
Contributor Author

Invizory commented Feb 13, 2018

@ianlancetaylor,

Since it's very difficult to reproduce the bug now (for unknown reasons), I've collected some statistics instead on how much iterations it takes for different versions of code to finish with PASS (sorted raw data).

Here is the histogram for n = 5000 runs, the X-axis (tail stripped) denotes the number of iterations to finish, the Y-axis denotes the number of runs.

Histogram.

See also five extremal values:

  • without sched_yield: 9, 10, 11, 11, 12, ..., 258, 263, 274, 281, 392.
  • with sched_yield: 11, 11, 12, 12, 12, ..., 256, 259, 272, 334, 340.

So, according to the above mentioned, I don't see any significant difference at all.

@gopherbot
Copy link

Change https://golang.org/cl/93895 mentions this issue: misc/cgo/testcshared: increase sleep in TestUnexportedSymbols

@ianlancetaylor
Copy link
Contributor

Well, I don't know what's going on, but it sounds like an approach like https;//golang.org/cl/93895 will at least make it more likely to pass.

@Invizory
Copy link
Contributor Author

@ianlancetaylor,

Yep, seems like a reasonable approach.

@ianlancetaylor
Copy link
Contributor

The test currently uses file descriptor 100, which requires the process's fdtable to be expanded. Somehow that seems to take a long time. The other threads running (the main thread written in C and the sysmon thread) are somehow blocking that. I don't have enough kernel-fu to understand how. On my system the dup2 call does block for about 0.03 seconds, while the other threads are busy sleeping in loops. I guess that somehow, sometimes, the call blocks for over 2 seconds.

I'll try a different workaround, which is to only dup to file descriptor 30. That is less than the Linux kernel's default fdtable size of 32 or 64, so the dup2 call now completes immediately without blocking at all.

@gopherbot
Copy link

Change https://golang.org/cl/108537 mentions this issue: misc/cgo/testcshared: use file descriptor 30 for TestUnexportedSymbols

@FiloSottile
Copy link
Contributor

@ianlancetaylor this breaks the trybots on the 1.10 release branch as well, and is probably (?) also an issue on 1.9, can you look into backporting the fix?

https://go-review.googlesource.com/c/go/+/111715#message-c7b7c5f6de1d1ad25c8c7db15599e99e29a392a7

@gopherbot please open a backport issue for 1.10 and 1.9.

@gopherbot
Copy link

Backport issue(s) opened: #25277 (for 1.10), #25278 (for 1.9).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

Change https://golang.org/cl/111995 mentions this issue: [release-branch.go1.10] misc/cgo/testcshared: use file descriptor 30 for TestUnexportedSymbols

@gopherbot
Copy link

Change https://golang.org/cl/111996 mentions this issue: [release-branch.go1.9] misc/cgo/testcshared: use file descriptor 30 for TestUnexportedSymbols

gopherbot pushed a commit that referenced this issue May 8, 2018
…for TestUnexportedSymbols

Backport of CL 108537 to 1.10 release branch.

We were using file descriptor 100, which requires the Linux kernel to
grow the fdtable size. That step may sometimes require a long time,
causing the test to fail. Switch to file descriptor 30, which should
not require growing the fdtable.

Updates #23784
Fixes #25277

Change-Id: I9d25986f3b59bdeb04aa52407b24aa94712aedff
Reviewed-on: https://go-review.googlesource.com/111995
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
gopherbot pushed a commit that referenced this issue May 8, 2018
…or TestUnexportedSymbols

Backport of CL 108537 to 1.9 release branch.

We were using file descriptor 100, which requires the Linux kernel to
grow the fdtable size. That step may sometimes require a long time,
causing the test to fail. Switch to file descriptor 30, which should
not require growing the fdtable.

Updates #23784
Fixes #25278

Change-Id: I19ea6ab1724ec1807643d5111c44631e20be76b0
Reviewed-on: https://go-review.googlesource.com/111996
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@danicat
Copy link
Contributor

danicat commented Jul 31, 2018

Hi everyone! I've just downloaded the go code and ran the all.bash script to compile without any changes and got the following error (that resembles the original message):

##### ../misc/cgo/testcshared
--- FAIL: TestUnexportedSymbols (0.64s)
    cshared_test.go:353: run: [go build -buildmode=c-shared -installsuffix testcshared -o libgo2.so libgo2]
    cshared_test.go:367: run: [gcc -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build783986620=/tmp/go-build -gno-record-gcc-switches -I pkg/linux_amd64_shared -o testp2 main2.c -Wl,--no-as-needed libgo2.so]
    cshared_test.go:373: run: [./testp2]
    cshared_test.go:376: # NAN ChangeLog
**
FAIL
2018/07/31 23:09:06 Failed: exit status 1

I've tried two other times and the message on the last line (after "cshared_test.go:376") changed all times, suggesting some random values were being read from the fd 30. I've recompiled the binaries to use fd 100 like it was originally and I got the tests to pass.

go env:

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/danielapetruzalek/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/danielapetruzalek/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build074359831=/tmp/go-build -gno-record-gcc-switches"

I'm testing Go on the new Google Pixelbook.

I wonder if instead of choosing an arbitrary number we could have a more safe way to select the fd? Or use another technique to run this test?

@ianlancetaylor
Copy link
Contributor

@danicat This issue is closed. Please open a new issue for that problem. Thanks.

@danicat
Copy link
Contributor

danicat commented Aug 1, 2018

Sure @ianlancetaylor. Done!

@golang golang locked and limited conversation to collaborators Aug 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants