Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: fixedbugs/issue21576.go flake on linux-ppc64le-power9osu builder #34836

Closed
bcmills opened this issue Oct 11, 2019 · 10 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 11, 2019

From the linux-ppc64le-power9osu builder (https://build.golang.org/log/8415a2011de6d02d89ecb3587ca3ac2735d86b3d):

# go run run.go -- fixedbugs/issue21576.go
exit status 1
2019/10/10 03:23:05 Unmatched error message "fatal error: all goroutines are asleep - deadlock!":
in
exit status 1

FAIL	fixedbugs/issue21576.go	8.596s

In #21576 (comment), @odeke-em notes:

Too low of a deadline and it could become a flake during tests, too high of
a deadline and it could stall building the Go tree and running other tests.
5seconds is a heuristic that I thought was a decently good enough time for
go run to have successfully started.

Does that explain this failure?

Would it be possible to code the test in a way that does not depend on timing? (What is the harm in having a higher timeout, assuming that the test has not regressed?)

@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 11, 2019
@bcmills bcmills added this to the Go1.14 milestone Oct 11, 2019
@shawndx
Copy link
Contributor

shawndx commented Oct 11, 2019

One of my colleagues observed the same failure on arm64 platform, we guess that might be due to too many testcases accessing /tmp at the same time slows down the building phase, I'm also a little bit confused why not setting a larger time-out.

@ianlancetaylor
Copy link
Contributor

Yes, I think a timeout could explain that problem.

The situation would be clearer if the test printed the error value.

I think it would be fine to increase the timeout. Since we expect the test to pass, we can set the timeout to be much larger. Yes, it will take longer to fail, but the length of time it takes for a test to fail is not all that interesting. Let's make it at least one minute.

@odeke-em
Copy link
Member

The situation would be clearer if the test printed the error value.

We print out the output from .CombinedOutput() which I thought would capture the actual error, but sure we can add the error value.

I think it would be fine to increase the timeout. Since we expect the test to pass, we can set the timeout to be much larger. Yes, it will take longer to fail, but the length of time it takes for a test to fail is not all that interesting. Let's make it at least one minute.

Sure, I've sent CL https://go-review.googlesource.com/c/go/+/200519 updating the time to 1 minute.

To sidetrack a little, linux-ppc64le-power9osu has been failing with mysterious errors such as in #34658 where the source code in the test doesn't close the connection but somehow we are getting back client_test.go:1353: error = read tcp 127.0.0.1:42466->127.0.0.1:40975: use of closed network connection; want errRequestCanceled
and then there is also #34837 (comment) with deadlocks which I am seeing on 9 different builder but perhaps it might be related to testing or the runtime.

@gopherbot
Copy link

Change https://golang.org/cl/200519 mentions this issue: test/fixedbugs: bump issue21576.go's timeout to 1min

@odeke-em
Copy link
Member

@shawn-xdji can you please try that patch, keeping it at 5 * time.Second but just printing out the error value, if possible? Thanks.

@dianhong01
Copy link
Contributor

dianhong01 commented Oct 11, 2019

@shawn-xdji can you please try that patch, keeping it at 5 * time.Second but just printing out the error value, if possible? Thanks.

Well, in my arm device, I got a error value when run all.bash, like below. Error value is "signal: killed". Does this error value mean that go routine has been killed before checking deadlock?
**
../test
go run run.go -- fixedbugs/issue21576.go
exit status 1
2019/10/10 08:01:15 XDH--err is signal: killed
2019/10/10 08:01:15 Unmatched error message "fatal error: all goroutines are asleep - deadlock!":
in
exit status 1

FAIL fixedbugs/issue21576.go 12.680s
2019/10/10 08:01:45 Failed: exit status 1
**

This error has a high frequency in arm device which has 200+ cores, especially when device handled a lot of jobs. Well in some arm device with 128 cores, this error almost never happened.

@ianlancetaylor
Copy link
Contributor

@odeke-em Printing the error value helps because in this case there is no output.

@dianhong01 Thanks, an error value of "signal: killed" confirms that the test timed out. (When the timeout expires, the parent will send SIGKILL to the child, so the child will have no output and an error value of "signal: killed".)

@odeke-em
Copy link
Member

Cool thanks! I've added the error value too as well as increased the timeout to 1minute. @shawn-xdji please don't hesitate to ping here more in case of anything.

@shawndx
Copy link
Contributor

shawndx commented Oct 12, 2019

@ianlancetaylor, Hi Ian, if setting the context time-out to, say 5 seconds, and let the child 'go run' sleep for 10 seconds, the child process is killed after 10 seconds, is it due to parent sending SIGKILL at T+10s or child not responding any signal during that period? Thanks.

@odeke-em @dianhong01 is the colleague I mentioned.

Attaching a case revised from the one provided by @odeke-em , removing the blocking channel bypass deadlock.

test.go.txt

@ianlancetaylor
Copy link
Contributor

Hi Ian, if setting the context time-out to, say 5 seconds, and let the child 'go run' sleep for 10 seconds, the child process is killed after 10 seconds, is it due to parent sending SIGKILL at T+10s or child not responding any signal during that period? Thanks.

You can tell by looking at the exit status. An exit status of "signal: killed" means that it was killed by the parent process. I'm not sure what exit status will be used for a timeout, but it's definitely not "signal: killed".

@golang golang locked and limited conversation to collaborators Oct 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants