Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGdbPython flaky on linux #24616

Closed
bradfitz opened this issue Mar 30, 2018 · 30 comments
Closed

runtime: TestGdbPython flaky on linux #24616

bradfitz opened this issue Mar 30, 2018 · 30 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-Linux Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz
Copy link
Contributor

Just saw this TestGdbPython flake on linux-amd64:

https://storage.googleapis.com/go-build-log/149e81d1/linux-amd64_fd28318f.log

--- FAIL: TestGdbPython (3.49s)
	runtime-gdb_test.go:59: gdb version 7.7
	runtime-gdb_test.go:193: gdb output: Loading Go Runtime support.
		Loaded  Script                                                                 
		Yes     /tmp/workdir/go/src/runtime/runtime-gdb.py                             
		Breakpoint 1 at 0x47c190: file /tmp/workdir/go/src/fmt/print.go, line 263.
		
		Breakpoint 1, fmt.Println (a=..., err=..., n=<optimized out>) at /tmp/workdir/go/src/fmt/print.go:263
		263	func Println(a ...interface{}) (n int, err error) {
		BEGIN info goroutines
		* 1 running  runtime.systemstack_switch
		* 2 running  runtime.forcegchelper
		  3 waiting  runtime.gopark
		  4 runnable runtime.runfinq
		END
		#1  0x00000000004828a0 in main.main () at /tmp/go-build461513624/main.go:14
		14		fmt.Println("hi")
		BEGIN print mapvar
		$1 = map[string]string = {["ghi"] = "jkl", ["abc"] = "def"}
		END
		BEGIN print strvar
		$2 = "abc"
		END
		BEGIN info locals
		mapvar = map[string]string = {["ghi"] = "jkl", ["abc"] = "def"}
		slicevar =  []string = {"def"}
		strvar = "abc"
		END
		#0  fmt.Println (a=..., err=..., n=<optimized out>) at /tmp/workdir/go/src/fmt/print.go:263
		263	func Println(a ...interface{}) (n int, err error) {
		BEGIN goroutine 1 bt
		#0  fmt.Println (a=..., err=..., n=<optimized out>) at /tmp/workdir/go/src/fmt/print.go:263
		#1  0x00000000004828a0 in main.main () at /tmp/go-build461513624/main.go:14
		END
		BEGIN goroutine 2 bt
		No such goroutine:  2
		END
		Breakpoint 2 at 0x4828cd: file /tmp/go-build461513624/main.go, line 18.
		hi
		
		Breakpoint 2, main.main () at /tmp/go-build461513624/main.go:19
		19	}  // END_OF_PROGRAM
		BEGIN goroutine 1 bt at the end
		#0  main.main () at /tmp/go-build461513624/main.go:19
		END
		
	runtime-gdb_test.go:258: goroutine 2 bt failed: No such goroutine:  2
FAIL
FAIL	runtime	28.524s

/cc @aclements

@bradfitz bradfitz added Testing An issue that has been verified to require only test changes, not just a test failure. help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Mar 30, 2018
@bradfitz bradfitz added this to the Go1.11 milestone Mar 30, 2018
@aclements
Copy link
Member

@hyangah, is this similar to the bug you fixed recently about getting the state of goroutines?

@hyangah
Copy link
Contributor

hyangah commented Apr 4, 2018

@aclements do you mean https://go-review.googlesource.com/c/go/+/49691?
I don't know. Maybe.

If when the test passes the gdb output should look like the following

--- PASS: TestGdbPython (0.44s)
	runtime-gdb_test.go:59: gdb version 7.7
	runtime-gdb_test.go:193: gdb output: Loading Go Runtime support.
		Loaded  Script                                                                 
		Yes     /tmp/workdir/go/src/runtime/runtime-gdb.py                             
		Breakpoint 1 at 0x47c1c0: file /tmp/workdir/go/src/fmt/print.go, line 263.
		
		Breakpoint 1, fmt.Println (a=..., err=..., n=) at /tmp/workdir/go/src/fmt/print.go:263
		263	func Println(a ...interface{}) (n int, err error) {
		BEGIN info goroutines
		* 1 running  runtime.systemstack_switch
		  2 waiting  runtime.gopark
		  17 waiting  runtime.gopark
		  33 runnable runtime.runfinq
		END
		#1  0x00000000004828d0 in main.main () at /tmp/go-build668958384/main.go:14
		14		fmt.Println("hi")
		BEGIN print mapvar
		$1 = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
		END
		BEGIN print strvar
		$2 = "abc"
		END
		BEGIN info locals
		mapvar = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
		slicevar =  []string = {"def"}
		strvar = "abc"
		END
		#0  fmt.Println (a=..., err=..., n=) at /tmp/workdir/go/src/fmt/print.go:263
		263	func Println(a ...interface{}) (n int, err error) {
		BEGIN goroutine 1 bt
		#0  fmt.Println (a=..., err=..., n=) at /tmp/workdir/go/src/fmt/print.go:263
		#1  0x00000000004828d0 in main.main () at /tmp/go-build668958384/main.go:14
		END
		BEGIN goroutine 2 bt
		#0  runtime.gopark (lock=0x528af0 , reason="force gc (idle)", traceEv=20 '\024', traceskip=1, unlockf=) at /tmp/workdir/go/src/runtime/proc.go:292
		#1  0x0000000000428c5e in runtime.goparkunlock (lock=, reason=..., traceEv=, traceskip=) at /tmp/workdir/go/src/runtime/proc.go:297
		#2  0x00000000004289ea in runtime.forcegchelper () at /tmp/workdir/go/src/runtime/proc.go:248
		#3  0x000000000044ee01 in runtime.goexit () at /tmp/workdir/go/src/runtime/asm_amd64.s:1385
		#4  0x0000000000000000 in ?? ()
		END
		Breakpoint 2 at 0x4828fd: file /tmp/go-build668958384/main.go, line 18.
		hi
		
		Breakpoint 2, main.main () at /tmp/go-build668958384/main.go:19
		19	}  // END_OF_PROGRAM
		BEGIN goroutine 1 bt at the end
		#0  main.main () at /tmp/go-build668958384/main.go:19
		END

Note the difference in the output of 'info goroutine' (2 running goroutines vs 1 running goroutine).

Is there any way to reliably reproduce the failure case?
I tried to gomote and test in the linux-amd64 buildlet, but failed to reproduce the failure with -count=1000 (too ~450s). Not from my linux either. I tried larger than 1000 in gomote and the run was SIGQUIT.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 7, 2018
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@bcmills
Copy link
Contributor

bcmills commented Aug 27, 2019

Here's a flake with very similar output on linux-ppc64le-buildlet:
https://build.golang.org/log/f5c124d74c9e6a71da5614b8b13db9328ec08910

@bcmills bcmills changed the title runtime: TestGdbPython flake on linux-amd64 runtime: TestGdbPython flaky on linux Aug 27, 2019
@bcmills
Copy link
Contributor

bcmills commented Sep 11, 2019

linux-ppc64le-buildlet again: https://build.golang.org/log/8656a75fa05d10775f555193d888e44adb7476c3

@laboger
Copy link
Contributor

laboger commented Sep 11, 2019

The failure here happens because at the time the gdb script tries to display information about goroutine 2, that goroutine no longer exists.

Is that a bug because the goroutine has ended, or just normal behavior? Someone who knows more about what causes goroutines to come and go would have to answer that.

It seems like the testcase should have a goroutine that won't go away soon and reference that in the gdb script.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bcmills
Copy link
Contributor

bcmills commented Oct 14, 2019

@bcmills
Copy link
Contributor

bcmills commented Oct 15, 2019

@bcmills
Copy link
Contributor

bcmills commented Nov 7, 2019

@josharian
Copy link
Contributor

Time to mark as flaky and skip?

@bcmills
Copy link
Contributor

bcmills commented Mar 11, 2020

Or, y'know, figure out why the test is failing and fix it...

@josharian
Copy link
Contributor

@bcmills I don't think that snarkiness is warranted.

It is obviously better to fix tests than skip them. It is also better skip tests than to have random flakes, which waste peoples' time, mask real failures, and condition people not to bother investigating the dots of red on build.golang.org.

This has been sitting without a fix for two years; it was originally marked Go 1.11. I think that's well over the threshold for marking as flaky, unless there's a concrete plan for who is going to look into it.

@bcmills
Copy link
Contributor

bcmills commented Mar 11, 2020

Skipping the test ~everywhere is tantamount to deleting it, especially given that we do not currently have a way to detect unnecessary skips (#25951), and especially for tests that do not reliably fail.

Skipping the test only on Linux would assume that the source of flakiness is linux-specific, but I do not know whether we have enough gdb coverage on other platforms to determine that (see #22021).

Skipping the test due to flakiness on only the builders would not fix the potential for flakes when folks run go test runtime locally.

So even adding an appropriate Skip requires domain knowledge that I do not have and do not have the time to acquire, in order to correctly target the Skip to the root cause of the failure. This issue needs attention from someone with domain knowledge specific to the test.

If the test does not provide enough value to warrant the time from someone with enough context to either fix it or add an appropriately-tailored Skip — for example, if we no longer care about the scenario it is testing — then it should probably be deleted outright rather than skipped. But that, too, would need to be a decision by someone with appropriate context on the original purpose of the test.

@laboger
Copy link
Contributor

laboger commented Mar 11, 2020

As I noted earlier in this issue, the gdb script is expecting goroutine 2 will exist throughout this test so it tries to do 'bt 2'. At the point it does that gdb thinks this is an error because it can't find goroutine 2. My guess is that goroutine 2 is gone (for whatever reason) when it tried to do the 'bt 2', which I don't think is a gdb error if that is the case.

But in any case I think if you remove the line that does 'bt 2' (skip it) from the gdb script then the error should go away and you will still be testing gdb to some extent. That is my guess anyway. I looked at this test a while back because it was failing intermittently on ppc64le for a while.

@josharian
Copy link
Contributor

From looking at the source code that it is running, I can't tell what goroutine 2 is even supposed to be. It looks entirely single-threaded to me. Do you know?

const helloSource = `
import "fmt"
import "runtime"
var gslice []string
func main() {
	mapvar := make(map[string]string, 13)
	mapvar["abc"] = "def"
	mapvar["ghi"] = "jkl"
	strvar := "abc"
	ptrvar := &strvar
	slicevar := make([]string, 0, 16)
	slicevar = append(slicevar, mapvar["abc"])
	fmt.Println("hi")
	runtime.KeepAlive(ptrvar)
	_ = ptrvar
	gslice = slicevar
	runtime.KeepAlive(mapvar)
}  // END_OF_PROGRAM
`

@thanm
Copy link
Contributor

thanm commented Mar 13, 2020

@josharian I got the same impression when I looked over the test. It seems that the test is making some sort of attempt to see if a different goroutine can be selected for a backtrace, but relying on runtime-specific goroutines to do that?

@josharian
Copy link
Contributor

Something like this might work, but I can't test particularly readily.

diff --git a/src/runtime/runtime-gdb_test.go b/src/runtime/runtime-gdb_test.go
index 79b4621614..9f2ada1032 100644
--- a/src/runtime/runtime-gdb_test.go
+++ b/src/runtime/runtime-gdb_test.go
@@ -106,6 +106,7 @@ import "fmt"
 import "runtime"
 var gslice []string
 func main() {
+	go func() { select{} }() // ensure a second goroutine is running
 	mapvar := make(map[string]string, 13)
 	mapvar["abc"] = "def"
 	mapvar["ghi"] = "jkl"
@@ -115,7 +116,7 @@ func main() {
 	slicevar = append(slicevar, mapvar["abc"])
 	fmt.Println("hi")
 	runtime.KeepAlive(ptrvar)
-	_ = ptrvar
+	_ = ptrvar // set breakpoint here
 	gslice = slicevar
 	runtime.KeepAlive(mapvar)
 }  // END_OF_PROGRAM
@@ -167,6 +168,16 @@ func testGdbPython(t *testing.T, cgo bool) {
 
 	src := buf.Bytes()
 
+	// Locate breakpoint line
+	var bp int
+	lines := bytes.Split(src, []byte("\n"))
+	for i, line := range lines {
+		if bytes.Contains(line, []byte("breakpoint")) {
+			bp = i
+			break
+		}
+	}
+
 	err = ioutil.WriteFile(filepath.Join(dir, "main.go"), src, 0644)
 	if err != nil {
 		t.Fatalf("failed to create file: %v", err)
@@ -201,7 +212,7 @@ func testGdbPython(t *testing.T, cgo bool) {
 	}
 	args = append(args,
 		"-ex", "set python print-stack full",
-		"-ex", "br main.go:15",
+		"-ex", fmt.Sprintf("br main.go:%d", bp),
 		"-ex", "run",
 		"-ex", "echo BEGIN info goroutines\n",
 		"-ex", "info goroutines",

@thanm
Copy link
Contributor

thanm commented Mar 13, 2020

I tested Josh's change and it seems to work. I am happy to send a CL with this change (or Josh can send one) if folks think that it would be helpful.

FWIW I also spent a while trying to reproduce the original flake on my workstation... lots of testing but no luck... so if I can't produce the problem in the first place, it's hard to know whether this new version is a real improvement.

@josharian
Copy link
Contributor

Thanks, @thanm. I’ll send a CL soonish and claim that it fixes this issue. And if the issue occurs again, we can re-open.

@josharian
Copy link
Contributor

@thanm if you're still up for it, feel free to send a CL. I'm AFK for an unknown time.

@thanm
Copy link
Contributor

thanm commented Mar 25, 2020

OK, I will add it to my list (may not get to it for a while, I have other pots on the stove).

@gopherbot
Copy link

Change https://golang.org/cl/226558 mentions this issue: test: deflaking measures for runtime gdb test

gopherbot pushed a commit that referenced this issue Apr 9, 2020
Tweak the runtime's GDB python test to try to reduce flake failures.

Background: the intent of the testpoint in question is to make sure
that python-supported commands like "info goroutines" or "goroutine 1
backtrace" work properly. The Go code being run under the debugger as
part of the test is single-threaded, but the test is written assuming
that in addition to the primary goroutine there will be other
background goroutines available (owned by the runtime). The flakiness
seems to crop up the most when requesting a backtrace for one of these
background goroutines; the speculation is that if we catch a
runtime-owned goroutine in an odd state, this could interfere with the
test.

The change in this patch is to explicitly start an additional
goroutine from the main thread, so that when the debugger stops the
main thread we can be sure that there is some other non-main goroutine
in a known state.

This change authored by Josh Bleecher Snyder <josharian@gmail.com>.

Updates #24616.

Change-Id: I45682323d5898e5187c0adada7c5d117e92f403b
Reviewed-on: https://go-review.googlesource.com/c/go/+/226558
Run-TryBot: Than McIntosh <thanm@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@josharian
Copy link
Contributor

This happened again: https://storage.googleapis.com/go-build-log/72c918bb/linux-amd64-race_026d95f9.log

Still no goroutine 2. Looks like there actually is a goroutine 2, doing bgsweep. Does gdb prevent us from doing bt on a runtime goroutine?

@laboger
Copy link
Contributor

laboger commented Apr 14, 2020

In the log:

BEGIN goroutine all bt
        #0  main.main () at /workdir/tmp/go-build562481541/main.go:18
        No such goroutine:  2
        #0  runtime.gopark (unlockf={void (runtime.g *, void *, bool *)} 0xc00002b7a8, lock=0x56bf20 <runtime.sweep>, reason=12 '\f', traceEv=20 '\024', traceskip=1) at /workdir/go/src/runtime/proc.go:307
        #1  0x000000000041fcee in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /workdir/go/src/runtime/proc.go:312
        #2  runtime.bgsweep (c=0xc000014070) at /workdir/go/src/runtime/mgcsweep.go:71
        #3  0x000000000045f8e1 in runtime.goexit () at /workdir/go/src/runtime/asm_amd64.s:1374
        #4  0x000000c000014070 in ?? ()
        #5  0x0000000000000000 in ?? ()
        #0  runtime.gopark (unlockf={void (runtime.g *, void *, bool *)} 0xc00002bf78, lock=0x56bee0 <runtime.scavenge>, reason=13 '\r', traceEv=20 '\024', traceskip=1) at /workdir/go/src/runtime/proc.go:307
        #1  0x000000000041e3c2 in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /workdir/go/src/runtime/proc.go:312
        #2  runtime.bgscavenge (c=0xc000014070) at /workdir/go/src/runtime/mgcscavenge.go:238
        #3  0x000000000045f8e1 in runtime.goexit () at /workdir/go/src/runtime/asm_amd64.s:1374
        #4  0x000000c000014070 in ?? ()
        #5  0x0000000000000000 in ?? ()
        #0  runtime.runfinq () at /workdir/go/src/runtime/mfinal.go:161
        #1  0x000000000045f8e1 in runtime.goexit () at /workdir/go/src/runtime/asm_amd64.s:1374
        #2  0x0000000000000000 in ?? ()
        #0  main.main.func1 () at /workdir/tmp/go-build562481541/main.go:7
        #1  0x000000000045f8e1 in runtime.goexit () at /workdir/go/src/runtime/asm_amd64.s:1374
        #2  0x0000000000000000 in ?? ()
        END

I think each #0 is the top (bottom?) of the stack for a goroutine. Note at the top it still says No goroutine 2. The bgsweep is part of the stack for another goroutine, and based what was shown above, that was goroutine 3.

It still seems that goroutine 2 has exited by the time the second bt is attempted.

@josharian
Copy link
Contributor

What then is it even trying to test? Perhaps we should just delete the bt 2? We cannot reliably identify goroutines just by their number. I guess the alternative is to do some python scripting to parse ‘bt all’, identify the goroutine of interest, and bt it. Or we could parse ‘bt all’ and backtrace all live goroutines.

I’m mostly inclined to delete ‘bt 2’. Opinions?

@thanm
Copy link
Contributor

thanm commented Apr 14, 2020

Agree on the analysis; I am find with deleting the bt 2. Parsing "bt all" or "info goroutines" to find a specific goroutine seems like overkill.

@laboger
Copy link
Contributor

laboger commented Apr 14, 2020

My assumption is that the purpose of the 'bt 2' was just to test the backtrace output. If you leave in the 'bt all' and remove 'bt 2' that should test it? I honestly don't know why goroutines would come and go but if it is gone that's not an error with gdb python but an expectation of the test.

@bcmills
Copy link
Contributor

bcmills commented May 11, 2020

Still flaky after CL 226558, unfortunately:
2020-05-08T00:07:39-f0cea84/linux-386-387

--- FAIL: TestGdbPython (0.62s)
    runtime-gdb_test.go:71: gdb version 7.7
    runtime-gdb_test.go:249: gdb output: Loading Go Runtime support.
        Loaded  Script                                                                 
        Yes     /workdir/go/src/runtime/runtime-gdb.py                                 
        Breakpoint 1 at 0x80cc592: file /workdir/tmp/go-build244785334/main.go, line 16.
        hi
        
        Breakpoint 1, main.main () at /workdir/tmp/go-build244785334/main.go:18
        18		gslice = slicevar
        BEGIN info goroutines
        * 1 running  syscall.Syscall
        * 2 running  runtime.gopark
          3 waiting  runtime.gopark
          4 waiting  runtime.gopark
          5 runnable runtime.runfinq
          6 runnable main.main.func1
        END
        BEGIN print mapvar
        $1 = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
        END
        BEGIN print strvar
        $2 = "abc"
        END
        BEGIN info locals
        mapvar = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
        strvar = "abc"
        slicevar =  []string
        END
        BEGIN goroutine 1 bt
        #0  main.main () at /workdir/tmp/go-build244785334/main.go:18
        END
        BEGIN goroutine 2 bt
        No such goroutine:  2
        END
        BEGIN goroutine all bt
        #0  main.main () at /workdir/tmp/go-build244785334/main.go:18
        No such goroutine:  2
        #0  runtime.gopark (unlockf={void (runtime.g *, void *, bool *)} 0x84277d4, lock=0x816eb40 <runtime.sweep>, reason=12 '\f', traceEv=20 '\024', traceskip=1) at /workdir/go/src/runtime/proc.go:307
        #1  0x0806584f in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /workdir/go/src/runtime/proc.go:312
        #2  runtime.bgsweep (c=0x844c000) at /workdir/go/src/runtime/mgcsweep.go:163
        #3  0x0809df41 in runtime.goexit () at /workdir/go/src/runtime/asm_386.s:1333
        #4  0x0844c000 in ?? ()
        #0  runtime.gopark (unlockf={void (runtime.g *, void *, bool *)} 0x8427f9c, lock=0x816eae0 <runtime.scavenge>, reason=13 '\r', traceEv=20 '\024', traceskip=1) at /workdir/go/src/runtime/proc.go:307
        #1  0x08063b2e in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /workdir/go/src/runtime/proc.go:312
        #2  runtime.bgscavenge (c=0x844c000) at /workdir/go/src/runtime/mgcscavenge.go:260
        #3  0x0809df41 in runtime.goexit () at /workdir/go/src/runtime/asm_386.s:1333
        #4  0x0844c000 in ?? ()
        #0  runtime.runfinq () at /workdir/go/src/runtime/mfinal.go:161
        #1  0x0809df41 in runtime.goexit () at /workdir/go/src/runtime/asm_386.s:1333
        #2  0x00000000 in ?? ()
        #0  0x080cc610 in main.main.func1 ()
        #1  0x0809df41 in runtime.goexit () at /workdir/go/src/runtime/asm_386.s:1333
        #2  0x00000000 in ?? ()
        END
        No breakpoint at main.go:15.
        Breakpoint 2 at 0x80cc5ba: file /workdir/tmp/go-build244785334/main.go, line 19.
        
        Breakpoint 2, main.main () at /workdir/tmp/go-build244785334/main.go:20
        20	}  // END_OF_PROGRAM
        BEGIN goroutine 1 bt at the end
        #0  main.main () at /workdir/tmp/go-build244785334/main.go:20
        END
        
    runtime-gdb_test.go:100: malformed backtrace at line 0: No such goroutine:  2
FAIL
FAIL	runtime	31.765s

@josharian
Copy link
Contributor

Yep. See the last few comments above, which include a plan for moving forward. I’m AFK now but feel free to send a CL. Should be a simple one.

@gopherbot
Copy link

Change https://golang.org/cl/233942 mentions this issue: runtime: remove flaky "goroutine 2 bt" from gdb test

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-Linux Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests