Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/bpf: tests are flaky due to timing assumptions #20695

Open
bradfitz opened this issue Jun 16, 2017 · 10 comments
Open

x/net/bpf: tests are flaky due to timing assumptions #20695

bradfitz opened this issue Jun 16, 2017 · 10 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz
Copy link
Contributor

Test flake:

Seen on a trybot run:

--- FAIL: TestVMALUOpXor (0.10s)
	vm_bpf_test.go:120: output byte count does not match:
		- go: 1
		- os: 0
	vm_bpf_test.go:124: Go BPF and OS BPF packet outputs do not match
FAIL
FAIL	golang.org/x/net/bpf	0.767s

Could somebody investigate? Test flakes make development elsewhere harder. (I was working on http2)

/cc @mdlayher @danderson

@gopherbot gopherbot added this to the Unreleased milestone Jun 16, 2017
@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure. labels Jun 16, 2017
@bradfitz bradfitz modified the milestones: Go1.9Maybe, Unreleased Jun 16, 2017
@gopherbot gopherbot modified the milestones: Unreleased, Go1.9Maybe Jun 16, 2017
@mdlayher
Copy link
Member

mdlayher commented Jun 16, 2017

Running the tests in a loop on my machine here. No failures after ~400 repeated runs. I'll keep it running.

I suspect this is the issue: https://github.com/golang/net/blob/d1beb07c2056f22ead4ef9dd84f268cd537eeec8/bpf/vm_bpf_test.go#L184.

The test currently assumes that BPF dropped the packet if a timeout occurs. If the packet did take longer than 50 milliseconds to send (for some reason like CPU contention), that would explain the flake.

There might be a better way to detect when the filter dropped the traffic, but it's been a bit since I looked at BPF.

Any suggestions, @danderson or @mikioh?

@danderson
Copy link
Contributor

Hmm. There must be a way to get dropped packet stats from the kernel, because tcpdump can tell you how many packets the filter dropped in-kernel. I don't know offhand how that works, but I'll go dig.

Alternatively, much as I find it distasteful, raising the timeout might be acceptable here. Transmitting to localhost should be lossless unless receive buffers are full, and we're not sending/receiving hard enough for that.

@mdlayher
Copy link
Member

Alternatively, much as I find it distasteful, raising the timeout might be acceptable here.

I thought about that too, but it seems like a slippery slope and generally something I try to avoid really hard when writing tests in Go.

There must be a way to get dropped packet stats from the kernel, because tcpdump can tell you how many packets the filter dropped in-kernel.

Seems promising! I looked through the BPF manpage for some magic number I could pass to getsockopt() or the like, but no luck.

Also, still running those tests. ~2500 runs, no failures.

@bradfitz
Copy link
Contributor Author

Our builders can be quite loaded at times so 50 ms might indeed be too low.

Generally I suggest making the test try N times with increasing timeouts until the test passes: 5ms, 50ms, 500ms, 5s.

@danderson
Copy link
Contributor

danderson commented Jun 16, 2017

I tell a lie, sadly. I examined libpcap's source code for how it gets packet data. The overall conclusion is that the numbers are a complete lie, and the exact nature of the lies varies wildly by OS and kernel version.

On linux, the PACKET_STATISTICS getsockopt option returns a count of packets "received" and "dropped", but their definitions makes it useless: "received" means "were not dropped by the filter", and "dropped" means "were not enqueued in the socket's receive queue because of a full buffer." Note the absence of a counter for "dropped because of filter." Also note that, under those definitions, a packet can be both received and dropped simultaneously, if we were interested in it but had full buffers.

Still digging more, but so far I stand by my claim that there's no way for the kernel to tell us what we want to know :(

@mdlayher
Copy link
Member

Generally I suggest making the test try N times with increasing timeouts until the test passes: 5ms, 50ms, 500ms, 5s.

This seems like a reasonable first step to me. I'll send a CL that adds a "hint" to the tests on whether or not the traffic should be allowed through the filter. If the hint is true, I'll continue on timeout using the increasing timeouts method that Brad suggests.

@danderson
Copy link
Contributor

I'm missing something: why is it better to incrementally increase the timeouts, rather than just set a large timeout to begin with?

@bradfitz
Copy link
Contributor Author

why is it better to incrementally increase the timeouts, rather than just set a large timeout to begin with?

I didn't read the test in question, but I assumed this was a test that didn't pass until the timeout expired.

My motivation is that I want tests to be as fast as possible and still not flaky. If the timeout were 5 seconds and we always had to wait 5 seconds for the test to pass, that would be bad.

But if I assume the timeout backwards, ignore me.

@danderson
Copy link
Contributor

Doh, of course, I see. I was seeing the timeout as just being there to catch the case when want != got, and we have to time out when the packet we expected got unexpectedly filtered.

But yeah, the way the test is structured, that timeout also dictates the time some want==got tests will take. That's unfortunate, but your suggestion makes sense now.

@bcmills
Copy link
Contributor

bcmills commented Mar 25, 2021

These failures all look like the same pathology. I think the testing approach for this package needs to be reworked so that it does not make arbitrary assumptions about timing.

2021-03-24T20:56:30-d1beb07/linux-ppc64-buildlet
2021-02-26T17:20:49-e18ecbb/linux-ppc64-buildlet
2020-08-13T13:45:08-3edf25e/linux-ppc64le-power9osu
2020-08-13T13:45:08-3edf25e/linux-s390x-ibm
2020-06-25T00:16:55-4c52546/linux-arm
2019-12-06T10:30:17-1ddd1de/linux-ppc64le-power9osu

@bcmills bcmills changed the title x/net/bpf: TestVMALUOpXor test flake x/net/bpf: tests are flaky due to timing assumptions Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

5 participants