New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/bpf: tests are flaky due to timing assumptions #20695
Comments
Running the tests in a loop on my machine here. No failures after ~400 repeated runs. I'll keep it running. I suspect this is the issue: https://github.com/golang/net/blob/d1beb07c2056f22ead4ef9dd84f268cd537eeec8/bpf/vm_bpf_test.go#L184. The test currently assumes that BPF dropped the packet if a timeout occurs. If the packet did take longer than 50 milliseconds to send (for some reason like CPU contention), that would explain the flake. There might be a better way to detect when the filter dropped the traffic, but it's been a bit since I looked at BPF. Any suggestions, @danderson or @mikioh? |
Hmm. There must be a way to get dropped packet stats from the kernel, because tcpdump can tell you how many packets the filter dropped in-kernel. I don't know offhand how that works, but I'll go dig. Alternatively, much as I find it distasteful, raising the timeout might be acceptable here. Transmitting to localhost should be lossless unless receive buffers are full, and we're not sending/receiving hard enough for that. |
I thought about that too, but it seems like a slippery slope and generally something I try to avoid really hard when writing tests in Go.
Seems promising! I looked through the BPF manpage for some magic number I could pass to Also, still running those tests. ~2500 runs, no failures. |
Our builders can be quite loaded at times so 50 ms might indeed be too low. Generally I suggest making the test try N times with increasing timeouts until the test passes: 5ms, 50ms, 500ms, 5s. |
I tell a lie, sadly. I examined libpcap's source code for how it gets packet data. The overall conclusion is that the numbers are a complete lie, and the exact nature of the lies varies wildly by OS and kernel version. On linux, the Still digging more, but so far I stand by my claim that there's no way for the kernel to tell us what we want to know :( |
This seems like a reasonable first step to me. I'll send a CL that adds a "hint" to the tests on whether or not the traffic should be allowed through the filter. If the hint is true, I'll continue on timeout using the increasing timeouts method that Brad suggests. |
I'm missing something: why is it better to incrementally increase the timeouts, rather than just set a large timeout to begin with? |
I didn't read the test in question, but I assumed this was a test that didn't pass until the timeout expired. My motivation is that I want tests to be as fast as possible and still not flaky. If the timeout were 5 seconds and we always had to wait 5 seconds for the test to pass, that would be bad. But if I assume the timeout backwards, ignore me. |
Doh, of course, I see. I was seeing the timeout as just being there to catch the case when want != got, and we have to time out when the packet we expected got unexpectedly filtered. But yeah, the way the test is structured, that timeout also dictates the time some want==got tests will take. That's unfortunate, but your suggestion makes sense now. |
These failures all look like the same pathology. I think the testing approach for this package needs to be reworked so that it does not make arbitrary assumptions about timing. 2021-03-24T20:56:30-d1beb07/linux-ppc64-buildlet |
Test flake:
Seen on a trybot run:
Could somebody investigate? Test flakes make development elsewhere harder. (I was working on http2)
/cc @mdlayher @danderson
The text was updated successfully, but these errors were encountered: