Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialCancel is flaky on ARM and MIPS builders #15191

Closed
mikioh opened this issue Apr 8, 2016 · 12 comments
Closed

net: TestDialCancel is flaky on ARM and MIPS builders #15191

mikioh opened this issue Apr 8, 2016 · 12 comments
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz bradfitz added this to the Unplanned milestone Apr 8, 2016
@bradfitz
Copy link
Contributor

bradfitz commented Apr 8, 2016

Odd:

--- FAIL: TestDialCancel (0.01s)
    dial_test.go:870: dial error after 0 ticks (5 before cancel sent): dial tcp 198.18.0.254:1234: getsockopt: network is unreachable
FAIL
FAIL    net 1.745s

Why would it return ENETUNREACH, but only sometimes? Amusingly, getsockopt's man page (http://linux.die.net/man/2/getsockopt) doesn't even mention this error.

/cc @ianlancetaylor @davecheney @minux

@bradfitz bradfitz changed the title net: TestDialCancel is flaky on linux/arm64-buidlet net: TestDialCancel is flaky on linux/arm64-buildlet Apr 8, 2016
@ianlancetaylor
Copy link
Contributor

If you look at the code in netFD.connect in fd_unix.go, you'll see that (most likely) getsockopt is not returning ENETUNREACH. Instead, getsockopt(SO_ERROR) is succeeding in retrieving the error associated with the socket, and that error is ENETUNREACH. The error is really coming from connect, and it means that there is no route to the IP address.

@bradfitz bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Apr 12, 2016
@bradfitz
Copy link
Contributor

I'm just going to disable this test for now. I think that machine (on Linaro) has different routes than we've normally assumed for tests.

For the record,

root@r2-a25-go1:/home/brad.fitzpatrick# ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:16:3e:0c:9c:8a  
          inet addr:10.20.3.110  Bcast:10.20.255.255  Mask:255.255.0.0
          inet6 addr: fe80::216:3eff:fe0c:9c8a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5151248 errors:0 dropped:0 overruns:0 frame:0
          TX packets:996113 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5114537899 (5.1 GB)  TX bytes:3534764800 (3.5 GB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:2065172 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2065172 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:6848213327 (6.8 GB)  TX bytes:6848213327 (6.8 GB)

lxcbr0    Link encap:Ethernet  HWaddr a6:d1:77:00:04:d2  
          inet addr:10.0.3.1  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::a4d1:77ff:fe00:4d2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:570 (570.0 B)

root@r2-a25-go1:/home/brad.fitzpatrick# ip route show
default via 10.20.0.1 dev eth0 
10.0.3.0/24 dev lxcbr0  proto kernel  scope link  src 10.0.3.1 
10.20.0.0/16 dev eth0  proto kernel  scope link  src 10.20.3.110 

gopherbot pushed a commit that referenced this issue Apr 12, 2016
These builders (on Linaro) have a different network configuration
which is incompatible with this test. Or so it seems.

Updates #15191

Change-Id: Ibfeacddc98dac1da316e704b5c8491617a13e3bf
Reviewed-on: https://go-review.googlesource.com/21901
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
@paulzhol
Copy link
Member

paulzhol commented Aug 7, 2017

I've started seeing these as well on freebsd-arm-paulzhol:
https://build.golang.org/log/d89169422b2e1c3f4765d7d9093faa56510e21ec
https://build.golang.org/log/6b83f0f1b1ba97c0efb74fae4bb05280c26b8a22
https://build.golang.org/log/664eb5529a95b2b051d874ebc6d4f0412266984e

I'm not sure why it started appearing now. There have been some changes in the environment: switched to a buildlet based builder, upgrade to FreeBSD 11.1 etc. But they don't seem to be related.

For my setup I can track the cause to the router/firewall replying with a TCP RST segment when dialing to the 198.18.0.0/15 subnet:

19:13:12.724990 IP 192.168.X.Y.29436 > 198.18.0.254.1234: Flags [S], seq 2301688538, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 157886 ecr 0], length 0
19:13:12.725786 IP 198.18.0.254.1234 > 192.168.X.Y.29436: Flags [R.], seq 0, ack 2301688539, win 0, length 0

It is OpenBSD pf firewall's recommended behavior according to https://www.openbsd.org/faq/pf/example1.html:

block in quick on egress from <martians> to any
block return out quick on egress from any to <martians>

Packets coming in on the egress interface should be dropped if they appear to be from the list of unroutable addresses we defined. Such packets were likely sent due to misconfiguration, or possibly as part of a spoofing attack. Similarly, our clients should not attempt to connect to such addresses. We'll specify the "return" action to prevent annoying timeouts for users. Note that this can cause problems if you're doing double NAT.

Where <martians> is a table containing 198.18.0.0/15 as well as other non-routable address ranges.
The block return rule's behavior according to the pf.conf manual is

This causes a TCP RST to be returned for TCP packets and an ICMP UNREACHABLE for other types of packets.

@bcmills
Copy link
Contributor

bcmills commented Mar 13, 2019

@bcmills
Copy link
Contributor

bcmills commented Jun 19, 2019

@bcmills bcmills changed the title net: TestDialCancel is flaky on linux/arm64-buildlet net: TestDialCancel is flaky on linux arm builders Jun 19, 2019
@bcmills bcmills added OS-Linux and removed OS-Linux labels Jun 19, 2019
@bcmills bcmills changed the title net: TestDialCancel is flaky on linux arm builders net: TestDialCancel is flaky on arm builders Jun 19, 2019
@bcmills
Copy link
Contributor

bcmills commented Sep 3, 2019

@bcmills
Copy link
Contributor

bcmills commented Sep 25, 2019

@bcmills
Copy link
Contributor

bcmills commented Oct 31, 2019

@ianlancetaylor
Copy link
Contributor

Same test failure happening on the MIPS builders.

2019-10-29T12:23:21-ac346a5/linux-mips64le-rtrk
2019-10-29T18:32:59-ca70ada/linux-mips64-rtrk
2019-10-29T19:58:24-cc47b0d/linux-mips64-rtrk
2019-10-30T03:48:03-9e094ea/linux-mips64le-rtrk
2019-10-30T08:17:29-f4e32ae/linux-arm
2019-10-31T16:02:25-d5caea7/linux-mips64-rtrk
2019-10-31T17:09:48-a9b37ae/linux-mips-rtrk
2019-10-31T17:21:56-48c0cef/linux-mips-rtrk
2019-11-05T05:22:07-3c0fbee/linux-mips64-rtrk
2019-11-05T15:40:02-8550a58/linux-mips64-rtrk
2019-11-05T18:37:06-c3cef0b/linux-mips64-rtrk
2019-11-05T20:21:34-552987f/linux-mipsle-rtrk
2019-11-05T20:47:22-fb37821/linux-mips64-rtrk
2019-11-05T20:47:22-fb37821/linux-mipsle-rtrk
2019-11-05T21:26:19-649f341/linux-mips-rtrk
2019-11-05T21:26:19-649f341/linux-mips64le-rtrk
2019-11-06T09:08:53-6108998/linux-mips-rtrk
2019-11-06T09:08:53-6108998/linux-mips64le-rtrk
2019-11-06T09:09:21-0ea7440/linux-mips-rtrk
2019-11-06T09:09:21-0ea7440/linux-mips64-rtrk
2019-11-06T09:09:59-0c5d545/linux-mips64le-rtrk
2019-11-06T13:55:04-6dc250f/linux-mips64-rtrk
2019-11-06T13:55:04-6dc250f/linux-mips64le-rtrk
2019-11-06T13:55:04-6dc250f/linux-mipsle-rtrk
2019-11-06T14:33:39-f891b7c/linux-mips-rtrk
2019-11-06T14:33:39-f891b7c/linux-mips64le-rtrk
2019-11-06T14:33:39-f891b7c/linux-mipsle-rtrk
2019-11-06T14:34:46-b824048/linux-mips-rtrk
2019-11-06T14:34:46-b824048/linux-mipsle-rtrk
2019-11-06T14:56:38-1bd974e/linux-mips-rtrk
2019-11-06T14:56:38-1bd974e/linux-mips64-rtrk
2019-11-06T14:56:38-1bd974e/linux-mips64le-rtrk
2019-11-06T14:56:38-1bd974e/linux-mipsle-rtrk
2019-11-06T15:08:19-cf3be9b/linux-mipsle-rtrk
2019-11-06T16:17:30-a5936a4/linux-mips-rtrk
2019-11-06T16:17:30-a5936a4/linux-mipsle-rtrk
2019-11-06T17:03:51-a2b1dc8/linux-mips64le-rtrk

@bcmills bcmills changed the title net: TestDialCancel is flaky on arm builders net: TestDialCancel is flaky on ARM and MIPS builders Nov 6, 2019
@bcmills bcmills added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 6, 2019
@ianlancetaylor
Copy link
Contributor

The test assumes that trying to connect to 198.18.0.254 or 2001:2::254 will hang. I guess that is usually true, and I guess that it is not always true. I will send a CL to tweak the test.

@gopherbot
Copy link

Change https://golang.org/cl/205698 mentions this issue: net: skip TestDialCancel if Dial fails with "connection refused"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants