Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: TestTransportDialTLSContext flakes in local testing #59035

Open
rsc opened this issue Mar 14, 2023 · 7 comments
Open

net/http: TestTransportDialTLSContext flakes in local testing #59035

rsc opened this issue Mar 14, 2023 · 7 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Mar 14, 2023

% go test -c net/http
% stress ./http.test -test.count=10 -test.run=TestTransportDialTLSContext
254 runs so far, 0 failures
572 runs so far, 0 failures
860 runs so far, 0 failures
1163 runs so far, 0 failures
1460 runs so far, 0 failures
1737 runs so far, 0 failures

/var/folders/pm/qx0jkz117r1cvc05kf31fhh8000n9d/T/go-stress173291610
--- FAIL: TestTransportDialTLSContext (25.91s)
    --- FAIL: TestTransportDialTLSContext/h2 (25.91s)
        transport_test.go:3821: Get "https://127.0.0.1:59866": dial tcp 127.0.0.1:59866: connect: operation timed out
FAIL


ERROR: exit status 1

2026 runs so far, 1 failures
2320 runs so far, 1 failures
2615 runs so far, 1 failures
2915 runs so far, 1 failures
3212 runs so far, 1 failures

/var/folders/pm/qx0jkz117r1cvc05kf31fhh8000n9d/T/go-stress896441329
--- FAIL: TestTransportDialTLSContext (25.92s)
    --- FAIL: TestTransportDialTLSContext/https1 (25.91s)
        transport_test.go:3821: Get "https://127.0.0.1:59866": dial tcp 127.0.0.1:59866: connect: operation timed out
FAIL


ERROR: exit status 1

3512 runs so far, 2 failures
3804 runs so far, 2 failures

I see a similar failure during all.bash from time to time, but stress is an easier way to reproduce it.

/cc @neild

@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 14, 2023
@rsc rsc added this to the Go1.21 milestone Mar 14, 2023
@bcmills
Copy link
Contributor

bcmills commented Mar 14, 2023

#57475 may be related.

@bcmills

This comment was marked as duplicate.

@ianlancetaylor
Copy link
Contributor

A connection to #57475 doesn't seem likely to me. That test uses test hooks to force a timeout, and those failures were about a forced timeout not happening as fast as expected. This test is not using test hooks and appears to be making real network connections.

@neild
Copy link
Contributor

neild commented Mar 15, 2023

I can replicate this on my laptop (darwin-arm64), but not on linux-amd64. Not sure about root cause, but I suspect something platform specific.

@neild
Copy link
Contributor

neild commented Mar 15, 2023

This made me think of #37795, although the setup is a bit different and I don't think that precise bug should apply here.

I ran the test case from #37795 (comment) to see if the underlying macOS bug causing #37795 still exists. This test consists of a C client and server. The client loops, connecting to the server and immediately closing each connection. The server accepts connections, attempts to read one byte (returning EOF), and then closes the connection. Previously, this test would hang in the read.

The read timeout seems to be fixed, but the client in this test now hangs for me after a few iterations, returning ETIMEDOUT from connect. That seems like the same behavior seen in the flake here.

Not sure yet if this is a kernel bug or something else (port exhaustion? antivirus getting out of hand?).

@neild
Copy link
Contributor

neild commented Mar 15, 2023

I thought this might be port exhaustion, but that doesn't appear the case.

I can reproduce what appears to be the same issue using a pair of C programs (small variant from the ones in #37795).

Client:

#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

int main() {
        int i = 0;
        while (1) {
                int sock = socket(PF_INET, SOCK_STREAM, 0);
                if (sock < 0) {
                        perror("socket");
                        return 1;
                }

                struct sockaddr_in raddr;
                raddr.sin_family = AF_INET;
                raddr.sin_port = htons(2020);
                if (inet_aton("127.0.0.1", &raddr.sin_addr) <= 0) {
                        perror("inet_aton");
                        return 1;
                }

                if (connect(sock, (struct sockaddr *)&raddr, sizeof(raddr)) < 0) {
                        perror("connect");
                        return 1;
                }
                close(sock);

                printf("%d\n", i);
                i++;

                // Sleep for a bit to avoid ephemeral port exhaustion.
                if (i % 100 == 0) {
                        usleep(1000000);
                }
        }
}

Server:

#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

int main() {
        int sock = socket(PF_INET, SOCK_STREAM, 0);
        if (sock < 0) {
                perror("socket");
                return 1;
        }

        struct sockaddr_in laddr;
        laddr.sin_family = AF_INET;
        laddr.sin_port = htons(2020);
        laddr.sin_addr.s_addr = 0;
        if (bind(sock, (struct sockaddr *)&laddr, sizeof(laddr)) < 0) {
                perror("bind");
                return 1;
        }

        if (listen(sock, 10) < 0) {
                perror("listen");
                return 1;
        }

        while (1) {
                struct sockaddr_in sin;
                socklen_t slen = sizeof(sin);
                int c = accept(sock, (struct sockaddr *)&sin, &slen);
                if (c < 0) {
                        perror("accept");
                        return 1;
                }
                close(c);
        }
}

Run the server. Run the client. The client makes connections to the server and closes them, with a delay every 100 connections to avoid ephemeral port exhaustion. After a while (seems to take ~16,000 connection attempts for me, so 2-3 minutes) the client's connect call fails with ETIMEDOUT.

@rsc
Copy link
Contributor Author

rsc commented Mar 16, 2023

I ran this on my Google-issued laptop and got the failure, but I've been unable to reproduce it on my personal MacBook Pro. It's possible this one is due to some non-standard antivirus or firewall on the Google laptop.

@gopherbot gopherbot modified the milestones: Go1.21, Go1.22 Aug 8, 2023
@gopherbot gopherbot modified the milestones: Go1.22, Go1.23 Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants