Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: unexpected nils from (*UnixConn).LocalAddr on solaris-amd64-oraclerel builder #34611

Closed
bcmills opened this issue Sep 30, 2019 · 24 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Solaris release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Sep 30, 2019

https://build.golang.org/log/d9ed52fecef0637337eaaef8194b90b06ba098aa contains a curious puzzle:

solaris-amd64-oraclerel at 739bf6b929b66ac1715268e269da01c8199f034b
[…]
--- FAIL: TestUnixAndUnixpacketServer (10.01s)
    server_test.go:144: skipping unix @nettest/go/unix test
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x5bf460]

goroutine 3285 [running]:
testing.tRunner.func1(0xc00012ae00)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:874 +0x3a3
panic(0x647400, 0x849ee0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/panic.go:679 +0x1b2
net.TestUnixAndUnixpacketServer(0xc00012ae00)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/net/server_test.go:189 +0x670
testing.tRunner(0xc00012ae00, 0x696098)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:960 +0x351
FAIL	net	17.374s

The skipping message suggests that the panic occurred on the test after @nettest/go/unix. The test after is {"unixpacket", testUnixAddr()}.

It's hard to tell from the backtrace, but this seems to suggest that either net.Dial returned a nil pointer (of a concrete connection type), or returned a non-nil pointer whose LocalAddr method returned a nil pointer (of a concrete Addr implementation type).

CC @mikioh @bradfitz @ianlancetaylor

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 30, 2019
@bcmills bcmills added this to the Go1.14 milestone Sep 30, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bcmills
Copy link
Contributor Author

bcmills commented Nov 18, 2019

I wonder: is there anything distinctive about the network configuration for that machine? (Is it IPv4-only, or IPv6-only, or dual-stack?)

@rorth
Copy link

rorth commented Nov 18, 2019 via email

@rorth
Copy link

rorth commented Jun 17, 2021 via email

@gopherbot
Copy link

Change https://golang.org/cl/366114 mentions this issue: net: diagnose unexpected nils in TestUnixAndUnixpacketServer

@bcmills
Copy link
Contributor Author

bcmills commented Nov 22, 2021

@rorth, I've mailed CL 366114 to add some debugging information here, but I'm unable to easily try it out on the builder because it lacks SSH access. Could you give it a spin and see if it comes up with anything useful?

gopherbot pushed a commit that referenced this issue Nov 22, 2021
For #34611

Change-Id: I31894d58498b2c290ecceccfc004bc817f8969c9
Reviewed-on: https://go-review.googlesource.com/c/go/+/366114
Trust: Bryan C. Mills <bcmills@google.com>
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rorth
Copy link

rorth commented Nov 22, 2021 via email

@bcmills
Copy link
Contributor Author

bcmills commented Nov 22, 2021

Great, thanks! That at least gives us a bit more information to go on...

@bcmills
Copy link
Contributor Author

bcmills commented Nov 23, 2021

The stack trace line at server_test.go:218 is the one right after Dial, before the transceiver goroutine is started and before the connection is closed:
https://cs.opensource.google/go/go/+/master:src/net/server_test.go;l=218;drc=2d7ae3fbd86d4b5471ac4044ece208b29cd0ef74

The LocalAddr method implementation is here:
https://cs.opensource.google/go/go/+/master:src/net/net.go;l=214-222;drc=a53e3d5f885ca7a0df1cd6cf65faa5b63a802dce
That implies that either c.ok() is false (meaning c.fd is nil, since we know that c itself is not), or c.fd.laddr is itself nil.

This *UnixConn comes from Dial, so I'm guessing it is constructed by (*sysDialer).dialUnix.
The only way that could have a nil fd is if a nil fd is returned by socket:
https://cs.opensource.google/go/go/+/master:src/net/unixsock_posix.go;l=45-49;drc=master

I don't see anything particularly suspicious from there on down regarding a nil fd, which makes me suspect that laddr itself is nil. On that front, I do see a missed error check from a getsockname syscall:
https://cs.opensource.google/go/go/+/master:src/net/sock_posix.go;l=164;drc=master

I suppose that the next step is to check that error and see what it turns up.

@gopherbot
Copy link

Change https://golang.org/cl/366536 mentions this issue: net: check error from syscall.Getsockname in (*netFD).dial

@bcmills
Copy link
Contributor Author

bcmills commented Nov 23, 2021

@rorth: ok, another diagnostic CL for you to try!
https://golang.org/cl/366536

@rorth
Copy link

rorth commented Nov 24, 2021 via email

@gopherbot
Copy link

Change https://golang.org/cl/366774 mentions this issue: log/syslog: create unix sockets in unique directories

@gopherbot
Copy link

Change https://golang.org/cl/369160 mentions this issue: net: clarify that conn.LocalAddr and conn.RemoteAddr might not be known

@bcmills
Copy link
Contributor Author

bcmills commented Dec 3, 2021

@rorth, one more change to try! I've uploaded a fourth patchset to https://golang.org/cl/366536 that I'm hopeful may fix the crash.

gopherbot pushed a commit that referenced this issue Dec 6, 2021
startServer was invoking os.Remove on the temporary file for a unix
socket after creating it. Since the files were created in the global
temp directory, that could cause two tests to arrive at colliding
names.

(Noticed while looking into the failure at
https://storage.googleapis.com/go-build-log/af2c83b1/solaris-amd64-oraclerel_3e01fda8.log,
but I would be surprised if this solves that failure.)

This change uses unique temporary directories, and attempts to keep
name lengths minimal to avoid accidentally running into socket-name
length limitations.

Updates #34611

Change-Id: I21743f245e5c14645e03f09795013e058b984471
Reviewed-on: https://go-review.googlesource.com/c/go/+/366774
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Dec 6, 2021
For #34611

Change-Id: I9a1357f53124c98ad017b58774696d0377dbea27
Reviewed-on: https://go-review.googlesource.com/c/go/+/369160
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Dec 7, 2021

@rorth, I think this is fixed at head — but if you wouldn't mind, I would appreciate one more round of stress-testing to verify.

@bcmills bcmills changed the title net: SIGSEGV during net.TestUnixAndUnixpacketServer on solaris-amd64-oraclerel builder net: unexpected nils from (*UnixConn).LocalAddr on solaris-amd64-oraclerel builder Dec 7, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Dec 7, 2021
@bcmills bcmills self-assigned this Dec 7, 2021
@bcmills bcmills reopened this Dec 8, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

I reverted the attempted fix due to #50033. Going to try a different approach.

@gopherbot
Copy link

Change https://golang.org/cl/370694 mentions this issue: net: create unix sockets in unique directories

@gopherbot
Copy link

Change https://golang.org/cl/370695 mentions this issue: net: do not try to remove the LocalAddr of a unix socket

@gopherbot
Copy link

Change https://golang.org/cl/370696 mentions this issue: net: pass a testing.TB to newLocal* helpers

@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker Testing An issue that has been verified to require only test changes, not just a test failure. labels Dec 9, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Dec 9, 2021

I found two separate problems in the tests:

  1. We're calling os.Remove on endpoints that are literally never on the filesystem as far as I can tell.
  2. We're creating “unique” socket paths in a way that doesn't actually ensure uniqueness.

I have mailed fixes for both of those issues; given that this is a source of recurring builder failures, and especially given that we know how to fix the underlying issues, I think this is a release-blocker via #11811.

gopherbot pushed a commit that referenced this issue Dec 13, 2021
Passing in an explicit testing.TB gives two benefits:

1. It allows the helper to fail the test itself, instead of returning
   an error to the caller. A non-nil error invariably fails the
   calling test, and none of these callers bother to add detail to the
   error when logging it anyway so returning the error just added
   noise to the test bodies.

2. It allows the helper to use t.Cleanup to perform any needed cleanup
   tasks, which will be used in CL 370695 to clean up temp directories
   used as namespaces for unix socket paths.

For #34611

Change-Id: I805e701687c12de2caca955649369294229c10b4
Reviewed-on: https://go-review.googlesource.com/c/go/+/370696
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Dec 13, 2021
This change applies the same transformation as in CL 366774,
but to the net package.

testUnixAddr was using os.CreateTemp to obtain a unique socket path,
but then calling os.Remove on that path immediately. Since the
existence of the file is what guarantees its uniqueness, that could
occasionally result in testUnixAddr returning the same path for two
calls, causing the tests using those paths to fail — especially if
they are the same test or are run in parallel.

Instead, we now create a unique, short temp directory for each call,
and use a path within that directory for the socket address.

For #34611

Change-Id: I8e13b606abce2479a0305f7aeecf5d54c449a032
Reviewed-on: https://go-review.googlesource.com/c/go/+/370694
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@rsc rsc unassigned bcmills Jun 23, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Solaris release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

4 participants