Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Sendfile broken on some versions of Oracle Solaris when reading from tmpfs #20857

Closed
binarycrusader opened this issue Jun 29, 2017 · 2 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Solaris
Milestone

Comments

@binarycrusader
Copy link
Contributor

Recently, when attempting to setup a buildlet for build.golang.org on Oracle Solaris, it was discovered that TestSendfile was failing consistently.

--- FAIL: TestSendfile (0.00s)
	sendfile_test.go:80: received 319488 bytes; expected 387851
	sendfile_test.go:84: retrieved data hash did not match
	sendfile_test.go:88: sent 319488 bytes; expected 387851
FAIL
FAIL	net	7.812s
...

But if I run ./all.bash on the buildlet system manually, that test passes:

...
ok      net     1.480s
ok      net/http        2.670s
ok      net/http/cgi    0.857s
ok      net/http/cookiejar      0.031s
ok      net/http/fcgi   0.016s
ok      net/http/httptest       0.033s
ok      net/http/httptrace      0.013s
ok      net/http/httputil       0.107s
ok      net/http/internal       0.044s
ok      net/internal/socktest   0.019s
ok      net/mail        0.044s
ok      net/rpc 0.092s
ok      net/rpc/jsonrpc 0.018s
ok      net/smtp        0.027s
ok      net/textproto   0.091s
ok      net/url 0.020s
...

If I ran that specific test manually, it passed:

$ go test -v -run TestSendfile
=== RUN   TestSendfile
--- PASS: TestSendfile (0.00s)
PASS
Socket statistical information:
(inet4, stream, default): opened=2 connected=1 listened=1 accepted=1 closed=3 openfailed=0 connectfailed=1 listenfailed=0 acceptfailed=0 closefailed=0

ok      net     0.008s

Thinking it was somehow environment, or the like, I setup an SMF service the same as used for the buildlet (e.g. https://github.com/golang/build/blob/master/env/solaris-amd64/joyent/buildlet.xml), but instead of calling the buildlet executable, I had it run all.bash instead. That also worked.

So I created a git clone of current Go trunk in the buildlet's $HOME, and then did this (due to @bradfitz 's suggestions):

$ gomote run solaris-amd64-oracledev@127.0.0.1:5936 ~/go-trunk/src/make.bash
...
---
Installed Go for solaris/amd64 in /export/home/golang/go-trunk
Installed commands in /export/home/golang/go-trunk/bin

$ gomote run -debug -path ~/go-trunk/bin:/usr/sbin:/usr/bin -dir ~/go-trunk/src solaris-amd64-oracledev@127.0.0.1:5936 ~/go-trunk/bin/go tool dist test -v go_test:net
:: Running /export/home/golang/go-trunk/bin/go with args ["/export/home/golang/go-trunk/bin/go" "tool" "dist" "test" "-v" "go_test:net"] and env ["LANG=en_US.UTF-8" "HZ=" "PAGER=/usr/bin/less -ins" "PWD=/export/home/golang/go/bin" "HOME=/export/home/golang" "https_proxy=http://www-proxy:80" "http_proxy=http://www-proxy:80" "MAIL=/var/mail/golang" "SHELL=/usr/bin/bash" "TERM=xterm-256color" "SHLVL=1" "LOGNAME=golang" "PATH=/export/home/golang/go-trunk/bin:/usr/sbin:/usr/bin" "_=/export/home/golang/go/bin/buildlet" "OLDPWD=/export/home/golang" "WORKDIR=/tmp/wl" "GOROOT_BOOTSTRAP=/opt/local/go-solaris-amd64-bootstrap" "GO_BUILDER_NAME=solaris-amd64-oracledev"] in dir /export/home/golang/go-trunk/src


##### Testing packages.
# go tool dist test -run=^go_test:net$
ok      net     7.498s

ALL TESTS PASSED (some were excluded)

I then realized the key difference -- when I'm building and running manually, I do so using a "standard" ZFS-backed filesystem, but when the buildlet runs, it's using /tmp/workdir-host-solaris-oracle-shawn for its workdir; /tmp on Solaris is backed by tmpfs (i.e. ramdisk). The tmpfs filesystem on Solaris has subtle differences in behaviour that occasionally expose other bugs.

If I copy the src/net/testdata/Mark.Twain-Tom.Sawyer.txt to /tmp and then change the test to explicitly read the file from there, it fails every time.

$ ./net.tmp.test -test.v -test.run TestSendfile
=== RUN   TestSendfile
--- FAIL: TestSendfile (0.00s)
        sendfile_test.go:80: received 270336 bytes; expected 387851
        sendfile_test.go:84: retrieved data hash did not match
        sendfile_test.go:88: sent 270336 bytes; expected 387851
FAIL
Socket statistical information:
(inet4, stream, default): opened=2 connected=1 listened=1 accepted=1 closed=3 openfailed=0 connectfailed=1 listenfailed=0 acceptfailed=1 closefailed=0

...but if I use a ZFS-backed filesystem location, such as /var/tmp, it works every time:

$ ./net.var.tmp.test -test.v -test.run TestSendfile
=== RUN   TestSendfile
--- PASS: TestSendfile (0.00s)
PASS
Socket statistical information:
(inet4, stream, default): opened=2 connected=1 listened=1 accepted=1 closed=3 openfailed=0 connectfailed=1 listenfailed=0 acceptfailed=1 closefailed=0

As a workaround, I've manually set the buildlet's -workdir to /var/tmp/$host_key_name since it defaults to using /tmp/$host_key_name.

Finally, I tested on older releases of Solaris and this problem was not reproduceable, which suggests that this is likely an OS bug. The OS bug is actively being investigated now. I will report back as to whether this is an OS bug or a Go bug and then either close or submit a fix as appropriate.

@bradfitz bradfitz added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Solaris labels Jun 29, 2017
@bradfitz bradfitz added this to the Go1.10 milestone Jun 29, 2017
@gopherbot
Copy link

CL https://golang.org/cl/47351 mentions this issue.

@binarycrusader
Copy link
Contributor Author

binarycrusader commented Jun 30, 2017

So as far as I can tell, this was just a subtle logic error in my original fix for sendfile in #13892. Notably, the code wrongly assumed that if EAGAIN/EINTR was encountered that a partial write occurred. In retrospect, that was a silly assumption. Instead, just like any other platform, we should try again, but unlike other platforms, we do need to check for the possibility of a partial write. This fix should be backwards compatible with any variant of Solaris that Go supports.

@golang golang locked and limited conversation to collaborators Jun 30, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Solaris
Projects
None yet
Development

No branches or pull requests

3 participants