Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: difference between post-submit builder and gomote makes repros less likely #42379

Closed
prattmic opened this issue Nov 4, 2020 · 5 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented Nov 4, 2020

What version of Go are you using (go version)?

$ go version
go version devel +594b4a3bfe Wed Nov 4 16:54:48 2020 +0000 linux/amd64

Does this issue reproduce with the latest release?

Bug under investigation not in latest releast

What operating system and processor architecture are you using (go env)?

openbsd-amd64-62 builder

What did you do?

Attempt to reproduce failures from #42237. These failures are occurring on nearly 100% of post-submit openbsd-amd64-62 builds on build.golang.org.

My attempt to repro (run from /src):

$ gomote create openbsd-amd64-62
user-mpratt-openbsd-amd64-62-0
$ GOROOT=$(readlink -f $(pwd)/../) gomote push user-mpratt-openbsd-amd64-62-0
2020/11/04 13:54:40 installing go1.4
2020/11/04 13:54:46 Remote doesn't have "src/cmd/go/testdata/script/test_benchmark_labels.txt"
2020/11/04 13:54:46 Remote doesn't have "src/runtime/debug/stack.go"
2020/11/04 13:54:46 Remote doesn't have "test/dwarf/dwarf.dir/z6.go"
2020/11/04 13:54:46 Remote doesn't have "test/fixedbugs/bug322.go"
2020/11/04 13:54:46 Remote doesn't have "src/cmd/vendor/golang.org/x/sys/unix/syscall_unix_gc.go"
2020/11/04 13:54:46 Remote doesn't have 9344 files (only showed 5).
2020/11/04 13:54:46 Remote lacks a VERSION file; sending a fake one
2020/11/04 13:54:49 Uploading 9345 new/changed files; 23218856 byte .tar.gz
$ gomote run user-mpratt-openbsd-amd64-62-0 ./go/src/all.bash
Building Go cmd/dist using /tmp/workdir/go1.4. (go1.12 openbsd/amd64)
Building Go toolchain1 using /tmp/workdir/go1.4.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for openbsd/amd64.

##### Testing packages.

... 20 minutes later ...
ALL TESTS PASSED                                                    

What did you expect to see?

Test failures similar to the post-submit builders in almost all runs.

What did you see instead?

Failures almost never occur. Running all.bash takes ~20min, so I haven't run it tons of times, but I'd say I've probably gotten 1 or 2 failures out of around 10 runs. I've managed to get a few more repros running go test runtime, but I'm not sure the failure rate is any higher; that test just runs faster.

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Nov 4, 2020
@gopherbot gopherbot added this to the Unreleased milestone Nov 4, 2020
@prattmic
Copy link
Member Author

prattmic commented Nov 4, 2020

One other thing I've noticed which may be related is that it seems like the openbsd-amd64-62 gomote is significantly slower than the openbsd-amd64-62 builder.

e.g., compare this post-submit run to my gomote run:

Test Output
##### Testing packages.                       
ok      archive/tar     0.132s         
ok      archive/zip     0.849s        
ok      bufio   0.342s        
ok      bytes   2.026s        
ok      compress/bzip2  0.112s
ok      compress/flate  4.316s
ok      compress/gzip   0.657s
ok      compress/lzw    0.244s
ok      compress/zlib   1.988s
ok      container/heap  0.023s
ok      container/list  0.017s
ok      container/ring  0.016s
ok      context 0.153s
ok      crypto  0.023s
ok      crypto/aes      0.078s
ok      crypto/cipher   0.038s
ok      crypto/des      0.044s
ok      crypto/dsa      0.031s
ok      crypto/ecdsa    0.346s
ok      crypto/ed25519  0.111s
ok      crypto/elliptic 0.079s
ok      crypto/hmac     0.024s
ok      crypto/internal/subtle  0.020s
ok      crypto/md5      0.020s
ok      crypto/rand     0.127s
ok      crypto/rc4      0.055s
ok      crypto/rsa      0.233s
ok      crypto/sha1     0.015s
ok      crypto/sha256   0.016s
ok      crypto/sha512   0.024s
ok      crypto/subtle   0.022s
ok      crypto/tls      2.721s
ok      crypto/x509     1.747s
ok      database/sql    1.046s
ok      database/sql/driver     0.027s
ok      debug/dwarf     0.045s
ok      debug/elf       0.066s
ok      debug/gosym     0.057s
ok      debug/macho     0.027s
ok      debug/pe        0.037s
ok      debug/plan9obj  0.028s
ok      embed/internal/embedtest        0.033s
ok      encoding/ascii85        0.058s
ok      encoding/asn1   0.033s
ok      encoding/base32 0.130s
ok      encoding/base64 0.029s
ok      encoding/binary 0.038s
ok      encoding/csv    0.029s
ok      encoding/gob    0.067s
ok      encoding/hex    0.084s
ok      encoding/json   0.687s
ok      encoding/pem    0.048s
ok      encoding/xml    0.111s
ok      errors  0.055s
ok      expvar  0.043s
ok      flag    0.124s
ok      fmt     0.171s
ok      go/ast  0.045s
ok      go/build        2.028s
ok      go/constant     0.020s
ok      go/doc  0.233s
ok      go/format       0.039s
ok      go/importer     0.516s
ok      go/internal/gccgoimporter       0.052s
ok      go/internal/gcimporter  10.296s
ok      go/internal/srcimporter 9.709s
ok      go/parser       0.071s
ok      go/printer      0.698s
ok      go/scanner      0.035s
ok      go/token        0.059s
ok      go/types        4.011s
ok      hash    0.019s
ok      hash/adler32    0.040s
ok      hash/crc32      0.021s
ok      hash/crc64      0.022s
ok      hash/fnv        0.121s
ok      hash/maphash    0.254s
ok      html    0.025s
ok      html/template   0.481s
ok      image   0.192s
ok      image/color     0.042s
ok      image/draw      0.193s
ok      image/gif       3.448s
ok      image/jpeg      0.388s
ok      image/png       0.177s
ok      index/suffixarray       0.325s
ok      internal/cpu    0.040s
ok      internal/fmtsort        0.030s
ok      internal/poll   0.132s
ok      internal/profile        0.015s
ok      internal/reflectlite    0.249s
ok      internal/singleflight   0.028s
ok      internal/trace  0.207s
ok      internal/unsafeheader   0.018s
ok      internal/xcoff  0.038s
ok      io      0.359s
ok      io/fs   0.025s
ok      io/ioutil       0.030s
ok      log     0.040s
ok      log/syslog      1.411s
ok      math    0.052s
ok      math/big        2.772s
ok      math/bits       0.034s
ok      math/cmplx      0.020s
ok      math/rand       0.566s
ok      mime    0.055s
ok      mime/multipart  0.411s
ok      mime/quotedprintable    0.057s
ok      net     29.665s
ok      net/http        23.805s
ok      net/http/cgi    1.256s
ok      net/http/cookiejar      0.112s
ok      net/http/fcgi   0.040s
ok      net/http/httptest       0.132s
ok      net/http/httptrace      0.028s
ok      net/http/httputil       0.195s
ok      net/http/internal       0.020s
ok      net/http/pprof  4.234s
ok      net/internal/socktest   0.015s
ok      net/mail        0.029s
ok      net/rpc 0.068s
ok      net/rpc/jsonrpc 0.031s
ok      net/smtp        0.074s
ok      net/textproto   0.044s
ok      net/url 0.046s
ok      os      1.398s
ok      os/exec 1.937s
ok      os/signal       2.588s
ok      os/user 0.026s
ok      path    0.016s
ok      path/filepath   0.133s
ok      plugin  0.019s
ok      reflect 2.786s
ok      regexp  0.226s
ok      regexp/syntax   0.577s

Almost every big test takes 50-100%+ longer than the post-submit version. bytes 1.2s vs 2s, compress/zlip 1.5s vs 2s, reflect 1s vs 2.7s.

This is a very empirical observation, and I haven't seen the post-submit results from a passing run (since the logs are immediately deleted), so perhaps there is nothing there, but this makes me wonder if the gomote is on a smaller instance or something.

Edit: this could also be cause the builder is running fewer tests? -short?

@dmitshur
Copy link
Contributor

dmitshur commented Nov 4, 2020

Thanks for reporting. Also CC @golang/release for awareness.

The openbsd-amd64-62 builder is a normal (not a "longtest" or "race" or otherwise semi-special builder), so running all.bash in a gomote session should be equivalent in spirit of what the builder is meant to test.

One of the most important differences that comes to mind is that when you have a gomote session running all.bash, all execution happens sequentially. The post-submit builder differs in that it's configured to shard test execution across multiple builder instances of that type:

$ goexec 'dashboard.Builders["openbsd-amd64-62"].SplitMakeRun()'
(bool)(true)

(See coordinator.go#L2003-L2007 and BuildConfig.SplitMakeRun.)

Which should explain the timing difference you've observed above.

I've only had a chance to look briefly so far, I'll see if I can look more later on.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 4, 2020
@dmitshur
Copy link
Contributor

dmitshur commented Nov 4, 2020

I haven't seen the post-submit results from a passing run

I've looked up the logs for some passing post-submit builder runs, for reference:

They all look similar, like so:

[...]
ok  	regexp/syntax	0.270s
ok  	runtime	72.720s
ok  	runtime/debug	0.112s
ok  	runtime/internal/atomic	0.073s
ok  	runtime/internal/math	0.010s
ok  	runtime/internal/sys	0.046s
ok  	runtime/metrics	0.009s
ok  	runtime/pprof	6.015s
ok  	runtime/race	0.386s
ok  	runtime/trace	1.521s
ok  	sort	0.062s
[...]

@dmitshur
Copy link
Contributor

dmitshur commented Nov 5, 2020

I've investigated this and I believe I'm able to reproduce the problem of #42237 on the openbsd-amd64-62 builder via gomote.

Investigation Log

I limited my scope to the openbsd-amd64-62 builder, which fails very often on build.golang.org. All of the failures have this shape:

panic: test timed out after 3m0s
[...]
FAIL	runtime	180.092s
FAIL

In other words, the runtime test is failing because it is taking more than 3 minutes to run, which is the default timeout for most package tests (see here) and the builder doesn't override it via GO_TEST_TIMEOUT_SCALE.

Here's a relevant bit from an event log of a recent failed build (full log here):

2020-11-04T23:36:25Z run_tests_multi 10.240.0.25:80: [go_test:regexp go_test:regexp/syntax]
2020-11-04T23:36:30Z finish_run_tests_multi after 4.58s; 10.240.0.25:80: [go_test:regexp go_test:regexp/syntax]
2020-11-04T23:36:30Z run_test:go_test:runtime 10.240.0.25:80
2020-11-04T23:37:00Z still_waiting_on_test go_test:runtime
2020-11-04T23:37:30Z still_waiting_on_test go_test:runtime
2020-11-04T23:38:00Z still_waiting_on_test go_test:runtime
2020-11-04T23:38:30Z still_waiting_on_test go_test:runtime
2020-11-04T23:39:00Z still_waiting_on_test go_test:runtime
2020-11-04T23:39:30Z still_waiting_on_test go_test:runtime
2020-11-04T23:39:55Z finish_run_test:go_test:runtime after 3m25.3s; 10.240.0.25:80
2020-11-04T23:39:55Z run_tests_multi 10.240.0.25:80: [go_test:runtime/debug go_test:runtime/internal/atomic go_test:runtime/internal/math]
2020-11-04T23:39:55Z finish_make_and_test after 14m36.5s
2020-11-04T23:39:55Z done with test failures
2020-11-04T23:39:55Z finish_run_tests_multi after 74.7ms; err=buildlet: Client closed; 10.240.0.25:80: [go_test:runtime/debug go_test:runtime/internal/atomic go_test:runtime/internal/math]
2020-11-04T23:39:55Z main_buildlet_broken 10.240.0.25:80

Notably, it tells me the runtime package tests were sharded and run individually (compare with regexp and regexp/syntax package tests above it, which were grouped together in a single runTestsOnBuildlet call).

This gave me the idea to try reproducing the failure by doing just go test -short runtime on a gomote session. Fortunately, I was able to reproduce the problem by doing that and didn't need to do anything more.

Reproduction Steps

  1. Build Go at parent commit of CL 232298 (commit b3f7f60) in a openbsd-amd64-62 gomote session.

  2. Run go test -count=1 -short runtime repeatedly. Observe success, all around 1~ minute completion time:

    gotip $ git checkout 8fdc79e18a9704185bd6471b592db1e8004bd993~
    [...]
    HEAD is now at b3f7f60129 cmd/dist: fix build failure of misc/cgo/test on arm64
    gotip $ export MOTE=$(gomote create openbsd-amd64-62)
    gotip $ gomote push $MOTE
    [...]
    2020/11/04 18:29:24 Uploading 411 new/changed files; 2728173 byte .tar.gz
    gotip $ gomote run $MOTE go/src/make.bash 
    Building Go cmd/dist using /tmp/workdir/go1.4. (go1.12 openbsd/amd64)
    [...]
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	88.095s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	46.000s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	85.341s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	43.466s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	52.164s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	55.943s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok  	runtime	56.088s
    
  3. Build Go at CL 232298 (commit 8fdc79e) in a openbsd-amd64-62 gomote session.

  4. Also run go test -count=1 -short runtime repeatedly. Now, observe success but many instances exceed the 180 second timeout:

    gotip $ git checkout 8fdc79e18a9704185bd6471b592db1e8004bd993
    HEAD is now at 8fdc79e18a runtime: reduce timer latency
    gotip $ gomote push $MOTE
    [...]
    2020/11/04 18:44:32 Uploading 3 new/changed files; 61629 byte .tar.gz
    gotip $ gomote run $MOTE go/src/make.bash
    Building Go cmd/dist using /tmp/workdir/go1.4. (go1.12 openbsd/amd64)
    [...]
    Installed commands in /tmp/workdir/go/bin
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime 
    ok      runtime 295.983s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime 
    ok      runtime 168.566s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok      runtime 290.109s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok      runtime 308.163s
    gotip $ gomote run $MOTE go/bin/go test -count=1 -short runtime
    ok      runtime 405.817s
    

Intermittent Slow Tests

I added -v to see what tests were slow. I found a few tests were sometimes slow (but not other times):

[...]
=== RUN   TestShrinkStackDuringBlockedSend
--- PASS: TestShrinkStackDuringBlockedSend (0.23s)
=== RUN   TestNoShrinkStackWhileParking
--- PASS: TestNoShrinkStackWhileParking (6.62s)
=== RUN   TestSelectDuplicateChannel
(a wait time of 2 minutes here) <------------------ !!!
--- PASS: TestSelectDuplicateChannel (120.02s)
=== RUN   TestSelectStackAdjust
--- PASS: TestSelectStackAdjust (0.08s)
=== RUN   TestChanSendSelectBarrier
--- PASS: TestChanSendSelectBarrier (0.05s)
[...]
PASS
ok  	runtime	199.880s
[...]
--- PASS: TestPseudoRandomSend (0.00s)
=== RUN   TestMultiConsumer
--- PASS: TestMultiConsumer (0.28s)
=== RUN   TestShrinkStackDuringBlockedSend
--- PASS: TestShrinkStackDuringBlockedSend (0.23s)
=== RUN   TestNoShrinkStackWhileParking
(a wait time of 2 minutes here) <------------------ !!!
--- PASS: TestNoShrinkStackWhileParking (126.49s)
=== RUN   TestSelectDuplicateChannel
(a wait time of 2 minutes here) <------------------ !!!
--- PASS: TestSelectDuplicateChannel (120.02s)
=== RUN   TestSelectStackAdjust
--- PASS: TestSelectStackAdjust (0.08s)
=== RUN   TestChanSendSelectBarrier
--- PASS: TestChanSendSelectBarrier (0.03s)
[...]
PASS
ok  	runtime	295.983s

Running just those tests individually didn't reproduce the problem:


gotip $ gomote run $MOTE go/bin/go test -count=1 -v -run=TestSelectDuplicateChannel -short runtime
=== RUN   TestSelectDuplicateChannel
--- PASS: TestSelectDuplicateChannel (0.03s)
PASS
ok  	runtime	0.046s
gotip $ gomote run $MOTE go/bin/go test -count=1 -v -run=TestSelectDuplicateChannel -short runtime
=== RUN   TestSelectDuplicateChannel
--- PASS: TestSelectDuplicateChannel (0.03s)
PASS
ok  	runtime	0.041s
gotip $ gomote run $MOTE go/bin/go test -count=1 -v -run=TestSelectDuplicateChannel -short runtime
=== RUN   TestSelectDuplicateChannel
--- PASS: TestSelectDuplicateChannel (0.03s)
PASS
ok  	runtime	0.039s
gotip $ gomote run $MOTE go/bin/go test -count=1 -v -run=TestSelectDuplicateChannel -short runtime
=== RUN   TestSelectDuplicateChannel
--- PASS: TestSelectDuplicateChannel (0.03s)
PASS
ok  	runtime	0.045s
gotip $ gomote run $MOTE go/bin/go test -count=1 -v -run=TestSelectDuplicateChannel -short runtime
=== RUN   TestSelectDuplicateChannel
--- PASS: TestSelectDuplicateChannel (0.03s)
PASS
ok  	runtime	0.045s

So it seems there's some sort of interaction between all the runtime tests when run one after another.

I hope this helps @prattmic be able to also reproduce the problem and use it to investigate #42237 more easily. Please let me know if it still doesn't reproduce for you for some reason.

@dmitshur
Copy link
Contributor

As far as I can tell, the root cause here turned out to be that the problem was reproducible only when running go test -short runtime individually, but not as part of all.bash. I'll close this since I don't see anything more to do here. Please comment if I missed something.

@dmitshur dmitshur self-assigned this Nov 10, 2020
@golang golang locked and limited conversation to collaborators Nov 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants