x/build/cmd/coordinator: bad test partitioning for plan9-arm #49343

millerresearch · 2021-11-04T12:50:12Z

Currently a lot of short tests are being executed individually in separate invocations of go tool dist test, instead of being grouped together. This matters for overall runtime, because on the raspberry pi builders each invocation of go tool dist test adds an overhead of >30 seconds, even for a test which takes a small fraction of a second.
See for example this excerpt from an execution log on plan9-arm:

  2021-10-25T17:16:23Z still_waiting_on_test go_test:internal/abi
  2021-10-25T17:16:53Z still_waiting_on_test go_test:internal/abi
  2021-10-25T17:17:18Z finish_run_tests_multi after 1m24.6s; pi4g: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
  2021-10-25T17:17:18Z run_test:go_test:internal/execabs pi4g
  2021-10-25T17:17:48Z still_waiting_on_test go_test:internal/execabs
  2021-10-25T17:17:53Z finish_run_test:go_test:internal/execabs after 35.3s; pi4g
  2021-10-25T17:17:53Z run_test:go_test:internal/fmtsort pi4g
  2021-10-25T17:18:23Z still_waiting_on_test go_test:internal/fmtsort
  2021-10-25T17:18:37Z finish_run_test:go_test:internal/fmtsort after 43.5s; pi4g
  2021-10-25T17:18:37Z run_test:go_test:internal/fuzz pi4g
  2021-10-25T17:19:07Z still_waiting_on_test go_test:internal/fuzz
  2021-10-25T17:19:13Z finish_run_test:go_test:internal/fuzz after 36.3s; pi4g
  2021-10-25T17:19:13Z run_test:go_test:internal/itoa pi4g
  2021-10-25T17:19:43Z still_waiting_on_test go_test:internal/itoa
  2021-10-25T17:19:48Z finish_run_test:go_test:internal/itoa after 34.6s; pi4g
  2021-10-25T17:19:48Z run_test:go_test:internal/poll pi4g
  2021-10-25T17:20:18Z still_waiting_on_test go_test:internal/poll
  2021-10-25T17:20:35Z finish_run_test:go_test:internal/poll after 47.1s; pi4g
  2021-10-25T17:20:35Z run_test:go_test:internal/profile pi4g
  2021-10-25T17:21:05Z still_waiting_on_test go_test:internal/profile
  2021-10-25T17:21:11Z finish_run_test:go_test:internal/profile after 36.1s; pi4g
  2021-10-25T17:21:11Z run_test:go_test:internal/reflectlite pi4g
  2021-10-25T17:21:41Z still_waiting_on_test go_test:internal/reflectlite
  2021-10-25T17:22:04Z finish_run_test:go_test:internal/reflectlite after 53s; pi4g
  2021-10-25T17:22:04Z run_test:go_test:internal/singleflight pi4g
  2021-10-25T17:22:34Z still_waiting_on_test go_test:internal/singleflight
  2021-10-25T17:22:38Z finish_run_test:go_test:internal/singleflight after 34s; pi4g
  2021-10-25T17:22:38Z run_test:go_test:internal/trace pi4g
  2021-10-25T17:23:08Z still_waiting_on_test go_test:internal/trace
  2021-10-25T17:23:14Z finish_run_test:go_test:internal/trace after 35.5s; pi4g
  2021-10-25T17:23:14Z run_test:go_test:internal/unsafeheader pi4g
  2021-10-25T17:23:44Z still_waiting_on_test go_test:internal/unsafeheader
  2021-10-25T17:23:48Z finish_run_test:go_test:internal/unsafeheader after 34.8s; pi4g
  2021-10-25T17:23:48Z run_test:go_test:internal/xcoff pi4g
  2021-10-25T17:24:18Z still_waiting_on_test go_test:internal/xcoff
  2021-10-25T17:24:24Z finish_run_test:go_test:internal/xcoff after 35.4s; pi4g
  2021-10-25T17:24:24Z run_test:go_test:io pi4g
  2021-10-25T17:24:54Z still_waiting_on_test go_test:io
  2021-10-25T17:25:11Z finish_run_test:go_test:io after 47.4s; pi4g
  2021-10-25T17:25:11Z run_test:go_test:io/fs pi4g
  2021-10-25T17:25:41Z still_waiting_on_test go_test:io/fs
  2021-10-25T17:25:46Z finish_run_test:go_test:io/fs after 34.9s; pi4g
  2021-10-25T17:25:46Z run_test:go_test:io/ioutil pi4g
  2021-10-25T17:26:16Z still_waiting_on_test go_test:io/ioutil
  2021-10-25T17:26:39Z finish_run_test:go_test:io/ioutil after 52.4s; pi4g
  2021-10-25T17:26:39Z run_test:go_test:log pi4g
  2021-10-25T17:27:09Z still_waiting_on_test go_test:log
  2021-10-25T17:27:15Z finish_run_test:go_test:log after 35.9s; pi4g
  2021-10-25T17:27:15Z run_test:go_test:math pi4g
  2021-10-25T17:27:45Z still_waiting_on_test go_test:math
  2021-10-25T17:28:04Z finish_run_test:go_test:math after 49.6s; pi4g
  2021-10-25T17:28:04Z run_test:go_test:math/big pi4g
  2021-10-25T17:28:34Z still_waiting_on_test go_test:math/big
  2021-10-25T17:28:51Z finish_run_test:go_test:math/big after 46.5s; pi4g
  2021-10-25T17:28:51Z run_test:go_test:math/bits pi4g
  2021-10-25T17:29:21Z still_waiting_on_test go_test:math/bits
  2021-10-25T17:29:42Z finish_run_test:go_test:math/bits after 50.9s; pi4g
  2021-10-25T17:29:42Z run_test:go_test:math/cmplx pi4g
  2021-10-25T17:30:12Z still_waiting_on_test go_test:math/cmplx
  2021-10-25T17:30:19Z finish_run_test:go_test:math/cmplx after 36.7s; pi4g
  2021-10-25T17:30:19Z run_test:go_test:math/rand pi4g
  2021-10-25T17:30:49Z still_waiting_on_test go_test:math/rand
  2021-10-25T17:30:58Z finish_run_test:go_test:math/rand after 39.4s; pi4g
  2021-10-25T17:30:58Z run_test:go_test:mime pi4g
  2021-10-25T17:31:28Z still_waiting_on_test go_test:mime
  2021-10-25T17:31:35Z finish_run_test:go_test:mime after 36.9s; pi4g
  2021-10-25T17:31:35Z run_test:go_test:mime/multipart pi4g
  2021-10-25T17:32:05Z still_waiting_on_test go_test:mime/multipart
  2021-10-25T17:32:13Z finish_run_test:go_test:mime/multipart after 37.8s; pi4g
  2021-10-25T17:32:13Z run_test:go_test:mime/quotedprintable pi4g
  2021-10-25T17:32:43Z still_waiting_on_test go_test:mime/quotedprintable
  2021-10-25T17:32:49Z finish_run_test:go_test:mime/quotedprintable after 36.1s; pi4g
  2021-10-25T17:32:49Z run_test:go_test:net pi4g
  2021-10-25T17:33:19Z still_waiting_on_test go_test:net
  2021-10-25T17:33:49Z still_waiting_on_test go_test:net
  2021-10-25T17:34:12Z finish_run_test:go_test:net after 1m23.2s; pi4g
  2021-10-25T17:34:12Z run_test:go_test:net/http pi4g
  2021-10-25T17:34:42Z still_waiting_on_test go_test:net/http
  2021-10-25T17:35:12Z still_waiting_on_test go_test:net/http
  2021-10-25T17:35:42Z still_waiting_on_test go_test:net/http
  2021-10-25T17:36:12Z still_waiting_on_test go_test:net/http
  2021-10-25T17:36:20Z finish_run_test:go_test:net/http after 2m8.1s; pi4g
  2021-10-25T17:36:20Z run_tests_multi pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi]
  2021-10-25T17:36:50Z still_waiting_on_test go_test:net/http/cgi
  2021-10-25T17:37:02Z finish_run_tests_multi after 41s; pi4g: [go_test:net/http/cgi go_test:net/http/cookiejar go_test:net/http/fcgi]

compared to the same subsequence of tests run on another slow platform, freebsd-arm64:

  2021-10-25T19:14:46Z still_waiting_on_test go_test:internal/abi
  2021-10-25T19:14:51Z finish_run_tests_multi after 34.4s; aarch64.localdomain: [go_test:internal/abi go_test:internal/buildcfg go_test:internal/cpu]
  2021-10-25T19:14:51Z run_tests_multi aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz]
  2021-10-25T19:14:59Z finish_run_tests_multi after 8.31s; aarch64.localdomain: [go_test:internal/execabs go_test:internal/fmtsort go_test:internal/fuzz]
  2021-10-25T19:14:59Z run_tests_multi aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile]
  2021-10-25T19:15:11Z finish_run_tests_multi after 12.3s; aarch64.localdomain: [go_test:internal/itoa go_test:internal/poll go_test:internal/profile]
  2021-10-25T19:15:11Z run_tests_multi aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace]
  2021-10-25T19:15:27Z finish_run_tests_multi after 15.4s; aarch64.localdomain: [go_test:internal/reflectlite go_test:internal/singleflight go_test:internal/trace]
  2021-10-25T19:15:27Z run_tests_multi aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io]
  2021-10-25T19:15:38Z finish_run_tests_multi after 11.7s; aarch64.localdomain: [go_test:internal/unsafeheader go_test:internal/xcoff go_test:io]
  2021-10-25T19:15:38Z run_tests_multi aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log]
  2021-10-25T19:15:42Z finish_run_tests_multi after 3.83s; aarch64.localdomain: [go_test:io/fs go_test:io/ioutil go_test:log]
  2021-10-25T19:15:42Z run_tests_multi aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big]
  2021-10-25T19:15:59Z finish_run_tests_multi after 16.6s; aarch64.localdomain: [go_test:log/syslog go_test:math go_test:math/big]
  2021-10-25T19:15:59Z run_tests_multi aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand]
  2021-10-25T19:16:12Z finish_run_tests_multi after 13.3s; aarch64.localdomain: [go_test:math/bits go_test:math/cmplx go_test:math/rand]
  2021-10-25T19:16:12Z run_tests_multi aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable]
  2021-10-25T19:16:17Z finish_run_tests_multi after 4.66s; aarch64.localdomain: [go_test:mime go_test:mime/multipart go_test:mime/quotedprintable]
  2021-10-25T19:16:17Z run_tests_multi aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi]
  2021-10-25T19:16:47Z still_waiting_on_test go_test:net
  2021-10-25T19:16:56Z finish_run_tests_multi after 39.2s; aarch64.localdomain: [go_test:net go_test:net/http go_test:net/http/cgi]

Naive question: on single builders without helpers, why are the tests partitioned at all? Running the whole set of tests under a single invocation of go tool dist test would reduce the overhead considerably.

The text was updated successfully, but these errors were encountered:

bcmills · 2021-11-04T13:56:13Z

The logic that partitions the tests is here: https://cs.opensource.google/go/x/build/+/master:cmd/coordinator/coordinator.go;l=2655-2692;drc=db77efa219bbf2dff0b44c84d5e7ea957f55a5db

It appears to be using a hard-coded 10s cutoff, and I suspect that it is not accounting for per-test setup overhead — so the freebsd-arm64 builder probably gets better clustering because it doesn't have that overhead. (Bit of a catch-22 there. 😅)

bcmills · 2021-11-04T13:57:29Z

As for why the tests are partitioned at all — I don't know? I would guess that the coordinator is written assuming that per-test overhead is minimal, and if we have sharded builders at all the coordinator implementation is certainly simpler using the same sharding logic for all builders.

millerresearch · 2021-11-04T14:54:09Z

The logic that partitions the tests is here: ...

Yes, I did look at that code before posting the issue. But that doesn't show where getTestStats is getting its historical durations from. I suspect they are not very accurate in the case of plan9-arm. Can I look at them somehow?

My feeling is that the bad partitioning is a recent problem, but I don't have an archive of test log output so I can't check. How often does the database of durations change?

I would guess that the coordinator is written assuming that per-test overhead is minimal

Clearly not the case for plan 9 since go1.10, as I reported in #24300. The redundant (imho) staleness tests at the start of dist test repeatedly thrash the entire source tree with lookup + stat calls, not minimal on operating systems which don't have linux's aggressive directory and inode cacheing, even worse if the builder uses a remote file server.

bcmills · 2021-11-04T15:16:50Z

Can I look at them somehow?

I'm not sure! (That seems like a question for @golang/release...)

gopherbot added the Builders x/build issues (builders, bots, dashboards) label Nov 4, 2021

gopherbot added this to the Unreleased milestone Nov 4, 2021

thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 4, 2021

bcmills mentioned this issue Feb 4, 2022

x/build: OpenBSD 7.0 386 builder is slow #49666

Open

bcmills mentioned this issue Jun 14, 2022

runtime,cmd/compile: exit status 0xc0000374 (STATUS_HEAP_CORRUPTION) on windows-amd64-longtest #52647

Closed

millerresearch mentioned this issue Jan 11, 2023

cmd/dist: huge increase in runtime for 'dist test' on slow builders after CL 432535 #57734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build/cmd/coordinator: bad test partitioning for plan9-arm #49343

x/build/cmd/coordinator: bad test partitioning for plan9-arm #49343

millerresearch commented Nov 4, 2021

bcmills commented Nov 4, 2021

bcmills commented Nov 4, 2021

millerresearch commented Nov 4, 2021

bcmills commented Nov 4, 2021

x/build/cmd/coordinator: bad test partitioning for plan9-arm #49343

x/build/cmd/coordinator: bad test partitioning for plan9-arm #49343

Comments

millerresearch commented Nov 4, 2021

bcmills commented Nov 4, 2021

bcmills commented Nov 4, 2021

millerresearch commented Nov 4, 2021

bcmills commented Nov 4, 2021