Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: trybots not running due to maintner resource temporarily unavailable errors #23705

Closed
bradfitz opened this issue Feb 5, 2018 · 16 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Feb 5, 2018

Trybots aren't running because maintner isn't noticing that there's new stuff on Gerrit, due to:

2018/02/02 20:07:17 Temporary error from gerrit go.googlesource.com/example: running git remote -v in /cache/go.googlesource.com%2Fexample: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 gerrit go.googlesource.com/time: init: running git remote -v in /cache/go.googlesource.com%2Ftime: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Ftime: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 Temporary error from gerrit go.googlesource.com/time: running git remote -v in /cache/go.googlesource.com%2Ftime: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 gerrit go.googlesource.com/build: init: running git remote -v in /cache/go.googlesource.com%2Fbuild: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fbuild: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 Temporary error from gerrit go.googlesource.com/build: running git remote -v in /cache/go.googlesource.com%2Fbuild: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 gerrit code.googlesource.com/gocloud: init: running git remote -v in /cache/code.googlesource.com%2Fgocloud: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/code.googlesource.com%2Fgocloud: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:18 Temporary error from gerrit code.googlesource.com/gocloud: running git remote -v in /cache/code.googlesource.com%2Fgocloud: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:19 gerrit go.googlesource.com/gddo: init: running git remote -v in /cache/go.googlesource.com%2Fgddo: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:19 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fgddo: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:19 Temporary error from gerrit go.googlesource.com/gddo: running git remote -v in /cache/go.googlesource.com%2Fgddo: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 gerrit go.googlesource.com/scratch: init: running git remote -v in /cache/go.googlesource.com%2Fscratch: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fscratch: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 Temporary error from gerrit go.googlesource.com/scratch: running git remote -v in /cache/go.googlesource.com%2Fscratch: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 gerrit go.googlesource.com/proposal: init: running git remote -v in /cache/go.googlesource.com%2Fproposal: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fproposal: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 Temporary error from gerrit go.googlesource.com/proposal: running git remote -v in /cache/go.googlesource.com%2Fproposal: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 gerrit go.googlesource.com/tour: init: running git remote -v in /cache/go.googlesource.com%2Ftour: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Ftour: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:20 Temporary error from gerrit go.googlesource.com/tour: running git remote -v in /cache/go.googlesource.com%2Ftour: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 gerrit go.googlesource.com/oauth2: init: running git remote -v in /cache/go.googlesource.com%2Foauth2: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Foauth2: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 Temporary error from gerrit go.googlesource.com/oauth2: running git remote -v in /cache/go.googlesource.com%2Foauth2: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 gerrit go.googlesource.com/exp: init: running git remote -v in /cache/go.googlesource.com%2Fexp: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 IS TEMP ERROR? *errors.errorString running git remote -v in /cache/go.googlesource.com%2Fexp: fork/exec /usr/bin/git: resource temporarily unavailable
2018/02/02 20:07:21 Temporary error from gerrit go.googlesource.com/exp: running git remote -v in /cache/go.googlesource.com%2Fexp: fork/exec /usr/bin/git: resource temporarily unavailable^C

This smells just like #23686 in a different binary.

Both of these were just updated for the LetsEncrypt ACME changes.

So what changed in the meantime?

What resource isn't available?

/cc @andybons @ianlancetaylor

@gopherbot gopherbot added this to the Unreleased milestone Feb 5, 2018
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Feb 5, 2018
@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 5, 2018

Can't even bash in to debug:

$ kubectl exec -ti maintnerd-deployment-2926465623-tjsxh /bin/bash
rpc error: code = 2 desc = "fork/exec /usr/bin/containerd-shim: resource temporarily unavailable"command terminated with exit code 126

Process limit? FD leak?

@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 5, 2018

Killed container and let it get restarted. Then ran bash in it and:

root@maintnerd-deployment-2926465623-7trps:/# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 60077
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@ianlancetaylor
Copy link
Contributor

This may have happened again, as https://golang.org/cl/92155 has had no trybot initiation after ten minutes.

@ianlancetaylor
Copy link
Contributor

The trybots did start running on CL 92155 after 32 minutes.

@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 8, 2018

Well, @andybons updated our GKE cluster from 1.5.6 to 1.8.whatever the latests GKE option is, and the errors persist, but they've changed. Now I see:

2018/02/08 17:00:25 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?
2018/02/08 17:00:37 error fetching 9 hashes from gerrit project go.googlesource.com/go: 
2018/02/08 17:00:37 gerrit go.googlesource.com/go: sync: signal: killed; stderr=""
2018/02/08 17:00:37 IS TEMP ERROR? *errors.errorString signal: killed; stderr=""
2018/02/08 17:00:37 Temporary error from gerrit go.googlesource.com/go: signal: killed; stderr=""
2018/02/08 17:01:07 gerrit go.googlesource.com/go: git directory exists.
2018/02/08 17:01:10 gerrit go.googlesource.com/go: 9 new refs
2018/02/08 17:01:10 gerrit go.googlesource.com/go: fetching 9 hashes...
2018/02/08 17:01:11 error fetching 9 hashes from gerrit project go.googlesource.com/go: 
2018/02/08 17:01:11 gerrit go.googlesource.com/go: sync: signal: killed; stderr=""
2018/02/08 17:01:11 IS TEMP ERROR? *errors.errorString signal: killed; stderr=""
2018/02/08 17:01:11 Temporary error from gerrit go.googlesource.com/go: signal: killed; stderr=""
2018/02/08 17:01:14 Got pubsubhelper event: {
        "Time": "2018-02-08T17:01:14.022043045Z",
        "Gerrit": {
                "URL": "https://go-review.googlesource.com/92815",
                "Project": "go",
                "CommitHash": "d069c928b4e5108c10621d3190761abf6a5ece9a",
                "ChangeNumber": 92815
        }       
}       
2018/02/08 17:01:14 Pubsub event on topic "gerrit:go.googlesource.com/go" discarded; already syncing?
2018/02/08 17:01:41 gerrit go.googlesource.com/go: git directory exists.
2018/02/08 17:01:42 gerrit go.googlesource.com/go: sync: git fetch origin: signal: killed, 
2018/02/08 17:01:42 IS TEMP ERROR? *errors.errorString git fetch origin: signal: killed, 
2018/02/08 17:01:42 Temporary error from gerrit go.googlesource.com/go: git fetch origin: signal: killed, 

2018/02/08 17:02:12 gerrit go.googlesource.com/go: git directory exists.
2018/02/08 17:02:13 gerrit go.googlesource.com/go: sync: git fetch origin: signal: killed, 
2018/02/08 17:02:13 IS TEMP ERROR? *errors.errorString git fetch origin: signal: killed, 
2018/02/08 17:02:13 Temporary error from gerrit go.googlesource.com/go: git fetch origin: signal: killed, 

Not sure what Temporary error from gerrit go.googlesource.com/go: signal: killed; stderr="" means.

@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 8, 2018

Looks like something's leaking. Bunch of defunct processes:

root@maintnerd-deployment-2926465623-g8q6z:/# ps ax | wc -l
145
root@maintnerd-deployment-2926465623-g8q6z:/# ps ax | head -20
  PID TTY      STAT   TIME COMMAND
    1 ?        Ssl   40:02 /maintnerd --config=go --bucket=golang-maintner-log --verbose --data-dir=/cache --listen=:80 --autocert=maintner.golang.org --autocert-bucket=golang-maintner-autocert
 2611 ?        Z      0:00 [git-remote-http] <defunct>
 3986 ?        Z      0:00 [git-remote-http] <defunct>
 3990 ?        Z      0:00 [git-remote-http] <defunct>
 4012 ?        Z      0:00 [git-remote-http] <defunct>
 4038 ?        Z      0:00 [git-remote-http] <defunct>
 4042 ?        Z      0:00 [git-remote-http] <defunct>
 4093 ?        Z      0:00 [git-remote-http] <defunct>
 4250 ?        Z      0:00 [git-remote-http] <defunct>
 4258 ?        Z      0:00 [git-remote-http] <defunct>
 4265 ?        Z      0:00 [git] <defunct>
 4283 ?        Z      0:00 [git-remote-http] <defunct>
 4363 ?        Z      0:00 [git-remote-http] <defunct>
 4385 ?        Z      0:00 [git-remote-http] <defunct>
 4407 ?        Z      0:00 [git-remote-http] <defunct>
 4420 ?        Z      0:00 [git-remote-http] <defunct>
 4423 ?        Z      0:04 [git] <defunct>
 4464 ?        Z      0:00 [git-remote-http] <defunct>
 4468 ?        Z      0:00 [git-remote-http] <defunct>

@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 8, 2018

I have no idea where these defunct processes are coming from. I audited all the os/exec callers in golang.org/x/build/maintner/... and don't see anything suspicious.

@ianlancetaylor, any debugging tips?

@ianlancetaylor
Copy link
Contributor

Sorry, no good ideas. Just the obvious one of double-checking that the parent process is what you think it is.

@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 9, 2018

Updates from research & chats with others.

This is a GKE issue. Apparently GKE 1.7 automatically reaped zombie processes like an init would, using a "pause container", but that was reverted in GKE 1.8 and hasn't been fixed in GKE 1.8 yet.

Some links:

https://news.ycombinator.com/item?id=14524253
kubernetes/kubernetes#1615
https://github.com/krallin/tini
https://stackoverflow.com/questions/43122080/how-to-use-init-parameter-in-docker-run

@gopherbot
Copy link

Change https://golang.org/cl/93077 mentions this issue: maintner/maintnerd: wrap maintnerd in tini init process to reap zombies

gopherbot pushed a commit to golang/build that referenced this issue Feb 9, 2018
Updates golang/go#23705

Change-Id: If0e7826ab75aae46dc7b79306d36d6cd3f07a041
Reviewed-on: https://go-review.googlesource.com/93077
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@bradfitz
Copy link
Contributor Author

bradfitz commented Feb 9, 2018

Deployed:

root@maintnerd-deployment-7fdd7bdfff-77dcp:/# ps afx
  PID TTY      STAT   TIME COMMAND
   18 ?        Ss     0:00 /bin/bash
   26 ?        R+     0:00  \_ ps afx
    1 ?        Ss     0:00 /tini -- /maintnerd --config=go --bucket=golang-maintner-log --verbose --data-dir=/cache --listen=:80 --autocert=maintner.golang
    7 ?        Sl     0:22 /maintnerd --config=go --bucket=golang-maintner-log --verbose --data-dir=/cache --listen=:80 --autocert=maintner.golang.org --au

@gopherbot
Copy link

Change https://golang.org/cl/93082 mentions this issue: maintner/maintnerd: fix use of tini

gopherbot pushed a commit to golang/build that referenced this issue Feb 9, 2018
Updates golang/go#23705

Change-Id: Ia66e3c16bda3357daf4796f3eafb50b2eb019223
Reviewed-on: https://go-review.googlesource.com/93082
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@bradfitz
Copy link
Contributor Author

gitmirror is next to need love:

root@gitmirror-rc-dlxkn:/# ps ax
  PID TTY      STAT   TIME COMMAND
    1 ?        Ssl    1:57 /gitmirror -http=:8585 -mirror=true -report=true -network=true -cachedir=/cache/gitmirror
 1233 ?        Z      0:00 [sh] <defunct>
 1238 ?        Z      0:00 [sh] <defunct>
 1247 ?        Z      0:00 [sh] <defunct>
 1253 ?        Z      0:00 [sh] <defunct>
 1272 ?        Z      0:00 [sh] <defunct>
 1275 ?        Z      0:00 [sh] <defunct>
 1278 ?        Z      0:00 [sh] <defunct>
 1282 ?        Z      0:00 [sh] <defunct>
 1285 ?        Z      0:00 [sh] <defunct>
 1287 ?        Z      0:00 [sh] <defunct>
 1290 ?        Z      0:00 [sh] <defunct>
 1293 ?        Z      0:00 [sh] <defunct>
 1296 ?        Z      0:00 [sh] <defunct>
 1298 ?        Z      0:00 [sh] <defunct>
 1335 ?        Z      0:00 [sh] <defunct>
 1339 ?        Z      0:00 [sh] <defunct>
 1340 ?        Z      0:00 [sh] <defunct>
 1347 ?        Z      0:00 [sh] <defunct>
 1368 ?        Z      0:00 [sh] <defunct>
 1375 ?        Z      0:00 [sh] <defunct>
 1392 ?        Z      0:00 [sh] <defunct>
 1395 ?        Z      0:00 [sh] <defunct>
 1397 ?        Z      0:00 [sh] <defunct>
 1403 ?        Z      0:00 [sh] <defunct>
 1406 ?        Z      0:00 [sh] <defunct>
 1414 ?        Z      0:00 [sh] <defunct>
 1424 ?        Z      0:00 [sh] <defunct>
 1427 ?        Z      0:00 [sh] <defunct>
 1430 ?        Z      0:00 [sh] <defunct>
 1485 ?        S      0:00 ssh git@github.com git-upload-pack 'golang/example.git'
 1524 ?        Z      0:00 [ssh] <defunct>
 1572 ?        Z      0:00 [ssh] <defunct>
 1575 ?        S      0:00 ssh git@github.com git-upload-pack 'golang/scratch.git'
 1586 ?        Z      0:00 [ssh] <defunct>
 1798 ?        Ss     0:00 /bin/bash
 1804 ?        R+     0:00 ps ax

@gopherbot
Copy link

Change https://golang.org/cl/93755 mentions this issue: cmd/gitmirror: add the tini init wrapper to the Dockerfile

gopherbot pushed a commit to golang/build that referenced this issue Feb 14, 2018
Updates golang/go#23705

Change-Id: I1fa42b2767d1a9780672458fd06d64ed998b9e04
Reviewed-on: https://go-review.googlesource.com/93755
Reviewed-by: Andrew Bonventre <andybons@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/94075 mentions this issue: cmd/gerritbot: add the tini init wrapper to the Dockerfile

gopherbot pushed a commit to golang/build that referenced this issue Feb 14, 2018
Updates golang/go#23705

Change-Id: I9c483efa491e2d9f705850d81ea94feda9f9d5a4
Reviewed-on: https://go-review.googlesource.com/94075
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@dmitshur
Copy link
Contributor

dmitshur commented Nov 2, 2018

I believe this particular issue is resolved, so closing. Please re-open if not, or if there's something actionable to do here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

4 participants