Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Kubernetes' kube-proxy stuck in GC forever #18925

Closed
ivan4th opened this issue Feb 3, 2017 · 32 comments
Closed

runtime: Kubernetes' kube-proxy stuck in GC forever #18925

ivan4th opened this issue Feb 3, 2017 · 32 comments
Milestone

Comments

@ivan4th
Copy link

ivan4th commented Feb 3, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go1.7.4 linux/amd64

What operating system and processor architecture are you using (go env)?

GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp.k8s/go-build739048657=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

(from k8s build container)

What did you do?

Running k8s cluster

What did you expect to see?

kube-proxy process running as expected

What did you see instead?

kube-proxy gets stuck in GC code, with no goroutines being scheduled

The problem here is that the problem is quite hard to reproduce, something similar happens on some nodes from time to time, may take some days to run into the problem. We currently have a process that's in the state described here running on one of the nodes. I'm pretty sure it's related to Go runtime but in any case I'm stuck trying to find a way to debug it. Any hints on what needs to be done to find out the cause of the problem would be very appreciated. I don't want to kill the process with SIGQUIT to retrieve goroutine info, so doing this with delve instead.

The process is running in docker container (debian jessie), GOMAXPROCS=48

Goroutine info: https://gist.github.com/ivan4th/17654f6fee35a38548502de4b6f68ce4
Thread info: https://gist.github.com/ivan4th/4596664ba1f935c500250f74ade5c162
Log output from another hung kube-proxy killed with SIGQUIT: https://gist.github.com/ivan4th/5da95ebf8986c6834bca35c9b4e7895b

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

Accidentally posted the issue before I completed editing, will update shortly with more details, sorry

@bradfitz bradfitz changed the title Kubernetes' kube-proxy stuck in GC forever runtime: Kubernetes' kube-proxy stuck in GC forever Feb 3, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Feb 3, 2017
@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

What I could get out of GDB (maybe I should try newer version?):

Program received signal SIGINT, Interrupt.
runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:95
95	/usr/local/go/src/runtime/sys_linux_amd64.s: No such file or directory.
(gdb) bt
#0  runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:95
#1  0x0000000000424f7d in runtime.getfull (~r0=0xc42002d228) at /usr/local/go/src/runtime/mgcwork.go:411
#2  0x000000000042479f in runtime.(*gcWork).get (w=0x42279d <runtime.greyobject+301>, ~r0=0) at /usr/local/go/src/runtime/mgcwork.go:199
#3  0x0000000000421e3a in runtime.gcDrain (gcw=0xc42002d228, flags=0) at /usr/local/go/src/runtime/mgcmark.go:982
#4  0x000000000041f579 in runtime.gchelper () at /usr/local/go/src/runtime/mgc.go:1819
#5  0x0000000000436a1f in runtime.stopm () at /usr/local/go/src/runtime/proc.go:1597
#6  0x00000000004377f8 in runtime.findrunnable (gp=0x43e30e <runtime.runqget+142>, inheritTime=false) at /usr/local/go/src/runtime/proc.go:2021
#7  0x000000000043836c in runtime.schedule () at /usr/local/go/src/runtime/proc.go:2120
#8  0x00000000004386e3 in runtime.park_m (gp=0xc420461040) at /usr/local/go/src/runtime/proc.go:2183
#9  0x000000000046111b in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:240
#10 0x0000000005cace00 in runtime.work ()
#11 0x00007ffca16997f0 in ?? ()
#12 0x0000000005cacea0 in ?? ()
#13 0x00007ffca16997e0 in ?? ()
#14 0x0000000000435914 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1096
#15 0x0000000000460fc5 in runtime.rt0_go () at /usr/local/go/src/runtime/asm_amd64.s:156
#16 0x0000000000000009 in ?? ()
#17 0x00007ffca16998f8 in ?? ()
#18 0x0000000000000009 in ?? ()
#19 0x00007ffca16998f8 in ?? ()
#20 0x0000000000000000 in ?? ()
(gdb) quit

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

Most of kube-proxy source is here: https://github.com/kubernetes/kubernetes/tree/v1.5.2/pkg/proxy

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

Per-process CPU usage (just pasting this from htop, see 9th column) -- first line is for the whole process, following lines correspond to threads:

   1 root       20   0 1612M 90760 49692 S 51.9  0.0 21h29:31 /hyperkube proxy ... 
   27 root       20   0 1612M 90760 49692 R  6.9  0.0  2h12:50 /hyperkube proxy ...
 4959 root       20   0 1612M 90760 49692 R  6.4  0.0  2h12:58 /hyperkube proxy ...
   14 root       20   0 1612M 90760 49692 R  6.0  0.0  2h15:43 /hyperkube proxy ...
34689 root       20   0 1612M 90760 49692 R  4.6  0.0  2h12:53 /hyperkube proxy ...
   18 root       20   0 1612M 90760 49692 R  4.1  0.0  2h13:15 /hyperkube proxy ...
   15 root       20   0 1612M 90760 49692 R  3.7  0.0  2h13:07 /hyperkube proxy ...
   17 root       20   0 1612M 90760 49692 S  0.9  0.0 18:41.65 /hyperkube proxy ...
   19 root       20   0 1612M 90760 49692 S  0.9  0.0 18:36.90 /hyperkube proxy ...
   16 root       20   0 1612M 90760 49692 S  0.9  0.0 18:44.86 /hyperkube proxy ...
 1285 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.19 /hyperkube proxy ...
 1829 root       20   0 1612M 90760 49692 S  0.9  0.0 18:43.81 /hyperkube proxy ...
   29 root       20   0 1612M 90760 49692 R  0.9  0.0 18:35.07 /hyperkube proxy ...
 4368 root       20   0 1612M 90760 49692 S  0.9  0.0 18:35.75 /hyperkube proxy ...
   20 root       20   0 1612M 90760 49692 S  0.9  0.0 18:41.37 /hyperkube proxy ...
   34 root       20   0 1612M 90760 49692 S  0.9  0.0 18:31.79 /hyperkube proxy ...
   12 root       20   0 1612M 90760 49692 S  0.9  0.0 18:40.26 /hyperkube proxy ...
   10 root       20   0 1612M 90760 49692 S  0.9  0.0 18:40.57 /hyperkube proxy ...
   13 root       20   0 1612M 90760 49692 S  0.9  0.0 18:39.82 /hyperkube proxy ...
   31 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.71 /hyperkube proxy ...
   28 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.91 /hyperkube proxy ...
   26 root       20   0 1612M 90760 49692 R  0.9  0.0 18:37.23 /hyperkube proxy ...
 1827 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.38 /hyperkube proxy ...
  733 root       20   0 1612M 90760 49692 S  0.9  0.0 18:36.00 /hyperkube proxy ...
  199 root       20   0 1612M 90760 49692 R  0.9  0.0 18:37.56 /hyperkube proxy ...
 1283 root       20   0 1612M 90760 49692 R  0.5  0.0 18:36.17 /hyperkube proxy ...
   30 root       20   0 1612M 90760 49692 S  0.5  0.0 18:41.68 /hyperkube proxy ...
47869 root       20   0 1612M 90760 49692 R  0.5  0.0 18:14.54 /hyperkube proxy ...
   25 root       20   0 1612M 90760 49692 S  0.5  0.0 18:37.63 /hyperkube proxy ...
   32 root       20   0 1612M 90760 49692 S  0.5  0.0 18:40.23 /hyperkube proxy ...
   22 root       20   0 1612M 90760 49692 S  0.5  0.0 18:38.95 /hyperkube proxy ...
    8 root       20   0 1612M 90760 49692 S  0.5  0.0 18:40.63 /hyperkube proxy ...

@bradfitz
Copy link
Contributor

bradfitz commented Feb 3, 2017

Can you run it with environment GOTRACEBACK=system set and then send it SIGQUIT when it gets into that state and capture the output?

That might be more useful to see Go's own output, rather than gdb's attempt.

@aclements
Copy link
Member

Thanks for the very detailed report. It looks like we're in a stack-scanning deadlock.

Narrowing down to just the threads that are in scang (which happen to be exactly the threads that are high in top):

(dlv) thread 14
Switched from 13 to 14
(dlv) bt -full
 0  0x0000000000465237 in runtime.osyield
    at .usr/local/go/src/runtime/sys_linux_amd64.s:479
 1  0x000000000043500b in runtime.scang
    at .usr/local/go/src/runtime/proc.go:865
        gp = (*runtime.g)(0xc420221380)
        gcw = (*runtime.gcWork)(0xc420017c28)
        nextYield = 487396600593231
        i = 101070403879
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221380)
        gcw = (*runtime.gcWork)(0xc420017c28)

(dlv) thread 15
Switched from 14 to 15
(dlv) bt -full
 0  0x0000000000464e40 in runtime.nanotime
    at .usr/local/go/src/runtime/sys_linux_amd64.s:174
 1  0x0000000000434f66 in runtime.scang
    at .usr/local/go/src/runtime/proc.go:862
        gp = (*runtime.g)(0xc4202211e0)
        gcw = (*runtime.gcWork)(0xc420016728)
        nextYield = 487396600632703
        i = 98607213579
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc4202211e0)
        gcw = (*runtime.gcWork)(0xc420016728)

(dlv) thread 18
Switched from 17 to 18
(dlv) bt -full
 0  0x0000000000434885 in runtime.readgstatus
    at .usr/local/go/src/runtime/proc.go:646
        gp = (*runtime.g)(0xc420221ba0)
        ~r1 = 2
 1  0x0000000000434f7a in runtime.scang
    at .usr/local/go/src/runtime/proc.go:808
        gp = (*runtime.g)(0xc420221ba0)
        gcw = (*runtime.gcWork)(0xc42001bc28)
        nextYield = 487396600856415
        i = 98856423408
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221ba0)
        gcw = (*runtime.gcWork)(0xc42001bc28)

(dlv) thread 27
Switched from 26 to 27
(dlv) bt -full
 0  0x0000000000462d94 in runtime.procyield
    at .usr/local/go/src/runtime/asm_amd64.s:509
 1  0x0000000000434fe4 in runtime.scang
    at .usr/local/go/src/runtime/proc.go:863
        gp = (*runtime.g)(0xc420221520)
        gcw = (*runtime.gcWork)(0xc420022728)
        nextYield = 487396700665810
        i = 98411903700
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221520)
        gcw = (*runtime.gcWork)(0xc420022728)

(dlv) thread 4959
Switched from 4368 to 4959
(dlv) bt -full
 0  0x0000000000465237 in runtime.osyield
    at .usr/local/go/src/runtime/sys_linux_amd64.s:479
 1  0x000000000043500b in runtime.scang
    at .usr/local/go/src/runtime/proc.go:865
        gp = (*runtime.g)(0xc420221d40)
        gcw = (*runtime.gcWork)(0xc42001a728)
        nextYield = 487396701077361
        i = 98526672812
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221d40)
        gcw = (*runtime.gcWork)(0xc42001a728)

(dlv) thread 34689
Switched from 4959 to 34689
(dlv) bt -full
 0  0x0000000000434ff5 in runtime.scang
    at .usr/local/go/src/runtime/proc.go:807
        gp = (*runtime.g)(0xc420221a00)
        gcw = (*runtime.gcWork)(0xc420023c28)
        nextYield = 487396700975725
        i = 98723943990
 1  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221a00)
        gcw = (*runtime.gcWork)(0xc420023c28)

It would be really useful if I could see the contents of *gp for each of these threads. Could you get that from delve or gdb?

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

As of GOTRACEBACK, ok, we will do this and wait for more occurrences. Just thought that I can do something with
the hung process I have currently meanwhile.
gp details - will paste in a moment.

Some observations follow. I'm not an expert in go runtime, so I may be wrong. Most threads are in gc.getfull(). Meanwhile:

(dlv) print runtime.work.nproc
32
(dlv) print runtime.work.nwait
26

6 remaining threads are in runtime.scang:

(dlv) bt
 0  0x0000000000465237 in runtime.osyield
    at .usr/local/go/src/runtime/sys_linux_amd64.s:479
 1  0x000000000043500b in runtime.scang
    at .usr/local/go/src/runtime/proc.go:865
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
 3  0x00000000004611e9 in runtime.systemstack
    at .usr/local/go/src/runtime/asm_amd64.s:314
 4  0x0000000000420111 in runtime.markroot
    at .usr/local/go/src/runtime/mgcmark.go:241
 5  0x0000000000421cf9 in runtime.gcDrain
    at .usr/local/go/src/runtime/mgcmark.go:963
 6  0x000000000041f567 in runtime.gchelper
    at .usr/local/go/src/runtime/mgc.go:1819
 7  0x0000000000436a1a in runtime.stopm
    at .usr/local/go/src/runtime/proc.go:1597
 8  0x00000000004377f3 in runtime.findrunnable
    at .usr/local/go/src/runtime/proc.go:2021
 9  0x0000000000438367 in runtime.schedule
    at .usr/local/go/src/runtime/proc.go:2120
10  0x00000000004386de in runtime.park_m
    at .usr/local/go/src/runtime/proc.go:2183

Closer inspection reveals that gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt condition is true in each of these places, but corresponding goroutines that must scan their stacks can't do this. You can see them in goroutine list, the look like this:

Goroutine 26 - User: .usr/local/go/src/runtime/mbitmap.go:388 runtime.heapBitsForObject (0x417350)

but delve can't show the stacks for them, complaining about null pointers

Will try to grab gp details now

@aclements
Copy link
Member

Once you get the gp details, it would be really helpful to know their stack traces. I'm not sure why delve can't get them, but if you grab my gdbinit.py from https://github.com/aclements/my-dotfiles/blob/master/gdbinit.py, you can probably get their stacks by running btg $getg(N) where N is the goroutine ID. While you've got that loaded, it may also be helpful to have to output of print-sched (though it probably won't tell us anything you haven't already reported).

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

gp details: https://gist.github.com/ivan4th/d031a21b5d1043d04f4b74108b4474ff
Note that I did some c + Ctrl-cs since I captured the initial data, but all scang threads still stay in scang and getfull ones stay in getfull - this doesn't change.

@aclements
Copy link
Member

Active scans:

thread gp goid gstatus target g's thread
14 0xc420221380 22 running 22
15 0xc4202211e0 21 running 14
18 0xc420221ba0 27 running 18
27 0xc420221520 23 running 26
4959 0xc420221d40 28 running 32
34689 0xc420221a00 26 running 25

All target Gs have waitreason "GC worker (idle)" and their start PCs are all the same, so these are GC workers.

It's worth noting that we're in mark termation, so these really shouldn't be in status running. STW should have moved them to runnable. We're stuck because stack scan needs to get these out of running, but we can't preempt anything during STW (because it shouldn't be running at all!), so we're just spinning around.

So, it actually could be useful to see the output of print-sched.

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

print-sched output is here https://gist.github.com/ivan4th/3e53dde8c1b6c3b559b6901aff176bec

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

btg doesn't work though :(

(gdb) btg $getg(21)
goroutine 21 stack:
thread 14 not found; stack may be incorrect (try import _ "rutime/cgo")
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x18:
Error occurred in Python command: Cannot access memory at address 0x18

It may be related to the environment, got some errors when attaching to the process:

(gdb) attach 1
Attaching to process 1
Reading symbols from /hyperkube...done.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /hyperkube
Use `info auto-load python-scripts [REGEXP]' to list them.
warning: .dynamic section for "/lib/x86_64-linux-gnu/libpthread.so.0" is not at the expected address (wrong library or version mismatch?)
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libpthread-2.19.so...done.
done.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/x86_64-linux-gnu/libnss_compat.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_compat-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_compat.so.2
Reading symbols from /lib/x86_64-linux-gnu/libnsl.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnsl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnsl.so.1
Reading symbols from /lib/x86_64-linux-gnu/libnss_nis.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_nis-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_nis.so.2
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_files-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

(yes, gdb just can't work with threads in this container... will try to fix it. meanwhile see print-sched output above)

@aclements
Copy link
Member

Based on print-sched, these six goroutines that we're trying to scan are the only ones in "running":

      21 running, m 9, p 2, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      22 running, m 17, p 8, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      23 running, m 21, p 7, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      26 running, m 20, p 3, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      27 running, m 13, p 5, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      28 running, m 27, p 9, idle mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers

I wouldn't totally trust the M and P numbers here, since these goroutines clearly aren't actually running, but these probably show the last M that was running each goroutine and whatever P that M was on when it was stopped for STW mark termination.

These Gs correspond to the following Ms (same order):

       9 PID 14, p 2, helpgc
      17 PID 22, p 8, helpgc
      21 PID 26, p 7, helpgc
      20 PID 25, p 3, helpgc
      13 PID 18, p 5, helpgc
      27 PID 32, p 9, helpgc

And the following Ps:

       2 gcstop, mark worker g 22
       8 gcstop, mark worker g 28
       7 gcstop, mark worker g 27
       3 gcstop, mark worker g 23
       5 gcstop, mark worker g 25
       9 gcstop, mark worker g 29

It's interesting that all of these Ps have assigned mark workers that are not the mark worker that claims it was running on that P. It could just be that the M and P numbers for each G are bogus, or this could mean something.

One good thing is that this is almost certainly not an issue in Go 1.8 because Go 1.8 doesn't scan stacks during mark termination.

@aclements
Copy link
Member

@ivan4th, another thing to do in gdb on the running image:

Assuming you've still got my gdbinit.py loaded, run the python command at the GDB prompt and paste in the following:

def btfrom(g):
    pc, sp = g["sched"]["pc"], g["sched"]["sp"]
    oldsp, oldpc = gdb.parse_and_eval('$sp'), gdb.parse_and_eval('$pc')
    try:
        # TODO: This fails if we're not in the innermost frame.
        gdb.execute('set $sp = %#x' % sp)
        gdb.execute('set $pc = %#x' % pc)
        gdb.execute('backtrace')
    finally:
        gdb.execute('set $sp = %#x' % oldsp)
        gdb.execute('set $pc = %#x' % oldpc)

for goid in [21, 22, 23, 26, 27, 28]:
    print "Goroutine", goid, "sched stack:"
    btfrom(getg(n=goid))
    print

Hit Ctrl-D to run it and get back to the GDB prompt. This should dump out the stacks of these goroutines from where they actually stopped (ignoring their claim that they are in state "running").

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

Unfortunately I got this:

>Goroutine 21 sched stack:
Traceback (most recent call last):
  File "<string>", line 15, in <module>
  File "<string>", line 6, in btfrom
gdb.MemoryError: Cannot access memory at address 0x18
Error while executing Python code.

@ivan4th
Copy link
Author

ivan4th commented Feb 3, 2017

Actually the same error happens for each goroutine.

@aclements
Copy link
Member

Yeah, looking back at your dump of the gp's, their saved SPs are all 0.

That might actually be a clue. I think the only thing that puts 0 there is returning from systemstack, which suggests that we somehow returned from systemstack on these and then didn't update their sched buffers again.

@aclements
Copy link
Member

It's interesting that all of these Ps have assigned mark workers that are not the mark worker that claims it was running on that P. It could just be that the M and P numbers for each G are bogus, or this could mean something.

On further thought, I don't think this indicates anything. Looking at the P's assigned mark workers, they're all exactly in sequence, except for P 20, which would clearly have mark worker G 40 except that mark worker 40 is the one that disassociated and is driving mark termination. These are exactly the 48 mark workers, so nothing funny happened where on disassociated from a P and was replaced by another one.

@ivan4th, another potentially useful thing you can get from GDB or delve is the value of runtime.work, particularly runtime.work.rescan.list.

@ivan4th
Copy link
Author

ivan4th commented Feb 4, 2017

(dlv) print runtime.work.rescan.list
[]runtime.guintptr len: 51, cap: 1312, [842352693312,842352693728,842352694144,842352694560,842352694976,842352695392,842352695808,842352696224,842352696640,842355048448,842355049280,842355048864,842355050112,842355050528,842355051360,842355049696,842355051776,842355052192,842355052608,842355053440,842355053024,842355054272,842355054688,842355055104,842355055520,842355055936,842355050944,842355089408,842355089824,842355090656,842355090240,842355053856,842355091904,842355091072,842355092736,842355091488,842355093568,842355093984,842355093152,842355092320,842355094816,842355095232,842355095648,842355096480,842355096896,842355138560,842355096064,842355094400,842359741760,842366314528,842353073472]

@ivan4th
Copy link
Author

ivan4th commented Feb 4, 2017

struct { full uint64; empty uint64; pad0 [64]uint8; markrootNext uint32; markrootJobs uint32; nproc uint32; tstart int64; nwait uint32; ndone uint32; alldone runtime.note; nDataRoots int; nBSSRoots int; nSpanRoots int; nStackRoots int; nRescanRoots int; markrootDone bool; startSema uint32; markDoneSema uint32; bgMarkReady runtime.note; bgMarkDone uint32; mode runtime.gcMode; spans []*runtime.mspan; totaltime int64; bytesMarked uint64; initialHeapLive uint64; assistQueue struct { lock runtime.mutex; head runtime.guintptr; tail runtime.guintptr }; rescan struct { lock runtime.mutex; list []runtime.guintptr }; stwprocs int32; maxprocs int32; tSweepTerm int64; tMark int64; tMarkTerm int64; tEnd int64; pauseNS int64; pauseStart int64; heap0 uint64; heap1 uint64; heap2 uint64; heapGoal uint64 } {
        full: 0,
        empty: 9171967839421147533,
        pad0: [64]uint8 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
        markrootNext: 60,
        markrootJobs: 54,
        nproc: 32,
        tstart: 308925600233906,
        nwait: 26,
        ndone: 0,
        alldone: runtime.note {key: 0},
        nDataRoots: 0,
        nBSSRoots: 0,
        nSpanRoots: 0,
        nStackRoots: 0,
        nRescanRoots: 51,
        markrootDone: true,
        startSema: 1,
        markDoneSema: 1,
        bgMarkReady: runtime.note {key: 0},
        bgMarkDone: 0,
        mode: 0,
        spans: []*runtime.mspan len: 1977, cap: 8192, [
                *(*runtime.mspan)(0x7f4965e3e000),
                *(*runtime.mspan)(0x7f4965e3e0a0),
                *(*runtime.mspan)(0x7f4965e3e140),
                *(*runtime.mspan)(0x7f4965e3e1e0),
                *(*runtime.mspan)(0x7f4965e3e280),
                *(*runtime.mspan)(0x7f4965e3e320),
                *(*runtime.mspan)(0x7f4965e3e3c0),
                *(*runtime.mspan)(0x7f4965e3e460),
                *(*runtime.mspan)(0x7f4965e3e500),
                *(*runtime.mspan)(0x7f4965e3e5a0),
                *(*runtime.mspan)(0x7f4965e3e640),
                *(*runtime.mspan)(0x7f4965e3e6e0),
                *(*runtime.mspan)(0x7f4965e3e780),
                *(*runtime.mspan)(0x7f4965e3e820),
                *(*runtime.mspan)(0x7f4965e3e8c0),
                *(*runtime.mspan)(0x7f4965e3e960),
                *(*runtime.mspan)(0x7f4965e3ea00),
                *(*runtime.mspan)(0x7f4965e3eaa0),
                *(*runtime.mspan)(0x7f4965e3eb40),
                *(*runtime.mspan)(0x7f4965e3ebe0),
                *(*runtime.mspan)(0x7f4965e3ec80),
                *(*runtime.mspan)(0x7f4965e3ed20),
                *(*runtime.mspan)(0x7f4965e3edc0),
                *(*runtime.mspan)(0x7f4965e3ee60),
                *(*runtime.mspan)(0x7f4965e3ef00),
                *(*runtime.mspan)(0x7f4965e3efa0),
                *(*runtime.mspan)(0x7f4965e3f040),
                *(*runtime.mspan)(0x7f4965e3f0e0),
                *(*runtime.mspan)(0x7f4965e3f180),
                *(*runtime.mspan)(0x7f4965e3f220),
                *(*runtime.mspan)(0x7f4965e3f2c0),
                *(*runtime.mspan)(0x7f4965e3f360),
                *(*runtime.mspan)(0x7f4965e3f400),
                *(*runtime.mspan)(0x7f4965e3f4a0),
                *(*runtime.mspan)(0x7f4965e3f540),
                *(*runtime.mspan)(0x7f4965e3f5e0),
                *(*runtime.mspan)(0x7f4965e3f680),
                *(*runtime.mspan)(0x7f4965e3f720),
                *(*runtime.mspan)(0x7f4965e3f7c0),
                *(*runtime.mspan)(0x7f4965e3f860),
                *(*runtime.mspan)(0x7f4965e3f900),
                *(*runtime.mspan)(0x7f4965e3f9a0),
                *(*runtime.mspan)(0x7f4965e3fa40),
                *(*runtime.mspan)(0x7f4965e3fae0),
                *(*runtime.mspan)(0x7f4965e3fb80),
                *(*runtime.mspan)(0x7f4965e3fc20),
                *(*runtime.mspan)(0x7f4965e3fcc0),
                *(*runtime.mspan)(0x7f4965e3fd60),
                *(*runtime.mspan)(0x7f4965e3fe00),
                *(*runtime.mspan)(0x7f4965e3fea0),
                *(*runtime.mspan)(0x7f4965e3ff40),
                *(*runtime.mspan)(0x7f4965e3ffe0),
                *(*runtime.mspan)(0x7f4965e40080),
                *(*runtime.mspan)(0x7f4965e40120),
                *(*runtime.mspan)(0x7f4965e401c0),
                *(*runtime.mspan)(0x7f4965e40260),
                *(*runtime.mspan)(0x7f4965e40300),
                *(*runtime.mspan)(0x7f4965e403a0),
                *(*runtime.mspan)(0x7f4965e40440),
                *(*runtime.mspan)(0x7f4965e404e0),
                *(*runtime.mspan)(0x7f4965e40580),
                *(*runtime.mspan)(0x7f4965e40620),
                *(*runtime.mspan)(0x7f4965e406c0),
                *(*runtime.mspan)(0x7f4965e40760),
                ...+1913 more
        ],
        totaltime: 3313886148062,
        bytesMarked: 11343072,
        initialHeapLive: 22322304,
        assistQueue: struct { lock runtime.mutex; head runtime.guintptr; tail runtime.guintptr } {
                lock: (*runtime.mutex)(0x5cacdf8),
                head: 0,
                tail: 0,},
        rescan: struct { lock runtime.mutex; list []runtime.guintptr } {
                lock: (*runtime.mutex)(0x5cace10),
                list: []runtime.guintptr len: 51, cap: 1312, [842352693312,842352693728,842352694144,842352694560,842352694976,842352695392,842352695808,842352696224,842352696640,842355048448,842355049280,842355048864,842355050112,842355050528,842355051360,842355049696,842355051776,842355052192,842355052608,842355053440,842355053024,842355054272,842355054688,842355055104,842355055520,842355055936,842355050944,842355089408,842355089824,842355090656,842355090240,842355053856,842355091904,842355091072,842355092736,842355091488,842355093568,842355093984,842355093152,842355092320,842355094816,842355095232,842355095648,842355096480,842355096896,842355138560,842355096064,842355094400,842359741760,842366314528,842353073472],},
        stwprocs: 32,
        maxprocs: 48,
        tSweepTerm: 308925429283441,
        tMark: 308925429313122,
        tMarkTerm: 308925600224180,
        tEnd: 308922100569890,
        pauseNS: 29681,
        pauseStart: 308925600224180,
        heap0: 22322304,
        heap1: 22570272,
        heap2: 11411928,
        heapGoal: 23370880,}

@r7vme
Copy link

r7vme commented Jun 13, 2017

We had the same issue with kube-proxy at least two times.

@bradfitz
Copy link
Contributor

@aclements, what's the status here? Is there something you still want to do for Go 1.9? Or do you suspect it's already been fixed?

@aclements
Copy link
Member

@bradfitz, unfortunately I still have no idea what's going on here. But it may have been related to stack rescanning and 1.8 doesn't rescan stacks and 1.9 will remove the rescan mechanism entirely, so it's possible this is fixed by side-effect.

@r7vme, what version of Go was your crashing kube-proxy built with?

@r7vme
Copy link

r7vme commented Jun 19, 2017

@aclements not sure, but kube-dns --version says Kubernetes v1.4.0-alpha.0.1055+6de83d58537f57-dirty

@josharian
Copy link
Contributor

@r7vme @aclements on phone but #13507 should provide hints for how to find what Go version the binary was built with. (Maybe we should make a go-gettable tool for this.)

@ALTree
Copy link
Member

ALTree commented Jun 19, 2017

(gdb) p 'runtime.buildVersion'

@dgryski
Copy link
Contributor

dgryski commented Jun 19, 2017

Go-gettable tool: https://github.com/loeyt/go-version

@davecheney
Copy link
Contributor

davecheney commented Jun 19, 2017 via email

@r7vme
Copy link

r7vme commented Jun 20, 2017

@josharian @aclements messed up stuff. Yes, we use kube-proxy from hyperkube v1.5.2_coreos.0

[root@0000dd1bb02a188c tmp]# gdb hyperkube
GNU gdb (GDB) Fedora 7.11.1-86.fc24
...
(gdb) p 'runtime.buildVersion'
$1 = 0x344629c "go1.7.4"

@aclements
Copy link
Member

Based on kubernetes/kubernetes#38228, Kubernetes 1.7 should be based on Go 1.8, which I believe does not have this bug. Kubernetes 1.7 should be out next week (June 28th).

Has anybody stress tested or run into this problem with the Kubernetes 1.7 beta?

@bradfitz
Copy link
Contributor

bradfitz commented Jul 6, 2017

Closing this, assuming it's fixed in Kubernetes 1.7 (Go 1.8). Please comment if not and we'll re-open.

@bradfitz bradfitz closed this as completed Jul 6, 2017
@golang golang locked and limited conversation to collaborators Jul 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants