-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: Process terminates with segmentation fault #5337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Comment 3 by hb@hannobraun.de: The program output is redirected to a log file. There is no stack trace in there, although I can clearly see the effects of the crash (normal log messages right up to the time of the crash, a gap while the core dump is written, normal log messages afterwards). Where should the stack trace be written to (stdout, stderr)? The server the Go program is running on has CentOS release 6.3 (Final) on it. Output from uname -a: Linux frontend1 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux The binary was compiled on Ubuntu 12.04. Output from uname -a: Linux hanno-Vostro-3360 3.2.0-40-generic #64-Ubuntu SMP Mon Mar 25 21:22:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Output from ldd: linux-vdso.so.1 => (0x00007fff96041000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003ebca00000) libc.so.6 => /lib64/libc.so.6 (0x0000003041400000) /lib64/ld-linux-x86-64.so.2 (0x0000003040c00000) I'll see what I can do about setting GOTRACEBACK and report back. Thanks for your help! |
Thanks. In retrospect it's not surprising that you are not seeing a stack trace, as gdb is showing that the crash happens in the code that is supposed to generate a stack trace. Still, setting GOTRACEBACK ought to print something, and that may help identify the problem. It should be printed on stderr. The crash appears to be happening when the program receives a SIGPROF signal, so presumably you are importing runtime/pprof or net/http/pprof. The gdb backtrace is showing the backtrace up to the signal handler. The Go backtrace should show the backtrace of the currently executing Go code, and will hopefully be more informative. One thing you could do within gdb is print the values of the TLS variables m and g, and also *m and *g. I don't know if you can print those from a core dump, though. |
Comment 5 by hb@hannobraun.de: The process is running with GOTRACEBACK=2 as of now. So far we've had no crash. I'll report back when something happens. (might take a while, as I said, whole days have gone by without crashes). We are indeed using runtime/pprof. Regarding the TLS variables, I wasn't able to find a way to access those. I'm very inexperienced with gdb though, so someone else might have more luck. If someone can think of a way to get the values of those, please let me know. |
Comment 7 by hb@hannobraun.de: We finally got a stack trace! I've attached the log file to this comment. We found out something else that might be interesting: Normally, the process is running as a non-root user. After I last posted here last week, we were running as root by accident and failed to produce any crashes. When my colleague made the connection and changed it to non-root, he was able to get quite a few crashes in a short time. Might be that the crash can only happen when running as non-root. Attachments:
|
Comment 9 by hb@hannobraun.de: I looked a little closer at the stack trace and was about to post the same thing. Can this explain the behavior we've seen, including the lack of a stack trace? |
Comment 12 by hb@hannobraun.de: I can send it to you, but unfortunately I don't see your complete email address (d...@cheney.net). If you send me an email, I'll reply with the file. Once you see the stack trace, the bug in my own code is pretty plain: Whenever the file fails to be created, Close() is later called on nil. Might be that with root it's just less likely for the file failing to be created. This is good news, since I can easily fix this. Still, it would be interesting to know why we never saw a stack trace. If we had, this wouldn't have been such a mystery bug in the first place. |
please send to dave at cheney.net My theory is the different resource limits from the root to non root user. root usually has no limits on nprocs, ulimit, nofiles, etc. Looking at the comments from #4, iant zeroed in on SIGPROF as being a culprit, again running as root changes the game wrt who can send who SIGPROF. Is there anything in dmesg / /var/log/syslog, possibly from the OOM killer ? |
Comment 14 by hb@hannobraun.de: File sent. I'll look into what dmesg and syslog have to say. |
Comment 15 by hb@hannobraun.de: There's nothing in dmesg or syslog regarding the crashes. As Dave pointed out to me (and as was made evident by the stack trace) I was, if certain error conditions were met, passing nil into pprof.StartCPUProfile and possibly pprof.WriteHeapProfile, which triggered the crashes. The question remains why we never saw stack traces from those crashes. This lack of information from the crashes turned those trivial to fix bugs into an issue that kept us on our toes for weeks. |
OP - with your permission I would like to remove the Go 1.1 tag from this issue. We can keep the issue open once you have fixed the error handling in your program. There may be something that can be done to make runtime/pprof more robust in the face of nil writers, but I don't feel this is a blocker for Go 1.1 |
Comment 17 by hb@hannobraun.de: Sorry for the late reply. Yesterday was a holiday around here and I just returned to the office. Bad news: After I fixed the bug and deployed the changes, the process crashed again with the same symptoms shortly after. I believe that the bug that triggered the stack trace in comment #7 has nothing to do with the originally reported crash and was only accidentally triggered while trying to get a stack trace. Looks like we're back to square one. For now, we'll keep trying to get a stack trace for the crash. I'll report back once I know more. Regarding the Go 1.1 tag (re #16): I fear I can't really judge if this should be a blocker or not. Obviously these crashes are a big deal for us and the lack of a stack trace really makes hard to find out what is actually going on. |
Comment 19 by hb@hannobraun.de: We're currently trying to reproduce the crash with GOTRACEBACK=2 set. Disabling profiling is one of the things I'd like to try next. I'll report back once we know more about that. As I said in comment #15, there's nothing in syslog or dmesg regarding the crashes. |
Comment 20 by hb@hannobraun.de: We have new information: For fear that our production setup could somehow swallow the error messages, my colleague ran the process in a terminal and managed to reproduce a crash. In the terminal, it looks like the process just exits: - The usual process output just stops. There is no output that indicates a crash. - The process return code is 0. However, a core dump is produced. It clearly states that the reason is a segmentation fault. My colleague also told me that he saw "Out of socket memory" errors in dmesg which we didn't see for the previous crash (the pprof/nil pointer one which I don't believe has anything to do with the originally reported bug). It seems really odd to me that the process crashes with a segmentation fault, but the return code is 0. However, I don't know a lot about signal handling or the specifics of how Go handles signals. Can anyone make sense of this? To summarize what has become a long comment thread: - We're seeing frequent crashes of our Go process due to segmentation fault. - No error output is produced (neither when piping output to log files nor when running the process in a terminal) - The process exits with return code 0 - The stack trace I posted in comment #7 seems completely unrelated to this Next we're going to deploy a version that has all profiling disabled to see, if this changes the situation. |
Thanks for this information. Can I ask you to check a few more things. 1. What are you using for process management ? Is it possible there is an intermediary wrapper that is swallowing the error? 2. Can you provide that core file somehow ? 3. The termination of the process with no notification (although I cannot explain the zero error code) is a strong indication that something is hitting your process with a kill -9. Can you please add some detail about the host your are running on. You've already mentioned it was compiled on a ubuntu system but deployed on RHEL 6. 4. Does your application use cgo (apart from the net pkg) ? 5. Have you tried compiling your application on the same host that you run the application ? 6. Have you tried building your application with cgo disabled ? 7. How much memory does your RHEL6 host have available ? What other processes is it sharing resources with ? What is the usual RSS of your application when running ? Does this fluctuate, or does it approach a steady state ? Thank you in advance. Dave |
Comment 23 by hb@hannobraun.de: Dave, I'll try to answer your questions as good as I can. I'll also ask my colleague to join us here, since he know a lot more about our production setup than I do. 1. We're using supervisord for process management. I think we can rule out any swallowing of errors, since the last crash happened after my colleague started the process by hand in a terminal (see #20). 2. Yes, I can. I'm going to send you an email with a link to it in a moment. 3. According to gdb (when analyzing the core dump) the process is terminated with signal 11. See my original post above. I'm going to ask my colleague to answer the rest of your question. 4. No and neither do our dependencies (see original post). I've confirmed this by running an rgrep on the source. If there's a better way to find out if cgo is used in a binary, please let me know. 5. No, we haven't. Do you think this is worthwile to try? 6. I don't understand what you mean. Do you mean not using cgo (we don't) or is there some kind of compiler flag to disable cgo (a quick search didn't reveal anything)? 7. The server has 96 GB RAM. Other processes include haproxy and a bunch of Python processes (Tornado web servers, I think). My colleague will provide more info. RES/RSS starts out at 4-5 GB and grows steadily (we do daily restarts). The highest I've seen was 12-14GB I think. This is way too much and subject to ongoing optimization. I hope this helps. Thanks a lot. |
Thank you for your reply. 1. Yup, that is strange. Can you please audit your code, and any dependencies for calls to os.Exit(0). Re 4. If you're not explicitly using cgo (it is unlikely you'd not know you were using it) then we can rule this out. Running ldd on the final binary will be the best test. 5. Yes, although it should be less of an issue if you are not using cgo to embed a foreign library. 6. Pass CGO_ENABLED=0 when building Go. 7. is a real concern. Can you please run your process with the flag, GOGCTRACE=1 and post the full output from startup to restart/crash. If your process reaches 16Gb, you will get a panic because the heap in 1.0.x had a maximum size of 16Gb. 8. Btw, you're not stripping your binaries by chance ? |
Comment 25 by hb@hannobraun.de: Mail has been sent to Dave. Re #22: This is the output for disass: Dump of assembler code for function runtime.gentraceback: 0x0000000000421420 <+0>: mov %fs:0xfffffffffffffff0,%rcx 0x0000000000421429 <+9>: cmp (%rcx),%rsp 0x000000000042142c <+12>: ja 0x421438 <runtime.gentraceback+24> 0x000000000042142e <+14>: mov $0x48,%eax 0x0000000000421433 <+19>: callq 0x427530 <runtime.morestack01> 0x0000000000421438 <+24>: sub $0x68,%rsp 0x000000000042143c <+28>: mov 0x70(%rsp),%rdx 0x0000000000421441 <+33>: mov 0x78(%rsp),%rcx 0x0000000000421446 <+38>: mov 0x88(%rsp),%rsi 0x000000000042144e <+46>: movq $0x0,0x48(%rsp) 0x0000000000421457 <+55>: movq $0x0,0x38(%rsp) 0x0000000000421460 <+64>: movb $0x0,0x27(%rsp) 0x0000000000421465 <+69>: mov 0x28(%rsi),%rax 0x0000000000421469 <+73>: cmp %rax,%rdx 0x000000000042146c <+76>: jne 0x42147b <runtime.gentraceback+91> 0x000000000042146e <+78>: mov 0x20(%rsi),%rax 0x0000000000421472 <+82>: cmp %rax,%rcx 0x0000000000421475 <+85>: je 0x421c5b <runtime.gentraceback+2107> 0x000000000042147b <+91>: cmp $0x0,%rdx 0x000000000042147f <+95>: jne 0x42148d <runtime.gentraceback+109> 0x0000000000421481 <+97>: mov (%rcx),%rdx 0x0000000000421484 <+100>: add $0x8,%rcx 0x0000000000421488 <+104>: mov %rcx,0x78(%rsp) 0x000000000042148d <+109>: movl $0x0,0x60(%rsp) 0x0000000000421495 <+117>: movl $0x0,0x5c(%rsp) 0x000000000042149d <+125>: mov 0x8(%rsi),%rax 0x00000000004214a1 <+129>: mov %rax,0x30(%rsp) 0x00000000004214a6 <+134>: mov 0x60(%rsp),%eax 0x00000000004214aa <+138>: cmp 0xa0(%rsp),%eax 0x00000000004214b1 <+145>: jge 0x421523 <runtime.gentraceback+259> 0x00000000004214b3 <+147>: cmp $0x4274e0,%rdx 0x00000000004214ba <+154>: jne 0x4216e2 <runtime.gentraceback+706> 0x00000000004214c0 <+160>: mov 0x30(%rsp),%rax 0x00000000004214c5 <+165>: mov 0x18(%rax),%rdx 0x00000000004214c9 <+169>: mov %rdx,0x50(%rsp) 0x00000000004214ce <+174>: mov 0x30(%rsp),%rax 0x00000000004214d3 <+179>: mov 0x10(%rax),%rax 0x00000000004214d7 <+183>: mov %rax,0x78(%rsp) 0x00000000004214dc <+188>: movq $0x0,0x48(%rsp) 0x00000000004214e5 <+197>: movq $0x0,0x38(%rsp) 0x00000000004214ee <+206>: cmpq $0x0,0x98(%rsp) 0x00000000004214f7 <+215>: jne 0x421508 <runtime.gentraceback+232> 0x00000000004214f9 <+217>: cmpq $0x0,0xa8(%rsp) 0x0000000000421502 <+226>: je 0x421681 <runtime.gentraceback+609> 0x0000000000421508 <+232>: mov 0x30(%rsp),%rax 0x000000000042150d <+237>: mov 0x8(%rax),%rax 0x0000000000421511 <+241>: mov %rax,0x30(%rsp) 0x0000000000421516 <+246>: mov 0x60(%rsp),%eax 0x000000000042151a <+250>: cmp 0xa0(%rsp),%eax 0x0000000000421521 <+257>: jl 0x4214b3 <runtime.gentraceback+147> 0x0000000000421523 <+259>: cmpq $0x0,0x98(%rsp) 0x000000000042152c <+268>: jne 0x421539 <runtime.gentraceback+281> 0x000000000042152e <+270>: cmpq $0x0,0xa8(%rsp) 0x0000000000421537 <+279>: je 0x421542 <runtime.gentraceback+290> 0x0000000000421539 <+281>: mov 0x60(%rsp),%eax 0x000000000042153d <+285>: add $0x68,%rsp 0x0000000000421541 <+289>: retq 0x0000000000421542 <+290>: mov 0xf0(%rsi),%rax 0x0000000000421549 <+297>: cmp $0x0,%rax 0x000000000042154d <+301>: je 0x421539 <runtime.gentraceback+281> 0x000000000042154f <+303>: mov %rax,0x50(%rsp) 0x0000000000421554 <+308>: mov %rax,(%rsp) 0x0000000000421558 <+312>: callq 0x421050 <runtime.findfunc> 0x000000000042155d <+317>: cmp $0x0,%rax 0x0000000000421561 <+321>: je 0x421539 <runtime.gentraceback+281> 0x0000000000421563 <+323>: mov %rax,0x28(%rsp) 0x0000000000421568 <+328>: mov %rax,(%rsp) 0x000000000042156c <+332>: mov %fs:0xfffffffffffffff8,%rcx 0x0000000000421575 <+341>: mov 0x78(%rcx),%rcx 0x0000000000421579 <+345>: cmp %rcx,0x88(%rsp) 0x0000000000421581 <+353>: jne 0x42167a <runtime.gentraceback+602> 0x0000000000421587 <+359>: mov $0x1,%ecx 0x000000000042158c <+364>: mov %cl,0x8(%rsp) 0x0000000000421590 <+368>: callq 0x421350 <runtime.showframe> 0x0000000000421595 <+373>: cmp $0x0,%al 0x0000000000421597 <+375>: je 0x421539 <runtime.gentraceback+281> 0x0000000000421599 <+377>: mov 0x88(%rsp),%rax 0x00000000004215a1 <+385>: mov 0x80(%rax),%rax 0x00000000004215a8 <+392>: cmp $0x1,%rax 0x00000000004215ac <+396>: je 0x421539 <runtime.gentraceback+281> 0x00000000004215ae <+398>: mov $0x8c8216,%eax 0x00000000004215b3 <+403>: mov %rax,(%rsp) 0x00000000004215b7 <+407>: mov 0x28(%rsp),%rcx 0x00000000004215bc <+412>: mov (%rcx),%rax 0x00000000004215bf <+415>: mov %rax,0x8(%rsp) 0x00000000004215c4 <+420>: mov 0x8(%rcx),%rax 0x00000000004215c8 <+424>: mov %rax,0x10(%rsp) 0x00000000004215cd <+429>: callq 0x418960 <runtime.printf> 0x00000000004215d2 <+434>: mov 0x50(%rsp),%rdx 0x00000000004215d7 <+439>: mov 0x28(%rsp),%rbx 0x00000000004215dc <+444>: cmpl $0x0,0x60(%rsp) 0x00000000004215e1 <+449>: jle 0x4215f0 <runtime.gentraceback+464> 0x00000000004215e3 <+451>: mov 0x48(%rbx),%rax 0x00000000004215e7 <+455>: cmp %rax,%rdx 0x00000000004215ea <+458>: ja 0x421672 <runtime.gentraceback+594> 0x00000000004215f0 <+464>: mov %rbx,(%rsp) 0x00000000004215f4 <+468>: mov %rdx,0x8(%rsp) 0x00000000004215f9 <+473>: callq 0x420d50 <runtime.funcline> 0x00000000004215fe <+478>: mov %eax,0x1c(%rsp) 0x0000000000421602 <+482>: mov $0x8c8225,%eax 0x0000000000421607 <+487>: mov %rax,(%rsp) 0x000000000042160b <+491>: mov 0x28(%rsp),%rcx 0x0000000000421610 <+496>: mov 0x20(%rcx),%rax 0x0000000000421614 <+500>: mov %rax,0x8(%rsp) 0x0000000000421619 <+505>: mov 0x28(%rcx),%rax 0x000000000042161d <+509>: mov %rax,0x10(%rsp) 0x0000000000421622 <+514>: mov 0x1c(%rsp),%eax 0x0000000000421626 <+518>: mov %eax,0x18(%rsp) 0x000000000042162a <+522>: callq 0x418960 <runtime.printf> 0x000000000042162f <+527>: mov 0x50(%rsp),%rdx 0x0000000000421634 <+532>: mov 0x28(%rsp),%rcx 0x0000000000421639 <+537>: mov 0x48(%rcx),%rax 0x000000000042163d <+541>: cmp %rax,%rdx 0x0000000000421640 <+544>: jbe 0x42165f <runtime.gentraceback+575> 0x0000000000421642 <+546>: mov $0x8c822c,%eax 0x0000000000421647 <+551>: mov %rax,(%rsp) 0x000000000042164b <+555>: mov 0x48(%rcx),%rax 0x000000000042164f <+559>: mov %rdx,%rcx 0x0000000000421652 <+562>: sub %rax,%rcx 0x0000000000421655 <+565>: mov %rcx,0x8(%rsp) 0x000000000042165a <+570>: callq 0x418960 <runtime.printf> 0x000000000042165f <+575>: mov $0x8c8231,%eax 0x0000000000421664 <+580>: mov %rax,(%rsp) 0x0000000000421668 <+584>: callq 0x418960 <runtime.printf> 0x000000000042166d <+589>: jmpq 0x421539 <runtime.gentraceback+281> 0x0000000000421672 <+594>: dec %rdx 0x0000000000421675 <+597>: jmpq 0x4215f0 <runtime.gentraceback+464> 0x000000000042167a <+602>: xor %ecx,%ecx 0x000000000042167c <+604>: jmpq 0x42158c <runtime.gentraceback+364> 0x0000000000421681 <+609>: movq $0x0,(%rsp) 0x0000000000421689 <+617>: mov %fs:0xfffffffffffffff8,%rcx 0x0000000000421692 <+626>: mov 0x78(%rcx),%rcx 0x0000000000421696 <+630>: cmp %rcx,%rsi 0x0000000000421699 <+633>: jne 0x4216de <runtime.gentraceback+702> 0x000000000042169b <+635>: mov $0x1,%ecx 0x00000000004216a0 <+640>: mov %cl,0x8(%rsp) 0x00000000004216a4 <+644>: callq 0x421350 <runtime.showframe> 0x00000000004216a9 <+649>: mov 0x88(%rsp),%rsi 0x00000000004216b1 <+657>: mov 0x50(%rsp),%rdx 0x00000000004216b6 <+662>: cmp $0x0,%al 0x00000000004216b8 <+664>: je 0x421508 <runtime.gentraceback+232> 0x00000000004216be <+670>: mov $0x8c8160,%eax 0x00000000004216c3 <+675>: mov %rax,(%rsp) 0x00000000004216c7 <+679>: callq 0x418960 <runtime.printf> 0x00000000004216cc <+684>: mov 0x88(%rsp),%rsi 0x00000000004216d4 <+692>: mov 0x50(%rsp),%rdx 0x00000000004216d9 <+697>: jmpq 0x421508 <runtime.gentraceback+232> 0x00000000004216de <+702>: xor %ecx,%ecx 0x00000000004216e0 <+704>: jmp 0x4216a0 <runtime.gentraceback+640> 0x00000000004216e2 <+706>: cmp $0x1000,%rdx 0x00000000004216e9 <+713>: jbe 0x421c31 <runtime.gentraceback+2065> 0x00000000004216ef <+719>: mov %rdx,0x50(%rsp) 0x00000000004216f4 <+724>: mov %rdx,(%rsp) 0x00000000004216f8 <+728>: callq 0x421050 <runtime.findfunc> 0x00000000004216fd <+733>: mov 0x78(%rsp),%rcx 0x0000000000421702 <+738>: mov 0x50(%rsp),%rdi 0x0000000000421707 <+743>: mov 0x88(%rsp),%rsi 0x000000000042170f <+751>: mov 0x38(%rsp),%rbp 0x0000000000421714 <+756>: mov %rax,%rbx 0x0000000000421717 <+759>: cmp $0x0,%rax 0x000000000042171b <+763>: je 0x421c31 <runtime.gentraceback+2065> 0x0000000000421721 <+769>: cmp $0x0,%rbp 0x0000000000421725 <+773>: jne 0x421c17 <runtime.gentraceback+2039> 0x000000000042172b <+779>: mov %rcx,%rbp 0x000000000042172e <+782>: mov 0x48(%rbx),%rax 0x0000000000421732 <+786>: cmp %rax,%rdi 0x0000000000421735 <+789>: jbe 0x421743 <runtime.gentraceback+803> 0x0000000000421737 <+791>: mov 0x5c(%rbx),%eax 0x000000000042173a <+794>: cmp $0x8,%eax 0x000000000042173d <+797>: jge 0x421c06 <runtime.gentraceback+2022> 0x0000000000421743 <+803>: cmpq $0x0,0x48(%rsp) 0x0000000000421749 <+809>: jne 0x421754 <runtime.gentraceback+820> => 0x000000000042174b <+811>: mov 0x0(%rbp),%rax 0x000000000042174f <+815>: mov %rax,0x48(%rsp) 0x0000000000421754 <+820>: add $0x8,%rbp 0x0000000000421758 <+824>: mov %rbp,0x38(%rsp) 0x000000000042175d <+829>: cmpl $0x0,0x90(%rsp) 0x0000000000421765 <+837>: jle 0x421986 <runtime.gentraceback+1382> 0x000000000042176b <+843>: decl 0x90(%rsp) 0x0000000000421772 <+850>: mov 0x48(%rbx),%rax 0x0000000000421776 <+854>: cmp $0x416f40,%rax 0x000000000042177c <+860>: jne 0x42197f <runtime.gentraceback+1375> 0x0000000000421782 <+866>: mov $0x1,%eax 0x0000000000421787 <+871>: mov %al,0x27(%rsp) 0x000000000042178b <+875>: mov 0x48(%rbx),%rax 0x000000000042178f <+879>: cmp $0x417710,%rax 0x0000000000421795 <+885>: je 0x421976 <runtime.gentraceback+1366> 0x000000000042179b <+891>: mov 0x48(%rbx),%rax 0x000000000042179f <+895>: cmp $0x41bef0,%rax 0x00000000004217a5 <+901>: je 0x421976 <runtime.gentraceback+1366> 0x00000000004217ab <+907>: mov 0x48(%rbx),%rax 0x00000000004217af <+911>: cmp $0x41fd30,%rax 0x00000000004217b5 <+917>: jne 0x4217bf <runtime.gentraceback+927> 0x00000000004217b7 <+919>: movl $0x1,0x5c(%rsp) 0x00000000004217bf <+927>: cmpq $0x0,0x98(%rsp) 0x00000000004217c8 <+936>: jne 0x4217d9 <runtime.gentraceback+953> 0x00000000004217ca <+938>: cmpq $0x0,0xa8(%rsp) 0x00000000004217d3 <+947>: je 0x4218ad <runtime.gentraceback+1165> 0x00000000004217d9 <+953>: cmpq $0x0,0x98(%rsp) 0x00000000004217e2 <+962>: jne 0x4217ef <runtime.gentraceback+975> 0x00000000004217e4 <+964>: cmpq $0x0,0xa8(%rsp) 0x00000000004217ed <+973>: je 0x42181d <runtime.gentraceback+1021> 0x00000000004217ef <+975>: cmp $0x41b520,%rdi 0x00000000004217f6 <+982>: je 0x421523 <runtime.gentraceback+259> 0x00000000004217fc <+988>: mov 0x48(%rsp),%rdx 0x0000000000421801 <+993>: movq $0x0,0x48(%rsp) 0x000000000042180a <+1002>: mov %rbp,0x78(%rsp) 0x000000000042180f <+1007>: movq $0x0,0x38(%rsp) 0x0000000000421818 <+1016>: jmpq 0x4214a6 <runtime.gentraceback+134> 0x000000000042181d <+1021>: mov 0x48(%rbx),%rax 0x0000000000421821 <+1025>: cmp $0x4274e0,%rax 0x0000000000421827 <+1031>: jne 0x4217ef <runtime.gentraceback+975> 0x0000000000421829 <+1033>: mov %fs:0xfffffffffffffff8,%rax 0x0000000000421832 <+1042>: mov (%rax),%rax 0x0000000000421835 <+1045>: cmp %rax,%rsi 0x0000000000421838 <+1048>: jne 0x4217ef <runtime.gentraceback+975> 0x000000000042183a <+1050>: mov $0x8c81e6,%eax 0x000000000042183f <+1055>: mov %rax,(%rsp) 0x0000000000421843 <+1059>: mov %fs:0xfffffffffffffff8,%rax 0x000000000042184c <+1068>: mov 0x78(%rax),%rax 0x0000000000421850 <+1072>: mov 0x80(%rax),%rax 0x0000000000421857 <+1079>: mov %rax,0x8(%rsp) 0x000000000042185c <+1084>: callq 0x418960 <runtime.printf> 0x0000000000421861 <+1089>: mov %fs:0xfffffffffffffff8,%rax 0x000000000042186a <+1098>: mov 0x78(%rax),%rsi 0x000000000042186e <+1102>: mov %rsi,0x88(%rsp) 0x0000000000421876 <+1110>: mov 0x8(%rsi),%rax 0x000000000042187a <+1114>: mov %rax,0x30(%rsp) 0x000000000042187f <+1119>: mov 0x30(%rsp),%rax 0x0000000000421884 <+1124>: mov 0x10(%rax),%rax 0x0000000000421888 <+1128>: mov %rax,0x78(%rsp) 0x000000000042188d <+1133>: mov 0x30(%rsp),%rax 0x0000000000421892 <+1138>: mov 0x18(%rax),%rdx 0x0000000000421896 <+1142>: movq $0x0,0x38(%rsp) 0x000000000042189f <+1151>: movq $0x0,0x48(%rsp) 0x00000000004218a8 <+1160>: jmpq 0x4214a6 <runtime.gentraceback+134> 0x00000000004218ad <+1165>: mov 0x48(%rbx),%rax 0x00000000004218b1 <+1169>: cmp $0x427400,%rax 0x00000000004218b7 <+1175>: jne 0x4217d9 <runtime.gentraceback+953> 0x00000000004218bd <+1181>: mov %fs:0xfffffffffffffff8,%rax 0x00000000004218c6 <+1190>: mov (%rax),%rax 0x00000000004218c9 <+1193>: cmp %rax,%rsi 0x00000000004218cc <+1196>: jne 0x4217d9 <runtime.gentraceback+953> 0x00000000004218d2 <+1202>: cmpl $0x0,0x5c(%rsp) 0x00000000004218d7 <+1207>: je 0x4217d9 <runtime.gentraceback+953> 0x00000000004218dd <+1213>: mov $0x8c81b6,%eax 0x00000000004218e2 <+1218>: mov %rax,(%rsp) 0x00000000004218e6 <+1222>: mov %fs:0xfffffffffffffff8,%rax 0x00000000004218ef <+1231>: mov 0x78(%rax),%rax 0x00000000004218f3 <+1235>: mov 0x80(%rax),%rax 0x00000000004218fa <+1242>: mov %rax,0x8(%rsp) 0x00000000004218ff <+1247>: callq 0x418960 <runtime.printf> 0x0000000000421904 <+1252>: mov %fs:0xfffffffffffffff8,%rax 0x000000000042190d <+1261>: mov 0x8(%rax),%rdx 0x0000000000421911 <+1265>: mov %fs:0xfffffffffffffff8,%rax 0x000000000042191a <+1274>: mov 0x18(%rax),%rax 0x000000000042191e <+1278>: add $0xfffffffffffffff8,%rax 0x0000000000421922 <+1282>: mov %rax,0x78(%rsp) 0x0000000000421927 <+1287>: mov %fs:0xfffffffffffffff8,%rax 0x0000000000421930 <+1296>: mov 0x20(%rax),%rax 0x0000000000421934 <+1300>: mov %rax,0x48(%rsp) 0x0000000000421939 <+1305>: mov %fs:0xfffffffffffffff8,%rax 0x0000000000421942 <+1314>: mov 0x18(%rax),%rax 0x0000000000421946 <+1318>: mov %rax,0x38(%rsp) 0x000000000042194b <+1323>: movl $0x0,0x5c(%rsp) 0x0000000000421953 <+1331>: mov %fs:0xfffffffffffffff8,%rax 0x000000000042195c <+1340>: mov 0x78(%rax),%rsi 0x0000000000421960 <+1344>: mov %rsi,0x88(%rsp) 0x0000000000421968 <+1352>: mov 0x8(%rsi),%rax 0x000000000042196c <+1356>: mov %rax,0x30(%rsp) 0x0000000000421971 <+1361>: jmpq 0x4214a6 <runtime.gentraceback+134> 0x0000000000421976 <+1366>: add $0x10,%rbp 0x000000000042197a <+1370>: jmpq 0x4217ab <runtime.gentraceback+907> 0x000000000042197f <+1375>: xor %eax,%eax 0x0000000000421981 <+1377>: jmpq 0x421787 <runtime.gentraceback+871> 0x0000000000421986 <+1382>: cmpq $0x0,0x98(%rsp) 0x000000000042198f <+1391>: je 0x4219b0 <runtime.gentraceback+1424> 0x0000000000421991 <+1393>: mov 0x60(%rsp),%eax 0x0000000000421995 <+1397>: incl 0x60(%rsp) 0x0000000000421999 <+1401>: movslq %eax,%rax 0x000000000042199c <+1404>: mov 0x98(%rsp),%rdx 0x00000000004219a4 <+1412>: lea (%rdx,%rax,8),%rax 0x00000000004219a8 <+1416>: mov %rdi,(%rax) 0x00000000004219ab <+1419>: jmpq 0x421772 <runtime.gentraceback+850> 0x00000000004219b0 <+1424>: cmpq $0x0,0xa8(%rsp) 0x00000000004219b9 <+1433>: je 0x421a01 <runtime.gentraceback+1505> 0x00000000004219bb <+1435>: mov %rbx,0x28(%rsp) 0x00000000004219c0 <+1440>: mov %rbx,(%rsp) 0x00000000004219c4 <+1444>: mov %rdi,0x8(%rsp) 0x00000000004219c9 <+1449>: mov %rcx,0x10(%rsp) 0x00000000004219ce <+1454>: mov 0xb0(%rsp),%rax 0x00000000004219d6 <+1462>: mov %rax,0x18(%rsp) 0x00000000004219db <+1467>: mov 0xa8(%rsp),%rax 0x00000000004219e3 <+1475>: callq *%rax 0x00000000004219e5 <+1477>: mov 0x50(%rsp),%rdi 0x00000000004219ea <+1482>: mov 0x88(%rsp),%rsi 0x00000000004219f2 <+1490>: mov 0x38(%rsp),%rbp 0x00000000004219f7 <+1495>: mov 0x28(%rsp),%rbx 0x00000000004219fc <+1500>: jmpq 0x421772 <runtime.gentraceback+850> 0x0000000000421a01 <+1505>: mov %rbx,0x28(%rsp) 0x0000000000421a06 <+1510>: mov %rbx,(%rsp) 0x0000000000421a0a <+1514>: mov %fs:0xfffffffffffffff8,%rcx 0x0000000000421a13 <+1523>: mov 0x78(%rcx),%rcx 0x0000000000421a17 <+1527>: cmp %rcx,%rsi 0x0000000000421a1a <+1530>: jne 0x421bff <runtime.gentraceback+2015> 0x0000000000421a20 <+1536>: mov $0x1,%ecx 0x0000000000421a25 <+1541>: mov %cl,0x8(%rsp) 0x0000000000421a29 <+1545>: callq 0x421350 <runtime.showframe> 0x0000000000421a2e <+1550>: mov 0x50(%rsp),%rdi 0x0000000000421a33 <+1555>: mov 0x88(%rsp),%rsi 0x0000000000421a3b <+1563>: mov 0x38(%rsp),%rbp 0x0000000000421a40 <+1568>: mov 0x28(%rsp),%rbx 0x0000000000421a45 <+1573>: cmp $0x0,%al 0x0000000000421a47 <+1575>: je 0x421bbd <runtime.gentraceback+1949> 0x0000000000421a4d <+1581>: mov %rdi,0x40(%rsp) 0x0000000000421a52 <+1586>: cmpl $0x0,0x60(%rsp) 0x0000000000421a57 <+1591>: jle 0x421a66 <runtime.gentraceback+1606> 0x0000000000421a59 <+1593>: mov 0x48(%rbx),%rax 0x0000000000421a5d <+1597>: cmp %rax,%rdi 0x0000000000421a60 <+1600>: ja 0x421bea <runtime.gentraceback+1994> 0x0000000000421a66 <+1606>: mov %fs:0xfffffffffffffff8,%rax 0x0000000000421a6f <+1615>: mov 0x98(%rax),%eax 0x0000000000421a75 <+1621>: cmp $0x0,%eax 0x0000000000421a78 <+1624>: je 0x421a90 <runtime.gentraceback+1648> 0x0000000000421a7a <+1626>: mov %fs:0xfffffffffffffff8,%rax 0x0000000000421a83 <+1635>: mov 0x78(%rax),%rax 0x0000000000421a87 <+1639>: cmp %rax,%rsi 0x0000000000421a8a <+1642>: je 0x421bcd <runtime.gentraceback+1965> 0x0000000000421a90 <+1648>: mov $0x8c8198,%eax 0x0000000000421a95 <+1653>: mov %rax,(%rsp) 0x0000000000421a99 <+1657>: mov (%rbx),%rax 0x0000000000421a9c <+1660>: mov %rax,0x8(%rsp) 0x0000000000421aa1 <+1665>: mov 0x8(%rbx),%rax 0x0000000000421aa5 <+1669>: mov %rax,0x10(%rsp) 0x0000000000421aaa <+1674>: callq 0x418960 <runtime.printf> 0x0000000000421aaf <+1679>: xor %ebx,%ebx 0x0000000000421ab1 <+1681>: mov 0x28(%rsp),%rax 0x0000000000421ab6 <+1686>: mov 0x60(%rax),%eax 0x0000000000421ab9 <+1689>: cltd 0x0000000000421aba <+1690>: and $0x7,%edx 0x0000000000421abd <+1693>: add %edx,%eax 0x0000000000421abf <+1695>: sar $0x3,%eax 0x0000000000421ac2 <+1698>: cmp %eax,%ebx 0x0000000000421ac4 <+1700>: jge 0x421b11 <runtime.gentraceback+1777> 0x0000000000421ac6 <+1702>: cmp $0x0,%ebx 0x0000000000421ac9 <+1705>: mov %ebx,0x64(%rsp) 0x0000000000421acd <+1709>: je 0x421ae1 <runtime.gentraceback+1729> 0x0000000000421acf <+1711>: mov $0x8c819c,%eax 0x0000000000421ad4 <+1716>: mov %rax,(%rsp) 0x0000000000421ad8 <+1720>: callq 0x418910 <runtime.prints> 0x0000000000421add <+1725>: mov 0x64(%rsp),%ebx 0x0000000000421ae1 <+1729>: movslq %ebx,%rcx 0x0000000000421ae4 <+1732>: mov 0x38(%rsp),%rdx 0x0000000000421ae9 <+1737>: mov (%rdx,%rcx,8),%rax 0x0000000000421aed <+1741>: mov %rax,(%rsp) 0x0000000000421af1 <+1745>: callq 0x4191f0 <runtime.printhex> 0x0000000000421af6 <+1750>: mov 0x64(%rsp),%ebx 0x0000000000421afa <+1754>: cmp $0x4,%ebx 0x0000000000421afd <+1757>: jl 0x421bc6 <runtime.gentraceback+1958> 0x0000000000421b03 <+1763>: mov $0x8c819f,%eax 0x0000000000421b08 <+1768>: mov %rax,(%rsp) 0x0000000000421b0c <+1772>: callq 0x418910 <runtime.prints> 0x0000000000421b11 <+1777>: mov $0x8c81a5,%eax 0x0000000000421b16 <+1782>: mov %rax,(%rsp) 0x0000000000421b1a <+1786>: callq 0x418910 <runtime.prints> 0x0000000000421b1f <+1791>: mov 0x28(%rsp),%rcx 0x0000000000421b24 <+1796>: mov %rcx,(%rsp) 0x0000000000421b28 <+1800>: mov 0x40(%rsp),%rcx 0x0000000000421b2d <+1805>: mov %rcx,0x8(%rsp) 0x0000000000421b32 <+1810>: callq 0x420d50 <runtime.funcline> 0x0000000000421b37 <+1815>: mov %eax,0x1c(%rsp) 0x0000000000421b3b <+1819>: mov $0x8c81a8,%eax 0x0000000000421b40 <+1824>: mov %rax,(%rsp) 0x0000000000421b44 <+1828>: mov 0x28(%rsp),%rcx 0x0000000000421b49 <+1833>: mov 0x20(%rcx),%rax 0x0000000000421b4d <+1837>: mov %rax,0x8(%rsp) 0x0000000000421b52 <+1842>: mov 0x28(%rcx),%rax 0x0000000000421b56 <+1846>: mov %rax,0x10(%rsp) 0x0000000000421b5b <+1851>: mov 0x1c(%rsp),%eax 0x0000000000421b5f <+1855>: mov %eax,0x18(%rsp) 0x0000000000421b63 <+1859>: callq 0x418960 <runtime.printf> 0x0000000000421b68 <+1864>: mov 0x50(%rsp),%rdx 0x0000000000421b6d <+1869>: mov 0x28(%rsp),%rcx 0x0000000000421b72 <+1874>: mov 0x48(%rcx),%rax 0x0000000000421b76 <+1878>: cmp %rax,%rdx 0x0000000000421b79 <+1881>: jbe 0x421b98 <runtime.gentraceback+1912> 0x0000000000421b7b <+1883>: mov $0x8c81af,%eax 0x0000000000421b80 <+1888>: mov %rax,(%rsp) 0x0000000000421b84 <+1892>: mov 0x48(%rcx),%rax 0x0000000000421b88 <+1896>: mov %rdx,%rcx 0x0000000000421b8b <+1899>: sub %rax,%rcx 0x0000000000421b8e <+1902>: mov %rcx,0x8(%rsp) 0x0000000000421b93 <+1907>: callq 0x418960 <runtime.printf> 0x0000000000421b98 <+1912>: mov $0x8c81b4,%eax 0x0000000000421b9d <+1917>: mov %rax,(%rsp) 0x0000000000421ba1 <+1921>: callq 0x418960 <runtime.printf> 0x0000000000421ba6 <+1926>: mov 0x50(%rsp),%rdi 0x0000000000421bab <+1931>: mov 0x88(%rsp),%rsi 0x0000000000421bb3 <+1939>: mov 0x38(%rsp),%rbp 0x0000000000421bb8 <+1944>: mov 0x28(%rsp),%rbx 0x0000000000421bbd <+1949>: incl 0x60(%rsp) 0x0000000000421bc1 <+1953>: jmpq 0x421772 <runtime.gentraceback+850> 0x0000000000421bc6 <+1958>: inc %ebx 0x0000000000421bc8 <+1960>: jmpq 0x421ab1 <runtime.gentraceback+1681> 0x0000000000421bcd <+1965>: mov $0x8c818f,%eax 0x0000000000421bd2 <+1970>: mov %rax,(%rsp) 0x0000000000421bd6 <+1974>: mov %rbp,0x8(%rsp) 0x0000000000421bdb <+1979>: callq 0x418960 <runtime.printf> 0x0000000000421be0 <+1984>: mov 0x28(%rsp),%rbx 0x0000000000421be5 <+1989>: jmpq 0x421a90 <runtime.gentraceback+1648> 0x0000000000421bea <+1994>: cmpb $0x0,0x27(%rsp) 0x0000000000421bef <+1999>: jne 0x421a66 <runtime.gentraceback+1606> 0x0000000000421bf5 <+2005>: decq 0x40(%rsp) 0x0000000000421bfa <+2010>: jmpq 0x421a66 <runtime.gentraceback+1606> 0x0000000000421bff <+2015>: xor %ecx,%ecx 0x0000000000421c01 <+2017>: jmpq 0x421a25 <runtime.gentraceback+1541> 0x0000000000421c06 <+2022>: mov 0x5c(%rbx),%eax 0x0000000000421c09 <+2025>: add $0xfffffff8,%eax 0x0000000000421c0c <+2028>: movslq %eax,%rax 0x0000000000421c0f <+2031>: add %rax,%rbp 0x0000000000421c12 <+2034>: jmpq 0x421743 <runtime.gentraceback+803> 0x0000000000421c17 <+2039>: cmpq $0x0,0x48(%rsp) 0x0000000000421c1d <+2045>: jne 0x42175d <runtime.gentraceback+829> 0x0000000000421c23 <+2051>: mov 0x0(%rbp),%rax 0x0000000000421c27 <+2055>: mov %rax,0x48(%rsp) 0x0000000000421c2c <+2060>: jmpq 0x42175d <runtime.gentraceback+829> 0x0000000000421c31 <+2065>: cmpq $0x0,0xa8(%rsp) 0x0000000000421c3a <+2074>: je 0x421523 <runtime.gentraceback+259> 0x0000000000421c40 <+2080>: mov $0x8c8184,%eax 0x0000000000421c45 <+2085>: mov %rax,(%rsp) 0x0000000000421c49 <+2089>: callq 0x418030 <runtime.throw> 0x0000000000421c4e <+2094>: mov 0x88(%rsp),%rsi 0x0000000000421c56 <+2102>: jmpq 0x421523 <runtime.gentraceback+259> 0x0000000000421c5b <+2107>: cmp $0x41b520,%rdx 0x0000000000421c62 <+2114>: jne 0x42147b <runtime.gentraceback+91> 0x0000000000421c68 <+2120>: mov 0x60(%rsi),%rax 0x0000000000421c6c <+2124>: cmp $0x0,%rax 0x0000000000421c70 <+2128>: je 0x42147b <runtime.gentraceback+91> 0x0000000000421c76 <+2134>: mov %rcx,0x38(%rsp) 0x0000000000421c7b <+2139>: mov %rdx,0x48(%rsp) 0x0000000000421c80 <+2144>: mov 0x60(%rsi),%rax 0x0000000000421c84 <+2148>: mov (%rax),%rdx 0x0000000000421c87 <+2151>: jmpq 0x42147b <runtime.gentraceback+91> End of assembler dump. And info registers: rax 0xa80 2688 rbx 0xc200018218 833223754264 rcx 0x7fa6fabfd710 140355148175120 rdx 0xc200018148 833223754056 rsi 0xc3a75c0700 840326448896 rdi 0x43e810 4450320 rbp 0x7fa6fabfe190 0x7fa6fabfe190 rsp 0xc2011cfaf0 0xc2011cfaf0 r8 0x0 0 r9 0x0 0 r10 0xc3a75c0700 840326448896 r11 0xa 10 r12 0x9 9 r13 0x9 9 r14 0x10 16 r15 0x72b0b0 7516336 rip 0x42174b 0x42174b <runtime.gentraceback+811> eflags 0x10246 [ PF ZF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 Thanks. |
Comment 26 by alex@davz.net: Dave, Regarding #21 and #23, let me try to answer your questions: Thanks for this information. Can I ask you to check a few more things. 1. What are you using for process management ? Is it possible there is an intermediary wrapper that is swallowing the error? We used supervisord when we first saw this, but as Hanno mentioned, I simply ran it inside tmux and triggered the most recent crash - so I saw in my terminal that it just "ended" and exited 0. Its also worth noting that despite exiting 0, we got a core file, so I dont think os.Exit(0) is guilty. 7. How much memory does your RHEL6 host have available ? What other processes is it sharing resources with ? What is the usual RSS of your application when running ? Does this fluctuate, or does it approach a steady state ? The server has 96G of RAM, looking at the monitoring graphs from yesterday it never dropped below 50G in "free" RAM. However, I believe we may have some issues with kernel memory allocations; we have a very large number of connections to this box (the only other process using much resources is haproxy, and that has a very small RSS and CPU load - just a huge number of open files, TCP connections etc.) I am not sure if the Socket errors we saw in dmesg could help explain the trigger for this condition. The kernel is 2.6.32-279.el6.x86_64. |
Comment 27 by hb@hannobraun.de: Re #24: > 1. Yup, that is strange. Can you please audit your code, and any dependencies for calls to os.Exit(0). Nothing found with grep. I'll do a more thorough search and report back, if I find something. > Re 4. If you're not explicitly using cgo (it is unlikely you'd not know you were using it) then we can rule this out. Running ldd on the final binary will be the best test. I'm 100% sure we're not using cgo in our own code. I've thoroughly searched the dependencies, so I'm sure about that as well. In fact, one of our dependencies (mymysql) was using cgo. We switched to another MySQL driver (https://github.com/go-sql-driver/mysql) and reconfirmed before reporting this issue. Here's the ldd output: linux-vdso.so.1 => (0x00007fffb354a000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fdc5eb80000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdc5e7c1000) /lib64/ld-linux-x86-64.so.2 (0x00007fdc5edb7000) > 5. Yes, although it should be less of an issue if you are not using cgo to embed a foreign library. Let's first see how the process fares with profiling removed (this is deployed as of now). I've added this to my list of things to try. > 6. Pass CGO_ENABLED=0 when building Go. I did this and it made no difference to the build. I think we can rule cgo out. > 7. is a real concern. Can you please run your process with the flag, GOGCTRACE=1 and post the full output from startup to restart/crash. If your process reaches 16Gb, you will get a panic because the heap in 1.0.x had a maximum size of 16Gb. As I said, the memory use is subject to ongoing optimization work and I expect to reduce it significantly very soon. The 16Gb limit shouldn't be a concern in this case though, since we're running on a tip version from mid-April (go version devel +0704ea89235b Tue Apr 16 09:08:06 2013 -0700 linux/amd64). Or am I mistaken about this? I'll see what I can do about GOGCTRACE=1 and report back. > 8. Btw, you're not stripping your binaries by chance ? No. |
Comment 28 by alex@davz.net: Dave, Regarding #21 and #23, let me try to answer your questions: 1. What are you using for process management ? Is it possible there is an intermediary wrapper that is swallowing the error? We used supervisord when we first saw this, but as Hanno mentioned, I simply ran it inside tmux and triggered the most recent crash - so I saw in my terminal that it just "ended" and exited 0. Its also worth noting that despite exiting 0, we got a core file, so I dont think os.Exit(0) is guilty. 7. How much memory does your RHEL6 host have available ? What other processes is it sharing resources with ? What is the usual RSS of your application when running ? Does this fluctuate, or does it approach a steady state ? The server has 96G of RAM, looking at the monitoring graphs from yesterday it never dropped below 50G in "free" RAM. However, I believe we may have some issues with kernel memory allocations; we have a very large number of connections to this box (the only other process using much resources is haproxy, and that has a very small RSS and CPU load - just a huge number of open files, TCP connections etc.) The RSS of the go process does seem to increase rapidly to around 5G and then slowly increase over time. The longest run time we have seen, I think 5 days when running as root, got the overall RSS to the 10-15G mark (sorry, I do not recall the exact figure). I am not sure if the Socket errors we saw in dmesg could help explain the trigger for this condition, or if there is some kernel memory pressure that we did not see any evidence of. The kernel is 2.6.32-279.el6.x86_64. |
Comment 30 by hb@hannobraun.de: Re #29, no problem, just let me know where to send it (I only see a shortened version of your email address). The bt output (see report above for more gdb output): (gdb) bt #0 runtime.gentraceback (pc0=void, sp=void, gp=void, skip=void, pcbuf=void, max=void, fn=void, arg=void) at /usr/lib/go/src/pkg/runtime/traceback_x86.c:88 #1 0x000000000041c743 in runtime.sigprof (pc=void, sp=void, lr=void, gp=void) at /usr/lib/go/src/pkg/runtime/proc.c:1790 #2 0x000000000041e931 in runtime.sighandler (sig=void, info=void, ctxt=void, gp=void) at /usr/lib/go/src/pkg/runtime/signal_amd64.c:51 #3 0x0000000000428144 in runtime.sigtramp () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:207 #4 0x0000000000428160 in ?? () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:213 #5 0x0000000000000000 in ?? () |
Summarizing the useful information from the disassembly and the traceback, the crash occurred on line 88 in traceback_x86.c: lr = *(uintptr*)fp; while executing this instruction: 0x000000000042174b <+811>: mov 0x0(%rbp),%rax The value of %rbp, which corresponds to the local variable fp, was 0x7fa6fabfe190. Looking a bit farther back in the code, fp was set to sp, here in %rcx, value 0x7fa6fabfd710. We can conclude that f->frame == 0xa88. The value of the local variable pc is 0x43e810. Question: you can use nm -n to see the symbol values in the executable sorted by number. What is the largest symbol whose value is <= 0x43e810? The traceback encountered a bad stack when trying to trace back through that symbol. |
Comment 32 by hb@hannobraun.de: I've sent Dave links to two core dumps: The one from the original bug report and the one from the most recent crash. nm -n output (excerpt) for the older binary: 000000000043e4d0 T fmt.(*pp).printValue 000000000043e7f0 T fmt.(*pp).printReflectValue 00000000004421c0 T fmt.(*pp).doPrintf If you were looking at this core dump, your answer would be printReflectValue. nm -n output (excerpt) for the binary corresponding to the most recent crash: 000000000043dbd0 T fmt.(*pp).printField 000000000043e6f0 T fmt.(*pp).printValue 000000000043ea10 T fmt.(*pp).printReflectValue That would be printValue. |
Comment 33 by hb@hannobraun.de: Re #32, I've just realized you were basing your question on the information I posted here, not one of the core dumps I sent to Dave. Everything I posted here comes from the older core dump, which means the largest symbol with a value <= 0x34e810 is: 000000000043e7f0 T fmt.(*pp).printReflectValue Sorry for the confusion. |
Comment 34 by hb@hannobraun.de: Re #31, I've just realized you were basing your question on the information I posted here, not one of the core dumps I sent to Dave. Everything I posted here comes from the older core dump, which means the largest symbol with a value <= 0x34e810 is: 000000000043e7f0 T fmt.(*pp).printReflectValue Sorry for the confusion. |
Thanks. The means that the pc value is 0x20 bytes into fmt.(*pp).printReflectValue. Looking at an executable here, that is immediately after the call to runtime.morestack01 in the prologue. printReflectValue in my executable has a frame size of 0xa88, which is consistent with what we see in yours. My working assumption is that the program received a signal for some reason. The runtime then tried to print the trace back for that signal. It crashed while trying to print the traceback. Were you ever able to reproduce the crash with the environment variable GOTRACEBACK set to 2? It would be very interesting to know whether the traceback managed to print anything at all. |
Comment 36 by hb@hannobraun.de: Yes, GOTRACEBACK was set to 2 for the latest crash. This is the one I described in comment #20. No traceback was printed. |
I think we must be hitting the first signal in the stack split code, and apparently the signal is happening at a time such that the stack backtrace does not work. Is there any chance you can make the crash happen while running the program under gdb? That would tell us where the first signal is happening. |
Comment 38 by hb@hannobraun.de: I'll see what I can do about that. My colleague who's managing the infrastructure will have to decide. Oh, an I almost forgot, re #24: There don't seem to be any calls to os.Exit. |
We are also getting SIGSEGV on Go 1.1-rc1 for our server (http://github.com/prometheus/prometheus), but *only* when running "go tool pprof" against it. Otherwise, it keeps on running stable. Note that we are using cgo to wrap LevelDB. Initially, with Go 1.0.3, we got "runtime: signal received on thread not created by Go." when pprof-ing, so we found https://groups.google.com/forum/?fromgroups=#!searchin/golang-nuts/pprof/golang-nuts/6q8RVoXhg4g/1WQfLMKLJBkJ and after upgrading to a newer version of Go (1.1-rc1) which contains the mentioned SIGPROF fix (https://golang.org/cl/6498057), the message about receiving a signal in a non-Go-spawned thread is gone, but the program still crashes with "Segmentation fault (core dumped)". I'm not sure if our issue is the same as this, but it does sound similar, and there was mention above that the crashes might be triggered by profiling too. The output of our server run and crash (with GOGCTRACE=1): https://gist.github.com/juliusv/d4baa7db4a1e8185614d Attaching a gdb to it before it crashed and then looking at the crash state: https://gist.github.com/juliusv/5df83f642809b4d97f1a |
Oops, badsignal calls morestack regardless of textflag 7: // This runs on a foreign stack, without an m or a g. No stack split. #pragma textflag 7 void runtimebadsignal(int32 sig) 415150: 64 48 8b 0c 25 f0 ff mov %fs:0xfffffffffffffff0,%rcx 415157: ff ff 415159: 48 3b 21 cmp (%rcx),%rsp 41515c: 77 05 ja 415163 41515e: e8 1d fe 00 00 callq 424f80 415163: 48 83 ec 18 sub $0x18,%rsp { if (sig == SIGPROF) { Probably this is due to: // Call runtime·findnull dynamically to circumvent static stack size check. static int32 (*findnull)(byte*) = runtime·findnull; runtime·write(2, runtime·sigtab[sig].name, findnull((byte*)runtime·sigtab[sig].name)); Owner changed to @dvyukov. |
Please check whether https://golang.org/cl/9226043 fixes the issue for you. It's linux only. |
Indeed, https://golang.org/cl/9226043 fixes the profiling segfault for us. Thank you very much! |
This issue was closed by revision f322c78. Status changed to Fixed. |
by hb@hannobraun.de:
The text was updated successfully, but these errors were encountered: