Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: clock_gettime on linux/amd64 make "gofmt -l src/pkg" segfault during GC #4402

Closed
mewmew opened this issue Nov 16, 2012 · 21 comments
Closed
Milestone

Comments

@mewmew
Copy link
Contributor

mewmew commented Nov 16, 2012

What steps will reproduce the problem?

[~/go]$ gofmt -l src/pkg
src/pkg/exp/locale/collate/colelem.go
Segmentation fault (core dumped)

[~/go]$ gofmt -l src/pkg
Segmentation fault (core dumped)

[~/go]$ gofmt -l src/pkg
Segmentation fault (core dumped)

[~/go]$ gofmt -l src/pkg
src/pkg/exp/locale/collate/colelem.go
src/pkg/runtime/zruntime_defs_linux_amd64.go


What is the expected output?
No segmentation fault.

What do you see instead?
Segmentation fault.

Which compiler are you using (5g, 6g, 8g, gccgo)?
6g

Which operating system are you using?
linux

Which version are you using?  (run 'go version')
go version devel +13d81e48dc90 Fri Nov 16 13:25:01 2012 -0800

Please provide any additional information below.
I would like to provide the core dump, but it is 276.3 MB. Any suggestions? I would be
glad to extract relevant information if you tell me what is needed.

Below follows some info from gdb (although the backtrace seems to be cut)

[~/go]$ gdb $GOROOT/bin/gofmt core 
(gdb) list
77          // make the trace look like a call to runtime·sigpanic instead.
78          // (Otherwise the trace will end at runtime·sigpanic and we
79          // won't get to see who faulted.)
80          if(r->rip != 0) {
81              sp = (uintptr*)r->rsp;
82              *--sp = r->rip;
83              r->rsp = (uintptr)sp;
84          }
85          r->rip = (uintptr)runtime·sigpanic;
86          return;
(gdb) bt
#0  0x0000000000419a5c in runtime.sighandler (sig=void, info=void, context=void,
gp=void) at /home/u/go/src/pkg/runtime/signal_linux_amd64.c:82
#1  0x0000000000420d54 in runtime.sigtramp () at
/home/u/go/src/pkg/runtime/sys_linux_amd64.s:184
#2  0x0000000000420d70 in ?? () at /home/u/go/src/pkg/runtime/sys_linux_amd64.s:190
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb)
@mewmew
Copy link
Contributor Author

mewmew commented Nov 16, 2012

Comment 1:

Some additional info from running "strace -f gofmt -l src/pkg":
...
[pid 10562] read(3, "", 637)            = 0
[pid 10562] close(3)                    = 0
[pid 10562] open("src/pkg/net/http/transport.go", O_RDONLY|O_CLOEXEC) = 3
[pid 10562] read(3, "// Copyright 2011 The Go Authors"..., 512) = 512
[pid 10562] read(3, ")\n\n// DefaultTransport is the de"..., 1024) = 1024
[pid 10562] read(3, "ies a function to return a proxy"..., 2048) = 2048
[pid 10562] read(3, "equest        // original reques"..., 4096) = 4096
[pid 10562] read(3, ", pconn)\n\tt.idleLk.Unlock()\n\tret"..., 8192) = 8192
[pid 10562] read(3, "rr := gzip.NewReader(resp.Body)\n"..., 16384) = 7210
[pid 10562] read(3, "", 9174)           = 0
[pid 10562] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7feaa1558ff8}
---
[pid 10562] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7feaa1558ff8}
---
[pid 10562] +++ killed by SIGSEGV +++
+++ killed by SIGSEGV +++
Segmentation fault
Info from a second run:
...
[pid 10603] read(3, "", 3124)           = 0
[pid 10603] close(3)                    = 0
[pid 10603] open("src/pkg/crypto/des/des_test.go", O_RDONLY|O_CLOEXEC) = 3
[pid 10603] read(3, "// Copyright 2011 The Go Authors"..., 512) = 512
[pid 10603] read(3, "\n\t\t[]byte{0x00, 0x00, 0x00, 0x00"..., 1024) = 1024
[pid 10603] read(3, "ce, 0x0d, 0xc9, 0x00, 0x65, 0x56"..., 2048) = 2048
[pid 10603] read(3, "0x0d, 0x73},\n\t\t[]byte{0x87, 0x87"..., 4096) = 4096
[pid 10603] read(3, " some custom tests for TripleDES"..., 8192) = 8192
[pid 10603] read(3, " 0xeb, 0x3b, 0x91, 0xd9, 0x07, 0"..., 16384) = 16384
[pid 10603] read(3, "x28, 0xbe, 0xbe, 0xf1, 0xcc}},\n\t"..., 32768) = 20022
[pid 10603] read(3, "", 12746)          = 0
[pid 10603] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f7e24b24ff8}
---
[pid 10603] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f7e24b24ff8}
---
[pid 10603] +++ killed by SIGSEGV +++
+++ killed by SIGSEGV +++
Segmentation fault
Info from a third run:
[pid 11359] close(3)                    = 0
[pid 11359] open("src/pkg/compress/flate/writer_test.go", O_RDONLY|O_CLOEXEC) = 3
[pid 11359] read(3, "// Copyright 2012 The Go Authors"..., 512) = 512
[pid 11359] read(3, " n)\n\tfor i := 0; i < n; i += len"..., 1024) = 1024
[pid 11359] read(3, "b *testing.B) { benchmarkEncoder"..., 2048) = 1033
[pid 11359] read(3, "", 1015)           = 0
[pid 11359] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f0d954b3ff8}
---
[pid 11359] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f0d954b3ff8}
---
[pid 11359] +++ killed by SIGSEGV +++
+++ killed by SIGSEGV +++
Segmentation fault

@griesemer
Copy link
Contributor

Comment 2:

Thanks for the report. I have never seen occasional segmentation faults w/ gofmt, which
makes me think perhaps there is something else going on here.
- Can you reproduce this sporadic behavior on any other machine?
- What else are you running on your machine?
- Is there enough memory?
Please provide as much detail as you can since I cannot reproduce this behavior at tip
at the moment. Thanks.

Owner changed to @griesemer.

Status changed to WaitingForReply.

@mewmew
Copy link
Contributor Author

mewmew commented Nov 17, 2012

Comment 3:

Hi.
I also think this issue is strange, since it doesn't behave deterministically. So far
I've been unable to reproduce it on my other laptop. I'll let it run over night and see
what happens.
On this computer however I can reproduce this easily. I tried to remove go and do a
fresh hg clone and ./all.bash, but it still segfaults.
Some info:
[~/go]$ free -m
             total       used       free     shared    buffers     cached
Mem:          3841       2821       1019          0        152       1603
-/+ buffers/cache:       1065       2775
Swap:            0          0          0
[~/go]$ uname -a
Linux x220 3.6.6-1-ARCH #1 SMP PREEMPT Mon Nov 5 11:57:22 CET 2012 x86_64 GNU/Linux
[~/go]$ gofmt -l src/pkg
src/pkg/exp/locale/collate/colelem.go
Segmentation fault
Anything else that could help with the trouble shooting?
The laptop on which this behavior is reproducible is a Lenovo x220. I tried the same
thing on a Lenovo x61s, but it only fails on the x220.

@davecheney
Copy link
Contributor

Comment 4:

Are the x220 and x61s running the same operating system ?
Is there anything informative in dmesg ?

@remyoudompheng
Copy link
Contributor

Comment 5:

Wow. I am surprised I am totally able to reproduce. I use Archlinux too.
$ strace -e write -f gofmt -l src/pkg
[pid  2284] --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1722, si_uid=1000} ---
[pid  2285] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f305ef8aff8}
---
[pid  2285] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f305ef8aff8}
---
[pid  2285] +++ killed by SIGSEGV (core dumped) +++
+++ killed by SIGSEGV (core dumped) +++
zsh: segmentation fault  strace -e write -f gofmt -l src/pkg
$ pmap 2285
2285:   gofmt -l src/pkg
0000000000400000   2260K r-x--  /opt/remy/go/bin/gofmt
0000000000635000     68K rw---  /opt/remy/go/bin/gofmt
0000000000646000 262268K rw---    [ anon ]
000000c1fffd0000   3264K rwx--    [ anon ]
00007f305ed0a000   3012K rwx--    [ anon ]
00007fff66974000    132K rw---    [ stack ]
00007fff669ce000      4K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total           271012K
I don't know what this 3012k mapping is. Lets disable address space randomization:
Now the fault address is deterministic:
% setarch x86_64 -R strace -e write -f gofmt -l src/pkg
Process 2351 attached
[pid  2351] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7ffff7f8dff8}
---
[pid  2351] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7ffff7f8dff8}
---
[pid  2351] +++ killed by SIGSEGV (core dumped) +++
+++ killed by SIGSEGV (core dumped) +++
zsh: segmentation fault  setarch x86_64 -R strace -e write -f gofmt -l src/pkg
$ setarch x86_64 -R strace -e write -f gofmt -l src/pkg
Process 2357 attached
[pid  2357] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7ffff7f8dff8}
---
[pid  2357] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7ffff7f8dff8}
---
[pid  2357] +++ killed by SIGSEGV (core dumped) +++
+++ killed by SIGSEGV (core dumped) +++
zsh: segmentation fault  setarch x86_64 -R strace -e write -f gofmt -l src/pkg
and pmap:
0000000000400000   2260K r-x--  /opt/remy/go/bin/gofmt
0000000000635000     68K rw---  /opt/remy/go/bin/gofmt
0000000000646000 262268K rw---    [ anon ]
000000c1fffd0000   3264K rwx--    [ anon ]
00007ffff7d0d000   3012K rwx--    [ anon ]
00007ffff7ffe000      4K r-x--    [ anon ]  <- we're just at the boundary here
00007ffffffde000    132K rw---    [ stack ]
ffffffffff600000      4K r-x--    [ anon ]
 total           271012K
And what actually happens:
$ gdb --args gofmt -l src/pkg
GNU gdb (GDB) 7.5
Copyright (C) 2012 Free Software Foundation, Inc.
[blah...]
(gdb) handle SIGSEGV stop nopass print
Signal        Stop  Print   Pass to program Description
SIGSEGV       Yes   Yes No      Segmentation fault
(gdb) run
Starting program: /home/remy/bin/gofmt -l src/pkg
[New LWP 1958]
Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1958]
0x00007ffff7ffe600 in ?? ()
(gdb) bt
#0  0x00007ffff7ffe600 in ?? ()
#1  0x00007ffff7ffe8ac in clock_gettime ()
#2  0x0000000000420c3b in time.now (sec=void, nsec=void) at
/opt/remy/go/src/pkg/runtime/sys_linux_amd64.s:110
#3  0x0000000000420c99 in runtime.nanotime () at
/opt/remy/go/src/pkg/runtime/sys_linux_amd64.s:129
#4  0x0000000000411f3c in sweepspan (idx=void) at /opt/remy/go/src/pkg/runtime/mgc0.c:710
#5  0x0000000000415554 in runtime.parfordo (desc=void) at
/opt/remy/go/src/pkg/runtime/parfor.c:101
#6  0x000000000041256a in runtime.gc (force=void) at
/opt/remy/go/src/pkg/runtime/mgc0.c:1053
#7  0x000000000041cfdb in runtime.mallocgc (size=void, flag=void, dogc=void, zeroed=void)
    at /opt/remy/go/src/pkg/runtime/malloc.goc:114
#8  0x000000000041e0af in runtime.mal (n=void) at
/opt/remy/go/src/pkg/runtime/malloc.goc:689
#9  0x000000000040af3d in hash_subtable_new (h=void, power=void, used=void) at
/opt/remy/go/src/pkg/runtime/hashmap.c:84
#10 0x000000000040b0c3 in hash_init (h=void, datasize=void, hint=void) at
/opt/remy/go/src/pkg/runtime/hashmap.c:128
#11 0x000000000040ca9e in runtime.makemap_c (typ=void, hint=void) at
/opt/remy/go/src/pkg/runtime/hashmap.c:780
#12 0x000000000040cb4f in runtime.makemap (typ=void, hint=void, ret=void) at
/opt/remy/go/src/pkg/runtime/hashmap.c:797
#13 0x000000000043d95c in go/ast.NewScope (outer=0x0, noname=void) at
/opt/remy/go/src/pkg/go/ast/scope.go:27
#14 0x000000000044739c in go/parser.(*parser).openLabelScope (p=0xc2002b4480)
    at /opt/remy/go/src/pkg/go/parser/parser.go:92
#15 0x000000000044d720 in go/parser.(*parser).parseBody (p=0xc2002b4480,
scope=0xc20047b060, noname=void)
    at /opt/remy/go/src/pkg/go/parser/parser.go:1007
#16 0x0000000000455b6d in go/parser.(*parser).parseFuncDecl (p=0xc2002b4480, noname=void)
    at /opt/remy/go/src/pkg/go/parser/parser.go:2230
#17 0x0000000000455e75 in go/parser.(*parser).parseDecl (p=0xc2002b4480, sync=
    {void (struct go/parser.parser *)} 0x7ffff7f8e680, noname=void) at /opt/remy/go/src/pkg/go/parser/parser.go:2274
#18 0x000000000045636d in go/parser.(*parser).parseFile (p=0xc2002b4480, noname=void)
    at /opt/remy/go/src/pkg/go/parser/parser.go:2329
#19 0x0000000000446e24 in go/parser.ParseFile (fset=0xc20006b280, filename=..., src=...,
mode=4, noname=void)
    at /opt/remy/go/src/pkg/go/parser/interface.go:92
#20 0x0000000000402564 in main.parse (fset=0xc20006b280, filename=..., src=...,
stdin=false, noname=void)
    at /opt/remy/go/src/cmd/gofmt/gofmt.go:259
#21 0x0000000000401095 in main.processFile (filename=..., in=..., out=..., stdin=false,
noname=void)
    at /opt/remy/go/src/cmd/gofmt/gofmt.go:101
#22 0x00000000004019a5 in main.visitFile (path=..., f=..., err=..., noname=void)
    at /opt/remy/go/src/cmd/gofmt/gofmt.go:160
#23 0x000000000047dded in path/filepath.walk (path=..., info=..., walkFn=
    {void (struct string, os.FileInfo, error, error *)} 0x7ffff7f8ebd8, noname=void)
    at /opt/remy/go/src/pkg/path/filepath/path.go:344
#24 0x000000000047e0ad in path/filepath.walk (path=..., info=..., walkFn=
    {void (struct string, os.FileInfo, error, error *)} 0x7ffff7f8ec98, noname=void)
    at /opt/remy/go/src/pkg/path/filepath/path.go:362
#25 0x000000000047e0ad in path/filepath.walk (path=..., info=..., walkFn=
    {void (struct string, os.FileInfo, error, error *)} 0x7ffff7f8ed58, noname=void)
    at /opt/remy/go/src/pkg/path/filepath/path.go:362
#26 0x000000000047e0ad in path/filepath.walk (path=..., info=..., walkFn=
    {void (struct string, os.FileInfo, error, error *)} 0x7ffff7f8ee18, noname=void)
    at /opt/remy/go/src/pkg/path/filepath/path.go:362
---Type <return> to continue, or q <return> to quit---apropos 
#27 0x000000000047e24f in path/filepath.Walk (root=..., walkFn=
    {void (struct string, os.FileInfo, error, error *)} 0x7ffff7f8ee50, noname=void)
    at /opt/remy/go/src/pkg/path/filepath/path.go:382
#28 0x0000000000401a59 in main.walkDir (path=...) at
/opt/remy/go/src/cmd/gofmt/gofmt.go:169
#29 0x000000000040200e in main.gofmtMain () at /opt/remy/go/src/cmd/gofmt/gofmt.go:218
#30 0x0000000000401a7c in main.main () at /opt/remy/go/src/cmd/gofmt/gofmt.go:176
The segfault is here:
   ...
   0x00007ffff7ffe867 <+567>: mov    %eax,%eax
   0x00007ffff7ffe869 <+569>: sub    0xffffffffff5ff090,%rax
   0x00007ffff7ffe871 <+577>: mov    0xffffffffff5ff0a0,%edx
   0x00007ffff7ffe878 <+584>: mov    0xffffffffff5ff0a4,%ecx
   0x00007ffff7ffe87f <+591>: and    0xffffffffff5ff098,%rax
   0x00007ffff7ffe887 <+599>: imul   %rdx,%rax
   0x00007ffff7ffe88b <+603>: sar    %cl,%rax
   0x00007ffff7ffe88e <+606>: jmpq   0x7ffff7ffe7d2 <clock_gettime+418>
   0x00007ffff7ffe893 <+611>: nopl   0x0(%rax,%rax,1)
   0x00007ffff7ffe898 <+616>: xor    %edx,%edx
   0x00007ffff7ffe89a <+618>: jmpq   0x7ffff7ffe761 <clock_gettime+305>
   0x00007ffff7ffe89f <+623>: nop
   0x00007ffff7ffe8a0 <+624>: mov    %rsi,-0x20(%rbp)
   0x00007ffff7ffe8a4 <+628>: mov    %edi,-0x18(%rbp)
   0x00007ffff7ffe8a7 <+631>: callq  0x7ffff7ffe600
=> 0x00007ffff7ffe8ac <+636>:  mov    -0x20(%rbp),%rsi
   0x00007ffff7ffe8b0 <+640>: mov    -0x18(%rbp),%edi
   0x00007ffff7ffe8b3 <+643>: jmp    0x7ffff7ffe869 <clock_gettime+569>
   0x00007ffff7ffe8b5 <+645>: nopl   (%rax)
   0x00007ffff7ffe8b8 <+648>: mov    0xffffffffff5fe0f0,%eax
   0x00007ffff7ffe8bf <+655>: mov    %eax,%eax
   0x00007ffff7ffe8c1 <+657>: jmpq   0x7ffff7ffe833 <clock_gettime+515>
   0x00007ffff7ffe8c6 <+662>: pause  
   0x00007ffff7ffe8c8 <+664>: jmpq   0x7ffff7ffe788 <clock_gettime+344>
Which I think corresponds to this in the kernel.
notrace static inline long vgetns(void)
{
  long v;
  cycles_t cycles;
  if (gtod->clock.vclock_mode == VCLOCK_TSC)
    cycles = vread_tsc();
  else if (gtod->clock.vclock_mode == VCLOCK_HPET)
    cycles = vread_hpet();
  else
    return 0;
  v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
  return (v * gtod->clock.mult) >> gtod->clock.shift;
}
At least, bisecting tells me the vDSO thing introduces the regression:
changeset 14848:42c8d3aadc40: bad
The first bad revision is:
changeset:   14848:42c8d3aadc40
user:        Shenghou Ma <minux.ma@gmail.com>
date:        Fri Nov 09 14:19:07 2012 +0800
summary:     runtime: use vDSO clock_gettime for time.now & runtime.nanotime on
Linux/amd64
I don't understand why it doesn't work.

@remyoudompheng
Copy link
Contributor

Comment 6:

Ah, the problem is in vread_tsc: the push %rbp cannot happen because the stack pointer
is already at the boundary (0x7ffff7f8e000) hence the fault at 0x7ffff7f8dff8
I wonder it it's a kernel bug, gcc bug or Archlinux bug.
Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 13668]
0x00007ffff7ffe600 in ?? ()
(gdb) info reg
rax            0x2a7b9454   712741972
rbx            0x1  1
rcx            0x28b    651
rdx            0x1  1
rsi            0x7ffff7f8e040   140737353670720
rdi            0x0  0
rbp            0x7ffff7f8e028   0x7ffff7f8e028
rsp            0x7ffff7f8e000   0x7ffff7f8e000
r8             0xc20054e000 833229217792
r9             0x0  0
r10            0xc1fffab1f8 833223307768
r11            0x28 40
r12            0xb7649ce    192301518
r13            0x1  1
r14            0x4e 78
r15            0xc20008fac0 833224243904
rip            0x7ffff7ffe600   0x7ffff7ffe600
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0
(gdb) disas 0x00007ffff7ffe600,0x00007ffff7ffe6400
Dump of assembler code from 0x7ffff7ffe600 to 0x7ffff7ffe6400:
=> 0x00007ffff7ffe600:  push   %rbp
   0x00007ffff7ffe601:  mov    %rsp,%rbp
   0x00007ffff7ffe604:  nopl   (%rax)
   0x00007ffff7ffe607:  lfence 
   0x00007ffff7ffe60a:  rdtsc  
   0x00007ffff7ffe60c:  shl    $0x20,%rdx
   0x00007ffff7ffe610:  mov    %eax,%eax
   0x00007ffff7ffe612:  or     %rax,%rdx
   0x00007ffff7ffe615:  mov    0xffffffffff5ff090,%rax
   0x00007ffff7ffe61d:  cmp    %rdx,%rax
   0x00007ffff7ffe620:  ja     0x7ffff7ffe627
   0x00007ffff7ffe622:  mov    %rdx,%rax
   0x00007ffff7ffe625:  pop    %rbp
   0x00007ffff7ffe626:  retq   
   0x00007ffff7ffe627:  mov    %rax,%rdx
   0x00007ffff7ffe62a:  jmp    0x7ffff7ffe622
   0x00007ffff7ffe62c:  nopl   0x0(%rax)
minux, do you have an idea?

@remyoudompheng
Copy link
Contributor

Comment 7:

Status changed to Accepted.

@shivakumargn
Copy link
Contributor

Comment 8:

Tested this on ubuntu and CentOS.
Consistently reproducible on CentOS. Does not occur on ubuntu.

@remyoudompheng
Copy link
Contributor

Comment 9:

Adding a dummy 'import _ "net"'
in cmd/gofmt to make the binary dynamically linked makes the segfaults disappear on my
system.

@shivakumargn
Copy link
Contributor

Comment 10:

Problem persisted after 'import _ "net"' as well on my system.
Additionally setarch as below seems to solve the problem:
> setarch x86_64 --addr-no-randomize gofmt -l src/pkg

@minux
Copy link
Member

minux commented Nov 17, 2012

Comment 11:

i think the cause is that we are calling a gcc compiled function (vDSO func) on
our segmented stack.
the only solution i can think of is to increase the required frame size for
runtime.nanotime and runtime.now and adjust SP before the indirect call.
however, we need to know how much stack space is necessary for the call.
the surest fix is to call gettime_clock on the OS stack, but that will degrade
performance a lot.
I think this could be regarded as a kernel bug, the function should not make
any assumption on the available stack space just like a normal syscall.
However, because there are kernels in the wild with this problem, maybe we
need to take precaution in all vDSO function call sites.

@remyoudompheng
Copy link
Contributor

Comment 12:

I don't manage to produce a deterministic reproducer. How can we determine it is indeed
the cause, do you have a patch idea?

@minux
Copy link
Member

minux commented Nov 17, 2012

Comment 13:

can we use something like src/pkg/runtime/stack_test.go to comprehensively
explore all available stack sizes for time.Now()?
on Linux we have at least 128 extra bytes below StackGuard, so maybe my analysis
was wrong.

@remyoudompheng
Copy link
Contributor

Comment 14:

I already did something similar this morning using the following template:
func BenchmarkNow2400(b *testing.B) {
    var s [2400]byte
    for i := 0; i < b.N; i++ {
        time.Now()
    }
    _ = s
}
I see:
BenchmarkNow3520     2000000            38.8 ns/op
BenchmarkNow3536     2000000            38.8 ns/op
BenchmarkNow3552     2000000            38.8 ns/op
BenchmarkNow3568     2000000            38.8 ns/op
BenchmarkNow3584      500000           166 ns/op
BenchmarkNow3600      500000           166 ns/op
BenchmarkNow3616      500000           166 ns/op
BenchmarkNow3632      500000           166 ns/op
BenchmarkNow3648      500000           166 ns/op
BenchmarkNow3664      500000           166 ns/op
BenchmarkNow3680      500000           166 ns/op
BenchmarkNow3696      500000           166 ns/op
BenchmarkNow3712      500000           166 ns/op
BenchmarkNow3728      500000           166 ns/op
BenchmarkNow3744      500000           166 ns/op
BenchmarkNow3760      500000           166 ns/op
BenchmarkNow3776      500000           166 ns/op
BenchmarkNow3792      500000           166 ns/op
BenchmarkNow3808      500000           166 ns/op
BenchmarkNow3824      500000           166 ns/op
BenchmarkNow3840     2000000            38.8 ns/op
BenchmarkNow3856     2000000            38.8 ns/op
BenchmarkNow3872     2000000            38.8 ns/op
BenchmarkNow3888     2000000            38.8 ns/op
But no crash.

@minux
Copy link
Member

minux commented Nov 17, 2012

Comment 15:

could you please try this very conservative patch:
diff -r f5b22a4df675 src/pkg/runtime/sys_linux_amd64.s
--- a/src/pkg/runtime/sys_linux_amd64.s Sat Nov 17 11:16:07 2012 -0800
+++ b/src/pkg/runtime/sys_linux_amd64.s Sun Nov 18 05:10:46 2012 +0800
@@ -100,16 +100,19 @@
        SYSCALL
        RET
 
+frameSize = 120 // also change the frame size below
 // func now() (sec int64, nsec int32)
-TEXT time·now(SB), 7, $32
+TEXT time·now(SB), 0, $120 // 32
        MOVQ    runtime·__vdso_clock_gettime_sym(SB), AX
        CMPQ    AX, $0
        JEQ     fallback_gtod
        MOVL    $0, DI // CLOCK_REALTIME
+       ADDQ    $(frameSize-32), SP
        LEAQ    8(SP), SI
        CALL    AX
        MOVQ    8(SP), AX       // sec
        MOVQ    16(SP), DX      // nsec
+       SUBQ    $(frameSize-32), SP
        MOVQ    AX, sec+0(FP)
        MOVL    DX, nsec+8(FP)
        RET
maybe you need to tweak value of frameSize.
sorry, i couldn't reproduce the problem on gentoo with kernel 3.5.2 and
i couldn't update the kernel now.

@minux
Copy link
Member

minux commented Nov 17, 2012

Comment 16:

oh, i think i know why the crash is mostly likely to occur during GC.
at the entry to runtime.nanotime, there could be as little as 128B stack available,
after runtime.nanotime uses 32+8B, and time.now uses 32+8B, there is only 48B
remaining for the clock_gettime call, which might not be enough.
You can test this hypothesis by make a copy of runtime.nanotime, export it
and call it in your benchmark tests. if this is the case, we only need to reduce
the frame size of runtime.nanotime to 16 (or just adapt the code of time.now
to runtime.nanotime).

@minux
Copy link
Member

minux commented Nov 17, 2012

Comment 17:

if my analysis is right, https://golang.org/cl/6842063/ should fix this issue.
please help test it. thanks.

Status changed to WaitingForReply.

@remyoudompheng
Copy link
Contributor

Comment 18:

Ah, it works indeed. Your patch corrects the segfaults and I can reproduce the problem
as you say:
% cat time.s
TEXT ·Nanotime(SB),7,$0
        JMP runtime·nanotime(SB)
% cat time_test.go
package t
import "testing"
func Nanotime() int64
func BenchmarkNow2048(b *testing.B) {
      var s [2048]byte
      for i := 0; i < b.N; i++ {
            Nanotime()
      }
      _ = s
}
[...]
% go test -v -bench . -benchtime 40ms
BenchmarkNow3504     2000000            38.8 ns/op
BenchmarkNow3520     2000000            38.9 ns/op
BenchmarkNow3536     2000000            38.8 ns/op
BenchmarkNow3552     2000000            38.8 ns/op
BenchmarkNow3568     2000000            38.8 ns/op
BenchmarkNow3584     1000000            42.7 ns/op
BenchmarkNow3600     1000000            41.8 ns/op
BenchmarkNow3616     1000000            47.0 ns/op
BenchmarkNow3632     1000000            42.7 ns/op
BenchmarkNow3648     1000000            41.8 ns/op
BenchmarkNow3664     1000000            44.0 ns/op
BenchmarkNow3680     1000000            40.1 ns/op
BenchmarkNow3696     2000000            38.8 ns/op
BenchmarkNow3712     2000000            38.8 ns/op
BenchmarkNow3728    signal 11
So there is a fault where a slowdown due to stack split should have been seen.

@minux
Copy link
Member

minux commented Nov 21, 2012

Comment 19:

https://golang.org/cl/6842063/

Labels changed: added priority-soon, removed priority-triage.

Owner changed to @minux.

Status changed to Started.

@minux
Copy link
Member

minux commented Nov 26, 2012

Comment 20:

This issue was closed by revision 2637777.

Status changed to Fixed.

@mewmew
Copy link
Contributor Author

mewmew commented Nov 27, 2012

Comment 21:

I'm now able to do a 1000 consecutive runs of "gofmt -l src/pkg" without a segfault.
Thank you for taking your time and investigating this issue!

@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants