Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync/atomic: generalCAS64 broken #6440

Closed
quarnster opened this issue Sep 21, 2013 · 20 comments
Closed

sync/atomic: generalCAS64 broken #6440

quarnster opened this issue Sep 21, 2013 · 20 comments
Milestone

Comments

@quarnster
Copy link

Before filing a bug, please check whether it has been fixed since the
latest release. Search the issue tracker and check that you're running the
latest version of Go:

Run "go version" and compare against
http://golang.org/doc/devel/release.html  If a newer version of Go exists,
install it and retry what you did to reproduce the problem.

Thanks.

What steps will reproduce the problem?
If possible, include a link to a program on play.golang.org.
1. Save http://play.golang.org/p/lqkYmdbrx2 as hellonet.go
2. GOARM=5 CGO_ENABLED=0 go build hellonet.go (cross compilation or compiling directly
on target makes no difference)
4. ./hellonet
5. Connect to the target's tcpsocket 6060 any way you want

What is the expected output?

Something like an immediate print of "2013/09/21 12:10:11 &{{0x102500f0}}"
and hellonet exiting as soon as a client connects.

What do you see instead?

The client connects to the socket and absolutely nothing happens. I've gotten SIGSEGVs a
couple of times but don't know how long that took as I left it running while I did
something else:

SIGSEGV: segmentation violation
PC=0x20870

scaninterfacedata(0x2, 0x40050f60, 0x1)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:1351 +0x48
scanbitvector(0x40050f60, 0x131070, 0x198901)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:1384 +0x118
addframeroots(0xbe910a60, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:1420 +0x1c0
runtime.gentraceback(0x1d034, 0x40050df4, 0x16708, 0x105010a0, 0x0, ...)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/traceback_arm.c:132 +0x408
addstackroots(0x105010a0)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:1475 +0xcc
addroots()
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:1556 +0x1f8
gc(0x40161fb0)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:2098 +0x19c
mgc(0x10501820)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:2049 +0x30
runtime.mcall(0x1e6e90)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/asm_arm.s:165 +0x38

goroutine 3 [garbage collection]:
runtime.gc(0x1)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:2020 +0x1a0 fp=0x40161fb8
forcegchelper(0x4005ff9c)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mheap.c:385 +0x2c fp=0x40161fc0
runtime.goexit()
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/proc.c:1386 fp=0x40161fc0
created by runtime.MHeap_Scavenger
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mheap.c:439

goroutine 1 [runnable]:
sync/atomic.loadUint64(0x1054b080, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/sync/atomic/64bit_arm.go:10 +0x84
net.(*fdMutex).RWLock(0x1054b080, 0x1e6901, 0x4000d100)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/fd_mutex.go:121 +0xb8
net.(*netFD).readLock(0x1054b080, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/fd_unix.go:122 +0x4c
net.(*netFD).accept(0x1054b080, 0x12bdc4, 0x0, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/fd_unix.go:368 +0x6c
net.(*TCPListener).AcceptTCP(0x10500170, 0x10543008, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/tcpsock_posix.go:233 +0x70
net.(*TCPListener).Accept(0x10500170, 0x0, 0x0, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/tcpsock_posix.go:243 +0x4c
main.main()
    /Users/quarnster/code/go/hellonet.go:14 +0x154

trap    0xe
error   0x5
oldmask 0x0
r0      0xbe9109d0
r1      0xe59fb030
r2      0x412
r3      0x40cac
r4      0x40050f60
r5      0x1
r6      0x412
r7      0x28
r8      0x16708
r9      0x1e89c0
r10     0x1e6e78
fp      0x1e6e6c
ip      0x1d034
sp      0xbe9109cc
lr      0x209fc
pc      0x20870
cpsr    0x60000010
fault   0xe59fb030


Which compiler are you using (5g, 6g, 8g, gccgo)?

5g, either on the native linux/armv5tel or cross building.

Which operating system are you using?

Linux/armv5tel 2.6.32.12 synology_88f6281_411j

Which version are you using?  (run 'go version')

go version devel +125bf33c1905 Sat Sep 21 17:53:44 2013 +1000 linux/arm.

The same test works just fine with hg tag go1.1.2.
@davecheney
Copy link
Contributor

Comment 1:

Can you check the values of /proc/cpu/aligment and see if it increments if your program
fails.

Status changed to WaitingForReply.

@quarnster
Copy link
Author

Comment 2:

The numbers are continually increasing without the program running so I don't know how
useful any info received from it would be. It's unclear to me what you mean with "If
your program fails" as it fails always and immediately with tip.
It's supposed to print out the connection and exit as soon as a client connects, but it
doesn't. Nothing is printed on the screen and the connection remains open and thus it
has immediately failed.
I can let it run until it SIGSEVs some point in the future, but again as the numbers in
/proc/cpu/alignment are continually incrementing without the program running I'm not
sure what info you hope to gain from that. Please clarify.

@quarnster
Copy link
Author

Comment 3:

Performing a hg revision binary search to see if I can narrow down the exact point in
time it stopped working.

@quarnster
Copy link
Author

Comment 4:

TLDR:
diff -r 125bf33c1905 src/pkg/sync/atomic/asm_linux_arm.s
--- a/src/pkg/sync/atomic/asm_linux_arm.s   Sat Sep 21 17:53:44 2013 +1000
+++ b/src/pkg/sync/atomic/asm_linux_arm.s   Sat Sep 21 19:13:42 2013 +0200
@@ -129,12 +129,14 @@
    BEQ     2(PC)
    MOVW    R1, (R1)
    MOVW    R0, 4(R13)
-   MOVW    $4(FP), R1 // oldval
+   MOVW    oldlo+4(FP), R1
    MOVW    R1, 8(R13)
+   MOVW    oldhi+8(FP), R1
+   MOVW    R1, 12(R13)
    MOVW    newlo+12(FP), R2
-   MOVW    R2, 12(R13)
+   MOVW    R2, 16(R13)
    MOVW    newhi+16(FP), R3
-   MOVW    R3, 16(R13)
+   MOVW    R3, 20(R13)
    BL      runtime·cas64(SB)
    MOVW    R0, 20(FP)
    RET
Without it "go test sync/atomic" (and probably much else using atomic operations)
freezes on tip, and with the patch "go test sync/atomic" does not freeze until it hits
TestNilDeref's use of CompareAndSwapUint64. Someone who knows the expected behaviour
better would have to take a closer look at fixing that one.
Details:
r17651 is the offending changelist for the original issue reported which isn't too
helpful as that's the revision where fd_mutex.go was first introduced. Rather than just
a freeze we immediately get this though:
runtime: unknown argument frame size for generalCAS64 called from 0x9b400
[sync/atomic.loadUint64]
fatal error: invalid stack
goroutine 1 [garbage collection]:
runtime.gc(0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/mgc0.c:2052 +0x1c4 fp=0x40050e74
runtime.mallocgc(0x8, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/malloc.goc:142 +0x22c fp=0x40050eb0
runtime.mal(0x8)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/malloc.goc:698 +0x38 fp=0x40050ec0
copyin(0xd17e0, 0x40050f00, 0x40050f0c)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/iface.c:152 +0x6c fp=0x40050edc
runtime.convT2E(0xd17e0, 0xf98a8, 0xd, 0xd17e0, 0x4000d188)
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/iface.c:220 +0x44 fp=0x40050ef8
main.main()
    /Users/quarnster/code/go/hellonet.go:27 +0x204 fp=0x40050f90
runtime.main()
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/proc.c:200 +0xf4 fp=0x40050fc4
runtime.goexit()
    /Users/quarnster/code/3rdparty/go/src/pkg/runtime/proc.c:1364 fp=0x40050fc4
goroutine 3 [runnable]:
runtime: unknown argument frame size for generalCAS64 called from 0x9b400
[sync/atomic.loadUint64]
sync/atomic.loadUint64(0x105230c0, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/sync/atomic/64bit_arm.go:10 +0x6c
net.(*fdMutex).RWLock(0x105230c0, 0x1, 0x3fae8)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/fd_mutex.go:121 +0x94
net.(*netFD).readLock(0x105230c0, 0x21, 0x40)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/fd_unix.go:128 +0x34
net.(*netFD).accept(0x105230c0, 0x114b48, 0x0, 0x0, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/fd_unix.go:374 +0x40
net.(*TCPListener).AcceptTCP(0x10500168, 0x10525090, 0xd, 0x0)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/tcpsock_posix.go:240 +0x50
net.(*TCPListener).Accept(0x10500168, 0x1, 0x1, 0xd17e0, 0x10500178)
    /Users/quarnster/code/3rdparty/go/src/pkg/net/tcpsock_posix.go:250 +0x2c
main.func·001()
    /Users/quarnster/code/go/hellonet.go:17 +0xf0
created by main.main
    /Users/quarnster/code/go/hellonet.go:24 +0x198
That issue was then fixed in r17687, and from thereon it just spins forever in
64bit_arm.go:loadUint64 as CompareAndSwapUint64 always returns false.
Creating a small test for CompareAndSwapUint64 I can confirm that it's broken, or more
specifically generalCAS64 is broken, which is only used if __kuser_helper_version is
< 5. (On the arm machine running this test on it appears to be 3). 
Apparently I can't change the title myself, but sync/atomic would be the correct package
and I don't believe it's a regression, or rather the regression in net is just because
net now uses a feature that might not ever have worked on this particular arm setup.

@minux
Copy link
Member

minux commented Sep 21, 2013

Comment 5:

Thank you for your excellent diagnosis!

Labels changed: added priority-soon, go1.2, removed priority-triage.

Owner changed to @minux.

Status changed to Accepted.

@minux
Copy link
Member

minux commented Sep 23, 2013

Comment 6:

Issue #6462 has been merged into this issue.

@dvyukov
Copy link
Member

dvyukov commented Sep 23, 2013

Comment 7:

Issue #6462 has been merged into this issue.

@minux
Copy link
Member

minux commented Sep 23, 2013

Comment 8:

Issue #6462 has been merged into this issue.

@rsc
Copy link
Contributor

rsc commented Sep 24, 2013

Comment 9:

The problem is that I changed the prototype for runtime.cas64 without realizing it was
called from here. Does https://golang.org/cl/13859043 fix the problem?

@quarnster
Copy link
Author

Comment 10:

Still hanging in TestNilDeref's use of CompareAndSwapUint64. The rest of the tests in
sync/atomic pass.

@rsc
Copy link
Contributor

rsc commented Sep 24, 2013

Comment 11:

Oh, I see. I was seeing something in the assembly that wasn't actually
there.
Please try CL 13859043 again. I've added a few more lines.

@quarnster
Copy link
Author

Comment 12:

Indeed, now the full sync/atomic test suite passes :)
Cheers!

@josharian
Copy link
Contributor

Comment 13:

CL 13859043 now fixes sync/atomic tests and network requests for me as well. Thanks!

@josharian
Copy link
Contributor

Comment 14:

Thinking about the future...is there a buildbot for this linux/arm combo?

@davecheney
Copy link
Contributor

Comment 15:

We have many, including 2 armv5 models.

@josharian
Copy link
Contributor

Comment 16:

Hmmm...I guess the obvious question, then, is why they didn't catch this, since it
caused a standard unit test to fail.

@davecheney
Copy link
Contributor

Comment 17:

That sir, is a very good question. My armv5 host runs Debian Sid, kernel 3.10, I don't
know about the other armv5 host, but I got the feeling that it was running a very old
kernel as I remember some teething problems getting it online.

@rsc
Copy link
Contributor

rsc commented Sep 24, 2013

Comment 18:

I think the problem is that we try pretty hard not to need to fall back to
generalCAS64, and we do a good job of it. Instead of having yet another arm
builder, I think it is okay just to test generalCAS64.
I added a test to https://golang.org/cl/13859043 but I am not near
an ARM machine. Can someone try it?
Thanks.

@quarnster
Copy link
Author

Comment 19:

Works for me.

@rsc
Copy link
Contributor

rsc commented Sep 24, 2013

Comment 20:

Fixed by 351b6fe0ae36.

Status changed to Fixed.

@rsc rsc added this to the Go1.2 milestone Apr 14, 2015
@rsc rsc removed the go1.2 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants