Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: intermittent crash locking Mutex -- unexpected fault address #8703

Closed
bmatsuo opened this issue Sep 11, 2014 · 17 comments
Closed

sync: intermittent crash locking Mutex -- unexpected fault address #8703

bmatsuo opened this issue Sep 11, 2014 · 17 comments

Comments

@bmatsuo
Copy link

bmatsuo commented Sep 11, 2014

What does 'go version' print?

go version go1.3.1 linux/amd64

What steps reproduce the problem?

We cannot yet reliably reproduce this problem. It happens in large clusters very
intermittently. The context of the crash is launching goroutines while iterating a slice
in a for-select loop.

func (tab *ProcessTable) loop() {
    for {
        select {
        // ...
        case set := <-tab.persist:
            for _, p := range tab.ps {
                p := p
                go p.DatasetPersisted(set)
            }
        }
        // ...
    }
}

The method called is attempting to lock a mutex.

func (p *Process) DatasetPersisted(outset vaultdata.Dataset) {
    if p == nil {
        return
    }  
    p.chwrmut.Lock()
    defer p.chwrmut.Unlock()
    if !p.closed {
        p.setPersistStatus <- setPersistStatus{outset, nil}
    }  
}

What happened?

The system crashes with the following error. I've only included the top stack as there
are several hundred active goroutines at the time this occurs. There is a
runtime.goexit() call near the bottom that is confusing.

unexpected fault address 0x0
fatal error: fault
[signal 0x7 code=0x80 addr=0x0 pc=0x60f9dd]

goroutine 10270 [running]:
runtime.throw(0xae7a56)
        /usr/local/go/src/pkg/runtime/panic.c:520 +0x69 fp=0x7ff97dcc5ee8 sp=0x7ff97dcc5ed0
runtime.sigpanic()
        /usr/local/go/src/pkg/runtime/os_linux.c:232 +0x25b fp=0x7ff97dcc5f00 sp=0x7ff97dcc5ee8
sync/atomic.CompareAndSwapUint32(0x7561772f746e6d9f, 0x100000000)
        /usr/local/go/src/pkg/sync/atomic/asm_amd64.s:42 +0xd fp=0x7ff97dcc5f08 sp=0x7ff97dcc5f00
sync.(*Mutex).Lock(0x7561772f746e6d9f)
        /usr/local/go/src/pkg/sync/mutex.go:43 +0x42 fp=0x7ff97dcc5f30 sp=0x7ff97dcc5f08
vault/vaultproc.(*Process).DatasetPersisted(0x7561772f746e6d2f, 0x6f2f646f72702e73,
0x393034312f736a62, 0x2d343034302d3131, 0x39382f61382f3435)
        /tmp/wault-build-356012266/git/src/vault/vaultproc/proc.go:233 +0x4a fp=0x7ff97dcc5f80 sp=0x7ff97dcc5f30
runtime.goexit()
        /usr/local/go/src/pkg/runtime/proc.c:1445 fp=0x7ff97dcc5f88 sp=0x7ff97dcc5f80
created by vault/vaultproc.(*ProcessTable).loop
        /tmp/wault-build-356012266/git/src/vault/vaultproc/table.go:255 +0x3c3

What should have happened instead?

No fatal error should occur there.

Please provide any additional information below.

The rate we have seen this occur has been approximately once every several weeks (this
is the first occurrence on go1.3.1).

We saw this error in go1.3 as well. The codebase is young and was running on go1.2.1 for
only a very short time. So while we didn't see the problem present then, given how
intermittent the crash is in nature, I can't say with certainty that it wouldn't have
happened.

Also, we have run into an occasional problem where the go process reaches the limit on
open file descriptors. I have no idea if could be related. And as I said previously
there's no indication from our logs that was the case when the faults have occurred.
@davecheney
Copy link
Contributor

Comment 1:

One issue is the address of the lock value is not properly aligned.
sync/atomic.CompareAndSwapUint32(0x7561772f746e6d9f, 0x100000000)
        /usr/local/go/src/pkg/sync/atomic/asm_amd64.s:42 +0xd fp=0x7ff97dcc5f08 sp=0x7ff97dcc5f00
The question is how did this happen. Can you post the declaration of the Process struct.

Status changed to WaitingForReply.

@davecheney
Copy link
Contributor

Comment 2:

Can you also include the definition of ProcessTable structure.

@bmatsuo
Copy link
Author

bmatsuo commented Sep 11, 2014

Comment 3:

That's interesting. Here are the relevant struct definitions.
type Process struct {
    topic            *stat.App
    start            time.Time
    uuid             string
    cleanup          bool
    debugdir         string
    debug            bool
    persistDeps      *vaultdata.ObjectDeps
    pipeline         []*vaultexec.State
    chwrmut          sync.Mutex
    closed           bool
    cancelled        bool
    cancel           chan chan<- error
    progCacheStatus  chan progCacheStatus
    setCacheStatus   chan setCacheStatus
    setPersistStatus chan setPersistStatus
    outEvents        chan<- vaultdata.DatasetEvent
    progEvents       chan<- vaultdata.ProgramEvent
    status           chan<- vaultdata.JobEvent
    err              error
    pcResult         chan pcResult
    pcwg             *sync.WaitGroup
}
type ProcessTable struct {
    ps          []*Process
    cleanonexit bool
    last        int
    count       int
    maxproc     int
    rootdir     string
    debugdir    string
    setready    chan vaultdata.Dataset
    setfail     chan datasetError
    persist     chan vaultdata.Dataset
    persistfail chan datasetError
    progready   chan vaultdata.Program
    progfail    chan programError
    events      chan vaultdata.DatasetEvent
    progEvents  chan vaultdata.ProgramEvent
    jobEvents   chan vaultdata.JobEvent
    startReq    chan startReq
    cancelReq   chan cancelReq
    cancelAll   chan chan<- []vaultdata.JobCancellation
    procResult  chan procResult
}

@davecheney
Copy link
Contributor

Comment 4:

Thanks for your reply. Someone just pointed out to me that the address of *Process was
completely scrambled, along with the contents of the value.Dataset received over the
channel.
Can you please post the entire panic message.
Have you run your application under the Go race detector ? Is it clear of detected data
races ?

@bmatsuo
Copy link
Author

bmatsuo commented Sep 11, 2014

Comment 5:

Ah. Thanks for pointing that out.
I have previously suspected a race as the cause of this problem. We don't run with race
detection regularly. But we do from time to time the program was run with the race
detector in response to the initial occurrence of the error. Races were detected and
fixed. But I don't think we ever managed to run the detector when the fatal error
actually exhibited itself.
We will start running processes compiled with race detection again to see if there is
anything new.
In the meantime the stackdump can be found here.
https://gist.githubusercontent.com/bmatsuo/1a323e1484930eeadb9b/raw/61769333c6d8d8bed0a22b6f2e0f633f637193d1/gistfile1.txt

@bradfitz
Copy link
Contributor

Comment 6:

My money's on this being a data race. It almost always is with symptoms like this.
Please confirm either way, though.

@bmatsuo
Copy link
Author

bmatsuo commented Sep 11, 2014

Comment 7:

No races detected in the first few hours running on a handful of machines.
It seems like this could take quite some time. As I mentioned we ran with race detection
for a few days last time and fixed all the reported races. And "absence of evidence",
infrequent occurrence at full speed (no -race), all that.
If you both think it's a race then you could close the ticket for now. And I'm sorry for
wasting valuable time. Thanks for the help, though.
Cheers

@ianlancetaylor
Copy link
Contributor

Comment 8:

Status changed to Retracted.

@dvyukov
Copy link
Member

dvyukov commented Sep 11, 2014

Comment 9:

I looks like stack of the goroutine is corrupted (overwritten with an array of floats or
crypted data).
First, both *Process and vaultdata.Dataset are corrupted in a similar way (looks like
they all are just sprayed):
vault/vaultproc.(*Process).DatasetPersisted(0x7561772f746e6d2f, 0x6f2f646f72702e73,
0x393034312f736a62, 0x2d343034302d3131, 0x39382f61382f3435)
Then, all other goroutines has received correct, uncorrupted data:
vault/vaultproc.(*Process).DatasetPersisted(0x0, 0xc20829d700, 0x71, 0xc208059880, 0x80)
vault/vaultproc.(*Process).DatasetPersisted(0xc2080d0460, 0xc20829d700, 0x71,
0xc208059880, 0x80)
So the data received from the chan and the tab.ps array are OK.
There are chances that this is not a data race (at least no in user code).
Bryan, please run the program with GOTRACEBACK=2 env var, it will give a more detailed
crash report next time the program crashes.

@davecheney
Copy link
Contributor

Comment 10:

Something came up overnight which may be related to your issue.
Can you confirm if your build process is using the -N flag ?
https://golang.org/issue/8719

Status changed to WaitingForReply.

@bmatsuo
Copy link
Author

bmatsuo commented Sep 14, 2014

Comment 11:

Sorry for the delay. But no, we aren't using that flag.
The build is pretty straight forward in terms of flags and environ. Only -race (and -o)
at the moment. No cgo outside the stdlib.

@bmatsuo
Copy link
Author

bmatsuo commented Sep 16, 2014

Comment 12:

We have finally reproduced the error using a binary compiled with -race and with
GOTRACEBACK=2 in the environ.
No data races were reported for the duration of execution (approx. 3h15m). Though I'm
not sure how definitive that is.
Here is the GOTRACEBACK=2 stack dump:
https://gist.githubusercontent.com/bmatsuo/2e77e51d880ee3c347d8/raw/ddac0be52d0cd0c636e900c4995cb5ff65d776f0/gistfile1.txt

@dvyukov
Copy link
Member

dvyukov commented Sep 16, 2014

Comment 13:

I have no glue so far.
The crash looks exactly the same -- the frame of the goroutine is sprayed with crypto
data or something.
I've checked that frame of the crashed goroutine is not used by any other goroutines.
Does your program deal with crypto data or something similar? Where does that data
"0x7561772f746e6d2f, 0x542d656c74747568, 0x6c7474756853706d, 0x342d79706f632d65,
0x3332333239383630" can come from?
I've noticed that frame with the strange 0x3ea12e0be826d695 value in it:
    /tmp/wault-build-321551249/git/src/vault/s3shuttle/tmpshuttle.go:145 +0x158 fp=0x7f3253b06228 sp=0x7f3253b06150
vault/s3shuttle.(*TmpShuttle).copy(0xc208037cb0, 0x7f32a29f9408, 0xc2097129e0, 0xf84a1b,
0x3200000, 0xc20897dce0, 0x3ea12e0be826d695, 0x1, 0x0, 0x0)
What is that 0x3ea12e0be826d695?
I am not sure how we can debug this issue. If you privately provide us with a standalone
reproducer, then maybe we will be able to debug it locally.

Status changed to Accepted.

@bmatsuo
Copy link
Author

bmatsuo commented Sep 16, 2014

Comment 14:

We're not specifically encrypting/decrypting data (including TLS). We do take MD5 sums
of a few streams. We also encode/decode using snappy (w/ framing format). I don't see a
snappy frame headers in there. I'm not familiar enough with the underlying format to see
those bytes. 
Here's the signature of that "copy" method:
func (*TmpShuttle) copy(io.Reader, int64, int64, chan<- int64) (*TempFile, int64,
error)
Sorry I can't just tell you the type of 0x3ea12e0be826d695 :( There's something I don't
understand and I have a hard time reading those traceback values. Sometimes it seems
like there are more 'arguments' than I'd expect or can explain, sometimes fewer >_<
I don't think it's possible to share these things even privately. I can continue to try
and whittle down to something that is more of a toy example that can be shared and
re-run easily.
Sorry, it's not ideal. But I wanted to get the issue out there in case other people see
the same things elsewhere.

@dvyukov
Copy link
Member

dvyukov commented Sep 16, 2014

Comment 15:

> Sometimes it seems like there are more 'arguments' than I'd expect or can explain,
sometimes fewer >_<
The dump contains just words. So an error, which is interface, occupies 2 words. But if
you have 8 subsequent bool, then all will be a single word.
So in this case 0x3ea12e0be826d695 is *TempFile return value, which is probably fine,
because it is dead (unnamed return value).

@griesemer
Copy link
Contributor

Comment 16:

Labels changed: added repo-main.

@rsc
Copy link
Contributor

rsc commented Apr 10, 2015

In the absence of more information, I don't believe we can fix this.

@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants