New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: intermittent crash locking Mutex -- unexpected fault address #8703
Labels
Comments
One issue is the address of the lock value is not properly aligned. sync/atomic.CompareAndSwapUint32(0x7561772f746e6d9f, 0x100000000) /usr/local/go/src/pkg/sync/atomic/asm_amd64.s:42 +0xd fp=0x7ff97dcc5f08 sp=0x7ff97dcc5f00 The question is how did this happen. Can you post the declaration of the Process struct. Status changed to WaitingForReply. |
That's interesting. Here are the relevant struct definitions. type Process struct { topic *stat.App start time.Time uuid string cleanup bool debugdir string debug bool persistDeps *vaultdata.ObjectDeps pipeline []*vaultexec.State chwrmut sync.Mutex closed bool cancelled bool cancel chan chan<- error progCacheStatus chan progCacheStatus setCacheStatus chan setCacheStatus setPersistStatus chan setPersistStatus outEvents chan<- vaultdata.DatasetEvent progEvents chan<- vaultdata.ProgramEvent status chan<- vaultdata.JobEvent err error pcResult chan pcResult pcwg *sync.WaitGroup } type ProcessTable struct { ps []*Process cleanonexit bool last int count int maxproc int rootdir string debugdir string setready chan vaultdata.Dataset setfail chan datasetError persist chan vaultdata.Dataset persistfail chan datasetError progready chan vaultdata.Program progfail chan programError events chan vaultdata.DatasetEvent progEvents chan vaultdata.ProgramEvent jobEvents chan vaultdata.JobEvent startReq chan startReq cancelReq chan cancelReq cancelAll chan chan<- []vaultdata.JobCancellation procResult chan procResult } |
Thanks for your reply. Someone just pointed out to me that the address of *Process was completely scrambled, along with the contents of the value.Dataset received over the channel. Can you please post the entire panic message. Have you run your application under the Go race detector ? Is it clear of detected data races ? |
Ah. Thanks for pointing that out. I have previously suspected a race as the cause of this problem. We don't run with race detection regularly. But we do from time to time the program was run with the race detector in response to the initial occurrence of the error. Races were detected and fixed. But I don't think we ever managed to run the detector when the fatal error actually exhibited itself. We will start running processes compiled with race detection again to see if there is anything new. In the meantime the stackdump can be found here. https://gist.githubusercontent.com/bmatsuo/1a323e1484930eeadb9b/raw/61769333c6d8d8bed0a22b6f2e0f633f637193d1/gistfile1.txt |
No races detected in the first few hours running on a handful of machines. It seems like this could take quite some time. As I mentioned we ran with race detection for a few days last time and fixed all the reported races. And "absence of evidence", infrequent occurrence at full speed (no -race), all that. If you both think it's a race then you could close the ticket for now. And I'm sorry for wasting valuable time. Thanks for the help, though. Cheers |
I looks like stack of the goroutine is corrupted (overwritten with an array of floats or crypted data). First, both *Process and vaultdata.Dataset are corrupted in a similar way (looks like they all are just sprayed): vault/vaultproc.(*Process).DatasetPersisted(0x7561772f746e6d2f, 0x6f2f646f72702e73, 0x393034312f736a62, 0x2d343034302d3131, 0x39382f61382f3435) Then, all other goroutines has received correct, uncorrupted data: vault/vaultproc.(*Process).DatasetPersisted(0x0, 0xc20829d700, 0x71, 0xc208059880, 0x80) vault/vaultproc.(*Process).DatasetPersisted(0xc2080d0460, 0xc20829d700, 0x71, 0xc208059880, 0x80) So the data received from the chan and the tab.ps array are OK. There are chances that this is not a data race (at least no in user code). Bryan, please run the program with GOTRACEBACK=2 env var, it will give a more detailed crash report next time the program crashes. |
Something came up overnight which may be related to your issue. Can you confirm if your build process is using the -N flag ? https://golang.org/issue/8719 Status changed to WaitingForReply. |
We have finally reproduced the error using a binary compiled with -race and with GOTRACEBACK=2 in the environ. No data races were reported for the duration of execution (approx. 3h15m). Though I'm not sure how definitive that is. Here is the GOTRACEBACK=2 stack dump: https://gist.githubusercontent.com/bmatsuo/2e77e51d880ee3c347d8/raw/ddac0be52d0cd0c636e900c4995cb5ff65d776f0/gistfile1.txt |
I have no glue so far. The crash looks exactly the same -- the frame of the goroutine is sprayed with crypto data or something. I've checked that frame of the crashed goroutine is not used by any other goroutines. Does your program deal with crypto data or something similar? Where does that data "0x7561772f746e6d2f, 0x542d656c74747568, 0x6c7474756853706d, 0x342d79706f632d65, 0x3332333239383630" can come from? I've noticed that frame with the strange 0x3ea12e0be826d695 value in it: /tmp/wault-build-321551249/git/src/vault/s3shuttle/tmpshuttle.go:145 +0x158 fp=0x7f3253b06228 sp=0x7f3253b06150 vault/s3shuttle.(*TmpShuttle).copy(0xc208037cb0, 0x7f32a29f9408, 0xc2097129e0, 0xf84a1b, 0x3200000, 0xc20897dce0, 0x3ea12e0be826d695, 0x1, 0x0, 0x0) What is that 0x3ea12e0be826d695? I am not sure how we can debug this issue. If you privately provide us with a standalone reproducer, then maybe we will be able to debug it locally. Status changed to Accepted. |
We're not specifically encrypting/decrypting data (including TLS). We do take MD5 sums of a few streams. We also encode/decode using snappy (w/ framing format). I don't see a snappy frame headers in there. I'm not familiar enough with the underlying format to see those bytes. Here's the signature of that "copy" method: func (*TmpShuttle) copy(io.Reader, int64, int64, chan<- int64) (*TempFile, int64, error) Sorry I can't just tell you the type of 0x3ea12e0be826d695 :( There's something I don't understand and I have a hard time reading those traceback values. Sometimes it seems like there are more 'arguments' than I'd expect or can explain, sometimes fewer >_< I don't think it's possible to share these things even privately. I can continue to try and whittle down to something that is more of a toy example that can be shared and re-run easily. Sorry, it's not ideal. But I wanted to get the issue out there in case other people see the same things elsewhere. |
> Sometimes it seems like there are more 'arguments' than I'd expect or can explain, sometimes fewer >_< The dump contains just words. So an error, which is interface, occupies 2 words. But if you have 8 subsequent bool, then all will be a single word. So in this case 0x3ea12e0be826d695 is *TempFile return value, which is probably fine, because it is dead (unnamed return value). |
In the absence of more information, I don't believe we can fix this. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: