-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: heap profile is malformed #6592
Labels
Milestone
Comments
This is the line that caused the problem: 12: 0 [48: 768] @ 0x3b420 0x3c8dc 0x1eb2bc 0x1ed3a8 0x1c7f60 0x1c6e64 0xcdf48 0xd1668 0x302e4 Here are some other incorrect lines: 113: -864 [393: 4720] @ 0x3b420 0x3c9a8 0x3ca50 0x35478 0x35734 0x35620 0x123220 0x123f94 0x1281b8 0x12bfd4 0x131b10 0x126aac 0x77ae0 0x77b78 0x78f7c 0x67668 0x302e4 12: -960 [49: 1568] @ 0x3b420 0x3c9a8 0x3ca50 0x35478 0x35328 0x222494 0x22b874 0x21f5ac 0x23393c 0x236c20 0x237b7c 0x237b7c 0x23831c 0x238170 0x1ed2ec 0x1c7f60 0x1c6e64 0xcdf48 0xd1668 0x302e4 60: -1048 [192: 1536] @ 0x3b420 0x3c8dc 0x59dc4 0x670a4 0x675c4 0x302e4 6: -1152 [15: 1920] @ 0x3b420 0x3c8dc 0xce1a4 0xd1668 0x302e4 5: -1184 [13: 1872] @ 0x3b420 0x3fd44 0x402fc 0x164ec0 0x163ce8 0x161490 0x163564 0x1614cc 0x160c60 0x1607d0 0x6bc84 0x70c34 0x302e4 9: -1216 [72: 2304] @ 0x3b420 0x3c9a8 0x3ca50 0x35478 0x35734 0x35620 0x21b5a4 0x21a1ec 0x1da2f4 0x1da7f0 0x1db458 0x1c9404 0x1c6e64 0xcdf48 0xd1668 0x302e4 48: -1280 [193: 1544] @ 0x3b420 0x3c8ac 0x20bd0 0x20dbc 0x77c74 0x78f7c 0x59fc4 0x670a4 0x675c4 0x302e4 77: -1424 [245: 1960] @ 0x3b420 0x3c8dc 0x177c48 0x1769fc 0x2379a0 0x235308 0x2372a8 0x23831c 0x238170 0x1ec474 0x1ed3a8 0x1c7f60 0x1c6e64 0xcdf48 0xd1668 0x302e4 1: -1472 [14: 4032] @ 0x3b420 0x3c9a8 0x3ca50 0x35478 0x35328 0x222494 0x229c88 0x21f838 0x231f84 0x230358 0x1d7378 0x1dffc8 0x1c8e68 0x1c6e64 0xcdf48 0xd1668 0x302e4 0: -1536 [4: 2048] @ 0x3b420 0x3c9a8 0x3ca50 0x35478 0x35328 0x82af4 0x82cac 0x6b2a8 0x707fc 0x302e4 108: -1624 [353: 11296] @ 0x3b420 0x3fd44 0x402fc 0x126adc 0x77ae0 0x77b78 0x78f7c 0x65c00 0x302e4 The accounting is clearly bogus. Labels changed: added priority-later, go1.2, removed priority-triage. Status changed to Accepted. |
Thanks. The first four fields are "objects in use", "bytes in use", "total allocated objects", "total allocated bytes". Internally, we track total allocated and total freed. The "in use" numbers are allocated - freed. The negative in use byte numbers indicate that the freed byte count is bigger than the allocated byte count. It appears that only the byte counts are incorrect. The object counts seem fine. All the lines containing 0x3c9a8 are calls to runtime.new, which necessarily allocate fixed-size chunks, so all the allocs and frees for each of those should be the same size, so the ratio of #block / # byte should be constant (1 / size). That makes it easy to find inconsistent lines: tr -d '[]:' <heap.txt | grep 0x3c8dc | awk '$1*$4 != $2*$3' Here are the 10 entries with the largest block sizes: $ tr -d '[]:' <heap.txt | grep 0x3c8dc | awk '$1*$4 != $2*$3 {print $4/$3, $0}' | sort -nr | sed 10q 2048 96 252408 370 757760 @ 0x3b420 0x3c8dc 0x149f54 0x67ebc 0x6bdc8 0x70c34 0x302e4 2048 39 109136 192 393216 @ 0x3b420 0x3c8dc 0x149f54 0x59a84 0x670a4 0x675c4 0x302e4 2048 38 94992 192 393216 @ 0x3b420 0x3c8dc 0x149f54 0x59c00 0x670a4 0x675c4 0x302e4 1152 5 6752 18 20736 @ 0x3b420 0x3c8dc 0x21aa48 0x1bf928 0x1c9500 0x1c6e64 0xcdf48 0xd1668 0x302e4 1152 4 -1920 19 21888 @ 0x3b420 0x3c8dc 0x21aa48 0x1bf928 0x1c995c 0x1c6e64 0xcdf48 0xd1668 0x302e4 224 55 12552 193 43232 @ 0x3b420 0x3c8dc 0x153244 0x14e7ac 0x65fe4 0x671e4 0x675c4 0x302e4 224 52 13544 193 43232 @ 0x3b420 0x3c8dc 0x153244 0x14b5dc 0x14b514 0x662fc 0x675c4 0x302e4 144 71 9136 202 29088 @ 0x3b420 0x3c8dc 0x138068 0x1262a8 0x126454 0x126a74 0x57efc 0x54738 0x58adc 0x12b214 0x12b708 0x131b10 0x126aac 0x77bb8 0x78f7c 0x57858 0x302e4 144 67 9808 196 28224 @ 0x3b420 0x3c8dc 0x138068 0x1262a8 0x126454 0x126a74 0x77ae0 0x77b78 0x78f7c 0x67668 0x302e4 144 63 8352 182 26208 @ 0x3b420 0x3c8dc 0x138068 0x1262a8 0x126454 0x126c28 0x15aee0 0x15a75c 0x15d6d8 0x15d868 0x15f780 0x12b214 0x12b708 0x131b10 0x126aac 0x77bb8 0x78f7c 0x66174 0x671e4 0x675c4 0x302e4 Note especially the three entries with 2048-byte blocks. The in-use byte count is 252408 (504 mod 2048). So it's not that some blocks are being counted as freed twice. The size is wrong for some other reason. Perhaps the size being passed to MProf_Free is wrong. However, it looks correct, and it is correct in many cases, and if it were incorrect, that would mean that sweep would be walking the span using the wrong block size and would cause more serious problems than a bad profile. Perhaps there is a bad atomic operation and the int64 storing the free byte count is getting corrupted. It is strange that the allocated byte count is fine, though. Perhaps there is memory corruption, but it's strange that it would target this one field in this one struct, repeatedly, but nothing else. Perhaps there is a code generation error, but again it's strange that it works most of the time. Perhaps if a signal arrives at the wrong time the register being used by this one computation is smashed, but not the computation of the others. But again, why this one field? The size passed to MProf_Free being incorrect really does seem to be the most likely option, but again if that were true, more would be broken. I'm stumped. I'd very much like to understand this before we cut Go 1.2. It won't hold up the rc2 release candidate, though. I assume this is arm. What kind of hardware is this? Is it a multiprocessor? What is the GOARM setting you are using? Any other relevant details? Labels changed: added go1.2maybe, removed go1.2. |
> I assume this is arm. What kind of hardware is this? Is it a multiprocessor? $ cat /proc/cpuinfo Processor : ARM926EJ-S rev 5 (v5l) BogoMIPS : 226.09 Features : swp half thumb fastmult edsp java CPU implementer : 0x41 CPU architecture: 5TEJ CPU variant : 0x0 CPU part : 0x926 CPU revision : 5 Hardware : Freescale MX28EVK board Revision : 0000 Serial : 0000000000000000 > What is the GOARM setting you are using? Any other relevant details? GOARCH=arm GOARM=5 GOOS=linux Cross compiled on OS X with go version devel +351b6fe0ae36 Tue Sep 24 15:54:48 2013 -0400 darwin/amd64 (Yes, that go toolchain includes the fix for https://golang.org/issue/6440.) $ uname -a Linux ppbeacon-48:02:2A:39:B6:5E 2.6.35.3-670-g914558e #1 PREEMPT Mon Oct 7 23:19:04 PDT 2013 armv5tejl GNU/Linux > if that were true, more would be broken. And this combination of code base, go toolchain, os, and hardware has a fair number of miles on it. > I'd very much like to understand this before we cut Go 1.2. Thanks for the write-up above. I'll spend some time digging at this here, and share anything potentially interesting that I find. |
Ok, assuming that I correctly understand how to detect malformed pprof heap profiles (following your notes above), I have a smallish test case that fairly reliably reproduces this: http://play.golang.org/p/kklslqpWKT. Forgive the ugly, haphazard code. It's the result of a very frustrating, slow search through a poorly bounded space. Most of this code seems to be necessary; attempts to simplify further and still reproduce have failed to date. This needs to run for about a minute to reproduce on my hardware. Happily, I've also been able to reproduce this on my Raspberry Pi, using the same binary, although it took a 10m run. Attached to this commit are a gzipped binary, a malformed sample 1 minute profile run with that binary, and that sample profile run symbolicated, for your convenience. (Note that the symbolification might be dodgy; I whipped it up ad hoc just for this, and it is not well tested.) Now that I can reproduce reliably, the question is: What next? In the absence of other guidance, I'll start digging into this myself, starting by instrumenting the runtime to look at MProf_Free sizes. If you have any specific experiments you'd like me to run, though, I'm happy to do so. Note that I'll be offline all of tomorrow (Thu) and will wrap up today in an hour or so. Attachments:
|
Agreed. And https://golang.org/cl/14695044/ does appear to fix it. |
It doesn't make sense to me that the numbers we saw would be caused by 6566. issue #6566 would cause entire profiling events to be lost but should keep the recent_frees and recent_free_bytes in sync. Since buckets are never reused, a bucket allocated for 2048-byte objects should always have a count that is a multiple of 2048. Even if 64-bit operations were interrupted, this property should still hold (and the recent_xxx numbers are uint32s anyway; the int64 doesn't come into play until they are retrieved as a heap profile). Something is still wrong, either in the code or in our understanding of the bug. |
Dmitriy, can you see how issue #6566 would have caused the profile seen above too? See my notes in comment #4 and comment #9. Thanks. |
GOMAXPROCS not set. Stuck in meetings all day; will gather some GOMAXPROCS data when I can. In the meantime, a quick note before I run off. (Please bear with me, I'm new to this code, and am reading it in meetings. :P) First, it's not clear to me why runtime·new will allocate a fixed size over the lifetime of a program, given only that the stack is the same; it could be called with different Types, no? That said, last night I instrumented runtime·MProf_Free and runtime·MProf_Malloc to log all calls and all calls that got short-circuited (nomemprof caught my eye as well). Organizing these log lines by pointer / size / event and then sort | uniq -c gives a look at what is going on on a per-memory-address basis. Here's activity for one sample pointer: count | ptr | event | size 2 0x10551000 1024 free 2 0x10551000 1024 malloc 1 0x10551000 1024 skipmalloc 3 0x10551000 1152 free 3 0x10551000 1152 malloc 3 0x10551000 1152 skipfree 4 0x10551000 1536 free 4 0x10551000 1536 malloc 3 0x10551000 1536 skipfree 1 0x10551000 1536 skipmalloc 28 0x10551000 2048 free 28 0x10551000 2048 malloc 3 0x10551000 2048 skipfree 4 0x10551000 2048 skipmalloc 7 0x10551000 2560 free 7 0x10551000 2560 malloc 14 0x10551000 4096 free 14 0x10551000 4096 malloc 1 0x10551000 4096 skipfree 3 0x10551000 512 free 3 0x10551000 512 malloc Note the mismatched skipfree/skipmalloc counts. (Skip is a misnomer here; should be "short-circuit".) Note also that the pointer is used in setaddrbucket but not in stkbucket, so these will all impact the same recent_alloc_bytes, IIUC. The mismatched short-circuits would then explain both the incorrect totals and the non-multiple of 2048, I think. Back in a few hours. |
MergedInto: 6566 runtime.new is only called by code generated by 5g as the compiled form of the Go expression new(T). At a given call site T is fixed - there are no type variables in Go. I see how the bug can be caused by nomemprof confusion now. All the skips you saw are buggy skips, since the allocation of the buckets themselves (which is what we were trying to skip) is no longer done with malloc. If you have a meant-to-be-profiled N-byte pointer p that misses its MProf_Malloc, the MProf_Free won't find it, no harm done. But if you have a profiled N-byte pointer p that hits its MProf_Malloc but misses its MProf_Free, the hash table entry created by setaddrbucket in MProf_Malloc will not be removed from the hash table. When the pointer is later reallocated and also profiled (perhaps as a different size), the next MProf_Malloc for that address will create a new bucket, earlier in the hash chain, so a matched MProf_Free will find the new bucket, and the math will still be okay. But if the p is later used for a profiled M-byte allocation that misses its MProf_Malloc, the MProf_Free will find the stale bucket entry left by the earlier missed MProf_Free. At that point (perhaps much later), it will subtract the new size from the old bucket, causing the strange behavior we saw. It will also subtract one from the bucket count, but that's actually a belated but correct update: the old p really did get freed. That is: (1) p allocated and profiled with size N1 and stack S1 (2) MProf_Malloc calls setaddrbucket(p, b1 for stack S1) (3) MProf_Free spuriously missed (4) repeat any number of times: either: p reallocated and not profiled or: p reallocated and profiled with size Nx and stack Sx MProf_Malloc calls setaddrbucket(p, bx for stack Sx) MProf_Free calls getaddrbucket, updates counts in bx correctly (5) p allocated and profiled with size N2 and stack S2 (6) MProf_Malloc spuriously missed, should have set b2 for stack S2 (7) MProf_Free calls getaddrbucket, finds b1 for stack S1 b1.recent_frees++ b1.recent_free_bytes += N2; The b1.recent_frees++ is late but correct. It should have happened in step (3). The b1.recent_free_bytes += N2 is late and incorrect. The one that should have happened in step (3) should have used N1, not N2. This is why the recent_frees numbers always look fine but the recent_free_bytes numbers are incorrect. Status changed to Duplicate. |
Merged into issue #6566. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: