Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "fatal: morestack on g0" on amd64 after upgrade to Go 1.21 #62440

Closed
VirrageS opened this issue Sep 4, 2023 · 58 comments
Closed

runtime: "fatal: morestack on g0" on amd64 after upgrade to Go 1.21 #62440

VirrageS opened this issue Sep 4, 2023 · 58 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@VirrageS
Copy link

VirrageS commented Sep 4, 2023

Does this issue reproduce with the latest release?

Yes.

What did you do?

https://github.com/VirrageS/fluent-bit-go-bug here is full code to repo.
I will try to make it smaller but for now this is all I have.

When running this project with:

docker build --build-arg="GO_VERSION=1.20" -t fluentbit .
docker run -t fluentbit

it runs correctly.

But when I change to GO_VERSION=1.21 it fails with fatal: morestack on g0:

docker build --build-arg="GO_VERSION=1.21" -t fluentbit .
docker run -t fluentbit

It looks like regression in Go 1.21.
I found similar issues but those seem to be quite old and are not tied to Go 1.21 version.

I have tried to strace the binary and this is what I got:

Details

18    <... set_robust_list resumed>)    = 0
18    prctl(PR_SET_NAME, "flb-pipeline" <unfinished ...>
18    <... prctl resumed>)              = 0
18    futex(0x55c46520b460, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
18    <... futex resumed>)              = 0
18    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f93877fd000
18    munmap(0x7f93877fd000, 2097152 <unfinished ...>
18    <... munmap resumed>)             = 0
18    mmap(NULL, 4190208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f93875fe000
18    munmap(0x7f93875fe000, 8192 <unfinished ...>
18    <... munmap resumed>)             = 0
18    munmap(0x7f9387800000, 2084864 <unfinished ...>
18    <... munmap resumed>)             = 0
18    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
18    <... mmap resumed>)               = 0x7f9387400000
18    epoll_create1(EPOLL_CLOEXEC)      = 8
18    pipe([9, 10])                     = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3559108056, u64=140272896028120}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    futex(0x55c46520b480, FUTEX_WAKE_PRIVATE, 2147483647) = 0
18    epoll_create1(EPOLL_CLOEXEC <unfinished ...>
18    <... epoll_create1 resumed>)      = 11
18    pipe([12, 13])                    = 0
18    epoll_ctl(11, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269204480, u64=140271606124544}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    pipe([14, 15])                    = 0
18    epoll_ctl(11, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269286400, u64=140271606206464}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    pipe([16, 17])                    = 0
18    epoll_ctl(11, EPOLL_CTL_ADD, 16, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269286592, u64=140271606206656}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9386bff000
18    mprotect(0x7f9386c00000, 8388608, PROT_READ|PROT_WRITE|PROT_EXEC <unfinished ...>
18    <... mprotect resumed>)           = 0
18    clone(child_stack=0x7f93873f54f0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID <unfinished ...>
18    <... clone resumed>, parent_tid=[19], tls=0x7f93873ff700, child_tidptr=0x7f93873ff9d0) = 19
18    futex(0x7f93874150a8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
18    <... futex resumed>)              = 0
18    futex(0x7f93874150b0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
18    <... futex resumed>)              = 0
18    getpid()                          = 10
18    openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC <unfinished ...>
18    <... openat resumed>)             = 18
18    fstat(18,  <unfinished ...>
18    <... fstat resumed>{st_mode=S_IFREG|0644, st_size=118, ...}) = 0
18    fstat(18,  <unfinished ...>
18    <... fstat resumed>{st_mode=S_IFREG|0644, st_size=118, ...}) = 0
18    read(18, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4096) = 118
18    lseek(18, -62, SEEK_CUR <unfinished ...>
18    <... lseek resumed>)              = 56
18    read(18, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\0"..., 4096) = 62
18    close(18 <unfinished ...>
18    <... close resumed>)              = 0
18    write(15, "T\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096 <unfinished ...>
18    <... write resumed>)              = 4096
18    pipe( <unfinished ...>
18    <... pipe resumed>[18, 19])       = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 18, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3559107648, u64=140272896027712}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    write(15, "m\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096 <unfinished ...>
18    <... write resumed>)              = 4096
18    write(15, "7\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096 <unfinished ...>
18    <... write resumed>)              = 4096
18    write(15, "7\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096 <unfinished ...>
18    <... write resumed>)              = 4096
18    timerfd_create(CLOCK_MONOTONIC, 0 <unfinished ...>
18    <... timerfd_create resumed>)     = 20
18    timerfd_settime(20, TFD_TIMER_ABSTIME, {it_interval={tv_sec=10, tv_nsec=0}, it_value={tv_sec=16839370, tv_nsec=0}},  <unfinished ...>
18    <... timerfd_settime resumed>NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 20, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269405184, u64=140271606325248}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    write(15, "=\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096) = 4096
18    write(15, "X\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096 <unfinished ...>
18    <... write resumed>)              = 4096
18    pipe( <unfinished ...>
18    <... pipe resumed>[21, 22])       = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 21, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3559552000, u64=140272896472064}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC <unfinished ...>
18    <... openat resumed>)             = 23
18    read(23,  <unfinished ...>
18    <... read resumed>"0-11\n", 8192) = 5
18    close(23)                         = 0
18    openat(AT_FDCWD, "/proc/stat", O_RDONLY <unfinished ...>
18    <... openat resumed>)             = 23
18    fstat(23, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
18    read(23,  <unfinished ...>
18    <... read resumed>"cpu  531693645 1691808 251625914"..., 1024) = 1024
18    close(23)                         = 0
18    pipe( <unfinished ...>
18    <... pipe resumed>[23, 24])       = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3559483392, u64=140272896403456}}) = 0
18    futex(0x55c46520b45c, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
18    <... futex resumed>)              = 0
18    epoll_create1(EPOLL_CLOEXEC)      = 25
18    pipe( <unfinished ...>
18    <... pipe resumed>[26, 27])       = 0
18    epoll_ctl(25, EPOLL_CTL_ADD, 26, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269552640, u64=140271606472704}}) = 0
18    pipe( <unfinished ...>
18    <... pipe resumed>[28, 29])       = 0
18    epoll_ctl(11, EPOLL_CTL_ADD, 28, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269286784, u64=140271606206848}}) = 0
18    mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f93863fe000
18    mprotect(0x7f93863ff000, 8388608, PROT_READ|PROT_WRITE|PROT_EXEC <unfinished ...>
18    <... mprotect resumed>)           = 0
18    clone(child_stack=0x7f9386bf44f0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID <unfinished ...>
18    <... clone resumed>, parent_tid=[20], tls=0x7f9386bfe700, child_tidptr=0x7f9386bfe9d0) = 20
18    pipe( <unfinished ...>
18    <... pipe resumed>[30, 31])       = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 30, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3559484160, u64=140272896404224}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    rt_sigprocmask(SIG_SETMASK, NULL,  <unfinished ...>
18    <... rt_sigprocmask resumed>[], 8) = 0
18    rt_sigprocmask(SIG_SETMASK, ~[],  <unfinished ...>
18    <... rt_sigprocmask resumed>NULL, 8) = 0
18    sched_getaffinity(18, 32,  <unfinished ...>
18    <... sched_getaffinity resumed>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) = 8
18    sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
18    sigaltstack({ss_sp=0xc000180000, ss_flags=0, ss_size=32768},  <unfinished ...>
18    <... sigaltstack resumed>NULL)    = 0
18    rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
18    <... rt_sigprocmask resumed>NULL, 8) = 0
18    gettid( <unfinished ...>
18    <... gettid resumed>)             = 18
18    futex(0x7f93d3dcc900, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
18    <... futex resumed>)              = 1
18    futex(0x7f93d3dcc148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
18    <... futex resumed>)              = 1
18    futex(0xc000100d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
18    <... futex resumed>)              = -1 EAGAIN (Resource temporarily unavailable)
18    --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=10, si_uid=0} ---
18    rt_sigreturn({mask=[]} <unfinished ...>
18    <... rt_sigreturn resumed>)       = -1 EAGAIN (Resource temporarily unavailable)
18    write(1, "[flb-go] plugin parameter = ''\n", 31 <unfinished ...>
18    <... write resumed>)              = 31
18    timerfd_create(CLOCK_MONOTONIC, 0 <unfinished ...>
18    <... timerfd_create resumed>)     = 36
18    timerfd_settime(36, TFD_TIMER_ABSTIME, {it_interval={tv_sec=1, tv_nsec=0}, it_value={tv_sec=16839361, tv_nsec=0}},  <unfinished ...>
18    <... timerfd_settime resumed>NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 36, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=3559107912, u64=140272896027976}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    timerfd_create(CLOCK_MONOTONIC, 0 <unfinished ...>
18    <... timerfd_create resumed>)     = 37
18    timerfd_settime(37, TFD_TIMER_ABSTIME, {it_interval={tv_sec=5, tv_nsec=0}, it_value={tv_sec=16839365, tv_nsec=0}},  <unfinished ...>
18    <... timerfd_settime resumed>NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 37, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269406144, u64=140271606326208}}) = 0
18    timerfd_create(CLOCK_MONOTONIC, 0 <unfinished ...>
18    <... timerfd_create resumed>)     = 38
18    timerfd_settime(38, TFD_TIMER_ABSTIME, {it_interval={tv_sec=1, tv_nsec=0}, it_value={tv_sec=16839361, tv_nsec=0}},  <unfinished ...>
18    <... timerfd_settime resumed>NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 38, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269396208, u64=140271606316272}}) = 0
18    write(15, "<\0\0\0\0\0\0\0[2023/09/04 10:18:13] [ "..., 4096 <unfinished ...>
18    <... write resumed>)              = 4096
18    timerfd_create(CLOCK_MONOTONIC, 0 <unfinished ...>
18    <... timerfd_create resumed>)     = 39
18    timerfd_settime(39, TFD_TIMER_ABSTIME, {it_interval={tv_sec=1, tv_nsec=0}, it_value={tv_sec=16839361, tv_nsec=0}},  <unfinished ...>
18    <... timerfd_settime resumed>NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 39, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269405504, u64=140271606325568}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    timerfd_create(CLOCK_MONOTONIC, 0) = 40
18    timerfd_settime(40, TFD_TIMER_ABSTIME, {it_interval={tv_sec=1, tv_nsec=500000000}, it_value={tv_sec=16839361, tv_nsec=0}},  <unfinished ...>
18    <... timerfd_settime resumed>NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 40, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269406304, u64=140271606326368}}) = 0
18    timerfd_create(CLOCK_MONOTONIC, 0 <unfinished ...>
18    <... timerfd_create resumed>)     = 41
18    timerfd_settime(41, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=250000000}, it_value={tv_sec=16839360, tv_nsec=0}}, NULL) = 0
18    epoll_ctl(8, EPOLL_CTL_ADD, 41, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2269406464, u64=140271606326528}} <unfinished ...>
18    <... epoll_ctl resumed>)          = 0
18    write(7, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[], 256, 0) = 0
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8, [{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8, [{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[{EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}, {EPOLLIN, {u32=2269406304, u64=140271606326368}}, {EPOLLIN, {u32=3559107912, u64=140272896027976}}, {EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 5
18    read(40,  <unfinished ...>
18    <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[{EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}, {EPOLLIN, {u32=3559107912, u64=140272896027976}}, {EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, 0) = 4
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [{EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}, {EPOLLIN, {u32=3559107912, u64=140272896027976}}], 256, 0) = 3
18    read(36, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [{EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}], 256, 0) = 2
18    read(38, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [{EPOLLIN, {u32=2269405504, u64=140271606325568}}], 256, 0) = 1
18    read(39, "\1\0\0\0\0\0\0\0", 8)   = 8
18    openat(AT_FDCWD, "/proc/stat", O_RDONLY) = 42
18    fstat(42, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
18    read(42, "cpu  531693681 1691808 251625926"..., 1024) = 1024
18    close(42)                         = 0
18    getpid()                          = 10
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8, [{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8, [{EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 1
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [], 256, 0)         = 0
18    epoll_wait(8, [{EPOLLIN, {u32=3559107912, u64=140272896027976}}, {EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}, {EPOLLIN, {u32=2269406464, u64=140271606326528}}], 256, -1) = 4
18    read(41, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8, [{EPOLLIN, {u32=3559107912, u64=140272896027976}}, {EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}], 256, 0) = 3
18    read(36, "\1\0\0\0\0\0\0\0", 8)   = 8
18    write(27, "@fD\207\223\177\0\0", 8) = 8
18    write(10, "P5E\207\223\177\0\0", 8 <unfinished ...>
18    <... write resumed>)              = 8
18    epoll_wait(8, [{EPOLLIN, {u32=2269396208, u64=140271606316272}}, {EPOLLIN, {u32=2269405504, u64=140271606325568}}, {EPOLLIN, {u32=3559108056, u64=140272896028120}}], 256, 0) = 3
18    read(38, "\1\0\0\0\0\0\0\0", 8)   = 8
18    epoll_wait(8,  <unfinished ...>
18    <... epoll_wait resumed>[{EPOLLIN, {u32=2269405504, u64=140271606325568}}, {EPOLLIN, {u32=3559108056, u64=140272896028120}}], 256, 0) = 2
18    read(39, "\1\0\0\0\0\0\0\0", 8)   = 8
18    openat(AT_FDCWD, "/proc/stat", O_RDONLY <unfinished ...>
18    <... openat resumed>)             = 42
18    fstat(42,  <unfinished ...>
18    <... fstat resumed>{st_mode=S_IFREG|0444, st_size=0, ...}) = 0
18    read(42,  <unfinished ...>
18    <... read resumed>"cpu  531693722 1691808 251625947"..., 1024) = 1024
18    close(42 <unfinished ...>
18    <... close resumed>)              = 0
18    getpid( <unfinished ...>
18    <... getpid resumed>)             = 10
18    epoll_wait(8, [{EPOLLIN, {u32=3559108056, u64=140272896028120}}, {EPOLLIN, {u32=3559483392, u64=140272896403456}}], 256, 0) = 2
18    read(9, "P5E\207\223\177\0\0", 8) = 8
18    write(2, "fatal: morestack on g0\n", 23) = 23
18    --- SIGTRAP {si_signo=SIGTRAP, si_code=SI_KERNEL} ---
18    +++ killed by SIGTRAP (core dumped) +++

What did you expect to see?

Same behavior as with GO_VERSION=1.20.

What did you see instead?

fatal: morestack on g0

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 4, 2023
@seankhliao
Copy link
Member

I think it was mentioned here #62130 (comment)

@prattmic
Copy link
Member

prattmic commented Sep 4, 2023

Thanks for the reference @seankhliao. This sounds like the same issue mentioned in that comment, but overall unrelated to #62130. There is no PHP involved here as far as I can tell.

Fluent Bit doesn't seem like the kind of project that would be doing fancy stack manipulation. @VirrageS or @PettitWesley, do you know if Fluent Bit has some sort of non-standard behavior we should know about?

@aarzilli
Copy link
Contributor

aarzilli commented Sep 4, 2023

Fluent Bit doesn't seem like the kind of project that would be doing fancy stack manipulation

It's in there but I don't know if it's relevant for this particular instance.

@prattmic
Copy link
Member

prattmic commented Sep 4, 2023

It looks like there is stack manipulation going on.

The stack pointer changes quite a bit between calls:

Thread 8 "flb-pipeline" hit Breakpoint 1, runtime.cgocallback () at /usr/local/go/src/runtime/asm_amd64.s:918
918     /usr/local/go/src/runtime/asm_amd64.s: No such file or directory.
(gdb) i r sp
sp             0x7fe7ce7f30f8      0x7fe7ce7f30f8
(gdb) c
Continuing.
Thread 8 "flb-pipeline" hit Breakpoint 1, runtime.cgocallback () at /usr/local/go/src/runtime/asm_amd64.s:918
918     /usr/local/go/src/runtime/asm_amd64.s: No such file or directory.
(gdb) i r sp
sp             0x7fe7cdad2a38      0x7fe7cdad2a38
(gdb) bt
#0  runtime.cgocallback () at /usr/local/go/src/runtime/asm_amd64.s:918
#1  0x00007fe819480bfd in crosscall2 () at /usr/local/go/src/runtime/cgo/asm_amd64.s:38
#2  0x00007fe8195b8459 in FLBPluginFlush (data=0x7fe7cda8b640, length=1121, tag=0x7fe7cda00180 "cpu.local") at _cgo_export.c:78
#3  0x000055d57a2d9f74 in proxy_go_output_flush (ctx=0x7fe819c23310, data=0x7fe7cda8b640, size=1121, tag=0x7fe7cda13ef0 "cpu.local", tag_len=9) at /src/fluent-bit/src/proxy/go/go.c:142
#4  0x000055d579cc846c in proxy_cb_flush (event_chunk=0x7fe7cda4b320, out_flush=0x7fe7cda53550, i_ins=0x7fe819ca7400, out_context=0x7fe819c23310, config=0x7fe819c3a400)
    at /src/fluent-bit/src/flb_plugin_proxy.c:57
#5  0x000055d579c1fedd in output_pre_cb_flush () at /src/fluent-bit/include/fluent-bit/flb_output.h:590
#6  0x000055d57a7776c7 in co_init () at /src/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#7  0x0000000000000000 in ?? ()

And we can see that the top frame in the stack is something called co_init from a coroutine library used by Fluent Bit: https://github.com/fluent/fluent-bit/blob/master/lib/flb_libco/

So this is effectively the same issue as #62130, though from a completely different library.

@VirrageS
Copy link
Author

VirrageS commented Sep 4, 2023

Fluent Bit doesn't seem like the kind of project that would be doing fancy stack manipulation. @VirrageS or @PettitWesley, do you know if Fluent Bit has some sort of non-standard behavior we should know about?

I'm not really sure :(

It's in there but I don't know if it's relevant for this particular instance.

Looking at the results, most of those seem unrelated, but I might be missing something.

Actually, there is something like https://github.com/fluent/fluent-bit/blob/f7731f3f81b1b32eef36aaa6786d82593d35a841/lib/flb_libco/sjlj.c#L99.


I have bisected the problem to ef0dedc. There was already issue with this commit #59678. Maybe there is still something lingering there? Or maybe this commit just uncovered some bug in how fluent-bit works with the stack.

@prattmic
Copy link
Member

prattmic commented Sep 4, 2023

Thanks for the quick bisect, I was about to suggest ef0dedc (https://go.dev/cl/495855) as the reason this case is suddenly causing problems, even though this use case was already unsupported.

Suppose that these calls always occur on a C-created thread (seems very likely given the Go code is a shared object). Prior to CL 495855, every cgocallback would go to needm and fetch a new M, which it initializes with the current stack. Upon return, the M is dropped and the stack can be freely changed without causing issues for the next call.

After CL 495855, the M is initialized on the first callback with the stack at that time. Upon the next call, the M has been cached and the stack bounds are not updated, so if the stack has been changed we will trip over stack bounds checks.

cc @cherrymui @ianlancetaylor @mknyszek @golang/runtime

Regarding #62130 (comment), it isn't immediately obvious to me why this situation is unworkable. Provided that this is a callback from C-to-Go (with no prior Go components up the stack), I don't see why we should care that the stack bounds of the thread have changed. It seems OK to me that such a callback could re-initialize the stack bounds if they have changed since the last call.

@prattmic
Copy link
Member

prattmic commented Sep 4, 2023

As an aside, CL 495855 gets the stack bounds from pthread, but this coroutine library doesn't appear to do anything to update pthread's view, so those bounds won't be accurate at all. In the case of this application, the first call appears to not be on a coroutine, so it works OK. It is only later calls that move to a coroutine.

@gopherbot
Copy link

Change https://go.dev/cl/525455 mentions this issue: runtime: allow update of system stack bounds

@PettitWesley
Copy link

PettitWesley commented Sep 5, 2023

@prattmic Fluent Bit uses simple coroutines that pause and resume, that's all: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

https://github.com/fluent/fluent-bit/blob/master/include/fluent-bit/flb_coro.h

I'm not an expert on it unfortunately. @matthewfala can help answer questions. If you can give me specific questions, I can go find the answers from the other maintainers.

@matthewfala
Copy link

matthewfala commented Sep 5, 2023

@prattmic @PettitWesley I'm not too sure how the GO plugins work in Fluent Bit, however for the c plugins, there is "fancy stack manipulation" to get the coroutines to work. On one thread, Fluent Bit swaps out multiple tasks/coroutines to allow for non-blocking async networking with a javascript like event loop. You can see the assembly for the swap function on x86 used below:

https://github.com/fluent/fluent-bit/blob/master/lib/flb_libco/amd64.c#L88-L105

There are similar files for other platforms.

@cherrymui
Copy link
Member

cherrymui commented Sep 5, 2023

Regarding #62130 (comment), it isn't immediately obvious to me why this situation is unworkable. Provided that this is a callback from C-to-Go (with no prior Go components up the stack), I don't see why we should care that the stack bounds of the thread have changed. It seems OK to me that such a callback could re-initialize the stack bounds if they have changed since the last call.

For #62130 (comment), I think it is more about what we (promise to) support. For cgo in general, we probably never really thought about what we support or not if C code switches stacks.

If the C code switches stacks while there is some Go frames on the stack, it may be tricky/difficult. If there is no Go frame at all when the C code switches stacks, it is probably fine. We could set a different g0 stack bounds each time (probably based on the current SP, which would be imprecise, but probably fine).

How big are C coroutine stacks? If they are small (like goroutine stacks), there is a risk of overflow, as the Go runtime generally assumes C stacks are large.

@cherrymui cherrymui added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Sep 5, 2023
@cherrymui cherrymui added this to the Backlog milestone Sep 5, 2023
@matthewfala
Copy link

matthewfala commented Sep 5, 2023

Stack size per coroutine is configurable in Fluent Bit: https://github.com/fluent/fluent-bit/blob/master/src/flb_config.c#L152C6-L152C34

Fluent Bit Documentation on setting coro stack size

Stack size depends on platform/os. By default size is:
1.5x to 3x PTHREAD_STACK_MIN (which is the minimum stack size + some buffer.) or the page size ~ whichever is larger.

Calculated by

((3 * STACK_FACTOR * PTHREAD_STACK_MIN) / 2

Please see:
Stack configuration page size - https://github.com/fluent/fluent-bit/blob/master/src/flb_config.c#L293-L298
Stack size default calculation - https://github.com/fluent/fluent-bit/blob/master/include/fluent-bit/flb_coro.h#L64-L78

@prattmic
Copy link
Member

prattmic commented Sep 8, 2023

@VirrageS could you verify that your application works properly with https://go.dev/cl/525455?

(If you aren't familiar with building the toolchain from a CL, you can use gotip download 525455: https://pkg.go.dev/golang.org/dl/gotip)

@sergystepanov
Copy link

sergystepanov commented Sep 8, 2023

@prattmic just wanted to say that the gotip version fixed the same issue in our project (tested on Linux, 1.21 on Windows didn't have this problem). Thanks.
(also, it would be great if you decide to tackle that libco problem in PHP Fibers so we don't have to use some silly workarounds #39411 or to move away from go/cgo eventually completely)

@VirrageS
Copy link
Author

@prattmic sorry for late response.

I have just tested the CL and works perfectly!

In case someone else wants to confirm, I have added branch to my repro repo that downloads the CL, builds plugin using gotip and runs fluent-bit with the built plugin: https://github.com/VirrageS/fluent-bit-go-bug/tree/cl525455.

@gopherbot
Copy link

Change https://go.dev/cl/527317 mentions this issue: Revert "runtime: allow update of system stack bounds on callback from C thread"

@gopherbot
Copy link

Change https://go.dev/cl/527316 mentions this issue: Revert "runtime: set stackguard1 on extra M g0"

@prattmic prattmic reopened this Sep 11, 2023
gopherbot pushed a commit that referenced this issue Sep 11, 2023
This reverts CL 527056.

CL 525455 breaks darwin, alpine, and android. This CL must be reverted
in order to revert that CL.

For #62440.

Change-Id: I4e1b16e384b475a605e0214ca36c918d50faa22c
Reviewed-on: https://go-review.googlesource.com/c/go/+/527316
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Sep 11, 2023
… C thread"

This reverts CL 525455. The test fails to build on darwin, alpine, and
android.

For #62440.

Change-Id: I39c6b1e16499bd61e0f166de6c6efe7a07961e62
Reviewed-on: https://go-review.googlesource.com/c/go/+/527317
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@dmitshur dmitshur modified the milestones: Backlog, Go1.22 Sep 11, 2023
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Sep 11, 2023
rhnasc pushed a commit to inloco/loki that referenced this issue Apr 12, 2024
grafana#11904)

**Issue**:

Since version `grafana/fluent-bit-grafana-loki` >= `2.9.2` we could not
run the client due to an error during the startup/forward process where
the container/service crashes and exits with an error message:

`fatal: morestack on g0`

refs: 
- grafana#10944
- golang/go#62440


**What this PR does / why we need it**:
A fix was released through the Golang version 1.22.0 and after some
tests (updating the Go version used to the fluent-bit base image), it
seems working fine now.

**Which issue(s) this PR fixes**:
Fixes grafana#10944
@podtserkovskiy
Copy link
Contributor

podtserkovskiy commented May 2, 2024

@prattmic I've just reproduced the original repro and it panicked on go1.22.2 on my first attempt.
Then I did another try, but it didn't panic.
So, I can confirm that the first repro still works, but don't know how to reliably reproduce the error.

Maybe it makes sense to leave it working for a while and it will eventually fail.

I'm attaching my logs in case if they are valuable.

Logs
➜  git clone git@github.com:VirrageS/fluent-bit-go-bug.git
➜  cd fluent-bit-go-bug
➜  podman machine init
➜  podman machine set --rootful
➜  podman machine start
➜  podman build --build-arg="GO_VERSION=1.22.2" -t fluentbit .
➜  podman run -t fluentbit
Fluent Bit v2.1.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2024/05/02 19:04:07] [ info] [fluent bit] version=2.1.8, commit=1d83649441, pid=1
[2024/05/02 19:04:07] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/02 19:04:07] [ info] [cmetrics] version=0.6.3
[2024/05/02 19:04:07] [ info] [ctraces ] version=0.3.1
[2024/05/02 19:04:07] [ info] [input:cpu:cpu.0] initializing
[2024/05/02 19:04:07] [ info] [input:cpu:cpu.0] storage_strategy='memory' (memory only)
[flb-go] plugin parameter = ''
[2024/05/02 19:04:07] [ info] [sp] stream processor started
[2024/05/02 19:04:07] [ info] [output:stdout:stdout.0] worker #0 started
[0] cpu.local: [[1714676647.156467772, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:07.156467772 +0000 UTC, {"cpu_p": 0, "system_p": 0, "cpu0.p_system": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 0, "cpu4.p_user": 0, "cpu0.p_cpu": 0, "cpu2.p_cpu": 0, "cpu3.p_system": 0, "user_p": 0, "cpu0.p_user": 0, "cpu2.p_user": 0, "cpu3.p_user": 0, "cpu4.p_system": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu2.p_system": 0, "cpu4.p_cpu": 0, }
[0] cpu.local: [[1714676648.157772622, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:08.157772622 +0000 UTC, {"cpu0.p_cpu": 0, "cpu0.p_user": 0, "cpu1.p_cpu": 0, "cpu2.p_system": 0, "cpu3.p_user": 0, "cpu4.p_cpu": 0, "user_p": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 0, "cpu4.p_user": 0, "cpu4.p_system": 0, "cpu_p": 0, "system_p": 0, "cpu0.p_system": 0, "cpu2.p_user": 0, "cpu3.p_system": 0, }
[0] cpu.local: [[1714676649.157505176, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>1.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>1.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>1.000000, "cpu4.p_user"=>1.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:09.157505176 +0000 UTC, {"user_p": 0, "cpu2.p_system": 0, "cpu3.p_user": 0, "cpu4.p_system": 0, "system_p": 0.2, "cpu0.p_system": 1, "cpu1.p_cpu": 0, "cpu3.p_system": 0, "cpu4.p_user": 1, "cpu_p": 0.2, "cpu0.p_user": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu2.p_cpu": 0, "cpu0.p_cpu": 1, "cpu2.p_user": 0, "cpu3.p_cpu": 0, "cpu4.p_cpu": 1, }
[0] cpu.local: [[1714676650.161379825, {}], {"cpu_p"=>0.400000, "user_p"=>0.200000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>1.000000, "cpu1.p_user"=>1.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>1.000000, "cpu3.p_user"=>1.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:10.161379825 +0000 UTC, {"cpu3.p_system": 0, "cpu4.p_cpu": 0, "cpu_p": 0.4, "user_p": 0.2, "cpu0.p_user": 0, "cpu0.p_system": 0, "cpu2.p_cpu": 0, "cpu3.p_user": 1, "system_p": 0.2, "cpu0.p_cpu": 0, "cpu1.p_user": 1, "cpu2.p_user": 0, "cpu3.p_cpu": 1, "cpu4.p_user": 0, "cpu1.p_cpu": 1, "cpu1.p_system": 0, "cpu2.p_system": 0, "cpu4.p_system": 0, }
[0] cpu.local: [2024-05-02 19:04:11.156551268 +0000 UTC, {"system_p": 0, "cpu0.p_user": 0, "cpu1.p_user": 0, "cpu2.p_system": 1, "cpu4.p_system": 0, "cpu3.p_user": 0, "user_p": 0.2, "cpu1.p_cpu": 0, "cpu1.p_system": 0, "cpu2.p_cpu": 1, "cpu2.p_user": 0, "cpu0.p_cpu": 0, "cpu0.p_system": 0, "cpu3.p_cpu": 1, "cpu3.p_system": 1, "cpu4.p_user": 0, "cpu_p": 0.2, "cpu4.p_cpu": 0, }
[0] cpu.local: [[1714676651.156551268, {}], {"cpu_p"=>0.200000, "user_p"=>0.200000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>1.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>1.000000, "cpu3.p_cpu"=>1.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>1.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [[1714676652.158199801, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>1.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>1.000000}]
[0] cpu.local: [2024-05-02 19:04:12.158199801 +0000 UTC, {"cpu2.p_user": 0, "cpu4.p_cpu": 1, "cpu1.p_system": 0, "cpu2.p_cpu": 0, "cpu2.p_system": 0, "cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu0.p_cpu": 0, "cpu0.p_user": 0, "cpu3.p_user": 0, "cpu3.p_cpu": 0, "cpu4.p_system": 1, "cpu_p": 0.2, "user_p": 0, "system_p": 0.2, }
[0] cpu.local: [[1714676653.157756456, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>1.000000, "cpu0.p_user"=>1.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:13.157756456 +0000 UTC, {"cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_user": 0, "cpu4.p_cpu": 0, "cpu4.p_user": 0, "user_p": 0, "cpu3.p_user": 0, "cpu1.p_system": 0, "cpu2.p_system": 0, "cpu3.p_system": 0, "cpu0.p_cpu": 1, "system_p": 0, "cpu0.p_user": 1, "cpu0.p_system": 0, "cpu3.p_cpu": 0, "cpu4.p_system": 0, "cpu_p": 0, }
[0] cpu.local: [[1714676654.156478069, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:14.156478069 +0000 UTC, {"cpu0.p_user": 0, "cpu0.p_system": 0, "cpu2.p_system": 0, "cpu4.p_system": 0, "cpu2.p_user": 0, "cpu3.p_system": 0, "cpu4.p_cpu": 0, "cpu_p": 0, "system_p": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "user_p": 0, "cpu0.p_cpu": 0, "cpu1.p_system": 0, "cpu4.p_user": 0, }
[0] cpu.local: [[1714676655.158473270, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:15.15847327 +0000 UTC, {"system_p": 0.2, "cpu1.p_cpu": 0, "cpu2.p_user": 0, "cpu4.p_system": 0, "cpu_p": 0.2, "cpu0.p_user": 0, "cpu2.p_system": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "user_p": 0, "cpu0.p_system": 0, "cpu4.p_cpu": 0, "cpu0.p_cpu": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu2.p_cpu": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, }
[0] cpu.local: [[1714676656.156483299, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:16.156483299 +0000 UTC, {"system_p": 0, "cpu3.p_cpu": 0, "user_p": 0, "cpu0.p_cpu": 0, "cpu2.p_cpu": 0, "cpu2.p_user": 0, "cpu4.p_cpu": 0, "cpu_p": 0, "cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu3.p_user": 0, "cpu0.p_user": 0, "cpu1.p_system": 0, "cpu2.p_system": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu4.p_system": 0, }
[0] cpu.local: [[1714676657.160387001, {}], {"cpu_p"=>0.200000, "user_p"=>0.200000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:17.160387001 +0000 UTC, {"cpu4.p_user": 0, "cpu0.p_system": 0, "cpu1.p_system": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu3.p_system": 0, "user_p": 0.2, "cpu0.p_cpu": 0, "cpu2.p_system": 0, "cpu3.p_cpu": 0, "cpu4.p_cpu": 0, "cpu_p": 0.2, "cpu0.p_user": 0, "cpu2.p_user": 0, "cpu3.p_user": 0, "cpu4.p_system": 0, "system_p": 0, "cpu1.p_cpu": 0, }
[0] cpu.local: [2024-05-02 19:04:18.15886653 +0000 UTC, {"cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "cpu0.p_cpu": 0, "system_p": 0.2, "cpu2.p_cpu": 0, "cpu_p": 0.2, "cpu2.p_system": 0, "cpu3.p_system": 0, "cpu4.p_cpu": 0, "cpu4.p_user": 0, "cpu4.p_system": 0, "user_p": 0, "cpu2.p_user": 0, "cpu0.p_user": 0, }
[0] cpu.local: [[1714676658.158866530, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [[1714676659.157350102, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:19.157350102 +0000 UTC, {"cpu4.p_user": 0, "system_p": 0, "cpu0.p_cpu": 0, "cpu0.p_user": 0, "cpu1.p_system": 0, "cpu2.p_cpu": 0, "cpu2.p_system": 0, "cpu4.p_cpu": 0, "cpu4.p_system": 0, "cpu0.p_system": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "user_p": 0, "cpu2.p_user": 0, "cpu_p": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu3.p_system": 0, }
[0] cpu.local: [[1714676660.156450090, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:20.15645009 +0000 UTC, {"cpu2.p_user": 0, "cpu3.p_cpu": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu4.p_system": 0, "system_p": 0, "cpu0.p_system": 0, "cpu2.p_cpu": 0, "cpu1.p_system": 0, "cpu2.p_system": 0, "user_p": 0, "cpu0.p_user": 0, "cpu1.p_cpu": 0, "cpu_p": 0, "cpu0.p_cpu": 0, "cpu1.p_user": 0, "cpu3.p_user": 0, "cpu4.p_cpu": 0, }
[0] cpu.local: [[1714676661.159350000, {}], {"cpu_p"=>0.200000, "user_p"=>0.200000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:21.15935 +0000 UTC, {"cpu_p": 0.2, "cpu0.p_cpu": 0, "cpu1.p_cpu": 0, "cpu1.p_system": 0, "cpu3.p_user": 0, "system_p": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_user": 0, "cpu4.p_cpu": 0, "user_p": 0.2, "cpu0.p_user": 0, "cpu2.p_system": 0, "cpu4.p_system": 0, "cpu0.p_system": 0, "cpu3.p_cpu": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, }
[0] cpu.local: [[1714676662.159564073, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>1.000000, "cpu4.p_user"=>1.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:22.159564073 +0000 UTC, {"cpu2.p_cpu": 0, "cpu4.p_system": 0, "cpu_p": 0.2, "cpu1.p_system": 0, "cpu2.p_system": 0, "cpu3.p_system": 0, "cpu4.p_user": 1, "cpu4.p_cpu": 1, "cpu0.p_cpu": 0, "cpu0.p_user": 0, "cpu0.p_system": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "user_p": 0, "system_p": 0.2, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu2.p_user": 0, }
[0] cpu.local: [[1714676663.156365850, {}], {"cpu_p"=>0.200000, "user_p"=>0.200000, "system_p"=>0.000000, "cpu0.p_cpu"=>1.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>1.000000, "cpu1.p_cpu"=>1.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>1.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:23.15636585 +0000 UTC, {"cpu1.p_system": 1, "cpu2.p_user": 0, "cpu2.p_system": 0, "cpu4.p_cpu": 0, "cpu4.p_user": 0, "cpu_p": 0.2, "cpu0.p_system": 1, "cpu3.p_system": 0, "cpu0.p_user": 0, "cpu1.p_user": 0, "cpu0.p_cpu": 1, "cpu1.p_cpu": 1, "cpu2.p_cpu": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "cpu4.p_system": 0, "user_p": 0.2, "system_p": 0, }
[0] cpu.local: [[1714676664.158276800, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>1.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>1.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:24.1582768 +0000 UTC, {"user_p": 0, "cpu0.p_cpu": 0, "cpu3.p_cpu": 0, "cpu4.p_cpu": 0, "cpu4.p_system": 0, "system_p": 0.2, "cpu3.p_user": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu0.p_user": 0, "cpu0.p_system": 0, "cpu2.p_user": 0, "cpu2.p_system": 1, "cpu_p": 0.2, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu2.p_cpu": 1, }
[0] cpu.local: [[1714676665.158120998, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>1.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>1.000000}]
[0] cpu.local: [2024-05-02 19:04:25.158120998 +0000 UTC, {"cpu_p": 0, "cpu0.p_user": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu2.p_user": 0, "cpu3.p_cpu": 0, "cpu4.p_system": 1, "cpu0.p_system": 0, "cpu3.p_system": 0, "user_p": 0, "system_p": 0, "cpu0.p_cpu": 0, "cpu1.p_cpu": 0, "cpu2.p_system": 0, "cpu3.p_user": 0, "cpu4.p_cpu": 1, "cpu2.p_cpu": 0, "cpu4.p_user": 0, }
[0] cpu.local: [[1714676666.159100989, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:26.159100989 +0000 UTC, {"cpu0.p_user": 0, "cpu1.p_system": 0, "cpu2.p_user": 0, "cpu3.p_cpu": 0, "cpu_p": 0, "user_p": 0, "system_p": 0, "cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_system": 0, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu4.p_cpu": 0, "cpu0.p_cpu": 0, "cpu3.p_user": 0, "cpu4.p_system": 0, }
[0] cpu.local: [[1714676667.160238229, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:27.160238229 +0000 UTC, {"cpu2.p_cpu": 0, "cpu4.p_user": 0, "cpu_p": 0.2, "cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu4.p_cpu": 0, "cpu4.p_system": 0, "user_p": 0, "cpu2.p_system": 0, "cpu3.p_user": 0, "system_p": 0.2, "cpu2.p_user": 0, "cpu3.p_system": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 0, "cpu0.p_cpu": 0, "cpu0.p_user": 0, "cpu1.p_user": 0, }
[0] cpu.local: [[1714676668.158483175, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:28.158483175 +0000 UTC, {"system_p": 0, "cpu0.p_user": 0, "cpu3.p_system": 0, "cpu4.p_system": 0, "cpu_p": 0, "user_p": 0, "cpu0.p_cpu": 0, "cpu1.p_cpu": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 0, "cpu4.p_user": 0, "cpu0.p_system": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_user": 0, "cpu2.p_system": 0, "cpu3.p_user": 0, "cpu4.p_cpu": 0, }
[0] cpu.local: [[1714676669.156824496, {}], {"cpu_p"=>0.200000, "user_p"=>0.200000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>1.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>1.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:29.156824496 +0000 UTC, {"system_p": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 1, "cpu3.p_system": 1, "cpu0.p_system": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_user": 0, "cpu4.p_user": 0, "user_p": 0.2, "cpu0.p_cpu": 0, "cpu0.p_user": 0, "cpu4.p_system": 0, "cpu3.p_user": 0, "cpu4.p_cpu": 0, "cpu_p": 0.2, "cpu1.p_cpu": 0, "cpu2.p_system": 0, }
[0] cpu.local: [2024-05-02 19:04:30.157574362 +0000 UTC, {"cpu_p": 0.2, "system_p": 0.2, "cpu0.p_system": 1, "cpu2.p_cpu": 0, "cpu4.p_cpu": 0, "cpu1.p_system": 0, "cpu2.p_user": 0, "cpu4.p_user": 0, "cpu4.p_system": 0, "cpu0.p_user": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu3.p_cpu": 0, "cpu3.p_user": 0, "user_p": 0, "cpu0.p_cpu": 1, "cpu2.p_system": 0, "cpu3.p_system": 0, }
[0] cpu.local: [[1714676670.157574362, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>1.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>1.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [[1714676671.158952436, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>1.000000, "cpu4.p_user"=>1.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:31.158952436 +0000 UTC, {"cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu3.p_cpu": 0, "cpu1.p_system": 0, "cpu2.p_system": 0, "cpu3.p_system": 0, "user_p": 0, "cpu0.p_cpu": 0, "cpu0.p_system": 0, "cpu3.p_user": 0, "cpu4.p_cpu": 1, "cpu_p": 0, "system_p": 0, "cpu1.p_cpu": 0, "cpu4.p_system": 0, "cpu0.p_user": 0, "cpu2.p_user": 0, "cpu4.p_user": 1, }
[0] cpu.local: [[1714676672.158878509, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:32.158878509 +0000 UTC, {"cpu4.p_user": 0, "cpu_p": 0, "cpu0.p_user": 0, "cpu1.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_system": 0, "cpu4.p_cpu": 0, "system_p": 0, "cpu1.p_cpu": 0, "cpu3.p_cpu": 0, "cpu3.p_system": 0, "cpu4.p_system": 0, "user_p": 0, "cpu0.p_cpu": 0, "cpu0.p_system": 0, "cpu1.p_system": 0, "cpu2.p_user": 0, "cpu3.p_user": 0, }
[0] cpu.local: [[1714676673.157961831, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:33.157961831 +0000 UTC, {"cpu4.p_cpu": 0, "cpu4.p_user": 0, "system_p": 0, "cpu0.p_system": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu3.p_cpu": 0, "cpu0.p_user": 0, "cpu1.p_cpu": 0, "cpu4.p_system": 0, "user_p": 0, "cpu2.p_cpu": 0, "cpu2.p_system": 0, "cpu3.p_system": 0, "cpu_p": 0, "cpu0.p_cpu": 0, "cpu2.p_user": 0, "cpu3.p_user": 0, }
[0] cpu.local: [[1714676674.156378902, {}], {"cpu_p"=>0.400000, "user_p"=>0.200000, "system_p"=>0.200000, "cpu0.p_cpu"=>1.000000, "cpu0.p_user"=>1.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:34.156378902 +0000 UTC, {"cpu4.p_system": 0, "system_p": 0.2, "cpu0.p_cpu": 1, "cpu2.p_cpu": 0, "cpu2.p_user": 0, "cpu4.p_cpu": 0, "cpu4.p_user": 0, "user_p": 0.2, "cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu3.p_user": 0, "cpu0.p_user": 1, "cpu2.p_system": 0, "cpu3.p_cpu": 0, "cpu3.p_system": 0, "cpu_p": 0.4, }
[0] cpu.local: [[1714676675.156705350, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>1.000000, "cpu3.p_user"=>1.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:35.15670535 +0000 UTC, {"user_p": 0, "cpu0.p_cpu": 0, "cpu1.p_user": 0, "cpu1.p_system": 0, "cpu4.p_system": 0, "system_p": 0, "cpu0.p_system": 0, "cpu1.p_cpu": 0, "cpu3.p_cpu": 1, "cpu3.p_system": 0, "cpu_p": 0, "cpu2.p_cpu": 0, "cpu4.p_cpu": 0, "cpu4.p_user": 0, "cpu0.p_user": 0, "cpu2.p_user": 0, "cpu2.p_system": 0, "cpu3.p_user": 1, }
[0] cpu.local: [2024-05-02 19:04:36.158000633 +0000 UTC, {"cpu0.p_user": 0, "cpu1.p_cpu": 0, "cpu3.p_cpu": 0, "cpu3.p_system": 0, "cpu4.p_system": 0, "cpu_p": 0, "cpu0.p_system": 0, "cpu1.p_user": 0, "cpu4.p_cpu": 0, "user_p": 0, "system_p": 0, "cpu0.p_cpu": 0, "cpu1.p_system": 0, "cpu2.p_user": 0, "cpu3.p_user": 0, "cpu4.p_user": 0, "cpu2.p_cpu": 0, "cpu2.p_system": 0, }
[0] cpu.local: [[1714676676.158000633, {}], {"cpu_p"=>0.000000, "user_p"=>0.000000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [[1714676677.157879914, {}], {"cpu_p"=>0.200000, "user_p"=>0.000000, "system_p"=>0.200000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>1.000000, "cpu1.p_user"=>1.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
[0] cpu.local: [2024-05-02 19:04:37.157879914 +0000 UTC, {"cpu1.p_cpu": 1, "cpu2.p_system": 0, "cpu3.p_cpu": 0, "system_p": 0.2, "cpu2.p_cpu": 0, "cpu4.p_system": 0, "cpu0.p_user": 0, "cpu2.p_user": 0, "cpu3.p_user": 0, "cpu3.p_system": 0, "cpu4.p_cpu": 0, "cpu0.p_cpu": 0, "user_p": 0, "cpu0.p_system": 0, "cpu1.p_user": 1, "cpu1.p_system": 0, "cpu4.p_user": 0, "cpu_p": 0.2, }
[0] cpu.local: [[1714676678.157762653, {}], {"cpu_p"=>0.200000, "user_p"=>0.200000, "system_p"=>0.000000, "cpu0.p_cpu"=>0.000000, "cpu0.p_user"=>0.000000, "cpu0.p_system"=>0.000000, "cpu1.p_cpu"=>0.000000, "cpu1.p_user"=>0.000000, "cpu1.p_system"=>0.000000, "cpu2.p_cpu"=>0.000000, "cpu2.p_user"=>0.000000, "cpu2.p_system"=>0.000000, "cpu3.p_cpu"=>0.000000, "cpu3.p_user"=>0.000000, "cpu3.p_system"=>0.000000, "cpu4.p_cpu"=>0.000000, "cpu4.p_user"=>0.000000, "cpu4.p_system"=>0.000000}]
runtime: morestack on g0, stack [0xffff370aa700 0xffff370b2b00], sp=0xffff370aaa80, called from
runtime.(*limiterEvent).stop(0x4000029750?, 0x4, 0x81afd47090?)
	/usr/local/go/src/runtime/mgclimit.go:448 +0x254 fp=0xffff370aaa80 sp=0xffff370aaa80 pc=0xffff82fcbfd4
runtime.pidleget(0x0?)
	/usr/local/go/src/runtime/proc.go:6572 +0x1ac fp=0xffff370aaac0 sp=0xffff370aaa80 pc=0xffff82ff792c
runtime.exitsyscallfast_pidle()
	/usr/local/go/src/runtime/proc.go:4685 +0x34 fp=0xffff370aaae0 sp=0xffff370aaac0 pc=0xffff82ff2be4
runtime.exitsyscallfast.func1()
	/usr/local/go/src/runtime/proc.go:4628 +0x24 fp=0xffff370aab00 sp=0xffff370aaae0 pc=0xffff82ff2aa4
runtime.systemstack(0x4000080808)
	/usr/local/go/src/runtime/asm_arm64.s:243 +0x6c fp=0xffff370aab10 sp=0xffff370aab00 pc=0xffff8301918c

fatal error: morestack on g0

runtime stack:
runtime.throw({0xffff8315d43a?, 0x0?})
	/usr/local/go/src/runtime/panic.go:1023 +0x40 fp=0x4000025f80 sp=0x4000025f50 pc=0xffff82fe5ae0
runtime.badmorestackg0.func1()
	/usr/local/go/src/runtime/proc.go:533 +0xcc fp=0x4000025fe0 sp=0x4000025f80 pc=0xffff82fe914c
runtime.switchToCrashStack0()
	/usr/local/go/src/runtime/asm_arm64.s:283 +0x3c fp=0x4000025ff0 sp=0x4000025fe0 pc=0xffff8301920c

goroutine 18 gp=0x4000082700 m=7 mp=0x4000080808 [syscall, locked to thread]:
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1222 +0x4 fp=0x40000bffe0 sp=0x40000bffe0 pc=0xffff8301b634

goroutine 17 gp=0x4000002700 m=1 mp=0x400003e008 [syscall, locked to thread]:
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1222 +0x4 fp=0x400004cfe0 sp=0x400004cfe0 pc=0xffff8301b634

goroutine 2 gp=0x4000002c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xc8 fp=0x4000038f90 sp=0x4000038f70 pc=0xffff82fe8988
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:326 +0xb8 fp=0x4000038fd0 sp=0x4000038f90 pc=0xffff82fe8818
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1222 +0x4 fp=0x4000038fd0 sp=0x4000038fd0 pc=0xffff8301b634
created by runtime.init.6 in goroutine 1
	/usr/local/go/src/runtime/proc.go:314 +0x24

goroutine 3 gp=0x4000003180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:402 +0xc8 fp=0x4000039760 sp=0x4000039740 pc=0xffff82fe8988
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.bgsweep(0x4000062000)
	/usr/local/go/src/runtime/mgcsweep.go:278 +0xa0 fp=0x40000397b0 sp=0x4000039760 pc=0xffff82fd40a0
runtime.gcenable.gowrap1()
	/usr/local/go/src/runtime/mgc.go:203 +0x28 fp=0x40000397d0 sp=0x40000397b0 pc=0xffff82fc82a8
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1222 +0x4 fp=0x40000397d0 sp=0x40000397d0 pc=0xffff8301b634
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:203 +0x6c

goroutine 4 gp=0x4000003340 m=nil [GC scavenge wait]:
runtime.gopark(0x4000062000?, 0xffff831c46d8?, 0x1?, 0x0?, 0x4000003340?)
	/usr/local/go/src/runtime/proc.go:402 +0xc8 fp=0x4000039f60 sp=0x4000039f40 pc=0xffff82fe8988
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:408
runtime.(*scavengerState).park(0xffff83394740)
	/usr/local/go/src/runtime/mgcscavenge.go:425 +0x5c fp=0x4000039f90 sp=0x4000039f60 pc=0xffff82fd1a8c
runtime.bgscavenge(0x4000062000)
	/usr/local/go/src/runtime/mgcscavenge.go:653 +0x44 fp=0x4000039fb0 sp=0x4000039f90 pc=0xffff82fd1fd4
runtime.gcenable.gowrap2()
	/usr/local/go/src/runtime/mgc.go:204 +0x28 fp=0x4000039fd0 sp=0x4000039fb0 pc=0xffff82fc8248
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1222 +0x4 fp=0x4000039fd0 sp=0x4000039fd0 pc=0xffff8301b634
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:204 +0xac

goroutine 5 gp=0x4000003c00 m=nil [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0xb8?, 0x85?, 0xffff83017204?)
	/usr/local/go/src/runtime/proc.go:402 +0xc8 fp=0x4000038580 sp=0x4000038560 pc=0xffff82fe8988
runtime.runfinq()
	/usr/local/go/src/runtime/mfinal.go:194 +0x108 fp=0x40000387d0 sp=0x4000038580 pc=0xffff82fc7378
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1222 +0x4 fp=0x40000387d0 sp=0x40000387d0 pc=0xffff8301b634
created by runtime.createfing in goroutine 1
	/usr/local/go/src/runtime/mfinal.go:164 +0x80

@prattmic
Copy link
Member

prattmic commented May 3, 2024

Hrm, thanks for confirming the repro. I will reopen.

@prattmic prattmic reopened this May 3, 2024
@RichieSams
Copy link

Possibly related to: #67108 ?

@podtserkovskiy
Copy link
Contributor

podtserkovskiy commented May 3, 2024

@prattmic I've improved the repro now it takes 2 to 10 minutes and 300 containers to reliably reproduce it 😀
The new repro with improved config: https://github.com/podtserkovskiy/fluent-bit-go-bug
I do this on M1 Mac with podman instead of docker.

Repro steps

git clone git@github.com:podtserkovskiy/fluent-bit-go-bug.git
cd fluent-bit-go-bug
# create a larger VM
podman machine init --cpus 8 --memory 16000 --rootful --now
podman build --build-arg="GO_VERSION=1.22.2" -t fluentbit .
# command repeat isn't available on bash, you may do the same with a for loop
repeat 300 podman run -d -t fluentbit
# wait a minute or more
podman ps --all | grep Exited 
# now you can grep for logs with `podman logs $container_id`

UPD: I managed to disable noisy logs, the panic stacktrace doesn't disappear anymore

@prattmic
Copy link
Member

prattmic commented May 8, 2024

Thanks @podtserkovskiy, I can reproduce locally on my Linux machine as well.

$ git clone git@github.com:podtserkovskiy/fluent-bit-go-bug.git
$ cd fluent-bit-go-bug
$ docker build --build-arg="GO_VERSION=1.22.2" -t fluentbit .
$ for i in $(seq 1 25); do docker run --rm -t fluentbit &; done
...
runtime: morestack on g0, stack [0x7f0518777690 0x7f051877fa90], sp=0x7f0518777fa8, called from    
runtime.netpoll(0x0?)    
        /usr/local/go/src/runtime/netpoll_epoll.go:98 +0x390 fp=0x7f0518777fb0 sp=0x7f0518777fa8 pc=0x7f05629d1d30    
runtime.startTheWorldWithSema(0x0, {0x5?, 0x88?})    
        /usr/local/go/src/runtime/proc.go:1559 +0x45 fp=0x7f0518778018 sp=0x7f0518777fb0 pc=0x7f05629dae45    
runtime.gcStart.func3()    
        /usr/local/go/src/runtime/mgc.go:747 +0x32 fp=0x7f0518778050 sp=0x7f0518778018 pc=0x7f05629b8ef2    
runtime.systemstack(0xc000100008)    
        /usr/local/go/src/runtime/asm_amd64.s:509 +0x47 fp=0x7f0518778060 sp=0x7f0518778050 pc=0x7f0562a079e7    
        
fatal error: morestack on g0    
        
runtime stack:    
runtime.throw({0x7f0562b67c0f?, 0x0?})    
        /usr/local/go/src/runtime/panic.go:1023 +0x5e fp=0xc000029f88 sp=0xc000029f58 pc=0x7f05629d545e    
runtime.badmorestackg0.func1()    
        /usr/local/go/src/runtime/proc.go:533 +0xe5 fp=0xc000029fe0 sp=0xc000029f88 pc=0x7f05629d8c45    
runtime.switchToCrashStack0()    
        /usr/local/go/src/runtime/asm_amd64.s:559 +0x34 fp=0xc000029ff0 sp=0xc000029fe0 pc=0x7f0562a07a74    
        
goroutine 18 gp=0xc000102380 m=6 mp=0xc000100008 [running, locked to thread]:    
runtime.systemstack_switch()    
        /usr/local/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc000407a98 sp=0xc000407a88 pc=0x7f0562a07988    
runtime.gcStart({0x7f0562c39320?, 0xf4479?, 0x629a2c01?})    
        /usr/local/go/src/runtime/mgc.go:746 +0x47f fp=0xc000407b58 sp=0xc000407a98 pc=0x7f05629b8d5f    
runtime.mallocgc(0xf4479, 0x0, 0x0)    
        /usr/local/go/src/runtime/malloc.go:1308 +0x814 fp=0xc000407be0 sp=0xc000407b58 pc=0x7f05629ab8d4                                        
runtime.gobytes(0x7f0518672c00, 0xf4479)                                                                                                         
        /usr/local/go/src/runtime/string.go:308 +0x51 fp=0xc000407c28 sp=0xc000407be0 pc=0x7f05629f2e71                                          
github.com/fluent/fluent-bit-go/output._Cfunc_GoBytes(...)                                                                                       
        _cgo_gotypes.go:109    
github.com/fluent/fluent-bit-go/output.NewDecoder.func1(0x7f0518672c00, 0xf4479)    
        /go/pkg/mod/github.com/fluent/fluent-bit-go@v0.0.0-20230731091245-a7a013e2473c/output/decoder.go:65 +0x45 fp=0xc000407c58 sp=0xc000407c28 pc=0x7f0562b62065    
github.com/fluent/fluent-bit-go/output.NewDecoder(0x7f0518672c00, 0xf4479)    
        /go/pkg/mod/github.com/fluent/fluent-bit-go@v0.0.0-20230731091245-a7a013e2473c/output/decoder.go:65 +0xec fp=0xc000407cd8 sp=0xc000407c58 pc=0x7f0562b61fac    
main.FLBPluginFlush(0xc000082000?, 0x0?, 0x7f0518600100)    
        /go/src/plugin/main.go:37 +0x2d fp=0xc000407e30 sp=0xc000407cd8 pc=0x7f0562b62f0d    
_cgoexp_77dc9c78b3ed_FLBPluginFlush(0x7f05187780d0)    
        _cgo_gotypes.go:83 +0x25 fp=0xc000407e58 sp=0xc000407e30 pc=0x7f0562b63365                                                               
runtime.cgocallbackg1(0x7f0562b63340, 0x7f05187780d0, 0x0)                                                                                       
        /usr/local/go/src/runtime/cgocall.go:403 +0x2a5 fp=0xc000407f18 sp=0xc000407e58 pc=0x7f05629a2705                                        
runtime.cgocallbackg(0x7f0562b63340, 0x7f05187780d0, 0x0)                                                                                        
        /usr/local/go/src/runtime/cgocall.go:322 +0x138 fp=0xc000407f90 sp=0xc000407f18 pc=0x7f05629a23b8                                        
runtime.cgocallbackg(0x7f0562b63340, 0x7f05187780d0, 0x0)                                                                                        
        <autogenerated>:1 +0x2b fp=0xc000407fb8 sp=0xc000407f90 pc=0x7f0562a0bdcb                                                                
runtime.cgocallback(0x0, 0x0, 0x0)                                                                                                               
        /usr/local/go/src/runtime/asm_amd64.s:1079 +0xcd fp=0xc000407fe0 sp=0xc000407fb8 pc=0x7f0562a0976d                                       
runtime.goexit({})    
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000407fe8 sp=0xc000407fe0 pc=0x7f0562a099c1   
...

@prattmic
Copy link
Member

prattmic commented May 8, 2024

@podtserkovskiy Just to clarify, though you said that you are on a Mac, I assume podman is actually running the code inside a Linux VM? Is it linux-arm64 or linux-amd64?

@prattmic
Copy link
Member

prattmic commented May 8, 2024

Since Go 1.21, during cgo callback (calls from C to Go, which fluent-bit uses), we query pthread for the current stack bounds to set proper bounds on our stack checks when performing low-level runtime operations which need to occur on the stack Go is called with (normal goroutine execution uses a separately allocated stack).

I instrumented cgo callbacks to see how much stack space we are given, and what I've found is that prior to crash, we've been called with very little space remaining on the stack. The most extreme case I've found:

...
[0.687053793 P 0] M 7 procid 13 cgocallback enter g0 stack 0x7f0c8debbe90 0x7f0c8dec4290 size 0x8400 sp 0x7f0c8dec3e90 remaining size 0x8000
[0.688720318 P 0] M 7 procid 13 cgocallback exit
[0.788935339 P 0] M 7 procid 13 cgocallback enter g0 stack 0x7f0c8debbe90 0x7f0c8dec4290 size 0x8400 sp 0x7f0c8dec3e90 remaining size 0x8000
[0.893441995 P 0] M 7 procid 13 cgocallback exit
[0.894090923 P 0] M 7 procid 13 cgocallback enter g0 stack 0x7f0c8debbe90 0x7f0c8dec4290 size 0x8400 sp 0x7f0c8debbf90 remaining size 0x100
[0.894098755 P 0] M 7 procid 13 morestack on g0
...

The total stack size is 0x8400 bytes (slightly over 32KB, which seems larger than implied by #62440 (comment), but not too far off), but the remaining stack stack (difference between the lower stack bound and sp) is only 0x100 (256) bytes.

I don't know precisely how much stack space we need (we allocate 16KB for threads created by Go, so that is the upper bound), but 256 bytes is certainly not enough!

Does fluent-bit have a minimum stack space guarantee for calls into Go? It seems that this needs more investigation on the fluent-bit side to see why there is so little stack space remaining.

P.S. Why did this work before Go 1.21? I don't think it did. Prior to 1.21, we didn't get the real stack bounds at all, we just assumed that there was at least 32KB. So we likely just silently overflowed the stack and corrupted whatever happened to come after.

@podtserkovskiy
Copy link
Contributor

Just to clarify, though you said that you are on a Mac, I assume podman is actually running the code inside a Linux VM? Is it linux-arm64 or linux-amd64?

Yes it runs linux-arm64 VM on Mac, but in production we have this problem on real linux/amd64 hosts.

@podtserkovskiy
Copy link
Contributor

@prattmic thank you very much for the investigation.

For this fast repro I set coro_stack_size option of fluent-bit to 19660~20 KiB just to make a reliable repro. I hope this doesn't fundamentally changed the environment.

By default it's ~200KiB and my first successful repro attempt (on the original repro https://github.com/VirrageS/fluent-bit-go-bug)was with this number.

I'm not sure how coro_stack_size affects the stack size at the very moment when fluent-bit do calls from C to Go.

BTW, as I understand we use the default value ~200KiB coro_stack_size when run fluent-bit on Linux, but let's double check with @shpprfb

Why did this work before Go 1.21? I don't think it did. Prior to 1.21, we didn't get the real stack bounds at all, we just assumed that there was at least 32KB. So we likely just silently overflowed the stack and corrupted whatever happened to come after

This is a good question, let's ask @shpprfb as he knows more than I about fluent-bit.

We can also build our .so with custom Go with instrumented cgo callbacks on a prod-like environment and share results if it makes sense.

Shall we do this?

@shpprfb
Copy link

shpprfb commented May 9, 2024

I've run it on our normal workload now built with 1.22.2 and there really seems to be no change.

@podtserkovskiy
Setting coro_stack_size to ~2MB(-s 2457600) or ~20MB(-s 24576000) doesn't seem to have much reproducible value. I saw it run for about an hour, sometimes for ~10 minutes, and sometimes it crashes almost immediately with these values.

Heres an example backtrace:

Stack
May 09 05:32:41 fluent-bit[740296]: I0509 05:32:41.147693  740330 metrics.go:266] submitting telemetry to backend
May 09 05:33:09 fluent-bit[740296]: runtime: morestack on g0, stack [0x7f53071ba4f0 0x7f53071c28f0], sp=0x7f53071ba828, called from
May 09 05:33:09 fluent-bit[740296]: runtime.exitsyscallfast.func1()
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:4627 +0x2c fp=0x7f53071ba830 sp=0x7f53071ba828 pc=0x7f5351f9fb8c
May 09 05:33:09 fluent-bit[740296]: runtime.systemstack(0xc000100008)
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/asm_amd64.s:509 +0x47 fp=0x7f53071ba840 sp=0x7f53071ba830 pc=0x7f5351fc7287
May 09 05:33:09 fluent-bit[740296]: fatal error: morestack on g0
May 09 05:33:09 fluent-bit[740296]: runtime stack:
May 09 05:33:09 fluent-bit[740296]: runtime.throw({0x7f535266c674?, 0x0?})
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/panic.go:1023 +0x5e fp=0xc000037f88 sp=0xc000037f58 pc=0x7f5351f93a7e
May 09 05:33:09 fluent-bit[740296]: runtime.badmorestackg0.func1()
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:533 +0xe5 fp=0xc000037fe0 sp=0xc000037f88 pc=0x7f5351f97265
May 09 05:33:09 fluent-bit[740296]: runtime.switchToCrashStack0()
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/asm_amd64.s:559 +0x34 fp=0xc000037ff0 sp=0xc000037fe0 pc=0x7f5351fc7314
May 09 05:33:09 fluent-bit[740296]: goroutine 18 gp=0xc000007c00 m=6 mp=0xc000100008 [syscall, locked to thread]:
May 09 05:33:09 fluent-bit[740296]: runtime.goexit({})
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0002b3fe8 sp=0xc0002b3fe0 pc=0x7f5351fc9261
May 09 05:33:09 fluent-bit[740296]: goroutine 8 gp=0xc0000061c0 m=nil [select, 5 minutes]:
May 09 05:33:09 fluent-bit[740296]: runtime.gopark(0xc000547f18?, 0x2?, 0xd0?, 0x7d?, 0xc000547ea4?)
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:402 +0xce fp=0xc00054dd38 sp=0xc00054dd18 pc=0x7f5351f96a0e
May 09 05:33:09 fluent-bit[740296]: runtime.selectgo(0xc00054df18, 0xc000547ea0, 0x0?, 0x0, 0x0?, 0x1)
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/select.go:327 +0x725 fp=0xc00054de58 sp=0xc00054dd38 pc=0x7f5351fa7de5
May 09 05:33:09 fluent-bit[740296]: ourqueue.(*Queue).clearCounters(0xc0000764e0, 0x0?)
May 09 05:33:09 fluent-bit[740296]:         build_dir/ourqueue/queue.go:524 +0xe5 fp=0xc00054dfc0 sp=0xc00054de58 pc=0x7f535248edc5
May 09 05:33:09 fluent-bit[740296]: ourqueue.(*Queue).Heartbeat.gowrap2()
May 09 05:33:09 fluent-bit[740296]:         build_dir/ourqueue/queue.go:477 +0x25 fp=0xc00054dfe0 sp=0xc00054dfc0 pc=0x7f535248e7e5
May 09 05:33:09 fluent-bit[740296]: runtime.goexit({})
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00054dfe8 sp=0xc00054dfe0 pc=0x7f5351fc9261
May 09 05:33:09 fluent-bit[740296]: created by ourqueue.(*Queue).Heartbeat in goroutine 22
May 09 05:33:09 fluent-bit[740296]:         build_dir/ourqueue/queue.go:477 +0xcb
May 09 05:33:09 fluent-bit[740296]: goroutine 17 gp=0xc000006700 m=1 mp=0xc000050008 [syscall, 5 minutes, locked to thread]:
May 09 05:33:09 fluent-bit[740296]: runtime.goexit({})
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00005efe8 sp=0xc00005efe0 pc=0x7f5351fc9261
May 09 05:33:09 fluent-bit[740296]: goroutine 2 gp=0xc000006c40 m=nil [force gc (idle), 5 minutes]:
May 09 05:33:09 fluent-bit[740296]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:402 +0xce fp=0xc00004afa8 sp=0xc00004af88 pc=0x7f5351f96a0e
May 09 05:33:09 fluent-bit[740296]: runtime.goparkunlock(...)
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:408
May 09 05:33:09 fluent-bit[740296]: runtime.forcegchelper()
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:326 +0xb8 fp=0xc00004afe0 sp=0xc00004afa8 pc=0x7f5351f96898
May 09 05:33:09 fluent-bit[740296]: runtime.goexit({})
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00004afe8 sp=0xc00004afe0 pc=0x7f5351fc9261
May 09 05:33:09 fluent-bit[740296]: created by runtime.init.6 in goroutine 1
May 09 05:33:09 fluent-bit[740296]:         go/1.22.2/linux_amd64/src/runtime/proc.go:314 +0x1a
May 09 05:33:09 fluent-bit[740296]: goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:

Now that there is a somewhat clear repro again, does the go team have everything it needs for reproducing, causing a crash, attaching a debugger, etc?

@prattmic
Copy link
Member

prattmic commented May 9, 2024

I also tried adjusting coro_stack_size and regardless of that value, I always see a total stack size of 0x8400 (33792) bytes on the calls into Go. It anecdotally feels like it takes longer to reproduce with higher values, but I haven't collected statistics, so that may just be confirmation bias.

I am curious to see what the C frames prior to the Go call look like on the crashing case where there is barely any stack space remaining. Unfortunately, I've been unable to reproduce inside of debugger thus far. If anyone else would like attempt this, that would be great. You can set a breakpoint on runtime.badmorestackg0 to catch the location that fatal panic occurs.

@podtserkovskiy I will clean up and mail the patch I am using shortly. It should be usable to at least verify you are seeing the same symptoms as me in production.

@prattmic
Copy link
Member

prattmic commented May 9, 2024

I believe I found the bug, which is in Go.

I was thrown off by the pthread stack bound lookup we do, it turns out that isn't relevant.

The coroutine stack allocation is at https://github.com/fluent/fluent-bit/blob/v2.1.8/lib/flb_libco/amd64.c#L142-L147. It is just malloc (the context switch is above, co_swap_function). At no point does the library change the pthread stack bounds (in hindsight, I'm not sure that this is even possible).

So when we lookup the pthread stack bounds, we will fail the bounds check at https://cs.opensource.google/go/go/+/master:src/runtime/cgocall.go;l=273-279;drc=a878d3dfa0f9d7cd1de26e3df9eb3983a9f64b53;bpv=0;bpt=1 and avoid using the pthread bounds. Instead we will use the dummy bounds computed above (assume we're near the top of a ~32KB stack).

But if we assume a 32KB stack, why does it look like we're called with almost no space?

The problem is the fast path at the start of stack update: https://cs.opensource.google/go/go/+/master:src/runtime/cgocall.go;l=224-227;drc=a878d3dfa0f9d7cd1de26e3df9eb3983a9f64b53;bpv=0;bpt=1

Consider this sequence:

  1. fluent-bit allocates a 16KB stack at [0x100000, 0x104000).
  2. fluent-bit calls Go with SP near the top of this stack. Let's assume 0x103900.
  3. callbackUpdateSystemStack finds that this stack isn't in the current stack bounds, and doesn't match the pthread stack bounds, so it uses ~32KB dummy bounds of [0xfb900, 0x103d00).
  4. Go call completes and returns to fluent-bit.
  5. fluent-bit frees the stack.
  6. To make a new call, fluent-bit allocates a new 16KB stack, this time at [0xf9000, 0xfd000)`.
  7. fluent-bit calls Go with SP near the top of this stack. Let's assume 0xfc900
  8. callbackUpdateSystemStack finds that this SP is within the old stack bounds, so it hits the fast path and does not change the stack bounds.

We're now using a completely different stack allocation, but with the old stack bounds. In the old stack bounds we only appear to have 0x1000 bytes of stack space remaining.

I selected arbitrary numbers here; the overlap could of course be even smaller. It also does not depend on the fluent-bit stack being <32KB. The same could occur with larger stack sizes provided that the old stack is freed prior to subsequent, allowing the new stack to get allocated in a partially overlapping region.

I believe the right approach here is to avoid this fast path and always attempt to update the stack bounds. I will send a CL.

@gopherbot
Copy link

Change https://go.dev/cl/584597 mentions this issue: runtime: always update stack bounds on cgocallback

@shpprfb
Copy link

shpprfb commented May 9, 2024

@prattmic Nice, I was following a similar path from your stacktrace:

I see theres a stack switch from g to g0 in goroutine 18:

runtime.systemstack_switch()    
        /usr/local/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc000407a98 sp=0xc000407a88 pc=0x7f0562a07988    

Which runs on the g stack, invokes the stack switch thunk, which results in

runtime.systemstack(0xc000100008)    
        /usr/local/go/src/runtime/asm_amd64.s:509 +0x47 fp=0x7f0518778060 sp=0x7f0518778050 pc=0x7f0562a079e

being invoked on the g0 stack. If the g0 stack range is [0x7f0518777690 0x7f051877fa90] there seems to be a strange discrepancy where the g0 stack SP we switch to that is used by the runtime.systemstack routine is 0x7f0518778050. This is only leaving 0x7f0518778050 - 0x7f0518777690 = 0x9C0 (~2.5k) bytes of stack space for what looks to be garbage collection.

When looking at the stack switching thunk https://github.com/golang/go/blob/master/src/runtime/asm_amd64.s#L513 why is the saved SP ((g_sched+gobuf_sp)(DX), SP) pointing into the middle of the g0 stack? This suggests that either there are frames below runtime.systemstack that the crashstack debug handler isn't printing, or SP was errantly saved/clobbered by previous stack management operations.

Is it expected behavior to switch right to the end of the g0 stack allocation for calls that run on g0?

This leads me to think this is a pure Go issue since this isn't running on the FluentBit stack at exception time at all. FluentBit should have no bearing on the g0 stack size or what SP is used when switching to g0... right?

Does the g0 stack need to grow, or is the SP we enter with on the g0 stack just wrong?

@cherrymui
Copy link
Member

the crashstack debug handler isn't printing

Probably this. This is because a stack switching function like systemstack isn't CALLed by the stack frame above it. So it doesn't make sense to unwind to the frame above it, and the unwinder doesn't know how to do it in general.

Is it expected behavior to switch right to the end of the g0 stack allocation for calls that run on g0?

I'm not sure I understand the question. Generally, it is expected that we switch to the g0 stack wherever it is currently at. Usually (especially for a pure Go binary), there aren't many frames on the g0 stack, so we switch to somewhere very close to the top. But if Go is called e.g. from a very deep C stack, we cannot overwrite the C frames, so we switch to right below the C frames.

Does the g0 stack need to grow, or is the SP we enter with on the g0 stack just wrong?

g0 stack doesn't grow. We just check its bounds. So morestack on g0 is always just crashing.

For this particular bug, it is not the the SP is wrong, but we get the g0 stack bounds wrong, so it thinks it goes out of bounds.

@shpprfb
Copy link

shpprfb commented May 10, 2024

Interesting, that makes sense. I didn't realize the g0 stack location was the original stack that called into Go in the first place which is the FluentBit coroutine stack. Thank you for clarifying.

@prattmic
Copy link
Member

@gopherbot Please backport to 1.21 and 1.22. This bug causes random crashes on cgocallback.

@gopherbot
Copy link

Change https://go.dev/cl/585337 mentions this issue: [release-branch.go1.21] runtime: always update stack bounds on cgocallback

@gopherbot
Copy link

Change https://go.dev/cl/585935 mentions this issue: [release-branch.go1.22] runtime: always update stack bounds on cgocallback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done.
Projects
Development

No branches or pull requests