-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: nice -20 causes thread exhaustion on ARM #17803
Comments
Are you able to paste the complete text of the panic message? On Sat, 5 Nov 2016, 09:53 shahinv notifications@github.com wrote:
|
It takes up to a day to have it crashed. But I have large number of bug reports from people who are using the program. They look similar.
|
Thanks for posting the repro. What does ulimit -a print for the account On Sat, 5 Nov 2016, 10:15 shahinv notifications@github.com wrote:
|
pi@raspberrypi:~ $ ulimit -a |
Which kernel (sorry if this has been already answered) On Sat, 5 Nov 2016, 19:09 shahinv notifications@github.com wrote:
|
Thank you for the investigation. Dave, I am not sure if the issue is isolated to 1.7, as far as I remember I had crashes albeit very rarely on 1.6 too not 100% sure if it was the same thread issue but reports were similar. Go 1.5/1.4 were very solid. |
@shahinv -20 is the highest priority on Linux; I suspect you are starving the other processes and kernel on the system, and your system calls are not completing in a timely manner. That said, I would expect to see those syscalls in the goroutine trace. Did you trim the backtrace above, or does it really skip from goroutine 36 to goroutine 3227? |
@shahinv Can you try to reproduce a crash with That will cause the traceback to contain more information. |
@quentinmit no I have not trimmed the panic message. I have attached another one for your reference. Sure I will try reproduce that with GOTRACEBACK=crash. Will share the result shortly. runtime/cgo: pthread_create failed: Resource temporarily unavailable goroutine 0 [idle]: goroutine 34 [syscall]: goroutine 1 [IO wait, 2 minutes]: goroutine 17 [syscall, 1587 minutes, locked to thread]: goroutine 19 [sleep]: goroutine 7 [sleep]: goroutine 23 [syscall]: goroutine 24 [sleep]: goroutine 26 [sleep]: goroutine 31 [sleep, 2 minutes]: goroutine 7102 [IO wait]: goroutine 7103 [IO wait]: goroutine 5228 [runnable]: trap 0x0 |
runtime/cgo: pthread_create failed: Resource temporarily unavailable goroutine 0 [idle]: goroutine 18 [syscall]: goroutine 1 [IO wait, 13 minutes]: goroutine 17 [syscall, 584 minutes, locked to thread]: goroutine 2 [force gc (idle), 3 minutes]: goroutine 3 [GC sweep wait]: goroutine 4 [finalizer wait, 539 minutes]: goroutine 6 [sleep]: goroutine 7 [sleep]: goroutine 75 [GC worker (idle), 1 minutes]: goroutine 12 [sleep]: goroutine 13 [sleep]: goroutine 73 [GC worker (idle), 1 minutes]: goroutine 15 [sleep]: goroutine 35 [IO wait]: goroutine 37 [sleep]: goroutine 72 [GC worker (idle), 1 minutes]: goroutine 69 [sleep]: goroutine 74 [GC worker (idle), 1 minutes]: goroutine 2917 [runnable]: trap 0x0 SIGQUIT: quit goroutine 0 [idle]: SIGQUIT: quit goroutine 0 [idle]: SIGQUIT: quit goroutine 0 [idle]: SIGQUIT: quit goroutine 0 [idle]: SIGQUIT: quit goroutine 0 [idle]: |
@davecheney @quentinmit Is there anything which I can do to help investigation? |
@shahinv is the source of your program available for me to try to reproduce the problem? |
@davecheney Unfortunately the program is large (http://www.nanodlp.com) and it is hard to reproduce the issue (require couple of hours), I will try isolate the issue to share easier to debug source code ASAP. |
Thank you
…On Fri, 25 Nov 2016, 11:34 shahinv ***@***.***> wrote:
@davecheney <https://github.com/davecheney> Unfortunately the program is
large (http://www.nanodlp.com) and it is hard to reproduce the issue
(require couple of hours), I will try isolate the issue to share easier to
debug source code ASAP.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17803 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA_BRzNWMZLRB4gI-LCAGg4oIh8U2ks5rBi0dgaJpZM4KqEOq>
.
|
Have you tested that the code contains no data races and properly handles
every error?/
…On Fri, 25 Nov 2016, 11:38 Dave Cheney ***@***.***> wrote:
Thank you
On Fri, 25 Nov 2016, 11:34 shahinv ***@***.***> wrote:
@davecheney <https://github.com/davecheney> Unfortunately the program is
large (http://www.nanodlp.com) and it is hard to reproduce the issue
(require couple of hours), I will try isolate the issue to share easier to
debug source code ASAP.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17803 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA_BRzNWMZLRB4gI-LCAGg4oIh8U2ks5rBi0dgaJpZM4KqEOq>
.
|
Also worth seeing is how many OS threads does the program use
with and without the nice adjustment. Perhaps it's indeed using too
many OS threads (i.e. unbounded concurrent cgo call or IO) and
adjusting nice just exacerbates the problem.
|
@davecheney I am checking the program for race frequently, as it slow down the program usually I test it for under an hour. If more could be helpful to detect race, please let me know. @minux I have seen couple of completely different sets of behaviours as I have forced to change framework and rewrite codes to make sure none of them is the issue. Currently I have seen issue with two tight loops with sleep inside.
Number of OS thread could increase up to 150, Also number of goroutines (runtime.NumGoroutine()) increases gradually also I have seen sporadic high cpu usage until time of crash, usually it crashes on around 800 goroutines. Removing this loop, fix the issue. As a workaround instead of watching for string on loop I lock the main module using Mutex lock, code which read output from serial directly searching for string and unlock the main module.
for !LayerExist(){ func LayerExist() bool { Number of OS thread remains 9-15 (which is normal), but number of goroutines (runtime.NumGoroutine()) increases gradually also I have seen sporadic high cpu usage until time of crash, usually it crashes on around 800 goroutines. Reproducing this one is much easier as making sure file does not exist and it will crashes the program sooner or later. Removing nice completely fix the issues. Usual processor usage should not be higher than 0.5, IO usage also very low. Removing nice caused another issue, as the program directly generate pulses to move a stepper motor. Rate of uneven steps increased substantially. You can see what I am talking about on this video (During 40~50s) https://www.youtube.com/watch?v=uybLNu0zv28&list=PLP9pvG98FyFs8Jz131i5iXVoY0O9QIQyF&index=2 |
How does your program collect serial output data from the external devices.
In go, each syscall requires a thread, if this operation is slower than expected due to the program being reniced this could explain the thread exhaustion you see.
Why are you remixing your program? What about its behaviour or interaction with other programs on this machine drove you to adopt this as a solution?
… On 25 Nov. 2016, at 20:32, shahinv ***@***.***> wrote:
@davecheney I am checking the program for race frequently, as it slow down the program usually I test it for under an hour. If more could be helpful to detect race, please let me know.
@minux I have seen couple of completely different sets of behaviours as I have forced to change framework and rewrite codes to make sure none of them is the issue. Currently I have seen issue with two tight loops with sleep inside.
The program collecting serial output from external devices, it is working fine. But to synchronize jobs I am searching output to see if given string is available or not to move to the next job. Sleep time and time waited correlated with crashes. Before each read and writes I was using Mutex lock. Even removing locks does not yield any difference. But increasing sleep time (100,000,000) or faster response from external device helps but never solve the issue.
Number of OS thread could increase up to 150, Also number of goroutines (runtime.NumGoroutine()) increases gradually also I have seen sporadic high cpu usage until time of crash, usually it crashes on around 800 goroutines.
Removing this loop, fix the issue. As a workaround instead of watching for string on loop I lock the main module using Mutex lock, code which read output from serial directly searching for string and unlock the main module.
Second place which helped me solve the issue as first issue was not easily reproducible for me, I had loop similar to one below.
for !LayerExist(){
time.Sleep(100000000)
}
func LayerExist() bool {
_, err := os.Stat("file")
if err != nil {
return false
}
return true
}
Number of OS thread remains 9-15 (which is normal), but number of goroutines (runtime.NumGoroutine()) increases gradually also I have seen sporadic high cpu usage until time of crash, usually it crashes on around 800 goroutines.
Reproducing this one is much easier as making sure file does not exist will crash the program sooner or later.
Removing nice completely fix the issues. Usual processor usage should not be higher than 0.5, IO usage also very low. Removing nice caused another issue, as the program directly generate pulses to move a stepper motor. Rate of uneven steps increased substantially. You can see what I am talking about on this video (During 40~50s) https://www.youtube.com/watch?v=uybLNu0zv28&list=PLP9pvG98FyFs8Jz131i5iXVoY0O9QIQyF&index=2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
It uses (github.com/tarm/serial) to communicate with serial devices. I do not think data collection is the issue it somewhere which we check its result, when they off this check everything working fine. Also the second case which I have described above happens with serial communication totally off. I have isolated issues and in the both case simple loop with wait inside causing the issue, unfortunately when I move these loops somewhere to reproduce the issue, it does not happen. Lots of different SLA 3D printers being supported by the software, naturally it becomes complex. |
On Sat, Nov 26, 2016 at 8:55 PM, shahinv ***@***.***> wrote:
It uses (github.com/tarm/serial) to communicate with serial devices. I do
not think data collection is the issue it somewhere which we check its
result, when they off this check everything working fine.
Also the second case which I have described above happens with serial
communication totally off. I have isolated issues and in the both case
simple loop with wait inside causing the issue, unfortunately when I move
these loops somewhere to reproduce the issue, it does not happen.
ok, let's rule out serial transfer. Are you able to share the exact code
which triggers the failure ?
btw. you still haven't explained why nice -20 is necessary
… Lots of different SLA 3D printers being supported by the software,
naturally it becomes complex.
Users could use different hardwares for the movements. RAMPS (through
Serial), Trinamics drivers and etc (Serial or i2c), Directly generating
pulse on RPi GPIO.
Even for a basic printer usually following functionality being used.
Serial con to RAMPS, Serial con to a Projector, CGO call to DispmanX,
Generating PWM (Through DMA) for shutter (right now I use pigpio), GPIO for
external buttons and fault detection and etc.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17803 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA7HZ2BWyzgLS4YRwi0F5tvscW4KBks5rCAIigaJpZM4KqEOq>
.
|
Unfortunately extracted code does not cause the issue. Having -20 helps to achieve better pulses. Even with nice -20 it is limited to 2KHz with golang. |
What I'm suspecting is renicing the process to the highest priority is
blocking some other operating system.action which causes you to run out of
threads.
I'm not really sure what to suggest; without renicing your process you do
not get the pseudo real time behaviour you want, but with renicing you make
your system unstable -- not really an imloment.
I want to see you succeed using go in this application, but without a way
to reproduce it I cannot suggest anything.
If you are on a platform that supports perf(1) this may be able to generate
some useful data.
Finally. Are you able to share the source of your program. Keeping in mind
that go is not a real time programming language, there may be some easy
performance wins that can be discovered via code review that will improve
the throughput of your program to the point that you do not need to renice
it.
…On Sat, 26 Nov 2016, 21:19 shahinv ***@***.***> wrote:
Unfortunately extracted code does not cause the issue.
About nice -20, when generating pulses duration of pulses must be even to
have smooth movement.
You can hear some strange sounds between 40~50s. It is sound of uneven
steps.
https://www.youtube.com/watch?v=uybLNu0zv28&list=PLP9pvG98FyFs8Jz131i5iXVoY0O9QIQyF&index=2
Having -20 helps to achieve better pulses. Even with nice -20 it is
limited to 2KHz with golang.
http://www.nanodlp.com/forum/uploads/d71e78db4925cdb08f8635312226312e.png
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17803 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA8FbB0sl0fq1OCQouMOq8o_3bMuDks5rCAfLgaJpZM4KqEOq>
.
|
@davecheney thank you for putting time into solving this issue. I have found workarounds for the issues before filling bug report. But it would be awesome if the source of the problem could be solved even if not, this thread could help fellow programmers with same issue to avoid nicing. As the issue hardly reported on go1.6, I will try go1.8 soon to see if new SSA helps. |
Getting too late for 1.8, especially since there's a workaround. |
What version of Go are you using (
go version
)?go version go1.7.1 linux/arm
What operating system and processor architecture are you using (
go env
)?Linux/ARM (RPi 2/3)
What did you do?
sudo nice -n -20 program
What did you expect to see?
Program works correctly
What did you see instead?
Number of threads increase randomly until program crashes with following error "runtime/cgo: pthread_create failed: Resource temporarily unavailable"
gdb reports large number of stalled threads
33 Thread 0x671ff460 (LWP 7863) "printer" 0x000a7aa8 in runtime.futex ()
32 Thread 0x669ff460 (LWP 7864) "printer" 0x000a7aa8 in runtime.futex ()
31 Thread 0x679ff460 (LWP 7821) "printer" 0x000a7aa8 in runtime.futex ()
30 Thread 0x681ff460 (LWP 7820) "printer" 0x000a7aa8 in runtime.futex ()
29 Thread 0x689ff460 (LWP 7811) "printer" 0x000a7aa8 in runtime.futex ()
28 Thread 0x691ff460 (LWP 7810) "printer" 0x000a7aa8 in runtime.futex ()
27 Thread 0x6a1ff460 (LWP 7806) "printer" 0x000a7aa8 in runtime.futex ()
26 Thread 0x699ff460 (LWP 7807) "printer" 0x000a7aa8 in runtime.futex ()
25 Thread 0x6a9ff460 (LWP 7804) "printer" 0x000a7aa8 in runtime.futex ()
24 Thread 0x6b1ff460 (LWP 7802) "printer" 0x000a7aa8 in runtime.futex ()
23 Thread 0x6b9ff460 (LWP 7801) "printer" 0x000a7aa8 in runtime.futex ()
22 Thread 0x6c1ff460 (LWP 7775) "printer" 0x000a7aa8 in runtime.futex ()
21 Thread 0x6c9ff460 (LWP 7774) "printer" 0x000a7aa8 in runtime.futex ()
20 Thread 0x6d1ff460 (LWP 7771) "printer" 0x000a7aa8 in runtime.futex ()
19 Thread 0x6d9ff460 (LWP 7770) "printer" 0x000a7aa8 in runtime.futex ()
Before crashes usually load where around 7~10. RPi's normal load should be less than 0.2.
I have find out time.Sleep() is the part of the issue specially inside tight loops. Longer delays cause issue appear much less frequently. It gets better or worst depend on structure of code. I could not reproduce it on amd64.
Removing "nice" make the issue goes away completely.
Unfortunately I could not prepare small program to reproduce the issue.
The text was updated successfully, but these errors were encountered: