Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: spinning CPU on NetBSD calling compat_60__lwp_park #22968

Closed
bradfitz opened this issue Dec 1, 2017 · 15 comments
Closed

runtime: spinning CPU on NetBSD calling compat_60__lwp_park #22968

bradfitz opened this issue Dec 1, 2017 · 15 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-NetBSD
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Dec 1, 2017

The NetBSD builder's buildlet process boots up immediately spinning, using 100% CPU.

$ gomote create netbsd-amd64-8branch
$ gomote ssh user-bradfitz-netbsd-amd64-8branch-0
...
# dtruss -n buildlet 2>&1 | head -500 > x
# cat x
PID/LWP    SYSCALL(args)                 = return
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
   505/2:  compat_60__lwp_park(0xC42003DE58, 0x0, 0xC42002E398)          = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE60, 0x0)                 = 0 0
   505/2:  clock_gettime(0x3, 0xC42003DE00, 0x0)                 = 0 0
....

This seemed to start when I rebuilt the NetBSD-amd64 8.0-BETA VM image yesterday, but I'm not sure.

Filing while I investigate.

/cc @ianlancetaylor @aclements @bsiegert @krytarowski @zoulasc

@bradfitz bradfitz added NeedsFix The path to resolution is known, but the work has not been done. OS-NetBSD labels Dec 1, 2017
@bradfitz bradfitz added this to the Go1.10 milestone Dec 1, 2017
@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 1, 2017

Okay, if I build the buildlet with Go 1.9, it doesn't spin.

So this appears to be some regression at Go master.

@krytarowski
Copy link
Contributor

krytarowski commented Dec 1, 2017

While there, it looks like golang uses clock_gettime(), while it should use newer syscall __clock_gettime50().

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 2, 2017

Update:

Even with the Go 1.9-built buildlet, we're still getting a column of red failing builds for netbsd because now the go_bootstrap binary is spinning, taking 100% CPU while the rest of the build tries to complete & pass its tests. But many tests fail while competing with a 100% CPU spinner. The spin is the same:

buildlet-netbsd-amd64-8branch-rnc9c727e# dtruss -n go_bootstrap 2>&1 | head -50 > x
buildlet-netbsd-amd64-8branch-rnc9c727e# cat x
PID/LWP    SYSCALL(args)                 = return
  2140/2:  clock_gettime(0x3, 0xC42003FE00, 0x0)                 = 0 0
  2140/2:  compat_60__lwp_park(0xC42003FE58, 0x0, 0xC420030398)          = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE60, 0x0)                 = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE00, 0x0)                 = 0 0
  2140/2:  compat_60__lwp_park(0xC42003FE58, 0x0, 0xC420030398)          = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE60, 0x0)                 = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE00, 0x0)                 = 0 0
  2140/2:  compat_60__lwp_park(0xC42003FE58, 0x0, 0xC420030398)          = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE60, 0x0)                 = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE00, 0x0)                 = 0 0
  2140/2:  compat_60__lwp_park(0xC42003FE58, 0x0, 0xC420030398)          = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE60, 0x0)                 = 0 0
  2140/2:  clock_gettime(0x3, 0xC42003FE00, 0x0)                 = 0 0
  2140/2:  compat_60__lwp_park(0xC42003FE58, 0x0, 0xC420030398)          = 0 0
...

Because this all worked the other day before I rebuilt the NetBSD image, I'm curious if any changes in NetBSD-8 happened recently that make Go unhappy.

@gopherbot
Copy link

Change https://golang.org/cl/81637 mentions this issue: dashboard: disable NetBSD builders

gopherbot pushed a commit to golang/build that referenced this issue Dec 2, 2017
Until issues are fixed. No point in wasting resources on column of
red.

Updates golang/go#22968

Change-Id: I2de5a4d8aca6fcbdab2f43d87818c7d8f333f8bc
Reviewed-on: https://go-review.googlesource.com/81637
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@bradfitz bradfitz modified the milestones: Go1.10, Unplanned Dec 2, 2017
@zoulasc
Copy link
Contributor

zoulasc commented Dec 2, 2017

I don't know if/what changed on 8. I can reproduce this on HEAD in one of the regexp tests and with i386 binaries on amd64 in many tests. I am investigating.

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 2, 2017

Thanks, @zoulasc.

@zoulasc
Copy link
Contributor

zoulasc commented Dec 2, 2017

Here's a patch to make go use the new lwp_park call on NetBSD (untested on arm) and to use CLOCK_MONOTONIC for nanotime(). This also saves a syscall each time we need to sleep for a semaphore...

https://www.netbsd.org/~christos/go-lwp-park-clock-monotonic.diff

With this fix and a tiny kernel patch (kern_lwp.c:1.191) go on amd64 passes all the tests.

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 3, 2017

Thanks. Is there a pullup to the NetBSD-8 branch I should wait for for http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/kern/kern_lwp.c.diff?r1=1.190&r2=1.191&only_with_tag=MAIN ?

Btw, the nanotime change already happened in 2065685.

So were the processes spinning because lwp_park was using the wrong clock and the clock was going backwards sometimes, or ... ?

@zoulasc
Copy link
Contributor

zoulasc commented Dec 3, 2017

206585 changes amd64/i386 but not arm. There is a different bug somewhere, perhaps in the compat32 code since amd64 now works. I don't think that processes were spinning because the clock went backwards, because before the changes both nanotime and lwp_park were using CLOCK_REALTIME. This change makes nanotime() to use CLOCK_MONOTONIC, but the lwp_park call is not using absolute time anymore, so it is not calling nanotime(), which is an improvement anyway.

The kern_lwp change fixes a different issue where we ended up having processes stuck exiting. I will ask for a pullup shortly.

@gopherbot
Copy link

Change https://golang.org/cl/81715 mentions this issue: runtime: make NetBSD lwp_park use monotonic time

@gopherbot
Copy link

Change https://golang.org/cl/81735 mentions this issue: dashboard: re-enable NetBSD/amd64 builder

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 4, 2017

@zoulasc, thank you so much. The amd64 builder is back & happy. So the spin was in fact my fault when I submitted the incomplete 2065685. I had missed semasleep's use of nanotime.

I verified that 386 works too on my local VM. We won't have 386 builders until pullup 413 lands in the netbsd-8 branch (status: https://releng.netbsd.org/cgi-bin/req-8.cgi) so we can have virtio scsi/rnd on GCE, where we run most of our builders.

It's possible I might find time to get a NetBSD Raspberry Pi builder running, but it's a low priority. I filed #22979 for that.

The 386 binaries still spin (and thus don't pass all.bash) when running on an amd64 kernel, though. If you're interested, that might be another bug to hunt.

gopherbot pushed a commit to golang/build that referenced this issue Dec 4, 2017
Updates golang/go#22968
Updates golang/go#20852

Change-Id: Ida2bc7213d418735e897d0bae8e5fba9ae18fd6c
Reviewed-on: https://go-review.googlesource.com/81735
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@krytarowski
Copy link
Contributor

We have users of golang on NetBSD/evbarm so it is definitely worth the trouble.

@zoulasc
Copy link
Contributor

zoulasc commented Dec 4, 2017

Thanks @bradfitz for keeping NetBSD/go going. I will keep looking for both the netbsd32 emulation bug and the 'message too long' issue on i386 -- I think that there is still a minor issue with i386 that needs to be fixed. The non-traditional syscall use by go makes it a good fuzzer :-)

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 4, 2017

@zoulasc, higher priority than the netbsd32 emulation is #22981 if you have time.

@golang golang locked and limited conversation to collaborators Dec 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-NetBSD
Projects
None yet
Development

No branches or pull requests

4 participants