Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: run DragonflyBSD VMs on GCE? #23060

Open
bradfitz opened this issue Dec 8, 2017 · 55 comments
Open

x/build: run DragonflyBSD VMs on GCE? #23060

bradfitz opened this issue Dec 8, 2017 · 55 comments
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) help wanted NeedsFix The path to resolution is known, but the work has not been done. new-builder umbrella
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Dec 8, 2017

Looks like Dragonfly now supports virtio:

https://leaf.dragonflybsd.org/cgi/web-man?command=virtio&section=4

So it should run on GCE?

If somebody could prepare make.bash scripts to script the install to prepare bootable images, we could run it on GCE.

See the netbsd, openbsd, and freebsd directories as examples: https://github.com/golang/build/tree/master/env

(The script must run on Linux and use qemu to do the image creation.)

/cc @tdfbsd

@bradfitz bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Dec 8, 2017
@gopherbot gopherbot added this to the Unreleased milestone Dec 8, 2017
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Dec 8, 2017
@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 9, 2017

More background on the dfly users list:
http://lists.dragonflybsd.org/pipermail/users/2017-December/313731.html

@bradfitz
Copy link
Contributor Author

bradfitz commented Dec 9, 2017

In that thread, @rickard-von-essen says:

I have a working packer build of DragonFly BSD https://github.com/boxcutter/bsd.

The most interesting parts are the boot_command
https://github.com/boxcutter/bsd/blob/master/dragonflybsd.json#L5
and actual installer script https://github.com/boxcutter/bsd/blob/master/http/install.sh.dfly

@bradfitz
Copy link
Contributor Author

bradfitz commented Aug 7, 2018

/cc @dmitshur

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 2, 2018

Update: I just ran Dragonfly (5.2.2) at home on QEMU/KVM with virtio-scsi and virtio net and it works fine.

So it should work fine on GCE, of course (which we already heard).

At this point I'm thinking we should just do this builder "by hand" for now, with a readme file with notes. I'll prepare the image by hand, then shut it down and copy its disk to a GCE image. (uploading it as a sparse tarball)

We can automate it with expect or whatnot later. Perfect is the enemy of good, etc.

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 2, 2018

I shut down my KVM/QEMU instance, copied its disk to a new GCE image, and created a GCE VM. It kernel panics on boot (over serial) with:

panic() at panic+0x236 0xffffffff805f8666 
panic() at panic+0x236 0xffffffff805f8666 
vfs_mountroot() at vfs_mountroot+0xfe 0xffffffff80672c7e 
mi_startup() at mi_startup+0x84 0xffffffff805c2a64 
Debugger("panic")
CPU0 stopping CPUs: 0x0000000e
 stopped
Stopped at      Debugger+0x7c:  movb    $0,0xe67a49(%rip)
db> 

So, uh, not as easy as I'd hoped.

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 2, 2018

Perhaps if we already have to do the whole double virtualization thing for Solaris (#15581 (comment)) anyway, we could just reuse that mechanism to run Dragonfly in qemu/kvm under GCE.

@cnst
Copy link

cnst commented Dec 16, 2018

I've tried working on this earlier this year (back in 2018-02), and had it scripted to make the image automatically, but I had the same issue that it'd work on my machines with vanilla QEMU just fine, including with the disk being accessible on DFly through DragonFly's vtscsi(4) with a local QEMU as per the QEMU configuration magic described over at http://wiki.netbsd.org/tutorials/how_to_setup_virtio_scsi_with_qemu/, but it still wouldn't work on GCE with GCE's virtio_scsi. Is there any info on how GCE's virtio_scsi different from QEMU's virtio_scsi?

I've also tried running DragonFly BSD side by side with FreeBSD with CAMDEBUG, but it didn't seem to reveal anything obvious, although the underlying CAM logic does seem to be quite different, so, it's probably the one to blame. I didn't run out of ideas, but did ran out of time back in February, and recently my GCE credits ran out as well.

Nested virtualisation sounds interesting. Does it require Linux on GCE, or would FreeBSD also work?

@tuxillo
Copy link
Contributor

tuxillo commented Feb 14, 2019

@cnst do you have instructions on how you tried DragonFly on GCE?

@gopherbot
Copy link

Change https://golang.org/cl/162959 mentions this issue: dashboard, buildlet: add a disabled builder with nested virt, for testing

gopherbot pushed a commit to golang/build that referenced this issue Feb 15, 2019
…ting

This adds a linux-amd64 COS builder that should be just like our
existing linux-amd64 COS builder except that it's using a forked image
that has the VMX license bit enabled for nested virtualization. (GCE
appears to be using the license mechanism as some sort of opt-in
mechanism for features that aren't yet GA; might go away?)

Once this is in, it won't do any new builds as regular+trybot builders
are disabled. But it means I can then use gomote + debugnewvm to work
on preparing the other four image types.

Updates golang/go#15581 (solaris)
Updates golang/go#23060 (dragonfly)
Updates golang/go#30262 (riscv)
Updates golang/go#30267 (fuchsia)
Updates golang/go#23824 (android)

Change-Id: Ic55f17eea17908dba7f58618d8cd162a2ed9b015
Reviewed-on: https://go-review.googlesource.com/c/162959
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@tuxillo
Copy link
Contributor

tuxillo commented Feb 17, 2019

I've tried myself and it seems DragonFly is unable to find the disk.
We're working on it already: https://bugs.dragonflybsd.org/issues/3175

@gopherbot
Copy link

Change https://golang.org/cl/163057 mentions this issue: buildlet: change image name for COS-with-vmx buildlet

gopherbot pushed a commit to golang/build that referenced this issue Feb 19, 2019
The COS image I'd forked from earlier didn't have CONFIG_KVM or
CONFIG_KVM_INTEL enabled in its kernel, so even though I'd enabled the
VMX license bit for the VM, the kernel was unable to use it.

Now I've instead rebuilt the ChromiumOS "lakitu" board with a modified
kernel config:

   https://cloud.google.com/container-optimized-os/docs/how-to/building-from-open-source

More docs later. Still tinkering. Nothing uses this yet.

Updates golang/go#15581 (solaris)
Updates golang/go#23060 (dragonfly)
Updates golang/go#30262 (riscv)
Updates golang/go#30267 (fuchsia)
Updates golang/go#23824 (android)

Change-Id: Id2839066e67d9ddda939d96c5f4287af3267a769
Reviewed-on: https://go-review.googlesource.com/c/163057
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/163301 mentions this issue: env/linux-x86-vmx: add new Debian host that's like Container-Optimized OS + vmx

gopherbot pushed a commit to golang/build that referenced this issue Feb 21, 2019
…d OS + vmx

This adds scripts to create a new builder host image that acts like
Container-Optimized OS (has docker, runs konlet on startup) but with a
Debian 9 kernel + userspace that permits KVM for nested
virtualization.

Updates golang/go#15581 (solaris)
Updates golang/go#23060 (dragonfly)
Updates golang/go#30262 (riscv)
Updates golang/go#30267 (fuchsia)
Updates golang/go#23824 (android)

Change-Id: Ib1d3a250556703856083c222be2a70c4e8d91884
Reviewed-on: https://go-review.googlesource.com/c/163301
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/202478 mentions this issue: dashboard: update Dragonfly tip policy for ABI change, add release builder

gopherbot pushed a commit to golang/build that referenced this issue Oct 21, 2019
…ilder

From golang/go#34958 (comment) :

> Go's DragonFly support policy is that we support the latest stable
> release primarily, but also try to keep DragonFly master passing, in
> prep for it to become the latest stable release.
>
> But that does mean we need one more builder at the moment.

Updates golang/go#34958
Updates golang/go#23060

Change-Id: I84be7c64eac593dee2252c397f9529deea13605a
Reviewed-on: https://go-review.googlesource.com/c/build/+/202478
Reviewed-by: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bradfitz
Copy link
Contributor Author

@tuxillo, looks like no progress on that bug, eh?

@tuxillo
Copy link
Contributor

tuxillo commented Oct 21, 2019

Thanks for the reminder, I kind of forgot about this one. It's being a tough one anyways. I'll check with the team again next week to see if we could do something.

@cnst
Copy link

cnst commented Oct 21, 2019

@bradfitz I have some time to work on it again, but my credits expired, and trying to signup for a new account required some sort of an extra verification. Is there a way to get the credits again to work on this? Also, is there any way to reproduce this bug outside of Google environment? As per my 2018 comments, our driver works just fine in regular KVM using NetBSD's instructions for activating the codepath.

@bradfitz
Copy link
Contributor Author

bradfitz commented Oct 21, 2019

GCP has a Free Tier these days:
https://cloud.google.com/free/

COMPUTE

Compute Engine
1
F1-micro instance per month

Scalable, high-performance virtual machines.

1 f1-micro instance per month (US regions only — excluding Northern Virginia [us-east4])
30 GB-months HDD
5 GB-months snapshot in select regions
1 GB network egress from North America to all region destinations per month (excluding China and Australia)

There's no way to reproduce it locally. GCP uses KVM but doesn't use QEMU and its implementation of virtio-scsi etc isn't open source.

@cnst
Copy link

cnst commented Oct 21, 2019

@bradfitz How long does it take recompile the kernel on this free instance? A few hours? It was already taking too long even on non-micro GCP instances compared to 15-year old hardware.

I think it'd be great if there was a way to reproduce this problem locally, because our virtio-scsi drivers work just fine with anything but the proprietary GCP implementation.

Would it be helpful to provide automation for any other cloud provider?

@bradfitz
Copy link
Contributor Author

@cnst, I didn't imagine you'd be using the f1-micro installation for compilations. I thought you'd use your normal development environment to build and then use the f1-micro to test boot them on GCE until it worked.

@tuxillo
Copy link
Contributor

tuxillo commented Oct 23, 2019

@cnst what I did in my tests was to download the latest IMG, mount null it, build kernel with modifications and install it in the mountpoint. Then I used gcloud/gsutil to upload the img and create the disk and the instance. You can retrieve the console output with gcloud iirc.

@massar
Copy link

massar commented Jul 21, 2022

Did you try adding a static entry after purging the discovered ones?

Also, if there is any form of IPv6, how does that act? Have you tried pinging ff02::1? ;)

Does putting the iface in promisc mod help?

@rsc
Copy link
Contributor

rsc commented Jul 21, 2022

Thanks for the suggestions. Static ARP didn't help before, I hadn't tried IPv6, and tcpdump reported at startup that it cannot put the interface in promiscuous mode at all.

Oddly, in the hour or so I have left the VM sitting here, it has fixed itself for UDP. This is unfortunate in the sense that I don't know what changed, which won't help the next time I create a VM, but it's working at the moment. I can't see anything different (except obviously the lack of ARP messages and the presence of UDP traffic):

[root@buildlet ~]# host swtch.com
18:00:55.963586 42:01:0a:80:00:0e > 42:01:0a:80:00:01, ethertype IPv4 (0x0800), length 69: 10.128.0.14.2112 > 169.254.169.254.53: 38747+ A? swtch.com. (27)
18:00:56.090490 42:01:0a:80:00:01 > 42:01:0a:80:00:0e, ethertype IPv4 (0x0800), length 133: 169.254.169.254.53 > 10.128.0.14.2112: 38747 4/0/0 A 216.239.38.21, A 216.239.32.21, A 216.239.36.21, A 216.239.34.21 (91)
swtch.com has address 216.239.38.21
swtch.com has address 216.239.32.21
swtch.com has address 216.239.36.21
swtch.com has address 216.239.34.21
18:00:56.091320 42:01:0a:80:00:0e > 42:01:0a:80:00:01, ethertype IPv4 (0x0800), length 69: 10.128.0.14.2720 > 169.254.169.254.53: 21334+ AAAA? swtch.com. (27)
18:00:56.212072 42:01:0a:80:00:01 > 42:01:0a:80:00:0e, ethertype IPv4 (0x0800), length 181: 169.254.169.254.53 > 10.128.0.14.2720: 21334 4/0/0 AAAA 2001:4860:4802:36::15, AAAA 2001:4860:4802:32::15, AAAA 2001:4860:4802:38::15, AAAA 2001:4860:4802:34::15 (139)
swtch.com has IPv6 address 2001:4860:4802:36::15
swtch.com has IPv6 address 2001:4860:4802:32::15
swtch.com has IPv6 address 2001:4860:4802:38::15
swtch.com has IPv6 address 2001:4860:4802:34::15
18:00:56.212775 42:01:0a:80:00:0e > 42:01:0a:80:00:01, ethertype IPv4 (0x0800), length 69: 10.128.0.14.1056 > 169.254.169.254.53: 28769+ MX? swtch.com. (27)
18:00:56.347300 42:01:0a:80:00:01 > 42:01:0a:80:00:0e, ethertype IPv4 (0x0800), length 184: 169.254.169.254.53 > 10.128.0.14.1056: 28769 5/0/0 MX ALT4.ASPMX.L.GOOGLE.com. 10, MX ALT1.ASPMX.L.GOOGLE.com. 5, MX ALT2.ASPMX.L.GOOGLE.com. 10, MX ASPMX.L.GOOGLE.com. 1, MX ALT3.ASPMX.L.GOOGLE.com. 10 (142)
swtch.com mail is handled by 10 ALT4.ASPMX.L.GOOGLE.com.
swtch.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
swtch.com mail is handled by 10 ALT2.ASPMX.L.GOOGLE.com.
swtch.com mail is handled by 1 ASPMX.L.GOOGLE.com.
swtch.com mail is handled by 10 ALT3.ASPMX.L.GOOGLE.com.
[root@buildlet ~]# arp -an
? (10.128.0.1) at 42:01:0a:80:00:01 on vtnet0 permanent [ethernet]
? (10.128.0.1) at (incomplete) on vtnet0 permanent published [ethernet]
[root@buildlet ~]# netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags    Refs      Use  Netif Expire
default            10.128.0.1         UGSc        1        0 vtnet0       
10.128.0.1/32      vtnet0             ULSc        2        0 vtnet0       
10.128.0.14/32     link#1             UC          0        0 vtnet0       
127.0.0.1          127.0.0.1          UH          0        0    lo0       

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::1                               ::1                           UH          lo0       
fe80::%vtnet0/64                  link#1                        UC       vtnet0       
fe80::4001:aff:fe80:e%vtnet0      42:01:0a:80:00:0e             UHL         lo0       
fe80::%lo0/64                     fe80::1%lo0                   Uc          lo0       
fe80::1%lo0                       link#2                        UHL         lo0       
ff01::/32                         ::1                           U           lo0       
ff02::%vtnet0/32                  link#1                        UC       vtnet0       
ff02::%lo0/32                     ::1                           UC          lo0       
[root@buildlet ~]# ifconfig -a
vtnet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2a<TXCSUM,VLAN_MTU,JUMBO_MTU>
        ether 42:01:0a:80:00:0e
        inet6 fe80::4001:aff:fe80:e%vtnet0 prefixlen 64 scopeid 0x1
        inet 10.128.0.14 netmask 0xffffffff broadcast 10.128.0.14
        media: Ethernet 1000baseT <full-duplex>
        status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=43<RXCSUM,TXCSUM,RSS>
        inet 127.0.0.1 netmask 0xff000000
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
        groups: lo
[root@buildlet ~]# route -nv show
Routing tables

Internet:
Destination          Gateway              Flags 
default              10.128.0.1          UG     
10.128.0.1           42:01:0a:80:00:01   UH     
10.128.0.1           42:01:0a      U      
10.128.0.14          link#1              U      
127.0.0.1            127.0.0.1           UH     
169.254.169.254      10.128.0.1          UGH    

Internet6:
Destination          Gateway              Flags 
::1                  ::1                 UH     
fe80::%vtnet0        link#1              U      
fe80::4001:aff:fe80:e%vtnet0 42:01:0a:80:00:0e UH     
fe80::%lo0           fe80::1%lo0         U      
fe80::1%lo0          link#2              UH     
ff01::               ::1                 U      
ff02::%vtnet0        link#1              U      
ff02::%lo0           ::1                 U      
[root@buildlet ~]# 

Now that I notice it, the line in route -nv show that has a half-MAC address is a bit odd. But it was there when things were broken and remains there now that they are working.

There is no obvious explanation for what changed. The only traffic shown by the background tcpdump between an hour ago and when things were working just now is ARP requests from the router for the VM's IP address, and the VM replying, one round trip per minute like clockwork.

I started two more VMs. One was working at boot (first time!). The other came up in the "TCP is fine, UDP is broken" state.

@massar
Copy link

massar commented Jul 21, 2022

The 'published' might indicate ProxyARP..... that would be quite interesting, but might be the case in your environment, that you are in one VLAN, and the gateway is actually in another VLAN.

I guess that your local box at least is not playing proxy_arp... but the remote one might...

@paulzhol
Copy link
Member

paulzhol commented Jul 21, 2022

@rsc <TXCSUM,VLAN_MTU,JUMBO_MTU> just the TXCSUM alone without the RXCSUM looks strange on the virtio NIC.
We have it disabled (both tx and rx hardware checksum offloading) for the FreeBSD builders for many years now:
https://github.com/golang/build/blob/d35cb804da1f71ec56603f818a96dd0b43e14da5/env/freebsd-amd64/loader.conf#L6
It is recommended for pfSense (a FreeBSD based firewall appliance),

@rsc
Copy link
Contributor

rsc commented Jul 22, 2022

Thanks @paulzhol, I will see what effect that has. I've found that ifdown/ifup/dhclient vtnet0 seems to "correct" the problem, so another option I am trying is just doing that as needed (up to 10 times) before trying to download the buildlet.

@paulzhol
Copy link
Member

@rsc, the manpage mentions why RXCSUM is disabled.
Maybe you can add an ifconfig -txcsum vtnet0 in your current flow instead of disabling it via the bootloader.

@rsc
Copy link
Contributor

rsc commented Jul 22, 2022

Disabling TXCSUM did not help, but thanks for the suggestion. I have left it disabled.

I just did 10 runs of all.bash.
3 came up OK the first time. 4 required one reset. 3 required two resets.
So it looks like a reset has about a 50% chance of working.
The buildlet script is willing to do up to 10 and then it powers down the machine.
This should be good enough, if dissatisfying.

@paulzhol
Copy link
Member

Another point to consider: cmd/buildlet lowers the MTU on FreeBSD/OpenBSD & Plan9 (all GCE VMs?) to 1460:
https://github.com/golang/build/blob/4864e2e8a08906f74b4ee3a973596fd7a93e9273/cmd/buildlet/buildlet.go#L440-L448
While your tcpdump shows mss 1420 in the first host -T for the flow 10.128.0.14.1216 > 169.254.169.254.53 in both SYN and SYN+ACK, but after you reset the ARP tables in the 10.128.0.14.4880 > 169.254.169.254.53 flow its mss 1460 for the SYN but mss 1420 for the returning SYN+ACK.
I don't really have a good story about how this could be triggering the small UDP packets to send these ARPs but it still looks strange enough.

@gopherbot
Copy link

Change https://go.dev/cl/419083 mentions this issue: dashboard: add new, unused dragonfly-amd64-622 builder

@gopherbot
Copy link

Change https://go.dev/cl/419081 mentions this issue: env/dragonfly-amd64: add scripts for building GCE VM image

@gopherbot
Copy link

Change https://go.dev/cl/419084 mentions this issue: dashboard: use dragonfly on GCE for dragonfly-amd64 builds

@tuxillo
Copy link
Contributor

tuxillo commented Jul 23, 2022

Glad to see progress on this issue!

I see there are problems with UDP and vtnet but it is not clear to me how it is reproduced. Is there anything we, from DragonFlyBSD, should do, investigate?

I've also seen that you created a GCE image for 6.2.2, are you guys going to follow RELEASE only? And what do we do with the reverse builder?

@rsc
Copy link
Contributor

rsc commented Jul 29, 2022

@tuxillo, would you be willing to review https://go-review.googlesource.com/c/build/+/419081/ to see if it looks like it makes sense?

To answer your questions:

When the image boots on GCP - a completely standard build, a VM configured with just the Dragonfly install CD should be enough to reproduce - it just can't do any UDP traffic at all. UDP traffic triggers ARP requests for the gateway instead. So 'host -W 3 google.com' times out for example, but 'host -T -W 3 google.com' works fine. This is the state after bringing up vtnet0 at boot on something like half the times it boots. I don't understand what could possibly cause that failure mode, honestly. It could be Dragonfly or it could be something about the virtio network device on Google Cloud's side.

I used a standard release for reproducibility. Over at https://farmer.golang.org/builders we have a list of the builders for other systems and we typically have a few different release versions as needed for supportability. The idea is that we'd add a new builder for new releases and retire the old ones. Does that seem like a reasonable plan to you?

We haven't changed over from the reverse builder yet, but once we do I will post here. At that point you can retire the reverse builder, with our gratitude for keeping it running for so long.

Thanks!

@tuxillo
Copy link
Contributor

tuxillo commented Jul 30, 2022

@tuxillo, would you be willing to review https://go-review.googlesource.com/c/build/+/419081/ to see if it looks like it makes sense?

@rsc the patch looks good to me and it's far better than what I could provide, which was nothing :-) It also helps me understand the image creation process from your side.

To answer your questions:

When the image boots on GCP - a completely standard build, a VM configured with just the Dragonfly install CD should be enough to reproduce - it just can't do any UDP traffic at all. UDP traffic triggers ARP requests for the gateway instead. So 'host -W 3 google.com' times out for example, but 'host -T -W 3 google.com' works fine. This is the state after bringing up vtnet0 at boot on something like half the times it boots. I don't understand what could possibly cause that failure mode, honestly. It could be Dragonfly or it could be something about the virtio network device on Google Cloud's side.

I can see you're using "DHCP mtu 1460" when setting up the vtnet netwok interface, but I don't know why. We have two DHCP clients, one is dhclient which comes from OpenBSD and it is a bit outdated and the other one is dhcpcd. We have known issues with dhclient in virtual environments (see https://bugs.dragonflybsd.org/issues/3317), not sure if this affects GCE VMs too.

Is there a way I can pick up the already generated image and boot it myself in GCP so I can try? Or should I generate a new one myself? Also, I'd need the network configuration I need to use in GCP to get a setup as close as possible to the one you had.

I used a standard release for reproducibility. Over at https://farmer.golang.org/builders we have a list of the builders for other systems and we typically have a few different release versions as needed for supportability. The idea is that we'd add a new builder for new releases and retire the old ones. Does that seem like a reasonable plan to you?

Our release model is very typical, with a point release which is the stable version, i.e. RELEASE-6.2, which is then tagged for minors (.2, .3, whatever) and this is done twice a year.

Then we have our "master" branch which is what you'd call "tip" I think, but the difference is that most of the DFly developers use this one, so normally it is pretty stable. Ideally, if you don't mind, under amd64 (we only support one arch atm) we'd have something what the freebds builder has. For example, "6_2" and "BE" (bleeding-edge) or tip, whatever you want to call it.

We haven't changed over from the reverse builder yet, but once we do I will post here. At that point you can retire the reverse builder, with our gratitude for keeping it running for so long.

Sure thing, thanks!

Thanks!

@tuxillo
Copy link
Contributor

tuxillo commented Jul 30, 2022

Not directly related, but I also discovered an easy way to panic the kernel:

root@buildlet:~ # ifconfig vtnet0 mtu 16384
panic: overflowed mbuf 0xfffff8037c5bec00
cpuid = 8
Trace beginning at frame 0xfffff8037cf9c6e8
m_free() at m_free+0x351 0xffffffff806be5c1 
m_free() at m_free+0x351 0xffffffff806be5c1 
m_freem() at m_freem+0x15 0xffffffff806be845 
vtnet_newbuf() at vtnet_newbuf+0x4b 0xffffffff80a71e9b 
vtnet_init() at vtnet_init+0x108 0xffffffff80a73848 
vtnet_ioctl() at vtnet_ioctl+0x213 0xffffffff80a73d23 
Debugger("panic")

CPU8 stopping CPUs: 0x0000feff
 stopped
Stopped at      Debugger+0x7c:  movb    $0,0xbcc819(%rip)
db> 
db>

Thanks for reporting, created: https://bugs.dragonflybsd.org/issues/3320

@rsc
Copy link
Contributor

rsc commented Aug 2, 2022

I can see you're using "DHCP mtu 1460" when setting up the vtnet netwok interface, but I don't know why.

I tried that because FreeBSD was setting the smaller MTU as well. Not setting it didn't help.

We have two DHCP clients, one is dhclient which comes from OpenBSD and it is a bit outdated and the other one is dhcpcd. We have known issues with dhclient in virtual environments (see https://bugs.dragonflybsd.org/issues/3317), not sure if this affects GCE VMs too.

Thanks for this tip. I will give dhcpd a try.

@rsc
Copy link
Contributor

rsc commented Aug 2, 2022

Then we have our "master" branch which is what you'd call "tip" I think, but the difference is that most of the DFly developers use this one, so normally it is pretty stable. Ideally, if you don't mind, under amd64 (we only support one arch atm) we'd have something what the freebds builder has. For example, "6_2" and "BE" (bleeding-edge) or tip, whatever you want to call it.

The only problem with bleeding-edge is that it means we have to keep rebuilding the image at regular intervals, which we could do, but it's a bit of a pain. It also means that results change when the builder changes, whereas we try to keep the builder constant and have only our Go tree changing. For comparison, as I understand it we do not have any FreeBSD builder tracking the dev branch, just numbered releases.

I will work on getting you precise directions for GCP.

@rsc
Copy link
Contributor

rsc commented Aug 2, 2022

This bug is going to auto-close in a little while but we still won't have moved off the reverse builder yet. I'll post here when we have.

gopherbot pushed a commit to golang/build that referenced this issue Aug 2, 2022
Now that Dragonfly runs on GCE, we can do that and retire the
one very slow reverse builder we are using today.

For golang/go#23060.

Change-Id: I2bd8c8be6735212ba6a8023327864b79dea08cf3
Reviewed-on: https://go-review.googlesource.com/c/build/+/419081
Auto-Submit: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
@tuxillo
Copy link
Contributor

tuxillo commented Aug 2, 2022

The only problem with bleeding-edge is that it means we have to keep rebuilding the image at regular intervals, which we could do, but it's a bit of a pain. It also means that results change when the builder changes, whereas we try to keep the builder constant and have only our Go tree changing. For comparison, as I understand it we do not have any FreeBSD builder tracking the dev branch, just numbered releases.

A good compromise perhaps is to rebuild bleeding-edge only when we bump the __DragonFly_version macro (https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/sys/sys/param.h#L244), which we only do when there is significant changes, you can see the version history in that header file.

I will work on getting you precise directions for GCP.

Okay thanks.

@tuxillo
Copy link
Contributor

tuxillo commented Aug 2, 2022

This bug is going to auto-close in a little while but we still won't have moved off the reverse builder yet. I'll post here when we have.

Sure, let me know.

@gopherbot
Copy link

Change https://go.dev/cl/420756 mentions this issue: dashboard: rename dragonfly-amd64 builder to dragonfly-amd64-622

gopherbot pushed a commit to golang/build that referenced this issue Aug 2, 2022
Two reasons: first, the builder is pinned to 6.2.2.
Second, the reverse builder is still dialing in and
confusing the coordinator. Make a clean break with the past.

For golang/go#23060.

Change-Id: Ia19cb6ef3fefef323b41c14298ef8dbc90a6e27b
Reviewed-on: https://go-review.googlesource.com/c/build/+/420756
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Russ Cox <rsc@golang.org>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@rsc rsc reopened this Aug 2, 2022
@rsc
Copy link
Contributor

rsc commented Aug 2, 2022

I started the GCE builder but it has yet to complete a build. After make.bash it is supposed to upload the tree back to the coordinator, and in 5 minutes it is only able to transfer about 130 MB which turns out to be not the whole thing. Perhaps this is the MTU thing, or perhaps it is something else. I am going to try to reproduce slow network uploads in a simpler context. We may turn the reverse builder back on in the interim. I will keep this issue posted.

@rsc
Copy link
Contributor

rsc commented Aug 3, 2022

We have our first 'ok' on build.golang.org for dragonfly-amd64-622. We still need to figure out the upload slowness (worked around for now by disabling that upload) and perhaps also the boot-time network issue (which may be related), but it's working, and much more scalable.

@tuxillo, please feel free to shut down the reverse builder, and thanks again for keeping it running for so long!

@rsc
Copy link
Contributor

rsc commented Aug 3, 2022

Leaving this issue open for the networking issues.

@daftaupe
Copy link

daftaupe commented Jan 6, 2023

Not directly related, but I also discovered an easy way to panic the kernel:

root@buildlet:~ # ifconfig vtnet0 mtu 16384
panic: overflowed mbuf 0xfffff8037c5bec00
cpuid = 8
Trace beginning at frame 0xfffff8037cf9c6e8
m_free() at m_free+0x351 0xffffffff806be5c1 
m_free() at m_free+0x351 0xffffffff806be5c1 
m_freem() at m_freem+0x15 0xffffffff806be845 
vtnet_newbuf() at vtnet_newbuf+0x4b 0xffffffff80a71e9b 
vtnet_init() at vtnet_init+0x108 0xffffffff80a73848 
vtnet_ioctl() at vtnet_ioctl+0x213 0xffffffff80a73d23 
Debugger("panic")

CPU8 stopping CPUs: 0x0000feff
 stopped
Stopped at      Debugger+0x7c:  movb    $0,0xbcc819(%rip)
db> 
db>

This should be fixed with https://gitweb.dragonflybsd.org/dragonfly.git/commit/20bf50996e30140ca0d813694090469045bba0c4 for what it's worth.

This has also been merged in DragonFly_RELEASE_6_4 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) help wanted NeedsFix The path to resolution is known, but the work has not been done. new-builder umbrella
Projects
None yet
Development

No branches or pull requests

10 participants