Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: all: add plan9/arm64 port #57540

Closed
psilva261 opened this issue Jan 1, 2023 · 42 comments
Closed

proposal: all: add plan9/arm64 port #57540

psilva261 opened this issue Jan 1, 2023 · 42 comments

Comments

@psilva261
Copy link
Member

Hi!

It would be great to support plan9/arm64. With plan9front it could be used on the MNT Reform and on the Raspberry Pi. The implementation would be similar to plan9/arm.

Also there's a first draft which can already be used. Most tests pass but more work is needed.

Greetings, Philip

@seankhliao seankhliao changed the title plan9/arm64 port proposal: all: add plan9/arm64 port Jan 1, 2023
@gopherbot gopherbot added this to the Proposal milestone Jan 1, 2023
@ianlancetaylor
Copy link
Contributor

CC @golang/release

See https://go.dev/wiki/PortingPolicy.

@heschi
Copy link
Contributor

heschi commented Jan 3, 2023

The porting policy says:

The cost must be balanced by an overall benefit in the form of potential new users or use cases for Go.

Is there a significant demand for this?

@psilva261
Copy link
Member Author

On the Plan 9 Mailing Lists a few times people were asking about Go on plan9/arm64 which is also how I found out that a Port wouldn't be that much effort. And of course in general on Plan 9 applications are in demand. At least for a possible builder there was already the offer to find a hosting option.

@DeedleFake
Copy link

I ran into this when I was playing around with 9front on a Pi 4 not too long ago. I was quite disappointed. I had a number of things I wanted to experiment with but was unable to because of the lack of support.

@bcmills
Copy link
Contributor

bcmills commented Jan 4, 2023

On the one hand, given that plan9 already supports 386, amd64, and arm, the arm64 variant seems like a missing cell in an otherwise-complete matrix.

On the other hand, the existing plan9 ports are already on pretty shaky ground — I count 21 open plan9 issues in the “Test Flakes” project as of right now. (I re-triaged those issues in December to attempt to enable watchflakes monitoring for plan9, but we are currently skipping those builders anyway pending some watchflakes improvements to be able to exclude them from the “broken commit” heuristics.)

So I wonder: would adding arm64 support make cross-arch fixes more likely, or would it contribute yet another set of plan9 failure modes needing triage?

@psilva261
Copy link
Member Author

That's true. It's difficult to say, also since the tests aren't fixed completely yet, especially speaking of the package cmd/go. But apart from this there is overlap, e.g. TestServerEmptyBodyRace is problematic like it used to be on other platforms (#22540). With a faster CPU and more RAM it works though. So this might be an example of a test that could be optimized cross-arch/-platform.

Also I've also seen a problem with Lstat (#42115) as well as TestRenameOverwriteDest (#13844) while doing a first test run with a RAM disk. I could gather a bit more data with a more stable setup and fixes although this would take some time.

@bcmills
Copy link
Contributor

bcmills commented Jan 5, 2023

(CC @golang/plan9)

@bcmills
Copy link
Contributor

bcmills commented Jan 6, 2023

The other major source of friction (at least for me) with the existing plan9 ports is the lack of scalable builders: today we don't have any plan9 builder that runs on scalable VMs (compare #29801), the plan9-arm builder (the most reliable of the three) is extremely slow, and gomote ssh doesn't work for ~any of them (#42117, #53571).

So when things do go wrong, the overhead for anybody on the Go team to create (or even verify!) a fix is quite high.

Would the builder for a plan9/arm64 port have these same limitations?

@oridb
Copy link
Contributor

oridb commented Jan 10, 2023

This port wouldn't change any of those existing issues -- as far as I'm aware, there's nothing (beyond the hardware/virtual machines) that would be different about this.

However, I think it's worth having a separate conversation about what it would take to improve those issues; I don't see any difficulties around running 9front on GCP vms, or standing up some new external VMs, and I'd be willing to help maintain the images for that; I think @majiru has expressed some interest in improving gomote.

A while I asked for a builder key so I could try setting up a GCP 9front builder, but I was told to wait a few months, and never really heard back; I probably should have pinged.

@millerresearch
Copy link
Contributor

the plan9-arm builder (the most reliable of the three) is extremely slow

This is pretty much entirely dist test overhead, performing staleness tests on the build cache. See #24300 and #57734.

@millerresearch
Copy link
Contributor

As to the subject of the proposal: I support it (in the sense of agreeing it should happen) but it needs a couple of people in the 9front community to support it (in the sense of making a commitment to provide and baby-sit a builder, and fix bugs which are induced from time to time by changes to the mainstream ports).

@psilva261
Copy link
Member Author

By the way, at least the fs related tests run much more stable when locking the OS thread in https://github.com/golang/go/blob/master/src/syscall/pwd_plan9.go (psilva261@de519d7). The plan9/arm builder seems to work anyway though. But together with a small fix in https://github.com/golang/go/blob/master/src/cmd/go/testdata/script/work_env.txt (psilva261@822f6af) and GOMAXPROCS=1 also cmd/go tests work. However ../test fails consistently for now.

To add another option, it would also be possible to have an additional ssh server that provides text console access using drawterm.

@psilva261
Copy link
Member Author

So by now we're additional 9front developers: me, @oridb and @majiru

I've also ran numerous tests, with GO_TEST_SHORT='true' GO_TEST_TIMEOUT_SCALE=3 and GOMAXPROCS=2 I'm getting a failure rate similar to the existing arm builder.

Although I noticed with GOMAXPROCS=1 the failure rate got really low, just 2% (n>100). In both cases testenv.CPUIsSlow() was changed to return true for runtime.GOARCH == "arm64" && runtime.GOOS == "plan9" though.

Speaking about 9front on GCP, with amd64 I got a failure rate of 7% with GOMAXPROCS=2 while ignoring cmd/go. (n=100)

GOMAXPROCS_1.log
GOMAXPROCS_2.log
amd64_GOMAXPROCS_2.log

@millerresearch
Copy link
Contributor

Thanks to @psilva261 @oridb and @majiru for your contributions. Limiting GOMAXPROCS may be masking some of the subtler concurrency bugs (and putting less stress on possible Plan 9 concurrency bugs). But you've made good progress. Have you tried booting with *ncpu=1 as an alternative? Because some of the tests temporarily increase GOMAXPROCS, using *ncpu=1 is the only way to ensure there's no real concurrency.

Has your debugging uncovered anything useful for other Plan 9 platforms? (The change to Fixwd is an example.)

@psilva261
Copy link
Member Author

True, this really is just a workaround. I need to try *ncpu=1 actually. No, not yet.

Unfortunately I noticed the 7% is difficult to reproduce. That was on a VPS with shared CPUs, with dedicated CPU cores this went to 21%. (Also n=100 and EPYC 7713) I wonder if frequency scaling could also have an effect.

@millerresearch
Copy link
Contributor

By the way, at least the fs related tests run much more stable when locking the OS thread in https://github.com/golang/go/blob/master/src/syscall/pwd_plan9.go (psilva261@de519d7).

I've opened an issue #58802 about this.

@rsc
Copy link
Contributor

rsc commented Mar 15, 2023

The plan9 port is becoming increasingly difficult to maintain and holds back non-plan9 development. I have been wondering whether it should be moved out of the main repository or if there should be a push to try to bring it back closer to parity.

Part of this seems to be due to Plan 9 itself. As Go has matured, it has started making more sophisticated use of operating system functionality - still relying only on the intersection of Linux, Windows, and the BSDs - and Plan 9 has not kept up, with the result that Go on Plan 9 is getting progressively worse.

First, the plan9 port is particularly slow. Testing some work recently on Plan 9, I found that the plan9-386 gomote runs make.rc 10X slower than the linux-386 gomote runs make.bash (990s vs 107s). That makes trying to work on the plan9 port significantly more painful. Do we know why it's so much slower, and is there any plan to fix it? It looks to me like plan9-386 runs on an e2-standard-8 while linux-386 runs on an e2-standard-16, but that would only explain a factor of 2, not 10. Perhaps the problem is the much slower file system, although with 32 GB of memory on the e2-standard-8 perhaps the builder could run in ramfs. Perhaps it is something else. Making it faster to debug plan9 problems would probably be the most significant single improvement that could be made.

Second, the Plan 9 kernel does not provide some now-standard functionality that other operating systems do, and workarounds from user space do not work terribly well. For example, for #58802, it seems like the right solution is an rfork RFPWD flag that causes parent and child to share a working directory when set. Then the Go thread creation would set RFPWD, just as it sets RFMEM to share memory. Instead we have a complicated locking and inter-thread communication mechanism.

As another example, for #58894, the telemetry implementation depends on writable, shared mmap segments for an efficient counter implementation, and I don't believe Plan 9 has those. My plan is to not do any telemetry on Plan 9.

There are also many plan9-specific issues open. A quick skim of the issue tracker turns up:

Many of these are stale, but many of them are current too.

I obviously am partial to Plan 9 and reluctant to write this, but I've watched for years as various members of the Go team hit Plan 9-specific bugs they have to chase down / work around / disable tests for. That effort seems disproportionate to its benefit. It's by no means an every-day occurrence, but weird bugs on Plan 9 happen at a noticeably higher rate than on the other operating systems, especially when you consider the relatively low number of Plan 9 users.

Given that the current plan9 ports are not keeping up with Go generally, I am not completely enthusiastic about adding another port. As I mentioned at the start of the comment, I have been wondering if instead we should encourage Go for Plan 9 to be maintained as an out-of-tree port.

Thoughts?

@oridb
Copy link
Contributor

oridb commented Mar 15, 2023

For what it's worth -- I've posted several times on the golang-dev list to request a builder key for a 9front machine, but haven't heard back. (https://groups.google.com/g/golang-dev/c/y-6KXft6Yeg/m/UMRsKz_jBwAJ)

I'd like to be able to get testing done and lower the burden on the go team. Doing things like fixing 'gomote' would be useful for this, and I think we'd need access to a builder in order to do this.

@majiru
Copy link

majiru commented Mar 15, 2023

Running make.rc on a 9front in a 4 core libvirt VM with 8GB of ram, a ways from the 8 cores and 32G of ram of the Plan 9 gcp runner, completes in just 354 seconds. Now this could be differences in the specific cpu used, but this seems more then I would expect for something like that. If @oridb can get keys we can get better comparisons, but this shows promise I think.

When looking through that issue list, it seems a pretty fair share of these issues are intermittent. Which makes me wonder if these are bugs with the os itself. I would be curious to see if we see the same rate of intermittent failures when using a 9front runner. Surely this is worth a try before tossing the Plan 9 target away?

@psilva261
Copy link
Member Author

True, performance seems to have the most direct leverage to lower failure rates.

Regarding Plan 9 itself, while failure modes are somewhat shared between the various ports, when using 9front the really low level ones are almost not present though (things like sys: trap: ...). Common errors are:

  • internal/singleflight: TestDoAndForgetUnsharedRace
  • net, net/http: Read/Write deadline related
  • os/exec: TestContextCancel
  • time: TestTicker

A 9front builder on amd64 would surely be much easier to troubleshoot because of available performance and assuming that at least some corner-cases seem to be less likely. In fact 8 GB is enough even when using ramfs for all build artifacts. With that running make.rc is possible in 231 seconds (4 cores).

Regarding arm64 itself, that's quite a disadvantage of this port. Higher frequencies and fixed frequencies seem to be officially supported for Raspberry Pi within a small range but I haven't had time to look into that yet. Actually faster already supported Hardware would be available in 2024.

Probably skipping/fixing individual tests isn't ideal. (But I could also fine-tune/fix individual tests on a regular basis) Generally seeing why the net, net/http tests often timeout seems promising though. Setting up listeners looks time consuming. Also speaking about performance, profiling support on Plan 9 might be helpful - although that might take some time to implement as well.

@oridb
Copy link
Contributor

oridb commented Mar 15, 2023

It's also worth noting that @pixelherodev has reported that building Go sometimes causes fossil to flake out; some of the issues mentioned seem consistent with file system oddity (eg, #50583, #21977). 9front has moved away from Fossil, so it would be interesting to see if there are issues on what we use.

@millerresearch
Copy link
Contributor

@rsc:

I obviously am partial to Plan 9 and reluctant to write this, but I've watched for years as various members of the Go team hit Plan 9-specific bugs they have to chase down / work around / disable tests for. That effort seems disproportionate to its benefit.

I think that depends on what you see as the benefit. Actual use of go on Plan 9 for production applications is probably vanishingly small. But if go is meant to be a language for portable (OS agnostic) software, having an outlier OS or two should encourage developers to think beyond the linux/bsd/windows enclave. What happens when the next wonderful new OS appears on the scene needing a go port ... "oh sorry, go is really just for linux-alikes with aggressively optimised filesystems now".

As I mentioned at the start of the comment, I have been wondering if instead we should encourage Go for Plan 9 to be maintained as an out-of-tree port.

I don't think that would be practical. You'd have however-many internal go developers devoted (some fulltime) to making implementation changes without any need to think about Plan 9 feasibility, and two or three Plan 9 occasional volunteers trying to keep up by reverse-engineering what's changed and figuring out how to adapt it to Plan 9 or adapt Plan 9 to cope.

... the plan9-386 gomote runs make.rc 10X slower than the linux-386 gomote runs make.bash (990s vs 107s). That makes trying to work on the plan9 port significantly more painful. Do we know why it's so much slower, and is there any plan to fix it?

Correct me if I'm wrong @0intro, but I think the plan9-386 builder ls limited to running on one CPU because of some undiagnosed glitches observed on a VM but not on real hardware. I don't know how to debug this without access to the VM environment google uses (is it GCE?).

... perhaps the builder could run in ramfs.

Plan 9 ramfs is not high performance, by design. (The man page says "This program is useful mainly as an example of how to write a user-level file server.") The particular problem with filesystem performance is the go implementation assumption that walking a source tree of several thousand files two or three times is a "no-op". I've done some experiments with interpolating a write-back directory cache, but the 9p model sets a limit on how quickly you can walk+open+read+clunk or even walk+stat+clunk. I think Plan 9 is not completely on its own here: for example some of the openbsd servers seem pretty slow too.

For example, for #58802, it seems like the right solution is an rfork RFPWD flag that causes parent and child to share a working directory when set. Then the Go thread creation would set RFPWD, just as it sets RFMEM to share memory. Instead we have a complicated locking and inter-thread communication mechanism.

I had an idea for a simpler workaround, by keeping a "chdir epoch" in each M to keep track of which threads need to sync the working dir with less locking, but you'd still need to lock g to m between chdir and the following syscall. I would go for RFPWD as a better solution.

... the telemetry implementation depends on writable, shared mmap segments for an efficient counter implementation, and I don't believe Plan 9 has those

Is segattach(2) not sufficient for this?

I think another big example of missing functionality in the Plan 9 implementation is the network poller. A lot of the intermittent net and http test problems I think are because of this. Would it be worth doing a Plan 9 network poller to try to get the semantics of network connecting and deadlines to match the other platforms?

There are also many plan9-specific issues open.

Yes. Many are stale, or so intermittent they can't realistically be replicated and diagnosed. Could we have an amnesty to close issues that haven't been seen in, say, a year?

@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2023

The particular problem with filesystem performance is the go implementation assumption that walking a source tree of several thousand files two or three times is a "no-op".

FWIW, we've been reducing the reliance on that assumption over time. For example, the package and module indexes added in cmd/go in Go 1.19 should substantially reduce the number of filesystem ops needed for many go commands.
(That said, the index is also more memory intensive on Plan 9 due to the lack of an mmap equivalent.)

@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2023

Many are stale, or so intermittent they can't realistically be replicated and diagnosed. Could we have an amnesty to close issues that haven't been seen in, say, a year?

Many of the issues that are still open are for tests that failed only on Plan 9, for which Skip calls were added. I've been pushing to make more of those Skip calls platform-independent (by adding more general predicates in internal/testenv and internal/platform and using those), but maintaining Skip calls for failing tests is still a significant source of overhead in maintaining a number of ports, plan9 included. A good starting point for closing out those issues would be to remove the skips and run the test enough times to confirm that the observed failure mode no longer reproduces.

If there are issues that haven't reproduced in a long time and don't have any skipped tests, I think it would be fine to close those out. The method I recommend for that is:

  1. Ensure that the issue has an appropriate watchflakes pattern and is in the Test Flakes project.
  2. Mark the issue WaitingForInfo.
  3. Subscribe to the issue for updates.
  4. Wait for the bot to close the issue due to the WaitingForInfo label, and check to make sure watchflakes didn't report any more failures in the meantime.

That will at least help to narrow down the open issues to the ones that still reproduce, and there are still several of those. Since watchflakes updates issues when matching failures are seen, you can get a good idea of which failure modes are still frequent by sorting those issues by last update:
https://github.com/golang/go/issues?q=is%3Aissue+is%3Aopen+project%3Agolang%2F20+label%3AOS-Plan9+sort%3Aupdated-desc+-label%3AWaitingForInfo

@millerresearch
Copy link
Contributor

@oridb:

It's also worth noting that @pixelherodev has reported that building Go sometimes causes fossil to flake out; some of the issues mentioned seem consistent with file system oddity (eg, #50583, #21977).

#50583 was occurring on ramfs, so not relevant for fossil. #21977 was a failing test TestSparseFiles which was meant to be bypassed for Plan 9 (where sparse files are not visibly supported). Anyway the test was removed later in 2017 so I've closed that issue.

@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2023

I don't know how to debug this without access to the VM environment google uses (is it GCE?).

Yes, it is GCE for most of the amd64 and 386 builders.

if go is meant to be a language for portable (OS agnostic) software, having an outlier OS or two should encourage developers to think beyond the linux/bsd/windows enclave.

FWIW, I suspect that js/wasm and wasip1/wasm (#58141) also fit that role to at least some extent.

@millerresearch
Copy link
Contributor

@majiru:

When looking through that issue list, it seems a pretty fair share of these issues are intermittent. Which makes me wonder if these are bugs with the os itself. I would be curious to see if we see the same rate of intermittent failures when using a 9front runner.

Yes, some are flaws of the OS or the platform (h/w or virtual). I too would like to see comparisons between 9front and Plan 9, but they must be on the same hardware or virtual platform to be meaningful. When running tests on Plan 9 with real intel hardware (4-core i7) I don't see any of the flakiness observed with the plan9-386 builder on a VM.

@millerresearch
Copy link
Contributor

@psilva261:

Regarding Plan 9 itself, while failure modes are somewhat shared between the various ports, when using 9front the really low level ones are almost not present though (things like sys: trap: ...).

I also don't see the sys:trap errors with Plan 9 running on real hardware. It would be interesting to observe 9front on the same virtual environment that the plan9-386 uses.

@majiru
Copy link

majiru commented Mar 16, 2023

@millerresearch I agree completely that the comparisons should be made on the same resources. However with lack of ability to run them right now, I figured a quick gut check of build time would suffice to show that things are better then what @rsc eluded to.

Again, it seems we've been stuck waiting for keys. If that current approach for allowing us to stand up a runner is not possible, could we look at other ways in which we can get a 9front runner going?

@millerresearch
Copy link
Contributor

@bcmills:

For example, the package and module indexes added in cmd/go in Go 1.19 should substantially reduce the number of filesystem ops needed for many go commands.

I was thinking especially of staleness tests. Executing

go list '-f={{if .Stale}}\tSTALE {{.ImportPath}}: {{.StaleReason}}{{end}}' cmd

does 11566 walks, 3498 stats, 5693 opens, 18457 reads, and so on: 49546 filesystem RPC ops altogether. Removing the checkNotStale calls from dist test has indeed made a huge improvement to test times. The bootstrap build still seems to call it a few times.

@0intro
Copy link
Member

0intro commented Mar 16, 2023

The Plan 9 port has been mainly a voluntary effort. For the few people working on it over the years, it was sometimes hard to keep up with the evolution of Go. I agree that there is still a number of issues to be fixed, but nothing that can't be overcome.

Unfortunately, I don't have as much time as I would like to work on Go. However, I see new people like @oridb, @majiru or @psilva261 which are willing to contribute. I'd like to see a 9front builder running, since the Plan 9 port aims to support both Plan 9 and 9front. It would be much easier to reproduce the issues for the people working on them.

Recently, @millerresearch did a lot of work to fix the some of the issues and the Plan 9 port is in a much better shape now.

I agree it's too early to add a new arm64 port. The priority should be to stabilize the current 386, amd64 and arm ports.

@0intro
Copy link
Member

0intro commented Mar 16, 2023

The 386 and amd64 builders are currently running in QEMU/KVM virtual machines (with VirtIO disk and VirtIO net). They are limited to a single CPU because we experienced issues with multiprocessing enabled.

@psilva261
Copy link
Member Author

psilva261 commented Mar 16, 2023

@0intro Actually I've been running the tests with 9front/amd64 mentioned before on Linode which uses KVM behind the scenes for both shared and dedicated CPU cores with VirtIO SCSI disk and VirtIO net from what I can see through % pci.

FWIW when trying *ncpu=1 at least on arm64, the results were comparable to GOMAXPROCS=1. (3% failure rate for > 100 tests, based on upstream changes from December. The recent changes would need more/different tweaks though also speaking of failure modes mentioned above but I didn't really look into that yet)

@psilva261
Copy link
Member Author

Although I realize probably it would be more interesting to check for sys:trap errors on 9front/386 anyway

@oridb
Copy link
Contributor

oridb commented Mar 18, 2023

Also, while this thread is getting very off topic -- is there a way that we can run the 9front builder in a "silent mode" where the 9front folks get notifications on flakes, but nobody on the go project is expected to take a look?

I wouldn't be surprised to find out that there are quirks or bugs we should shake out before we expect anyone to pay attention, and I want to keep the maintenance burden as low as possible.

@rsc
Copy link
Contributor

rsc commented Mar 29, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Apr 12, 2023

It sounds like the sentiment is we should focus on the existing plan9 ports before adding a new one. Do I have that right?

@psilva261
Copy link
Member Author

I think so, also considering that there are quite some options to setup a 9front/amd64 builder which should make it feasible to make tests more stable quicklier. (It would be great though if eventually arm64 could be added once the existing ones are more stable.)

@henesy
Copy link

henesy commented Apr 13, 2023

I would also be willing to help with the plan9 builders and the arm64 port (MNT Reform)

@rsc
Copy link
Contributor

rsc commented Apr 19, 2023

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented May 3, 2023

No change in consensus, so declined.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Declined
Development

No branches or pull requests