Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/poll: transparently support new linux io_uring interface #31908

Open
johanbrandhorst opened this issue May 8, 2019 · 35 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux
Milestone

Comments

@johanbrandhorst
Copy link
Member

johanbrandhorst commented May 8, 2019

A document on the latest linux IO syscall has made the rounds of discussion forums on the internet:

http://kernel.dk/io_uring.pdf

I wanted to open a discussion on whether we could (and should) add transparent support for this on supported linux kernels.

LWN article: https://lwn.net/Articles/776703/

@ianlancetaylor ianlancetaylor changed the title syscall: transparently support new linux io_uring interface internal/poll: transparently support new linux io_uring interface May 8, 2019
@ianlancetaylor
Copy link
Contributor

It should be feasible to fit this approach into our current netpoll framework. For some programs I think it would reduce the number of threads doing file I/O. It should potentially reduce the number of system calls required to read from the network.

I'm concerned about coordinating access to the ring. The approach seems designed for high performance communication between the application and the kernel, but it seems easiest to use for an application that uses a single thread for I/O, or in which each I/O thread uses its own ring. In Go of course each goroutine is acting independently, and it seems infeasible for each thread to have a separate ring. so that means that goroutines will need to coordinate their access to the I/O ring. That's fine but on high GOMAXPROCS systems I would worry about contention for the ring, a contention that I think doesn't exist in the current epoll framework.

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 8, 2019
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone May 8, 2019
@ianlancetaylor
Copy link
Contributor

Thinking about this further, it's not clear that it makes sense to use this new interface for network I/O. It seems that it can only handle a fixed number of concurrent I/O requests, and it's going to be quite hard to make that work transparently and efficiently in Go programs. Without knowing what the Go program plans to do, we can't allocate the ring appropriately.

If that is true, we would only use it for file I/O, where we can reasonably delay I/O operations when the ring is full. In that case it would not fit into the netpoll system. Instead, we might in effect send all file I/O requests to a single goroutine which would add them to the ring, while a second goroutine would sleep until events were ready and wake up the goroutine waiting for them. That should limit the number of threads we need for file I/O and reduce the number of system calls.

@CAFxX
Copy link
Contributor

CAFxX commented May 8, 2019

Without knowing what the Go program plans to do, we can't allocate the ring appropriately

I was wondering if we could use the map approach: when it becomes full we allocate a new, bigger one, and start submitting requests to the new one. Once the old one doesn't have pending requests anymore we deallocate it.

it seems easiest to use for an application that uses a single thread for I/O, or in which each I/O thread uses its own ring. In Go of course each goroutine is acting independently, and it seems infeasible for each thread to have a separate ring

Can you elaborate on the "infeasible" part? Assuming having multiple rings is feasible, wouldn't having per-P rings work (with the appropriate fallback slow paths in case the per-P ring is full)? I'm not so familiar with the poller, is the problem that the model mismatch is too big?

@coder543
Copy link

coder543 commented May 9, 2019

If that is true, we would only use it for file I/O

File I/O is the main reason I'm interested in io_uring. My understanding is that the current Linux kernel abstractions for async networking work just fine, but making file I/O appear to be interruptible so that goroutines don't block entire OS threads requires using a pool of I/O OS threads. If io_uring could remove the need for a pool of OS threads for file I/O, that seems to make it worthwhile.

Whether having a ring per P (as @CAFxX suggests) or just having a single thread dedicated to managing a ring... either solution seems fine. Unless there's some measurable advantage to io_uring for networking, I don't think it would be that important to switch the networking code at this point.

@ianlancetaylor
Copy link
Contributor

Using a ring per P seems possible, but it means that when a P is idle some M will have to be sleeping in a io_uring getevent call. And then there will have to be a way to wake up that M if we need it, and some way to avoid confusion if the P gets assigned to a different M. It may be doable but it seems pretty complicated.

@JimChengLin
Copy link

JimChengLin commented May 26, 2019

I think it is a killer feature of Linux 5.1. libaio(a wrapper of Linux kernel native async file io api) is broken or at least flawed, which cannot do async buffered file io and could block unexpectedly. The author of io_uring claims all problems that libaio has are all solved. It would be great if Go can utilize the new feature seamlessly.

@dahankzter
Copy link

Apparently io_uring brings many benefits to networked io as well. Is that not the case or is it just hard to accommodate for the size of the ring?

@ianlancetaylor
Copy link
Contributor

@dahankzter See the discussion above. When I looked at io_uring earlier I did not see how to fit it into the Go runtime for network I/O. I can certainly see how io_uring would work well for some network programs, but only ones that know exactly how many concurrent I/O requests they expect to have in flight. The Go runtime can never know that. This issue is about transparently using io_uring, and I don't see how to do that in Go for network connections. If you see how to do it, please explain. Thanks.

@axboe
Copy link

axboe commented Jan 7, 2020

The fixed limit has been removed, and another networking concern was the fixed CQ ring size and the difficulty in sizing that. The latter was fixed with the IORING_FEAT_NODROP support, which guarantees that completion events are never dropped. I think that should take care of most of the concerns?

One thing I've been toying with is the idea of having a shared backend in terms of threads between rings. I think that'd work nicely and would allow you to just setup one ring per thread and not worry too much about overhead.

@ianlancetaylor
Copy link
Contributor

@axboe Thanks for the update. Where is the current API documented?

@axboe
Copy link

axboe commented Jan 7, 2020

I wrote an update here:

https://kernel.dk/io_uring-whatsnew.pdf

and the man pages in liburing are also up-to-date.

@spacejam
Copy link

spacejam commented Jan 13, 2020

I'm working on rio, a pure-rust misuse-resistant io_uring library, and have been thinking a lot about edge cases that folks are likely to hit. One is exactly the overflow issue @axboe mentions that is addressed with IORING_FEAT_NODROP, which I work around in earlier kernel versions by blocking submission on the current number of in-flight requests, and guaranteeing that not more requests than the CQ size are ever in-flight (they could be dropped without that flag if there's a cq overflow).

In the testing of io_uring with sled I spin up thousands of database instances, each getting their own uring. This quickly causes ENOMEM to be hit, making this approach infeasible under this kind of test. details here. So, you may run into this if you go for a ring-per-goroutine. Maybe ring-per-golang proc.go processor would be OK? Just be aware of this ENOMEM tendency.

Care needs to be taken when working with linked ops. If a linked SQE fails due to, say, a tcp client disconnecting, it will cancel everything down the chain, even if it wrote some bytes into the buffer passed to io_uring. On newer kernel versions you can use HARDLINK instead of LINK to write the partial data into downstream sockets/files/etc... even when the previous write received an error due to disconnection etc...

Regarding concurrent access, it's not too complex. Just make sure that the shared submission queue tail gets bumped with a Release memory ordering after you've written data into all SQE slots before that point. If you're using SQPOLL you don't even need to do a syscall to submit (but you do need to check the SQ's flags to see if the kernel spun down the polling thread due to inactivity, in which case you do need to issue a syscall to kick it off again, it goes away after 1s of inactivity). Completions are cheap to reap because you can really quickly and cheaply copy the sequence of ready ones and set the cq's head value with a Release ordering afterward.

io_uring will change everything. It allows us to dramatically reduce our syscalls (even without SQPOLL mode that spins up a kernel thread, negating the need to do a syscall to submit events). This is vital in a post-meltdown/spectre world where our syscalls have gotten much more expensive. It allows us to do real async file IO without O_DIRECT (but it also works great with O_DIRECT if you're building storage engines etc...). It lets us do things like write broadcast proxies where there's a read followed by a DRAIN barrier, then many socket writes that read from that same buffer. The goldrush has begun :)

@coder543
Copy link

coder543 commented Jan 13, 2020

@spacejam do you have any interesting experiences that you can talk about from looking into using io_uring for async file I/O? I still find that to be the more compelling use case, since Linux did not have a proper story for async file I/O prior to io_uring, in my limited understanding.

I’m definitely interested in seeing benchmarks of io_uring for networking compared to other async networking approaches, since if io_uring is implemented for file I/O, I would speculate that it might only be an incremental amount of work to support io_uring for network I/O as well.

@axboe
Copy link

axboe commented Jan 13, 2020

@coder543 It actually started as async file IO, as that only worked for O_DIRECT with Linux. With io_uring, it works for buffered and O_DIRECT.

io_uring also supports networked IO. As per 5.4, you can do the regular read/readv or write/writev on the socket, or you can use send/recvmsg. 5.5 adds connect/accept support, and 5.6 will have support for send/recv as well.

@spacejam
Copy link

@coder543 For bulk-writing and reading on my 7th gen lenovo x1 carbon with nvme + full disk encryption + ext4 with io_uring (without using registered buffers nor file descriptors nor SQPOLL mode) I can hit 5gbps while writing sequentially with O_DIRECT and 6.5gbps reads while reading sequentially with O_DIRECT. This is the o_direct example in the rio repo. While using a threadpool or single thread doing the same reads/writes also with O_DIRECT but using synchronous operations I can scrape 2gbps. dd can hit 4gbps writes for large writes on my system with default settings. I'll be collecting and publishing more statistics as I shift gears from ergonomics around rio to throughput and giving long-tail-latency a haircut for various interesting IO patterns, especially around the observations I'm able to measure around buffer/fd registration for different kinds of workloads.

@johanbrandhorst
Copy link
Member Author

High level overview: https://lwn.net/Articles/810414/

@acln0
Copy link
Contributor

acln0 commented Feb 14, 2020

I think the focus should be on file I/O only, to begin with. Network I/O works just fine using netpoll already. If the model works out for file I/O, we can try it for network I/O also (modulo complications related to deadlines, but those don't seem insurmountable).

A ring per P sounds good to me. Each ring would maintain a cache of off-heap completion tokens which identify the parked goroutine and provide a slot to write the result of the operation to. These tokens would function much like pollDescs function for the existing netpoll implementation, but they would be cached per-P.

To perform I/O, a goroutine acquires the usual locks through internal/poll, then enters the runtime. It creates an SQE and submits it to the ring attached to its P, then parks itself. When the completion is handled, the goroutine is woken, much like netpoll currently does.

Some pseudocode:


The completion token:

//go:notinheap
type uringCompletionToken struct {
	link *uringCompletionToken // links to next token in per-P cache
	cg   guintptr              // the goroutine waiting for the completion
	res  int32                 // written by CQ handler, read by cg
}

SQ submission, parking the goroutine doing the submission:

	_g_ := getg().m.curg
	_p_ := _g_.m.p

	tok := _p_.uringTokenCache.alloc()
	tok.cg = guintptr(_g_)
	sqe := uringSQE{
		op: _IORING_OP_SOMETHING,
		fd: int32(fd),

		// etc. etc.

		userData: uint64(uintptr(unsafe.Pointer(tok))),
	}
	if errno := uringSubmit(_p_.uring, sqe); errno != 0 {
		// fall back out into regular internal/poll code path
		return 0, 0, notHandled
	}

	gopark(..., ..., waitReasonIOWait, ..., ...)

	// Now that we have woken up, interpret results from the token.
	var (
		n     int
		errno int
	)
	if tok.res < 0 {
		errno = int(-tok.res)
	} else {
		n = int(tok.res)
	}

	// Might have woken on another P: reload, then free the token.
	_p_ = getg().m.p
	_p_.uringTokenCache.free(tok)

	return n, errno, handled

Handling completions:

	cq := r.nextCQ()
	tok := (*uringCompletionToken)(unsafe.Pointer(uintptr(cq.data)))
	tok.res = cq.res
	goready((*g)(unsafe.Pointer(uintptr(tok.cg))), 4)

The question of who / what handles completions remains.

If I understand things correctly, after parking the current goroutine, we enter the scheduler and we are executing findrunnable with a P, which means that we can check per-P rings rather cheaply. If we have unhandled completions in the local ring, we can add the associated goroutines to the local runqueue and start execution right away.

If findrunnable, knowing that there is work pending on the local ring, observes an empty completion ring, it should arrange for something to wait on the ring, so that the P can carry on doing other things. This is the one tricky bit.

I think we can leverage the existing netpoll infrastructure to solve this problem. We associate a non-blocking event file descriptor (as in eventfd(2)) to each ring using io_uring_register(IORING_REGISTER_EVENTFD). We register these event file descriptors with netpoll, and we teach netpoll to handle them differently from regular file descriptors: if the M executing netpoll receives a notification on a file descriptor which is part of the uring eventfd set, then it looks for a *uring in the epoll user data field rather than a *pollDesc, as it currently does. It handles the notification by returning the eventfd to quiescent state, reading completions from associated rings, and pushing the associated goroutines to the global runqueue. If netpoll observes spurious notifications from the eventfds, it should know to ignore them, if polling for completions wasn't requested explicitly for the associated rings.


I have insufficient knowledge of the runtime to know whether what I am describing is, in fact, feasible. It is all very hand-wavy, and I haven't attempted an implementation (yet?).

What do you think?

@agnivade
Copy link
Contributor

Went with the "single thread for I/O" approach and built a small POC here: https://github.com/agnivade/frodo :) Nothing serious, just something for me to learn about io_uring and do something with it.

There are definitely some hairy edge-cases due to my highly inexperienced knowledge of CGo. But it's at a stage to start playing around. I have thrown in a benchmark just for the sake of it, but it does not mean anything - as it's not an apples-apples comparison, and the CGO boundary alone brings an order of magnitude difference.

@hodgesds
Copy link

I've also been working on a pure go library that initial tests passing for reading from files. I wanted to have somewhat working code before looking at where to integrate it into the runtime. However, I've run into a few things that are rather difficult with Go's memory model regarding memory barriers and dealing with multiple writers to the submit ring, which would be less of an issue in the solution proposed by @acln0.

@git001
Copy link

git001 commented Jun 8, 2020

@axboe

I wrote an update here:

https://kernel.dk/io_uring-whatsnew.pdf

and the man pages in liburing are also up-to-date.

The git link point to Facebook workplace, is this intentional?

https://l.workplace.com/l.php?u=http%3A%2F%2Fgit.kernel.dk%2Fliburing&h=AT3WDhsryImUEhlRMqW221yX-_Eo4Kru9lc1LpqTgxLCOVuCFAptdNw27wNawPNlBzYop81S1n9JiJr_ONyV6TLVtMz-Snbaj3ZU9Woj_R6HQCnYL2makW80B3VORrzUO4GcmxNb

@pete-woods
Copy link

@spacejam do you have any interesting experiences that you can talk about from looking into using io_uring for async file I/O? I still find that to be the more compelling use case, since Linux did not have a proper story for async file I/O prior to io_uring, in my limited understanding.

I’m definitely interested in seeing benchmarks of io_uring for networking compared to other async networking approaches, since if io_uring is implemented for file I/O, I would speculate that it might only be an incremental amount of work to support io_uring for network I/O as well.

I saw ScyllaDB adopted it, they talked about it in this article: https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

@Iceber
Copy link

Iceber commented Oct 16, 2020

I'm working on an easy-to-use iouring library for golang (https://github.com/iceber/iouring-go), both file IO and socket IO work fine, but the testing and documentation aren't perfect yet!

@mvdan
Copy link
Member

mvdan commented Oct 16, 2020

@Iceber I don't think it's an option to have the os or net packages depend on a vendored third party package, though. Unless you were offering to contribute parts of it, like in #31908 (comment). It seems like the difficulty is more with fitting this into Go's runtime, than implementing the use of io_uring itself.

@Iceber
Copy link

Iceber commented Oct 16, 2020

@mvdan Yes, this is just an experiment for now, and it is optimal for the standard library to provide these features

But really integrating iouring into net would be a huge and probably difficult change, and I don't really think net is going to use iouring!
For file IO you can go for third-party packages such as iouring-go, e.g. in the file server.

I'm also going to try to develop iouring-net, which combines iouring and networking to achieve more efficient asynchronous networking with goroutine.
Obviously the third-party package iouring-net is not for production, it's just a thought and an experiment.

@ghost

This comment has been minimized.

@godzie44
Copy link

godzie44 commented Dec 13, 2021

Hi, I am working on a go-uring library. In addition to the liburing port itself, it provides a backend for I/O operations (named reactor). With its help, you can implement net.Listener and net.Conn interfaces, and make a comparison of the reactor (with an io_uring inside) against the standard mechanism - netpoller. Benchmark (using the example of an echo server). In addition, on this benchmark, a comparison of the go-uring library against the liburing. The results suggest that the ring can at least be an interesting alternative for a netpoller.

@mappu
Copy link

mappu commented Jan 13, 2022

Windows 11 is introducing a new ioring API that seems almost identical [1] to io_uring - so any work in this area might be applicable on both Windows and Linux.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@bohdantrotsenko
Copy link

As of September 2022, can golang runtime detect whether the kernel supports ioring (and switch to using it)?

@ianlancetaylor
Copy link
Contributor

No. This issue is still open.

@aktau
Copy link
Contributor

aktau commented Apr 13, 2023

Today I read an experience report on Hacker News where someone describes a way they found to work with io_uring that resulted in actual performance improvements over using epoll(2): https://news.ycombinator.com/item?id=35548968. Might be worth keeping in mind when experimenting with this in the netpoller.

@pawelgaczynski
Copy link

Hi. I wrote a web framework that is based on io_uring: https://github.com/pawelgaczynski/gain. It is entirely written in Go and achieves really high performance. In a test environment based on a m6i.xlarge AWS EC2 instance machine, kernel 5.15.0-1026-aws and go 1.19, it achieved 490k req/s in the plaintext TechEmpower benchmark. Gain uses the liburing library internal port. It is not a full port. It focuses primarily on the networking part, but I don't see any problem extending the implementation to include the rest of the liburing-supported operations and publishing the port on Github as a standalone package. It would also be worthwhile to create an additional layer of abstraction to use the liburing port in a more idiomatic way, as currently the port is very close to the prototype implemented in C and may not be the most intuitive for Go programmers. If anyone is interested in using my liburing port or would like to help develop it, please contact me by creating an issue in the Gain repository.

@pawelgaczynski
Copy link

Hi. I have implemented and published an almost full port of the liburing library to the Go programming language (no cgo):

https://github.com/pawelgaczynski/giouring

The giouring API is very similar to liburing API so documentation for liburing is valid for giouring (see README.md for more details).

@ocodista
Copy link

Still no plans to use io_uring instead of epoll on Go's runtime?

@ericvolp12
Copy link

ericvolp12 commented Jan 11, 2024

I wasn't sure whether to create a new issue or comment here, but we've run into some significant bottlenecks in the current netpoll implementation on syscall.EpollWait when working on very large (192 core) systems that have thousands of TCP sockets making fast requests.

I did a full writeup here but the tl;dr is that in our use-case, we've got >1,500 TCP connections to ScyllaDB shards and >1,000 client connections talking to our ConnectRPC service. Everything is basically rack-local so the requests are generally sub-millisecond so hundreds if not >1k sockets become ready simultaneously causing a block on EPollWait (~65% of CPU usage in our CPU profile) since it's only buffering 128 FDs per call (I'm assuming due to historical max EPoll call kernel support).

We had to solve this by breaking up our binary across many containers on the same host (8 containers each pinned to 24 different cores) which made the EPollWait completely disappear from our CPU profiles.

@jakebailey
Copy link

I brought this up in the bluesky thread that the above came from, but one thing to consider is that Docker has now disabled io_uring support for security reasons: moby/moby@891241e

This syncs the seccomp profile with changes made to containerd's default
profile in 1.

The original containerd issue and PR mention:

Security experts generally believe io_uring to be unsafe. In fact
Google ChromeOS and Android have turned it off, plus all Google
production servers turn it off. Based on the blog published by Google
below it seems like a bunch of vulnerabilities related to io_uring can
be exploited to breakout of the container.

2

Other security reaserchers also hold this opinion: see 3 for a
blackhat presentation on io_uring exploits.

For the record, these syscalls were added to the allowlist in 4.

This has prompted users of io_uring to have to rewrite things back to epoll so that their code works in containers (e.g. bun oven-sh/bun#7470).

I think it's extremely common for Go code to be running within containers, and if a large bulk of them can't use the feature, that seems pretty unfortunate. Hopefully things get worked out such that io_uring isn't getting bulk disabled for being generally unsafe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux
Projects
None yet
Development

No branches or pull requests