internal/poll: transparently support new linux io_uring interface #31908

johanbrandhorst · 2019-05-08T11:05:53Z

A document on the latest linux IO syscall has made the rounds of discussion forums on the internet:

I wanted to open a discussion on whether we could (and should) add transparent support for this on supported linux kernels.

LWN article: https://lwn.net/Articles/776703/

ianlancetaylor · 2019-05-08T14:47:14Z

It should be feasible to fit this approach into our current netpoll framework. For some programs I think it would reduce the number of threads doing file I/O. It should potentially reduce the number of system calls required to read from the network.

I'm concerned about coordinating access to the ring. The approach seems designed for high performance communication between the application and the kernel, but it seems easiest to use for an application that uses a single thread for I/O, or in which each I/O thread uses its own ring. In Go of course each goroutine is acting independently, and it seems infeasible for each thread to have a separate ring. so that means that goroutines will need to coordinate their access to the I/O ring. That's fine but on high GOMAXPROCS systems I would worry about contention for the ring, a contention that I think doesn't exist in the current epoll framework.

ianlancetaylor · 2019-05-08T16:33:10Z

Thinking about this further, it's not clear that it makes sense to use this new interface for network I/O. It seems that it can only handle a fixed number of concurrent I/O requests, and it's going to be quite hard to make that work transparently and efficiently in Go programs. Without knowing what the Go program plans to do, we can't allocate the ring appropriately.

If that is true, we would only use it for file I/O, where we can reasonably delay I/O operations when the ring is full. In that case it would not fit into the netpoll system. Instead, we might in effect send all file I/O requests to a single goroutine which would add them to the ring, while a second goroutine would sleep until events were ready and wake up the goroutine waiting for them. That should limit the number of threads we need for file I/O and reduce the number of system calls.

CAFxX · 2019-05-08T20:04:06Z

Without knowing what the Go program plans to do, we can't allocate the ring appropriately

I was wondering if we could use the map approach: when it becomes full we allocate a new, bigger one, and start submitting requests to the new one. Once the old one doesn't have pending requests anymore we deallocate it.

it seems easiest to use for an application that uses a single thread for I/O, or in which each I/O thread uses its own ring. In Go of course each goroutine is acting independently, and it seems infeasible for each thread to have a separate ring

Can you elaborate on the "infeasible" part? Assuming having multiple rings is feasible, wouldn't having per-P rings work (with the appropriate fallback slow paths in case the per-P ring is full)? I'm not so familiar with the poller, is the problem that the model mismatch is too big?

coder543 · 2019-05-09T17:04:53Z

If that is true, we would only use it for file I/O

File I/O is the main reason I'm interested in io_uring. My understanding is that the current Linux kernel abstractions for async networking work just fine, but making file I/O appear to be interruptible so that goroutines don't block entire OS threads requires using a pool of I/O OS threads. If io_uring could remove the need for a pool of OS threads for file I/O, that seems to make it worthwhile.

Whether having a ring per P (as @CAFxX suggests) or just having a single thread dedicated to managing a ring... either solution seems fine. Unless there's some measurable advantage to io_uring for networking, I don't think it would be that important to switch the networking code at this point.

ianlancetaylor · 2019-05-09T20:27:28Z

Using a ring per P seems possible, but it means that when a P is idle some M will have to be sleeping in a io_uring getevent call. And then there will have to be a way to wake up that M if we need it, and some way to avoid confusion if the P gets assigned to a different M. It may be doable but it seems pretty complicated.

JimChengLin · 2019-05-26T05:33:31Z

I think it is a killer feature of Linux 5.1. libaio(a wrapper of Linux kernel native async file io api) is broken or at least flawed, which cannot do async buffered file io and could block unexpectedly. The author of io_uring claims all problems that libaio has are all solved. It would be great if Go can utilize the new feature seamlessly.

dahankzter · 2020-01-07T10:09:20Z

Apparently io_uring brings many benefits to networked io as well. Is that not the case or is it just hard to accommodate for the size of the ring?

ianlancetaylor · 2020-01-07T14:37:46Z

@dahankzter See the discussion above. When I looked at io_uring earlier I did not see how to fit it into the Go runtime for network I/O. I can certainly see how io_uring would work well for some network programs, but only ones that know exactly how many concurrent I/O requests they expect to have in flight. The Go runtime can never know that. This issue is about transparently using io_uring, and I don't see how to do that in Go for network connections. If you see how to do it, please explain. Thanks.

axboe · 2020-01-07T16:04:19Z

The fixed limit has been removed, and another networking concern was the fixed CQ ring size and the difficulty in sizing that. The latter was fixed with the IORING_FEAT_NODROP support, which guarantees that completion events are never dropped. I think that should take care of most of the concerns?

One thing I've been toying with is the idea of having a shared backend in terms of threads between rings. I think that'd work nicely and would allow you to just setup one ring per thread and not worry too much about overhead.

ianlancetaylor · 2020-01-07T18:39:00Z

@axboe Thanks for the update. Where is the current API documented?

axboe · 2020-01-07T18:46:58Z

I wrote an update here:

https://kernel.dk/io_uring-whatsnew.pdf

and the man pages in liburing are also up-to-date.

spacejam · 2020-01-13T14:42:13Z

I'm working on rio, a pure-rust misuse-resistant io_uring library, and have been thinking a lot about edge cases that folks are likely to hit. One is exactly the overflow issue @axboe mentions that is addressed with IORING_FEAT_NODROP, which I work around in earlier kernel versions by blocking submission on the current number of in-flight requests, and guaranteeing that not more requests than the CQ size are ever in-flight (they could be dropped without that flag if there's a cq overflow).

In the testing of io_uring with sled I spin up thousands of database instances, each getting their own uring. This quickly causes ENOMEM to be hit, making this approach infeasible under this kind of test. details here. So, you may run into this if you go for a ring-per-goroutine. Maybe ring-per-golang proc.go processor would be OK? Just be aware of this ENOMEM tendency.

Care needs to be taken when working with linked ops. If a linked SQE fails due to, say, a tcp client disconnecting, it will cancel everything down the chain, even if it wrote some bytes into the buffer passed to io_uring. On newer kernel versions you can use HARDLINK instead of LINK to write the partial data into downstream sockets/files/etc... even when the previous write received an error due to disconnection etc...

Regarding concurrent access, it's not too complex. Just make sure that the shared submission queue tail gets bumped with a Release memory ordering after you've written data into all SQE slots before that point. If you're using SQPOLL you don't even need to do a syscall to submit (but you do need to check the SQ's flags to see if the kernel spun down the polling thread due to inactivity, in which case you do need to issue a syscall to kick it off again, it goes away after 1s of inactivity). Completions are cheap to reap because you can really quickly and cheaply copy the sequence of ready ones and set the cq's head value with a Release ordering afterward.

io_uring will change everything. It allows us to dramatically reduce our syscalls (even without SQPOLL mode that spins up a kernel thread, negating the need to do a syscall to submit events). This is vital in a post-meltdown/spectre world where our syscalls have gotten much more expensive. It allows us to do real async file IO without O_DIRECT (but it also works great with O_DIRECT if you're building storage engines etc...). It lets us do things like write broadcast proxies where there's a read followed by a DRAIN barrier, then many socket writes that read from that same buffer. The goldrush has begun :)

coder543 · 2020-01-13T14:56:59Z

@spacejam do you have any interesting experiences that you can talk about from looking into using io_uring for async file I/O? I still find that to be the more compelling use case, since Linux did not have a proper story for async file I/O prior to io_uring, in my limited understanding.

I’m definitely interested in seeing benchmarks of io_uring for networking compared to other async networking approaches, since if io_uring is implemented for file I/O, I would speculate that it might only be an incremental amount of work to support io_uring for network I/O as well.

axboe · 2020-01-13T17:48:39Z

@coder543 It actually started as async file IO, as that only worked for O_DIRECT with Linux. With io_uring, it works for buffered and O_DIRECT.

io_uring also supports networked IO. As per 5.4, you can do the regular read/readv or write/writev on the socket, or you can use send/recvmsg. 5.5 adds connect/accept support, and 5.6 will have support for send/recv as well.

spacejam · 2020-01-13T23:05:20Z

@coder543 For bulk-writing and reading on my 7th gen lenovo x1 carbon with nvme + full disk encryption + ext4 with io_uring (without using registered buffers nor file descriptors nor SQPOLL mode) I can hit 5gbps while writing sequentially with O_DIRECT and 6.5gbps reads while reading sequentially with O_DIRECT. This is the o_direct example in the rio repo. While using a threadpool or single thread doing the same reads/writes also with O_DIRECT but using synchronous operations I can scrape 2gbps. dd can hit 4gbps writes for large writes on my system with default settings. I'll be collecting and publishing more statistics as I shift gears from ergonomics around rio to throughput and giving long-tail-latency a haircut for various interesting IO patterns, especially around the observations I'm able to measure around buffer/fd registration for different kinds of workloads.

johanbrandhorst · 2020-02-08T12:01:16Z

High level overview: https://lwn.net/Articles/810414/

acln0 · 2020-02-14T23:53:09Z

I think the focus should be on file I/O only, to begin with. Network I/O works just fine using netpoll already. If the model works out for file I/O, we can try it for network I/O also (modulo complications related to deadlines, but those don't seem insurmountable).

A ring per P sounds good to me. Each ring would maintain a cache of off-heap completion tokens which identify the parked goroutine and provide a slot to write the result of the operation to. These tokens would function much like pollDescs function for the existing netpoll implementation, but they would be cached per-P.

To perform I/O, a goroutine acquires the usual locks through internal/poll, then enters the runtime. It creates an SQE and submits it to the ring attached to its P, then parks itself. When the completion is handled, the goroutine is woken, much like netpoll currently does.

Some pseudocode:

The completion token:

//go:notinheap
type uringCompletionToken struct {
	link *uringCompletionToken // links to next token in per-P cache
	cg   guintptr              // the goroutine waiting for the completion
	res  int32                 // written by CQ handler, read by cg
}

SQ submission, parking the goroutine doing the submission:

	_g_ := getg().m.curg
	_p_ := _g_.m.p

	tok := _p_.uringTokenCache.alloc()
	tok.cg = guintptr(_g_)
	sqe := uringSQE{
		op: _IORING_OP_SOMETHING,
		fd: int32(fd),

		// etc. etc.

		userData: uint64(uintptr(unsafe.Pointer(tok))),
	}
	if errno := uringSubmit(_p_.uring, sqe); errno != 0 {
		// fall back out into regular internal/poll code path
		return 0, 0, notHandled
	}

	gopark(..., ..., waitReasonIOWait, ..., ...)

	// Now that we have woken up, interpret results from the token.
	var (
		n     int
		errno int
	)
	if tok.res < 0 {
		errno = int(-tok.res)
	} else {
		n = int(tok.res)
	}

	// Might have woken on another P: reload, then free the token.
	_p_ = getg().m.p
	_p_.uringTokenCache.free(tok)

	return n, errno, handled

Handling completions:

	cq := r.nextCQ()
	tok := (*uringCompletionToken)(unsafe.Pointer(uintptr(cq.data)))
	tok.res = cq.res
	goready((*g)(unsafe.Pointer(uintptr(tok.cg))), 4)

The question of who / what handles completions remains.

If I understand things correctly, after parking the current goroutine, we enter the scheduler and we are executing findrunnable with a P, which means that we can check per-P rings rather cheaply. If we have unhandled completions in the local ring, we can add the associated goroutines to the local runqueue and start execution right away.

If findrunnable, knowing that there is work pending on the local ring, observes an empty completion ring, it should arrange for something to wait on the ring, so that the P can carry on doing other things. This is the one tricky bit.

I think we can leverage the existing netpoll infrastructure to solve this problem. We associate a non-blocking event file descriptor (as in eventfd(2)) to each ring using io_uring_register(IORING_REGISTER_EVENTFD). We register these event file descriptors with netpoll, and we teach netpoll to handle them differently from regular file descriptors: if the M executing netpoll receives a notification on a file descriptor which is part of the uring eventfd set, then it looks for a *uring in the epoll user data field rather than a *pollDesc, as it currently does. It handles the notification by returning the eventfd to quiescent state, reading completions from associated rings, and pushing the associated goroutines to the global runqueue. If netpoll observes spurious notifications from the eventfds, it should know to ignore them, if polling for completions wasn't requested explicitly for the associated rings.

I have insufficient knowledge of the runtime to know whether what I am describing is, in fact, feasible. It is all very hand-wavy, and I haven't attempted an implementation (yet?).

What do you think?

agnivade · 2020-04-19T17:43:13Z

Went with the "single thread for I/O" approach and built a small POC here: https://github.com/agnivade/frodo :) Nothing serious, just something for me to learn about io_uring and do something with it.

There are definitely some hairy edge-cases due to my highly inexperienced knowledge of CGo. But it's at a stage to start playing around. I have thrown in a benchmark just for the sake of it, but it does not mean anything - as it's not an apples-apples comparison, and the CGO boundary alone brings an order of magnitude difference.

hodgesds · 2020-04-21T14:41:23Z

I've also been working on a pure go library that initial tests passing for reading from files. I wanted to have somewhat working code before looking at where to integrate it into the runtime. However, I've run into a few things that are rather difficult with Go's memory model regarding memory barriers and dealing with multiple writers to the submit ring, which would be less of an issue in the solution proposed by @acln0.

git001 · 2020-06-08T09:05:32Z

@axboe

I wrote an update here:

https://kernel.dk/io_uring-whatsnew.pdf

and the man pages in liburing are also up-to-date.

The git link point to Facebook workplace, is this intentional?

https://l.workplace.com/l.php?u=http%3A%2F%2Fgit.kernel.dk%2Fliburing&h=AT3WDhsryImUEhlRMqW221yX-_Eo4Kru9lc1LpqTgxLCOVuCFAptdNw27wNawPNlBzYop81S1n9JiJr_ONyV6TLVtMz-Snbaj3ZU9Woj_R6HQCnYL2makW80B3VORrzUO4GcmxNb

pete-woods · 2020-06-17T11:19:12Z

@spacejam do you have any interesting experiences that you can talk about from looking into using io_uring for async file I/O? I still find that to be the more compelling use case, since Linux did not have a proper story for async file I/O prior to io_uring, in my limited understanding.

I’m definitely interested in seeing benchmarks of io_uring for networking compared to other async networking approaches, since if io_uring is implemented for file I/O, I would speculate that it might only be an incremental amount of work to support io_uring for network I/O as well.

I saw ScyllaDB adopted it, they talked about it in this article: https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

Iceber · 2020-10-16T08:14:06Z

I'm working on an easy-to-use iouring library for golang (https://github.com/iceber/iouring-go), both file IO and socket IO work fine, but the testing and documentation aren't perfect yet!

mvdan · 2020-10-16T08:37:41Z

@Iceber I don't think it's an option to have the os or net packages depend on a vendored third party package, though. Unless you were offering to contribute parts of it, like in #31908 (comment). It seems like the difficulty is more with fitting this into Go's runtime, than implementing the use of io_uring itself.

Iceber · 2020-10-16T09:16:58Z

@mvdan Yes, this is just an experiment for now, and it is optimal for the standard library to provide these features

But really integrating iouring into net would be a huge and probably difficult change, and I don't really think net is going to use iouring!
For file IO you can go for third-party packages such as iouring-go, e.g. in the file server.

I'm also going to try to develop iouring-net, which combines iouring and networking to achieve more efficient asynchronous networking with goroutine.
Obviously the third-party package iouring-net is not for production, it's just a thought and an experiment.

godzie44 · 2021-12-13T19:18:13Z

Hi, I am working on a go-uring library. In addition to the liburing port itself, it provides a backend for I/O operations (named reactor). With its help, you can implement net.Listener and net.Conn interfaces, and make a comparison of the reactor (with an io_uring inside) against the standard mechanism - netpoller. Benchmark (using the example of an echo server). In addition, on this benchmark, a comparison of the go-uring library against the liburing. The results suggest that the ring can at least be an interesting alternative for a netpoller.

mappu · 2022-01-13T04:36:16Z

Windows 11 is introducing a new ioring API that seems almost identical [1] to io_uring - so any work in this area might be applicable on both Windows and Linux.

bohdantrotsenko · 2022-09-17T14:13:26Z

As of September 2022, can golang runtime detect whether the kernel supports ioring (and switch to using it)?

ianlancetaylor · 2022-09-18T02:27:52Z

No. This issue is still open.

aktau · 2023-04-13T08:55:45Z

Today I read an experience report on Hacker News where someone describes a way they found to work with io_uring that resulted in actual performance improvements over using epoll(2): https://news.ycombinator.com/item?id=35548968. Might be worth keeping in mind when experimenting with this in the netpoller.

pawelgaczynski · 2023-07-02T11:24:19Z

Hi. I wrote a web framework that is based on io_uring: https://github.com/pawelgaczynski/gain. It is entirely written in Go and achieves really high performance. In a test environment based on a m6i.xlarge AWS EC2 instance machine, kernel 5.15.0-1026-aws and go 1.19, it achieved 490k req/s in the plaintext TechEmpower benchmark. Gain uses the liburing library internal port. It is not a full port. It focuses primarily on the networking part, but I don't see any problem extending the implementation to include the rest of the liburing-supported operations and publishing the port on Github as a standalone package. It would also be worthwhile to create an additional layer of abstraction to use the liburing port in a more idiomatic way, as currently the port is very close to the prototype implemented in C and may not be the most intuitive for Go programmers. If anyone is interested in using my liburing port or would like to help develop it, please contact me by creating an issue in the Gain repository.

pawelgaczynski · 2023-08-21T12:10:30Z

Hi. I have implemented and published an almost full port of the liburing library to the Go programming language (no cgo):

https://github.com/pawelgaczynski/giouring

The giouring API is very similar to liburing API so documentation for liburing is valid for giouring (see README.md for more details).

ocodista · 2023-12-11T21:51:16Z

Still no plans to use io_uring instead of epoll on Go's runtime?

ericvolp12 · 2024-01-11T17:45:43Z

I wasn't sure whether to create a new issue or comment here, but we've run into some significant bottlenecks in the current netpoll implementation on syscall.EpollWait when working on very large (192 core) systems that have thousands of TCP sockets making fast requests.

I did a full writeup here but the tl;dr is that in our use-case, we've got >1,500 TCP connections to ScyllaDB shards and >1,000 client connections talking to our ConnectRPC service. Everything is basically rack-local so the requests are generally sub-millisecond so hundreds if not >1k sockets become ready simultaneously causing a block on EPollWait (~65% of CPU usage in our CPU profile) since it's only buffering 128 FDs per call (I'm assuming due to historical max EPoll call kernel support).

We had to solve this by breaking up our binary across many containers on the same host (8 containers each pinned to 24 different cores) which made the EPollWait completely disappear from our CPU profiles.

jakebailey · 2024-01-12T01:09:23Z

I brought this up in the bluesky thread that the above came from, but one thing to consider is that Docker has now disabled io_uring support for security reasons: moby/moby@891241e

This syncs the seccomp profile with changes made to containerd's default
profile in 1.

The original containerd issue and PR mention:

Security experts generally believe io_uring to be unsafe. In fact
Google ChromeOS and Android have turned it off, plus all Google
production servers turn it off. Based on the blog published by Google
below it seems like a bunch of vulnerabilities related to io_uring can
be exploited to breakout of the container.

2

Other security reaserchers also hold this opinion: see 3 for a
blackhat presentation on io_uring exploits.

For the record, these syscalls were added to the allowlist in 4.

This has prompted users of io_uring to have to rewrite things back to epoll so that their code works in containers (e.g. bun oven-sh/bun#7470).

I think it's extremely common for Go code to be running within containers, and if a large bulk of them can't use the feature, that seems pretty unfortunate. Hopefully things get worked out such that io_uring isn't getting bulk disabled for being generally unsafe.

ianlancetaylor changed the title ~~syscall: transparently support new linux io_uring interface~~ internal/poll: transparently support new linux io_uring interface May 8, 2019

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 8, 2019

ianlancetaylor added this to the Unplanned milestone May 8, 2019

ianlancetaylor added the OS-Linux label May 8, 2019

ianlancetaylor mentioned this issue Jun 23, 2019

os: "async" file IO #6817

Open

harshavardhana mentioned this issue Aug 1, 2020

Add storage layer deadlines similar to hung_task_timeout_secs minio/minio#10178

Closed

This comment has been minimized.

Sign in to view

CAFxX mentioned this issue Sep 18, 2021

net: add mechanism to wait for readability on a TCPConn #15735

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mateusz834 mentioned this issue Jan 9, 2023

proposal: os: async file I/O using io_uring kernel interface for linux #57701

Closed

prattmic mentioned this issue Jan 11, 2024

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Open

Navigation Menu

internal/poll: transparently support new linux io_uring interface #31908

internal/poll: transparently support new linux io_uring interface #31908

Comments

johanbrandhorst commented May 8, 2019 • edited

ianlancetaylor commented May 8, 2019

ianlancetaylor commented May 8, 2019

CAFxX commented May 8, 2019 • edited

coder543 commented May 9, 2019 • edited

ianlancetaylor commented May 9, 2019

JimChengLin commented May 26, 2019 • edited

dahankzter commented Jan 7, 2020

ianlancetaylor commented Jan 7, 2020

axboe commented Jan 7, 2020

ianlancetaylor commented Jan 7, 2020

axboe commented Jan 7, 2020

spacejam commented Jan 13, 2020 • edited

coder543 commented Jan 13, 2020 • edited

axboe commented Jan 13, 2020

spacejam commented Jan 13, 2020

johanbrandhorst commented Feb 8, 2020

acln0 commented Feb 14, 2020

agnivade commented Apr 19, 2020

hodgesds commented Apr 21, 2020

git001 commented Jun 8, 2020

pete-woods commented Jun 17, 2020

Iceber commented Oct 16, 2020

mvdan commented Oct 16, 2020

Iceber commented Oct 16, 2020

This comment has been minimized.

godzie44 commented Dec 13, 2021 • edited

mappu commented Jan 13, 2022

bohdantrotsenko commented Sep 17, 2022

ianlancetaylor commented Sep 18, 2022

aktau commented Apr 13, 2023

pawelgaczynski commented Jul 2, 2023

pawelgaczynski commented Aug 21, 2023

ocodista commented Dec 11, 2023

ericvolp12 commented Jan 11, 2024 • edited

jakebailey commented Jan 12, 2024

johanbrandhorst commented May 8, 2019 •

edited

CAFxX commented May 8, 2019 •

edited

coder543 commented May 9, 2019 •

edited

JimChengLin commented May 26, 2019 •

edited

spacejam commented Jan 13, 2020 •

edited

coder543 commented Jan 13, 2020 •

edited

godzie44 commented Dec 13, 2021 •

edited

ericvolp12 commented Jan 11, 2024 •

edited