Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os, internal/poll, runtime: how to use /dev/net/tun on Linux #30426

Closed
zx2c4 opened this issue Feb 27, 2019 · 17 comments
Closed

os, internal/poll, runtime: how to use /dev/net/tun on Linux #30426

zx2c4 opened this issue Feb 27, 2019 · 17 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux
Milestone

Comments

@zx2c4
Copy link
Contributor

zx2c4 commented Feb 27, 2019

Go 1.12 brought Sysconn() for os.File. In theory that should let us OpenFile on /dev/net/tun, and then use Sysconn() to do all of the TUN-specific ioctls for setting up the device and giving it a name and setting some properties and such. From then out, it's supposed to be a matter of Read, Write, and Close. Since we don't need to call Fd() on the os.File at any point, we gain the benefits of using netpoll (which is epoll behind the scenes).

In addition to allowing the scheduler to make better decisions and not allocating an OS thread for every IO operation, netpoll also lets us call Read in one Go routine and Close in another, and the currently running Read will return immediately with an error saying that it's been closed. This is terrific for shutting down gracefully. To illustrate here's something that does not work as a consequence of using Fd:

        fd, err := os.OpenFile("/dev/net/tun", os.O_RDWR, 0)
        if err != nil {
                log.Fatal(err)
        }
        
        var ifr [unix.IFNAMSIZ + 64]byte
        copy(ifr[:], []byte("cheese"))
        *(*uint16)(unsafe.Pointer(&ifr[unix.IFNAMSIZ])) = unix.IFF_TUN
        
        _, _, errno := unix.Syscall(
                unix.SYS_IOCTL,
                uintptr(fd.Fd()),
                uintptr(unix.TUNSETIFF),
                uintptr(unsafe.Pointer(&ifr[0])),
        )
        if errno != 0 {
                log.Fatal(errno)
        }

        wait := sync.WaitGroup{}
        wait.Add(1)
        go func() {
                var err error
                for {   
                        _, err := fd.Read(b[:])
                        if err != nil {
                                break
                        }
                }
                log.Print("Read errored: ", err)
                wait.Done()
        }()
        time.Sleep(time.Second * 3)
        log.Print("Closing")
        err = fd.Close()
        if err != nil {
                log.Print("Close errored: " , err)
        }
        wait.Wait()
        log.Print("Exiting")

The problem with the above code is that fd.Read(b[:]) never returns after fd.Close() executes, and so the program hangs forever. Thanks to Sysconn in Go 1.12, we can fix that problem like this:

        fd, err := os.OpenFile("/dev/net/tun", os.O_RDWR, 0)
        if err != nil {
                log.Fatal(err)
        }
        
        var ifr [unix.IFNAMSIZ + 64]byte
        copy(ifr[:], []byte("cheese"))
        *(*uint16)(unsafe.Pointer(&ifr[unix.IFNAMSIZ])) = unix.IFF_TUN
        
        var errno syscall.Errno
        s, _ := fd.SyscallConn()
        s.Control(func(fd uintptr) {
                _, _, errno = unix.Syscall(
                        unix.SYS_IOCTL,
                        fd,
                        uintptr(unix.TUNSETIFF),
                        uintptr(unsafe.Pointer(&ifr[0])),
                )
        })
        if errno != 0 {
                log.Fatal(errno)
        }

        wait := sync.WaitGroup{}
        wait.Add(1)
        go func() {
                var err error
                for {   
                        _, err := fd.Read(b[:])
                        if err != nil {
                                break
                        }
                }
                log.Print("Read errored: ", err)
                wait.Done()
        }()
        time.Sleep(time.Second * 3)
        log.Print("Closing")
        err = fd.Close()
        if err != nil {
                log.Print("Close errored: " , err)
        }
        wait.Wait()
        log.Print("Exiting")

This works as expected with regards to that fd.Read(b[:]) getting cancelled. (In Go 1.11, I previously worked around this by manually polling on a cancellation pipe and the tun fd with some pretty gnarly ugliness. I've been eagerly awaiting the Go 1.12 release to stop having to play those games.)

There's a big problem, however: netpoll's use of epoll doesn't seem to agree with the the Linux tun driver's tun_chr_poll. Consider the following program:

package main

import "log"
import "os"
import "unsafe"
import "time"
import "syscall"
import "os/exec"
import "sync"
import "golang.org/x/sys/unix"

func main() {
	fd, err := os.OpenFile("/dev/net/tun", os.O_RDWR, 0)
	if err != nil {
		log.Fatal(err)
	}

	var ifr [unix.IFNAMSIZ + 64]byte
	copy(ifr[:], []byte("cheese"))
	*(*uint16)(unsafe.Pointer(&ifr[unix.IFNAMSIZ])) = unix.IFF_TUN

	var errno syscall.Errno
	s, _ := fd.SyscallConn()
	s.Control(func(fd uintptr) {
		_, _, errno = unix.Syscall(
			unix.SYS_IOCTL,
			fd,
			uintptr(unix.TUNSETIFF),
			uintptr(unsafe.Pointer(&ifr[0])),
		)
	})
	if errno != 0 {
		log.Fatal(errno)
	}

	wait := sync.WaitGroup{}
	wait.Add(1)
	go func() {
		var err error
		c := exec.Command("sh", "-c", "ip link set up cheese && ip a a 192.168.9.2/24 dev cheese")
		c.Start()
		c.Wait()
		exec.Command("sh", "-c", "ping -c 4 -f 192.168.9.1; ip link set down cheese; ip a f dev cheese").Start()
		b := [2000]byte{}
		for {
			n, err := fd.Read(b[:])
			if err != nil {
				break
			}
			log.Printf("Read %d bytes", n)
		}
		log.Print("Read errored: ", err)
		wait.Done()
	}()
	time.Sleep(time.Second * 15)
	log.Print("Closing")
	err = fd.Close()
	if err != nil {
		log.Print("Close errored: ", err)
	}
	wait.Wait()
	log.Print("Exiting")
}

This is supposed to work, but actually the call to Read winds up blocking and not returning any data, and only ever returns upon the call to Close. The above program can be "fixed" by adding fd.Fd() just above the go func() { line, in order to remove fd from netpoll. This, however, incurs the pre-Sysconn-era problem of Close not being cancelable and loosing the nice other benefits of netpoll.

Anybody familiar with netpoll's particular use of epoll interested in taking a look under the hood?

zx2c4-bot pushed a commit to WireGuard/wireguard-go that referenced this issue Feb 27, 2019
So this mostly reverts the switch to Sysconn for Linux.

Issue: golang/go#30426
@ianlancetaylor ianlancetaylor changed the title netpoll doesn't like linux's /dev/net/tun runtime: netpoll doesn't like linux's /dev/net/tun Feb 27, 2019
@ianlancetaylor
Copy link
Contributor

The way that netpoll uses epoll is straightforward. It adds descriptors to the epoll descriptor using EPOLL_CTL_ADD with EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET. It shouldn't be hard to try writing a C program to see how epoll behaves with /dev/net/tun.

@mikioh mikioh changed the title runtime: netpoll doesn't like linux's /dev/net/tun os: netpoll doesn't like linux's /dev/net/tun Feb 27, 2019
@mikioh
Copy link
Contributor

mikioh commented Feb 27, 2019

Just to be sure, can you please confirm that:

  • "ip tuntap add $devname tun user $username" works well on your node under the test,
  • "ip link show" displays the tun interface configured by your program during the test.

@zx2c4
Copy link
Contributor Author

zx2c4 commented Feb 27, 2019

Yes.

thinkpad ~ # ip tuntap add mode tun name cheese user zx2c4
thinkpad ~ # ip link show dev cheese
150: cheese: <POINTOPOINT,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
    link/none 

@zx2c4
Copy link
Contributor Author

zx2c4 commented Feb 27, 2019

Using the same flags as Go's usage as epoll, I'm able to reproduce this in C. Here's the working blocking case as a baseline:

#include <sys/types.h>
#include <sys/epoll.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_tun.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	char buf[2000];
	ssize_t len;
	int tunfd, ret;
	struct ifreq ifreq = {
		.ifr_name = "cheese",
		.ifr_flags = IFF_TUN
	};
	
	tunfd = open("/dev/net/tun", O_RDWR);
	if (tunfd < 0) {
		perror("open(/dev/net/tun");
		return 1;
	}
	ret = ioctl(tunfd, TUNSETIFF, &ifreq);
	if (ret < 0) {
		perror("ioctl(IFF_TUN)");
		return 1;
	}
	system("ip link set up cheese && ip a a 192.168.9.2/24 dev cheese");
	popen("ping -c 4 -f 192.168.9.1; ip link set down cheese; ip a f dev cheese", "r");
	while ((len = read(tunfd, buf, sizeof(buf))) >= 0)
		printf("Read %ld bytes\n", len);
	return 0;
}

Here's the broken epoll case:

#include <sys/types.h>
#include <sys/epoll.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_tun.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	char buf[2000];
	ssize_t len;
	int tunfd, efd, ret;
	struct ifreq ifreq = {
		.ifr_name = "cheese",
		.ifr_flags = IFF_TUN
	};
	struct epoll_event event = {
		.events = EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET
	};
	
	tunfd = open("/dev/net/tun", O_RDWR);
	if (tunfd < 0) {
		perror("open(/dev/net/tun");
		return 1;
	}
	ret = fcntl(tunfd, F_GETFL);
	if (ret < 0) {
		perror("F_GETFL");
		return 1;
	}
	ret = fcntl(tunfd, F_SETFL, ret | O_NONBLOCK);
	if (ret < 0) {
		perror("F_SETFL");
		return 1;
	}
	efd = epoll_create1(0);
	if (efd < 0) {
		perror("epoll_create1");
		return 1;
	}
	ret = epoll_ctl(efd, EPOLL_CTL_ADD, tunfd, &event);
	if (ret < 0) {
		perror("epoll_ctl");
		return 1;
	}
	ret = ioctl(tunfd, TUNSETIFF, &ifreq);
	if (ret < 0) {
		perror("ioctl(IFF_TUN)");
		return 1;
	}
	system("ip link set up cheese && ip a a 192.168.9.2/24 dev cheese");
	popen("ping -c 4 -f 192.168.9.1; ip link set down cheese; ip a f dev cheese", "r");
	for (;;) {
		len = read(tunfd, buf, sizeof(buf));
		if (len < 0 && errno == EAGAIN) {
			ret = epoll_wait(efd, &event, 1, -1);
			if (ret < 0) {
				perror("epoll_wait");
				return 1;
			}
			continue;
		}
		if (len < 0)
			break;
		printf("Read %ld bytes\n", len);
	}
	return 0;
}

@zx2c4
Copy link
Contributor Author

zx2c4 commented Feb 27, 2019

Interestingly, it appears that removing EPOLLET fixes things. That's not surprising as level triggering is basically the same as ordinary poll.

@mikioh mikioh changed the title os: netpoll doesn't like linux's /dev/net/tun os, internal/poll: netpoll doesn't like linux's /dev/net/tun Feb 27, 2019
@mikioh
Copy link
Contributor

mikioh commented Feb 27, 2019

it appears that removing EPOLLET fixes things

Sounds like you need to find out some good way to accommodate either a) blocking I/O w/ level-triggered notification, or b) non-blocking I/O w/ edge-triggered notification; the current runtime-integrated network poller is designed for just the latter.

If marking a tun/tap device file with non-blocking does make it possible to work together with the current runtime-integrated network poller, well, it's unlikely, tun_ring_recv in drivers/net/tun.c always returns EAGAIN when the argument noblock is true.

A naive fix might be to make the epoll registration adaptive by referring to the target file capability for non-blocking I/O.

@mikioh mikioh changed the title os, internal/poll: netpoll doesn't like linux's /dev/net/tun os, internal/poll, runtime: netpoll doesn't like linux's /dev/net/tun Feb 27, 2019
@zx2c4
Copy link
Contributor Author

zx2c4 commented Feb 27, 2019

tun_ring_recv in drivers/net/tun.c always returns EAGAIN when the argument noblock is true.

Are we reading the same source? It returns 0 and with the buffer if noblock is true and a buffer is available. Otherwise it returns EAGAIN:

static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
{       
        DECLARE_WAITQUEUE(wait, current);
        void *ptr = NULL;
        int error = 0;
        
        ptr = ptr_ring_consume(&tfile->tx_ring);
        if (ptr)
                goto out;
        if (noblock) {
                error = -EAGAIN;
                goto out;
        }
        
        //[...]

out:    
        *err = error;
        return ptr;
}

@ianlancetaylor
Copy link
Contributor

If /dev/net/tun doesn't support EPOLLET, then I don't see a reasonable way to make it work with Go's poller.

@mikioh
Copy link
Contributor

mikioh commented Feb 28, 2019

@zx2c4,

A workaround: https://play.golang.org/p/iu6ayVT3Yfe

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 28, 2019
@bcmills bcmills added this to the Unplanned milestone Feb 28, 2019
@zx2c4
Copy link
Contributor Author

zx2c4 commented Mar 4, 2019

@mikioh Is that workaround remotely safe to do? That is, if an fd is in netpoll, and then you manually twiddle it to be nonblocking, won't netpoll tweak out? Or just become really inefficient? Generally the epoll ET pattern is something like:

for (;;) {
    while ((ret = read(fd, ...)) >= 0)
        ...
    if (ret < 0 && errno == EAGAIN)
        epoll(efd, ...);
}

If you put the fd into blocking mode, the reads will just block forever, and so it'll never return EAGAIN and epoll basically won't be used. This sounds like in theory it would make cancellation very difficult, since that read(fd) call just hangs there until a packet comes in. And if Go thinks it can epoll, it might not spawn a thread for the blocking call, which could then starve other Go routines.

Is this analysis correct? Or does Go somehow use epoll internally in a way that makes ET+blocking acceptable?

@zx2c4
Copy link
Contributor Author

zx2c4 commented Mar 4, 2019

You had some comments the other day about this working, then not working, on the BSDs, but I can't find them now for some reason. What was the verdict of that? In my quick trials with code similar to OP, I was able to Close() the file from one go routine and have the read canceled in the other. I thought this was decent enough indication things were working fine on the BSDs. From further inspecting what's going on, though, it looks like all the BSDs examine the file descriptor and then might actually wind up disable polling under certain conditions. Are we hitting these conditions? But if that's the case, why does the cancellation appear to work?

@zx2c4
Copy link
Contributor Author

zx2c4 commented Mar 4, 2019

Is this analysis correct? Or does Go somehow use epoll internally in a way that makes ET+blocking acceptable?

It looks like your workaround code actually doesn't work at all. Extend that timeout from 3 seconds to 10 seconds, so that there's time for the broadcast packet stuff to stop happening. That way the file is actually closed during a period when there isn't new data. Then, you'll see the same hang that we had in Go 1.11, which I'm forced to solve with this monstrosity.

@mikioh
Copy link
Contributor

mikioh commented Mar 6, 2019

[I deleted my previous comments mentioning BSDs because I was confused a bit, sorry.]

@zx2c4,

I skimmed Linux kernel code a bit and realized that the byte sequence (or character) interface on tun device doesn't support epoll, as your example code displays that the first epoll_pwait always returns EPOLLERR regardless of EPOLLET or blocking/non-blocking I/O; see tun_chr_poll in drivers/net/tun.c. I expected vfs_poll in fs/eventpoll.c to handle poll-capable stuff well but tun_chr_poll returns EPOLLERR for non-NETREG_REGISTERED devices, /dev/net/tun device files. Right now, I have no good idea to accommodate such stuff like poll-capable but non-epoll capable device files.

So, a workaround would be to have own poll for such devices files: https://play.golang.org/p/D3B8KBeW10y

PS: On BSD variants, the tun or similar software interfaces are well integrated with kqueue, so that's the reason I was confused initially, sorry for the confusion.

@crvv
Copy link
Contributor

crvv commented Mar 7, 2019

tun works well with runtime poller on my machine.
The code doesn't work because the fd is added to poller before ioctl.
It should use x/sys/unix.Open to open the file, then ioctl, SetNonblock and os.NewFile.

Please see #22939

@mikioh
Copy link
Contributor

mikioh commented Mar 7, 2019

@crvv,

Oh, nice; that means that calling ioctl w/ IFF_XXX makes the device file NETREG_REGISTERED?

@zx2c4
Copy link
Contributor Author

zx2c4 commented Mar 7, 2019

Nice observation. This seems to work correctly:

package main

import "log"
import "os"
import "unsafe"
import "time"
import "os/exec"
import "sync"
import "golang.org/x/sys/unix"

func main() {
        tunfd, err := unix.Open("/dev/net/tun", os.O_RDWR, 0)
        if err != nil {
                log.Fatal(err)
        }

        var ifr [unix.IFNAMSIZ + 64]byte
        copy(ifr[:], []byte("cheese"))
        *(*uint16)(unsafe.Pointer(&ifr[unix.IFNAMSIZ])) = unix.IFF_TUN
        _, _, errno := unix.Syscall(
                unix.SYS_IOCTL,
                uintptr(tunfd),
                uintptr(unix.TUNSETIFF),
                uintptr(unsafe.Pointer(&ifr[0])),
        )

        if errno != 0 {
                log.Fatal(errno)
        }
        unix.SetNonblock(tunfd, true)

        fd := os.NewFile(uintptr(tunfd), "/dev/net/tun")

        wait := sync.WaitGroup{}
        wait.Add(1)
        go func() {
                var err error
                c := exec.Command("sh", "-c", "ip link set up cheese && ip a a 192.168.9.2/24 dev cheese")
                c.Start()
                c.Wait()
                exec.Command("sh", "-c", "ping -c 4 -f 192.168.9.1; ip link set down cheese; ip a f dev cheese").Start()
                b := [2000]byte{}
                for {
                        var n int
                        n, err = fd.Read(b[:])
                        if err != nil {
                                break
                        }
                        log.Printf("Read %d bytes", n)
                }
                log.Print("Read errored: ", err)
                wait.Done()
        }()
        time.Sleep(time.Second * 15)
        log.Print("Closing")
        err = fd.Close()
        if err != nil {
                log.Print("Close errored: ", err)
        }
        wait.Wait()
        log.Print("Exiting")
}

@mikioh mikioh changed the title os, internal/poll, runtime: netpoll doesn't like linux's /dev/net/tun os, internal/poll, runtime: how to use /dev/net/tun on Linux Mar 7, 2019
@mikioh
Copy link
Contributor

mikioh commented Mar 7, 2019

Closing, thanks @crvv for the valuable information.

@mikioh mikioh closed this as completed Mar 7, 2019
nsd20463 added a commit to mistsys/tuntap that referenced this issue Sep 17, 2019
running the tuntap in go 1.13 resulted Interface.Read() returning
an "not pollable" error (from the runtime's poll.ErrNotPollable).
This, it turns out, is due to /dev/net/tun in linux not being pollable
(in the epoll sense) until after the TUNSETIFF ioctl has been done.

The right fix, done here, is to open /dev/net/tun as a raw file
descriptor, and ioctl it before constructing an *os.File which gets
added to the poll set when Read() is called.

 See golang/go#30426
 and golang/go#30624

and go source code commit a5fdd58c84b6b0a1ae5a53faebc0550024e3a066
which adds ErrNotPollable and exposes this error which otherwise
was getting silently thrown away.

This code works properly on the AP, too (master branch, using go 1.12.9,
but it should work a long way back)
@golang golang locked and limited conversation to collaborators Mar 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux
Projects
None yet
Development

No branches or pull requests

6 participants