Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: os/exec: better support for user namespaces #50098

Open
outofforest opened this issue Dec 10, 2021 · 19 comments
Open

proposal: os/exec: better support for user namespaces #50098

outofforest opened this issue Dec 10, 2021 · 19 comments

Comments

@outofforest
Copy link

I have this scenario:

  1. I want to start new process inside new user namespace (aka rootless container).
  2. unshare syscall allows me to map root user only.
  3. podman is able to map subids too by calling newuidmap and newgidmap (it is implemented here: https://github.com/containers/storage/tree/main/pkg/unshare/unshare_linux.go)
  4. To make it work they implemented complex wrapper around exec.Cmd plus some C code to call unshare in the middle of the process, after setting full mapping set
  5. For some reason calling syscall.Unshare(syscall.CLONE_NEWUSER) returns invalid argument error even if called from goroutine pinned to thread.

It would be nice to have more "goish" way to do it, like this:

	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUSER,
		UidMappings: []syscall.SysProcIDMap{
			{
				HostID:      1000,
				ContainerID: 0,
				Size:        1,
			},
			{
				HostID:      1,
				ContainerID: 100000,
				Size:        65536,
			},
		},
		GidMappings: []syscall.SysProcIDMap{
			{
				HostID:      1000,
				ContainerID: 0,
				Size:        1,
			},
			{
				HostID:      1,
				ContainerID: 100000,
				Size:        65536,
			},
		},
	}
@seankhliao seankhliao changed the title Better support for user namespaces, affected package: exec os/exec: better support for user namespaces Dec 10, 2021
@ianlancetaylor
Copy link
Contributor

This is supposed to be done by using the Unshareflags field of syscall.SysProcAttr. Why does that not suffice? Thanks.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 11, 2021
@outofforest
Copy link
Author

This is supposed to be done by using the Unshareflags field of syscall.SysProcAttr. Why does that not suffice? Thanks.

Behavior is the same no matter if I use Cloneflags or Unshareflags. They both work if only root uid is mapped but fails if I try to map subids too. This is by design because that's the limitation of linux unshare for unprivileged user. If I try to map both scopes I get error: fork/exec /tmp/executor-468678108: operation not permitted.

Software like podman use alternative implementation of exec.Cmd which calls newuidmap, newgidmap, reexec and C wrapper to bypass this limitation. It would be nice to have it implemented in go directly.

@ianlancetaylor
Copy link
Contributor

What would we have to change to make this work? Thanks.

@outofforest
Copy link
Author

What would we have to change to make this work? Thanks.

I believe ready to use implementation is here: https://github.com/containers/storage/tree/main/pkg/unshare/

@outofforest
Copy link
Author

outofforest commented Dec 11, 2021

BTW: the best approach IMO would be to implement fork. Yes, I know it's dangerous and golang team refused to do it many time but actually this is the best way to switch to namespace:

  1. Fork the process
  2. Inside parent set id mappings
  3. Inside child call unshare

Whenever someone shares this idea golang team answers that the correct way of doing this is to use exec.Cmd. But:

  1. as I shown you above, it doesn't work anyway
  2. it makes us impossible to provide "container as a library".

Imagine this situation:
I want to create a library which allows to do some operations inside namespace.
To do it now I have to run separate binary using exec.Cmd. All existing software do that by reexecuting /proc/self/exe. But this is not a good way if you want to create a library, reexecuting current binary from a library is extremely weird idea. Now I do it by embedding binary code inside library, saving it later in tmp location and executing it in a namespace. This is sooooo weird solution but the only one available because fork is not present.

@ianlancetaylor
Copy link
Contributor

It's infeasible to implement fork in the Go standard library. After a program forks, there is only a single thread running in the child process. That means that the basic assumptions of the Go runtime fall apart. Almost any ordinary Go operation can potentially fail in that case. The syscall code that starts a child is very carefully written to avoid problems. We can't permit ordinary Go code to run between the fork and the execve.

Thanks for pointing to the package. There is a lot there, and I don't know what matters. It would help this proposal a great deal if you or somebody could write down exactly what would need to be added to syscall.ProcSysAttr to make things work. In your initial comment you mentioned Cloneflags, UidMappings, and GidMappings, but all of those already exist. What do we need that is new?

@outofforest
Copy link
Author

@ianlancetaylor

I'm not a C expert but linked library works this way:

  1. New process is started using exec.Cmd. It reexecutes /proc/self/exe so the same binary is executed again in this case, but in general one it should allow to start any binary.
  2. Executed go binary is prepared in some tricky way. As far as I understand, it runs C function before go runtime takes control:
    https://github.com/containers/storage/blob/main/pkg/unshare/unshare_gccgo.go#L5
// #cgo CFLAGS: -Wall -Wextra
// extern void _containers_unshare(void);
// void __attribute__((constructor)) init(void) {
//   _containers_unshare();
// }
import "C"
  1. _containers_unshare C function does the real stuff related to creating a namespace: https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L299

And this is the flow:

  1. Binary is reexecuted so we have a parent and child.
  2. Child starts the C code, sends its PID to parent (https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L319) and waits (https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L328).
  3. Parent takes child process ID and calls newuidmap and newgidmap linux binaries to set full mappings (both root user and subids if defined in /etc/subuid and /etc/subgid): https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L266
    https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L217
    It may be done by unprivileged user because newuidmap and newgidmap have appropriate capability or sticky bit set, depending on linux distro.
  4. Now child gets a signal (using pipe) to continue.
  5. Child reexecutes itself again (this time in C code): https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L372
  6. Finally after 2 reexecs we have go binary running inside user namespce with both root and subids being mapped.

In this setup flags for unshare are passed using _Containers-unshare env variable:
https://github.com/containers/storage/blob/main/pkg/unshare/unshare_linux.go#L86
https://github.com/containers/storage/blob/main/pkg/unshare/unshare.c#L304

Entire process is sooooo complicated...
It would be much easier if go executable could call unshare before real go runtime starts.

@ianlancetaylor
Copy link
Contributor

We are not going to permit calling ordinary Go code between fork and execve.

I don't know what the newuidmap and newgidmap binaries are.

For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.

@outofforest
Copy link
Author

I don't know what the newuidmap and newgidmap binaries are.

https://man7.org/linux/man-pages/man1/newuidmap.1.html
https://man7.org/linux/man-pages/man1/newgidmap.1.html

For this proposal to move forward we're going to need more precise details as to exactly what needs to be implemented in the syscall package. I'm going to put this on hold for now.

I think I provided all the details on how the flow should look like. But once again:

  1. fork
  2. call newuidmap and newgidmap in parent process
  3. call unshare in child process before go runtime takes control over execution

go is frequently used to develop containerization engines so it would be nice to implement this once for all.

@ianlancetaylor
Copy link
Contributor

Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?

@outofforest
Copy link
Author

outofforest commented Dec 16, 2021

Why is it necessary to call an external binary to make this work? Why can't we do this entirely with system calls?

Unprivileged user may create only a single mapping: id in container -> current uid/gid on host. Providing more mappings causes "permission denied" error, even if mapped ids are specified in /etc/subuid or /etc/subgid. This is by design in linux.
Calling newuidmap / newgidmap bypasses this limitation because those binaries have sticky bit or capability set (depending on distro) so they always run with privileges required to set extended mappings.

hown3d pushed a commit to hown3d/go that referenced this issue Aug 3, 2022
Beforehand, unprivileged user could only create id mappings like this:
<ID in user namespace> -> current uid/gid on the host

Using the new{uid,gid}map executables, this limitation can be bypassed,
because those executables have setuid and setgid sticky bits or
capabilities set.

Fixes golang#50098

Signed-off-by: Höhl, Lukas <lukas.hoehl@accso.de>
hown3d pushed a commit to hown3d/go that referenced this issue Aug 3, 2022
Beforehand, unprivileged user could only create id mappings like this:
<ID in user namespace> -> current uid/gid on the host

Using the new{uid,gid}map executables, this limitation can be bypassed,
because those executables have setuid and setgid sticky bits or
capabilities set.

Fixes golang#50098

Signed-off-by: Höhl, Lukas <lukas.hoehl@accso.de>
@hown3d
Copy link

hown3d commented Aug 3, 2022

Hey everyone, I tried to come up with an implementation for this issue:
master...hown3d:go:master

I tried to follow the logic of util-linux's implementation of unshare, although I ran into an issue:
The newuidmap and newgidmap executables need to be fork/exec'ed while the child of our current fork is waiting for the idmap writes.
Since the previous fork didn't return yet, the ForkLock will still be locked and we run into a deadlock.

go/src/syscall/exec_unix.go

Lines 200 to 216 in 29b9a32

ForkLock.Lock()
// Allocate child status pipe close on exec.
if err = forkExecPipe(p[:]); err != nil {
ForkLock.Unlock()
return 0, err
}
// Kick off child.
pid, err1 = forkAndExecInChild(argv0p, argvp, envvp, chroot, dir, attr, sys, p[1])
if err1 != 0 {
Close(p[0])
Close(p[1])
ForkLock.Unlock()
return 0, Errno(err1)
}
ForkLock.Unlock()

I think if that problem is sorted out, the implementation can go on further. Can someone with a bit more experience on the stdlib help out?

@seankhliao seankhliao modified the milestones: Unplanned, Proposal Aug 20, 2022
@seankhliao seankhliao changed the title os/exec: better support for user namespaces proposal: os/exec: better support for user namespaces Aug 20, 2022
fho added a commit to simplesurance/baur that referenced this issue Dec 15, 2022
@kevin-matthew
Copy link

I've been running into operation not permitted anytime I want to do anything advanced when setting UidMappings (note: uid/gid interchangeable in this post) similar to #50098 (comment)

A little debugging reveals it comes from when go tries to write to the child process's /proc/{childpid}/uid_map, the write returns EPERM. Looking through user_namespaces(7), it basically list many reasons why this can happen. However many of the (most likely) reasons deal with capabilities.

So to best logically deduce what's happening, I looked at the source code for the newuidmap(1) utility, as this utility worked just fine when setting uid mappings. I ended up here. When I compiled this utility without the HAVE_SYS_CAPABILITY_H flag then it fails with write returning EPERM, the same as go.

Conclusively, in a non-root environment, doing anything advanced with UidMappings will always fail with EPERM until capabilities are set prior to writeUidGidMappings being called.

In other words, we need to implement the go-equivalent of this mess:

	int cap;
	struct __user_cap_header_struct hdr = {_LINUX_CAPABILITY_VERSION_3, 0};
	struct __user_cap_data_struct data[2] = {{0}};

	if (strcmp(map_file, "uid_map") == 0) {
		cap = CAP_SETUID;
	} else if (strcmp(map_file, "gid_map") == 0) {
		cap = CAP_SETGID;
	} else {
		fprintf(stderr, _("%s: Invalid map file %s specified\n"), Prog, map_file);
		exit(EXIT_FAILURE);
	}

	/* Align setuid- and fscaps-based new{g,u}idmap behavior. */
	if (geteuid() == 0 && geteuid() != ruid) {
		if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) < 0) {
			fprintf(stderr, _("%s: Could not prctl(PR_SET_KEEPCAPS)\n"), Prog);
			exit(EXIT_FAILURE);
		}

		if (seteuid(ruid) < 0) {
			fprintf(stderr, _("%s: Could not seteuid to %d\n"), Prog, ruid);
			exit(EXIT_FAILURE);
		}
	}

	/* Lockdown new{g,u}idmap by dropping all unneeded capabilities. */
	memset(data, 0, sizeof(data));
	data[0].effective = CAP_TO_MASK(cap);
	data[0].permitted = data[0].effective;
	if (capset(&hdr, data) < 0) {
		fprintf(stderr, _("%s: Could not set caps\n"), Prog);
		exit(EXIT_FAILURE);
	}

@outofforest
Copy link
Author

outofforest commented Jul 8, 2023

1.5 year later, having more experience in running programs in namespaces using go I think there is no good solution for this, as the limitation comes from the kernel. There are 4 possible ways to deal with it:

  1. Run your software as root
  2. Assign some extra capabilities to the binary (haven't tested but should be possible)
  3. Run a magic daemon as root to serve non-root clients (this is what docker does)
  4. Execute newuidmap and newgidmap binaries to assign mappings, those binaries must be owned by root and have sticky bit set (this is how podman approaches the problem)

@seankhliao seankhliao removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jul 9, 2023
@ianlancetaylor
Copy link
Contributor

My current understanding of this issue, which may be mistaken, is that we don't need to change any user visible API in the standard library. This issue is about the fact that the UIDMappings and GIDMappings fields only work if the program is running as root. It is possible to implement them by having the parent process run newuidmap and/or newgidmap to set up the mappings. This has to be done after the child process has forked but before the child process has exec'ed the new program.

Does that sound right? If that's right, this doesn't need to be a proposal.

(As an aside, I don't see why ForkLock is a problem here. The parent process would be running in forkAndExecInChild, and would cause forkAndExecInChild1 again specifically to run newuidmap and newgidmap.)

@outofforest
Copy link
Author

@ianlancetaylor

This has to be done after the child process has forked but before the child process has exec'ed the new program.

And this is the problem. Because, at the moment, we have no way to inject the logic into exec.Cmd to do this.

@ianlancetaylor
Copy link
Contributor

If we know precisely what has to be done, we can change the syscall package to do it. We don't have to support having the program inject arbitrary code. Even if there were a reasonable way to do that, it's not what we want.

@outofforest
Copy link
Author

I agree that adding code calling newuidmap or newgidmap to standard library is not a good idea, but then developers should have a way to inject this logic into the right place. Otherwise ability to configure namespaces is limited.

@ianlancetaylor
Copy link
Contributor

I am actually suggesting that perhaps we could change the syscall package to invoke newuidmap or newgidmap when appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests

5 participants