We’ve been using Sysbox (https://github.com/nestybox/sysbox) for our Buildkite b...

lima · on Jan 17, 2022

Never head of Sysbox before. At a first glance, the comparison table in their GitHub repo and on their website[1] has a number of inaccuracies which makes me question the quality of their engineering:

— They claim that their solution has the same isolation level ("4 stars") than gVisor, unlike "standard containers", which are "2 stars" only (with Firecracker and Kubevirt being "5 stars). This is very wrong - as far as I can tell, they use regular Linux namespaces with some light eBPF-based filesystem emulation, while the vast majority of syscalls is still handled by the host kernel. Sorry, but this is still "2 stars" and far away from the isolation guarantees provided by gVisor (fully emulating the kernel in userspace, which is at the same level or even better than Firecracker) and nowhere close to a VM.

— Somehow, regular VMs (Kubevirt) get a "speed" rating of only "2 stars" - worse than gVisor ("3 stars") and Firecracker ("4 stars"), even though they both rely on virtually the same virtualization technology. If anything, gVisor is the slowest but most efficient solution while QEMU maintains some performance advantage over Firecracker[2]. These are basically random scores, it's not a good first impression–if you do a detailed comparison like that, at least do a proper evaluation before giving your own product the best score!

— They claim that "standard containers" cannot run a full OS. This isn't true - while it's typically a bad idea, this works just fine with rootless podman and, more recently, rootless docker. Allowing this is the whole point of user namespaces, after all! Maybe their custom procfs does a better job of pretending to be a VM - but it's simply false that you can't do these things without. You can certainly run a full OS inside Kata/Firecracker, too, I've actually done that.

Nitpicking over rating scales aside, the claim that their solution offers large security improvements over any other solution with user namespaces isn't true and the whole thing seems very marketing-driven. The isolation offered by user namespaces is still very weak and not comparable to gVisor or Firecracker (both in production use by Google/AWS for untrusted workloads!). False marketing is a big red flag, especially for something as critical as a container runtime.

Anyone who wants unprivileged system containers might want to look into rootless docker or podman rather than this.

[1]: https://www.nestybox.com

[2]: https://www.usenix.org/system/files/nsdi20-paper-agache.pdf

lox · on Jan 17, 2022

I don't spend a lot of time on those comparison-style charts if I'm honest, but that is good (and valid) feedback for them. I also hadn't heard of it, I discovered sysbox via jpettazo's updated post at https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-d..., he's an advisor of nestybox the company that develops sysbox.

For the CI/CD usecase on AWS, sysbox presented the right balance of trade-offs between something like Firecracker (which would require bare metal hosts on AWS) and the docker containers that already existed. We specifically need to run privileged containers so that we could run docker-in-docker for CI workloads, so rootless docker or podman wouldn't have helped. Sysbox lets us do that with a significant improvement in security to just running privileged docker containers as most CI environments end up doing.

Just switching their docker-in-docker CI job containers to sysbox would have mitigated 4 of the compromises from the article with nearly zero other configuration changes.

lima · on Jan 17, 2022

> We specifically need to run privileged containers so that we could run docker-in-docker for CI workloads, so rootless docker or podman wouldn't have helped.

rootless docker works inside an unprivileged container (that's how our CI works).

HenAm · on Jan 20, 2022

How do you run rootless docker in unprivileged container ? Here it says that the privilege is still a must.

https://docs.docker.com/engine/security/rootless/#rootless-d...

pritambaral · on Jan 17, 2022

> They claim that "standard containers" cannot run a full OS. ... this works just fine with rootless podman and, more recently, rootless docker.

> Anyone who wants unprivileged system containers might want to look into rootless docker or podman rather than this.

Perhaps I'm missing something, but I have been running full OS userlands using "standard containers" in production for years, via LXD[1].

[1]: https://linuxcontainers.org/

ctalledo · on Jan 18, 2022

LXD is great, but one nice feature of Sysbox is that it's an OCI-based runtime, and therefore integrates with Docker, K8s, etc. In a way, Sysbox turns Docker containers or Kubernetes pods into LXD-like containers, although there are differences.

lima · on Jan 17, 2022

LXD uses privileged containers, though - this exposes a lot more attack surface, since uid 0 inside the container equals uid 0 outside.

ylk · on Jan 18, 2022

That’s not true:

> By default containers are unprivileged […]

https://linuxcontainers.org/lxd/docs/master/security/#contai...

As for LXC:

> LXC containers can be of two kinds:

> - Privileged containers

> - Unprivileged containers

> […]

> The latter has been introduced back in LXC 1.0 (February 2014) […]

https://linuxcontainers.org/lxc/security/

_joel · on Jan 17, 2022

It can do both. https://linuxcontainers.org/lxc/security/

pritambaral · on Jan 18, 2022

I have always used it for unprivileged containers, and I remember using it as such in 2016. IIRC both podman and rootless docker are much more recent.

ctalledo · on Jan 17, 2022

Thanks for the feedback; I am one of the developers of Sysbox. Some answers to the above comments:

- Regarding the container isolation, Sysbox uses a combination of Linux user-namespace + partial procfs & sysfs emulation + intercepting some sensitive syscalls in the container (using seccomp-bpf). It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not. This is why we felt it was fair to put Sysbox at a similar isolation rating as gVisor, although if you view it from purely a syscall isolation perspective it's fair to say that gVisor offers better isolation. Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better). But in single-tenant environments, Sysbox does void the need for privileged containers in many scenarios because it allows well isolated containers/pods to run system workloads such as Docker and even K8s (which is why it's often used in CI infra).

- Regarding the speed rating, we gave Firecracker a higher speed rating than KubeVirt because while they both use hardware virtualization, the latter run microVMs that are highly optimized and have much less overhead that full VMs that typically run on KubeVirt. While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed).

- Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one).

- Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know.

- Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation.

- Regarding "The isolation offered by user namespaces is still very weak and not comparable to gVisor or Firecracker". User namespaces by itself mitigates several recent CVEs for containers, so it's a valuable feature. It may not offer VM-level isolation, but that's not what we are claiming. Furthermore, Sysbox uses the user-ns as a baseline, but adds syscall interception and procfs & sysfs emulation to further harden the isolation.

- "False marketing is a big red flag, especially for something as critical as a container runtime." That's not what we are doing.

- Rootless Docker/Podman are great, but they work at a different level than Sysbox. Sysbox is an enhanced "runc", and while Sysbox itself runs as true root on the host (i.e., Sysbox is not rootless), the containers or pods it creates are well isolated and void the need for privileged containers in many scenarios. This is why several companies use it in production too.

lima · on Jan 18, 2022

Thank you for taking the time to reply - happy to discuss this! :)

> It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not.

Have a look at what gVisor actually does: https://gvisor.dev/docs/architecture_guide/security

It fully implements a subset of the Linux kernel ABI in userspace, including procfs and sysfs and even memory and process management. No untrusted code ever interacts with the host kernel. Filesystem and network access goes through an IPC protocol and is handled by the gVisor processes on the host, which in turns runs inside a user namespace and a seccomp sandbox for defense in depth.

This is a much, much stronger level of isolation than your approach or, arguably, even VMs (the trade-off is performance). "Sysbox isolates the container in ways that gVisor does not" just isn't true.

The sysbox approach is one kernel bug away from host system compromise, same as using regular containers. Emulating procfs and sysfs and using user namespaces takes away some of the attack surface and is great defense in depth, but does not provide isolation from the host kernel.

> Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better)

I've read numerous claims that sysbox is suitable for untrusted workloads, for instance in [1] and [2].

It's a nice product and certainly much, much better than running docker-in-docker using privileged containers, but given the significant remaining attack surface, this claim could put your customers at risk and should come with a big disclaimer.

> While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed)

Firecracker was designed for memory efficiency, faster cold start times and security (by virtue of being written in a memory-safe language). It means you can run more containers per host, but the actual workload performance overhead is identical to "normal" VMs and, in some cases, even slightly higher since Firecracker lacks some of the optimization that has gone into QEMU.

> Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one).

KubeVirt is just plain QEMU VMs using libvirt, which have been compared to gVisor quite extensively[3][4]. There's almost no overhead for memory/CPU and quite a lot of overhead for syscalls (but with big improvements recently with the introduction of VFS2 and soon LisaFS[5]). It's a classic trade-off - gVisor is more secure and efficient than QEMU, allowing a much larger number of instances to run on a host by virtue of better cooperation with the host kernel scheduler and memory management, but for raw performance, a QEMU VM always wins.

> Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know.

Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else. See [6] for a practical example.

For containers, this used to be the case, but the situation improved in recent kernel releases.

For podman, almost every combination works - running systemd unprivileged, running podman inside podman, or even running rootless-podman-in-rootless-podman[7] and so does Kubernetes-in-rootless-{podman,docker}[8] (requiring very recent kernel features, though - notably cgroupsv2 and unprivileged overlayfs).

Running docker:dind-rootless inside unprivileged Docker containers also works, however, it requires "--security-opt seccomp=unconfined".

Sysbox definitely got to that point earlier and has better usability.

> - Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation.

Apologies, then, for misinterpreting that.

[1]: https://blog.nestybox.com/2020/10/06/related-tech-comparison...

[2]: https://github.com/nestybox/sysbox/issues/120#issuecomment-9...

[3]: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/6e4619...

[4]: https://www.scitepress.org/Papers/2021/104405/104405.pdf

[5]: https://gvisor.dev/blog/2021/12/02/running-gvisor-in-product...

[6]: https://github.com/innobead/kubefire

[7]: https://www.redhat.com/sysadmin/podman-inside-container

[8]: https://kind.sigs.k8s.io/docs/user/rootless

ctalledo · on Jan 18, 2022

Thanks again for the detailed response.

> Have a look at what gVisor actually does

I am aware of what it does, though I had missed the fact that the Sentry and/or Gopher run within a user-ns (could not find this in the docs). Had also missed the fact that it does perform procfs/sysfs emulation (makes sense), so I stand corrected on that. In light of this, I'll modify the Sysbox GH table to show gVisor as having a stronger isolation rating (in fact, our Sysbox blog comparing technologies [1] did give gVisor a stronger isolation rating).

> the sysbox approach is one kernel bug away from host system compromise

All approaches are one bug away from host system compromise (gVisor, VMs, etc.), though I agree that approaches like gVisor and VMs have a reduced attack surface.

> I've read numerous claims that sysbox is suitable for untrusted workloads

It's not a black or white determination in my view. Users choose based on their environments & needs. We always make it clear to our users that VM-based approaches provide stronger isolation, per the Sysbox GH repo:

"Isolation wise, it's fair to say that Sysbox containers provide stronger isolation than regular Docker containers (by virtue of using the Linux user-namespace and light-weight OS shim), but weaker isolation than VMs (by sharing the Linux kernel among containers)."

> Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else

That's good to know (thanks), though the table in the Sysbox GH repo meant to compare Sysbox against Kata + Firecracker (since Kata is a container runtime). To the best of my knowledge running Docker, K8s, k3s, etc. inside a Kata container is not easy (see [1] and [2]).

> For containers, this used to be the case, but the situation improved in recent kernel releases.

It's correct that rootless docker/podman approaches are improving as far as what workloads they can run inside containers, although they still have several limitations [3], [4].

With Sysbox, most of these limitations don't apply because the solution works at the more basic "runc" level, Sysbox itself is rootful, and it uses some of the techniques I mentioned before (user-ns, procfs & sysfs virtualization, syscall trapping, UID-shifting, etc.) to make the container resemble a "real host" while providing good isolation.

Good discussion, please let me know of any more feedback.

[1] https://github.com/kata-containers/kata-containers/issues/20... [2] https://github.com/daniel-noland/docker-in-kata [3] https://docs.docker.com/engine/security/rootless/#known-limi... [4] https://github.com/containers/podman/blob/main/rootless.md

xmodem · on Jan 17, 2022

Could you elaborate a bit more how you get the containers into their own IAM roles?

lox · on Jan 17, 2022

Yup, we have a sidecar process/container that runs for each job and assumes an AWS IAM Role for that specific pipeline (with constraints like whether it’s an approved PR as well). The credentials are provided to the job container via a volume mount. This allows us to have shared agents with very granular roles per-pipeline and job.

nijave · on Jan 17, 2022

Not sure if this applies to the parent, but one way this Buildkite

Queues map pipelines to agents. Agents can be assigned IAM roles. If you want a certain build to run as an IAM role, you give it a queue where the agents have that role. For AWS, Buildkite has as a Cloud Formation stack that sets up auto scaling groups and some other resources for your agents to run.

xmodem · on Jan 17, 2022

Most CI systems will have some way of assigning builds to groups of agents. But it would in some cases be useful to grant different privileges to different containers running on the same agent, which is what I understood OP to have.

selecsosi · on Jan 17, 2022

For ECS: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...

orf · on Jan 17, 2022

AWS has IAM service accounts for containers. Comes for free with EKS, not sure how you’d do it without EKS.

Basically it adds a signed web identity file into the container which can be used to assume roles.

otterley · on Jan 17, 2022

Amazon ECS also offers task roles which do the same thing: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...

captn3m0 · on Jan 17, 2022

Ref: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-f...