> If you've somehow bypassed AppArmor and cgroup mechanisms then any UID/GID remapping ...

cyphar • today at 4:58 PM • 0 replies • view on HN

> If you've somehow bypassed AppArmor and cgroup mechanisms then any UID/GID remapping is irrelevant. At this point you're in a position to directly manage memory.

Not really, user namespaces (despite all of the issues that unprivileged user namespaces have caused) provide an additional layer of protection as lots of privilege checks are either based on kuid/kgid or are userns-aware. These are some of the deepest security models that Linux has (in the sense that every codepath that involves operating on kernel objects involves a kuid/kgid check and possibly a capability check), so making full use of them is fairly prudent.

AppArmor is not a particularly strong security boundary (it's better than nothing, but there are all sorts of issues with having path-based policies and so they mostly act as final layer of defence against dumb attacks). cgroups are mostly just resource limits, but the devices cgroup (and devices eBPF filter) are are security barrier that prevent obviously bad stuff like direct write access to your host drive. However, those are not the only kinds of protections we need or use in containers, and breaking just those is not enough to "directly manage memory" (but /dev/kmem is inaccessible to user namespaced processes so if that is something you're worried about, user namespaces are another good layer of defence ;)).

It should also be noted that LXC is not the only runtime to support this, the OCI ecosystem supports this too and has for quite a long time now (and the latest release of Kubernetes officially supports isolated user namespaces). Most of my container runtime talks in the past decade have had a slide telling people to use user namespaces but sadly they are not widely used yet.

On the topic of whether containers are a security boundary, I consider them to be fairly secure these days if you use reasonable defaults (user namespaces, reasonable seccomp rules, ideally non-root inside the container). The main reason we struggle in ways that BSD Jails and Solaris Zones do not is because containers on Linux require putting together a lot of disparate components and while this does mean that you can harden non-container programs using them, it opens the door to more bugs. If we had a way to consolidate more operations and information in-kernel things would be much easier to secure (one perennial issue is that of the inatomicity of switching to the container security zone).

(Disclaimer: I am a maintainer of runc.)

alt Hacker News