Hacker News

curmudgeon22•3d

Building a Linux Container Runtime from Scratch edera.dev

68 comments

pss314•3d
I loved this hands-on presentation Containers From Scratch by Liz Rice from few years ago https://www.youtube.com/watch?v=8fi7uSYlOdc.
Today, Linux containers in (less than) 100 lines of shell by Michael Kerrisk was published https://www.youtube.com/watch?v=4RUiVAlJE2w.
- seungwoolee518•3d
  Michael Kerrisk wrote a series of article about Linux namespace on lwn.net [0]
  [0]: https://lwn.net/Articles/531114/#series_index
- Brian_K_White•3d
  That bash/busybox demo is awesome. The code is at: https://man7.org/tlpi/code/ (/tlpi-dist/consh/ in the tar)
  I still used lxc-utils in my rc script which now seems like positively cheating and may as well use docker.
- •2d
  [deleted]
- •3d
  [deleted]
- Brian_K_White•3d
  On my birthday while attending Arisia January 2010 I wrote a single rc script with about 30 non-boilerplate lines of bash (the 3 functions) that does:
```
  * start all enabled containers on boot
  * stop all running containers at shutdown (ie gracefully wait for them all to shut themselves down before letting the host proceed to shut itself down)
  * start/stop/status any specified container on command
  * list all containers (known/configured, running or not)
  * every container has a gnu screen console
  * simple config file per container to define network & root dir etc.
```
  (these are the latest versions of the wiki page and the referenced rclxc package, but I created the wiki page and the script on Jan 18 2010, despite the wiki history. The weird link for the rpm is because home:aljex no longer exists on the opensuse build service)
  https://en.opensuse.org/SDB:LXC
  https://anna.lysator.liu.se/pub/opensuse/repositories/home%3...
  Whopping 3 files in the package, and one is just a symlink, and the other is just a single rmdir command. No daemon, the script only runs to do something. Not even systemd, just plain old sysv init.
  I never developed it beyond essentially proof of concept because my companies owner listened to vmware salespeople, but I did use it in quasi-production for a year or two. (some developer vms, a few internal services, 20 or so customers)
  But to me it did prove the concept and I would have liked to just work on that instead of using vmware or anything else. I completely gag when I look at kubernetes or even just podman when I had this so long ago and got so much function out of so little code and complication.
  I mean it would obviously get larger and more complicated as it grew to handle more cases and supply more features. I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code. I feel like once you cross that point you have wandered off the track and are now doing bad engineering in some way and need to go back and figure out where you started driving in your sleep and get back on track solving the problem of getting the necessary job done in some sensible way.
  - kubafu•3d
    > I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code.
    And that should be the right approach 90% of the time. Thanks for your comment!
Joker_vD•3d
> Importantly, we designed Styrolite with full awareness that Linux namespaces were never intended as hard security boundaries—a fact that explains why container escape vulnerabilities continue to emerge. Our approach acknowledges these limitations while providing a more robust foundation.
So what do you do, exactly?
- klysm•3d
  Say “it’s probably fine” and hope that the people building the foundational systems are protecting us
  - Joker_vD•3d
    No, I mean, what do the Edera developers do differently, in order to provide more robust foundation with this new container runtime called Styrolite? They still use Linux namespaces, as far as I can tell from TFA.
    - denhamparry•3d
      Edera developer here, we use Styrolite to run containers with Edera Protect. Edera Protect creates Zones to isolate processes from other Zones so that if someone were to break out of a container, they'd only see the zone processes. Not the host operating system or the hardware on the machine. The key difference here between us and other isolation implementations is that there is no performance degradation, you don't have to rebuild your container images, and that we don't require specific hardware (e.g. you can run Edera Protect on bare metal or on public cloud instances and everything else in-between).
      - xmodem•3d
        What underlying primitives are you relying on to provide isolation, if not linux namespaces?
        How does your approach compare to Google's gVisor?
        asmor•2d
        It's Xen, and they even explain why it's not KVM here: https://github.com/edera-dev/krata/blob/main/FAQ.md
        sys_call•3d
        gVisor emulates a kernel in userspace, providing some isolation but still relying on a shared host kernel. The recent Nvidia GPU container toolkit vulnerability was able to privilege escalate and container escape to the host because of a shared inode.
        Styrolite runs containers in a fully isolated virtual machine guest with its own, non-shared kernel, isolated from the host kernel. Styrolite doesn't run a userspace kernel that traps syscalls; it runs a type 1 hypervisor for better performance and security. You can read more in our whitepaper: http://arxiv.org/abs/2501.04580
        xmodem•3d
        Thanks for the explanation. So you are using virtualisation-based techniques. I had incorrectly inferred from other comments that you were not.
        I skimmed the paper and it suggests your hypervisor can work without CPU-based virtualisation support - that's pretty neat.
        Many cloud environments do not have support for nested virtualisation extensions available (and also it tends to suck, so you shouldn't use it for production even if it is available). So there aren't many good options for running containers from different security domains on the same cloud instance. gVisor has been my go-to for that up until now. I will be sure to give this a shot!
        0x1ceb00da•3d
        So it's a lightweight way of running docker images inside a virtual machine?
        sys_call•2d
        Yes, precisely. This also provides container operators with the benefits of a hypervisor, like memory ballooning, and dynamically allocating CPU and memory to workloads, improving resource utilization and the current node overprovisioning patterns.
        klysm•2d
        So it’s a VM?
      - znpy•3d
        > Edera Protect creates Zones to isolate processes from other Zones
        What do you mean by "zone" exactly?
        sys_call•3d
        A zone is jargon for a virtual machine guest environment (an homage to Solaris Zones). Styrolite and Edera runs containers inside virtual machine guests for improved isolation and resource management.
        znpy•2d
        > an homage to Solaris Zones
        i asked specifically because the word "zones" reminded me of solaris zones :)
        > Styrolite and Edera runs containers inside virtual machine guests for improved isolation and resource management.
        do your have your own vmm or is it firecracker with make up and a wig?
        klysm•2d
        How exactly is this an improvement over VMs?
        sys_call•2d
        We run unmodified containers in a VM guest environment, so you get the developer ergonomics of containers with the security and hardware controls of a VMM.
- flkenosad•3d
  Anyone know if it's possible to update the Linux kernel so that namespaces are hard security boundaries? I wonder what that would entail.
  - eyberg•3d
    When we speak of 'hard security boundaries' most people, in this space, are comparing to existing hardware backed isolation such as virtual machines. There are many container escapes each year because the chunk of api that they are required to cover is so large but more importantly it doesn't have isolation at the cpu level (eg: intel vt-x such as VMREAD, VMWRITE, VMLAUNCH, VMXOFF, VMXON).
    This is what the entire public cloud is built on. You don't really read articles that often where someone is talking about breaking vm isolation on AWS and spying on the other tenants on the server.
    - •3d
      [deleted]
    - vaylian•2d
      > There are many container escapes each year because the chunk of api that they are required to cover is so large
      What API? The kernel syscall API?
      If we assume for a moment, that there are no bugs in the Linux namespace implementation, would containers be as safe as virtual machines?
      - eyberg•2d
        No. As I'm responding to this Qualys just announced three new bypasses as of today: https://seclists.org/oss-sec/2025/q1/253 .
        vaylian•2d
        Sorry, can you elaborate? Your answer is not really clear. Why is it not possible for Linux namespaces to be secure?
    - flaminHotSpeedo•3d
      > This is what the entire public cloud is built on.
      Well... The entire public cloud except Azure. They've been caught multiple times for vulnerabilities stemming from the lack of hardware backed isolation between tenants.
      - richardwhiuk•3d
        Azure has the same level of isolation for VMs at a hardware level as AWS.
        flaminHotSpeedo•2d
        How Azure isolates VM's is completely unrelated, because containers are not VM's. And if you meant to assert that Azure uses hardware assisted isolation between tenants in general, that was not the case for azurescape [1] or chaosDB [2].
        [1] https://unit42.paloaltonetworks.com/azure-container-instance...
        [2] https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v...
        richardwhiuk•2d
        It is the case for VMs that customers create.
        It hasn't always been the case for manged services, but I don't think that's true for AWS either.
        flaminHotSpeedo•2d
        Unmanaged VM's created directly by customers still aren't relevant to this discussion. The whole point here is that everyone else uses some form of hardware assisted isolation between tenants, even in managed services that vend containers or other higher order compute primitives (i.e. Lambda, Cloud Functions, and hosted notebooks/shells).
        Between first and second hand experience I can confidently say that, at a bare minimum, the majority of managed services at AWS, GCP, and even OCI use VM's to isolate tenant workloads. Not sure about OCI, but at least in GCP and AWS, security teams that review your service will assume that customers will break out of containers no matter how the container capabilities/permissions/configs are locked down.
  - GardenLetter27•3d
    A lot of use cases don't want that though. It's nice having lightweight network namespaces for example, just to separate the network stack for tunneling but still have X and Wayland working fine with the applications running there.
  - fulafel•2d
    Have a look at gVisor for one approach.
- z3t4•3d
  Once you have set up the namespaces you drop all capabilities so if the program gets hacked while it's running it can do very little.
  - denhamparry•3d
    Edera developer here. I agree! But there are instances we need to run with additional capabilities, and we’re also dependent on people knowing how to do the right thing. We’re trying to improve this by setting this by default, but also improving the overall performance and efficiency of running containers
  - znpy•3d
    honest question: how is this any better than running non-root containers?
    They can do very little anyway, that way.
    - sys_call•3d
      Non-root containers still operate under a shared kernel. Non-root containers that run under a vulnerable kernel can lead to privilege escalation and container escapes.
      Styrolite is a container runtime engine that runs containers in a virtual machine guest environment with no shared kernel state. It uses a type 1 hypervisor to fully isolate a running container from the node and other containers. It's similar to Firecracker or Kata containers, but doesn't require bare metal instances (runs on standard EC2, etc) and utilizes paravirtualization.
    - •2d
      [deleted]
seungwoolee518•3d
When I was digging into Container (a.k.a it uses linux namespace capabilities) lwn.net's series of article helps me a lot.
https://lwn.net/Articles/531114/#series_index
shortrounddev2•3d
I've seen many examples of people creating containers for Linux; I wish it were comparably easier to create containers for Windows. The fundamental software exists on Windows (AppContainers are how UWP apps work) but the documentation around AppContainers is very sparse/opaque because Microsoft doesn't want you to use AppContainers to make a general purpose sandbox environment like Snap or Flatpak; they want you to write UWP apps. It would be immensely helpful if you could run any arbitrary win32 or higher application in a sandboxed AppContainer where the NT System calls only had access to, say, the application's local folder and its %APPDATA% folder.
Alas, I think that Microsoft has simply given up on Native application support on Windows. Currently the only good way to write native apps for windows is still Win32/MFC and Winforms.
In fact, I think that secretly even Microsoft knows that everyone hates their UI frameworks/runtimes (and the fact that Microsoft deprecates them 2 years into their lifespan) because Microsoft STILL provides modern .Net 8/9 bindings for Winforms in 2025. If only they would just replace the GDI renderer with Direct2D, it would be literally perfect
- pjmlp•3d
  Windows containers exist, their are based on the jobs, and Microsof took the approach to use the same APIs docker world expects to have as means to integrate with the DevOps container world expectations.
  https://learn.microsoft.com/en-us/virtualization/windowscont...
  You missed GDI+, Direct2D API is a COM mess that we only put up with because DirectX, and DirectX team doesn't like .NET, thus nothing like XNA or Managed DirectX will ever happen again.
  WPF also exists, and since Build 2025 has regained parity with WinUI in official Windows GUI frameworks, that aren't in maintenance mode, aka Forms and MFC.
  However, WinUI 3.0 with WinAppSDK has been a mess of project since Project Reunion was announced back in 2021, after almost four years it is still a shadow of UWP tooling, this is where I agree with you, it was so badly managed that nowadays only the Windows development team really cares about it, and most likely because their job depends on having to use WinUI.
  But if you so wish to go through the pains of WinUI, there is Win2D.
  - shortrounddev2•3d
    While windows containers exist, the documentation surrounding them at the API level is sparse. Anything from Azure just tells you to use docker.
    As far as I can tell GDI+ is still software rendered? DirectX Com objects aren't difficult to work with at all, ive never understood why people hate them so much. The point of using direct2d would be to provide hardware rendering for winforms.
    Wpf is OK compared to winui 3 but it still suffers from xaml.
    - pjmlp•3d
      Because the API was designed to be compatible with Docker tooling.
      GDI and GDI+ are hardware accelerated for years now,
      https://learn.microsoft.com/en-us/windows-hardware/drivers/d...
      Maybe because COM tooling sucks, in C++ land, Microsoft re-invents the approach to use COM every couple of years, and it is too much C/C++ style instead of being a proper modern C++ approach to handle COM.
      While on .NET land, DirectX team couldn't care less, and leaves the community the work to make the interop work without issues.
      The XAML hate comes mostly from outside traditional Windows developer circles.
      - shortrounddev2•2d
        Also the hardware acceleration in gdi and especially gdi+ is not totally complete. Text rendering in gdi+ is still handled in software and only some operations in gdi are hardware acclerated
      - shortrounddev2•3d
        yes but the point is to not have to use docker to containerize an app; it would be nice to be able to containerize an app with a built in runtime or something that is just literally not docker. Microsoft could solve so many of its security issues with an equivalent to Snap.
        Again, I don't get what the COM hate is. In DirectX, it's basically just become a simple way to manage the life cycle of an object.
        And Xaml hate is the hill I'm willing to die on. UI should be defined in either a dom or a winforms-like API, but not a mix between the two. Xaml is just straight up one of the worst things Microsoft has created
m00dy•3d
We are an algorithmic trading company [0], and our trading strategies are primarily built as pure Rust libraries. We've been searching for a way to sandbox the strategies we host, as not all of them are signed or open source for verification. Styrolite seems like a promising solution to address this issue, so we’re planning to give it a try.
[0]: https://cycletop.xyz
- denhamparry•3d
  Edera developer here! Thank you for sharing and any feedback you have would be great! Edera Protect is written in Rust too, and our focus is also performance as well as isolation.
pzmarzly•3d
Why not use any of the existing OCI Runtimes? They take well-defined[0] JSON description as input, and are pretty well-contained (single static binary). And because they are separate binaries, not libraries, you don't need to worry about things like thread safety or FD leaking.
[0] https://github.com/opencontainers/runtime-spec/blob/main/con...
- zamalek•3d
  "I don't need the full capabilities of OCI." In my (now very much stagnating) Nix-like pet project[1] I merely want a hermetic build environment. Rolling my own container runtime was no more difficult than, what would likely be, a nightmare of emulating a complete OCI container for the simple purpose that I'm after.
  Simple problems need simple solutions, and OCI is really complex. I was initially overjoyed by the prospect of deleting my code, but it looks like this project doesn't have rootless/shadowutils support yet (which is solely useful for not having to worry about su or caps during development).
  [1]: https://github.com/porkg/porkg/tree/rs
- r3trohack3r•3d
  I’m currently exploring this for an AI context because I haven’t found a better solution for letting K8S manage AI workloads that need direct GPU access on OSx
  - denhamparry•3d
    Edera developer here. Edera Protect is being developed to manage access to the GPU hardware on a Node with the containers running your workloads. We talk a lot about isolation between containers, but we're also focused on adding this isolation throughout the stack, from containers/processes down to hardware.
  - pm90•3d
    You're running a kubernetes cluster with nodes that are running OSx?
  - brcmthrowaway•3d
    Why are you building AI anything
- harha_•3d
  The beginning of the article answers to your question.
infogulch•2d
How does this compare to recently discussed Landrun?
https://news.ycombinator.com/item?id=43445662
cedws•3d
Isn’t the gold standard of containerisation gVisor? Can’t get much more restrictive than proxying and filtering syscalls. As far as I remember it’s the default runtime on GKE.
- denhamparry•3d
  Edera developer here. gVisor is restrictive, but its at a cost of performance. Personally, I'd say Edera Protect is one level deeper. We create Edera Protect Zones to provide isolation, so we create a Zone that is isolated from the OS and hardware of the machine running the container. So we don't proxy or filter syscalls, as the isolation is a layer deeper. We are also focused on ensuring that Edera Protect is as performant (if not better) as running a container today with containerd.
  Finally, if you wanted to, you could run gVisor within Edera Protect, but we feel that Edera Protect would already provide the security benefits that gVisor offer.
  - cedws•2d
    Thanks, but what is a “Protect Zone” at a technical level? Why does it provider stronger isolation than syscall filtering?
  - raesene9•2d
    How would you say it compares to Firecracker?
- raesene9•3d
  If you want better isolation than is provided by Linux namespaces et al, then yep something like gVisor or Firecracker (https://firecracker-microvm.github.io/) provide a likely better level of isolation.
- sys_call•3d
  gVisor runs a userspace kernel that proxies syscalls to a shared host kernel. Running an "application kernel" in userspace impacts performance because it goes through two schedulers. Virtual machine isolation is more restrictive because it doesn't share any kernel state with other containers. We have a whitepaper that compares the performance of gVisor and Stylorite/Edera if you want to see the differences http://arxiv.org/abs/2501.04580
TechDebtDevin•3d
Cookie consent card wont disappear. Brave mobile.
- elboulangero•3d
  Same with Firefox on Android...
  - shellwizard•3d
    No problem here. FF Android + uBO hard-mode
asicsp•3d
See also:
* https://ericchiang.github.io/post/containers-from-scratch/
* https://indradhanush.github.io/blog/life-of-a-container/