Poking Holes in Cheese

8 minute read Published:

The fine art of herding spherical mount namespaces in vacuum

This post is somewhat technical but may help application developers to peek under the hood of snapd and see the system in a more complete way. Hopefully it will be of use to more than my closest friends and colleagues.

Snap execution environment, 2.0

It’s not really 2.0, but it sounds nice to say that since many people discussed the existing environment in various formats and with varying degrees of correctness and depth. Here I’ll try to describe how the system operates with some more detail and will finish with description of some new and shiny elements that are on the horizon and will be reaching you soon.

Historical digression

You can skip this section if you are familiar with virtual machines and containers.

The execution environment is everything that snap applications see at runtime. You may think that this is a very ordinary concept, after all every application runs on top of your computer and sees its various technical aspects. You can see the various files, devices, running application processes and everything else that we have conjured as abstractions, conveniences or now-historic design aberrations.

Modern technology upset that model with some advancements. We are now familiar and used to the concept of virtualisation where a program mimics a whole computer and allows entire operating systems and countless applications to run inside a modelled, virtual hardware. As an application developer you can now run your program inside your physical computer or inside one of perhaps many virtual computers. Each computer may present different set of files, devices and processes to a running application.

Later on virtualisation was joined by another technology, now called containers. Containers take the idea of using software to create an apparition of a separate computer in which software can execute but takes a lot of the overhead away by reusing the core of the operating system among the different apparitions or containers. This allowed large scale deployments where hundreds or thousands of containers can host programs on a single powerful physical computer.

But containers also brought with them another possibilities. You no longer had to have a whole separate computer, you could choose to isolate one aspect from other containers but share another. This idea that you can share the same operating system core, or kernel, amongst separate virtual machines brought us to containers. The same idea, extended to containers allow a large and complex variety of possible arrangements where programs see their execution environment either private or shared with others.

Snap application containers

You can skip this section if you are familiar with how snapd uses mount namespaces

Snap applications, with the very special exception of classically confined snaps, live in an execution environment that is one of the many possible arrangements of the share-or-not-share switchboard enabled by container APIs available in Linux. Unlike typical containers that give an impression or more-or-less, separate place, isolated from your computer, snaps try to give the impression of an integrated space. Snaps run on your machine, physical or virtual they see most of the same things. The only things that they see differently are for technical reasons and are, hopefully, out of the way.

One aspect of this this imaginary switchboard is something known as the mount namespace. The mount namespace one of the many namespaces available in Linux. Rest assured that snapd doesn’t use any other namespace and we can safely ignore them here. This means that any aspect of your computer managed by them is exactly the same as the rest of your computer and snap applications seems to run directly on top of your computer and not in some other box.

The mount namespace governs all the files, directories and other filesystem entires visible by processes inhabiting it. My desktop files are stored in the directory /home/zyga/Desktop. This directory is stored in a filesystem which is stored itself on one of the partitions of my hard drive. Those facts are encoded in the mount table managed by the kernel. The details are perhaps known to you but the key aspect is that the mount table controls what a given process sees as it traverses the filesystem. All the way from the root directory / down to /home to /home/zyga and finally to /home/zyga/Desktop.

Using mount namespaces one process can see a mounted hard drive and another process a virtual filesystem, network share or something entirely different. This ability gives us the power to shape the filesystem for each application. If one application needs a particular library or data file or anything else we can simply put it there. If other application needs a different library in the same spot then a long as those applications inhabit two distinct snaps ten it can be done.

The mount namespace is, perhaps unfortunately, more complex than other namespaces in that it is not a real share-or-not-share binary choice. It’s a flexible and complex web of dependencies and behaviours that describes not only how things seem to be in the filesystem but also how changes in a particular spot affect all the other namespaces. When you plug in a removable hard drive and see the files in your file manager then it is so because the file manager runs in the same namespace of the system service that mounted the hard drive. Using namespaces one could construct a world where the daemon inhabits a different mount namespace but can still mount the removable hard drive and make it appear in another mount namespace (perhaps in a totally different place) because of a particular sharing arrangement. That other mount namespace could be also adjusted to unmount the removable hard drive, but only for itself, leaving it available to the rest of the system.

The last aspect of the mount namespace that is essential to understanding the execution environment is that using bind mounts one can make a part of a filesystem appear in another place. One can, for instance, make it appear that a part of a large external hard drive seems to be in a directory in your home folder. Or make it appear that a part of your regular system, say the /home directory is visible for a snap application while the rest of the system is totally different. You can bind mount directories or individual files. This is used extensively by snapd to piece together a partially writable world out of large read-only spaces and a single writable space that appears spread around individual files and directories in /etc, /var and /home.

The new and shiny

Over the past few months snapd has been growing an array of means to shape mount namespaces. All of this is mostly under the hood and materializes as fixes to obscure bugs or long standing feature requests for the content sharing interface. One big request was to allow shaping the execution environment more freely and with greater access to places outside of the small part of the otherwise read-only filesystem carved for writable data.

The concept of putting a writable veil over read-only space is not new. It was pioneered by bootable Linux distributions that filled a read only medium with what looks like a typical system installation and then used some features of the kernel to make it appear as writable, storing the actual changes in memory.

The kernel had a long history of trying to overlay one filesystem over another one but those all were plagued by various issues. While it looked fine for a quick look, on closer inspection the incorrect behavior was breaking more sophisticated programs. Fast forward to today and still the most advanced solution that allows to do this struggles with security features and cannot be universally used, even on most recent kernel versions.

Snapd needed a robust solution, even if that solution is not as conceptually elegant as the overlay filesystem. The solution we ended up using works correctly with path-based LSM (Linux Security Module) such as apparmor and works fine on just about any kernel that a random Linux distribution offers now or in the past few years.

Using bind mounts, and writable in-memory temporary file systems one can achieve pretty much any layout desirable. To be more accurate, anything as long as that description can be applied by a special program and not performed using the regular operations for creating files and directories. The idea is very simple. Assume you are in a read only filesystem (CD-ROM disk or… a snap). You want to write to a non-existent file or directory. Bind-mount the parent directory aside. We will call that a safe-keeping directory. Mount a new, empty in-memory filesystem over the original parent directory, completely hiding any content inside. Using the safe-keeping, bind-mounted version of the parent directory create empty directories and files that mimic the now-hidden original. Bind mount each file and directory from the safe-keeping directory to the empty placeholders in the new in-memory filesystem. Finally unmount the safe-keeping directory.

The in-memory filesystem is tmpfs and if you are familiar with it you will know that all the writes are discarded when the machine is rebooted or powered off. For us that is not a problem because we will always bind mount something inherently writable (e.g. the per-snap writable data directory) or another read-only content (e.g. part of another snap).

This scheme is also fully reversible so we can undo the changes (effectively unmount each item inside and then unmount the in-memory filesystem) to restore the original view.

All of those mount operations are performed in the mount namespace of the affected snap. Other snaps and the rest of the system is not going to see them.

The future

This new tool in our toolbox opens the path to the several exiting new features. I will write about them separately but for sparking one’s interest I will enumerate them here:

  • Much richer content interface (plug-in snaps, themes, etc) where content can be automatically aggregated in a spool-like directory.
  • No more hacks for creating directories in $SNAP_DATA or placeholders in $SNAP
  • Open avenue towards generalized layout configuration for any snap (e.g. put $SNAP/usr/lib over the regular /usr/lib)
comments powered by Disqus