--- title: "Unveiling Whiteout Files: Do you know how file deletions are handled between layers of a Docker image?" date: !!timestamp '2023-11-09 15:35:00' image: /post/unveiling-whiteout-files/og.webp tags: - kernel - container - docker - linux --- Union file systems are a mechanism for merging two or more file systems, to present them unified, under a single mount point for the user. The main idea behind this mechanism is to be able to alter the contents of the first file system (e.g. the contents of a CD-ROM) by writing all changes (additions, deletions, modifications) to the second (which could be a disk partition, a USB stick, ...). While adding and modifying may seem trivial, deleting is not. So let's explore in this article what *whiteout files* are and how they can simulate the deletion of a file. Another common use of filesystem unions is in containers: container images are made up of layers. If you launch a `php` container and then a `nginx` container, both images based on `debian`, you will only download the underlying `debian` image once. Files from the `debian` image may be modified or deleted by an image such as `php` or `nginx`. Thanks to union file system! ## Understanding Union File System Unions file system share a number of concepts, which we will illustrate with the following diagram: ![File access by layer](overlayfs.png) Here we see a two-layer file system, referred to in the jargon as two *branches*. They are denoted *Lower* for the lowest layer and *upper* for the layer that is inserted on top of the *lower* layer; and finally *Merged* for the resulting view. Some implementations support more than 2 branches, with sometimes complex access and modification policies. When a file is deleted from the union, a so-called *whiteout file* is placed in the *upper* layer to indicate that this file should no longer be displayed in the *merged* layer. The same concept applies to folders, which are referred to as *opaque directory*. When accessing a file in the *lower* branch that has not been modified in *upper*, the *lower* file is accessed directly. When a file is modified, its entire contents are copied from the *lower* branch to the *upper* branch. A file that is added, overwritten or modified will therefore have its entire contents in the *upper* layer. ## History The concept of *whiteout file* has its origins in the early development of file system unions. [**Translucent File System**](http://mcvoy.com/lm/papers/SunOS.tfs.pdf) is undoubtedly the first implementation of the *whiteout file* concept. Developed by David Hendricks in the 1980s for SunOS 3, the idea was to allow users of a machine to take advantage of the base system, making modifications without impacting other users, and without having access to other users' files. The first *union mounts* were implemented with BSD 4.4, in the 90s. The best-known implementation today is *UnionFS*, by Erez Zadok. It was to be the implementation used for the Linux kernel, but like *aufs*, their code and solution didn't convince to be fully integrated. It wasn't until 2014 that a *union mount* was integrated into the Linux kernel. This is OverlayFS. It arrived in kernel 3.18, after more than 4 years of rewrites and structural improvements, to reach the demanding and uncompromising level required for its integration into the official kernel. {{% card color="info" title="What issues complicate the implementation of an union file system?" %}} One of the trickiest problems is finding a way to represent file and folder deletions: it has to be a valid file (with or without metadata), as the information needs to be stored in a concrete way. In many implementations, a `.wh.` file serves as a *whiteout file*, which can create conflicts with the user's own file names (or reduce the user's choice of file names). A similar problem applies to folders: should you delete every file contained in the folder, or does the mere presence of an *opaque directory* prevents discovery? Memory usage can quickly get out of hand, especially if the implementation allows a lot of branches, because if you want the system to perform well you'll need to have the topologies of each file system in memory. Implementing `mmap(2)` is necessarily a nightmare: when a file is modified by two processes that `mmap(2)`, we normally expect to see the modifications in both processes, but the first to make a modification creates a new file in the writeable branch. This makes it difficult to reconcile the pointers of the two processes. Similarly, think about *hard links* management: all pointers to updated content should be modified in the write layer, but there is no pointer index, so it's not easy to find the files to be updated. And let's not forget that the underlying file systems of each branch don't necessarily have the same constraints (file name sizes, extended attributes, metadata, accent encoding, etc.), so you have to juggle between them, while returning consistent errors where appropriate. And many more besides. Not least `readdir(2)`, which needs to be stable despite the turbulence that can occur between two calls, ... See this series of articles summarizing the different implementations, their choices and differences: , . {{% /card %}} In what follows, we'll be concentrating mainly on the operation of this file system, trying as far as possible to draw parallels with the others. ## Whiteout Files in Practice First of all, you need to know how to set up such a file system. Here's a general example of how to create a simple union between a read-only and a read/write file system: ``` mount -t overlay -olowerdir=/lower,upperdir=/upper,workdir=/work ignored /merged ``` The type to use is `overlay`, with the `lowerdir` options indicating the location of the folder(s) to be combined in read-only mode (separated by `:` when there are several), the directory containing the read/write system in the `upperdir` option, and don't forget the `workdir` option, a path on the same partition as the `upperdir`, which must be empty. We end the call by giving the source device, which is useless in our case (`ignored` or any other string will do), and finally the folder to which our union will be mounted: `/merged` in the example. ### Usage in Containerization Let's analyze a running Docker container to learn more. First, we check that we're using the `overlay2` *storage driver*: ``` 42sh$ docker info | grep "Storage Driver" Storage Driver: overlay2 ``` This is the case (depending on your kernel configuration, Docker may have chosen a different *driver*), so let's start the analysis: ``` 42sh$ docker container run --rm -it debian incntr$ mount | grep "on / " overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/B62UNV3UB3X4TBWQMM6XCMM6W5:/var/lib/docker/overlay2/l/V6HGFN3C3PEW6CZ6XWRSHHDKJH,upperdir=/var/lib/docker/overlay2/2a353708e5b16ea7775cf1a33dd23ce31430faaa504bcde5508691b230f9d700/diff,workdir=/var/lib/docker/overlay2/2a353708e5b16ea7775cf1a33dd23ce31430faaa504bcde5508691b230f9d700/work) ``` Note that 2 `lowerdir` are used. These are symbolic links pointing to the folders identifying the layers (the names of the links are random, the aim being to have a shortened path to the layer's file system, as the number of characters that can be passed to the `mount(2)` system call is limited). The lowest branch (furthest to the right of the `lowerdir` parameter) contains the single layer of our `debian` image, while the branch furthest to the left overlays a number of configuration files required to run the container (`/etc/hosts`, `resolv.conf`, ...). The read/write branch is also registered in the `/var/lib/docker/overlay2` folder, and its identifier can be seen. The `upperdir` is in the `diff` folder, while the `workdir` is in the `work` folder, under the same layer ID. We can also see the folders used by inspecting our: ``` 42sh$ docker container inspect youthful_wilbur | jq .[0].GraphDriver.Data ``` ```json { "LowerDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31-init/diff:/var/lib/docker/overlay2/2cc3656c06...c0fb91d6/diff", "MergedDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31/merged", "UpperDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31/diff", "WorkDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31/work" } ``` If you test with an image with more layers, you'll get more `lowerdir`, one per layer. Feel free to run the same series of commands with the `python` image, for example. ### Adding files At this point, if we look at the contents of our `upperdir` folder, we can see that it's empty. This is normal, since we haven't made any changes. In our previously launched container, let's make a modification, by adding a: ``` incntr$ echo "newfile" > /root/foobar ``` ``` 42sh$ tree /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff └── root └── foobar ``` Our new file, which is not the only one in the tree structure shown in the container, has been added, as you'd expect, to the read/write branch. ### Modifying files If we make a change to a file, for example by adding a line, it's not just the difference that is stored in the write branch, but the whole file, as it has been modified: ``` incntr$ echo "Bienvenue dans le conteneur" >> /etc/issue ``` ``` 42sh$ tree /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff └── etc └── issue ``` ``` 42sh$ cat /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/issue Debian GNU/Linux 11 \n \l Bienvenue dans le conteneur ``` ### Deleting files When you want to delete a file you've just added, there's not much you can do, since deleting the file from the write branch will make the file disappear from the mounted tree. When it comes to deleting a file from a read-only branch, you need to be able to hide the file using a marker. Depending on the *storage driver*, this marker is different: in `OverlayFS`, a deletion is materialized by a special character file of the same name. ``` incntr$ rm /etc/adduser.conf ``` ``` 42sh$ tree /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff /var/lib/docker/overlay2/1531651afa872006a4b2b9b913d5d8ee317cf12be7883517ba77f3d094f871b4/diff └── etc └── adduser.conf ``` ``` 42sh$ cat /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/adduser.conf cat: No such device or address 42sh$ stat /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/adduser.conf File: /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/adduser.conf Size: 0 Blocks: 0 IO Block: 4096 character special file Device: fe0bh/65035d Inode: 515773 Links: 2 Device type: 0,0 ``` Note here `Device type: 0,0`. To create a similar file ourselves, we would need to use: ``` 42sh$ mkdir /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/bin 42sh$ mknod /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/bin/sh c 0 0 ``` {{% card color="danger" title="Caution, undefined behavior!" %}} Running this `mknod` command while the file system union is mounted elsewhere will not make the `/bin/sh` file disappear, as any modifications modifications that could be made to the branches outside the mounted system lead to explicitly undefined results. {{% /card %}} ### Deletion on `unionfs` and AuFS The concept of *whiteout file*, as we have seen, differs depending on the file system. It turns out that, although OverlayFS was integrated into the Linux kernel after many ups and downs, when specifying the format of the archives used to distribute layers, Docker now uses the AuFS format to represent deletions. It is therefore important to know it too. Instead of using a special file, AuFS creates a standard file `.wh.`, where `` is the name of the file to be hidden. In order to adapt to the *storage driver*, when the archive is decompressed, Docker converts[^MOBYWHITEOUT] the *whiteout files* it encounters into the expected expected format. [^MOBYWHITEOUT]: See the source code ## Conclusion Just when you thought you didn't want to know what *whiteout files* were all about, I'm sure that reading this article has given you a glimpse into the complexity of both *union mounts* and software that takes advantages of different implementations. Now you know why, in particular, it's pointless to delete a large file in a layer other than the one that contributed it, for example: ```dockerfile RUN wget https://dumps.wikimedia.org/enwiki/enwiki-pages-articles-multistream.xml.bz2 RUN ... # some other stuff RUN rm enwiki-pages-articles-multistream.xml.bz2 ``` Each `RUN` creates a separate layer, so our `enwiki-pages-articles-multistream.xml.bz2` file will be distributed with the first layer of our image, then a *whiteout file* will be inserted in the layer corresponding to the third `RUN`.