Two approaches to file exclusions blog home

Posted Thursday, 26-Sep-2024 by Ingo Karkat

context

For a long time, there used to be either commercial backup software that offers a simple, slick GUI for the layman home user, or arcane script solutions for IT-savvy people. Neither of which had been really attractive to me. I read with interest how a home-grown solution using just rsync and hard-links could create incremental backups that didn't waste a whole lot of disk space, but it wasn't until this idea had been implemented by Borg and Borgmatic (the latter being a wrapper that offers configuration-driven backups) that I put it to use. I've been using Borg for several years now to create backups; first for my wife's data (as she's the worst kind of layperson who isn't even fully aware of the need for backups until it's too late), then also for my encrypted data container hosting my (portable) personal data.

For my wife's data, it was important that any important files would be covered (so it had to be basically her user's whole home directory where she had write access); on the other hand, I didn't want to include too many temporary or cached files, as that would just prolong the backup and thereby reduce the chance that it would conclude successfully. So I basically ran the first backups manually, noted files and directories that weren't important, and then wrote exclude patterns for those.

When she got a new laptop, I had the idea to reuse the same excludes in order to transfer her user data onto the new machine. Thus, userdata-migrater was born.

problem

I had assumed that the file exclusion worked in a similar way as other tools like tar's or the rsync that I used for the migrater. Boy, was I wrong! If it's just a few files, not complete directory hierarchies, it's easy to miss a few files that should have been excluded, but weren't. I had skimmed the $ borg help patterns output, and noticed that it offers five different styles:

File patterns support these styles: fnmatch, shell, regular expressions, path prefixes and path full-matches. By default, fnmatch is used for --exclude patterns and shell-style is used for the experimental --pattern option.

More choices should be good, eh? Well, no. At one point I thought that all I needed to do was switching to shell-style patterns, and prepend a **/ everywhere (totally forgetting that this would break the excludes reuse with rsync in my migrater). But even that led to surprises when to-be-excluded files were picked up by the backup. (See the history of my excludes file evolution for the gruesome details.)

a revelation

I had "verified" the backups by listing the contents, but an important learning only set in when I actually had to restore some files (that were lost due to frantic fat-fingering shortly before shutting down the computer and calling it a day). I only then realized that the whole path to the backup root had been included. This is even documented in the (way too long) help on patterns:

File paths in Borg archives are always stored normalized and relative. This means that [...] borg create /path/to/repo /home/user will store all files as home/user/.../file.ext.

My inner perfectionist didn't like that. With my encrypted data container, this could (theoretically) lead to different paths depending on which system the backup was made (if the home directory's location were different on a system (I have such systems, but luckily I don't mount my data there (yet).)) This is rather easy to work around (change directory to the backup root and then use . as the backup root).

What's more, it challenged my assumptions about excludes matching, and suddenly explained why certain excludes that were supposedly anchored to the backup root (like /.dbus or /Dropbox) were still included in the backup!

syntax deep dive

Let's go back to $ borg help patterns (emphases mine)

Fnmatch, selector fm:
This is the default style for --exclude and --exclude-from. These patterns use a variant of shell pattern syntax, with '*' matching any number of characters, '?' matching any single character, '[...]' matching any single character specified, including ranges, and '[!...]' matching any character not specified. For the purpose of these patterns, the path separator (backslash for Windows and '/' on other systems) is not treated specially.
Shell-style patterns, selector sh:
This is the default style for --pattern and --patterns-from. Like fnmatch patterns these are similar to shell patterns. The difference is that the pattern may include **/ for matching zero or more directory levels, * for matching zero or more arbitrary characters with the exception of any path separator.
Regular expressions, selector re:
Regular expressions similar to those found in Perl are supported. Unlike shell patterns regular expressions are not required to match the full path and any substring match is sufficient.
Path prefix, selector pp:
This pattern style is useful to match whole sub-directories. The pattern pp:root/somedir matches root/somedir and everything therein.
Path full-match, selector pf:
This pattern style is (only) useful to match full paths.
[...] For a path to match a pattern, the full path must match, or it must match from the start of the full path to just before a path separator.

I think that the root cause of the pattern matching problems starts with the use of Python's fnmatch module. Its documentation highlights the Unix filename pattern matching use, and the fifth sentence notes that separators are not special, and that the glob module should be used instead.
I'm ignorant of the history of borg, but the fact that Shell-style patterns were introduced and are the default for --patterns[-from hint at the curse of backwards compatibility: The (bad) default for --exclude[-from] could not be easily changed any longer.

It doesn't help that the devs saw the need for three additional styles, which adds more choices. Though pf: refers to corner cases, neither it nor the pp: style provides examples or use cases where they might be useful and where it'd be difficult to model the same exclusion with the other styles. The whole documents is technical and not user-centric. (It's good that implementation details are mentioned, especially as they relate to possible security impacts, but I'm not happy about the overall structure.)

I'll say more about regular expressions later; I'm horrified that this also throws the otherwise consistent anchoring over board. Could it be that unanchored matching simply is the default of the library function?!

comparison rsync

How does borg compare with rsync, whose conventions I subconsciously assumed to be applicable here as well?

if the pattern starts with a / then it is anchored to a particular spot in the hierarchy of files, otherwise it is matched against the end of the pathname. This is similar to a leading ^ in regular expressions. Thus "/foo" would match a name of "foo" at either the "root of the transfer" (for a global rule) or in the merge-file’s directory (for a per-directory rule). An unqualified "foo" would match a name of "foo" anywhere in the tree because the algorithm is applied recursively from the top down; it behaves as if each path component gets a turn at being the end of the filename. Even the unanchored "sub/foo" would match at any point in the hierarchy where a "foo" was found within a directory named "sub".

We see here that rsync has a concept of root of the transfer; unlike borg it doesn't concatenate the source dirspec to the file path for the matching. To me, that makes a lot more sense.
rsync differentiates between anchored (via leading /) and unanchored patterns. They also need to dive into technical details (the recursion down the path), but that makes it clear that unanchored patterns can match at any position, and the example given highlights that this isn't limited to filename matches, but that even subdir/filename combinations can be handled.

a ’*’ matches any path component, but it stops at slashes.
use ’**’ to match anything, including slashes.
if the pattern contains a / (not counting a trailing /) or a "**", then it is matched against the full pathname, including any leading directories. If the pattern doesn’t contain a / or a "**", then it is matched only against the final component of the filename. (Remember that the algorithm is applied recursively so "full filename" can actually be any portion of a path from the starting directory on down.)

rsync supports (extended) shell globbing out of the box; the paragraph also reiterates the anchoring part from above. As a user, I very much prefer this.

a trailing "dir_name/***" will match both the directory (as if "dir_name/" had been specified) and everything in the directory (as if "dir_name/**" had been specified). This behavior was added in version 2.6.7.

rsync also has added their own extensions; unlike borg's style prefixes, this doesn't force the user to make an immediate decision, as it's just about a special case. However, I'd love to see an example for when this would be useful here as well.

conclusion

As in The curse of different regexp dialects, shell globbing is an old and oft-reimplemented idea, and unfortunately tooling even today isn't consistent in its use. I suspect that borg's developers chose the fnmatch library without putting too much thought in it (pattern matching, how hard can it be?), and then were forced to add more styles to address the deficiencies of the original approach without being able to fully move away from it. I've surely fallen prey to this, too. One can only hope that these issues surface soon enough, before it's too late to make incompatible changes.

Versioning of configuration can help with that, but as excludes can also be passed as command-line arguments, adding a version is a challenge here. Introducing style prefixes is an obvious way to address, but I dislike that this pushes the complexity towards the user. (And adds another level of escaping-the-prefix complexity that most users will likely be ignorant of.) Is there really a need to support multiple matching schemes? I don't think so.

Regular expressions are (depending on the dialect, but in general) more powerful than shell globs, but how many occasions are there where this actually is required? And even those still leave some use cases that cannot practically be addressed (and may add security issues on their own like excessive pattern recursion slowing down a server).
I'm very fond of the original Unix architecture of small, dedicated executables for every task. I think I would build in a sane shell globbing syntax, but offer piping the file list through an external tool as the (very) advanced escape hatch. Setting up a grep [--perl-regexp] command would be a bit more effort that selecting regular expressions via a style prefix (and the additional process a bit more inefficient), but it would be a truly generic option, even allowing for scripts to apply arbitrarily complex filtering rules. Programming this shouldn't be any more difficult than using a matcher library — even AWK has coprocessing support!
When the initial choice of matcher syntax is made with diligence and under consideration of all common use cases, there may very well never arise the need for such advanced coproc filtering, which would be ideal. The best feature is the one that's never needed (but could be trivially added to an extensible design).

Ingo Karkat, 26-Sep-2024

ingo's blog is licensed under Attribution-ShareAlike 4.0 International

blog comments powered by Disqus