Ingo Karkat - The curse of different regexp dialects

The curse of different regexp dialects

Posted Monday, 18-Apr-2022 by Ingo Karkat

Regular expressions are one of the greatest inventions in computer science. From quasi-literal text searches (find me all occurrences of foo or bar followed by any number) to complex validation patterns (e.g. for email addresses), they let you specify text matches succinctly through a combination of literal characters and special atoms and bounds (POSIX name; they're called multi items in Vim and quantifiers in Perl — you'll likely already see where this is going…), and some escaping via backslashes to differentiate between the two. Anything that takes literal text can be made much more powerful by allowing a regular expression pattern instead, while still allowing to pass most forms of literal text with minimal escaping. Or in the words of Jeff Atwood, they generally look like what they are matching. The downside of the terse syntax is poor readability; paired with a lack of commenting abilities, this makes them very difficult to understand. Some even deride regexps as write-only, and the fact that the Perl programming language (powerful yet often criticized for its weird syntax) is particularly fond of them may be pure coincidence or not.

Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.
Jamie Zawinski

This blog article and this question on the software engineering StackExchange explain that the famous quote is more a criticism of regexp misuse (when all you have is a hammer…) than a general condemnation. Jeff Atwood has some great tips for sensible uses. I fully agree; every programmer should know them inside and out, and also non-programming people would benefit a lot from at least some basic knowledge — as even word processors have them (though Microsoft Word calls them wildcards and uses an unconventional syntax).

differences

Unfortunately, regular expression syntax is not something you can learn once and be done with it. From the initial idea (from the 1950s!) of matching arbitrary characters or sets of character classes a varying number of times, there not only have been differences in how these were encoded (with a backslash or without), but many useful extensions were introduced (\{N,M} can model all sorts of multiplicities, but it's immensely useful to have additional shortcuts like \? for 0 or 1, and \+ for 1 or more). Text processing greatly benefited from the addition of assertions for start-of-line (^), end-of-line ($), sometimes also different ones for start-of-file and end-of-file.

So POSIX alone came up with three sets: simple (deprecated), basic, and extended — the latter two still in use, with many command-line tools supporting both (naturally, with inconsistent command-line flags, for example -G|--basic-regexp vs. -E|--extended-regexp in grep, but sed having no flag for the basic default and using -E|-r|--regexp-extended for the switch). AWK just supports the extended type, but extends the POSIX definition with additional boundary assertions, like \< and \> for beginning of word and end of word, which are identical to Vim, but elsewhere often combined into a single \b word boundary assertion. AWK has the opposite \B assertion for text within a word (which is missing in Vim).

Then there are character classes, which can be specified with single letters like \d or as [0-9] (inside what Vim calls a collection, AWK a character list, Perl a bracketed character class, and POSIX a bracket expression), or as a POSIX [:digit:], which has to be put inside [...] (and can be combined and negated, like [^[:digit:][:upper:],.]). But \d cannot be used in there (a common beginner's mistake is to write [\d\u,.] and expect that to match numbers, uppercase letters, comma, and the period). The POSIX class names have non-standard extensions, too. The Wikipedia table on character classes shows the exceptions and inconsistencies quite well.

As long as you stay within one programming language, this is still manageable — you deal with the language itself and maybe a different dialect in your editor / IDE. But the typical programmer also uses a variety of scripting languages, and many tools also expose the regexp dialect of their underlying programming language through their user interfaces. In practice, you'll be making those stupid syntax mistakes and consult man pages or cheat sheets all the time. And you'll suffer from the poor readability and virtually non-existent debugging support every single time. The worst problems are those where the pattern matches most of the time, and only fails for some obscure corner case, maybe because of some forgotten or doubled-up backslash. Less tenacious developers just throw in the towel and never grow their knowledge and use of regexps beyond the very basics, which is sad because these can be really helpful and time-saving, both during (mass-)editing of text, and inside programs to efficiently parse text. It always saddens me to realize a colleague has spent the whole morning doing a tedious manual change through dozens of files — had he asked me we would have jointly developed, tested (on a subset), and then rolled out (Vim's :argdo is great for this) a regexp substitution, and this usually doesn't take longer than 5 to 15 minutes!

more differences

On top of that hairy mess, life is made miserable by so many more differences that I don't have the space here to cover them in depth:

greedy vs. non-greedy matching: the first is the standard, matching as many characters as possible, but it can be useful to match as few as possible; a common beginner's mistake is extracting quoted strings via ".*" — this matches the entire "foo" is "bar", not just each "foo" and "bar". In this simple case, the problem can be avoided by excluding the double quote: "[^"]*", but that's not always possible (and it's less readable)
non-capturing groups and named captures: for text processing, referencing parts of the matches in a replacement is essential, but captures can also be useful inside the pattern itself (for matching stuff surrounded by the same but variable text); on the other hand, grouping is often necessary just to overcome the default precedence (e.g. $foo$\+ to match multiple occurrences of foo). With non-capturing groups (\%(...\) in Vim, (?:...) in Perl), the capturing groups can be reserved for those parts that need to be referenced. Without them, it's counting braces in $\(...\|\(..\|..$\)\+\) — not fun. Named captures even let you assign arbitrary identifiers to groups instead of referencing by number; this helps a lot when you programmatically assemble regexp fragments (and therefore don't know how many previous capturing groups there are), or in parsing situations where elements may appear in different orders. Perl (the perennial forerunner in special features with ever more obscure syntax) even has a \g-1 relative referencing to previous groups as yet another alternative.
newline handling: basically, whether . also matches a newline character; for grep, this is never an issue; Vim has dedicated set of atoms prefixed with \_ (e.g. \_. is the multi-line variant of .) — with added complexity around the inversion of collections ignoring a contained \n ; most other dialects (sed, AWK) do match newlines
platform-dependent newlines: does \n just match the linefeed character or also the CR-LF combination used on Windows? This also depends on the platform it's run; many Unix tools need \r\?\n to be able to process Windows-style text files
multi-line mode: applications may work on one line at a time, or entire files of text; depending on that the ^ and $ anchors usually have different semantics, and alternative assertions may be offered, or a flag to change the matching mode
Unicode: in the beginning, all text was 7-bit ASCII; then, Internationalization happened and different scripts had to be incorporated; as long as the locale is correct and specifies the right character encoding (mostly UTF-8 today), a . should always match a single character. It's getting interesting for things like \w and [[:alnum:]], though. Perl excels here with dedicated atoms; in most other dialects, character classes typically include Unicode variants (like Japanese numerals for numbers, or a full-width space for whitespace); therefore, the assumption that an explicit [0-9] is always identical to \d or [[:digit:]] does not hold

PCRE

Through ubiquity and its pattern-matching power, Perl had been the gold standard of regular expression dialects for a long time. Wikipedia summarizes it well:

Because of its expressive power and (relative) ease of reading, many other utilities and programming languages have adopted syntax similar to Perl's — for example, Java, JavaScript, Julia, Python, Ruby, Qt, Microsoft's .NET Framework, and XML Schema. [...] Perl-derivative regex implementations are not identical and usually implement a subset of features found in Perl 5.0, released in 1994. Perl sometimes does incorporate features initially found in other languages. For example, Perl 5.10 implements syntactic extensions originally developed in PCRE and Python.

The Perl Compatible Regular Expressions (PCRE) have been developed to make (most of) Perl's regexp capabilities available to other tools. As a C library, it can be readily incorporated into almost any tool, and the BSD-licensing also allows use by commercial software; GNU grep supports PCRE syntax via its -P|--perl-regexp option.

The situation isn't perfect; there are still differences between Perl and PCRE:

While PCRE originally aimed at feature-equivalence with Perl, the two implementations are not fully equivalent. During the PCRE 7.x and Perl 5.9.x phase, the two projects have coordinated development, with features being ported between them in both directions.

As well as inconsistencies; for example:

This feature was subsequently adopted by Perl, so now named groups can also be defined using (?<name>...) or (?'name'...), as well as ?P<name>...). Named groups can be backreferenced with, for example: (?P=name) (Python syntax) or \k'name' (Perl syntax).

readability

Still, widespread PCRE support would achieve way more consistency, expressiveness and power. It offers in-line comments via (?#...) syntax. Spreading a dense regexp across lines can help a lot with readability; for example, grep -o -P -e '/\*.*?\*/' /dev/null can be documented in-line as:

grep -o -P -e '/\*(?# Match the opening delimiter).*?(?# Match a minimal number of characters.)\*/(?# Match the closing delimiter.)' /dev/null

It should also be possible to embed newlines in the inline comments to distribute the regexp across multiple lines, but this doesn't work (in GNU grep 3.4):

$ grep -P '/\*(?# Match the opening delimiter
).*?(?# Match a minimal number of characters.
)\*/(?# Match the closing delimiter.
)' /dev/null
grep: the -P option only supports a single pattern

This looks like bad interference with grep's feature to take multiple newline-delimited patterns in a single -e argument; a typical bug when retrofitting new features into an existing design. I guess it could be fixed, but I haven't seen much use of --perl-regexp, inline comments still are a rather obscure feature in PCRE, and maybe the bug even didn't even get reported so far.

Perl's /x and /xx modifiers allow completely free-form whitespacing and commenting, and by switching to the special pcre2grep (part of the pcre2-utils package), we can do this:

pcre2grep '(?x)
    /\*     # Match the opening delimiter.
    .*?     # Match a minimal number of characters.
    \*/     # Match the closing delimiter.
' /dev/null

Now that's a lot better in terms of readability and documentation, right?

switching dialects

As seen above, many tools (unfortunately inconsistently) offer command-line options to switch the regexp dialect; -E|--extended-regexp being the most common in GNU tools. Then there are pattern flags or modifiers as in Perl; originally appended to a /<pattern>/, but alternatively embedded in the regexp itself ((?x)). Vim has a similar concept (funnily called magicness) that influences the amount of literal interpretation vs. necessary escaping (and can be influenced by a global option (now deprecated, as that was a really bad idea) and embedded (but again very different) atoms; \v \m \M \V). These can simplify interactive searches a lot; for example, a literal search can be written \V[****] instead of \[\*\*\*\*\], but when put into programs, these are confusing and need to be normalized when concatenating regexp fragments.

story

My motivation for this ~~rant~~plea has been a rather simple enhancement to one of my AWK-based tools. I needed to convert any ^ and $ anchors to newlines, but keep escaped (i.e. literal) ^ and $ characters as-is. In Vim syntax, the search looks like this: \%(\%(^\|[^\\]\)\%(\\\\\)*\\\)\@<!\\[^$], but it requires a negative lookbehind (\@<! in Vim, (?<!...) in Perl), which AWK's extended regexp does not have. To emulate that, I had to implement a separate AWK function with 5 lines of code that cleverly did one substitution and then partially undid that. And all of that for an infrequent use of two literal characters in the input (so if untested, bugs could linger a long time until they would appear, forcing me to write an automated test and therefore making this even more costly). Had AWK supported PCRE, I would have been done in a minute! (In the end, I figured out a way to avoid the escaping by inverting a previous parsing step. Overall, this dead end cost me more than an hour.)

With my background in Vim and regular exposure to complex regexps, it has happened numerous times that I struggled with a text substitution that would have been a piece of cake in Vim (or Perl). Though there's a natural progression from sed to AWK to a full-blown scripting language like Perl or Python, there's always overhead involved (at the very least, command-line argument parsing has to be rewritten, and there's pressure to rewrite the tests in the target language's favored testing framework, too). I'm unwilling to do that just because of limited regexp capabilities.

summary

With a history of close to 25 years, PCRE is readily available and easily consumable (heck, even Perl itself can use PCRE through the re::engine::PCRE module). I'd love to have every tool offer this dialect as an option, with the hope that over time it will become the default, and all the inferior dialects can be laid to rest.

Also, after 40+ years, there should have been more than enough time to agree on what the individual regexp elements are called (to ease learning). Right now, too many tutorials teach regular expressions without mentioning what dialect (or the unsuspecting users don't realize that this makes a huge difference). For historical reasons, the legacy inconsistencies will have to stay in the code for a long time (likely forever), but nothing prevents implementers from adding the PCRE dialect on top, so that there's a way out of this mess.

Ingo Karkat, 18-Apr-2022

ingo's blog is licensed under Attribution-ShareAlike 4.0 International