Ingo Karkat - Excursions in tool writing

2 Here, I again went down another short side path, creating a curlAsBrowser command because the Google search would not accept a manual curl invocation. In order to fake a browser's User-Agent, I installed a JavaScript library called top-user-agents, and then wrote a small shell wrapper around it.

3 What do you do when you're getting in too deep? For me, the solution is a comfortable todo list. I can open a HUD terminal with a single key press and submit a note quickly via $ tt add "do that later"

1 And here, not even one sentence into the article, started my first excursion from the excursion. My Vim mapping that searches for a keyword and offers corresponding links for inclusion did not give the expected result, so I tried troubleshooting the custom Google search I had set up years ago², then investigated free alternatives, and attempted registering for Bing's search API (until it asked for a credit card despite search being a "free" service).

I finally made several enhancements to my (unpublished) Vim plugin, so that I could finally do automatically in 2 seconds what would have been 60 seconds (open a new browser tab → Google search → right mouse button on the search result → Copy Link → close browser tab → switch back to Vim → paste → re-type title) — poor ROI, but it felt right.

It all started with the accidental discovery of a style error (via ShellCheck)¹, complaining that items+="$item" is missing the round braces (items+=("$item")) for adding elements to an array. I then learned that += can be used (in Bash and Korn shell) for appending to a string. As that's much less ugly than foo="${foo}text", I wanted to automatically convert all such assignments, throughout my scripts.

I need to do a regular expression search across all Bash scripts to identify the places. Bash scripts can be identified via its #!/bin/bash shebang in the first line. So one way to implement this is one search to identify the script, generating a list of files, and then a second search on those files for all occurrences. (Another approach would be one combined search, but asserting both matches across lines is impossible with line-based tools like grep.)

design

In order to match shebangs, I need a regular expression search that can be limited to certain lines (the shebang has to be in the first line), and yields a list of matching files. If only every tool had such powerful regexes like Vim (which can easily assert a match in line 1 via \%1l; cp. :help /\%l)! But even the gold standard PCRE apparently struggle with that. Common Unix tools like grep only offer simple stuff (and that confusingly in two variants, basic and "extended"). But sed has separate addressing, so a shebang search can be done via 1{ /^#!/ }. On the other hand, sed doesn't have grep's --files-with-matches option (but GNU sed can print the current filename via the F command).

So, let's write a sedf wrapper that simplifies a sed-based search yielding matching files, and use that to build a find-shebang command. That one will take find arguments (where to start searching, maybe limiting to newer files, and important for my goal to do another search on those found files via -exec grep ... +), and one custom argument to restrict the shebangs (here: to Bash).

If all of this sounds like overkill, it certainly is for this particular task — a few manual steps achieve this much faster than just getting started with the first script is. But the motivation for toolsmithing is identifying generic tools that help achieve common tasks. More universal and modular tools offer more potential applicability in the future, and with that the hope that the effort will pay off.

sedf

That small first command wasn't a lot of effort, but also doesn't save much: Some crucial sed options (-n, --separate) need not be supplied, it fails when there are no matches, and deduplicates reported files (so that the patterns do not need to care about matching only once per file).

While testing that first step towards find-shebang, I soon realized that getting the candidate files (likely under source control) also isn't trivial; a search over all of my repositories would likely also pick up version control metadata, backups, etc. Reusing the ack-grep tool as a provider for candidate files would eliminate that problem — the best tool is the one I don't have to write and maintain.

Almost immediately into my investigation of ack, I realized that it already can detect shell scripts based on the shebang: ack --type=shell. There's no need for my find-shebang! Instead, I added some additional type definitions; its customizability is amazing:

ackfind

I still liked the idea of combining a pattern search with find conditions and actions, and with my renewed infatuation for ack, I started the corresponding ackfind, with this interface:

The algorithm behind the command is simple: Run ack and run find, then accept all filespecs reported by both. The comm command can filter lines that are not present in two files (as long as the contents are sorted); we can use Bash process substitution and readarray to obtain the result:

By leveraging the power of existing commands, that part is already done, with just minimal boilerplate code added. But the requirement to be able to use find's -exec[dir] (and the lesser-known confirming -ok[dir] variant) transparently then added a lot of complexity. First in parsing; reading multiple arguments as a combined shell command, with varying delimiters (either ; or +), and allowing repeats of those, is inherently complex.

The challenge lies in the trade-off between a simple but limited implementation and perfect support of the underlying interface. By naming the command after find, there's the inherent promise that it acts just like the original. Joel Spolsky has a nice article about The Law of Leaky Abstractions and how dangerous it is when implementation specifics become unhidden and affect the user. On the other hand, interfaces often have exotic specials and obscure features that few people know and even fewer people use, so is it really necessary to support those, potentially at great cost? But are you able to decide whether something is unnecessary fluff or could it actually become that key feature that saves your hide a month from now? This is hard to decide, and depends on the situation, and how much time is available.

For the ackfind implementation, I've evaluated the implementation alternatives as such:

So a roll-your-own implementation would be a lot of bland code and still miss the most salient feature of xargs. Leveraging xargs is the way to go, also to satisfy my inner perfectionist ;-).

xargsdir

The missing per-directory splitting can be handled in a modular way by introducing a drop-in replacement command:

hairiness

And just when I thought I was almost done, I got stuck by (what I perceive as) inconsistencies in xargs's feature set around the -I replace-str. What is innocuously noted as Implies -x and -L 1 in the man page means that at most 1 input line per command-line is used. In other words, -exec … + and {} placeholder are mutually exclusive! Really? No way around it?

This problem made me aware that xargs has separate limiting both on the input / parsing side (-L) and output side (-n|--max-args). So far, my mental model has been that xargs parses the input into items (whitespace-separated by default, but more robustly delimited via newlines or null characters), and then builds up command-lines to the maximum or configured number. Apparently, as soon as a placeholder is introduced, this clean separation is gone and all arguments that need to go together must be placed on a single line, delimited by whitespace, and whitespace within items escaped — the hairy default parsing that the man page already recommends against. If appending arguments is fine for you, no problem with splitting, though.

Others have asked about that already, and the Unix & Linux StackExchange question has workaround suggestions (by going through another shell and using "$@"). It echoes my complaints about command inconsistencies (whether it's different regexp support or the placement and number of {} placeholders here), and the resulting leaky abstractions that make life much more difficult.

placeholderArguments

I wasn't aware of the workarounds at the time (and deep into the rabbit hole didn't give myself the time to research the problem). I already had a placeholderArgument command that almost helps here - it replaces a {} with the (single) last argument.

So I wrote a placeholderArguments (plural) derivative that supports multiple arguments.

The flexible parsing options should allow any kind of use case. In my ackfind command, I'll simply embellish the xargs command if necessary:

out of the rabbit hole

surprise

Satisfied with my new tools, let's finally put them to use! ack's PCRE backreferences make it easy to write a regular expression that searches for code that appends to a scalar variable (but it's as cryptic as regular expression can be). The -t|--type argument limits the considered files to Bash scripts.

Wait, where do I need the find part? Turns out that the problem can already be solved completely within ack! Only my initial approach of separating the shebang discovery from the grepping would have required a find part in between. In the heat of the moment, I didn't realize that what I was building isn't actually needed (for now at least). Such is life :-).

However, should I want to limit the search to files modified within the last week, I can now do so easily, with minimal extensions to the above command:

Excursions in tool writing

prelude

design

`sedf`

`ackfind`

`xargsdir`

hairiness

`placeholderArguments`

out of the rabbit hole

surprise

Excursions in tool writing

prelude

design

sedf

ackfind

xargsdir

hairiness

place­holder­Argu­ments

out of the rabbit hole

surprise

`sedf`

`ackfind`

`xargsdir`

`placeholderArguments`