Excursions in tool writing blog home

Posted Monday, 04-Apr-2022 by Ingo Karkat

prelude

2 Here, I again went down another short side path, creating a curlAsBrowser command because the Google search would not accept a manual curl invocation. In order to fake a browser's User-Agent, I installed a JavaScript library called top-user-agents, and then wrote a small shell wrapper around it.

3 What do you do when you're getting in too deep? For me, the solution is a comfortable todo list. I can open a HUD terminal with a single key press and submit a note quickly via $ tt add "do that later"

1 And here, not even one sentence into the article, started my first excursion from the excursion. My Vim mapping that searches for a keyword and offers corresponding links for inclusion did not give the expected result, so I tried troubleshooting the custom Google search I had set up years ago2, then investigated free alternatives, and attempted registering for Bing's search API (until it asked for a credit card despite search being a "free" service).

I finally made several enhancements to my (unpublished) Vim plugin, so that I could finally do automatically in 2 seconds what would have been 60 seconds (open a new browser tab → Google search → right mouse button on the search result → Copy Link → close browser tab → switch back to Vim → paste → re-type title) — poor ROI, but it felt right.

It all started with the accidental discovery of a style error (via ShellCheck)1, complaining that items+="$item" is missing the round braces (items+=("$item")) for adding elements to an array. I then learned that += can be used (in Bash and Korn shell) for appending to a string. As that's much less ugly than foo="${foo}text", I wanted to automatically convert all such assignments, throughout my scripts.

I need to do a regular expression search across all Bash scripts to identify the places. Bash scripts can be identified via its #!/bin/bash shebang in the first line. So one way to implement this is one search to identify the script, generating a list of files, and then a second search on those files for all occurrences. (Another approach would be one combined search, but asserting both matches across lines is impossible with line-based tools like grep.)

design

In order to match shebangs, I need a regular expression search that can be limited to certain lines (the shebang has to be in the first line), and yields a list of matching files. If only every tool had such powerful regexes like Vim (which can easily assert a match in line 1 via \%1l; cp. :help /\%l)! But even the gold standard PCRE apparently struggle with that. Common Unix tools like grep only offer simple stuff (and that confusingly in two variants, basic and "extended"). But sed has separate addressing, so a shebang search can be done via 1{ /^#!/ }. On the other hand, sed doesn't have grep's --files-with-matches option (but GNU sed can print the current filename via the F command).

So, let's write a sedf wrapper that simplifies a sed-based search yielding matching files, and use that to build a find-shebang command. That one will take find arguments (where to start searching, maybe limiting to newer files, and important for my goal to do another search on those found files via -exec grep ... +), and one custom argument to restrict the shebangs (here: to Bash).

If all of this sounds like overkill, it certainly is for this particular task — a few manual steps achieve this much faster than just getting started with the first script is. But the motivation for toolsmithing is identifying generic tools that help achieve common tasks. More universal and modular tools offer more potential applicability in the future, and with that the hope that the effort will pay off.

sedf

$ sedf -e '1{ /#!\/bin\/sh/F }' shell-basics/bin/*
shell-basics/bin/log
shell-basics/bin/logf

That small first command wasn't a lot of effort, but also doesn't save much: Some crucial sed options (-n, --separate) need not be supplied, it fails when there are no matches, and deduplicates reported files (so that the patterns do not need to care about matching only once per file).

While testing that first step towards find-shebang, I soon realized that getting the candidate files (likely under source control) also isn't trivial; a search over all of my repositories would likely also pick up version control metadata, backups, etc. Reusing the ack-grep tool as a provider for candidate files would eliminate that problem — the best tool is the one I don't have to write and maintain.

Almost immediately into my investigation of ack, I realized that it already can detect shell scripts based on the shebang: ack --type=shell. There's no need for my find-shebang! Instead, I added some additional type definitions; its customizability is amazing:

~/.ackrc
--type-set
posix:firstlinematch:/^#!/bin/sh\b/
--type-set
bash:ext:bash
--type-set
bash:firstlinematch:/^#!.*\bbash\b/
--type-set
bashexcluded:firstlinematch:/^#!.*\b(?:t?c|k|z|fi)?sh\b/
--type-set
shebang:firstlinematch:/^#!/

ackfind

I still liked the idea of combining a pattern search with find conditions and actions, and with my renewed infatuation for ack, I started the corresponding ackfind, with this interface:

ackfind --help
Print files that would be searched by ack [in DIRECTORY[s]] [considering
ACK-OPTIONS (like --type) and ignoring version control directories by default],
additionally applying FIND-OPTION(s), too (files must be accepted both by ack
and find then).

Usage: ackfind [ACK-OPTIONS ...] [DIRECTORY ... [FIND-OPTIONS ...]] [-?|-h|--help] 

The algorithm behind the command is simple: Run ack and run find, then accept all filespecs reported by both. The comm command can filter lines that are not present in two files (as long as the contents are sorted); we can use Bash process substitution and readarray to obtain the result:

ackfind
readarray -t filespecs < <(
    comm -12 \
        <(findViaAck | sort) \
        <(findViaFind | sort)
)

By leveraging the power of existing commands, that part is already done, with just minimal boilerplate code added. But the requirement to be able to use find's -exec[dir] (and the lesser-known confirming -ok[dir] variant) transparently then added a lot of complexity. First in parsing; reading multiple arguments as a combined shell command, with varying delimiters (either ; or +), and allowing repeats of those, is inherently complex.

The challenge lies in the trade-off between a simple but limited implementation and perfect support of the underlying interface. By naming the command after find, there's the inherent promise that it acts just like the original. Joel Spolsky has a nice article about The Law of Leaky Abstractions and how dangerous it is when implementation specifics become unhidden and affect the user. On the other hand, interfaces often have exotic specials and obscure features that few people know and even fewer people use, so is it really necessary to support those, potentially at great cost? But are you able to decide whether something is unnecessary fluff or could it actually become that key feature that saves your hide a month from now? This is hard to decide, and depends on the situation, and how much time is available.

For the ackfind implementation, I've evaluated the implementation alternatives as such:

So a roll-your-own implementation would be a lot of bland code and still miss the most salient feature of xargs. Leveraging xargs is the way to go, also to satisfy my inner perfectionist ;-).

xargsdir

The missing per-directory splitting can be handled in a modular way by introducing a drop-in replacement command:

xargsdir --help
Build and execute command lines from standard input; the specified COMMAND is
run from the subdirectory containing each file when limiting input lines /
arguments to 1. Else, it builds a command line to process more than one file,
but any given invocation of COMMAND will only use files that exist in the same
subdirectory. In other words, a drop-in replacement for xargs with semantics of
find's -execdir.

Usage: xargsdir [XARGS-OPTIONS] [COMMAND [INITIAL-ARGUMENTS]] [-?|-h|--help]

hairiness

And just when I thought I was almost done, I got stuck by (what I perceive as) inconsistencies in xargs's feature set around the -I replace-str. What is innocuously noted as Implies -x and -L 1 in the man page means that at most 1 input line per command-line is used. In other words, -exec … + and {} placeholder are mutually exclusive! Really? No way around it?

This problem made me aware that xargs has separate limiting both on the input / parsing side (-L) and output side (-n|--max-args). So far, my mental model has been that xargs parses the input into items (whitespace-separated by default, but more robustly delimited via newlines or null characters), and then builds up command-lines to the maximum or configured number. Apparently, as soon as a placeholder is introduced, this clean separation is gone and all arguments that need to go together must be placed on a single line, delimited by whitespace, and whitespace within items escaped — the hairy default parsing that the man page already recommends against. If appending arguments is fine for you, no problem with splitting, though.

Others have asked about that already, and the Unix & Linux StackExchange question has workaround suggestions (by going through another shell and using "$@"). It echoes my complaints about command inconsistencies (whether it's different regexp support or the placement and number of {} placeholders here), and the resulting leaky abstractions that make life much more difficult.

place­holder­Argu­ments

I wasn't aware of the workarounds at the time (and deep into the rabbit hole didn't give myself the time to research the problem). I already had a placeholderArgument command that almost helps here - it replaces a {} with the (single) last argument.

placeholderArgument --help
Execute COMMAND while replacing all occurrences of {} in its argument with the
LAST-ARGUMENT. (Can be useful when xargs doesn't support the -I replace-str (as
in Busybox xargs).)
Usage: placeholderArgument COMMAND [COMMAND-ARG ...] [{}] [COMMAND-ARG ...] LAST-ARGUMENT [-?|-h|--help]

So I wrote a placeholderArguments (plural) derivative that supports multiple arguments.

placeholderArguments --help
Execute COMMAND(s) while replacing all occurrences of {} in its
arguments with ARGUMENT(s).
COMMAND (and its ARGUMENT(s)) can be specified in various ways: As one quoted
argument with -c|--command, arguments after --exec until a ';', separated by
special arguments --, or (especially useful in scripts when you know the
${#ARGS[@]} but cannot influence the contents) by number of arguments.

Usage: placeholderArguments [-r|--no-run-if-empty] [-?|-h|--help] ...
Usage: placeholderArguments [...] -c|--command "COMMANDLINE [...] [{}] [...]" [-c ...] [ARGUMENT ...]
Usage: placeholderArguments [...] --exec SIMPLECOMMAND [COMMAND-ARG ...] [{}] [COMMAND-ARG ...] ; [ARGUMENT ...]
Usage: placeholderArguments [...] -- SIMPLECOMMAND [COMMAND-ARG ...] [{}] [COMMAND-ARG ...] -- [ARGUMENT ...]
Usage: placeholderArguments [...] -n|--command-arguments N SIMPLECOMMAND [COMMAND-ARG ...] [{}] [COMMAND-ARG ...] [ARGUMENT ...]

    --no-run-if-empty|-r
			Don't run COMMANDs  if there are no ARGUMENTs. By
			default, COMMANDs will be executed without any supplied
			arguments.

Can be useful in combination with xargs when you want to build up arguments up
until the limit but cannot simply append the arguments - "xargs -I replace-str"
implies -L 1 and therefore will only take one input line per command, forcing
you to abandon -d '\n' / -0 and into whitespace issues:
Example: cat FILE [...] | xargs -d '\n' placeholderArguments --exec dump-args -v -- \[ {} \] \;

The flexible parsing options should allow any kind of use case. In my ackfind command, I'll simply embellish the xargs command if necessary:

ackfind
execCommand=(placeholderArguments --command-arguments "${#execCommand[@]}" "${execCommand[@]}")

out of the rabbit hole

Whew, that was much more effort than expected! But I'm satisfied that I …

surprise

Satisfied with my new tools, let's finally put them to use! ack's PCRE backreferences make it easy to write a regular expression that searches for code that appends to a scalar variable (but it's as cryptic as regular expression can be). The -t|--type argument limits the considered files to Bash scripts.

$ ack -t bash '\b(\w+)="\$(\{\1\}|\1[^[:alnum:]"])'

Wait, where do I need the find part? Turns out that the problem can already be solved completely within ack! Only my initial approach of separating the shebang discovery from the grepping would have required a find part in between. In the heat of the moment, I didn't realize that what I was building isn't actually needed (for now at least). Such is life :-).

However, should I want to limit the search to files modified within the last week, I can now do so easily, with minimal extensions to the above command:

$ ackfind -t bash . -ctime -7 -exec ack '\b(\w+)="\$(\{\1\}|\1[^[:alnum:]"])' {} +

Ingo Karkat, 04-Apr-2022

ingo's blog is licensed under Attribution-ShareAlike 4.0 International

blog comments powered by Disqus