Playing with set -e in shell scripts

2023-02-25

At one of my previous teams, we laughed at how ~~much of an overkill~~ state-of-the-art some of our practices were. In particular, for a shell script to be considered ~~a cargo-cult~~ state-of-the-art, it had to include this line somewhere at the beginning:

set -Eeuxo pipefail

I’ll focus on the -e part—an option called errexit. man dash gives the following description:

If not interactive, exit immediately if any untested command fails. The exit status of a command is considered to be explicitly tested if the command is used to control an if, elif, while, or until; or if the command is the left hand operand of an && or || operator.

I will refer to the term tested command as defined above throughout the post.

The good things

The above description sounds useful. For example, I wouldn’t want my backup script to silently ignore that cp failed and then happily run rm on the original files.

And it’s still possible to act upon a failure by using tested commands. For a trivial example, to try until cp in the backup script succeeded, I could write it this way:

until cp original backup; do
    sleep 1
done
rm original

Setting errexit isn’t the only way to achieve this quit-on-error behavior. One can manually check the exit code of every command in a script and then decide to quit—like so:

cp original backup || exit

The bad things

It turns out, however, that errexit isn’t that amazing. What prompted me to write this post is that I’ve recently noticed that ShellCheck provides the --enable all flag, which surely sounds like a ~~yak-shaving contest~~ state-of-the-art thing. Obviously, it reported new errors when I ran it on a script that previously passed with flying colors. In particular, the following snippet was marked with error SC2310.

if ! there_is_rebase_in_progress; then
    return
fi
# some other commands follow

To quickly quote ShellCheck’s documentation:

ShellCheck found a function used as a condition in a script where set -e is enabled. This means that the function will run without set -e, and will power through any errors.
This applies to if, while, and until statements, commands negated with !, as well as the left-hand side of || and &&. It does not matter how deeply the command is nested in such a structure.

To rephrase: because this is a tested function, the errors that happen inside it will be ignored; neither the function nor the script will terminate because of an error. This is—broken? I can’t name a single thing that’s good about this. And ShellCheck does not report it by default!

Why is this the case? My first guess would be the following. Let’s remember that—despite errexit being set—one can still do checks like until cp mentioned above. An extremely crude way of implementing that would be to stop errexit from taking effect before the check and enable it again afterward. Voilà. But that’s just a wild guess.

One thing that complicates the matter is that explicitly setting errexit again in the tested function has no effect: errors will be ignored, even in a subshell.

Ideal behaviour

Let’s think what would the ideal behavior here. The following three statements hold when errexit is set:

an explicit non-zero return from an untested function should terminate the script;
an explicit non-zero return from a tested function should not terminate the script;
an error inside an untested function should terminate both the function and the script.

I think the following two statements should hold as well:

an error inside a tested function should terminate the function;
an error inside a tested function should terminate the script.

Any untested error in execution would then terminate the script, which is robust and easy to reason about. To allow an error to happen one would need to make it tested. To ignore an error and continue the function, one can do this:

some_command || :  # colon is same as true, a no-op

If one wanted to allow an error to happen inside the function and terminate the function but not the script, one would need to make an explicit return:

some_command || return $some_code

Workarounds

At this point, you are aware the problem even exists, and you can make an informed decision about whether this is something that bothers you or not. If it does, I’ve come up with some workarounds.

1: Don’t use errexit

IMHO this is throwing the baby out with the water. Every single command has to be tested individually.

2: Don’t be fancy

Obviously this issue doesn’t affect a script if it doesn’t use tested functions or subshells. I would guesstimate the vast majority of scripts is not affected, simply because they don’t even define any functions. Using functions in a tested context is somewhat uncommon—and plain odd in the case of subshells.

So the workaround is to avoid functions and subshells, or at least avoid them in tested contexts.

If one needs some kind of status propagation one can use stdout, stderr, or other stream, or a variable.

3: Move tested functions to scripts

Probably the simplest workaround: one can define their tested functions in other files and run them as scripts. Then they will run as any other program would, and the disabled state of errexit will not leak to children. One could even place them in a dedicated directory alongside an explanatory README.

4: Source and run

Run the function with something akin to:

sh -ec ". ${this-script}; some_function"

Similar to the previous one, but still allowing the script to be self-contained—well, mostly. The script must^* be a library, otherwise sourcing it will trigger side-effects. This in turn means that the script was itself sourced by something else, and one needs to resort to some shenanigans to know the path to the script, as POSIX doesn’t provide a built-in way to obtain it.

* I think there is a way to make the script truly self-contained. See this short article on modulinos in bash. However, I’m not sure how a POSIX-compliant modulino would look like.

5: Test everything in tested functions

If a shell function is executed and its exit status is explicitly tested, all commands of the function are considered to be tested as well.

The above phrasing of the issue—found in FreeBSD’s sh manual—gave me an idea that one might just test all the commands inside the affected function explicitly. If the surface area to test is small, it might be a good solution. It needs to me remembered though, that the surface area grows recursively with every function call made in the function.

6: `set -e` guards

We can perform the following sequence: (1) disable errexit before calling the function; (2) enable errexit inside the function; (3) enable errexit back after the function call. See the example below. For this to work, we need to define the function using a subshell—note the parentheses in lieu of curly braces—so setting errexit doesn’t propagate outwards to the caller.

some_function() (
    set -e
    false # terminates and returns 1
    return 2
)

set +e; some_function; exit_code="$?"; set -e

if test "${exit_code}" = 1; then
    echo "This is echoed"
fi

The first consideration is that this approach is intrusive: one needs to change braces to parentheses and run set -e at the beginning for every affected function, which might be a little error prone.

Second, subshells are slow^†. That said—unless the code is extremely subshell-heavy, it’s not going to be noticeable.

† See this ShellCheck entry on subshell performance for more details. Out of curiosity I ran the benchmark presented there on both dash and bash. In both cases the version without subshells was significantly faster—double digit milliseconds versus single digit seconds. However, in both cases dash was at least two times faster than bash.

Subshells have a nice property of isolating changes made to variables inside them, cutting down on mutable global state. This alone is a good reason to use them to define all the functions. Non-POSIX local keyword provided by some shells achieves the same purpose.

7: Substitution sandbox (dash-specific)

If an author of a script knows they are targeting only dash shell, they can leverage its quirk, which probably is a bug, but I prefer to call it a happy little accident. For whatever reason, the problem disappears if command substitution is used to assign stdout of a function to a variable. If there’s no need to save the output, it can be assigned to _ variable, however, no one sane reading this will understand it without an explanatory comment. Let’s see it in action:

if _="$(some_function)"; then
    echo "This is echoed on success"
fi

If one wanted to do something only in case of failure, they can use ! to invert the exit code:

if ! _="$(some_function)"; then
    echo "This is echoed on failure"
fi

However, this collapses all 255 nonzero exit codes into one, losing information. A simple workaround is to use a no-op command in the success branch—like so:

if _="$(some_function)"; then
    :
else
    exit_code="$?"
fi

One could also want to save the exit code in both success and failure branches. This can be shortened to arrive at this one-liner. Behold the substitution sandbox:

_="$(some_function)" && ec="$?" || ec="$?"

This is actually really neat. One is free to use stdout for whatever they wish, there’s no need to make your hands dirty with global variables, no modifications to the function are required, and it’s not that noisy syntax-wise. Too bad it doesn’t work in any other shell I tried, it would make for a nice idiom. Substitution sandbox, it does have a nice ring to it, doesn’t it?

Closing thoughts

Man this is fucked up. I like shell. I want to like it. But it’s painful sometimes.

If there’s one takeway here, it’s this: use ShellCheck. You will quote everything like a madman. It will not catch everything. But it is good.

By the way, hello world! That’s my first post, and hopefully not the last.