Playing with set -e in shell scripts
2023-02-25
At one of my previous teams, we laughed at how much of an overkill state-of-the-art some of our practices were. In particular, for a shell script to be considered a cargo-cult state-of-the-art, it had to include this line somewhere at the beginning:
set -Eeuxo pipefail
I’ll focus on the -e
part—an option called errexit. man dash
gives the following description:
If not interactive, exit immediately if any untested command fails. The exit status of a command is considered to be explicitly tested if the command is used to control an
if
,elif
,while
, oruntil
; or if the command is the left hand operand of an&&
or||
operator.
I will refer to the term tested command as defined above throughout the post. The above description sounds useful. For example, I wouldn’t want my backup script to silently ignore that And it’s still possible to act upon a failure by using tested commands. For a trivial example, to try until Setting errexit isn’t the only way to achieve this quit-on-error behavior. One can manually check the exit code of every command in a script and then decide to quit—like so: It turns out, however, that errexit isn’t that amazing. What prompted me to write this post is that I’ve recently noticed that ShellCheck provides the To quickly quote ShellCheck’s documentation: ShellCheck found a function used as a condition in a script where This applies to To rephrase: because this is a tested function, the errors that happen inside it will be ignored; neither the function nor the script will terminate because of an error. This is—broken? I can’t name a single thing that’s good about this. And ShellCheck does not report it by default! Why is this the case? My first guess would be the following. Let’s remember that—despite errexit being set—one can still do checks like One thing that complicates the matter is that explicitly setting errexit again in the tested function has no effect: errors will be ignored, even in a subshell. There are other problems with errexit—that our state-of-the-art thing fixed—I won’t focus on in this post. First, commands registered with And if you’re wondering how error handling looks in pipelines, well, they just power through all errors. I don’t know what you expected. Bash makes it possible to change that with Let’s think what would the ideal behavior here. The following three statements hold when errexit is set: I think the following two statements should hold as well: Any untested error in execution would then terminate the script, which is robust and easy to reason about. To allow an error to happen one would need to make it tested. To ignore an error and continue the function, one can do this: If one wanted to allow an error to happen inside the function and terminate the function but not the script, one would need to make an explicit return: At this point, you are aware the problem even exists, and you can make an informed decision about whether this is something that bothers you or not. If it does, I’ve come up with some workarounds. IMHO this is throwing the baby out with the water. Every single command has to be tested individually. Obviously this issue doesn’t affect a script if it doesn’t use tested functions or subshells. I would guesstimate the vast majority of scripts is not affected, simply because they don’t even define any functions. Using functions in a tested context is somewhat uncommon—and plain odd in the case of subshells. So the workaround is to avoid functions and subshells, or at least avoid them in tested contexts. If one needs some kind of status propagation one can use stdout, stderr, or other stream, or a variable. Probably the simplest workaround: one can define their tested functions in other files and run them as scripts. Then they will run as any other program would, and the disabled state of errexit will not leak to children. One could even place them in a dedicated directory alongside an explanatory README. Run the function with something akin to: Similar to the previous one, but still allowing the script to be self-contained—well, mostly. The script must* be a library, otherwise sourcing it will trigger side-effects. This in turn means that the script was itself sourced by something else, and one needs to resort to some shenanigans to know the path to the script, as POSIX doesn’t provide a built-in way to obtain it. * I think there is a way to make the script truly self-contained. See this short article on modulinos in bash. However, I’m not sure how a POSIX-compliant modulino would look like. If a shell function is executed and its exit status is explicitly tested, all commands of the function are considered to be tested as well. The above phrasing of the issue—found in FreeBSD’s sh manual—gave me an idea that one might just test all the commands inside the affected function explicitly. If the surface area to test is small, it might be a good solution. It needs to me remembered though, that the surface area grows recursively with every function call made in the function. We can perform the following sequence: (1) disable errexit before calling the function; (2) enable errexit inside the function; (3) enable errexit back after the function call. See the example below. For this to work, we need to define the function using a subshell—note the parentheses in lieu of curly braces—so setting errexit doesn’t propagate outwards to the caller. The first consideration is that this approach is intrusive: one needs to change braces to parentheses and run Second, subshells are slow†. That said—unless the code is extremely subshell-heavy, it’s not going to be noticeable. † See this ShellCheck entry on subshell performance for more details. Out of curiosity I ran the benchmark presented there on both dash and bash. In both cases the version without subshells was significantly faster—double digit milliseconds versus single digit seconds. However, in both cases dash was at least two times faster than bash. Subshells have a nice property of isolating changes made to variables inside them, cutting down on mutable global state. This alone is a good reason to use them to define all the functions. Non-POSIX If an author of a script knows they are targeting only dash shell, they can leverage its quirk, which probably is a bug, but I prefer to call it a happy little accident. For whatever reason, the problem disappears if command substitution is used to assign stdout of a function to a variable. If there’s no need to save the output, it can be assigned to If one wanted to do something only in case of failure, they can use However, this collapses all 255 nonzero exit codes into one, losing information. A simple workaround is to use a no-op command in the success branch—like so: One could also want to save the exit code in both success and failure branches. This can be shortened to arrive at this one-liner. Behold the substitution sandbox: This is actually really neat. One is free to use stdout for whatever they wish, there’s no need to make your hands dirty with global variables, no modifications to the function are required, and it’s not that noisy syntax-wise. Too bad it doesn’t work in any other shell I tried, it would make for a nice idiom. Substitution sandbox, it does have a nice ring to it, doesn’t it? Man this is fucked up. I like shell. I want to like it. But it’s painful sometimes. If there’s one takeway here, it’s this: use ShellCheck. You will quote everything like a madman. It will not catch everything. But it is good. By the way, hello world! That’s my first post, and hopefully not the last.The good things
cp
failed and then happily run rm
on the original files.cp
in the backup script succeeded, I could write it this way:until cp original backup; do
sleep 1
done
rm original
cp original backup || exit
The bad things
--enable all
flag, which surely sounds like a yak-shaving contest state-of-the-art thing. Obviously, it reported new errors when I ran it on a script that previously passed with flying colors. In particular, the following snippet was marked with error SC2310.if ! there_is_rebase_in_progress; then
return
fi
# some other commands follow
set -e
is enabled. This means that the function will run without set -e
, and will power through any errors.if
, while
, and until
statements, commands negated with !
, as well as the left-hand side of ||
and &&
. It does not matter how deeply the command is nested in such a structure.until cp
mentioned above. An extremely crude way of implementing that would be to stop errexit from taking effect before the check and enable it again afterward. Voilà. But that’s just a wild guess.Other problems
trap
are not fired if an error terminating the execution happens in a function. Bash provides set -E
option for this.-o pipefail
option. For POSIX, you’d need to refactor your code to use named pipes—or use a proper language already.Ideal behaviour
some_command || : # colon is same as true, a no-op
some_command || return $some_code
Workarounds
1: Don’t use errexit
2: Don’t be fancy
3: Move tested functions to scripts
4: Source and run
sh -ec ". ${this-script}; some_function"
5: Test everything in tested functions
6:
set -e
guardssome_function() (
set -e
false # terminates and returns 1
return 2
)
set +e; some_function; exit_code="$?"; set -e
if test "${exit_code}" = 1; then
echo "This is echoed"
fi
set -e
at the beginning for every affected function, which might be a little error prone.local
keyword provided by some shells achieves the same purpose.7: Substitution sandbox (dash-specific)
_
variable, however, no one sane reading this will understand it without an explanatory comment. Let’s see it in action:if _="$(some_function)"; then
echo "This is echoed on success"
fi
!
to invert the exit code:if ! _="$(some_function)"; then
echo "This is echoed on failure"
fi
if _="$(some_function)"; then
:
else
exit_code="$?"
fi
_="$(some_function)" && ec="$?" || ec="$?"
Closing thoughts