Monday! I am slightly jet-lagged, just like this post. Interesting things happened but I had no time to collect my thoughts and share them.
The interface re-mapping branch got another round of reviews and slight tweaks. The configuration code (think “snap get, snap set”) is now enrolled in the same mechanism and can map “system” in the API to “core” in the on-disk state.
There are some small tweaks elsewhere, additional unit tests, updates to stale comments, the usual stuff that gets addressed as a branch evolves.
Meanwhile a new topic begins to grow on the side. There is a particular integration test that is unreliable, racy, perhaps broken in ways we don’t yet understand.
The test exercises a feature where snap developers can customize how a service
is stopped. Normally the process gets killed with SIGTERM
but as
systemd.service(5) and systemd.kill(5) manual pages show, there are many
options to choose from.
You can pick the signal to use. The signal can be sent to just the main process or to all of the children as well. We want to test each possible combination to ensure it works.
We have a test executable, a shell script really, that responds to signals. The executable is used by a number of services in a test snap, each one with a distinct combination of available options.
It looks great on paper but it doesn’t work in practice. It often fails with
puzzling lack of output. We had to investigate as it was constantly getting in
the way of core18
work.
The service program sets up handlers for a number of signals and then loops chatting “I’m alive”, waiting for them to arrive.
The act of establishing the handler was racing with the test asking systemd to stop the service. This is because systemd already considered the service “ready” in absence of information to the contrary (as soon as the process is there really).
I tweeted about it sometime last week. This is just the on-going investigation. We fixed that source of raciness and hoped for the best. It was still racy.
The next thing we found was that shell is terrible. It is a constant reminder that any code written in shell is just waiting to bash you (ha ha) on the head.
This apparently also includes an extremely trivial two line script that uses
trap to run a command on signal delivery. You see, we got lucky because we
used a loop that did echo "running"; sleep 1
. If we had used sleep 1h
the
bug would have been apparent.
Shell doesn’t run the signal handler when sleep is invoked. In fact by using
strace
we can clearly see that signal handlers are blocked before running
commands requiring a new process.
Fortunately I had a version of the test program written in Python that didn’t have that issue. We switched to it hoping it would be the end of the story.
It isn’t, the test is still flaky. It’s interesting how reality challenges assumptions of simplicity.
The suspicion now is on how the test itself operates and how logging helpers extract the relevant fragment of the log for analysis. Exploration continues in https://github.com/snapcore/snapd/pull/5526