Daily 2018-07-16

3 minute read Published:

Monday, 16th of July 2018

Monday! I am slightly jet-lagged, just like this post. Interesting things happened but I had no time to collect my thoughts and share them.

The interface re-mapping branch got another round of reviews and slight tweaks. The configuration code (think “snap get, snap set”) is now enrolled in the same mechanism and can map “system” in the API to “core” in the on-disk state.

There are some small tweaks elsewhere, additional unit tests, updates to stale comments, the usual stuff that gets addressed as a branch evolves.

Meanwhile a new topic begins to grow on the side. There is a particular integration test that is unreliable, racy, perhaps broken in ways we don’t yet understand.

The test exercises a feature where snap developers can customize how a service is stopped. Normally the process gets killed with SIGTERM but as systemd.service(5) and systemd.kill(5) manual pages show, there are many options to choose from.

You can pick the signal to use. The signal can be sent to just the main process or to all of the children as well. We want to test each possible combination to ensure it works.

We have a test executable, a shell script really, that responds to signals. The executable is used by a number of services in a test snap, each one with a distinct combination of available options.

It looks great on paper but it doesn’t work in practice. It often fails with puzzling lack of output. We had to investigate as it was constantly getting in the way of core18 work.

The service program sets up handlers for a number of signals and then loops chatting “I’m alive”, waiting for them to arrive.

The act of establishing the handler was racing with the test asking systemd to stop the service. This is because systemd already considered the service “ready” in absence of information to the contrary (as soon as the process is there really).

I tweeted about it sometime last week. This is just the on-going investigation. We fixed that source of raciness and hoped for the best. It was still racy.

The next thing we found was that shell is terrible. It is a constant reminder that any code written in shell is just waiting to bash you (ha ha) on the head.

This apparently also includes an extremely trivial two line script that uses trap to run a command on signal delivery. You see, we got lucky because we used a loop that did echo "running"; sleep 1. If we had used sleep 1h the bug would have been apparent.

Shell doesn’t run the signal handler when sleep is invoked. In fact by using strace we can clearly see that signal handlers are blocked before running commands requiring a new process.

Fortunately I had a version of the test program written in Python that didn’t have that issue. We switched to it hoping it would be the end of the story.

It isn’t, the test is still flaky. It’s interesting how reality challenges assumptions of simplicity.

The suspicion now is on how the test itself operates and how logging helpers extract the relevant fragment of the log for analysis. Exploration continues in https://github.com/snapcore/snapd/pull/5526

comments powered by Disqus