A stabilized approach to systems orchestration
December 19, 2020
Abstract
Configuration management is term that is usually used to describe a declarative approach to systems administration. Declarative configuration allows you to write down the intended state of a system. Changes are communicated to the rest of the team through a series of commits. The benefits of these tools are immense, but validating changes is slow and prone to error. Sequencing operations is also possible, but at the expense of eye-watering complexity. In this session we explore the following topics:
Preamble
I will never be a pilot, but I have a good deal of admiration for these men and women. One of the features of this profession that inspires me is the personal nature of working tight formation with the rest of the crew. Another is their habit of verbally cross-checking their actions with procedures and information they are receiving. These verbal and manual confirmations work just as well when operating solo as they do when working with a copilot.
Pilots also learn some outstanding methods of ordering priorities. The motto that most directly speaks to this is,
Aviate, Navigate, Communicate
What does it mean to "Aviate"? That means fly the airplane.
Simulator training
Never bust minimums
Train for all scenarios known to be problematic
Dan Gryder is a flight instructor who has taken up the cause of studying and learning from accident reports in general aviation. Several hard personal experiences in his life put him on a mission to give pilots in single-engine aircraft everything they need to avoid a loss of control.
Part of his technique is to import the habits and tools that commercial airlines have. One of the interesting experiments he conducted was to ask airline pilots which skill was more important: a) stall recovery or b) energy management. Can you guess what they said?
The important skill is avoiding loss of control. Do you think the pilots who work for Southwest trade stories at the bar about adventures while stalling a 737? I hope not!
Here is what Dan had to say to those in general aviation:
Learn to define and honor the 1.3 buffer at all times. Define it, placard it, honor it. It is what the airlines do every day. Do you think they memorize all those speeds? No, they are clearly defined and placarded for them at all times.
What is the "tool" or the "placard" he is referring to?
On a small aircraft, the tool [in this case] is a bright piece of tape on the airspeed indicator. Now when the pilot is distracted or under pressure, one thing he does not have to remember is the minimum maneuvering speed. This maneuvering speed is calculated ahead of time to allow up to a 30 degree bank angle. When something unexpected happens that piece of tape on the airspeed indicator will help him keep the machine flying--all the way end.
Dan continues,
The airline record is impressive...as they now train and check all possible scenarios (called maneuvers) known to be problematic over the course of time.
Build/test source code | < 10 seconds |
Trial deployment | < 90 seconds |
Push to all users | < 10 minutes |
Software engineers don't have standard operating procedures, but every well-managed project has a substantial list of rules and processes to follow. Don't believe me? Try submitting a patch to your favorite open-source project and find out how much correction you receive.
The strategy a software project employs should be composed of everything they need to maintain a stabilized approach:
I'm assuming that you ship tests with the code, but maybe this doesn't make sense for what you're doing. There are many times where you put together some specialized or very slow regressions and then throw them away after they have served their purpose.
The point is that your approach to development is able to transition from one stable condition to another stable condition. To go back to the example of aviation: a stabilized approach lightens the workload and gives you situational awareness. How so? You've done all the figuring in advance.
Test configuration on an existing host | < 10 seconds |
Provision new infrastructure | < 90 seconds |
Commit, propagate | < 10 minutes |
This slide shows systems configuration when framed from the perspective of software engineering.
I think there is a strong case to be made that productivity in software development and systems administration correlates directly with the time it takes to validate a change. Delay in feedback changes your approach to development.
If testing a change to a host takes more than, say 10 seconds, it's not providing interactive feedback. If you can't iterate on a problem efficiently you will inevitably compensate by manually testing fragments of code and configuration outside of your repository.
As with software development, there is a strategy to systems administration
{% for file in ["miniupnpd.conf", "dhcpd.conf", "mail/smtpd.conf"] %} /etc/{{file}}: file.managed: - source: salt://home/{{file|replace('/', '_')}} {% endfor %}
Templates, variables
Massive APIs
Modules and extensions
Product roadmap (features, bugs)
The longer I worked with large frameworks like Salt the more I valued their capabilities, and the more I found the framework itself to be an obstacle to what I was trying to accomplish.
Far too often we commit the change in order to test the change.
# step 1 /usr/local/bin/mysql_install_db: cmd.run: - creates: /var/mysql # step 2
Dependencies, sequencing
Progressive status?
Arid programming environment
Eventually configuration management frameworks grow large enough to be called an orchestration framework. What's not to like?
Orchestration is an advanced topic for configuration management, but all of you do this already. It's called scripting.
ssh 10.5.5.1 < base-cfg.sh ssh 10.5.5.1 < configure-wordpress.sh
What's missing?
Let's try our hand at configuring a system using some scripts. This is a solution that is too simple. The first reason this is too simple because we didn't ship resources the scripts will need:
We also need a convention for associating configuration with each host, and hopefully a means of running only the part we are trying to test.
Adding/upgrade packages Install files, directories, symlinks Enable/start/restart services |
} | Map units of work into a profile for each host |
There are few operations that configuration management systems must be able to accomplish. We need a mechanism for adding packages, installing files, and controlling services. With only these three things, you can accomplish some valuable tasks, such as
Also, we have a versioned history of configuration changes that the rest of your team can follow. These are critical capabilities, and it is for good reason that configuration management has become mainstream.
With with these fundamentals in mind, let's try again.
alias rexec="ssh -T -S /tmp/control $1" rexec -fN -M # start control master rexec mkdir /tmp/staging # scratch space case $1 in 192.168.0.2) # copy files, run scripts tar cf - util wpconfig | rexec tar xf - -C /tmp/staging rexec < base-cfg.sh rexec < configure-wordpress.sh ;; esac rexec rm -r /tmp/staging # clean up rexec -O exit # end control master
As crude as this is, this tiny framework is, it has some advantages:
What are we missing?
/tmp
In short, this works! But notice this:
Stages configuration and utilities on the remote machine
Secure access to remote files: rinstall(1)
Ability to run everything, or use a pattern to match labels: pln(5)
rset(1) is a tool that provides conventions and a way to execute scripts with access to particular resources.
As we have already observed, the ability to execute scripts is not sufficient. You also need a collection of utilities or utility libraries. rset(1) creates a temporary directory populated with the tools and materials you will need for the task at hand.
It has with a server for access to large files, and some built-in utilities know how to install or modify files.
rset(1) uses it's own container format. This is different, and I think sets it apart from other attempts at a minimalist configuration management systems.
Blocks of configuration can be selected individually
Labels names beginning with
[0-9a-z]
are excluded by default:
root_tasks: → crontab - <<-EOF → ~ 1 * * * /usr/local/bin/renewcert → EOF
Parameters apply to subsequent labels
interpreter=/bin/sh -x
Progressive Label Notation is a tab-indented file format that allows you to organize configuration.
The content of each label is indented with tab indentation. If you do not have a capable text editor, this might a problem for you. Why tabs?
routes.pln
associates configurations with each hostname
vm2.eradman.com: vm2/ → vm2.pln → wordpress.pln 172.16.0.5: alpine/ → alpine_vm.pln
Directories listed after the hostname labels are copied to the staging directory
The "top-level" configuration file is called routes.pln
by default.
Paths after the :
are directories (configuration files, scripts,
libraries, anything) that you want staged on the remote host.
Dynamic inventory is a feature that you can handle yourself. rset
reads a file. Use any means you'd like to generate this file if need
be.
├── _rutils # utilities always staged │ ├── rinstall │ └── rsub ├── _sources # files served over http └── routes.pln # configuration mapping
Extend core functionality on all hosts using
_rutils
Built-in web server serves files under
_sources
Put everything else in directories specified in
routes.pln
rset(1) has three methods of providing access to files:
_rutils
will show up in the current working directory
on the remote host_sources
is accessible over a local port-forward
to a built-in web server (access is not restricted, but large files
are fine)routes.pln
will be copied (push only,
remote hosts cannot request this content)rinstall(1)
./rinstall xa10/pf.conf /etc/pf.conf \
&& pfctl -f /etc/pf.conf
rsub(1)
./rsub /etc/firefox/unveil.main <<-CONF /usr/local/heimdal/lib r /usr/lib r CONF
Some solutions are too simple. Landing on a remote host with a staging
directory with /bin/cp
is not enough. Two built-in utilities are
automatically shipped:
rinstall:
0
only if file was installed or changed! [Very important]rsub:
stdin
to create a block of managed textDevelopment environment is not authoritative
Critical systems on the edge of a network
Observation requires a terminal on both sides
In aviation there's a funny term for making a final approach with the engines idling. This is some sometimes called "dead stick" landing and bringing a functional aircraft down this way is not very safe. There are a couple reasons jets land under power, the most significant is that it takes precious time to spool up the engines. A powered final approach gives the pilot the capability to adjust or abort.
In some environments every configuration task is like a forced landing because your personal test environment is slow to react and is not the same as the official deployment mechanism from top of tree.
This is what running configuration before you commit does; it gives you the ability assess your current approach and to adjust course.
There are some solutions that client-server systems seem to be well suited for. I don't like agent-based configuration several reasons:
# jumphost/hostname.wg0
wgport 111 wgkey JUMP_HOST_PRIVATE_KEY
wgpeer ROAMING_HOST1_PUBLIC_KEY wgaip 10.0.0.20/32
wgpeer ROAMING_HOST2_PUBLIC_KEY wgaip 10.0.0.21/32
inet 10.0.0.1/24
# thinkpad10/hostname.wg0
wgkey ROAMING_HOST1_PRIVATE_KEY
wgpeer JUMP_HOST_PUBLIC_KEY wgendpoint proxy.xyz.com 111 wgaip 0.0.0.0/0
inet 10.0.0.20/24
Even if you have mobile clients, you probably don't need a pull-based configuration scheme. I say this because we now have WireGuard to build links. WireGuard is revolutionary in it's simplicity
wg(4)
interface guarantees the identity of interfacesThe only thing you might to add is a cron job that sends a ping in order to establish the tunnel.
rset(1) doesn't provide flags for controlling connection options.
It doesn't need to, because options such as ConnectTimeout
,
ProxyJump
and anything else you can imagine can be specified in
ssh_config(5).
Factor out common or complex operations into dedicated utilities
Stage configuration data, scripts, and utilities on the remote host
Map units of work into a profile for each host
I have heard it said that there is a paradox with respect to learning a subject. The first maxim is this: the more you know, the more you can see there is to learn. The second is that once you've mastered a topic you finally see how simple it was all along.
May I submit that configuration management simply means that we have a way of associating configuration with each host. Orchestration is really a flamboyant term for scripting with configuration data, scripts, and utilities already staged.