PFCTs: Post-facto Configuration Tinkerers (and Docker)

A PFCT is a class of software system used within server infrastructure management which typically meets a subset of possible deployment, testing, monitoring, scalability, availability, and management problems. Some call them orchestration systems, others call them remote execution systems or configuration managers, still other subsets are known as cloud, cluster, infrastructure or virtualization management systems.

As potentially powerful tools for the use (newspeak: devops, ie. 'development and operation') of large computer systems, they have been receiving significant interest of late. Unfortunately, open and broadly informed discussion about their true properties appears to be relatively thin on the ground, while many consider them laced with hidden issues.

In other words:

Therefore, encouraged by the recent frequency of repeated discussions on Hacker News about these beasts, I have attempted to review the field to highlight the properties of available systems to:

Scope

The scope of the survey are systems which explicitly include:

  1. An open source development and distribution model.
    (This excludes specific vendor cloud management solutions, such as those offered by major Linux vendors)
  2. Capacity to manage multiple systems concurrently.
    (This excludes RC systems such as openrc and systemd that are beginning to add support for managing containers and virtual machines directly, in a host-centric manner)
  3. Some form of functional dependency resolution.
    (Because all nontrivial systems include inter-service dependencies, and the ongoing management of these is critical with respect to facilitating testing (within the developer workflow) and continuous deployment practices (for actual staging and production environments))
  4. Language build system agnostic (ie. no single language solutions, because TMTOWTDI)

Timeline

... insert javascript inline graphviz here!
roughly:
cfengine -> puppet --> chef --> ansible/salt

Comparison

Ansible CFEngine Chef Puppet/Vagrant Salt
Looks Young Grizzled Mature Mature Young
Description "Ansible is significantly different from Puppet. For starters it is push-based rather than pull-based as a lot of Puppet setups are. The other important difference is that Ansible has top-to-bottom ordering. That's what first attracted me." [1]
"Though one of the core technology decisions behind Ansible is that you don't need much to get it running. You can scale things up as needed, such as fireball mode. And you can boot strap that process using Ansible. You don't even need DNS. Hosts can register back with Ansible on start up, this works today with their AWX product and cloud-init's phone home feature. Also your inventory of hosts can simply be queried from sources, such as AWS's inventory list, so in that case you can't have a host that's unaccounted for." [1]
"a manifest-ordered tool" [1]
"It's still using Python scripts to wrap the non-idempotent unix primitives - but at least it's clean, reusable code." [1]
"Basically with Ansible all the declarative stuff in ansible is there and you'll be able to do all those things you want to do from Puppet and Chef. However you can also do the app deployment stuff that you would typically do with Fabric or Capistrano." [1]
"the docs for beginners are excellent." [1]
"Ansible Pushes code from the controller to your nodes, while salt, puppet and chef all pull state from a master somewhere." [1]
"I see it as a 'nice Chef' or 'usable Puppet'." [1]
"[CFEngine] really invented the Prolog-style-Config-Mgmt [about] 20+ years ago" [1]
"one of the first configuration management systems that was deployed in anything approaching widespread use" [2]
"Chef is 'explicit' about ordering (I explicitly say exactly what is going to happen)" [1]
"one of the founding principles of Chef in that Adam Jacobs got tired of, in large systems, when you get one link in your dependency tree wrong you wind up with code which will pass tests and get deployed to prod then accidentally get broken the next day as the random ordering changes." [1]
"I don't quite understand custom DSLs though for configuration management. Giving users a library of idempotent code components like chef does I think is way better than a custom language that is almost but not quite or maybe turing complete. At some point you are going to want to iterate and loop over stuff and if there is anything that ant has taught us is that imperative things are better handled with imperative language constructs. Trying to shoehorn everything into a declarative format is the wrong approach." [1]
"The benefit with Chef is that it is always Ruby. There is no dropping in/out of anything other than Ruby. As a Ruby programmer I quite like that. It doesn't fight the language it is embedded in and uses all the language idioms to great effect." [1]
"a configuration management language that is strongly influenced by Prolog and logic programming" [1]
"Puppet [order of execution] I would call 'implicit' based on solving a dependency graph and a bit of random hash order evaluation." [1]
"Puppet has implicit ordering (for example, a file will depend on its parent directory) as well as explicit (arrow syntax, new in 3.0, I think). But really, I don't understand the complaint. Writing explicit dependencies literally what Puppet -- or any other config management system -- is about. Puppet cannot possibly know what your intent is. So you have to tell it." [1]
"Salt started life as a remote execution system: a class of software applications written to address concerns of the form, "I have this command I want to run across 1,000 servers. I want the command to run on all of those systems within a five second window. It failed on three of them, and I need to know which three." [1]
"Salt is mcollective + puppet in one. So yes, you can use it as a config management system (it works very nicely as one). You can also use just the remote execution portion, and continue using puppet/chef/etc." [2]
"Salt builds configuration management on top of the remote executions system. The philosophy is that the two aspects of system management are fundamentally linked. Salt is also intended to be fast, easy to use and lean, lightweight and easy to set up. Salt differs from other remote execution systems because it is built on ZeroMQ instead of [an] AMQP system or SSH (like tentakel). This difference makes Salt much faster and lighter weight than SSH based systems. As for configuration management, Salt is made with the goal of being easier to get going. Many community member have been able to pick up Salt quickly and have expressed that it is easy to learn and maintain." [3]
"And salt will even output the command results for you in JSON. There's massive potential there for using salt for monitoring not just deployment." [4]
Criticism "Puppet works so well, Ansible seems like a typical 'not invented here' tragedy that frequently occurs in open source. Rather than coalesce around something that works and has years of experience and knowledge and bug fixing sunk into it, people create something from scratch. To be specific..." [1]
"Ansible has the disadvantage of unclear, implicit dependencies." [1]
"I've had experience with Chef, Puppet and Ansible. Ansible is the least complex, and we're using it daily. Re: Ansible community dynamic - I've gotten unfriendly feedback a few times and agree with the negative reputation. Aside from community, Ansible is a big step up, and I suspect Salt would be as well." [1]
"A bit slow and custom loops/dsl sometimes gets confusing. Managing hosts file is mostly the only "infrastructure" you need, but still is annoying. No Windows support (yet)" [1]
"Its utterly atrocious. Combining global booleans as a weird sort of declarative flow control to make up for the lack of explicit dependencies between objects that everything else has is horrific. Thousands and thousands of these: (debian_6|debian_7).(role_postgres|role_postgres_exception):: And ordering? Nah, just run it 3 times in a row. :\" [rhys_rhaven] "With [resource dependencies in Chef], I would see "okay, C happens after B and A, but does C _really_ need to happen after A?" I don't know. I can't answer that. It requires much deeper domain knowledge of that infrastructure to answer that question. This leads to more of a nightmare [for me] months down the road, when I don't quite remember if something needs to happen after something else." [1]
"I can appreciate the simple chef/ansible way of provisioning things, but starting with some size I think that explicit dependencies are the only way to go. Puppet/salt allow me to do things that are quite hard to do in chef." [1]
"My major pain point was the complexity of Chef Cookbooks in comparison to Ansible Playbooks. I could hardly wrap my mind around how to write my own Cookbooks after exploring some of the ones I used (Ruby, rbenv, Git, Nginx.) Another major thing for me at least was the documentation, it seemed like Ansible had better documentation to me vs Chef. Finally another thing was the product offerings of Chef vs Ansible. Chef has some paid versions that offer more features, whereas Ansible is feature complete for free and they offer a GUI and Support instead which I preferred." [1]
"Tried Chef for a few weeks, hated every moment of it, switched to Ansible and it all made complete sense." [1]
"Ruby DSL is hard if you don't know Ruby. Lots of infrastructure to manage (if not using hosted Chef). On the fly orchestration requires 3rd party tools or Enterprise License." [1]
"Puppet suffers hugely from the influence from the fact that ordering is unpredictable and that you have to use arrows or requires to add order. People accidentally forget to add ordering, the resulting code works most of the time, and then one day you start running into an issue because some implicit dependency either didn't happen, or happened in the wrong order."
"That's the main reason we've switched over from Puppet to Ansible. It has less abstraction layers, but at least its execution order is predictable. The easy learning curve for new colleagues and the compact syntax are also nice benefits." [1]
"The undeterministic declarative system (Puppet) has the disadvantage of unclear, missing dependencies." [1]
"Puppet's unpredictable ordering should be a write-and-forget benefit, but in reality it's very difficult to reason about. In the end, my script imposed a top-to-bottom ordering with requires. Also, developing configuration scripts against Vagrant is a huge pain. While `puppet parser validate` checks for syntax errors, most of the logical, dependency-based errors reveal themselves after Vagrant takes a good minute to fire up." [1]
"Having used Puppet for a few years, my biggest complaint is that it's not quite modular enough. For example, in one system I have a way of declaring an "app". An app may have some dependencies (Postgres, Memcached), settings, users, etc. I have two choices, I think; either have one large, monolithic define allowing me to instantiate the app from a bunch of config which means that every setting has to be a variable that's passed around and used by the define. Or I can split it up into pieces. The problem with the latter approach is that eventually, much the configuration needs to be collected into just a few actual config files. Let's imagine that the app uses a single "app.conf". Normally you would build it using a template. But with the second approach, you would have to build it out of fragments that come from multiple defines; probably you would collect them in a conf.d type folder, and then use an "exec" definition to concatenate them into "app.conf". [1]
"Despite Puppet's nearly six year head start, Salt boasts more contributors to its code base (as per Ohloh.net), a superior comment-to-code ratio, an increase in year-over-year commits, and a lower barrier to entry for new contributors" [1]
"Puppet attempts to create idempotent actions to use as primitives, but unfortunately they're written in a weird dialect of ruby and tend to rely on a bunch of Puppet internals in poor separation-of-concern ways (disclaimer: I used to be a Puppet developer) and I think that Chef has analogous problems." [1]
"Crazy slow, linguistically poor, parser breaks in every release, really hard to test locally." [1]
"But yes, crazy slow alone destroys everything. Typical "fixes" include going masterless, yet there's no standardised distribution methods so you need to invent that yourself. Embedding all files into catalogs, thus turning network overhead into CPU overhead etc. Not to mention in the insane memory usage client side, i.e. on every single box." [1]
"It layers a custom DSL on top of a perfectly adequate language, uses standard terms like classes in non-standard ways, takes away the linear top/down flow that most programmers are used to, forces sequencing through notification chains, steamrolls over error messages willy-nilly, etc. Although I'm a bit biased so a few more data points would be helpful." [1]
"I tried working with Puppet long time ago. The idea of having 20 minute window for pushing out changes never seemed attractive to me." [1]
"We're in the process of switching from Puppet to SaltStack. It's a change measured in light-years." [1]
"In theory Puppet would be good for the Linux servers at least because it lets you declare things in an abstract way that can hinge on variables like distro, release, etc. In practice the Puppet language is only tolerable to the extent that it provides (or helps you create) abstractions for everything, and now you have two problems as they say." [1]
"We are moving from puppet to salt and I'm half way through and so far my git commits looks like this over the past month: puppet repo -14000 lines salt repo +1600 lines" [1]
"We could have likely saved 3-5k lines of code from a puppet rewrite from scratch, but it still wouldn't have been as simple as the Salt or Ansible code." [1]
"Puppet doesn't have native support for a lot of things, which require us to either implement it in puppet's DSL, or in custom ruby, which the upstream won't take. For instance: git, gems, pip, virtualenv, npm, etc. etc.. Puppet doesn't have looping. I'm always told: "Iteration is evil. Puppet is a declarative language and if you're needing to loop you're doing something wrong." But it's simply not true. Looping making things insanely simpler. Puppet isn't executed in order, even for the same service in the same environment across systems. You have to very diligently manage every require for ordering, and no one does it right. This had lead to systems unable to run first runs really often, which causes problems with autoscaling. I don't enjoy spending my time cleaning this up often. Puppet's DSL is full of little gotchas that constantly cause issues for developers who aren't very familiar with Puppet." [1]
"lack of simple search function vs complexity of exported resources... Another real issue is the slowness of compile process, which happens on the master. But it's OK for "smaller" deployments - like if you don't go above 10-20k nodes." [1]
"Custom DSL is json-y which for some is easier than Ruby. Scaling problems because puppetmaster compiles the manifests (instead of having nodes compile). 2 tools/interfaces for config vs orchestration (mcollective) gets confusing and not very consistent with features." [1]
"There is also Salt. That had the "look really fast and responsive configuration" because it has the 0mq based distribution mechanism. But then Ansible added that too as a feature. Salt looked at and said "ok fine, we'll add SSH only option too". So now they are both basically solving a similar problem along with the older tools (Puppet, Chef etc)" [1]
"Haven't used Salt much, as I understand it they do take Puppets model. Even if you want that (and I don't) there are still a lot of ways you could do a better job, so I'm somewhat open to Salt." [1] "Although it supports Windows in a sense, it's very rough around the edges. Many modules will fail or have weird edge cases on Windows. I've gotten to the point where the only module I really trust to work 100% of the time is cmd.run (which executes arbitrary shell commands). That said, it's been a total win so far. I've almost completely replaced ad hoc Windows server provisioning with version controlled, documented Salt states. It's glorious." [1]
"I've always been a bit wary of salt after: this. Perhaps unfairly so... yet, I'm not entirely put at ease by: this. Did salt ever move to a secure transport? Then there's the (linked above, inline) issue with RSA exponent." [1]
"it's darn hard to upgrade ... The command line utilities are prone to user error: they return success during failure, they return no output and success because your states took too long to run, and it got bored. You can look up the job ID, but it's painful... The errors are utterly useless. In particular, Jinja rendering errors tend to reference incorrect locations in files, returning nonsense such as use of an undefined variable on a blank line... The output is useless too: you get a (very) verbose listing of everything that succeeded or failed. Telling if anything failed is the trick: it's buried in all the successes. (Terminal find is my friend here, but still, you have to be careful to watch out for boundaries between runs and not read an old run's output.) As discussed, the return code won't help you here... AFAICT, you need to be a particular user, and there is really no ACLs to speak of. All of our Salt stuff currently runs as a single user. People inevitably step on each others' toes... Non-responsive nodes are not mentioned in the output: they're the same as if they didn't exist! This results in some really wacky stuff happening. If you have variables that are lists of machines, the machine simply won't be in the list. This means if you need N of some type of machine, that list will be empty. (This often then triggers the aforementioned unreadable jinja error output, if you assume the list to be non-empty.)... There is little capability for actual processing on the master itself. Sometimes, you need to coordinate the actions of several nodes together, such as generating keys for each node, and then distributing all keys to all nodes." [1]
"Not as mature so it can't do some advanced stuff Puppet/Chef can do. Last I looked at web UI (Halite) it was not much to look at. Hardly any integration into 3rd party tools (most favor Puppet"
[1] "Salt is still very much in development. There are multiple open bugs on multiple core features (win repo comes to mind) which simply do not work as documented, period." [1]

Common limitations

Relevant Industry Trends

Other people's thoughts...

Conclusion: A Refined Problem Statement?

After surveying the field, I think we can at least humbly attempt to reach a refined version of the original problem statement: something that will enable us to at least direct our development efforts in the most agreeable of directions, that honestly solves a broader problem for a larger number of people.

So what is it that we need? Something that:

No, not Docker!

What's that we hear at the back of the room? A murmur from the San Franciscans? Out with it!

"Yes, yes! But what about Docker?"

I've purposefully saved mentioning it to the end because it's an interesting case, and because in preparing the above summary it was out of scope because:

Having slammed it for its imperfections, one should concur that docker does beat the PFCTs at one thing: solid output. When you say "give me environment x", you get it, and fast.

This positive property derives from the general approach: post-facto procedural configuration tinkering is infinitely more bug-ridden than just spinning up a container with fully known contents... or is it?

In fact, if what goes on in the container (honest Freudian typo: containter!) after spinning up uses the network to get you to a desired state, then it's just as flaky and fallible.

For example, if your sexy demonstration dockerfile includes lines like RUN apt-get install -y memcached then your builds are not reliably repeatable, and in no uncertain terms you're building on a house of cards should the network fail or the world move on from present day apt.

Yes, you can work around this by precaching stuff in your image so apt-get doesn't touch the network. No, that's not enforced by docker at the moment. I would argue this is actually a huge problem.

What's to be done? Obviously, it would be a huge, error-prone, largely tedious barrier for new users if docker asked everyone packaging a service to write pre-caching steps in to their dockerfiles and generate new source images... instead of using FROM ubuntu-someversion they'd be writing services FROM ubuntu-someversion-withprecachingtobuild-x-version-of-y-service. That's going to get unwieldy really quickly, and breaks the attraction of the apparent simplicity of the abstractions inherant in docker itself.

Clearly then, we need to rethink. Could it be that the common logic for precaching all dependencies of software builds on these various platforms could be generalized and attached to the concept of a platform, such that any build process using FROM ubuntu-* would actually run transparently twice: once to pre-cache the required dependencies (to 'build' the 'service package'), and once to actually generate the resulting service ('instantiate' the 'service package')?

Would this additional complexity really provide anything tangible?

I think so.

Firstly, your service packages can be tiny (relative to an entire OS filesystem snapshot tarred for portability between infrastructure providers), but reliable (in the face of vanished network locations or a total lack of network access) .. leading to absolute build repeatability across arbitrary infrastructure and time. That eliminates an entire class of bugs, which is - in anyone's estimation - an efficient approach to take when designing a modern 'devops' workflow.

Secondly, you have made target platforms a first class citizen within your development process and infrastructure. This opens up a world of possibilities:

Finally, you've shifted the goalposts far enough that you may as well use a different codebase to docker, which means you don't have to bother learning go. =)

But there's more...

Aside from docker's absence of service level dependencies last time I checked, even if they are there they are likely to be only a partial implementation of desirable service-oriented-architecture (SOA) properties. Which is a rather enterprisey way of saying, it's going to be missing features. Which features?

The corosync/pacemaker stack get us the first three points by allowing neat management of arbitrary service interdependencies including real time failover support through a declarative configuration syntax. This seems a logical layer upon which to build commonly desired service availability features at the devops process level, without resorting to software service-specific attempts at replication, clustering or failover (which typically each come with their own learning curve, re-releases, security gotchas and whatnot). Basically, if you can put the service in a box and control the box, declare when the box is working ("responds to ping? responds to http query?"), corosync/pacemaker can make it highly available after a fashion. This seems a useful thing to do, since we can then give this to service authors targeting most common platforms with arbitrarily complex inter-service dependency hierarchies for free. If it's stable enough for German air traffic control, then chances it's stable enough for your purposes.

The final point — multi-site support — is one that needs to be considered by the overall devops process. Instead of assuming that all instances of all services are always visible on the same infrastructure in a centrally managed fashion that never seems to see network connectivity issues or requires any inherant changes to devops process security, the overall devops process should be redesigned as a parallel system that is highly available (permits failure and automatic recovery) yet demands rigorous trust at every point.

Can we achieve this by making services and platforms the solid rocks we base operations processes on (through versioning/packaging/holistic dependency inclusion/cryptographic signatures)?

Actual Conclusion

Most stuff out there is sucky, and needs a rethink.


To (realistically) add to the mess or (optimistically, hah!) resolve it, I've created a new tool (actually, the second iteration thereof) known as cims (Cloud Infrastructure Management Service) which attempts to implement the different approach outlined above.

It's not perfect or beautiful, but after analyzing what's going on around the world I do believe it is at least unique in that its heading in the right direction, being adequately holistic in its outlook.

I'm hoping to get it open sourced in the near future, but for the moment you can check out the documentation and discuss @ Hacker News. General approach is as follows.

Elements within the Cloud Infrastructure Management Service

At the very least, hopefully some of you enjoyed this line of thinking, buzzword free. To be clear, this is a consensus-building exercise on the nature of the problems in this area, not a "here is the one true solution" proposal, though we could work forwards here .. absence of code equals bounty of potential :)

— Walter @ this-domain

External links