PFCTs: Post-facto Configuration Tinkerers
Here lies some thoughts regarding the systems that run the systems that in truth, run some of the most important software and services in the world today.
The term PFCT (Post-facto configuration tinkerers) is used in jest to refer to the current (ie. previous) generation of systems because such a large range of terms have been used for these systems that in truth fulfil parts of this spectrum of functions, and because we know that they do not really do things 'right' (eg. too much manual process, limited portability to other platforms, etc.).
This is a list of features you should ask for in your infrastructure abstraction system.
- Improved workflow
- Automated deployment (Continuous Deployment/Continuous Integration)
- Automated scaling
- Automated failover/high availability
- Automated service discovery (read: functional service topology dependency resolution)
- Resource and capacity planning support
- Real time least-cost-overhead provider selection for third party infrastructure providers meeting some disparate set of performance requirements, etc.
- Arbitrary OS support
Presently, I'm not aware of any open source tools actually offering this complete feature set in a decently adaptable way.
Network Related Features
Excerpt from Software-Defined Networking: A Comprehensive Survey, Kreutz et al. Version 1.0, May 31, 2014 follows.
See also the opendaylight, project floodlight SDN controllers, Giacomo Bernadi of NGI Italy's SDN for real! RIPE 2014 presentation and Open Networking Foundation (OpenFlow).
There are also some insights in to attempting to get SDN working with current high-end vendor gear (optical and layer3) in 'NTT GIN SDN' RIPE 2014 presentation.
- There is a clear trend towards segregating state from services where possible and using supervisory processes to reset services to a known good state either regularly or upon detected deviation from established behavior. This is not a new approach but is rarely well deployed in practice owing architectural challenges in most software preventing this style of process control. The approach was recently fairly well summarized in the September 2014 free PDF book Stuff Goes Bad: Erlang in Anger:
Failures will happen no matter what, whether they're developer, operator, or hardware-related. It is rarely practical or even possible to get rid of all errors in a program or a system. If you can deal with some errors rather than preventing them at all cost, then most undefined behaviours of a program can go in that "deal with it" approach.
This is where the "Let it Crash" idea comes from: Because you can now deal with failure, and because the cost of weeding out all of the complex bugs from a system before it hits production is often prohibitive, programmers should only deal with the errors they know how to handle, and leave the rest for another process (a supervisor) or the virtual machine to deal with.
Given that most bugs are transient, simply restarting processes back to a state known to be stable when encountering an error can be a surprisingly good strategy.
It seems, then, that perhaps what the world is looking for is a simple process by which to apply Erlang-like control processes to arbitrary software, without substantially raising the cognitive load on developers or infrastructure managers, or unduly locking them in to a single operating system, platform, network topology or other architectural paradigm.
Some related discussions
Books, Papers and Videos
walter at-symbol-thinggy this-domain