What is operability, how does it promote resilience, and how does building operability into your services drive Continuous Delivery adoption?
In this series, Steve Smith explains what operability is, how to design Service Level Objectives, how to measure leading and trailing indicators of operability, and why Build It Run It matters so much to Continuous Delivery as well as operability
This is part 1 of the Build Operability In series
Bad things happen when problems are protected by a force field of tediousness – Ben Goldacre
The origins of the 20th century, pre-Internet IT As A Cost Centre organisational model can be traced to the suzerainty of cost accounting, and the COBIT management and governance framework. COBIT has recommended sequential Plan-Build-Run phases to maximise resource efficiency since its launch in 1996. The justification for this is the high compute costs and high transaction costs for a release in the 1990s.
With IT As A Cost Centre, Plan happens in a Product department and Build-Run happens in an IT department. IT will have separate Delivery and Operations groups, with competing goals:
- Delivery will be responsible for building features
- Operations will be responsible for running services
Delivery and Operations will consist of functionally-oriented teams of specialists. Delivery will have multiple development teams. Operations will have Database, Network, and Server teams to administer resources, a Service Transition team to check operational readiness prior to launch, and a Production Support team to respond to live incidents. This can be referred to as You Build It, They Run It.
Siloisation causes Discontinuous Delivery
In High Velocity Edge, Dr. Steven Spear warns over-specialisation leads to siloisation, and causes functional areas to “operate more like sovereign states”. Disparate Delivery and Operations teams with orthogonal priorities will inject multiple handoffs into a technology value stream, and create queues that disproportionately inflate lead times. Each handoff means waiting for a downstream team to be free to complete a task, and it could produce more upstream work. The delays and rework incurred will constrain flow such that product demand cannot be satisfied.
These problems are exacerbated when development teams in different technology value streams depend on the same Operations teams. Whenever multiple production releases stall with a particular Operations team, the short-term countermeasure will be a contentious management escalation that increases global coordination costs, and interrupt-driven toil for that specific team. The medium-term countermeasure will be increasing the size of the Operations teams, which has its own financial costs.
For example, a development team needs an operational readiness check of their service prior to launch. That can only be done by the Service Transition team, and the review is delayed by a fortnight until they are available. The completed review identifies a litany of corrective actions for the development team, the day before live launch. As more development teams are added, the Service Transition team come under tremendous strain and lead times for new services substantially worsen.
Ron Westrum defined a pathological culture in A Typology of Organisational Cultures as power-oriented interactions, with low cooperation and neglected responsibilities. The fundamentally opposed incentives, nomenclature, and risk appetites within Delivery and Operations will cause a pathological culture to slowly emerge over time. Delays and rework will gradually increase until Discontinuous Delivery becomes permanent.
Robustness breeds inoperability
Most IT cost centres try to achieve reliability by Optimising For Robustness, which means prioritising a higher Mean Time Between Failures (MTBF) over a lower Mean Time To Repair (MTTR). This is based on the idea a production environment is a complicated system, in which homogeneous service processes have predictable, repeatable interactions, and failures are preventable.
Reliability is dependent on operability, which can be defined as the ease of safely operating a production system. Optimising For Robustness produces an underinvestment in operability, due to the following:
- A Diffusion Of Responsibility between Delivery and Operations. When Operations teams are accountable for operational readiness and incident response, Delivery teams have little reason to work on operability
- A Normalisation of Deviation within Delivery and Operations. When failures are tolerated as rare and avoidable, Delivery and Operations teams will pursue cost savings rather than the ability to degrade on failure
In Resilience and Precarious Success, Mary Patterson et al state “fundamental goals (such as safety) tend to be sacrificed with increasing pressure to achieve acute goals (faster, better, and cheaper)”. Delivery and Operations teams incentivised to work faster and cheaper will create brittle systems, with symptoms including:
- Inadequate telemetry – an inability to detect abnormal conditions
- Fragile architecture – an inability to limit blast radius on failure
- Operator burnout – an inability to perform heroics on demand
- Blame games – an inability to learn from experience
Underinvesting in operability is ill-advised as failures are entirely unavoidable. A production environment is actually a complex system, in which heterogeneous service processes have unpredictable, unrepeatable interactions, and failures are inevitable. As Richard Cook explains in How Complex Systems Fail “the complexity of these systems makes it impossible for them to run without multiple flaws“.
A production environment is perpetually in a state of near-failure. Its flaws will be tightly coupled and interconnected, and if left unresolved they could lead to catastrophe. A failure occurs when multiple flaws unexpectedly coalesce and impede a business function, and the costs can be steep for a brittle, inoperable service. Inadequate telemetry widens the sunk cost duration from failure start to detection. A fragile architecture expands the opportunity cost duration from detection until resolution, and the overall cost per unit time. Operator burnout increases all costs involved, and blame games allow similar failures to occur in the future.
Resilience needs operability
Optimising For Resilience is a more effective reliability strategy. This means prioritising a lower MTTR over a higher MTBF. The ability to quickly adapt to failures is more important than fewer failures, although some failure classes should never occur and some safety-critical systems should never fail.
Resilience can be thought of as graceful extensibility. In The Theory of Graceful Extensibility, David Woods defines it as “a blend of graceful degradation and software extensibility”. A complex system with high graceful extensibility will continue to function, whereas a brittle system would collapse.
Graceful extensibility is derived from the capacity for adaptation in a system. Adaptive capacity can be created when work is effectively managed to rapidly reveal new problems, problems are quickly solved and produce new knowledge, and new local knowledge is shared throughout an organisation. These can be achieved by improving the operability of a system via:
- An adaptive architecture
- Incremental deployments
- Automated provisioning
- Ubiquitous telemetry
- Chaos Engineering
- You Build It You Run It
- Post-incident reviews
Investing in operability creates a production environment in which services can gracefully extend on failure. Ubiquitous telemetry will minimise sunk cost duration, an adaptive architecture will decrease opportunity cost duration, operator health will aid all aspects of failure resolution, and post-incident reviews will produce shareable knowledge for other operators. The result will be what Ron Westrum describes as a generative culture of performance-oriented interactions, high cooperation, and shared risks.
Dr. W. Edwards Deming said in Out Of The Crisis that “you cannot inspect quality into a product”. The same is true of operability. You cannot inspect operability into a product. Building operability in from the outset will remove handoffs, queues, and coordination costs between Delivery and Operations teams in a technology value stream. This will eliminate delays and rework, and allow Continuous Delivery to be achieved.
This is part 1 of the Build Operability In series
- Build Operability In
- Build Operability In – Measures
- Build Operability In – Architecture
- Build Operability In – Telemetry
- Build Operability In – Operational Readiness
- Build Operability In – You Build It You Run It
- Build Operability In – Learning
Thanks as usual to Thierry de Pauw for reviewing this series