What is operability, how does it promote resilience, and how does building operability into your services drive Continuous Delivery adoption?
In this series, Steve Smith explains what operability is, how to design Service Level Objectives, how to measure leading and trailing indicators of operability, and why Build It Run It matters so much to Continuous Delivery as well as operability
This is part 1 of the Build Operability In series
Bad things happen when problems are protected by a force field of tediousness – Ben Goldacre
The origins of the 20th century, pre-Internet IT As A Cost Centre organisational model can be traced to the suzerainty of cost accounting, and the COBIT management and governance framework. COBIT has recommended sequential Plan-Build-Run phases to maximise resource efficiency since its launch in 1996. The justification for this is the high compute costs and high transaction costs for a release in the 1990s.
With IT As A Cost Centre, Plan happens in a Product department and Build-Run happens in an IT department. IT will have separate Delivery and Operations groups, with competing goals:
- Delivery will be responsible for building features
- Operations will be responsible for running services
Delivery and Operations will consist of functionally-oriented teams of specialists. Delivery will have multiple development teams. Operations will have Database, Network, and Server teams to administer resources, a Service Transition team to check operational readiness prior to launch, and a Production Support team to respond to live incidents.
Siloisation causes Discontinuous Delivery
In High Velocity Edge, Dr. Steven Spear warns over-specialisation leads to siloisation, and causes functional areas to “operate more like sovereign states”. Delivery and Operations teams with orthogonal priorities will create multiple handoffs in a technology value stream. A handoff means waiting in a queue for a downstream team to complete a task, and that task could inadvertently produce more upstream work.
Furthermore, the fundamentally opposed incentives, nomenclature, and risk appetites within Delivery and Operations teams will cause a pathological culture to emerge over time. This is defined by Ron Westrum in A Typology of Organisational Cultures as a culture of power-oriented interactions, with low cooperation and neglected responsibilities.
These problems will be exacerbated at scale, when a non-trivial number of Delivery teams are working in different technology value streams and depend on the same Operations teams. When multiple release candidates are blocked on an Operations team the countermeasure will be a contentious management escalation, which will increase coordination costs for the Delivery teams and interrupt-driven toil for the Operations team.
The goal of Continuous Delivery is to achieve a deployment throughput that satisfies product demand. Disparate Delivery and Operations teams will inject delays and rework into a technology value stream that disproportionately inflate lead times. If product demand dictates a throughput target of weekly deployments or more, Discontinuous Delivery is inevitable.
Robustness breeds inoperability
Most IT cost centres try to achieve reliability by Optimising For Robustness, which means prioritising a higher Mean Time Between Failures (MTBF) over a lower Mean Time To Repair (MTTR). This is based on the idea a production environment is a complicated system, in which homogeneous service processes have predictable, repeatable interactions, and failures are preventable.
Reliability is dependent on operability, which can be defined as the ease of safely operating a production system. Optimising For Robustness produces an underinvestment in operability, due to the following:
- A Diffusion Of Responsibility between Delivery and Operations. When Operations teams are accountable for operational readiness and incident response, Delivery teams have little reason to work on operability
- A Normalisation of Deviation within Delivery and Operations. When failures are tolerated as rare and avoidable, Delivery and Operations teams will pursue cost savings rather than an ability to degrade on failure
That underinvestment in operability will result in Delivery and Operations teams creating brittle, inoperable production systems. Symptoms of brittleness will include:
- Inadequate telemetry – an inability to detect abnormal conditions
- Fragile architecture – an inability to limit blast radius on failure
- Operator burnout – an inability to perform heroics on demand
- Blame games – an inability to learn from experience
This is ill-advised, as failures are entirely unavoidable. A production environment is actually a complex system, in which heterogeneous service processes have unpredictable, unrepeatable interactions, and failures are inevitable. As Richard Cook explains in How Complex Systems Fail “the complexity of these systems makes it impossible for them to run without multiple flaws“. A production environment is perpetually in a state of near-failure.
A failure occurs when multiple flaws unexpectedly coalesce and impede a business function, and the costs can be steep for a brittle, inoperable service. Inadequate telemetry widens the sunk cost duration from failure start to detection. A fragile architecture expands the opportunity cost duration from detection until resolution, and the overall cost per unit time. Operator burnout increases all costs involved, and blame games allow similar failures to occur in the future.
Resilience needs operability
Optimising For Resilience is a more effective reliability strategy. This means prioritising a lower MTTR over a higher MTBF. The ability to quickly adapt to failures is more important than fewer failures, although some failure classes should never occur and some safety-critical systems should never fail.
Resilience can be thought of as graceful extensibility. In The Theory of Graceful Extensibility, David Woods defines it as “a blend of graceful degradation and software extensibility”. A complex system with high graceful extensibility will continue to function, whereas a brittle system would collapse.
Graceful extensibility is derived from the capacity for adaptation in a system. Adaptive capacity can be created when work is effectively managed to rapidly reveal new problems, problems are quickly solved and produce new knowledge, and new local knowledge is shared throughout an organisation. These can be achieved by improving the operability of a system via:
- An adaptive architecture
- Incremental deployments
- Automated provisioning
- Ubiquitous telemetry
- Chaos Engineering
- You Build It You Run It
- Post-incident reviews
Investing in operability creates a production environment in which services can gracefully extend on failure. Ubiquitous telemetry will minimise sunk cost duration, an adaptive architecture will decrease opportunity cost duration, operator health will aid all aspects of failure resolution, and post-incident reviews will produce shareable knowledge for other operators. The result will be what Ron Westrum describes as a generative culture of performance-oriented interactions, high cooperation, and shared risks.
Dr. W. Edwards Deming said in Out Of The Crisis that “you cannot inspect quality into a product”. The same is true of operability. You cannot inspect operability into a product. Building operability in from the outset will remove handoffs, queues, and coordination costs between Delivery and Operations teams in a technology value stream. This will eliminate delays and rework, and allow Continuous Delivery to be achieved.
This is part 1 of the Build Operability In series
- Build Operability In
- Build Operability In – Measures
- Build Operability In – Architecture
- Build Operability In – Telemetry
- Build Operability In – Operational Readiness
- Build Operability In – You Build It You Run It
- Build Operability In – Learning
Thanks as usual to Thierry de Pauw for reviewing this series