Why is it important to measure operability? What should the trailing indicators and leading indicators of operability?
In this article Steve Smith looks at system-level measures of operability that can be aligned with organisational goals
This is part 2 of the Build Operability In series
In How To Measure Anything, Douglas Hubbard states organisations have a Measurement Inversion, and waste their time measuring variables with a low information value. This is certainly true of IT reliability, which is usually badly measured if at all. By proxy, this includes operability as well.
In many organisations, reliability is measured by equating failures with recorded production incidents. Incident durations are calculated for Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR), or there is just an overall incident count. These are classic vanity measures. They are easy to implement and understand, but they have a low information value due to the following:
- Quantitative measures such as incident count have no reflection on business drivers, such as percentage of unrecoverable user errors
- Manual recording of incidents in a ticket system can be affected by data inaccuracies and cognitive biases, such as confirmation bias and recency bias
- Goodhart’s Law means measuring incidents will result in fewer incident reports. People adjust their behaviours based on how they are measured, and measuring incidents will encourage people to suppress incident reports with potentially valuable information.
If operability is to be built into applications, there is a need to identify trailing and leading indicators of operability that are holistic and actionable. Measures of operability that encourage system-level collaboration rather than individual productivity will pinpoint where improvements need to be made.
Effective leading and trailing indicators of software delivery should be visualised and publicly communicated throughout an organisation, via internal websites and dashboards. Information radiators help engineers, managers, and executives understand at a glance the progress being made and alignment with organisational goals. Transparency also reduces the potential for accidents and bad behaviours. As Louis Brandeis said in Other People’s Money “sunlight is said to be the best of disinfectants; electric light the most efficient policeman”.
Availability as a trailing indicator
Failures should be measured in terms of application availability, not production incidents. Availability measurements are easy to implement with automated time series metrics collection, easy to understand, and have a high information value. Measurements can be designed to distinguish between full and partial degradation, and between unrecoverable and recoverable user errors.
Availability should be measured via availability targets. An availability target represents a desired level of availability, and is usually expressed as a number of nines e.g: 99.9%, 99.99%, or 99.999%. Each additional nine represents an order of magnitude of engineering effort. An availability target should be chosen by a product manager, based on their maximum tolerance for degradation. This will be based on user expectations, revenue potential, and application functionality.
For example, a media organisation called MediaTech has 99.0%, 99.5%, and 99.9% as its desired levels of availability. A product manager for a Games service has to weigh up product characteristics and the cost of engineering to decide on their availability target. They select 99.5% for at least the first 3 months of service lifetime.
Availability should be measured as Request Success Rate, as described by Betsey Beyer et al in Site Reliability Engineering. Request Success Rate is an effective approximation of degradation for customer-facing or back office applications, provided there is a well-defined notion of successful and unsuccessful work. It covers partial and full downtime for an application, and is more fine-grained than simply using uptime versus downtime. When an application has a Request Success Rate lower than its availability target, it is considered a failure. The average time to restore availability can be tracked as a Mean Time To Repair metric, and visualised in a graph alongside availability.
At MediaTech, the Games service communicates with upstream consumers via a HTTPS API. Its availability is constantly measured by Request Success Rate, which is implemented by checking the percentage of upstream requests that produce a HTTP response code lower than HTTP 500. When the Request Success Rate over 15 minutes is lower than the availability target of 99.5%, it is considered a failure and a production incident is raised. An availability graph can be used to illustrate availability, incidents, and time to repair as a trailing indicator of operability.
Leading indicators of operability
Failures cannot be predicted in a production environment as it is a complex, adaptive system. In addition, it is easy to infer a false narrative of past behaviours from quantitative data. The insights uncovered from an availability trailing indicator and the right leading indicators can identify inoperability prior to a production incident, and they can be pattern matched to select the best heuristic for the circumstances.
A leading indicator should be split into an automated check and one or more exploratory tests. This allows for continuous discovery of shallow data, and frees up people to examine contextual, richer data with a higher information value. Those exploratory tests might be part of an operational readiness assessment, or a Chaos Day dedicated to particular applications. Leading indicators of operability can include:
Learning is a vital leading indicator of operability. An organisation is more likely to produce operable, reliable applications if it fosters a culture of continuous learning and experimentation. After a production incident, nothing should be more important than everyone in the organisation having the opportunity to accumulate new knowledge, for their colleagues as well as themselves.
The initial automated check of learning should be whether a post-incident review is published within 24 hours of an incident. This is easy to automate with a timestamp comparison between a post-incident review document and the central incident system, easy to communicate across an organisation, and highly actionable. It will uncover incident reviews that do not happen, are not publicly published, or happen too late to prevent information decay.
Another learning check should be the throughput of operability tasks, comprising the lead time to complete a task and interval between completing tasks. Tasks should be created and stored in a machine readable format during operability readiness assessments, Chaos Days, exploratory testing, and other automated checks of operability. Task lead time should not be more than a week, and task interval should not exceed the fastest learning source. For example, if operability readiness assessments occur every 90 days and Chaos Days are 30 days then at least one operability task should be completed per month.
This is part 2 of the Build Operability In series
- Build Operability In
- Build Operability In – Measures
- Build Operability In – Architecture
- Build Operability In – Telemetry
- Build Operability In – Operational Readiness
- Build Operability In – You Build It You Run It
- Build Operability In – Learning
Thanks as usual to Thierry de Pauw for reviewing this series