Why do disparate Delivery and Operations teams result in long-term Discontinuous Delivery of inoperable applications? Why is it important to balance support cost effectiveness with operability incentives? What is You Build It You Run It, and why does it have such a positive impact on operability?
In this article Steve Smith looks at the traditional support model of You Build It Ops Run It, and how its modern equivalent You Build It You Run It powers Continuous Delivery and application operability
This is part 7 of the Build Operability In series
You Build It Ops Run It
An organisation modelled on IT As A Cost Centre and Plan-Build-Run will have an Operations group in its IT department. Operations teams will be responsible for all Run activities, including deployments and production support for all applications. This can be referred to as You Build It Ops Run It. For example, consider a technology value stream comprising 1 development team in Delivery and an Application Operations team.
You Build It Ops Run It usually involves multi-level production support, in line with the ITIL v3 Service Operation standard:
The Service Desk will receive customer requests, and Operations Bridge will monitor dashboards and receive alerts. Both L1 teams will be trained to resolve simple technology issues, and to escalate more complicated tickets to L2. Application Operations will respond to incidents that require technology specialisation, and when necessary will escalate to an L3 Delivery team to contribute their expertise to an incident.
Cost accounting in IT As A Cost Centre creates a funding divide. A Delivery team will be budgeted under Capital Expenditure (CapEx), whereas Operations teams will be under Operational Expenditure (OpEx). An Operations team member will be paid a flat standby rate and a per-incident callout rate. A Delivery team member will not be paid for standby, and might be unofficially compensated per-callout with time off in lieu. Operations will be under continual pressure to reduce OpEx spending, and the Service Desk, Ops Bridge, and/or Application Operations might be outsourced to third party suppliers.
In ITSM and why three-tier support should be replaced with Swarming, Jon Hall argues “the current organizational structure of the vast majority of IT support organisations is fundamentally flawed”. Multi-level support in You Build It Ops Run It means non-trivial tickets will go from Service Desk or Ops Bridge through triage queues until the best-placed responder team is found. Repeated, unilateral ticket reassignments can occur between teams and individuals. Those handoffs can increase incident resolution time by hours, days, or even weeks. Rework can also be incurred as Application Operations introduce workarounds and data fixes, which await resolution in a Delivery backlog for months before prioritisation.
In addition, You Build It Ops Run It has major disadvantages for fast customer feedback and iterative product development:
- Long deployment lead times – handoffs with Application Operations will inflate lead times by hours or days
- High knowledge synchronisation costs – Delivery team application knowledge and Application Operations incident knowledge will be lost in handoffs, without substantial synchronisation efforts
- Focus on outputs – software will be built as an output, with little to no understanding of product hypotheses or customer outcomes
- Fragile architecture – applications will be architected without limits on failure blast radius, and exposed to high impact incidents
- Inadequate telemetry – dashboards and alerts created by Application Operations in isolation will only be able to use operational metrics
- Traffic ignorance – challenges involved in managing live traffic will be localised and unable to inform design decisions
- Restricted collaboration – Application Operations and Delivery teams will find joint incident response hard, due to differences in ways of working and tools, and lack of Delivery team access to production
- Unfair on-call expectations – Delivery team members will be expected to be available out of hours without compensation for the inconvenience, and disruption to their lives
These problems can be traced back to incentives. With Application Operations responsible for production support, a Delivery team will be unaware of or uninvolved in production incidents. Application Operations cannot build operability into applications they do not own, and a Delivery team will have little reason to prioritise operational features. As a result, inoperability is inevitable.
You Build It Ops Run It injects substantial delays and rework into a technology value stream. This is likely to constrain Continuous Delivery if product demand is high. If weekly or fewer deployments are sufficient to meet demand, then Continuous Delivery is possible. However, if product demand calls for more than weekly deployments then You Build It Ops Run It can only lead to Discontinuous Delivery.
You Build It You Run It
The alternative is for a Delivery team to assume responsibility for its Run activities, including deployments and production support. This is often referred to as You Build It You Run It.
You Build It You Run It consists of single-level swarming support, with developers on-call. There is also a Service Desk to handle customer requests. The toolchain needs to include anomaly detection, alert notifications, messaging, and incident management tools, such as Prometheus, PagerDuty, Slack, and ServiceNow.
As with You Build It Ops Run It, Service Desk is an L1 team that receives customer requests and will resolve simple technology issues wherever possible. A development team in Delivery is also L1, and they will monitor dashboards, receive alerts, and respond to incidents. Service Desk should escalate tickets for particular website pages or user journeys into the incident management system, which would be linked to applications.
Delivery engineering costs and on-call support will both be paid out of CapEx, and Operations teams such as Service Desk will be under OpEx. As with You Build It Ops Run It, the Service Desk team might be outsourced to reduce OpEx costs. CapEx funding for You Build It You Run It will compel a product manager to balance their desired availability with on-call costs. OpEx funding for Delivery on-call should be avoided wherever possible, as it encourages product managers to artificially minimise risk tolerance and select high availability targets irregardless of on-call costs.
Swarming support means Delivery prioritising incident resolution over feature development, in line with the Continuous Delivery practice of Stop The Line and the Toyota Andon Cord. This encourages developers to limit failure blast radius wherever possible, and prevents them from deploying changes mid-incident that might exacerbate a failure. Swarming also increases learning, as it ensures developers are able to uncover perishable mid-incident information, and cross-pollinate their skills.
You Build It You Run It also has the following advantages for product development:
- Short deployment lead times – lead times will be minimised due to no handoffs
- Minimal knowledge synchronisation costs – developers will be able to easily share application and incident knowledge, to better prepare themselves for future incidents
- Focus on outcomes – teams will be empowered to deliver outcomes that test product hypotheses, and iterate based on user feedback
- Short incident resolution times – incident response will be quickened by no support ticket handoffs or rework
- Adaptive architecture – applications will be architected to limit failure blast radius, including bulkheads and circuit breakers
- Product telemetry – dashboards and alerts will be continually updated by developers, to be multi-level and tailored to the product context
- Traffic knowledge – an appreciation of the pitfalls and responsibilities inherent in managing live traffic will be factored into design work
- Rich situational awareness – developers will respond to incidents with the same context, ways of working, and tooling
- Clear on-call expectations – developers will be aware they are building applications they themselves will support, and they should be remunerated
You Build It You Run It creates the right incentives for operability. When Delivery is responsible for their own deployments and production support, product owners will be more aware of operational shortfalls, and pressed by developers to prioritise operational features alongside product ideas. Ensuring that application availability is the responsibility of everyone will improve outcomes and accelerate learning, particularly for developers who in IT As A Cost Centre are far removed from actual customers. Empowering delivery teams to do on-call 24×7 is the only way to maximise incentives to build operability in.
You Build It You Run It costs
The most common criticism of You Build It You Run It is that it is too expensive. Paying Delivery team members for L1 on-call standby and callout can seem costly, particularly when You Build It Ops Run It allows for L1-2 production support to be outsourced to cheaper third party suppliers. This perception should not be surprising, given David Wood’s assertion in The Flip Side Of Resilience that “graceful extensibility trades off with robust optimality”. Implementing You Build It You Run to increase adaptive capacity for future incidents may look wasteful, particularly if incidents are rare.
A more holistic perspective would be to treat production support as revenue insurance for availability targets, and consider risk in terms of revenue impact instead of incident count. A production support policy will cover:
- Availability protection
- Availability restoration on loss
You Build It You Run It maximises incentives for Delivery teams to focus from the outset on protecting availability, and it guarantees the callout of an L1 Delivery engineer to restore availability on loss. This should be demonstrable with a short Time To Restore (TTR), which could be measured via availability time series metrics or incident duration. That high level of risk coverage will come at a higher premium. This means You Build It You Run It will be more cost effective for applications with higher availability targets and greater potential for revenue loss.
You Build It Ops Run It offers a lower level of risk coverage at a lower premium, with weak incentives to protect application availability and an L2 Application Operations team to restore application availability. This will produce a higher TTR than You Build It You Run It. This may be acceptable for applications with lower availability targets and/or limited potential for revenue loss.
The cost effectiveness of a production support policy can be calculated per availability target by comparing its availability restoration capability with support cost. For example, at Fruits R Us there are 3 availability targets with estimated maximum revenue losses on availability target loss. Fruits R Us has a Delivery team with an on-call cost of £3K per calendar month and a TTR of 20 minutes, and an Application Operations team with a cost of £1.5K per month and a TTR of 1 hour.
Projected availability loss per team is a function of TTR and the £ maximum availability loss per availability target, and lower losses can be calculated for the Delivery team due to its shorter TTR.
At 99.0%, Application Operations is as cost effective at availability restoration of a 7 hour 12 minute outage as the Delivery team, and Fruits R Us might consider the merits of You Build It Ops Run It. However, this would mean Application Operations would be unable to build operability in and increase availability protection, and the Delivery team would have few incentives to contribute.
At 99.5%, the Delivery team is more cost effective at availability restoration of a 3 hour 36 minute outage than Application Operations.
At 99.9%, the Delivery team is far more cost effective at availability restoration of a 43 minute 12 second outage. The 1 hour TTR of Application Operations means their £ projected availability loss is greater than the £ maximum availability loss at 99.9%. You Build It You Run It is the only choice.
This is part 7 of the Build Operability In series
- Build Operability In
- Build Operability In – Availability Targets
- Build Operability In – Measures
- Build Operability In – Architecture [TBA]
- Build Operability In – Telemetry [TBA]
- Build Operability In – Operational Readiness [TBA]
- Build Operability In – You Build It You Run It
- Build Operability In – You Build It You Run It At Scale
- Build Operability In – Implementing You Build It You Run It At Scale
- Build Operability In – Learning [TBA]
Thanks as usual to Thierry de Pauw for reviewing this series. Thanks also to Dan North for knowledge synchronisation cost as a term