Why does Operations production support become less effective as Delivery teams and applications increase in scale? How can You Build It You Run It be applied to 10+ teams and applications without an overwhelming support cost? How can operability incentives be preserved for so many teams?
In this article Steve Smith looks at why You Build It Ops Run It At Scale struggles so badly, and how You Build It You Run It At Scale can be designed to balance operability incentives with support costs
This is part 8 of the Build Operability In series
You Build It Ops Run It At Scale
An IT As A Cost Centre organisation beholden to Plan-Build-Run will have a Delivery group responsible for building applications, and an Operations group responsible for deploying applications and production support. When there are 10+ Delivery teams and applications, this can be referred to as You Build It Ops Run It At Scale. For example, imagine a single technology value stream used by 10 delivery teams, and each team builds a separate customer-facing application.
You Build It Ops Run It At Scale roles and responsibilities are no different from You Build It Ops Run It. L1 and L2 Operations teams will be paid standby and callout costs out of Operational Expenditure (OpEx), and L3 Delivery team members on best endeavours are not paid. The key difference at scale is Operations workload. In particular, Application Operations will have to manage deployments and L2 incident response for 10+ applications. It will be extremely difficult for Application Operations to keep track of when a deployment is required, which alert corresponds to which application, and which Delivery team can help with a particular application.
You Build It Ops Run It At Scale magnifies the problems with You Build Ops Run It, with a negative impact on both Continuous Delivery and operability:
- Long time to restore – support ticket handoffs between Ops Bridge, Application Operations, and multiple Delivery teams will delay availability restoration on failure
- No focus on customer outcomes – applications will be built as outputs only, with little time for product hypotheses
- Fragile architecture – failure scenarios will not be designed into user journeys and applications, increasing failure blast radius
- Inadequate telemetry – dashboards and alerts from Applications Operations will only be able to show low-level operational metrics
- Traffic ignorance – applications will be built with little knowledge of how traffic flows through different dependencies
- Restricted collaboration – incident response between Application Operations and multiple Delivery teams will be hampered by different ways of working
- Trapped learnings – knowledge acquired by Application Operations in post-incident reviews will not easily reach Delivery teams
- Unfair on-call expectations – Delivery team members will be expected to do unpaid on-call out of hours
These problems will make it less likely that application availability targets can consistently be met, and will increase Time To Restore (TTR) on availability loss. Production incidents will be more frequent, and revenue impact will potentially be much greater. This is a direct result of the lack of operability incentives. Application Operations cannot build operability into 10+ applications they do not own. Delivery teams will have little reason to do so when they have little to no responsibility for incident response.
A Theory Of Constraints lens on Continuous Delivery shows that reducing rework and queue times is key to deployment throughput. With 10+ Delivery teams and applications the Application Operations workload will become intolerable, and team member burnout will be a real possibility. Queue time for deployments will mount up, and the countermeasure to release candidates blocking on Application Operations will be time-consuming management escalations. If product demand calls for more than weekly deployments, the rework and delays incurred in Application Operations will result in long-term Discontinuous Delivery.
You Build It You Run It At Scale
You Build It You Run It At Scale means 10+ Delivery teams are responsible for their own deployments and production support. It is the You Build It You Run It approach, applied to multiple teams and multiple applications.
There is an L1 Service Desk team to handle customer requests. Each Delivery team is on L1 support for their applications, and creates their own monitoring dashboard and alerts. There should be a consistent toolchain for anomaly detection and alert notifications for all Delivery teams, that can incorporate those dashboards and alerts.
The Service Desk team will tackle customer complaints and resolve simple technology issues. When an alert fires, a Delivery team will practice Stop The Line by halting feature development, and swarming on the problem within the team. That cross-functional collaboration means a problem can be quickly isolated and diagnosed, and the whole team creates new knowledge they can incorporate into future work. If the Service Desk cannot resolve an issue, they should be able to route it to the appropriate Delivery team via an application mapping in the incident management system.
In On-Call At Any Size, Susan Fowler et al warn “multiple rotations is a key scaling challenge, requiring active attention to ensure practices remain healthy and consistent”. Funding is the first You Build It You Run It At Scale practice that needs attention. On-call support for each Delivery team should be charged to the CapEx budget for that team. This will encourage each product manager to regularly work on the delicate trade-off between protecting their desired availability target out of hours and on-call costs. Central OpEx funding must be avoided, as it eliminates the need for product managers to consider on-call costs at all.
You Build It You Run It At Scale has the following advantages:
- Fast incident resolution – an alert will be immediately assigned to the team that owns the application, and can rapidly swarm to recover from failure and minimise TTR
- Short deployment lead times – deployments can be performed on demand by a Delivery team, with no handoffs involved
- Focus on outcomes – teams are encouraged to work in smaller batches, towards customer outcomes and product hypotheses
- Adaptive architecture – applications can be designed with failure scenarios in mind, including circuit breakers and feature toggles to reduce failure blast radius
- Product telemetry – application dashboards and alerts can be constantly updated to include the latest product metrics
- Situational awareness – teams will have a prior understanding of normal versus abnormal live traffic conditions that can be relied on during incident response
- Cumulative learning – teams will be able to convert new operational information into knowledge, that can be used during application design and disseminated to other teams
- Fair on-call compensation – team members will be remunerated for the disruption to their lives incurred by supporting applications
In Accelerate, Dr. Nicole Forsgren et al found “high performance is possible with all kinds of systems, provided that systems – and the teams that build and maintain them – are loosely coupled”. Accelerate research showed the key to high performance is for a team to be able to independently test and deploy its applications, with negligible coordination with other teams. You Build It You Run It enables a team to increase its throughput and achieve Continuous Delivery, by removing rework and queue times associated with deployments and production support. You Build It You Run It At Scale enables an organisation to increase overall throughput while simultaneously increasing the number of teams. This allows an organisation to move faster as it adds more people, which is a true competitive advantage.
You Build It You Run It At Scale creates a healthy engineering culture, in which product development consists of a balance between product ideas and operational features. 10+ Delivery teams with on-call responsibilities will be incentivised to care about operability and consistently meeting availability targets, while increasing delivery throughput to meet product demand. Delivery teams doing 24×7 on-call at scale will be encouraged to build operability into all their applications, from inception to retirement.
You Build It You Run It At Scale can incur high support costs. It can be cost effective if a compromise is struck between deployment targets, operability incentives, and on-call costs that does not weaken operability incentives for Delivery teams.
Production support as revenue insurance
Production support should be thought of as revenue protection insurance. As insurance policies, You Build It Ops Run It At Scale and You Build It You Run It At Scale are opposites in terms of risk coverage and costs.
You Build It Ops Run It At Scale offers a low degree of risk coverage, limits deployment throughput, and has a potential for revenue loss on unavailability that should not be underestimated. You Build It You Run It At Scale has a higher degree of risk coverage, with no limits on deployment throughput and a short TTR to minimise revenue losses on failure.
You Build It You Run It becomes more cost effective as product demand and reliability needs increase, as deployment targets and availability targets are ratcheted up, and the need for Continuous Delivery and operability becomes ever more apparent. The right revenue insurance policy should be chosen based on the number of teams and applications, and the range of availability targets. The fuzzy model below can be used to distinguish when You Build It Run It At Scale is appropriate – when availability targets are demanding and the number of teams and applications is 10+.
This is part 8 of the Build Operability In series
- Build Operability In
- Build Operability In – Availability Targets
- Build Operability In – Measures
- Build Operability In – Architecture [TBA]
- Build Operability In – Telemetry [TBA]
- Build Operability In – Operational Readiness [TBA]
- Build Operability In – You Build It You Run It
- Build Operability In – You Build It You Run It At Scale
- Build Operability In – Implementing You Build It You Run It At Scale
- Build Operability In – Learning [TBA]
Thanks as usual to Thierry de Pauw for reviewing this series