How can professionalism in software delivery be measured?
There is a way to measure the performance of an IT organization: The DORA-KPIs
If you try to optimize and improve for the DORA-KPI you improve and increase the efficiency of your IT organization as well. The better the performance in regards to the DORA-KPIs is, the more efficient your organization will be.
Performing high in regard to these KPIs will lead to:
- An improved ROI of new features
- A higher product quality
- A higher focus of the team on product features
- And a faster feature delivery
Let’s dive into it
One of the main questions in modern IT organizations is: How can professionalism in software delivery be measured? And how mature are we in comparison with other companies in our industry?
If you want to measure the maturity and performance of your IT organization you need some metrics or KPIs. But which kind of KPIs do make sense?
Luckily this question was already raised by many people and there is an answer to it. Of course, the most important measure of performance is business success but there are additional ways to get a deeper understanding of your performance let’s look at:
The DevOps Research and Assessment (DORA) KPIs
The DevOps Research and Assessment (DORA) KPIs are based on research and broadly discussed including the science behind it in the book “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations” co-authored by Nicole Forsgren, Jez Humble and Gene Kim published in 2018.
Since 2014 yearly reports have been published to give insights regarding the performance of different IT organizations. You can even self-assess the performance of your IT organization via a questionnaire.
What makes these DORA-KPIs so interesting is that the teams with the best results are twice as likely to meet or exceed their goals!
So it looks like these performance indicators are a good starting point. If you like to know what the current performance of your organization is in comparison with others and which impact improvement measures and activities would have.
There are four main KPIs:
- Lead Time for Changes
- Deployment Frequency
- Change Failure Rate
- Time to Restore Service
Additionally, for a Cloud and Hosting Provider like IONOS it is very important to have Availability as a KPI. It is not part of the four core KPIs and it has changed over time – since 2021 it is part of operational performance on reliability metric, which tells you how well your services meet user expectations, such as availability and performance.
What are these KPIs about?
Lead Time for Changes
The amount of time it takes a code-commit to get into production. It tells how fast you can deliver changes for your customers. The goal is to reduce lead time as much as possible. It is measured in time (minutes, hours, days, weeks, months, or even years).
How often an organization successfully releases to production. A higher deployment frequency demands usually a higher degree of automation and automated tests including a high test coverage. The theoretical maximum number of deployments equals the count of changes done in the code (like merged pull requests or commits).
Change Failure Rate
The percentage of deployments causing a failure in production. The goal is to have a low change fail rate. It is a kind of counter metric to deployment frequency. If you are deploying more often it could be that your change fail rate is increasing especially if you don’t have deployment automation in place.
Time to Restore Service
Measures how long it takes an organization to recover from a failure in production. The smaller the changes you are shipping to productions are the easier it is to find the root cause of an outage. The better the debuggability of your system is the faster you are able to find the root cause and you can apply a fix.
The percentage of time your service is available for your customers. As already stated it is since 2021 part of operational performance on reliability. Especially for companies running services, that is a very important metric.
The KPIs can be grouped into two main categories:
At a high level Deployment Frequency and Lead Time for Changes are measures for the velocity, while Change Failure Rate and Time to Restore Service and Availability are measures of stability.
We want to optimize both velocity and stability to be successful and deliver customer value. With these KPIs, we have a proper tool to look at the performance of our organization.
The monolith tinkering part
IONOS operates a ~14-year-old monolith that is responsible for serving the main parts of the Cloud API for the Compute Engine of the product. This REST API is the main access point for our customers for the Cloud product.
Back then at IONOS in 2019 the performance of the Cloud software stack measured based on DORA KPIs was as follows:
Lead Time for Changes: 2-4 weeks
Deployment Frequency: 1x month
Change Failure Rate: 25%-50%
Time to Restore Service: 1-5 days
Looking at the self-assessment the IONOS cloud platform was performing only better than 22% of the rest of the industry. This means it was a low performer at that time in regard to the software delivery performance.
So it was really time to optimize the delivery performance to ensure ongoing business success.
Let’s look first at the (simplified) value stream and the single steps. Going from dev to delivering value for the customer.
Where is room for improvement? Where we are losing time due to bottlenecks, bad quality, or limited resources (e.g. build agents, flaky tests, test environments, etc.)
If we are optimizing this value stream we directly optimize our lead time.
The lead time can be optimized in the following areas: coding, testing, continuous integration & delivery. To achieve that the team implemented a lot of measures. The following list is just a short list to give an impression of what can and has been done:
- Create time and headroom for quality (e.g. 20% rule of thumb) for the engineers to allow for continuous optimizations and improvements in every development iteration.
- Work with TDD, pair- and mob programming to share knowledge, reduce bugs, and get rid of knowledge silos that are slowing down the software delivery.
- Work on a maintainable easy-to-change codebase to keep a certain pace in regards to code changes by applying and training in Clean Code practices, coding katas, and establishing a SoftwareCrafts mindset.
- Use coding efficiency measures like AI-assisted coding, sufficient and powerful developer machines, state-of-the-art IDEs (integrated development environments), etc..
- Follow the pyramid of tests (unit, integration, e2e, etc.) to be able to move fast in refactorings and deployments.
- Get rid of manual testing as much as possible.
- Fight flaky tests (e.g. quarantine pattern) to avoid false positives, unnecessary reruns of builds, and the broken window effect.
- Ensure sufficient local test and staging system environments to shorten cycle time for the developers.
Continuous Integration & Delivery
- Have fast and reliable builds, and a scaling(!) build infrastructure to avoid wait time.
- Simplify version control: no heavy-weight branching models like GitFlow, no long-living feature branches to avoid losing time during merges.
- Pay attention to code reviews as a bottleneck that is slowing down the software delivery.
- Build a solid deployment automation (incl. DB Migrations).
If software is shipped more frequently, technical and business risk is reduced. The technical risk is reduced because the amount of changes per deployment is smaller and therefore the likelihood of introducing a problem as well. Additionally, new problems can be identified faster and easier because of the smaller change sets. Business-wise, a high deployment frequency allows us to ship in a higher frequency customer value in the form of fixed issues and new features which will increase customer happiness.
To achieve this we did the following:
- Increased deployment frequency (see where it starts to break) from once a month to once every 2 weeks to once a week to two times a week.
- Build up automated and reliable tests with 80%+ test coverage.
- Implemented automated deployment-unit builds (see build pipelines), which have been done manually before with a complicated long checklist.
- Ship bits and bytes continuously (built once, deployed multiple times, no rebuilds of the software during the process of delivery). Decouple the release of a feature from the deployment of new software (aka feature toggles).
- Implemented infrastructure as code (e.g. helm charts-based configuration).
- Align the way of deploying in all test environments. Same way of shipping the software for all environments (before we had up to 4 different deployment processes).
Change Failure Rate
We reduced the change fail rate of the deployments of the monolith by:
- Increasing the deployment frequency and therefore having smaller change sets lowered the risk of a failure in production. Smaller changes can be tested more easily in advance.
- Automating the rollout procedure. (using infrastructure as code)
- Introducing smoke tests to check the success of deployment right after the rollout in production to get fast feedback regarding the success.
- Worked hard on a test system production parity to avoid any kind of surprises like different setups, different ways of deploying, configuration mismatches, etc.
Time to Restore Service
We optimized the time to restore service by:
- Automated deployment and rollbacks (incl. DB migrations) – in case of an issue we were able to just roll back to the previous version.
- Optimization of the system design and building an automated failover and redundancy for different components (see also twelve-factor app principles).
- Improving monitoring and alerting – to see and find issues faster – if you are flying blind you cannot find or recognize issues fast.
- Working with smaller change sets per deployment helps to find issues faster and restore service operations.
- A short lead time and a high deployment frequency are leading to a shorter time to restore the service because bugs can be found and fixes can be rolled out faster.
And finally, we improved our availability by:
- Introducing zero-downtime deployments – so no downtime during a rollout anymore (back then we operated one single instance of the monolith).
- Build a scalable, resilient system architecture. (e.g. geo-redundancy, circuit breaker, and automated failovers, etc.)
- Having multiple instances of every service running (at least two per default) – is a no-brainer, but …
- Securing our REST APIs against DoS attacks and accidentally misusing via rate limits (also due to internal systems) – often system overloads due to missing or relaxed rate limits.
After introducing zero downtime deployments and a lowered change fail rate as well as improved time to restore the service the availability of the service improved due to the applied measures. To reach a certain level in your SLA you may need to improve the system architecture (geo-redundancy, horizontal scalability, etc.).
Where are we today?
- Deployment Frequency: We could improve our deployment frequency so much that we could do daily deployments most of the time. That leads to an improvement in our products, our quality, our efficiency and effectiveness, and our customer and developer experience by a great factor. We moved from 12 deployments to ~120 deployments of the monolith per year.
- Lead Time for Changes: We can deliver changes in hours instead of days or weeks. A full cycle for the monolith takes only 45 min.
- Change Failure Rate: We have a much lower change fail rate. In case of a failure, we can restore the service in minutes instead of hours. Every change has much less impact because of smaller change sets per deployment.
- Time to Restore Service: On average, we were able to recover from major incidents 30x faster, and in total, we only needed 20% of the time to restore customer happiness in comparison to the years before.
- Availability: We were able to increase the availability of our service to meet industry standards and reduce the volatility of the same as well.
Summary: As we deploy more often, we have smaller change sets. When a change fails, we are able to recover much faster. This results in higher availability, decreased negative customer impact, and less wasted resources for service restoration.
From low to high performer
Based on the numbers below we became a high performer and are performing better than 81% of the rest of our industry. We are above the industry average and 3.5x better than before.
Deployment Frequency: >10x more (daily vs. monthly)
Lead Time for Changes: >10x faster (hours vs. weeks)
Change Failure Rate: 0-15% vs. 25%-50%
Time to Restore Service: 30x faster (minutes vs. days)
Availability: 99.99% vs. 96%-99%
What to keep in mind?
- DORA KPIs are powerful metrics to self-assess the performance of the software delivery.
- Optimize your value stream by identifying anti-patterns from processes to system architecture.
- Every optimization should have a measurable impact on the KPIs.
- You are never done! (see: Theory of Constraints by Eliyahu Goldratt)
If you optimize and improve the DORA-KPI you improve and increase the efficiency of your IT organization. The better you perform in regards to DORA-KPIs the more efficient your organization will be.
If you like to take part in that further journey please consider joining us and also please consider using our products.
If you want to dig deeper into this area then the following books can be recommended:
- Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations (Nicole Forsgren PhD, Jez Humble, Gene Kim | 2018)
- An explanation of DevOps Assessment and Research KPI including the scientific approach behind it.
- The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations (Gene Kim, Jez Humble, et al. |2021)
- An overview of modern software engineering and software delivery practices.
- The Goal: A Process of Ongoing Improvement (Eliyahu M. Goldratt, Jeff Cox| 2012)
- It explains the Theory of Constraints. After optimizing one bottleneck you will face the next one.