Originally published at DevOps.com
Today, most DevOps teams do not think too much about their approach to risk. Usually, the typical attitude focuses on what to add to reduce risk further. In other words, “Can we add more tests? Can we deploy more carefully?” More tests, more processes and less risk are all obviously good things. Or are they?
As a programmer, I’ve seen a lot of projects where there are so many tests and so many processes that they actually become counterproductive. Those ‘overheads’ presumably do reduce risk, but nobody knows by how much. I call this attitude ‘blind risk avoidance.’ This line of avoidant thinking usually builds up over time as we react to failures. It’s a very natural reaction.
But when risk avoidance becomes your standard modus operandi, your delivery cycle gets slower and slower over time. Your team starts to feel they are trapped in molasses but, as time goes on, they not only gradually train themselves to tolerate this but to actively justify that it’s a good thing. In the meantime, nobody pays attention to the fact that your software is losing a competitive edge and that you’re being out-innovated and out-executed by others. Nobody pays enough attention to engineering team attrition; losing people as they drop off, one-by-one to move to more modern software development teams with less baggage. The ultimate conundrum lies here: If you move too fast, you break things. But if you move too slowly, your business and your team are doomed.
So what is the optimal approach for DevOps teams to take when it comes to risk? The truth is, neither extreme is desirable. Blind risk avoidance leads to a failure of value delivery. On the other hand, blindly ignoring risk means crashing and burning.
The ideal balance lies in not avoiding risk entirely or ignoring the reality of risk. Instead, a sensible middle ground can be found in taming risk.
Moving forward, DevOps teams and developers must focus on taming risk to deliver software confidently with data-driven actions, speed and quality.
Automation Can Reduce Operational Overhead
Taming risk should become a larger part of DevOps culture. One of the most famous ways this approach has been adopted is by Google’s site reliability engineering (SRE) team.
Google’s SRE team embraces two key principles when it comes to taming risk: Automation for reduced operational overhead and structured and rational decision-making for better, data-driven choices.
To explore this mindset of risk-taming at Google, let’s dive into how the SREs use automation to slash waste, save time and reduce repetitious, low-value tasks and tests.
At Google, SREs are software engineers with an extremely low tolerance for repetitive, reactive work. (Aren’t we all?) But here’s the difference at Google: Thanks to a workplace culture of taming risk, operations that do not add value to a service are avoided or automated away.
The vast majority of DevOps teams hold the same beliefs as Google: That automation cuts down on waste and saves everyone time.
We all know automation allows for lower operational overhead, frees up software engineers to proactively assess and improve the services they support and boosts team morale. No one wants to spend their time or energy on repetitive, low-value tasks. Automation frees up DevOps teams to expend their energy on what matters most: Continuous quality.
Bottom line? Automation is integral to the idea of taming risk. Automation of repetitive, low-value tasks is the perfect way to embrace a mindset of taming, not avoiding or ignoring risk entirely.
Another way that Google’s SRE team tames risk is through structured and rational decision-making based on data.
Data-driven decisions are crucial to the success of any software engineering team, project or mission. Without proper data and feedback, there’s no way for software engineers to make the best, most calculated decisions and mitigate risks.
A data-driven decision methodology can—and should—be applied to DevOps testing cycles.
Another key aspect of data-driven decisions that speaks directly to the concept of taming risk? Service level objectives (SLOs).
A service level objective is an element of a service level agreement that measures the performance of the service provider. SLOs are created (and agreed upon) between the service provider and the customer and help to mitigate disputes between providers and customers.
Now, let’s take the concept of SLOs and apply them to engineering and QA teams.
In traditional organizations, engineering and QA are forced to interact intentionally with a healthy amount of tension. One team produces software and the other team tears this software down. While this kind of necessary and healthy conflict between teams can lead to better results, the inherent push and pull between engineering and QA can create an excess of inter-organizational infighting.
Engineering teams will always advocate for more velocity or blind ignorance of risk, above all else. On the flip side, QA will always advocate for quality, or blind avoidance of risk, above all else.
Of course, neither side is entirely correct (nor incorrect).
Instead, the solution to this tension lies somewhere in the middle: The taming of risk, not ignorance or avoidance of risk. This is where an SLO, or the numeric target that represents ‘somewhere in the middle,’ comes into play. For engineering teams focused on velocity and QA teams focused on quality, creating an SLO—a happy medium to work toward with regard to risk—will create a more balanced workplace culture. In short, this approach can be considered a taming of risk.
‘Donut’ Slow Velocity
Now, we’re finally going to get to the donuts. To demonstrate how embracing a mindset of taming, not avoiding, risk can create better, happier and faster DevOps teams, let’s take a look at Krispy Kreme Donuts.
Krispy Kreme churns out millions of delicious, hot, glazed donuts every single day. However, during production, it’s inevitable that one or more of those donuts will be flawed. (I know, you’re asking, “Is there really such a thing as a flawed donut?!” For the purposes of this analogy, let’s say yes, there is.)
One common donut defect occurs when the donuts bunch up on the production line and clog it. Now, let’s assume this common error is caused by the speed of the line. If the Krispy Kreme team attempts to solve this problem by pushing the line even faster, tolerance gets smaller and the risk of even more smashed, defective donuts only rises. That’s certainly not a solution to this issue.
On the other hand, if the Krispy Kreme team slows down the donut production line speed, there will be a smoother—albeit slower—operation.
Is Krispy Kreme better served by slowing down its entire production line and sacrificing its velocity to eliminate the risk of a potential defective donut? Of course not!
If that’s not true, then is Krispy Kreme better served by cranking up their production line speed to go as fast as possible, regardless of the risk of defective donuts? Again, of course not!
For sweet success, Krispy Kreme must embrace a mindset of taming risk. By baking careful measurement and an iterative feedback loop into their donut production process, Krispy Kreme is better able to create delicious, high-quality donuts at an optimal speed, producing some flawed donuts. This balance of speed with quality creates the best economic outcome and the mindset and measurement allow for a constant evolution to risk-taming in the future. As the production line changes, the optimal speed also needs to evolve.
DevOps teams should take this same approach. Tame the risk without sacrificing speed and quality. ‘Donut’ slow velocity to fix a few minor issues!
If you’re still not convinced, here’s another way to think about taming risk that is decidedly less delicious than warm, glazed donuts: Taxes.
The IRS balances velocity and accuracy during tax season as they process millions of tax reviews. While suspect returns are flagged for audits, the IRS simply cannot catch every single flawed tax return. It’s not economical or even possible to run detailed audits on every single problematic tax return, every single year.
So, instead, the IRS tames its risk with automation and data-driven decisions. The IRS focuses on creating and using methods to detect patterns that correlate with flawed tax returns. Then, once these data-driven problem returns are detected by automation, experts can devote their precious time and mental energy to tackle the biggest of the problematic returns.
Again, this approach to taming risk creates an inherent balance between quality and speed and allows for an evolution of the process as economic activities and laws evolve.
Whether you love eating donuts or doing your taxes, the bottom line is clear. Most organizations need to balance potential risks with maintaining speed, quality and team morale. DevOps teams are no different.
How to Tame Risk in DevOps Teams
So how can we, as DevOps teams, tame the risk?
The basic idea here is to do smarter testing, not just more testing. The good news is that if you have a reasonable level of automation, the data coming out of your existing automation can make your testing smarter. Approaches like predictive test selection can let you spend your computer hours more meaningfully, allowing you to improve speed and quality at the same time. Approaches like flaky test insights can let your precious engineering resources focus on the most impactful effort to improve the quality of tests.
The Power of Taming Risk
What is the right balance between speed and quality? This is the constant question DevOps teams face. But, here’s how I answer this riddle: Taking calculated risk is the most logical and rational approach in the workplace.
It’s easy to avoid risk. Anyone can skirt around potential pitfalls by avoiding new solutions and sticking to the same old methodologies. But it’s so much more satisfying to take a calculated risk rather than ignore or avoid the risk outright.
If you want to improve the velocity and the happiness of your DevOps team, it’s time to embrace a new mindset: Tame the risk, gain control.