Handling Failure During Your Evolution towards CD &DevOps

Published Nov 29, 2018

In this guest post, Paul Swartout, an IT expert and the author of Continuous Delivery and DevOps – A Quickstart Guide, explains how failure should be handled during the journey of CD and DevOps adoption.

As you go along your journey of adopting CD and DevOps, things will occasionally go wrong; this is inevitable and is nothing to be afraid or ashamed of. There may be situations that you didn't foresee, or steps in the existing process that were not picked up during the elephant exposure. It might be as simple as a problem within the chosen toolset, which isn't doing what you had hoped it would or is simply buggy.

Your natural reaction may be to hide such failures or at least not broadcast the fact that a failure has occurred. This is not a wise thing to do. You and your team are working hard to instill a sense of openness and honesty, so the worst thing you can do is the exact opposite. Think back to what we covered previously in relation to failing fast in terms of finding defects; the same approach works here as well.

Admitting defeat, curling up in a fetal position, and lying in the corner whimpering is also not an option. As with any change, things go wrong, so review the situation, review your options, and move forward. Once you have a way to move around or through the problem, communicate this. Ensure you're candid about what the problem is and what is being done to overcome it. This will show others how to react and deal with change—a sort of lead-by-example.

You might be concerned that admitting failures might give the laggards more ammunition to derail the adoption; however, their win will be short-lived once the innovators and followers have found a solution. Hold fast, stand your ground, and have faith.

If you're using agile techniques such as scrum or Kanban to drive the CD and DevOps adoption, you should be able to change direction relatively quickly without impeding progress.

Okay, so this is all a very positive mental attitude (PMA) and may be seen by some of you who are more cynical than the rest as management hot air and platitudes, so let's look at a scenario involving a fictional IT company called ACME systems.

ACME systems implemented a deployment transaction model to manage dependencies and ensure only one change went through to the production system at any one point in time. This worked well for a while, but things started to slow down. Automated integration tests that were previously working started to fail intermittently; defects were being raised in areas of functionality that were previously seen as bulletproof.

This slowdown started to impact the wider R&D team's ability to deliver and the noise level started to rise, especially from the vocal laggards. Open and honest discussions between all concerned ensued and, after much debate, it transpired that the main source of the problem was a very simple dependency, and change management was not keeping up with the speed of delivery.

In essence, there was no sure way of determining which software asset change would be completed before another software asset change and there was no simple way to try out different scenarios in terms of integration. What it boiled down to was this: if changes within asset A had a dependency on changes within asset B, then asset B needed to go live first to allow for full integration testing.

However, if asset A was ready first, it would have to sit and wait—sometimes for days or weeks. The deployment transaction was starting to hinder Continuous Delivery (CD).

Here's a simple process that ACME systems called the deployment transaction:

Everyone had agreed that the deployment transaction worked well and provided a working alternative to dependency hell. When used in anger, however, it exposed a flaw that started to cause real and painful problems. Even if features could be switched off through feature flags, there was no way to fully test integration without deploying everything to production and the live environment.

This had not been a problem previously, as the speed of releases had been very slow and assets had been clumped together. ACME systems now had the ability to deploy to production very quickly and now had a new problem: which order to deploy? Many discussions took place and complicated options were looked at, but in the end the solution was quite simple: move the boundary of the deployment transaction and allow for full integration testing before assets went to production.

It was then down to the various R&D teams to manually work out in which order things should be deployed.

The following diagram depicts the revised deployment transaction boundary:

So ACME had a potential showstopper, which could have completely derailed their CD and DevOps adoption. The problem became very visible and many questions were asked. The followers started to doubt the innovators, and the laggards became vocal. With some good old-fashioned collaboration and honest discussions, the issue was quickly overcome.

Again, open and honest communication and courageous dialogue is the key. If you keep reviewing and listening to what people are saying, you have a much better opportunity to see potential hurdles before they completely block your progress.

Another thing that may scupper your implementation and erode trust is inconsistent results.

Processes that are not repeatable

There is a tendency for those of a technical nature to automate everything they touch, such as the automated building of an engineer's workstation, automated building of software, and automated switching on of the coffee machine when the office lights come on. This is nothing new and there is nothing wrong with this approach as long as the process is repeatable and provides consistent results. If the results are not consistent, others will be reluctant to use the automation you spent many hours, days, or weeks pulling together.

When it comes to CD and DevOps, the same approach should apply, especially when you're looking at tooling. You need to trust the results that you are getting time and time again.

Some believe that internal tooling and labor-saving solutions or processes that aren't out in the hostile customer world don't have to be of production quality as they're only going to be used by people within the business, mostly by techies. This is 100 percent wrong.

Let's look at a very simple example: if you're a software engineer, you will use an IDE to write code and you will use a compiler to generate the binary to deploy.If you're a database administrator (DBA), you'll use a SQL admin program to manage your databases and write SQL. You will expect these tools to work 100 percent of the time and produce consistent and repeatable results; you open a source file and the IDE opens it for editing, and you execute some SQL and the SQL admin tool runs it on the server.

If your tools keep crashing or produce unexpected results, you will be a bit upset (putting it politely) and will no doubt refrain from using the said tools again. It may drive you insane.

The same goes for the tools (technical and non-technical) you build and/or implement for your CD and DevOps adoption. These tools have to be as good as (if not better than) the software your teams are creating.

The users of the implemented tool/processes must be confident that when they do the same actions over and over again, they get the same results. As this confidence grows, so does the trust in the tool/process. Ultimately, the tool/process will start to be taken for granted and people will use it without a second thought.

Consequently, people will also trust the fact that if the results differ from the last run, something bad has been introduced (for example, a software bug has been created) that needs immediate attention.

Consider how much confidence and trust will be eroded if the tool/process continually fails or provides different and/or unexpected results. You must therefore be very confident that the tooling/processes are fit for purpose.

We have already covered the potential hurdles you'll encounter in terms of corporate guidelines, red tape, and standards. Just think what fun you will have convincing the gatekeepers that CD and DevOps is not risky when you can't provide consistent results for repeatable tasks. Okay, maybe fun is not the correct word; maybe pain is a better one.

Another advantage of consistent, repeatable results comes into play when looking at metrics. If you can trust the fact that to deploy the same asset to the same server takes the same amount of time each time you deploy it, you can start to spot problems (for example, if it starts taking longer to deploy, there may be an infrastructure issue or something fundamental has changed in the configuration).

All in all, it may sound boring and not very innovative, but with consistent and repeatable results, you can stop worrying about the mundane and move your attention to the problems that need solving, such as the very real requirement to recruit new people into a transforming or transformed business.

If you enjoyed reading this article, you can check out Continuous Delivery and DevOps – A Quickstart Guide. This book covers real world scenarios and provides examples, tricks and tips to ease your journey of CD and DevOps adoption. You will be guided through the various stages of the adoption to ensure you are set up for success and are fully aware of the work, tools, techniques and effort required to realize the massive benefits CD and DevOps can bring.

Continuous integration Continuous deployment Software Development DevOps

Report

Enjoy this post? Give PACKT a like if it's helpful.

PACKT

Stay Relevant!

Discover and read more posts from PACKT

get started