Anatomy of a crisis

This is a story about a failure. Why on earth should we dwell on that? Well, in Reload we believe that we can make a difference through excellence, honesty and transparency. We strive to continually challenge and improve ourselves and our surroundings. These values are deeply embedded in our culture. This has been an opportunity to learn - for ourselves - and maybe for you as well. And it’s an interesting story - and stories about failures are way too rare in our opinion. Failure is only the opportunity to begin again, only this time more wisely.

Context

Reload exists in a world of fast-moving web-development projects. We are a development agency and have many clients with varying degrees of critical web solutions that we develop and manage. Over time we have had our share of minor fuck-ups and we have managed to learn from most of them. But until recently, we hadn’t really experienced a real crisis.

A true crisis

We experienced a rather serious crisis in October. This is the story about how two lines of code for one of our major clients changed everything - and how we ended up sending out personal confidential information via email to ~18.000 members of the union. Members that were ALSO journalists, as the client are The Danish Union of Journalists. This is NOT the publicity you are looking for ;-) This was the kind of mistake where headlines were made, Twitter-storms started brewing and the media were calling for comments. That’s a first for us. The timing wasn’t that great either. We’ve just been through a rough period in the project and in the relationship with the client, but now things were going fine. And then on a seemingly innocent Tuesday: With just a push of a button - two weeks after the actual launch - shit hit the fan! Horrifying in its own right, but a strange thing happened. Because this was both our worst and proudest moment in our seven year long company history. 

Fear cripples

Different people in different company cultures have different fears when it comes to the consequences of a crisis. Fear is usually not very constructive and gives you tunnel-vision. All of the key personnel might fear losing reputation at best and their job at worst. All those fears might cripple us and cause us to react in a way that is not the best in the given situation. Therefore, the big difference in people, as well as companies, is how the crisis is handled. Are people able to keep their head cool and is the company in which they work encouraging this kind of behavior? It says a lot about a company’s culture how they handle crises.

What happened

We had worked on the project for about half a year, the client has been working on it for about a year and a half and while we had launched the site formally, we had yet to inform the members about it. It had been a difficult project, both technically and emotionally and everybody was really excited about the grand finale. The real test of the system was to notify all their members of their shiny new site, letting them log in and try it out. We were all ready to ride into the sunset. We did our best to give their members, the heart and soul of their organisation, the best possible experience. To make it easy for the members, who had complained about the old site for years and were eagerly waiting for the new to arrive, we’d send out the information beforehand - just to make it easy for them. This was really the culmination of months of work and planning. Surely, we had launched the site, but this was the grand finale.

Of course the product owner knew the trickiness of the operation and had tested it quite a bit. It worked, but still, it’s always a bit scary to push the button, and then, irreversibly, send 18.000 emails. She pushed the button and 18.000 mails began being sent out in chunks of 10 at a time. Everything seemed fine; the peace lasted for a sum total of 47 minutes until disaster struck.

In this case, the project manager and the rest of team were actually dining happily at lunch, blissfully ignorant of the approaching storm. After unsuccessful attempts at getting our attention on our chat, the product owner calls us and the project manager scrambles the team for a quick situation report. Tensions are high, as we do not know what is wrong at this time. Some users reported being logged in as another user when using the link in the email. So we knew that 18.000 emails had been sent, possibly exposing access to confidential member data to all of them.

The nerves

Even though we might laugh a little about it now, it was not a funny situation to be in. Customers were angry and afraid (the union got 2000 confused and not-so-happy emails that day) - and what should we tell them? You can probably imagine the situation at the office. Have we done something really stupid? Are there any legal repercussions? How much harm have we actually inflicted? Will the client hate us? Will my boss?

Did we panic, run around the office screaming? No, luckily we didn't. Though the stress of the situation did affect us, we were quite composed and the entire office came together as one, anyone helping in the way they could, either solving the problem or getting coffee and snacks to the people solving it.

It quickly dawned on the team, that this wasn’t going to fix itself and that it was a non-trivial issue. We had to take immediate action. We noticed that we had quickly given each other informal roles. One was responsible for communicating with the client, getting important information from them and informing them of the progress. Another couple of people focused on containing the incident and looking for root cause. It was really a great sight of teamwork, under a pressure we rarely experience and it taught us a thing or two about our company.

The technical details of the crisis

As mentioned, the send out was divided up into batches. The information we had at the time of the crisis was this: Some members (not all) would be logged in to a different account when using their one-time-login link. It was really like looking for a needle in a haystack.

We quickly decided that it did not make sense for everyone to look for the needle. Instead, one focused on what damage the error had already caused and tried to patch that up. One of the damages that had been caused was that one-time links was exposed to people that did not own the profile of the link. Somehow we needed to invalidate the links. At no point had we taken the site offline. We had just launched the site and were quite proud of it, so it would feel like quite a hit to have to take the site down. This was probably one of our mistakes. The site should ideally have been taken offline at the first sign of trouble. In the end 732 people managed to log in during the period. If we had taken the site offline immediately that number might have been dramatically decreased. We did eventually take the site offline, 18 minutes after we received the initial reporting that something was wrong. 18 valuable minutes.

After this, the project manager called the client to let them know what is going on. As we learned, communication is key. At this point, it is perfectly acceptable to say something like: “I have no idea what caused the problem, we are working on finding that out, but until we do, we have taken the site offline, to stop the problem getting worse”. No one expects you to find the issue right away.

We then invalidated the links and were now ready for opening up the site again. Or were we? As we were just about to put the site online again when we remembered something. Sessions had to be invalidated as well, because 732 people had logged in and they might still have an active session, meaning that they possibly have access to another person’s profile. Alright, sessions invalidated, we were now ready and we put the site online again.

Trying to find the root cause of the problem

During this part of the situation we remembered another possible problem. What about the people that had logged in and changed either the password or email? They would potentially still have access to a profile that was not theirs. Another setback.

We reset all passwords and reset the emails to their original value. At the same time we reset any other profile change there might have been during the incident. This being a bit easier than it normally would have been as we could just import data once again. Everything was good once again. We searched for the culprit of the error for hours on end, and when we finally found it, we were happy to realise, that it was indeed really hard to spot. It wasn't just something that had gotten past us because we had been unfocused.

Can you spot the error?

If you can, I am sure there is a job open for you with us, because none of us could. This is even more impossible for you than us. We had a hard time spotting it and we knew how the system behind worked. This was the line that caused all that commotion. Billede 1

Billede 2

As it turns out the email-action instance is reused in a batch meaning that when we overwrite this variable with the mail with replaced tokens, the same variable is used in the next run, causing 10 people to receive identical emails. So here is the solution, two simple lines that could have prevented all that mess.
Billede 3

 

It took us many hours of debugging to find them. The PM relayed the findings and the data, sent the list of the affected people when we were able to retrieve them and tried to identify any loose ends or blind angles. On top of that, our CEO took the role as the external communicator and spoke to the press and the important stakeholders inside the client's organisation.

How do these kinds of errors even occur?  

Even if we do everything we possibly can to avoid errors, they will still happen and once in a while we will need to handle them. We have found that when developing any kind of software, it is important to weigh the amount of testing needed. This decision should be based on a number of factors, two of the most important ones being the criticality of the application and the possible repercussions, should it go wrong.

If you are like most of our clients, then you would be telling me, that your website is the most critical one in the world and that everything should always work. We get that most clients see their website as absolutely critical, but let’s face it. Lives are not lost when their website does not work. For example, if you are developing software for a space shuttle, you want to make sure that the software is absolutely flawless. You cannot just update the software on the fly. Once it's installed in the spacecraft and launched into space, there's no turning back. Also, if something does go wrong with space shuttle software, the consequences might be millions of dollars worth of damage, or in worst case, the loss of human lives.

A funny example: In 1999 NASA lost connection to their Mars Orbiter, because a team from Lockheed Martin had worked in UK inches, whereas the rest of the teams had worked in centimeters. The spacecraft disappeared in space. While most of our developers probably would argue that their software is sometimes out of this world, we must admit that we are not in a situation where it makes sense to be absolutely sure that everything works every time we release something. It's simply not realistic, cost wise.

What we learned

So after all this was over and done, we thought to ourselves: Gosh, that wasn’t fun, but we actually did quite ok. But maybe we should reflect on and internalize the findings. 

As a result, we have condensed our learnings from this day down to a couple of “rules” and “roles”.  The 5 rules of fuck-ups you might call it :-)

Take these points home with you, should you ever find yourself in a similar situation:

  1. Take quick and decisive action
    Focus on limiting the damage, not fixing the problem at first. Make the hard choice when is it the right choice. Address the problem internally, then communicate vigorously to the client. Coordinate with the client how you communicate this externally if needed. If you need to issue a public statement, do it soon thereafter. Remember, it’s always better for you to control the conversation than the press doing it, so make sure you beat the press to it. The quicker you respond to an issue, the faster it runs through the news cycle. A quick response also illustrates that your company is concerned, proactive and in control of the situation.
     
  2. Be transparent, honest and open
    In these cases, silence is not golden. Ignoring or being tight-lipped about a public mishap can do as much damage to your company’s reputation as the mishap itself. Companies who come forward and address their mistakes in an open and candid way regain the public’s trust and help mitigate any reputational damage that has already occurred.  Be open about the mistake as well as how the company is handling it and be sure to make someone accessible for further questions, comments or concerns.
     
  3. Constant communication is key
    Keep making the client aware of the status, even if it hasn't changed since you last spoke together. Coordinate with the client who speaks to the press. It’s often not a good idea to have multiple spokesperson(s), but sometimes it's necessary. 

    When it comes to negative publicity, the public and media will be most interested in how your company is involved and how it will affect the company and/or client in the future. Your public statement should entail what you have done to mitigate the issue and/or what steps you will be taking to do so, when you expect the issue to be resolved and how your company will prevent this type of issue in the future.
     

  4. Appoint roles
    Make sure everyone who's involved understands what their role implies.
     
  5. Don't panic
    If it doesn’t kill you, it makes you stronger.

The 4 most important roles

We have learned that it is a very good idea to have clearly defined roles. We did not verbalise the roles, it came naturally. Even so, we learned that verbalisation of the roles is always a good idea. It can clear up some confusion that you do not want occurring during the crisis.

Damage controller

The damage-controller solely focuses on limiting the damages made by the crisis. It is important for the damage-control person to keep their head cool and think about all possible exploits and security-holes that might have been opened up by the error. It is very important for this person to talk it through with another developer or project manager.

The fixer

The fixer focuses on uncovering exactly how the error happened and what should be done to fix it. It's not the focus to fix it right away. The important job for the fixer is to uncover potential exploits or security-holes and report them to the damage-control person for closing. Often, the damage-control person cannot anticipate what the error may have had of repercussions when not knowing the root cause of the error.

The coordinator

The coordinator is responsible for coordinating information from the client, removing potential roadblocks and making sure that the team talks things through and don't get tunnel-vision, ignoring important aspects of the error. It's also the coordinators job to talk to the client, giving them status of what we are doing. This is very important. There is nothing worse for the client than to have screaming customers, no way of doing anything about it and complete radio silence from the team working on the solution.

The communicator

The communicator is responsible for communicating with external parties, for instance the press and key stakeholders, both in your own organisation and in the client’s organisation.

How could this be our proudest moment, you ask?

Because, even if we were totally unprepared for this situation, we actually handled it pretty well and our company values passed the test. When we ask our co-workers to give an example of when they were really proud to be a Reloader, this situation gets mentioned all the time. It was a defining moment for us, because everyone came together to work out a solution as one. Tensions were high, but people were focused rather than afraid. The rest of Reload helped and were very supportive.

This emphasizes that a crisis like this is not only bad. When handled correctly, it can have a positive effect on the company culture and even a positive effect on the reputation of individuals and the company as a whole. We have even gotten positive feedback from people outside our company, who followed the situation and read our written report which we published the same day. Of course, there has been a lot of negative press, but in the end we think we came out on top.

The Danish Union of Journalists is still our client. I do believe we all came out of this stronger.

Everyone makes mistakes all the time - and that’s OK. But I really believe, that the only real mistake is the one from which we learn nothing.

"It's always helpful to learn from your mistakes because then your mistakes seem worthwhile" - Garry Marshall

Planche over roller

Download the poster here 

This article was originally a talk from Drupal Camp 2017 in Aarhus.