Did you watch the Super Bowl? There was an 84 Lumber commercial showing a mother and daughter risking their lives trying to get to the United States from Mexico. Showing the long and arduous journey through the desert, only to face a giant wall separating them from America.
Particularly poignant after what the now-president said about Mexican immigrants while campaigning, the ad was purposeful in showing one type of extremely arduous immigrant experience. It was gut-wrenching in a way that made you want to see what happens next to the mother and daughter, which drove traffic to the 84 Lumber site where the rest of the video could be seen.
The only problem is that they didn’t predict they would be so successful and all the incoming traffic crashed their site.
Predicting the Unpredictable
This kind of traffic spike would have been hard to predict or model given their existing business model of a lumber company. Yet, with so much in the news these days about immigration and the wall being proposed by Trump, the ad sparked a major emotional response and a correspondingly huge amount of page hits.
Other instances of this happening, although not as emotionally moving, are not infrequent. Victoria Secret had a similar and possibly more costly outage in 2011 on Black Friday.
And there are sites like Is It Down Right Now that inform you if Netflix, Facebook, Twitter, Amazon, and other major sites are not working for just you, or if they’re down for everyone.
The question is how can you prepare a website or web app for sudden popularity or success beyond what you forecast or even beyond your wildest dreams? What measures can we take as software engineers to prevent such catastrophic failure? Can you prepare? Can you design? Test?
Predictions Can’t Always be Trusted
You can’t really predict a catastrophe such as a sudden crash of your software due to an outpouring of interest, such as 84 Lumber. Despite the touting of Big Data and how powerful it is in giving meaningful insights and ability to make accurate predictions, we are unfortunately predicting many things that really should not be predicted. As a result, we’re falsely thinking that if there is a very low probability, we don’t need to worry about it because the likelihood is so low. But these kinds of predictions give us a false sense of security.
No matter how small the probability of occurrence, if an occurrence results in major customer loss or plunge in stocks, it could result in the death of the company, and you absolutely must estimate and mitigate that risk. If you want to learn more about mitigating that risk, in June at Better Software West in Las Vegas, my co-presenter Moss Drake and I will be giving a tutorial on Risk Management.
As noted by Nasim Talib in the Precautionary Principle, when cost of ruin is infinite, so you cannot apply typical risk management principles, i.e. probability of happening x expected loss. “Ruin is Forever”. This is because if the expected loss is infinite, then any probability x infinity is a very large number.
Take for example, a tsunami. What is the probability that a tsunami will happen in Japan? If you consider how many tsunamis there have been in the last 10 years, there is miniscule chance. But, as in most coastal towns in Japan if a tsunami does happen, it could mean hundreds of people losing their lives and livelihood. Hence, they have prepared shelters as well as sea walls to block the effects.
Another example more close to home is earthquakes. There have been 6 major earthquakes in the San Francisco Bay area in the last 40 years.
If we apply standard risk management, we could calculate that on any particular day, based on past data, that there is a probability of 6 earthquakes/13,800 days since 1979 = .000432 or .04 percent chance. So why do people buy earthquake insurance?
Let’s say you pay a $1,500 per year premium on a $500,000 replacement cost. Should I pay $1,500 per year against a risk of .000432 x $500,000, which is $216? Sounds like a very stupid decision. Why fork out $1,500 per year when the calculated loss is only $216? That’s because most people emotionally equate ruin or death with earthquakes and want to avoid that at all costs. Bottom line; you can’t use standard risk management when the consequences are grave.
So, with this in mind, you have to think, what are the effects to my business in the event of a tsunami on my website? Just like a tsunami where you would measure the force of the water (mass x acceleration), in this case you’d measure the number of users and their acceleration or rate of accessing your website or webapp. What’s important here is the number and acceleration.
For the case of 84 lumber, they could have come up with some predictions based on the number of SuperBowl viewers (mass), as well as the rate of acceleration after seeing the commercial.
Design and Test for It
When you are preparing for the worst, you must also design for the worst. Luckily, with today’s elastic clouds and plethora of services from cloud vendors, that means you don’t have to have a roomful of servers dedicated to such a response. Once you have your infrastructure set up to withstand the tsunami, use your predictions to set up performance tests with the right mass and acceleration (users and rate of hitting your website or executing any life or death function or workflow).
Again, you have to determine what means life or death for your business. For some, it may be e-commerce, or for healthcare.gov, maybe its registration or an interface to a database for lookup such as social security numbers or credit agency.
Once you’ve identified the mass, what the mass is doing (your user workflows) then think about the worst-case scenario for accelerating and executing those workflows. What is critical is speed of ramp up is similar to a tsunami’s waves rushing your shores. Ramp up your users in your performance test to meet those predictions.
Lastly, don’t forget to model the undertow. As waves come in and hit your walls, they’ll combine with other waves as they approach. This effect is similar as in a call center where callers get a busy signal. If you call an airline and you get a busy signal, what do you do? You call back! This makes the call center load accelerate even faster due to people rushing back. Many of the users who couldn’t access the 84 Lumber site the first time, so what did they do? Probably tried again in 10 seconds or maybe even less.
How will you be remembered?
Regardless of how low the chances of a tsunami, you must prepare for it as if it means life or death for your business. What are those life or death events for your business and the way your software enables customers and clients to access you? Would a performance crash or security breach be recoverable? If your software fails your users, what will they do? How will that impact their image or impression of you and your company? When you show up in the news, what will people remember about you?