Thanks for sticking with me through my eight predictions for 2021! Here’s a quick recap of the first seven:
- Ransomware victims will face penalties.
- Your digital reputation will come under attack.
- Zerologon will continue to cause pain for IT pros.
- People will remember the hard way that they have Group Policy.
- A rebound in M&As will make more people realize just how hard a tenant-to tenant migration is.
- Transitional and project-based employees will increase the risk to intellectual property (IP).
- Microsoft 365 Multi-Geo will send multi-nationals down the rabbit hole.
Today’s my final post in the series! I’ll cover prediction #8: Increased cloud service and telco outages will drive renewed interest in bare-minimum hybrid business continuity plans.
I know, I know! Availability issues simply aren’t sexy or exciting, like new Office 365 features or the latest cyberattack techniques. But they are awfully important, especially as you seek the digital resilience that is critical to any organization’s continued success in our modern reality. So, please stick with me here, and I’ll have a special surprise for you at the end of the post.
Service outages are everywhere these days.
Sorry, but we’re having trouble signing you in.
A transient error has occurred. Please try again.
Starting around 5:30 p.m. ET on Monday, September 28, people around the world began getting the modern equivalent of the blue screen of death: errors when trying to sign into any applications that rely on Azure AD for authentication, including Microsoft 365 solutions like Teams as well as third-party applications. The outage ended up lasting for more than six agonizingly long hours.
This wasn’t either the first or the last serious Microsoft service interruption in 2020. Back on March 3, the U.S. East Azure region experienced problems across most services, again for more than six hours. On October 1, Exchange and Outlook were again causing issues for users, primarily in Europe. You can check out this Azure service history for a complete list.
I’m not picking on Microsoft; in fact, given the sudden spike in usage, only Microsoft could have handled the load with SO FEW disruptions! The truth is, plenty of other cloud services have experienced crippling downtime as well. In one of the most recent, on November 25, Amazon Web Services (AWS) suffered a multi-hour outage in its eastern U.S. operations that affected everything from corporate software applications to major publishers. Google was not immune, either. March was a particularly bad month for them. First, Gmail, Drive, Docs, Sheets, Slides, Hangouts Chat and Meet services went down — just as newly remote students and teachers were relying on them most. Then, customers suffered network connectivity issues and elevated error rates across multiple Google Cloud Platform services for 14 long hours.
In May, Adobe Creative Cloud went down for nearly a full business day, leaving customers unable to use cloud-based tools or access documents; a similar outage in October led one user to post a YouTube video of his frustrations. In June, a serious IBM Cloud outage brought many Big Blue customers, including a number of popular websites, to an abrupt halt; here’s a handy history of IBM Cloud outages. Meanwhile, of course, availability issues with Zoom became almost legendary in 2020.
Cloud service outages can bring your business to a standstill.
Imagine getting an error like the one above over and over again, especially when you have an urgent deadline pending. It’s bad enough for any individual, but now imagine the problem from the standpoint of the organization. None of your teachers can provide instruction to students. None of your creative team members can do their work. Multiple critical projects have to be put on hold because teams can’t communicate or collaborate.
Moreover, your customers won’t care one little bit that a cloud service provider is the cause of the problems they’re experiencing. When it’s the day before Thanksgiving and their iRobot Roomba vacuum cleaner app isn’t working, or they can’t read The Washington Post even though they paid for their subscription, they’ll blame you, not Amazon or Microsoft. Therefore, your brand — and your bottom line — can suffer serious damage.
What does “high availability” even mean these days?
The Holy Grail of availability is “five 9s” (99.999%), which means an average of just five minutes of service outage per year. Decades ago, on-prem solutions like TDM PBX phone systems came close to meeting this goal, often achieving better than four 9s through measures like top-quality hardware and redundancy.
But in the cloud, a much wider range of factors can affect service availability. Recent service outages have been attributed to misconfigurations, software bugs, defects in updates that didn’t appear until they were deployed at scale, cooling plant issues, spikes in traffic, malicious attacks, and many other issues. I seriously doubt that any cloud service today offers a service level agreement (SLA) guaranteeing five 9s of availability. For instance, the SLAs for both the IBM Cloud and Amazon Compute begin offering customers a credit at four 9s. Microsoft guarantees only three 9s of availability for its Azure Active Directory Basic and Premium services.
Moreover, SLAs often don’t cover everything you think they should. For instance, an almost four-day Amazon outage in April 2011 did not breach Amazon’s EC2 SLA, because that SLA guaranteed only “99.95% availability of the service within a Region over a trailing 365 period.”
Building digital resilience in our modern reality.
What does all this mean for you in 2021? Well, cloud services outages won’t magically disappear; in fact, they could get worse. Of course, you should read your SLAs carefully, but even if an SLA does kick in, a 10% credit on your bill won’t really do much to offset the damage that service downtime could do to your business. Remote work isn’t going away, either. While coronavirus vaccines might enable a partial return to the office, you will still need to empower your workforce to function effectively from home (and while traveling, when that resumes!). In fact, more businesses are considering having smaller hubs at different geographical locations, and all those people will still need to communicate and collaborate.
The recipe for success given these stark realities is the theme I mentioned way back in my original 2021 predictions post: building digital resilience. Digital resilience means resisting the temptation to simply move everything to the cloud. Instead, organizations must take the time to carefully determine the bare minimum data they need to operate without cloud access, and then build an appropriate hybrid model into their digitization strategy and disaster recovery plans. Many organizations in critical industries have already had to think this through, but others are behind in this area because they never had so many users working from home and therefore cloud outages have never loomed so large.
This effort requires gaining deep insight into your IT ecosystem and, ideally, making it as clean and well-organized as possible. Then you can better understand workflows and other user activity, so you can take steps to ensure business continuity by developing a corporate strategy for maintaining an on-prem Active Directory and local data stores. You’ll also need to educate users about how to make wise (and yet legally responsible) choices about what data to sync locally so they can keep working during outages.
Thanks for sticking with me through the details of all eight of my predictions. I hope you found them thought-provoking and useful. As promised, I have a reward for your fortitude and patience: a wiki of computer error messages written entirely in haiku. Here’s my favorite:
Three things are certain:
Death, taxes, and lost data.
Guess which has occurred.
— David Dixon
May your 2021 be filled with joy, humor and — of course — digital resilience!