Delta Outage: A Wake-Up Call for Insurers

By Joe McKendrick August 11, 2016, 12:28 p.m. EDT 4 Min Read

Early in my career, I had the opportunity to tour a corporate disaster recovery (DR) center, right in the heart of center city Philadelphia. The facility, housed within a solidly built World War II-era tank factory, had every kind of system available, from IBM and Honeywell mainframes to rooms of PC servers. Here’s the way it worked: in the event of an outage caused by a natural or manmade disaster, clients’ IT managers would run down to the facility, backup tapes in hand, to reload and restore operations on the backup center’s systems. It was a process that was expected to last between 24 and 48 hours – there were even rooms with cots for the admins’ extended stay.

These days, a 24-hour turnaround time at a DR center is unacceptable, unless companies are dealing with a huge ruinous regional event, such as Hurricane Sandy. However, in a highly interconnected digital economy, chain reactions will erupt as a result of relatively mundane events, such as power outages or errant software updates.

It’s probably a freak of semantic fate that one of the year’s biggest snafus with legacy systems occurred at an airline named “Delta.” Many organizations, especially insurance companies, have been attempting to straddle the (little-D) "delta" between enormous stores of legacy assets and 2010s-era digital demands in recent years.

To recap, on Monday, August 8, Delta had to cancel more than 1,000 flights due to a computer glitch precipitated by an electrical power outage. Delta, which is built on the merger with Northwest Airlines and relies on various systems dating back to the 1990s, may have too complex of a Frankenstein construction of various legacy systems, according to the Wall Street Journal’s Susan Carey. In the words of one IT consultant, the power outage may have instigated a “huge domino as a result of the power outage or bringing the system back up.”

Tellingly, the Wall Street Journal titled a follow-up articleon the event as “Failures Like the Delta Outage Are a Fact of Digital Business.”

But does it have to be?

There are many high availability and resiliency solutions on the market today that promise to bring systems back up and running to a specific recovery point objective within seconds, if not subseconds. The only catch is, this subsecond capability is great for specific products or environments – such as an Oracle database – but can’t help environments dependent on multiple systems, multiple vendors relying on multiple partners with their own systems and backup plans.

One DR expert noted, while Delta’s IT team did everything they were supposed to do – and had things back up and running within six hours -- "Delta likely built a disaster-recovery plan into the system, but it was not properly configured to back up everything that failed. It's possible they didn't make changes to the backup system for it to mirror the main system.”

Too many points of failure. As insurers move to digital configurations, in which they depend on multiple partners and pull together multiple legacy computing environments, they too, may face “Delta” situations. While insurers likely won’t leave customers stranded in airport waiting areas, there are other risks – an inability to catch fraudulent claims as they are being filed, an inability to monitor impending weather events, or an inability to quickly respond to emergencies.

Here are some rules of the road for more effective DR/BC in the digital age:

• Don’t let your DR/BC plan be siloed. Make sure your organization has an enterprise-wide disaster recovery and business continuity strategy that encompasses all key systems in all departments and units. Coordination between all stakeholders is key.

• Review your partners’ DR/BC plans and solutions, and make sure they are in sync with your own. A partner’s faulty or inadequate plan will impact your own in unforeseen ways. This includes your cloud providers – get to know their DR/BC plans intimately, and don’t be afraid to make waves if they’re not as comprehensive as you would like.

• Be ready for the extraordinary, but plan for the ordinary. Remember that most events that affect systems are not spectacular hurricanes or fires; rather they are more mundane events such as power glitches or problems with software updates. Build in contingencies to manage for what should be hiccups.

• Simplify, simply, simplify. It may be easier said than done in today’s interconnected digital economy, but the fewer points of failure, the better, “Stuff breaks, we know, but the whole idea is to try to have fewer points of failure,” Scott Holland, principal at The Hackett Group, is quoted as telling WSJ. “And you become a hell of a lot more agile, able to plug new systems in faster and more safely. Our data show that if you’re less complex, you’re more efficient and run at a lower cost.”

Joe McKendrick

Dig In contributor