System Outages
Risk Management Perspectives from an Unexpected Outage
(Published: 20 July 2024)
The CrowdStrike outage is a timely reminder on what should be the top risks for any software company providing and supporting mission critical applications and their clients operating these applications.
From a software company’s perspective, one of its top risks should be that the program changes/upgrades fail and cause system failure and business disruption to their clients resulting in potentially serious financial, operational and reputational repercussions and even personal harm. The software company’s Board should have a clearly defined risk appetite that will not tolerate such an event without establishing and implementing robust key controls to mitigate such a risk. Robust controls would include stringent program change controls to thoroughly test upgrades/changes made against all operating systems, especially those used by major clients.
From the client perspective, one of its top risks should be that its business, operations and mission critical systems are down and unavailable through an extended period. There is often reliance placed on supporting software vendors to ensure the necessary stringent quality controls are followed. Companies which rely heavily on its IT systems being operational all the time are more exposed and heavily impacted by such unexpected disruptions. One key control is to ensure that Business Continuity Plans (BCPs) are established and current. A BCP is a document that consists of the critical information an organization needs to continue operating during an unplanned event (like that of the CrowdStrike outage) and states the essential functions of the business, identifies which processes must be sustained, and details how to maintain them. BCPs typically revert to manual processes which may be inefficient but at least will ensure continued operations albeit at a reduced level until the system is fully recovered. It is therefore a timely reminder for everyone to pull out and dust off their BCPs from drawers, update and thoroughly test them to ensure that the organizations essential processes continue to operate in an emergency.