What's Hot

"Risk Dashboards should serve the stakeholder" | Advanced Risk Dashboards

Thursday, October 6, 2011

Resilience and Business Continuity

Resilience is perhaps the most important aspect of a solid business continuity program but when it comes to practice, operational continuity would become the measure of resilience.

The question we are asking today is; which industry sectors or which companies are most resilient?

Measures of resilience
The word resilience is derived from the Latin word resili or present participle, which is the power of something to return to the original form or to recover from adversity.  In practical terms it might be the ability for something to remain continuous or to revert back to its normal state of operation. If we were to take this literally then, we would be most interested in measuring how long it takes for an operation that is suffering a disruption, to resume normal processing function.

In operational risk there are actually two measures we concern ourselves with in this case. The first is how often something fails and the second is how long it takes to repair that failure or to be technical:

The Mean Time Between Failure or MTBF; which over a statistically predicable time horizon should tell us how many failures we can expect. Say in a thousand trails or may be a million, how often did the service fail.

Now things that go bump in the night as they say aren't so bad if they are fixed swiftly. Given this, we also need measure the Mean Time to Repair-MTTR or the estimated time to repair as the measure is so often referred to.

Figure 1 : Up State and Down State

In Figure 1 you should be able to clearly see the up state represented as red bars. You can equivalently tally up how often the system is down over its event horizon and you could if you cared, accumulate the amount of seconds or milliseconds if that was applicable for it to come back on line after each outage.

This form of availability is normally expressed in up time over a one year horizon. So in this case if I was to say to you the service is 99.885845% reliable, you may think that this is a great performance statistic but in reality you can expect 10 hours worth of accumulated outage in a single year of operation.

Is that good?

Well let's hope it truly is a cumulative measure over 365 days especially if you depend on the service you are measuring and jet engines are an example of such a service. For those of you who dislike flying, fret not because the majority of turbines being manufactured today are running "6 Nines" and most commercial aircraft are able to fly on a single engine once they are cruising.

Building resilience
There are several key tips for building resilience into operations and some industry sectors seem to have a better grip on this over others. 

Six key specific points to consider for continuous operation would include:

[1] Failure is normal operation
If you want to improve operational effectiveness in your systems, design them in a failed environment rather than a successful one. In effect, assume volatility is the norm and assume inherent risk is high.  Systems that are built to operate in such places seem to be more robust when they are run in normal states of operation.

[2] Adding Redundancy
Adding in redundancy to the design is perhaps the easiest way to reduce the mean time between failure but at the same time it may not be the most cost effective way to achieve this. Nonetheless, for mission critical applications resilient firms seem to employ this tactic.

Figure 2 : Designing with failure in mind

In the simple example above, a dual system that has 50% chance of operating in the next hour will increase its operation chances to 75% by making both systems (A) & (B) available. Serially arranged processes on the other hand are a much more problematic. They are a weaker design and usually increase the failure rate in firms.

[3] Reducing Tail Events
Avoiding tail events in a natural manner. This is often accomplished by setting up firewalls between specific risk factors and their outcomes. For example; many explosions require three elements to be present before combustion occurs and those elements are oxygen, a fuel source and an ignition point. So then, separating the causal elements will immediately reduce the likelihood of a catastrophic incident and this is the underlying reason why ignition sources are banned or controlled in combustive environments.

[4] Operational Simplification
Businesses which are able to simplify processes, authority lines and machine elements, seem to increase their reliability factor and they do this by lowering the mean time to repair function. Additional, overly complex services or systems with too many moving parts seem prone to higher chance of compound error.

[5] Reducing Dependencies
Outsourcing generally does not build resilience but dependence, it may lower unit costs of production but it can also interfere with inherent risk and control. Outsourced functions need careful review and sometimes the establishment of a redundancy program that will sadly reduce the cost benefits from outsourcing in the first place.

[6] Creative Industries
Businesses that embrace resilience seem to be those which are generally more creative, they rebuke bureaucracy and consequently adapt to change more rapidly. These types of business models also have a tendency to make decisions swiftly because central authority is devolved and they aspire to thrive off diversity rather than stumble over it.

Resilience in practice
By looking across a large set of industry sectors taking in mining, construction, energy production, military applications, hospitality, logistics, air traffic control, finance, manufacturing, information technology, some industry sectors seem more prone to failure than others. This is partly due to the environmental nature and inherent risks that surround a business or a lack of standards, others seem to simply have appalling resilience benchmarks.

Military Applications
Two industries which are worth looking at from a resilience perspective is the military and the airlines. 

Why?

Well, both these industries are operating in relatively volatile environments which results in equipment being designed for failure. In reality, no one would want to go to battle with guns that jammed or communication devices that failed if they were dropped or became wet. The US in particular has extremely high standards for both defining and testing operations known as the MIL-STD-810 standard. This standard addresses an incredible array of potential catastrophe factors from shock to fungus and in itself is a leading edge in resilience design.

Airlines
The airline industry is also heavily regulated by standards that ensures equipment is able to operate in the most hostile environments, that failures result in shutdowns rather than explosive responses and that continuous operation is over five times what is normally expected.

Figure 3 : Sample FAR Standards

Figure 3 highlights a sample of the types of tests which are carried out on turbines before they are commercially sold and you can check out General Electrics' test bed for a successful Engine Icing Stall Test by following this <LINK>. Personally I am impressed with this little lot and much can be learned from the control quality General Electric seem to employ. 

Financial Services
You guessed it, finance was going to appear in this blog post seeing that is where I spend most of my time working but in my opinion, financial services would rate pretty low in our resilience award list and it would be fair to say that if some of these banks were to run airlines, many of us would never want to fly again. 

On the resilience assessment preparedness scale, manufacturing is probably more widely dispersed in resilience quality when viewed from a global perspective and in some parts of the world, construction would also fall into that bucket of disarray. However, the banking sector has really been showing a deteriorated level of standards when it comes to resilience and in the year of 2011, many jurisdictions have been unable to run through a single event horizon without living through a critical failure.

No comments:

Post a Comment