FMEA - Real life example analysis

FMEA - Introduction

Failure mode and effects analysis (FMEA) is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. An FMEA can be a qualitative analysis,but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.

A successful FMEA activity helps identify potential failure modes based on experience with similar products and processes—or based on common physics of failure logic. It is widely used in development and manufacturing industries in various phases of the product life cycle. Effects analysis refers to studying the consequences of those failures on different system levels.


Functional analyses are needed as an input to determine correct failure modes, at all system levels, both for functional FMEA or Piece-Part (hardware) FMEA. An FMEA is used to structure Mitigation for Risk reduction based on either failure (mode) effect severity reduction or based on lowering the probability of failure or both. The FMEA is in principle a full inductive (forward logic) analysis, however the failure probability can only be estimated or reduced by understanding the failure mechanism. Hence, FMEA may include information on causes of failure (deductive analysis) to reduce the possibility of occurrence by eliminating identified.

What happened to British Airways?



On Saturday May 27th all power went down in one of British Airways key data centers. This problem wasn’t fixed, and operations weren’t restored, until some point the following day, Sunday 28th.

Approximately 800 flights are thought to have been canceled with estimates of around 75,000 travelers affected in some way. The total cost for British Airways is likely to come in at over £50m, with some extremely pessimistic persons estimating it could go as high as £100m.

This is a huge dent on customer trust in British Airways and a very large bill to foot. On Saturday 24th of June, the Guardian was reporting that many travelers still hadn’t been reimbursed for their losses, with BA and insurers both refusing to pay up. British Airways paid one particular family the refund for the flight of £350, but the family claim this has only gone some way to covering their losses with a further £700 they won’t see again.

There are many cost related effects we can see from the information available, but the financial ramifications of brand damage are much harder to measure. Not everyone is worried, mind. In the last year, Delta Air Lines, Lufthansa, and Air France have all experienced similar outages and significant brand damage doesn’t seem to have occurred.

Nonetheless, BA has suffered some significant losses and had to hire many emergency staff. We know this much to be true.

Why did it happen?

fmea failure mode and effects analysis data center tweets

The details of the case are not wholly out in the open. It’s likely that we won’t know exactly what went wrong unless a report is published in the months to come.

Initially, the incident was reported as resulting from a power surge, which suggested an unexpected freak event. However, a few days after the incident an official statement was released informing us of the real cause.

According to The Times, a power supply unit at the centre of the outage was in perfect working order and was deliberately shut down which triggered the disturbance. The paper reported that an investigation of the episode will therefore likely focus on human error.

According to The Times, the incident likely concerned a so-called uninterruptable power supply, or UPS, which is designed to deliver a smooth flow of power from the main with a fall-back to a battery-powered back-up and a diesel generator.

What caused the damage was that the power was restored in an uncontrolled fashion

He went on to say that the massive surge of power from the Uninterruptable Power Supply (UPS) caused physical damage to British Airways hardware, which was the root cause of the data center being shut down for such a long period of time.

Ultimately, this came down to human error. A contractor disconnected the UPS and then – presumably in panic – reconnected it in an irresponsible way; failing to follow the correct procedures.

This crisis is a process failure. Clear as day.

But why did the process fail? In the media, there has been much discussion about outsourcing. The Daily Mail reported that the engineer in question was employed by CRBE Global Workplace Solutions. While the GMB union made a similar point, as reported by the Financial Times, that British Airways had cut hundreds of IT staff in Britain in 2016 to outsource their work to India.

Which is a fair point. It has become fairly standard across many industries to have multiple bases across the globe. Whether outsourcing is part of the problem or not seems somewhat redundant in understanding the incident and how to prevent it, even if it’s an important discussion in other ways.

This was a process failure. Looking at the name of the Uninterruptable Power Supply (uninterruptable), we can assume some process wasn’t properly implemented.

How can FMEA help us prevent this happening again?



fmea failure mode and effects analysis british airways plane Presumably, there was an instruction not to interrupt the uninterruptable power supply. I don’t know this for sure, but I will gladly bet on it.

So, I think it is reasonable for us to presume there was a process of some description in place.

Was the problem that the process wasn’t being followed? In which case, I recommend British Airways read up on process adherence and create an account with Process Street for future.

If, on the other hand, the problem was with a lack of adequate controls, then performing a Failure Mode and Effects Analysis is the correct way forward.

Perhaps British Airways should conduct more than one FMEA? If this failure was caused by the contractor, as the Daily Mail reported, then one FMEA should presumably cover the process for onboarding contractors to work within the company’s systems.

Another FMEA could look at the security processes in place surrounding the UPS. Why was one contractor even able to disconnect this crucial supply? Where was the trained supervision?

A further FMEA could be conducted to look at the specific task the contractor was performing. What was the process being followed? Had this process been documented in advance and then reviewed by British Airways staff?

Without wanting to pretend I know better than British Airways – because mistakes always happen – it seems that the effective review and improvement of just one of these processes could have resulted in controls being put in place which may have prevented this whole sorry affair from occurring.

After all, the severity rating for this kind of catastrophe would be very high and likely to show up as a relatively top priority failure mode. It would score low on occurrence rating, but probably high on detection. If we had a S of 9, an O of 2, and a D of 8, the RPN would be 144. This would have raised red flags immediately even if the criticality score had been lower.

Moreover, GMB claimed that hundreds of IT staff were let go only last year. If you’re changing so many employees, this would seem large enough of a change to constitute the need for an FMEA, as per our recommendations above. Had British Airways followed the prescribed process management techniques, they could have put controls in place last year to prevent this costly incident.

The price of not performing effective reviews and not putting well-mapped controls in place is probably going to be much greater than any of the savings they made through their restructuring and outsourcing.

Comments

Popular Posts