Not long ago, reliability was considered engineering alchemy, an “Alice in Wonderland” science. Today, reliability is being treated as a true engineering discipline. It is such a popular term that it has given birth to an entire industry that has produced countless titles on the subject. Several professional societies have been founded and the lecture circuit is full of reliability engineers promising to decode the science of reliability.
Reliability and its design methodology have had a long and fruitful existence. They were employed in the 1940s and 1950s to design complex systems and measure risk in exotic military projects. In the 1960s, reliability tools were refined and became a base alloy in the program that saw Neil Armstrong place the first of what was to be many footprints on the moon. The 1970s brought with it the golden age of commercial nuclear power production. During this period, reliability stood as a silent sentinel to reactor design and associated safety systems design. Over the past two decades, reliability has made and continues to make its mark as a successful design characteristic in any process, system, or component.
Somewhere during the past 20 years, perhaps when words like Chernobyl, Bhopal, and Challenger filled the headlines, the expectations of industrial and manufacturing process plants were reordered and owners began to view their investments with a highly demanding economical eye. This is not to say that economics was never the top order of the day, but the emphasis and the associated costs placed on environmental protection, process safety management, worker health, and plant availability sounded the wake-up call. This forced owners and managers to look at new ways to keep their plants profitable. It was then that the forgotten stepchild known as maintenance was given the recognition it deserved. If keeping the plant running and profitable were the kingdom, maintenance would need to be the keys to that kingdom.
Over a 30-year period, reliability-centered maintenance (RCM) would develop a strategic framework for addressing process failures using the civil airline industry as its teacher. John Moubray and his book Reliability-centered Maintenance (Industrial Press, New York, 1992) broke new ground by developing a systematic approach to understanding and preventing failure.
This book introduced the most revered of the maintenance acronyms—RCM—into the lexicon of maintenance and, almost single-handedly, produced some of the most sweeping changes in how equipment reliability was viewed within the maintenance function. RCM was shown to be a series of well researched and executed processes that promised a greater understanding of why things fail and, more importantly, how to take measures to prevent the consequences of failures.
A major problem with implementation of the RCM process is that it is often applied far too broadly to yield practical results, and the price for such a protracted endeavor is typically far more than an organization with serious equipment reliability issues can bear. (Moubray notes that “the quickest and biggest short-term returns are usually achieved when RCM is applied to assets or processes suffering from intractable problems which have serious consequences.”)
What is needed is a practical tool to allow managers to quickly understand the value of reliability and how reliability impacts profit. In 1993, H. Paul Barringer, a Houston-based reliability consultant, realized the difficulty of making the RCM process work and posed the question: “Can your plant afford a reliability improvement program?”
Barringer observed that few, if any, organizations could afford to employ the entire RCM process without first understanding how unreliability affects the bottom line.
Fortunately a practical reliability tool can be extracted from Moubray, Barringer, and the past 30 years of experience and research, and we will not need rocket scientists to use it in a cost-effective manner.
Reliability is most commonly defined as the probability of equipment or a process to function without failure, when operated correctly, for a given period of time, under stated conditions. Simply put, the fewer equipment or process failures a plant sustains, the more reliable the plant.
In searching for a single-word definition, reliability is dependability. Many industries have the additional burden of ensuring that plant reliability is kept in the forefront of day-to-day operations. Employee safety, public approval, and demonstrated environmental safeguards lie at the very core of an industry’s existence.
The accident at Three Mile Island power plant is stark testimony that reliability, when used as a design characteristic, works. If Reactor-2 was designed without inherent stability and reliability, chances are you would be using a candle to read this article.
Thinking of reliability as an engineering problem, one can imagine a team of engineers searching for better equipment designs and working out solutions to eliminate weak points within system processes. When considering reliability from a business aspect, the focus shifts away from reliability and toward the financial issue of controlling the cost of unreliability. Quantifying reliability in this way sets the stage for the examination of operating risks when monetary values are included. Measuring the reliability of industrial processes and equipment by quantifying the cost of unreliability places reliability under the more-recognizable banner of business impact.
It is not a difficult thought process that leads us to the conclusion that higher plant reliability lies in the ability to reduce equipment failure costs. The motivation for a plant to improve reliability by addressing unreliability is clear: Reduce equipment failures, reduce costs due to unreliability, and generate more profit. It is under this preamble that a sound business commitment to plant reliability begins to step out of the shadows and take shape.
We have now defined reliability as a plant engineering characteristic, and, more importantly, defined it in terms of business impact. In order to improve reliability, we first must understand the very nature of its measurement—failure.
Moubray defines failure as “the inability of any asset to fulfil a function to a standard of performance which is acceptable to the user.” This is the definition that we will use, but we will move the definition vertically.
We shall define failure as the loss or reduction in performance of a system, process, or piece of equipment that causes a loss or reduction in the ability of the plant to meet a consumer demand. This definition focuses attention on the systems vital to making the plant profitable, while the standard definition could lead some people to believe that all equipment is equal. The loss of a pawn in a game of chess does not represent the loss of the game. It is a calculated risk taken in a strategic effort to win the game and it is, after all, a pawn. In other words, the probability of meeting consumer demand has been increased as equipment within a process is evaluated based on its impact to the financial health of the company.
Mathematically, reliability is the probability of any production-interrupting failure occurring over a given future time interval and is stated as:
R = e -lt
R = Reliability
e = 2.71828 ···, the base of natural logarithms
l = Failure rate, the reciprocal of mean time between failure or 1/MTBF
t = Given time interval for which prediction is sought
For the purpose of calculating the cost of unreliability of industrial equipment, mean time between failure (MTBF) can be defined as the time interval of the study divided by the number of production-interrupting failure events recorded during the study.
The good, the bad, and the ugly
We have defined reliability (the good) as requiring the measurement of failure (the bad). There remains only one obstacle to putting the above equation to work. We must glean failure data from industries that do not understand how to accumulate coherent equipment failure data for the purpose of relating it to cost (the ugly).
Plant engineers and maintenance practitioners typically maintain that good failure data does not exist, or would require extraordinary effort to secure. This is simply not true. Failure data exists all around them in varying degrees of usefulness. Many plants have been accruing failure data under the guise of operating logs, work orders, environmental reports, etc. The force that drives the paradigm is that plant management does not see the data as a tool to solve problems and as a result rarely treats or analyzes the data in an economical manner. This is punctuated by the fact that operators, maintenance personnel, supervisors, and managers fail to acquire data in a manner conducive to analysis.
The net result is a vast bank of quite useful information, haphazardly recorded and poorly structured. When equipment or process failures cause enough of a financial concern to warrant study, engineers can look forward to hours of sifting piles of incoherent data in search of an answer.
Substantial amounts of failure data exist in various places awaiting use for improving the reliability of processes and equipment. Start with common sense data now, then couple it with a progressive data recovery program. With these elements in place, the road to an integrated and structured maintenance management program that recognizes plant reliability as its mission will no longer be elusive.
Acquiring failure data
Robert Abernethy in his book, The New Weinbull Handbook (self-published, North Palm Beach, FL, 1996), maintains that acquiring equipment failure data has three basic requirements:
He goes on to explain that commercial businesses require the addition of two elements:
In order to illustrate this concept, we need to get back to basics. It is a common philosophy (especially among investors) that the mission of the maintenance component of any facility is to keep the plant producing. In other words, protect the investment.
This translates well into the mission of reliability and gives us our newest characteristic: protect the integrity of the process. It can only follow that plant processes are maintained by protecting system function and system functions are protected by maintaining equipment.
In order to establish a beachhead for reliability improvement, we need to define failure in terms of the overall mission. For ease of illustration, we shall consider the primary loop, the secondary loop, and the power transmission stages of power generation in a nuclear power plant as the three high-level processes under which failure has the greatest financial impact.
In order to hold the study to an unambiguous time interval, we shall fix the time for each process with consideration to quality of failure data available for that time interval, then normalize the failure rate.
The time interval calculation assumes that the plant runs 24 hours per day, 365 days per year or 8760 hours per year. The number of failures was counted for the time interval to calculate the MTBF. Failure rate is calculated by taking the reciprocal of MTBF.
With the failure rates known, we can determine the production time lost from the failures and begin to determine the cost of unreliability.
In our example, we have established the three critical processes in making a power plant financially feasible. The criticality of the systems and equipment that make up these processes carries its own weight with regard to personnel and environmental safety. In understanding the financial ramifications of unreliability, it is important that the average corrective time for failures be determined for the purpose of estimating process downtime. This total average downtime equates to lost production time and, consequently, lost revenue.
In order to prove the value of this tool, the worth of its assumptions must be addressed. The most salient assumption must be that there is some net worth in examining the power generation process from the highest level. The purpose of a commercial power plant is not to answer the question: Are we smart enough to tame a nuclear fusion reaction in populated areas while not managing to render a 700-square-mile area inhabitable for 1.6 million years? The purpose is to supply electricity to the local grid for economic profit without rendering a 700-square-mile area inhabitable for 1.6 million years, even when individual equipment fails. Again, back to our chess game. We play, even though we know that individual pieces will be lost in pursuit of winning the game. Costs due to reliability quantify the losses expected from playing the game.
It also must be assumed that the number of failures in any given time interval will generally follow true to history. Unless some extraordinary effort is taken, the number of failures will not change. Corrective repair times will remain relatively constant for the same reason.
To make the translations to the cost of unreliability there is a question that needs to be answered. Should the costs of scheduled outages be included in the cost of unreliability?
Absolutely, for two reasons: For an investor, the plant is in failure mode, and the plant has been skewered with a double-edged sword, buried to the hilt. It is not on the local power grid making money and it is spending money rapidly to renew its assets. These facts must be accepted when placing a dollar value on a plant.
Assuming that 10 megawatts of electrical capacity translates into $5 million of potential gross profit, a nuclear power plant rated at 1200 electrical megawatts of output will yield a gross margin of $600 million per year or $68,493.15 per hour. When this loss is multiplied by the lost time due to failure, the hammer of unreliability is felt hard upon the anvil of business impact. The blacksmith takes another stroke when the cost of maintenance is added to gross margin loss.
Here we have represented the primary loop as a $25,000 per hour maintenance cost burden, the secondary loop as a $15,000 per hour cost burden, and the power transmission loop as an $8,500 per hour cost burden. These maintenance costs take into account the price of working with radioactive materials, additional personnel training and equipment, and the cost of returning the plant to full power operations. When the lost time due to the failure of the process is put into financial terms, it becomes apparent the cost of unreliability represents a substantial burden on the economic feasibility of the plant.
From this data model, two highly revealing values can be calculated—annual plant availability (the time that the plant has the opportunity to make money) and plant reliability (the probability that the plant will cost money).
Availability = Uptime
Total Time = 8760 - 78 = 99.1 percent
R = e -lt
R = e -(399.55 x 10-6 x 8760) = 0.031
= 3.01 percent
These numbers speak volumes. These calculations show that while the plant is generally available to produce electricity, it has only a 3 percent probability of meeting a year-long operational commitment without incurring a forced outage or reduction in power generation. The price for this plant reliability comes to $6.8 million. This is the cost of unreliability.
It is easy to see why many power organizations publish quarterly plant availability reports to their boards of directors showing availability to be high while complaining that the price of maintenance continues to be excessive. The real truth of the matter is that owners are spending inordinate amounts of money to pay for a number that, when taken alone, means little to the bottom line.
We have presented a practical and simple tool for understanding why reliability is a vital ingredient of plant operations and maintenance. What started as an esoteric term for design engineers has become a signpost pointing the way to the high country. Knowing the cost of unreliability and where, within the context of process criticality, these costs are incurred will allow plant management to address and prioritize process failure issues, knowing the financial impact to their plant. MT