If what you're doing isn't enhancing the value of your organization, what's the point?
Many in our field define the output of maintenance operations in terms of "reliability, availability and maintainability." Why, though, can't we simply say that the output of maintenance should be to improve the value of the organization. But, how to define value? Let's use that universal KPI favored by senior executives: ROI (Return On Investment). In other words, if it does not add value, don't do it.
This article focuses on this type of approach in resolving some common maintenance issues. Using logic, statistics and well-accepted techniques, we can improve maintenance decision-making. The result is better selection of maintenance tactics, improved equipment reliability and increased company value.
Silos, silos, silos
RCM (Reliability Centered Maintenance), CMMS (Computerized Maintenance Management Systems), EAMs (Enterprise Asset Management systems) and CM (Condition Monitoring) have been used for many years to help select and implement maintenance tactics. But, all have flaws that inhibit their effective use; here are some of these drawbacks—plus suggestions for remedies.
RCM pays little attention to historical data and is often seen as competitive to the main historical data source (the CMMS). Any objective view, though, sees them as complementary to each other. How, for example, to better track the occurrence frequency of RCM failure modes than on a CMMS work order? And how better to start RCM analysis than by examining failure modes that have actually occurred? The resolution lies in linking the RCM and CMMS databases.
Fig. 1. A maintenance manager should receive this type of simple failure cost report every month.
- An RCM reality is that failure modes (FMs) are not comprehensive; they're dependent on "what if" scenarios applied by RCM analysts. Studies show that the gap between the FMs predicted and those actually observed can be huge. Resolving this situation requires enhancing the RCM database by adding actual maintenance experience, as it happens, from the CMMS.
- Most failure analysis predicts the future by looking backwards (i.e. at the CMMS history). This approach, however, typically omits, for example, the RCM concept of potential failures (PFs). There are two reasons for this. First, the CMMS does not allow for such data to be collected. Second, the technicians are not trained to recognize actual PFs. Solving this problem calls for minor modifications to the CMMS work order process and a modicum of training at the proverbial coal face (i.e. for those working in day-to-day maintenance operations).
- CM data should prompt intelligent decision-making. Three steps are involved. First, collect the right data (and stop collecting the wrong data). Second, do the right analysis. Third, use the data for making the right decision. In our experience, upwards of 70% of the data has no predictive ability and, of course, key data is missing. A critical point in resolution of the problem is that CM data must relate to the failure mode; the question is how to demonstrate this.
No one questions the need to predict failures.
The simple statement above raises a number of issues.
- Which failures do we predict? Refer to the objective of maintenance "adding value" and use "Cost of Failure" as the primary determining factor, defined as:
Recognizing some needed substitutions ("mission readiness" instead of revenue for example), we know that an emergency failure greatly inflates repair costs. Safety costs, environmental costs, political embarrassment, etc. also need to be factored in. This leads to a simple failure cost report (Fig. 1) that every maintenance manager should have on his/her desk each month!
Cost of Failure = Cost of Emergency Repair + Cost of Lost Revenue + Penalty Costs, Reputation Costs, Fines and Reparations
That type of report draws our attention to the overall cost of the failures rather than frequency and duration. Bad Actors are redefined as "Bad Cost Actors."
How do we measure resistance to failure? This has great theoretical importance—but as a practical issue, it is complex and not well understood. By substituting "performance" as a proxy for "resistance to failure," the concept becomes simpler and easier to understand. As illustrated in Fig. 2, therefore, a 1000-GPM pump has "failed" if it pumps "only" 999 gallons per minute (the required amount being necessary for feedstock supply, cooling purposes, etc.). Thus, an instance of functional failure (FF) can be quickly defined and just as easily recorded on the work order.
Fig. 2. The concept of "performance" is easier to understand than "resisitance to failure."
- As performance slides down the slippery scope of the PF curve, the point of acceleration in the rate of degradation is often clearly apparent in practice, thus suggesting the PF point. Accordingly, we can define a specific condition value for the PF point (in the Fig. 2 example, it's 1100 gallons per minute). Using a "Pass/Fail" inspection greatly eases data collection and analysis. Equally important is the fact that the PF acts as a warning signal needing a maintenance response.
What if the PF and FF points are not predictable? Or what if they're identical? Let's use electronic and electrical equipment as an example. The PF and FF points clearly exist—but are simultaneous. Condition monitoring will not help, except to advise of complete failure. Here, we respond with standby units, plug-out/plug-in replacements and similar techniques. (Note the diagram in Fig. 3.)
Fig. 3. Condition monitoring will not help when PF and FF points are unpredictable or identical.
- How does age fit into the equation? As Nowlan and Heap pointed out, age directly impacts failure in only a small number of cases. Yet, intuitively, we feel that age is an important factor. As an alternative, let's define it as "working age." This has two implications for probability of failure: load and stress (negative) and out-of-service state (positive). Load is often difficult to track accurately, so we default to operating hours (which equals total time minus out-of-service time). This requires us to record "suspensions" on the work order (as an alternative to a PF or an FF).
- How do we relate the multiple streams of CM data to the FM? Here we can use Proportional Hazards Modeling, a statistical technique showing which variables have a predictive impact on the failure mode—and which have little or none. This technique is built into EXAKT, a product developed by Dr. Andrew Jardine at the University of Toronto. Repeated use of this tool suggests that most CM data has little or no relationship to the incidence of failure and, thus, can be ignored as a predictor. Such data does not need collecting; key data, such as working age and other condition variables, are frequently missing.
- In predicting failure, the predictive ability of CM data must be accurate and consistent. EXAKT achieves this by providing a probability of failure in a given period (completion of a mission, prior to a maintenance shutdown, etc.), and applying a statistical test showing confidence levels. Relating the three elements of failure probability, confidence levels and cost of failure provides strong insight into the "best" maintenance tactic to follow. Low confidence levels prompt both conservative action (to pre-empt the FF point) and enhanced data collection, especially when the cost of failure is high.
- These shortcomings of CM as a true predictor of failure prompt the development of a better approach. The required output is improved reliability analysis; but there are already many effective reliability analysis tools on the market. What is the missing link? Let's call it a reliability database—collecting and holding the many data inputs ready for the analysis.
Key to the reliability database are the sources of data. Here is a proposed structure:
Historical data—primarily from the CMMS—but with simple modifications to the work order to accommodate the missing FF, PF and Suspension data. Also, as explained later, this adds a cross-reference to the appropriate RCM database record.
Fig. 4. Living RCM software links data sources and acts as a data traffic cop, collecting, cleansing and storing data to create a reliability database.
- Current status data—primarily from the CM sensors. These will give us (along with PLCs, SCADA and others) the best insight into the current equipment conditions.
- Expected data. These will tell us which failures should realistically be expected, based on equipment assessment and operating context as in the RCM database.
To accommodate these data sources, Living RCM (LRCM) software (as shown in Fig. 4) has been developed. It links the data sources and acts as a data traffic cop, collecting, cleansing and storing the data to create a reliability database. This is the feedstock for commercial reliability tools (such as EXAKT and others).
Fig. 5. The advantages of integrating CMMS and RCM are quite evident.
Earlier, we linked the RCM and CMMS databases. These are clearly complementary in prompting a better understanding of failure and reliability. Contrary to common practice, the best output of an RCM analysis is not dusty tomes in the engineering office, but an improved work order. Similarly, a satisfactory output of a work order is an improved RCM record—where the work adds new knowledge or a new failure mode. Looking at the activity flow in Fig. 5, the advantages of integrating CMMS and RCM are quite evident:
- Inspection prompts identification of measurable potential failures.
- This prompts creation of a PM Work Order (or often, an on-the-spot remedial or preventive action).
- PM tasks are specifically designed to prevent a functional failure. If we cannot tie the PM tasks to the prevention of a functional failure, we need to challenge the value of the PM.
- LRCM prompts the technician completing the work order not to use the typical Fault Code (the value of which is highly questionable, and in our experience rarely used), but rather to access the failure mode in the RCM database and insert it in the work order.
- If a "significant" task adds to our knowledge (e.g. a new failure mode or new effects compared to the RCM record), a temporary record is created by LRCM and awaits validation by the RCM analysis team.
- Add the RCM record number to record the occurrence and frequency of the RCM failure mode—a very valuable analysis aid.
- Finally, an unexpected occurrence of a failure mode in a critical equipment demands several responses: the repair of the equipment, the repair of the RCM record AND the repair of the RCM logic AND all the other records that used the same logic. Ease of access of the RCM database from the CMMS thus becomes critical to creating a regime of Living Reliability.
Fig. 6. The giher the Risk Ratio, the greater the PM's leverage in reducing risk. Likewise, the higher the PM's ROI, the more value is added to the company.
Does maintenance improvement actually happen?
The bottom line? Does application of the techniques described here add value—does it improve decision-making? Here are several indicators:
- Does a new maintenance tactic reduce costs? EXAKT's cost function compares failure cost with cost of preventive repair. The cost model optimizes the combination of preventive work and run to failure, compared to the current mix of maintenance tactics. A second modeling option provides the optimum balance of PM and run to failure (RTF) to achieve the minimum downtime, or (model three) to achieve a given minimum level of reliability. Industrial experience shows cost reductions of about 20% to 40% of current maintenance costs, using the customers' cost data as the baseline. Substantial cost reduction certainly is achievable.
- Is the quality of maintenance improved? Refer to one of RCM's fundamentals. A key insight is the use of PFs to prevent FFs-an easily measurable and self-checking KPI which is built into the analysis program. It would be a remarkable improvement if our vibration analysis or oil analysis programs could tell us whether they are doing their job properly. Or not.
- Is business decision-making improved? To answer this, examine the following basic logic:
- Applying the cost of failure to the probability of failure provides a practical definition of "Risk." The do-nothing scenario can be called the "Run Risk."
Run Risk = Cost of Failure × Probability of Failure
- Next, calculate the PM cost (using parallel logic, but different cost numbers) times the probability of doing the PM (which is 100%). We can then define the "PM Risk."
- By comparing the Run Risk to the PM Risk, we can develop a Risk Ratio.
Risk Ratio = Ratio of Run Risk to PM Risk
Fig. 7. By tacking the change in the Risk Ratio through time, we can determine, before the next scheduled maintenance shutdown or end of the mission, if the Risk Ratio trend exceeds operating parameters.
We can now decide whether a $40,000 investment in a PM to avoid the Run Risk of $200,000 (comprising a 25% probability of an $800,000 failure) is a good decision-a Risk Ratio of 5:1. Or, should we spend $360,000 to eliminate the 15% probability of a risk of a $3 million failure—a Risk Ratio of 1.3:1 (see Fig. 6). It is apparent that the higher the Risk Ratio, the greater the PM's leverage in reducing risk. Likewise, the higher the PM's ROI, the more value is added to the company.
As a logical next step, we can establish if the Risk Ratio (in this case 5:1) violates the organization's risk limits policy. In addition, by tracking the change in the Risk Ratio through time (Fig. 7), we see, before the next scheduled shutdown or end of the mission, whether the Risk Ratio trend exceeds operating parameters.
A stronger foundation
We have shown how solid business logic leads to maintenance improvements and reduced maintenance costs. Providing an objective assessment of business risks offers a strong foundation for improved decision-making. Each step is straightforward and can be implemented with minimal change to an existing business process. Moreover, with each step, we get closer to our goal. That is to demonstrate-quite clearly-that maintenance is truly a provider of
business value. MT
Ben Stevens is president of OMDEC Inc., based in Godfrey, ON, Canada. OMDEC (Optimal Maintenance Decisions Inc.) provides asset management consulting, training and software solutions for clients around the globe.