All maintenance professionals know the hollow feeling in the pit of the stomach that accompanies sudden and unexpected silence out in the plant. Sooner or later, the process will roll to a stop. When it does, how will you view the subsequent diagnosis and repair? Will the breakdown be a problem, or will it provide an opportunity?
If you see the cessation of production simply as another in a seemingly unending series of maintenance crises, you probably will repair or replace the failed component as quickly as possible and your follow-up will be minimal. In this mindset, you are the fireman and the quicker you put out the fire, the happier everyone will be. This type of rapid response historically has been the yardstick by which maintenance organizations are measured. But, if you don’t take the time to correctly determine the causes of the downtime so that a failure analysis can be performed, you are by default setting up your next breakdown. When a machine or component fails, it is pointing out a weakness in your process. If you take the time to investigate the clues, you have the opportunity to improve your maintenance reality.
Notes on the concept of blame
Unfortunately, some organizations consider an interruption of the process to be the signal to begin the search for the guilty. If you find yourself in such a culture, you likely will have limited success in gathering the factual input required to positively impact your maintenance reality.
A majority of process failures involve human error at some level. Consequently, if employees—hourly or salaried—do not feel the level of trust that is required for total honesty, the process interruption investigation will not get to the root cause of the failure. Even worse, you may follow bad data up the wrong road and end up changing some other part of your process that was sufficient as it was. This could lead to subsequent process interruptions—from issues at the original failure site that have not been resolved, as well as from the new weakness in the process you have inadvertently created. In technical maintenance terms, this is called “chasing the rabbit.”
Step #1: Acquiring and interpreting facts
A successful process interruption investigation is, in its simplest form, the unbiased acquisition and interpretation of a set of facts. Since the interruption of a manufacturing process is a complex affair, however, the gathering of the data surrounding it is by no means a simple matter.
An incident must be looked at from several angles so that no important clue is neglected. And, while a good deal of information must be gathered, the plant cannot be down indefinitely. When production ceases, the business is not making money. That’s a condition that does not need to continue any longer than absolutely necessary. As your process interruption protocols develop and evolve, hourly and salaried employees who have been trained in the acquisition of pertinent data will gain speed at their tasks, and downtimes due to data collection can be held to a minimum.
The following activities should be part of your investigative approach to process interruptions. It is important to note that the production or maintenance supervisor on the scene is a poor choice to gather all of the necessary facts. Such individuals should be supervising during the upset condition. Other qualified employees must be assigned to gather the bulk of the information.
Think safety first.
A breakdown is an upset condition, so the potential for an injury is much greater than usual. The work area must be assessed for safety concerns by a competent person (or persons) prior to the gathering of data or the start of repairs. This is a very important step. Although the plant is down and everyone is in a hurry to get it back up, if a deliberate decision is not made to carefully evaluate the situation for potential and actual hazards, an injury could easily occur. Members of the Safety Committee are excellent candidates for this role.
Every maintenance department needs to have access to a goodquality digital camera. As a tool, it is as valuable as a set of wrenches or a welder, and it costs less than either. A complete photographic record should be made of the point of failure. Be certain to take shots from all angles, and get as many close-ups as possible. If your particular product or one of your raw materials were involved in the breakdown due to a misfeed or some other reason, capture images of the widget or substance in question before it is removed from the machine.
Retrieve computer printouts.
If your process is under computer control, any printouts that are available to you should be retrieved. Some examples of the types of information that might be obtainable include piece number, piece size, piece composition, date, time, production speed, ambient temperature, previous interruptions, upstream and downstream upsets and operator number. If you have a condition monitoring system, you may be able to access and record bearing temperatures, lubrication flow rates, lubricant temperatures, electrical spikes and other anomalies. The point is to gather all the data available. It is better to have facts you don’t need than to be missing a key piece of information that you could have obtained—but didn’t.
Make video records.
Many organizations have invested in video monitoring as one way of controlling their processes. If you have access to video data, it should be someone’s assigned task during a cessation of production to recover the video data. During downtime, operators often find themselves with idle time. One of these hourly professionals could be assigned the task of burning the DVD or recording the videotape.
Recover the failed part.
This reminder seems obvious, but you would be surprised at how often the failed component is misplaced or thrown away. More than one replaced part has been found in the dumpster, on the catwalk or on the back of the maintenance truck. Failed components more often than not contain their own record of why they failed. It is imperative that the failed part be recovered, preserved and tagged so that it can be analyzed, either by your personnel or by factory reps or consultants.
The machine operator should be interviewed while the incident is fresh in everyone’s mind. Has the machine been running smoothly? Was it in automatic or manual? Have there been feed problems? Is the operator the primary or a back-up? Has the operator been newly trained, or is he/she a veteran?
The run-time maintenance personnel will need to be interviewed concerning any mechanical calls made to the area prior to the failure. Did offsets or adjustments have to be made? Were any unusual sounds or odors detected? Was there a new vibration? The personnel dispatched to perform the repair should be interviewed after the work is completed. They can provide information about machine condition and unexpected repair issues, and they can provide a summary of the necessary follow-up.
Assess recent corrective work orders.
The corrective work and emergency work maintenance files for the failed machine or component should be assessed. If corrective work has been performed in the near past on or around the affected area, this is an important piece of information. Was the correct part installed? Was the part installed correctly? Was there an SMP? Why was the corrective work being performed in the first place? The answers to these questions may provide clues.
Review PM and PdM work orders.
The preventive maintenance (PM) and predictive maintenance (PdM) files for the failed machine or component should be reviewed. Are the PMs current? Has a different maintenance professional been assigned to the machine? Has a changing condition been detected or monitored?
If there are similar machines in other parts of your plant or company, their history should be checked to ensure that the breakdown you have experienced is not part of a much larger issue. If you are a single-site operation, vendors, colleagues or factory reps are possible options for consultation. Online message boards are another possibility, as are individual vendor Websites.
Record the data.
This cannot be overstated: If your facts do not exist in some tangible form, your facts simply do not exist. Memories fade very quickly, and nothing can corrupt data faster than word-ofmouth transfers of information. But even while urging you to write “it” down, I must warn you against the very real danger of having the medium become the message. If the breakdown report or whatever you choose to call it becomes the point of the exercise, you have wasted your time and your company’s money. The only purpose for the gathering of these facts is so a competent team of professionals can analyze them.
Think outside the box.
A quick and informal brainstorming session at the conclusion of the information collection period may provide further clues. Is it colder or hotter than usual? Has a new parts vendor been added? Could the failure in question be the result of another recent failure?
Step #2: Evaluating the data
Once you have gathered the facts concerning the process interruption, the next step is to evaluate the data. This portion of the process may be several days or weeks removed from the actual breakdown, depending on how long it has taken to examine the failed parts. If you have one, your reliability engineer should lead the team, as he/she has been trained in the proper issue resolution methodology. The team also should include a machine operator, at least one of the maintenance technicians who performed the repairs and the planner. While you may choose to have additional members, keep in mind that groups with more than six participants sometimes lose effectiveness.
The team will conduct a RCFA (Root Cause Failure Analysis). As was the case with an FMEA (Failure Modes and Effects Analysis), performing an RCFA can be a daunting task—but don’t let it scare you. The idea is to try to determine the ultimate circumstance or set of circumstances that led to the interruption of the process, so that the disruption can be avoided in the future. (By the way, a good source on the subject of RCFA development is http://www.bill-wilson.net/b35. html)
The time and trouble required to properly quantify and analyze a process interruption is well worth the effort. Just remember that your perceived maintenance reality (i.e., what you think is happening) has a way of becoming your actual reality, especially when it has been committed to paper. Thus, it is vital for everyone on the team to set aside agendas and preconceived ideas and try to get to the real root of the matter. Your process and your business depend on it.