Power Engineering

High-Stakes Troubleshooting

Social Media Tools

Sponsored by FLSmidth
06/01/2006

High-stakes troubleshooting can be harrowing, but using a systematic process can ease the pressure, increase your chances of success and make it easier every time.

By Geoff Edelman, Kepner-Tregoe Inc.

Late one afternoon, with no warning, a nuclear power plant in the Eastern United States experienced an automatic shutdown. The first indication of a problem came when the plant shut down due to a trip signal from the main turbine. Because about 20 different logic signals can lead to this trip, the plant’s troubleshooters faced a real challenge.

Coincidentally, the plant support services supervisor and senior mechanical design engineer had just completed two weeks of training to become certified in a troubleshooting process, using a systematic, rational process to solve problems. They were the logical choice to lead the team searching for what caused the trip.

The troubleshooters quickly assembled a team of mechanical, electrical and turbine engineers; an instrument and control supervisor; and a lead engineer for the turbine systems. The two team leaders started using their newly acquired process knowledge, asking, in strict order, a series of questions designed to elicit relevant information from the group about the problem. For the next four hours, they helped the team create a detailed problem description, formulate possible causes and test those causes against the facts.

By the end of the session, the cause that best explained all the facts with the fewest assumptions emerged as the simultaneous closure of all turbine stop valves and control valves.

“No way,” said the manufacturer’s main turbine experts and many other workers who were not part of the team. All eight valves could never have closed simultaneously. Some suggested the problem was caused by a row of commutator brushes on the main generator, which had been installed just 20 minutes before the trip. Others blamed a lightning strike 12 hours earlier.

The team hung tough. It proved, by testing these and other “pet causes” against the problem specification, that none could account for the facts that were known about the problem. And those facts couldn’t be disputed. An inspection of the cables in the indicated valve-protection logic revealed a short in the cable that transmits a signal that activates the valve-protection switches. The short gave an erroneous indication that the valves had closed-even though they hadn’t-and in response, a trip signal had been sent to the turbine. The cable was replaced, the turbine was restarted, and the plant was back in business.

Since his troubleshooting baptism by fire, the design engineer has learned a lot about helping problem solvers under time pressure find and prove cause quickly and efficiently. He and five other expert troubleshooters from several nuclear and fossil power generators identify below eight key actions that make the difference between taking shots in the dark and hitting the bull’s-eye on the first try.

Think first, act later

When a plant goes down, the pressure is on. Every troubleshooter has heard: “Do something. I don’t care what, just do something.” And, every good troubleshooter knows it is the worst possible advice.

Jens Friedrichsen, senior system engineer at Exelon Nuclear’s Quad Cities Nuclear Power Station, knows first hand just how much trouble the shot-in-the-dark approach can create. “At one time,” he explains, “our troubleshooting and root-cause analysis consisted of determining every possible way a problem could have been caused and then physically dispositioning each one. We did find and fix most problems with this method, but it was very time consuming and expensive.”

In 2001, Exelon Nuclear initiated a program to upgrade the skills of troubleshooters at its Quad Cities Plant. Seven employees, from operations, maintenance, training and engineering, went through a train-the-trainer program and began teaching analytic troubleshooting methods back at the plant. Friedrichsen contrasts the way problems are now solved with the old, trial-and-error approach. “Following preventive maintenance, an out-of-service emergency diesel generator output breaker had failed to close four times in a row, after various repair attempts over a four-hour period. Then the breaker successfully closed four times in a row over another four-hour period. No one knew whether the problem had actually been ‘fixed’ or if another set of failures would occur. If we couldn’t get the generator up and running, we were going to have to shut down the reactor. That would have cost $250,000 a day. And if we got it up and running and then the breaker had failed to close, we would have incurred additional costs of regulatory scrutiny-another $50,000 to $100,000.”

The new “think first, act later” approach to troubleshooting paid off, Friedrichsen says. Despite pressure to get the generator back in service, the troubleshooting team compared each breaker-closure attempt looking for significant differences and changes that had caused the problem during some but not all incidents. They identified a probable cause-loose wiring connections during the initial four closure attempts; tighter connections in the next round-tested it against the problem specification and got the generator back online without incident.

One problem at a time

A major obstacle to successful problem solving under time pressure is failing to identify the one problem that needs to be solved. It’s up to the facilitator guiding the team through the troubleshooting process to make sure that before the problem analysis begins, team members agree on an accurate, specific statement of a single, top-priority problem.

“Often,” Friedrichsen points out, “a system can remain operational even if there are several ongoing problems within it. Then, along comes a problem that disables the entire system. Under time pressure, the goal isn’t to solve all of these problems. It’s to identify and solve the one that caused the system to fail.”


For successful troubleshooting, information should be gathered in an orderly, step-by-step sequence.
Click here to enlarge image

Friedrichsen illustrates with the example of a moisture separator drain-tank-level controller plagued for years with constantly oscillating levels. Numerous unsuccessful attempts had been made to tune the control system to reduce the oscillations. Then, one day, the level rose so high it no longer registered on the indicator. Even so, the high-level alarm failed to go off. To manage the problem, the control system’s instrument tubing was back-filled, which caused the level indicator to return to normal, although it was still oscillating. This loss-of-indication problem began to recur every five months.

Deciding which problem to resolve was unclear. Careful consideration brought the team to the conclusion that the periodic (every five months) off-scale high-level indication was the critical issue. The oscillation and the failure of the high-level alarm could wait, and did, until the cause-a three-drip per hour leak on the controller’s reference leg-was found and corrected.

The right questions and answers

When the newly trained troubleshooters at the nuclear plant introduced earlier in the story began to question the team investigating the main turbine trip, team members were frustrated. The process they were using asked the team members to identify all the “is nots” of the problem: the places it didn’t occur, the times it hadn’t been seen, the problems that might have been occurring but weren’t. This approach to fact-gathering was totally alien to them, as it can be to many. But in most cases it is the only way to gain a complete picture of a problem.

Asking “is not” questions in four dimensions-what, where, when and how large-gives troubleshooters a basis for comparison. In turn, this can significantly narrow the search for cause. If one of two identical pieces of equipment begins to malfunction, there must be a difference between the two that is related to the cause of the problem. Perhaps one is in a high-traffic area or near a heat source. Or, perhaps the one with the problem wasn’t installed according to the manufacturer’s specs. Once differences between the “is”-the malfunctioning unit-and the “is not”-the one without problems-are identified, those differences can be examined for evidence of what may have caused the problem. Did placing the pump too close to a heat source cause the rubber gasket to dry out? Is that why this pump leaks while one on the other side of the room doesn’t? The same logic holds true for information relating to the nature of the defect, the place it is noticed, the time it first showed up and so on.

To understand the value of a tight problem specification and the value of gathering “is not” data, consider the following. A crane was stopping exactly 12 minutes after it began raising a load. It never stopped sooner than 12 minutes into the lift and it never stopped when it was lowering the load. When the group investigating the problem asked what was different between the raising and lowering operations, it identified a controller, unique to the former, which was losing its signal. The manufacturer was asked what would cause the controller to lose its signal after exactly 12 minutes. After consulting his controller programmers, the manufacturer’s representative discovered that a feature had been mistakenly installed on the controller. Once the feature was disabled, the crane worked fine.

The expression “garbage in, garbage out” is just as true for troubleshooting as it is for computers. The ability to find true cause is only as good as the information put into the problem specification. Whenever possible, check, double-check and triple-check the facts people provide and identify those employees who have been around the longest and have the most reliable memory.

One process

During an outage, emotions-and adrenaline-run high. Often, when the troubleshooting team first assembles, ideas, especially about cause, are thrown up, shot down and sometimes brought up again. Without a shared systematic process for tackling problems, the team can go in circles indefinitely, wasting time and money while struggling to get a handle on the situation.

When everyone on a troubleshooting team uses the same process, order is quickly restored. Information is gathered in an orderly, step-by-step sequence. Everyone on the team is on the same page, gathering information, developing possible causes, then testing those causes to determine which is most probable and, finally, verifying true cause. No jumping back and forth, no going over ground that’s already been covered, no fixating on pet causes. It’s the most efficient way we know of to get to the root of a problem.

To get the highest return on a troubleshooting process, the appropriate users, including new employees, must be trained as soon as they come on board. The troubleshooting process must be factored into work systems and documentation, have management’s involvement and, most importantly, ensure that a reward and recognition system encourages process use.

Even when all employees have been trained in a systematic process, a strong personality is required to keep them on track. As a member of the issue-response team at Constellation Energy Group’s Nine-Mile Point plant, internal consultant Steve Davis is often called on to facilitate under emergency conditions. Davis believes the key to success in such circumstances is being firm without being dictatorial.

“Sometimes the team gets frustrated. It seems as though you are asking all these questions and getting nowhere, and they get impatient. In reality, you are moving toward cause much faster by following the process steps. If you just keep bringing them back to the process, asking the questions, you’ll get there.”

Steve Osborn, performance engineer at Entergy’s Grand Gulf Nuclear Station, believes you have to give groups some leeway or you risk losing them. This is especially important with troubleshooters who are just learning to use a systematic process. “Let them talk for a while,” he advises. “They’ll use technical jargon, jump to cause, defend their pet theories and try to impress one another with their content knowledge. When they’ve gotten all that out of their system, then you need to lead them down the process road.”

Terry Larbes, manager, technical services for Electric Energy Inc., says he is known for being forthright and candid, which he believes is a crucial quality in a facilitator. “I allow the group some latitude to roam before I bring them back into process. But I always make a public note of it when they begin wandering off the main trail. Later, it will be a learning experience for them.”

The right people

Nine-Mile Point plant’s issue-response team is a multi-disciplinary group that goes into action as soon as an outage occurs. “The disciplined approach we use has resulted in many benefits,” says Davis. “We have a dedicated core group in the outage control center. We augment the core team with knowledgeable people that we bring in to help us define the problem, gather the facts and develop possible causes. Sometimes we bring in other station personnel who have special expertise in the problem area. Or, as when we had a maintenance problem on a diesel generator, we’ll bring in the vendor because it’s the content expert.”

The definition of “expert” will vary from problem to problem: in some cases it will be the vendor, in others the customer, in still other cases engineering or maintenance personnel. Sometimes the best information and insights come not from technically savvy individuals but from generalists in the field.

Friedrichsen also stresses the importance of having the right people-and enough of them-on the team. “Sometimes managers want to assign the analysis to a separate group of people because the knowledgeable individuals are too busy managing the problem. But if this second group doesn’t have the facts, or if the event is still unfolding, they really can’t do a proper job of analytic troubleshooting. It often takes less than one hour to create a problem specification and test possible causes-if the right people are assigned. Even for busy or critical personnel, this is the best investment of an hour they can possibly make.”

Beyond the fix

Seasoned troubleshooters know that many problems could be prevented if people thought more carefully about the consequences of their actions. They call this “thinking beyond the fix.” Based on his experience, Davis has seen positive results after personnel are trained to think about the changes they are making and how those changes could create new problems. “By taking a few preventive actions, they manage to eliminate new problems before they ever occur,” he says.

Just as finding cause follows a well-thought-out, rational process, the best fix should be chosen systematically. Following a shared, decision-making template will help the team determine which option best meets the objectives with the least risk. And, in cases where the choices are limited, it can focus the group’s thinking on ways to lessen the impact of those risks that can’t be avoided.

Terry Larbes says that the concept of thinking beyond the fix, or avoiding future problems, has been used with great success in his organization. “Many of the maintenance crews use the ideas in their start-of-shift meetings,” says Larbes, “which helps them focus on working safely and not doing anything to cause new problems.”

Lessons learned

Nuclear power plants have a leg up on many other industries when it comes to record keeping. Having access to a database of operating experience from the entire industry makes it easier to locate the root-cause analysis and corrective-action plan for a problem that is new to a site but not to the industry. Osborn has made use of it for everything from equipment malfunctions to animal intrusions.

For those not in nuclear, creating and using a comprehensive database of all problem resolutions and corrective action plans is one of the best ways to keep from reinventing the problem-solving wheel. Bill Brockman, senior engineer at Duke Energy Generation Services (formerly Cinergy Generation Resources), says that at Duke Energy (formerly Cinergy), and particularly at the older, coal-fired plants, this information resided in the heads of employees who have been there for years. The only way newer troubleshooters got it was if they happened to mention a problem they were involved in or asked an “old timer” for information. Now decades of troubleshooting history have been computerized, and more than 1,000 people at Duke Energy have instant access to it.

Not just for emergencies

Often, employees learn new skills in a training program, then return to work and promptly forget all they’ve learned. The best way to become a crack troubleshooter is to use the same systematic problem-solving process every time a cause-unknown deviation occurs-not just when the plant goes down.

Larbes believes that before serving on a team charged with solving a major problem, employees should be given experience solving less critical problems. And, if they have had the opportunity to use the problem-solving process privately, while being coached by their instructor, they will be much more comfortable participating in a group problem analysis.

Once adopted, the troubleshooting process should make people’s work easier, not more difficult. A cumbersome, hard-to-use system will never take root and become part of the organization’s culture. Osborn was at the Grand Gulf Nuclear Station in the 1980s when the problem-solving/decision-making program was first introduced.

“Back then,” he recounts, “emphasis was placed on using the full process: crossing every ‘t’ and dotting every ‘i.’ Now we’re more interested in having mechanics that can stand out there with the questions on a card and solve a problem on a pump.” It works at Grand Gulf, says Osborn, because of the maturity of the organization; most people have been using the same systematic process for so long that it’s simply “the way we do things.”

Another way to make systematic troubleshooting a way of life is to show new troubleshooters how to apply those parts of the process that are needed and not insist that everyone use the full set of problem-solving tools every time. Not all changes cause problems, but all problems are caused by change. That’s why Bill Brockman says that, in an emergency, troubleshooters can save a great deal of time and effort by focusing on changes. “Especially in emergencies,” says Brockman, “a sudden change usually causes the problem-a transformer that fails and shuts down a machine, a pressure spike that ruptures a pipe, and so on. Getting a comprehensive list of changes and looking closely at what is distinctive about the problem situation versus a normal situation (mentally comparing ‘is’ versus ‘is not’) is a great place to start when you have limited time.”

Author

Geoff Edelman (gedelman@kepner-tregoe) is director of Kepner-Tregoe Inc.’s Energy Group. Kepner-Tregoe is an international consulting and skill development firm that specializes in strategy formulation and implementation, problem-solving and decision-making, project management and human performance management.

Recommend this article Recommend this article () You recommended this article You recommended this article ()
Follow Power Engineering on Twitter

Power Engineering

Article Archives for Power Engineering Magazine