Troubleshooting is a form of problem solving, often applied to repair failed products or processes. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination – eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state. In general, troubleshooting is the identification of diagnosis of “trouble” in the management flow of a corporation or a system caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining and remedying the causes of these symptoms.
A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example selecting the “print” option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example). Corrective action can then be taken to prevent further failures of a similar kind.
Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a printer that “was working when it was plugged in over there”). However, there is a well known principle that correlation does not imply causality. (For example the failure of a device shortly after it has been plugged into a different outlet doesn’t necessarily mean that the events were related. The failure could have been a matter of coincidence.) Therefore troubleshooting demands critical thinking rather than magical thinking.
It is useful to consider the common experiences we have with light bulbs. Light bulbs “burn out” more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.
Isolating single component failures which cause reproducible symptoms is relatively straightforward. However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and failover to a system may also be subject to failure, and enough different component failures in any system will “take it down.”
Even in simple systems the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).