Method and apparatus for generating a telemetric impulsional response fingerprint for a computer system
One embodiment of the present invention provides a system for generating telemetric impulsional response fingerprints for an electronic system. The system operates by first determining a steady-state response of the electronic system under specified initial conditions. Next, the system introduces a sudden impulse step change to a parameter of the electronic system and then measures the dynamic response of the electronic system to the sudden impulse step change. The system then generates a multiparametric representation from the steady-state response and the dynamic response wherein the multiparametric representation simultaneously displays the steady-state response and the dynamic response.
The subject matter of this application is related to the subject matter in a co-pending non-provisional application by Kenny C. Gross and Lawrence G. Votta Jr. entitled, “Method and Apparatus for Monitoring and Recording Computer System Performance Parameters,” having Ser. No. 10/272,680 and filing date 17 Oct. 2002, which is incorporated herein by reference; and to the subject matter in a co-pending non-provisional application by Kenny C. Gross, Lawrence G. Votta Jr., and Adam Porter entitled, “Detecting and Correcting a Failure Sequence in a Computer System Before a Failure Occurs,” having Ser. No. 10/777,532 and filing date 11 Feb. 2004, which is incorporated herein by reference.
BACKGROUND1. Field of the Invention
The present invention relates to techniques for enhancing reliability within computer systems. More specifically, the present invention relates to a method and an apparatus for proactively monitoring computer system components for faults by using telemetric impulsional response fingerprints.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. If systems have too little event monitoring, when problems crop up at a customer site, service engineers may be unable to quickly identify the source of the problem. This can lead to increased down time, which can adversely impact customer satisfaction and loyalty.
One approach to address this problem is to monitor all aspects of a customer's data center and to send the monitored signals to a central monitoring center. This enables system administrators at the monitoring center to identify problematic discrepancies in system performance parameters and, if necessary, to direct service personnel to handle discrepancies more efficiently.
Existing continuous telemetry systems perform proactive fault monitoring of computer systems through passive surveillance, which does not impact the monitored system in any way. This approach can catch many types of faults. However, there are other latent faults that may appear only during dynamic stimulation. An analogy of these latent faults is a car that may have a problem with acceleration. The problem may not reveal itself during idling or while cruising at a uniform speed.
Hence, what is needed is a method and an apparatus for proactive fault monitoring a computer system without the shortcomings described above.
SUMMARYOne embodiment of the present invention provides a system for generating telemetric impulsional response fingerprints for an electronic system. The system operates by first determining a steady-state response of the electronic system under specified initial conditions. Next, the system introduces a sudden impulse step change to a parameter of the electronic system and then measures the dynamic response of the electronic system to the sudden impulse step change. The system then generates a multiparametric representation from the steady-state response and the dynamic response wherein the multiparametric representation simultaneously displays the steady-state response and the dynamic response.
In a variation of this embodiment, determining the steady-state response of the electronic system involves making measurements using a continuous system telemetry harness.
In a further variation, determining the steady-state response of the electronic system involves monitoring temperature, voltage, current, and/or vibration at multiple points within the electronic system.
In a further variation, introducing the sudden impulse step change involves changing a load, a temperature, a voltage, and/or a vibration within the electronic system.
In a further variation, measuring the dynamic response of the electronic system involves normalizing the dynamic response for measured system parameters.
In a further variation, generating the multiparametric representation involves creating a Kiviat diagram, which displays both the steady state response and the dynamic response.
In a further variation, the system detects incipient problems in the electronic system by comparing the multiparametric response representation with a standard multiparametric representation derived from a known good electronic system.
BRIEF DESCRIPTION OF THE FIGURES
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this detailed description are typically stored on a computer readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs) and DVDs (digital versatile discs or digital video discs), and computer instruction signals embodied in a transmission medium (with or without a carrier wave upon which the signals are modulated). For example, the transmission medium may include a communications network, such as the Internet.
Overview
Research has shown that a computer field replaceable unit (FRU) can exhibit a wide range of subtle, incipient problems which can be amplified and easily spotted if one examines the dynamic response of the FRU just before and just after a well defined dynamic-stimulus perturbation. The present invention provides a method and apparatus for creating telemetric impulsional response fingerprints (TIRF) which can be used to detect such incipient problems, and to thereby enhance reliability, availability, and serviceability of enterprise computer systems.
The TIRF provides a new and unique “active probe” machine-learning technique that leverages continuous system telemetry to provide dynamic, multivariate “fingerprints” for FRUs that can be (1) compared with previous TIRFs for the same FRU, or (2) compared with TIRFs for “Golden FRUs” generated from FRUs that are certified to be operating nearly perfectly. These fingerprints can be used to recognize very subtle failure precursors, such as aging processes, degrading sensors, delamination of bonded components, solder-joint cracking, deterioration of socket connectors, and other mechanisms that may not show up during conventional ongoing reliability testing (ORT) or reliability quality testing (RQT) test sequences.
A continuous system telemetry harness (CSTH) has been developed (see U.S. patent application Ser. No. 10/272,680 entitled “Method and Apparatus for Monitoring and Recording Computer System Performance Parameters,” filed 17 Oct. 2002). The CSTH monitors temperatures, voltages, and currents throughout a system, as well as some discrete performance metrics extracted from the operating system. The CSTH provides signals that can be used to enhance root cause analysis (RCA) following system failures. The signals can also be monitored in real-time for early warning of the onset of problems.
For the above listed types of reactive and proactive surveillance techniques for FRUs and systems, the telemetry is passive and does not disturb the monitored system in any way.
The present invention leverages the CSTH while extending significantly the range of its diagnostic coverage with a new dynamic probe technique. The new dynamic probe technique described below provides a wealth of diagnostic information relating to the health of components, FRUs, and integrated systems.
The system described herein generates a TIRF for an FRU by:
-
- (1) introducing a sudden impulse step change in one or more operational parameters (e.g. load, temperature, voltage) associated with the FRU;
- (2) measuring the dynamic response of all monitorable parameters following the impulse; and
- (3) creating a multiparametric Kiviat diagram (also known as a spider plot) that contrasts the post-impulse behavior with the “reference” behavior. The Kiviat diagram provides a “dynamic fingerprint” for the FRU.
The TIRF provides a unique and concise representation of the dynamic response of the FRU to a controlled perturbation under specified initial conditions. The TIRF can be represented as a vector of signal values collected from sensors, arranged in a specific order, and normalized to represent the post-perturbation vs. pre-perturbation behavior as a multivariate “fingerprint” for that FRU. Furthermore, the TIRF can be plotted in Kiviat diagram format as a human visualization aid to very readily highlight exactly where any problems appear. As such, the TIRF provides a dynamic perturbation response signature of a given FRU under unified values of initial conditions. Note that each FRU can have several TIRFs corresponding to different types of perturbations.
An FRU TIRF is a very concise multivariate descriptor of a given FRU under specified conditions. Moreover, the collection of TIRFs for FRUs may have great potential to increase availability of complex enterprise servers. Along with standard long-duration online tests of FRUs in ORT and RQT, TIRFs can be generated very quickly to represent important diagnostic information about the FRUs dynamic operability and, in most cases, can be obtained without taking the FRU out of service.
Electronic System Under Test
Soft variables 103 can include metrics such as load, throughput, and transaction latencies. These variables are typically derived from the operating system of electronic system under test 102. Physical variables 104 include temperature, voltage, current, and vibration within electronic system under test 102. Canary variables 105 include synthetic user-transactions and quality of performance values for these synthetic transactions.
The testing methodology involves first establishing a steady-state for soft variables 103, physical variables 104, and canary variables 105. After the steady state has been established, the system takes a pre-perturbation snapshot of the system parameters. Next, the system applies a sudden impulse change to one or more of the variables. For example, one or more voltages applied to the system can be changed, or the load applied from the canary variables might be stepped to a maximum value to stress the system.
After the sudden impulse has been applied, the system measures the dynamic response of electronic system under test 102, and takes a post-perturbation snapshot of the system parameters. The system then uses the pre-perturbation and the post-perturbation parameters to generate multiparametric response representation. This representation can be in the form of a Kiviat diagram.
Generating a Multiparametric Response Representation
After the sudden impulse step change, the system measures the dynamic response of electronic system under test 102 (step 206). The system then takes a post-perturbation snapshot of the system parameters (step 208). Finally, the system generates a multiparametric response representation (a Kiviat diagram) from the pre-perturbation snapshot and the post perturbation snapshot (step 210). This Kiviat diagram can then be compared with a previous Kiviat diagram taken from electronic system under test 102 or it can be compared with a Kiviat diagram that was generated from a known good electronic system to determine if electronic system under test 102 has any incipient failures.
Normalized Temperature Response
Multiparametric Response Representation
The inner and outer polygons in the Kiviat diagrams represent the minimum and maximum values for the monitored parameters. These polygons can be compared with polygons from previous test of the same electronic system, or can be compared with polygons from a test on a known good system to determine if there exist any incipient failures within the electronic system.
The foregoing descriptions of embodiments of the present invention have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.
Claims
1. A method for generating telemetric impulsional response fingerprints for an electronic system, comprising:
- determining a steady-state response of the electronic system under specified initial conditions;
- introducing a sudden impulse step change to a parameter of the electronic system;
- measuring a dynamic response of the electronic system to the sudden impulse step change; and
- generating a multiparametric representation, which simultaneously displays the steady state response and the dynamic response.
2. The method of claim 1, wherein determining the steady-state response of the electronic system involves making measurements through a continuous system telemetry harness.
3. The method of claim 1, wherein determining of the steady-state response of the electronic system involves monitoring at least one of temperature, voltage, current, and vibration at multiple points within the electronic system.
4. The method of claim 1, wherein introducing the sudden impulse step change involves changing at least one of a load, a temperature, a voltage, and a vibration within the electronic system.
5. The method of claim 1, wherein measuring the dynamic response of the electronic system involves normalizing the dynamic response for measured system parameters.
6. The method of claim 1, wherein generating a multiparametric response representation involves creating a Kiviat diagram, which displays both the steady state response and the dynamic response.
7. The method of claim 1, further comprising detecting incipient problems in the electronic system by comparing the multiparametric representation with a standard multiparametric representation derived from a known good electronic system.
8. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for generating telemetric impulsional response fingerprints for an electronic system, the method comprising:
- determining a steady-state response of the electronic system under specified initial conditions;
- introducing a sudden impulse step change to a parameter of the electronic system;
- measuring a dynamic response of the electronic system to the sudden impulse step change; and
- generating a multiparametric representation, which simultaneously displays the steady state response and the dynamic response.
9. The computer-readable storage medium of claim 8, wherein determining the steady-state response of the electronic system involves making measurements through a continuous system telemetry harness.
10. The computer-readable storage medium of claim 8, wherein determining f the steady-state response of the electronic system involves monitoring at least one of temperature, voltage, current, and vibration at multiple points within the electronic system.
11. The computer-readable storage medium of claim 8, wherein introducing the sudden impulse step change involves changing at least one of a load, a temperature, a voltage, and a vibration within the electronic system.
12. The computer-readable storage medium of claim 8, wherein measuring the dynamic response of the electronic system involves normalizing the dynamic response for measured system parameters.
13. The computer-readable storage medium of claim 8, wherein generating a multiparametric response representation involves creating a Kiviat diagram, which displays both the steady state response and the dynamic response.
14. The computer-readable storage medium of claim 8, the method further comprising detecting incipient problems in the electronic system by comparing the multiparametric representation with a standard multiparametric representation derived from a known good electronic system.
15. An apparatus for generating telemetric impulsional response fingerprints for an electronic system, comprising:
- a determining mechanism configured to determine a steady-state response of the electronic system under specified initial conditions;
- a step-change mechanism configured to introduce a sudden impulse step change to a parameter of the electronic system;
- a measuring mechanism configured to measure a dynamic response of the electronic system to the sudden impulse step change; and
- a generating mechanism configured to generate a multiparametric representation, which simultaneously displays the steady state response and the dynamic response.
16. The apparatus of claim 15, wherein determining the steady-state response of the electronic system involves making measurements through a continuous system telemetry harness.
17. The apparatus of claim 15, wherein determining the steady-state response of the electronic system involves monitoring at least one of temperature, voltage, current, and vibration at multiple points within the electronic system.
18. The apparatus of claim 15, wherein introducing the sudden impulse step change involves changing at least one of a load, a temperature, a voltage, and a vibration within the electronic system.
19. The apparatus of claim 15, wherein measuring the dynamic response of the electronic system involves normalizing the dynamic response for measured system parameters.
20. The apparatus of claim 15, wherein generating a multiparametric response representation involves creating a Kiviat diagram, which displays both the steady state response and the dynamic response.
Type: Application
Filed: Aug 11, 2005
Publication Date: Feb 15, 2007
Inventors: Aleksey Urmanov (San Diego, CA), Anton Bougaev (Lafayette, IN), Kenny Gross (San Diego, CA)
Application Number: 11/203,361
International Classification: F24J 2/40 (20060101);