Scalable Prediction Failure Analysis For Memory Used In Modern Computers
One embodiment provides a method for scalable predictive failure analysis. Embodiments of the method may include gathering memory information for memory on a user computer system having at least one processor. Further, the method includes selecting one or more memory-related parameters. Further still, the method includes calculating based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Yet further, the method includes setting, based on the calculating, the single bit error value for the user computer system.
Latest IBM Patents:
Memory correctable errors are becoming a major issue in today's modern personal computers, especially since supported memory sizes often reach terabytes instead of gigabytes. To that end, complex predictive failure analyses are desirous in order to anticipate and prevent mild to catastrophic system failures involving data loss and damage due to memory errors.
BRIEF SUMMARYOne embodiment provides a method for scalable predictive failure analysis. Embodiments of the method may include gathering memory information for memory on a user computer system having at least one processor. Further, the method includes selecting one or more memory-related parameters from a plurality. Further still, the method includes calculating based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Yet further, the method includes setting, based on the calculating, the single bit error value for the user computer system.
Another embodiment provides a computer program product for scalable predictive failure analysis. The computer program product includes a computer readable storage device. Further, the computer program product includes first program instructions to gather memory information for memory on a user computer system having at least one processor. Further still, the computer program product includes second program instructions to select one or more memory-related parameters. Yet further, the computer program product includes third program instructions to calculate based on the gather and the select (i.e., performing the instructions to gather and to select), a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Still further, the computer program product includes fourth program instructions to set, based on the calculate (i.e., performing the instructions to calculate), the single bit error value for the user computer system, wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device.
Another embodiment provides a system for scalable predictive failure analysis. The system includes a processor, a computer readable memory and a computer readable storage device. Further, the system includes first program instructions to gather memory information for memory on a user computer system having at least one processor, wherein the memory may be the same, part of or different from the computer readable memory. Further still, the system includes second program instructions to select one or more memory-related parameters. Yet further, the system includes third program instructions to calculate, based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Further still, the system includes fourth program instructions to select, based on the calculate, the single bit error value for the user computer system. The first, second, third, and fourth program instructions of the system are stored on the computer readable storage device for execution by the processor via the computer readable memory.
So that the manner in which the above recited features, advantages and objects of the present disclosure are attained and can be understood in detail, a more particular description of this disclosure, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
It is to be noted, however, that the appended drawings illustrate only example embodiments of this disclosure, and, therefore, are not to be considered limiting of its scope, for this disclosure may admit or not to other equally effective embodiments.
The following is a detailed description of example embodiments with accompanying drawings. The example embodiments are in such detail as to communicate the invention. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
Generally speaking, systems, methods and media for scalable predictive failure analysis (SPFA) for single bit errors (SBE) in memory are disclosed. Embodiments include gathering, for a user computer system, memory information, such as memory size, synchronous dynamic random access memory (SDRAM) technology on the module, module packaging, memory failure mode and vendor quality. Calculation of the SBE value ensues through combining calculation(s) for each of the selected memory-related parameters, wherein the selecting optionally occurs subsequent or prior to the gathering. The calculated SBE value is set and valid for the user computer system until powering down or changing memory components in the user computer system. Accordingly, the SBE value is scalable because the value is determined for the particular user computer system—not simply a fixed, generic value. Alerts, whether audible or visible, may occur based on comparing counted SBEs to the scalable SBE value. The alerts provide credible predictive failure analysis to avert system memory failures while incorporating the realities of the unique complexities for the particular user computer system.
In general, the routines executed to implement the embodiments of the invention may be part of a specific application, component, program, module, object, or sequence of instructions. The computer program of the present invention typically is comprised of a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions. Also, programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices. In addition, various programs described herein may be identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature herein is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
While specific embodiments will be described below with reference to particular configurations of hardware and/or software, those of skill in the art will realize that embodiments of the present invention may advantageously be implemented with other substantially equivalent hardware, software systems, manual operations, or any combination of any or all of these. The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Moreover, embodiments of the invention may also be implemented via parallel processing using a parallel computing architecture, such as one using multiple discrete systems (e.g., plurality of computers, etc.) or an internal multiprocessing architecture (e.g., a single system with parallel processing capabilities).
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of embodiments of the invention described herein may be stored or distributed on computer-readable medium as well as distributed electronically over the Internet or over other networks, including wireless networks. Data structures and transmission of data (including wireless transmission) particular to aspects of the invention are also encompassed within the scope of the invention. Furthermore, the invention can take the form of a computer program product accessible from a computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
Each software program described herein may be operated on any type of data processing system, such as a personal computer, server, etc. A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements may include local memory employed during execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks, including wireless networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
Turning now to the drawings,
Regardless of individual logic location, the system 100 has accessible logic to gather memory information for memory 105 on the user computer system 100. The gathering module 110 gathers memory information, memory size, synchronous dynamic random access memory (SDRAM) technology on the module, module packaging, memory failure mode and vendor quality for memory 105 under test on the particular user computer system 100. For example, memory information for memory 105 could be a module size of 2 GB for a single-rank dual in-line module (DIMM). Below, further discussion of memory information occurs in combination with discussion of selected memory-based parameters.
The system 100 also includes logic, denominated as a configuration module 120 in
In communication with both the gathering and configuration modules 110, 120, the calculation module 130 includes logic to calculate a combination of the selected memory-related parameters. The SPFA uses the selected number of memory-related parameters, which one considers critical to maintain a functioning memory subsystem, in order to calculate the SBE value. The setting module 140 then sets the calculated SBE value for the system 100. Evaluation of exemplary memory-related parameters and combination of the same for calculation of the SBE value now ensues.
Memory module size is a memory-related parameter for possible inclusion in the SPFA calculation for the memory 105. For such, the following exemplary scale is provided for a correctable SBE value based on the actual capacity of each module or module-pairs installed in the system:
Referring to Table 1, and assuming x=256 SBE for a baseline PFA count within a 24-hour window, then a larger memory 105 DIMM logically permits more SBEs before meeting or exceeding a set SBE value, i.e., a threshold. For example, the memory-based parameter for memory module size would allow 256 SBEs for a 2 GB DIMM, 512 SBEs for a 4 GB DIMM, 1024 SBEs for a 8 GB DIMM, 2048 SBEs for a 16 GB DIMM, and 4096 SBEs for a 32 GB DIMM before memory failure realized by visual and/or audio alert through use of the detection and comparison modules 115, 145.
In addition to memory module size, another possibly selected memory-related parameter for inclusion in the calculation of the SBE value is SDRAM technology on the memory module 105. For such, the following exemplary scale is provided:
Referring to Table 2, and assuming y=1024 for a baseline PFA count within a 24-hour window, memory 105 DIMM with a lesser rank permits a higher SBE value. For example, the memory-based parameter for SDRAM technology would allow 1024 SBEs for a single-rank DIMM, 823 SBEs for a dual-rank DIMM, and 640 SBEs for a quad-rank DIMM before alerting the user or another system in network communication with the system 100 of memory failure of a module or other memory device needing repair or replacement, whereupon the latter at least suggests a new SBE value should be re-set by re-calculation.
Still another memory-related parameter for inclusion in the calculation of the SBE value is module packaging of the memory 105 on the particular user computer system 100. For such, the following exemplary scale is provided:
IBM® Chipkill™ is an advanced error checking and correcting (ECC) computer technology that has the ability to correct multi-bit memory errors on a single SDRAM. Referring to Table 3, and assuming z=256 for a baseline PFA count within a 24-hour window, memory 105 DIMM with additional advanced ECC protection, i.e., Chipkill™, affords a higher SBE value due to this individual PFA metric. For example, the memory-based parameter regarding Chipkill™ would allow 256 SBEs for x8 DIMM with no Chipkill™, 512 SBEs for x8 DIMM with Chipkill™ is, and 640 SBEs for x4 DIMM with Chipkill™
Yet another memory-related parameter for optional inclusion in the calculation of the SBE value is memory failure mode of the memory 105 on the particular user computer system 100. Here, this memory-related parameter regards single count reduction for a single memory address. That is, a correctable SBE that occurs repeatedly at the same memory address on memory 105 DIMM is counted as one failure instead of counting the repeats as multiple failures.
Another example of a memory-related parameter for optional inclusion in the calculation of the SBE value is vendor quality of the memory 105 on the particular user computer system 100. For such, the following exemplary scale is provided:
Table 4 represents a memory vendor quality/reliability matrix on a per product basis. A memory vendor can have multiple products, each one could have a different quality/reliability rating. The quality scale rating, such as Table 4, may be used for calculating the SBE value. A memory 105 DIMM from a lower quality score supplier yields a lower PFA threshold value for this memory-related parameter. A lower quality score would require replacement or repair sooner as compared to a higher quality score provided all other contributing PFA memory-related parameters to the SBE value are constant.
For calculation purposes, combination of the selected, memory-related parameters may be through simple addition, multiplication, a mixture of the two, or any other combination method so as to yield a reliable, relative, and meaningful SBE value for SFPA. For example, the foregoing five memory-related parameters may calculate an SBE value according to: PFA(sum)=PFA(a)+PFA(b)+PFA(c)+PFA(d)+PFA(a). The value of each memory-related PFA threshold and time window(s) should be defined by the subject matter expert on the system design team. That is, the illustrative tables provided herein are neither the sole nor necessarily appropriate values to use because the same are solely intended as examples. Whether a hardware built-in memory test, power-on memory test (i.e., post-power on self-test), system in run time, or memory diagnostic test, this disclosure enables a selectable and scalable PFA for memory 105 that thwarts consequences of memory failures for a particular user computer system 100.
In the depicted embodiment, the computer system 200 includes a processor 202, storage 204, memory 206, a user interface adapter 208, and a display adapter 210 connected to a bus 212 or other interconnect. The bus 212 facilitates communication between the processor 202 and other components of the computer system 200, as well as communication between components. Processor 202 may include one or more system central processing units (CPUs) or processors to execute instructions, such as an IBM® PowerPC® processor, an Intel® Pentium® processor, an Advanced Micro Devices, Inc. processor or any other suitable processor. IBM and PowerPC are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. The processor 202 may utilize storage 204, which may be non-volatile storage such as one or more hard drives, tape drives, diskette drives, CD-ROM drive, DVD-ROM drive, or the like. The processor 202 may also be connected to memory 206 via bus 212, such as via a memory controller hub (MCH). System memory 206 may include volatile memory such as random access memory (RAM) or double data rate (DDR) synchronous dynamic random access memory (SDRAM). In the disclosed systems, for example, a processor 202 may execute instructions to perform functions, such as by gathering memory information and selecting memory-related parameters for inclusion for SPFA calculations. Information before, during or after calculations may temporarily or permanently be stored in storage 204 or memory 206.
Turning now to
Returning to
BIOS 480 is coupled to ISA bus 440, and incorporates the necessary processor executable code for a variety of low-level system functions and system boot functions. BIOS 480 can be stored in any computer readable medium, including magnetic storage media, optical storage media, flash memory, random access memory, read only memory, and communications media conveying signals encoding the instructions (e.g., signals from a network). In order to attach computer system 401 to another computer system to copy files over a network, LAN card 430 is coupled to PCI bus 425 and to PCI-to-ISA bridge 435. Similarly, to connect computer system 401 to an ISP to connect to the Internet using a telephone line connection, modem 475 is connected to serial port 464 and PCI-to-ISA Bridge 435.
While the computer systems described in
Another embodiment of the disclosure is implemented as a program product for use within a device such as, for example, those systems and methods depicted in
In general, the routines executed to implement the embodiments of this disclosure, may be part of an operating system or a specific application, component, program, module, object, or sequence of instructions. The computer program of this disclosure typically comprises a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions. Also, programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices. In addition, various programs described hereinafter may be identified based upon the application for which they are implemented in a specific embodiment of this disclosure. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus this disclosure should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
While the foregoing is directed to example embodiments of this disclosure, other and further embodiments of this disclosure may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims
1. A method for scalable predictive failure analysis, the method comprising:
- gathering memory information for memory on a user computer system having at least one processor;
- selecting one or more memory-related parameters;
- calculating, based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information; and
- setting, based on the calculating, the single bit error value for the user computer system.
2. The method of claim 1, further comprising detecting, subsequent to the setting, one or more single bit errors for the memory.
3. The method of claim 1, further comprising comparing, subsequent to the setting, a counted number of single bit errors for the memory to the value.
4. The method of claim 1, further comprising alerting, subsequent to the setting, if a counted number of single bit errors for the memory at least equals the single bit error value.
5. The method of claim 1, further comprising returning to sleep, subsequent to the setting, if a counted number of single bit errors for the memory fails to exceed the single bit error value.
6. The method of claim 1, further comprising re-setting, according to the method, the single bit error value for the user computer system upon a memory replacement.
7. The method of claim 1, further comprising reporting the single bit error value and any results from the method on a display associated with the user computer system.
8. A computer program product for scalable predictive failure analysis:
- a computer readable storage device;
- first program instructions to gather memory information for memory on a user computer system having at least one processor;
- second program instructions to select one or more memory-related parameters;
- third program instructions to calculate based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information;
- fourth program instructions to set, based on the calculate, the single bit error value for the user computer system; and
- wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device.
9. The computer program product of claim 8, further comprising fifth program instructions to detect, subsequent to the set, one or more single bit errors for the memory; and wherein the fifth program instructions are stored on the computer readable storage device.
10. The computer program product of claim 8, further comprising fifth program instructions to compare, subsequent to the set, a counted number of single bit errors for the memory to the value; and wherein the fifth program instructions are stored on the computer readable storage device.
11. The computer program product of claim 8, further comprising fifth program instructions to alert, subsequent to the set, if a counted number of single bit errors for the memory at least equals the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device.
12. The computer program product of claim 8, further comprising fifth program instructions to return to sleep, subsequent to the set, if a counted number of single bit errors for the memory fails to exceed the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device.
13. The computer program product of claim 8, further comprising fifth program instructions to re-set, according to the method, the single bit error value for the user computer system upon a memory replacement; and wherein the fifth program instructions are stored on the computer readable storage device.
14. A system for scalable predictive failure analysis, the system comprising:
- a processor, a computer readable memory and a computer readable storage device;
- first program instructions to gather memory information for memory on a user computer system having at least one processor;
- second program instructions to select one or more memory-related parameters;
- third program instructions to calculate based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information;
- fourth program instructions to set, based on the calculate, the single bit error value for the user computer system; and
- wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
15. The system of claim 14, further comprising fifth program instructions to detect, subsequent to the set, one or more single bit errors for the memory; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
16. The system of claim 14, further comprising fifth program instructions to compare, subsequent to the set, a counted number of single bit errors for the memory to the value; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
17. The system of claim 14, further comprising fifth program instructions to alert, subsequent to the set, if a counted number of single bit errors for the memory at least equals the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
18. The system of claim 14, further comprising fifth program instructions to return to sleep, subsequent to the setting, if a counted number of single bit errors for the memory fails to exceed the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
19. The system of claim 14, further comprising fifth program instructions to re-set, according to the method, the single bit error value for the user computer system upon a memory replacement; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
20. The system of claim 14, further comprising fifth program instructions to report the single bit error value and any results from the method on a display associated with the user computer system; and wherein the fifth program instructions are stored on the computer readable storage device.
Type: Application
Filed: Oct 26, 2010
Publication Date: Apr 26, 2012
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
Inventors: Tu T. Dang (Cary, NC), Michael C. Elles (Apex, NC), Juan Q. Hernandez (Garner, NC), Dwayne A. Lowe (Durham, NC), Challis L. Purrington (Raleigh, NC)
Application Number: 12/912,735
International Classification: G06F 11/00 (20060101);