Functional interrupt mitigation for fault tolerant computer
A new method for the detection and correction of environmentally induced functional interrupts (or “hangs”) induced in computers or microprocessors caused by external sources of single event upsets (SEU) which propagate into the internal control functions, or circuits, of the microprocessor. This method is named Hardened Core (or H-Core) and is based upon the addition of an environmentally hardened circuit added into the computer system and connected to the microprocessor to provide monitoring and interrupt or reset to the microprocessor when a functional interrupt occurs. The Hardened Core method can be combined with another method for the detection and correction of single bit errors or faults induced in a computer or microprocessor caused by external sources SEUs. This method is named Time-Triple Modular Redundancy (TTMR) and is based upon the idea that very long instruction word (VLIW) style microprocessors provide externally controllable parallel computing elements which can be used to combine time redundant and spatially redundant fault error detection and correction techniques. This method is completed in a single microprocessor, which substitute for the traditional multi-processor redundancy techniques, such as Triple Modular Redundancy (TMR).
Latest Space Micro, Inc. Patents:
- Object level encryption system including encryption key management system
- Programmable microwave integrated circuit
- OBJECT LEVEL ENCRYPTION SYSTEM INLCUDING ENCRYPTION KEY MANAGEMENT SYSTEM
- Radiation hard and fault tolerant multicore processor and method for ionizing radiation environment
- Portable composite bonding inspection system
This application is a reissue of U.S. Pat. No. 7,237,148 B2, issued Jun. 26, 2007. This application claims priority to U.S. Provisional Patent No. 60/408,205, filed on Sep. 5, 2002, entitled “Functional Interrupt Mitigation for Fault Tolerant Computer,” naming David Czajkowski as first named inventor and Darrell Sellers as second named inventor, of which is hereby incorporated by reference in its entirety.
BACKGROUND OF THE INVENTIONDuring use, microprocessors may be exposed to external conditions which may cause internal data bits within or being processed by the microprocessor to change. Commonly, these events are classified as single event upsets (SEU). Conditions giving rise to SEU may include ambient radiation (including protons, x-rays, neutrons, cosmic rays, electrons, alpha particles, etc.), electrical noise (including voltage spikes, electromagnetic interference, wireless high frequency signals, etc.), and/or improper sequencing of electronic signals or other similar events. The effects of SEU conditions can include the processing of incorrect data or the microprocessor may temporarily or permanent hang, which may be reference to as single event functional interrupt (SEFI), for a temporary or permanent condition.
A number of solutions to avoid or correct for these events have been developed, and include modifying the manufacturing process for the microprocessor. For example, microprocessor may utilize temporal redundancy or spatial redundancy in an effort to mitigate the likelihood of SEUs. While these systems have proven somewhat effective in reducing or avoiding SEU and SEFI events, several shortcomings have been identified. For example, using spatial redundancy in a triple modular redundant design allows three microprocessors to operate in parallel to detect and correct for single event upsets and functional interrupts, but require two additional microprocessors and support circuits (e.g. memory) causing additional power and synchronization problems. Another solution is to manufacture the microprocessor integrated circuits (IC) on radiation tolerant processes, which historically lag commercial devices by two to three generations. More specifically, today's radiation-tolerant IC production processes produce devices utilizing 0.35 micrometer geometries while non-radiation tolerant devices typically utilize 0.13 micro-meter geometry. The effect of the larger geometry is much slower performance and higher power consumption for the microprocessor.
In light of the foregoing, there is an ongoing need for high performance, low power consumption radiation tolerant systems and devices, that mitigate the problem of single event functional interrupt (SEFI), also known as environmental induced hangs.
BRIEF SUMMARY OF THE INVENTIONThe present application discloses fault tolerant circuits and companion software routines for use in computer systems and method of use. In one embodiment, a computer system with improved fault tolerance from microprocessor hangs is disclosed and includes a microprocessor, a fault tolerant software maintenance routine configured to send a periodic output signal from the microprocessor to a separate circuit (termed a “Hardened Core” or “H-Core”) in communication with the microprocessor, the Hardened Core circuit configured to monitor the periodic signal, the control lines (reset, non-maskable interrupt, interrupts, etc.) of the microprocessor wired through the Hardened Core circuit in a manner that allows the Hardened Core to selectively and sequentially activate each control line when periodic signal from microprocessor is not received on periodic schedule, and a set of software repair routines comprised of known instructions which provide a stop to all existing microprocessor instructions and force a controlled restart, where repair routines are operational at the control line interrupt vector memory addresses of the microprocessor.
In another embodiment, a computer system with improved fault tolerance from microprocessor hangs is disclosed and includes a microprocessor, a fault tolerant software maintenance routine configured to send a periodic output signal from the microprocessor to a separate circuit (termed “Hardened Core with Power Cycle”) in communication with the microprocessor, the Hardened Core with Power Cycle configured to monitor the periodic signal, the control lines (reset, non-maskable interrupt, interrupts, etc.) of the microprocessor wired through the Hardened Core with Power Cycle circuit in a manner that allows the Hardened Core with Power Cycle circuit to selectively and sequentially activate each control line when periodic signal from microprocessor is not received on a periodic schedule, the power supply lines of the microprocessor wired through the Hardened Core with Power Cycle circuit in a manner that allows the Hardened Core with Power Cycle circuit to selectively turn off and then on the power supply lines when the periodic signal from the microprocessor is not received on a periodic schedule, and a set of software repair routines comprised of known instructions which provide a stop to all existing microprocessor instructions and force a controlled restart, where repair routines are operational at the control line interrupt vector memory addresses of the microprocessor.
In another embodiment, a software and hardware computer system with improved fault tolerance from microprocessor data errors and microprocessor hangs is disclosed and includes a very long instruction word microprocessor, a fault tolerant software routine comprising a first instruction and a second instruction, each inserted into two spatially separate functional computational units in the VLIW microprocessor at two different clock cycles and stored in a memory device in communication with the microprocessor, the first and second instructions being identical, a software instruction to compare the first and second instruction in the memory device in communication with a VLIW microprocessor compare or branch units, and configured to perform an action if the first and second instruction match, the fault tolerant software routine comprising a third inserted into a third spatially separate functional computational units in the VLIW microprocessor at a third different clock cycles and stored in a third memory device in communication with the microprocessor, the first, second, and third instructions being identical, and the software instruction to compare the first, second, and third instructions in the memory devices in communication with a VLIW microprocessor compare or branch units, and configured to perform an action if any of the first, second and third instructions match; plus a fault tolerant software maintenance routine configured to send a periodic output signal from the VLIW microprocessor to a separate circuit (termed “Hardened Core”) in communication with the VLIW microprocessor, the Hardened Core circuit configured to monitor the periodic signal, the control lines (reset, non-maskable interrupt, interrupts, etc.) of the microprocessor wired through the Hardened Core circuit in a manner that allows the Hardened Core to selectively and sequentially activate each control signal when periodic signal from microprocessor is not received on periodic schedule, and a set of software repair routines comprised of known instructions which provide a stop to all VLIW microprocessor instructions and force a controlled restart, where repair routines are operational at the control line interrupt vector memory addresses of the VLIW microprocessor.
The Hardened Core system disclosed herein is a fault detection and correction system capable of being implemented with any microprocessor. In one embodiment, the microprocessor control signals, typically reset(s) and interrupt(s), are electrically connected through the Hardened Core circuit, wherein the signals are activated when the Hardened Core circuit does not receive a periodic timer signal from the microprocessor, which is generated by software routine(s) in the microprocessor software.
In alternate embodiments, the Hardened Core circuit 200 may include an application specific integrated circuit (ASIC) or other electronic circuit implementation.
Another embodiment is the combination of a Time-Triple Modular Redundancy (TTMR) system (disclosed herein), providing single bit error detection and correction in the microprocessor, with a Hardened Core system providing functional interrupt fault recovery. The TTMR system is capable of being implemented in very long instruction word (VLIW) microprocessors. In one embodiment, the VLIW microprocessor includes specialized software routines known as “ultra long instruction word” and/or “software controlled instruction level parallelism.” These software routines include parallel functional units configured to execute instructions simultaneously wherein the instruction scheduling decisions are moved to the software compiler. The TTMR systems combines time redundant and spatially redundant (including TMR and/or Master/Shadow architectures) instruction routines together on a single VLIW microprocessor.
Referring again to
At a later clock cycle or time interval T3, a compare instruction 616 is then sent from the software controller unit 600 to the branch or compare unit 618 within or in communication with the CPU 602. Exemplary branch or compare units 620 may include, without limitation, at least one comparator in communication with the CPU 602. The branch or compare unit 620 accesses and compares the two instructions retained within the memory devices in communication with arithmetic logic units 608, 612, respectively. If the two instructions stored within the memory devices in communication with the arithmetic logic units 608, 612 match no error has occurred and the instruction is accepted and performed. If a discrepancy is detected between the instructions 606, 610, respectively, stored within the memory devices in communication with the arithmetic logic units 608, 612, a third instruction 620 is sent from a software controller unit 600 to a third arithmetic logic unit 622 within or in communication with a CPU 602 and retained within a third memory device in communication therewith. The third instruction 620 is sent from the software controller unit 600 to the third arithmetic logic unit 622 at a later clock cycle or time interval T4 as compared with time interval T3. The instructions 606, 610, 620, respectively, are identical instructions sent at different time intervals, T1, T2, T4, respectively. Those skilled in the art will appreciate any number greater than 1 of instructions may be sent from the software controller unit 600 to the CPU 602 thereby permitting a comparison of instructions to occur within the CPU 602. The instructions stored within the memory devices in communication with the respective arithmetic logic units 608, 612, 622 are compared and any match therein is assumed to be a correct instruction, thereafter, the instruction may be performed. Like the previous embodiment, the TTMR system disclosed herein permits a second instruction 630 and a third instruction 640 to be completed in parallel with the first instruction 606 when three or more parallel functional units are available.
Implementation and control of the TTMR system takes place through software control of the VLIW microprocessor. TTMR software code can be developed using a variety of methods, which are dependent upon the individual microprocessor development environment and operating system(s). As shown in
In the combined embodiment, the TTMR system may include or otherwise incorporate a Hardened Core system, where the microprocessor 104 of
Claims
1. A computer system with improved tolerance to microprocessor functional interrupts induced by environmental sources, comprising: a microprocessor not required to be radiation hardened; an array of memory, volatile or non-volatile, connected to said microprocessor; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt(s) output, reset output(s) and/or power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor on the predetermined time period and activate said microprocessor's interrupt and or reset control input signals if timer signal is not received within the predetermined time period to provide for removal of said microprocessor from functionally interrupted state; a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
2. A The system of claim 1 further comprising a microprocessor software routine configured to send maintenance data to the microprocessor memory prior to functional interrupt and configured to read said maintenance data from the microprocessor memory after microprocessor's removal from functionally interrupted state and use maintenance data to restart microprocessor's application software routines.
3. The system of claim 2 further comprising a microprocessor software routine configured to read said hardened core status signal(s), and to determine if interrupt or reset activation was a result of hardened core activation and then restart application software routines, or normal interrupt or reset and then continue with normal application software operation.
4. The system of claim 3 further comprising a microprocessor software routine configured to halt all currently operating application software threads.
5. The system of claim 4 further comprising a microprocessor software routine configured to read hardened core status signal(s), and to determine if multiple functional interrupts occurred within the predetermined time period and then to restart all microprocessor software and hardware if multiple functional interrupts occurred within the predetermined time period, or, if single functional interrupt occurred in the predetermined time period then to read maintenance data stored in said memory and provide a controlled restart of selected application software.
6. A computer system with improved fault tolerance from microprocessor, data errors and functional interrupts, comprising: a microprocessor not required to be radiation hardened; an array of memory, volatile or non-volatile, connected to said microprocessor; a fault tolerant software routine configured to send a first instruction and at least a second instruction to the microprocessor, the first and at least the second instructions being identical and being inserted into spatially separated functional computational units of the microprocessor at different clock cycles; a first and at least a second memory device in communication with the microprocessor, the first memory device configured to store the first instruction, the second memory device configured to store at least the second instruction; a software instruction to compare the first instruction to at least the second instruction; a comparator to compare the first instruction to the second instruction; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt output(s), reset output(s) and/or power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor in the predetermined time period and activate said microprocessor's interrupt and or reset control input signals if the timer signal is not received within the predetermined time period to provide for removal of said microprocessor from a functionally interrupted state; and a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
7. The system of claim 6 further comprising a third instruction sent by the fault tolerant software routine to the microprocessor, the third instruction stored in a third memory device in communication with the microprocessor.
8. The system of claim 7 wherein the software instruction directs the comparator to compare the first, second, and third instruction.
9. The system of claim 8 wherein a match of any of the first, second, and third instructions is accepted by the microprocessor.
10. The system of claim 6 wherein the microprocessor comprises a very long instruction word (VLIW) microprocessor.
11. A software and hardware computer system with improved fault tolerance from microprocessor data errors and functional interrupts, comprising: a very long instruction word (VLIW) microprocessor not required to be radiation hardened; an array of memory, volatile or non-volatile, connected to said microprocessor; a fault tolerant software routine comprising a first instruction and a second instruction, each inserted into two spatially separate functional computational units in the VLIW microprocessor at two different clock cycles and stored in a memory device in communication with the microprocessor, the first and second instructions being identical; a software instruction to compare the first and second instructions in the memory device in communication with a VLIW microprocessor compare or branch units, and configured to perform an action if the first and second instructions match, the fault tolerant software routine comprising a third instruction inserted into a third spatially separate functional computational unit in the VLIW microprocessor at a third different clock cycle and stored in a third memory device in communication with the microprocessor, the first, second, and third instructions being identical; the software instruction to compare the first, second, and third instructions in the memory devices in communication with a VLIW microprocessor compare or branch units, and configured to perform an action if any of the first, second and third instructions match; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt output(s), reset output(s) and/or power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor in the predetermined time period and activate said microprocessor's interrupt and or reset control input signals if the timer signal is not received within the predetermined time period to provide for removal of said microprocessor from functionally interrupted state; and a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
12. A computer system with improved tolerance to microprocessor functional interrupts induced by environmental sources, comprising: a microprocessor not required to be radiation hardened; an array of memory, volatile or non-volatile, connected to said microprocessor; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt output and power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor on the predetermined time period and activate said microprocessor's interrupt and reset control input signals if timer signal is not received within the predetermined time period to provide for removal of said microprocessor from functionally interrupted state; a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
13. A computer system with improved fault tolerance from microprocessor, data errors and functional interrupts, comprising: a microprocessor not required to be radiation hardened; an array of memory, volatile or non-volatile, connected to said microprocessor; a fault tolerant software routine configured to send a first instruction least a second instruction to the microprocessor, the first and at least the second instructions being identical and being inserted into spatially separated functional computational units of the microprocessor at different clock cycles; a first and at least a second memory device in communication with the microprocessor, the first memory device configured to store the first instruction, the second memory device configured to store at least the second instruction; a software instruction to compare the first instruction to at least the second instruction; a comparator to compare the first instruction to the second instruction; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt output and power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor in the predetermined time period and activate said microprocessor's interrupt and reset control input signals if the timer signal is not received within the predetermined time period to provide for removal of said microprocessor from a functionally interrupted state; and a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
14. A software and hardware computer system with improved fault tolerance from microprocessor data errors and functional interrupts, comprising: a very long instruction word (VLIW) microprocessor not required to be radiation hardened; an array of memory, volatile or non-volatile, connected to said microprocessor; a fault tolerant software routine comprising a first instruction and a second instruction, each inserted into two spatially separate functional computational units in the VLIW microprocessor at two different clock cycles and stored in a memory device in communication with the microprocessor, the first and second instructions being identical; a software instruction to compare the first and second instructions in the memory device in communication with a VLIW microprocessor compare or branch units, and configured to perform an action if the first and second instructions match, the fault tolerant software routine comprising a third instruction inserted into a third spatially separate functional computational unit in the VLIW microprocessor at a third different clock cycle and stored in a third memory device in communication with the microprocessor, the first, second, and third instructions being identical; the software instruction to compare the first, second, and third instructions in the memory devices in communication with a VLIW microprocessor compare or branch units, and configured to perform an action if any of the first, second and third instructions match; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt output and power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor in the predetermined time period and activate said microprocessor's interrupt and reset control input signals if the timer signal is not received within the predetermined time period to provide for removal of said microprocessor from functionally interrupted state; and a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
15. A computer system with improved tolerance to microprocessor functional interrupts induced by environmental sources, comprising: a microprocessor not required to be radiation hardened; said microprocessor further comprising power supply lines, a power cycle control unit coupled to said microprocessor power supply lines to selectively provide for removal and return of power to said microprocessor, an array of memory, volatile or non-volatile, connected to said microprocessor; a hardened core circuit, designed to withstand environmentally induced faults, and connected to said microprocessor, in a manner allowing for said microprocessor's interrupt control, reset control, data bus, and address bus signals to connect to said hardened core circuit, and for said hardened core's status, interrupt output and power cycle output signals to connect to said microprocessor; a microprocessor software routine configured to send a predetermined timer signal from the microprocessor to the said hardened core circuit on a predetermined time period; a hardened core circuit function configured to read the predetermined timer signal from said microprocessor on the predetermined time period and activate said microprocessor's interrupt or reset control input signals if timer signal is not received within the predetermined time period to provide for removal of said microprocessor from functionally interrupted state and also to generate an activation signal to said power cycle control unit to remove and return power to said microprocessor; a microprocessor software routine located at said microprocessor's interrupt or reset vector addresses, configured to restart the microprocessor's application software.
4132975 | January 2, 1979 | Koike |
4199810 | April 22, 1980 | Gunckel et al. |
4670880 | June 2, 1987 | Jitsukawa et al. |
4817094 | March 28, 1989 | Lebizay et al. |
4943969 | July 24, 1990 | Criswell |
4956807 | September 11, 1990 | Hosaka et al. |
4959836 | September 25, 1990 | Berard et al. |
5233613 | August 3, 1993 | Allen et al. |
5235220 | August 10, 1993 | Takizawa |
5345583 | September 6, 1994 | Davis |
5414722 | May 9, 1995 | Tollum |
5594865 | January 14, 1997 | Saitoh |
5604755 | February 18, 1997 | Bertin et al. |
5706423 | January 6, 1998 | Sugimoto |
5822515 | October 13, 1998 | Baylocq |
5864663 | January 26, 1999 | Stolan |
6754846 | June 22, 2004 | Rasmussen et al. |
6901532 | May 31, 2005 | DeRuiter et al. |
7036059 | April 25, 2006 | Carmichael et al. |
7237148 | June 26, 2007 | Czajkowski et al. |
7318169 | January 8, 2008 | Czajkowski |
20040153747 | August 5, 2004 | Czajkowski |
20040250178 | December 9, 2004 | Munguia et al. |
20050055607 | March 10, 2005 | Czajkowski et al. |
20050138485 | June 23, 2005 | Osecky et al. |
20050172196 | August 4, 2005 | Osecky et al. |
2 903 614 | September 1982 | GB |
- Makherjee et al.; “Detailed Design and Evaluation of Redundant Multithreading Alternatives”; International conference on Computer Architecture, Proceedings of the 29.sup.th Annual International Symposium on Computer Architecture, published 2002; pp. 99-110.
- Reinhardt et al.; “Transient Fault Detection Via Simulaneous Multithreading”; International conference on Computer Architecture, Proceedings of the 27th Annual International Symposium on computer Architecture; published 2000, pp. 25-36.
- Specification SMD 5962-00538, Defense Supply Agency, Sep. 2000.
- Brochure: S950 3U cPCI Radiation Tolerant PowerPC SBC, Aitech Defense Systems, Inc., Chatsworth, CA, Publication No. S950T1208R23 (date not on brochure).
Type: Grant
Filed: Jun 24, 2009
Date of Patent: Apr 26, 2011
Assignee: Space Micro, Inc. (San Diego, CA)
Inventors: David R. Czajkowski (Encinitas, CA), Darrell Sellers (San Diego, CA)
Primary Examiner: Joshua A Lohn
Attorney: Continuum Law
Application Number: 04/590,147
International Classification: G06F 11/00 (20060101);