Abstract: An apparatus includes at least one processing device comprising a processor coupled to a memory, with the processing device being configured to maintain at least first and second journals for respective first and second different types of input-output requests, to move one or more entries between the first journal and the second journal under one or more specified conditions, to perform a clean-up operation for at least one of the first and second journals in conjunction with the moving of the one or more entries, and responsive to a failure occurring during the clean-up operation, to execute a contention resolution algorithm to resolve logical address range lock contentions between different entries of the first and second journals. The processing device illustratively comprises a storage controller of a storage system. The storage system may be, for example, a source storage system configured to carry out a synchronous replication process with a target storage system.
Abstract: Data associated with a write request is stored at a storage device of multiple solid-state storage devices. A determination as to whether the data stored at the storage device is readable is made by determining whether a number of subsequent programming operations have been performed since the data was stored at the storage device. A notification that the stored data is readable from the storage device is generated upon determining that the data is readable.
Type:
Grant
Filed:
May 4, 2021
Date of Patent:
February 15, 2022
Assignee:
Pure Storage, Inc.
Inventors:
Gordon James Coleman, Andrew R. Bernat, Peter E. Kirkpatrick
Abstract: Example techniques for obtaining data for identifying a fault are described. In response to receiving a fault message corresponding to a first device, a computing device determines a first set of data to be obtained for identifying the fault. The first set of data to be obtained is determined based on a workload of the computing device. The first set of data is then obtained.
Type:
Grant
Filed:
April 19, 2021
Date of Patent:
February 1, 2022
Assignee:
Hewlett Packard Enterprise Development LP
Abstract: Systems, methods, and computer-readable media are described for utilizing breakpoint value-based fingerprints of failing regression test cases to determine specific components of a System Under Test (SUT) that are causing a fault such as specific lines of source code. A failing test case from a regression run is selected and fault localization and inverse combinatorics techniques are employed to generate a set of failing test cases around the selected failing test case. A set of test fingerprints corresponding to the set of failing test cases is compared to a set of test fingerprints corresponding to a set of passing test cases to determine breakpoints that are indicated as being encountered during execution of at least one failing test case and that are not encountered during execution of any of the passing test cases. Specific lines of source code that correspond to these breakpoints are then identified as causing the fault.
Type:
Grant
Filed:
June 13, 2019
Date of Patent:
January 25, 2022
Assignee:
INTERNATIONAL BUSINESS MACHINES CORPORATION
Inventors:
Andrew Hicks, Dale E. Blue, Ryan Thomas Rawlins, Steven M. Partlow
Abstract: A fault recoverable computer system including an instruction table having a plurality of processor instructions. The system also includes at least one sensor arranged to monitor an environmental condition and output sensor data. A monitor module is arranged to receive sensor data and/or processor state information. A testing module is arranged to perform a plurality of self-tests including a first self-test of the computer system and, if the first self-test fails, output a failure notification. A recovery module is arranged to update the instruction table in response to receiving the failure notification. The update includes replacing a first processor instruction arranged to perform a first function with a replacement set of processor instructions configured to alternatively perform the first function.
Abstract: Feedback relating to errors in memory operations on a plurality of memory cells is received by a memory sub-system. At least one processing level corresponding to a program distribution is updated based on the feedback to adjust an error measure between pages of the plurality of memory cells and to adjust a read window budget within a page of the plurality of cells. The updating of the at least one processing level is based on information for the at least one processing level that is stored in a data-structure.
Type:
Grant
Filed:
November 21, 2020
Date of Patent:
December 28, 2021
Assignee:
Micron Technology, Inc.
Inventors:
Michael Sheperek, Bruce A. Liikanen, Larry J. Koudele, James P. Crowley, Stuart A. Bell
Abstract: There are provided a memory system and a method for operating the same. A memory system includes: a controller for queuing a plurality of commands and outputting control signals in response to the plurality of queued commands; and a memory device for performing a program operation in response to the control signals, wherein, when the program operation fails, the controller holds the plurality of queued commands.
Abstract: A system and method for configuring fault tolerance in nonvolatile memory (NVM) are operative to set a first threshold value, declare one or more portions of NVM invalid based on an error criterion, track the number of declared invalid NVM portions, determine if the tracked number exceeds the first threshold value, and if the tracked number exceeds the first threshold value, perform one or more remediation actions, such as issue a warning or prevent backup of volatile memory data in a hybrid memory system. In the event of backup failure, an extent of the backup can still be assessed by determining the amount of erased NVM that has remained erased after the backup, or by comparing a predicted backup end point with an actual endpoint.
Type:
Grant
Filed:
July 19, 2019
Date of Patent:
December 14, 2021
Assignee:
Netlist, Inc.
Inventors:
Scott H. Milton, Jeffrey C. Solomon, Kenneth S. Post
Abstract: Topology and performance metrics of a storage system are monitored for anomalies. The storage system includes a set of disk array enclosures (DAEs) connected to a host server. Each DAE is chained to another DAE. Upon detecting an anomaly associated with a DAE, log collection is triggered to obtain logs from the DAE and logs in other DAEs upstream and downstream of the DAE.
Abstract: An illustrative data storage management system is aware that certain data storage resources for storing/serving primary data operate in a partnered configuration. Illustrative components of the data storage management system analyze the failover status of the partnered primary data storage resources to determine which is currently serving/storing primary data and/or snapshots targeted for backup. When detecting that a first partnered primary data storage resource has failed over to a second primary data storage resource, the example storage manager changes the assignment of backup resources that are pre-administered for the targeted data. Accordingly, the example storage manager assigns backup resources, including at least one media agent, that are associated with the second primary data storage resource, and which are “closer” thereto from a geography and/or network topology perspective, even if the pre-administered backup resources are available for backup.
Abstract: Methods and systems for networked systems are provided. A reinforcement learning (RL) agent is deployed during runtime of a networked system having at least a first component and a second component. The RL agent detects a first degradation signal in response to an error associated with the first component and a second degradation signal from the second component, the second degradation signal generated in response to the error. The RL agent identifies from a learned data structure an action for fixing degradation, at both the first component and the second component; and continues to update the learned data structure, upon successful and unsuccessful attempts to fix degradation associated with the first component and the second component.
Abstract: An information handling system may include a processor and a basic input/output system comprising a program of instructions executable by the processor and configured to cause the processor to determine if a captured stop error code captured in connection with an operating system stop error occurring during a previous boot session of the information exists on a memory accessible to the basic input/output system and responsive to the captured stop error code existing on the memory, read the captured stop error code and perform a remedial action based on the captured stop error code.
Type:
Grant
Filed:
February 3, 2020
Date of Patent:
October 26, 2021
Assignee:
Dell Products L.P.
Inventors:
Arifullah Syed Shah, Ibrahim Sayyed, Steven A. Downum
Abstract: A system of verifying execution sequence integrity of an execution flow includes a monitoring system in communication with one or more sensors of a system being monitored, where the monitoring system includes one or more electronic devices, and a computer-readable storage medium having one or more programming instructions. When executed, the one or more programming instructions cause at least one of the electronic devices to receive from the sensors, a parameter value for each of one or more parameters that pertain to an operational state of the system, combine the received parameters to generate a combination value, apply a hashing algorithm to the combination value to generate a temporary hash value, search a data store for a result code associated with the temporary hash value, and in response to the result code associated with the temporary hash value indicating that the temporary hash value is incorrect, generate a fault notification.
Abstract: Methods, apparatuses, and systems related to a memory device are described. The memory device may include a non-volatile (NV) memory and a controller. The controller may be configured to predict a temperature of the NV memory based on a real-time temperature of the controller. Based on the predicted temperature of the NV memory, the controller may execute a remedial action to reduce an actual temperature of the NV memory for executing an upcoming operation.
Abstract: The present invention discloses a method and device for managing a storage system. Specifically, in one embodiment of the present invention, there is proposed a method for managing a storage system, the storage system comprising a buffer device and a plurality of storage devices. The method comprises: receiving an access request with respect to the storage system; determining a storage device among the plurality of storage devices has been failed; and in response to the access request being an access request with respect to the failed storage device, serving the access request with data in the buffer device so as to reduce internal data access in the storage system. In one embodiment of the present invention, there is proposed a device for managing a storage system.
Abstract: A backup control method is proposed to include: (A) two control units executing firmware such that the control units respectively operate in a master mode and a slave mode; (B) the control unit that operates in the master mode generating a health signal when executing the firmware; (C) a logic arithmetic unit determining, based on the health signal, whether the control unit that operates in the master mode functions normally; and (D) when the control unit that operates in the master mode is determined to not function normally, the logic arithmetic unit controlling a light emitting element to emit light, and notifying the control unit that operates in the slave mode such that the control unit which operates in the slave mode enters the master mode.
Abstract: A first computer system identifies a failure case from a collected information, the collected information corresponds to one or more functions of a second system. The first computer system analyzes the failure case to determine a failure pattern. The first computer system determines whether the determined failure pattern corresponds to a stored failure pattern of a plurality of stored failure patterns in a database. In response to determining that the determined failure pattern corresponds to the stored failure pattern, the first computer system determines a remediation plan corresponding to the stored failure pattern, and utilizing the remediation plan to automatically remediate the failure case.
Type:
Grant
Filed:
October 31, 2017
Date of Patent:
September 21, 2021
Assignee:
PayPal, Inc.
Inventors:
Rani Fields, Laxmikant Sharma, Scott Donald Sivi, Subhadra Tatavarti
Abstract: A computer system includes a circuit board, one or more connectors/sockets and a first controller. The connectors/sockets are disposed on the circuit board. The first controller is configured to receive information corresponding to parameters of the circuit board and/or the connectors/sockets before booting up the computer system to run an operating system (OS).
Abstract: An exemplary method for analyzing an error log file includes receiving an error log file at a processor, the error log file including a plurality of distinct entries, determining a token pattern of each entry by tokenizing each of the distinct entries, and grouping the plurality of distinct entries into groups having similar token patterns.
Abstract: Systems and methods are disclosed herein for monitoring, detecting, and mitigating hardware and software failures. An error detection module monitors the execution of software processes and detects failures of the monitored processes. The error detection module may monitor reboot events and correlate reboot events with failures of the monitored software processes. If a monitored process fails, the error detection module may log the failure and its cause. If the same process has failed numerous times, causing the user device to experience a reboot loop, remedial action may be taken based on the cause of the failure.