Method and apparatus for failover detection and recovery using gratuitous address resolution messages
An approach for efficient failover detection includes detecting an attempt by a first server to transition from a standby mode to an active mode, diagnosing a loss of connectivity to the first server in a control plane as a cause of the attempt, and transitioning to a standby mode based on the diagnosed cause of the attempt.
Latest Verizon Patent and Licensing Inc. Patents:
- Self-managed networks and services with artificial intelligence and machine learning
- Systems and methods for radio access network-level allocation and retention priority based on network slicing
- Systems and methods for simultaneous recordation of multiple records to a distributed ledger
- System and method for cable identification
- Systems and methods for application security utilizing centralized security management
Consumer demand for Internet services has led to the widespread deployment of application services by service providers in nearly every industry. The growth of such application services has also led to the growth of increasingly complex hardware and software systems supporting a wide assortment of end-user devices and computing environments. Despite their increasingly complex and diverse computing needs, service providers must nevertheless satisfy consumer demand for reliable and continuous access to applications. Maintaining high levels of service availability has led to the widespread use of redundant application and server configurations in which an application service fails over to one or more standby servers if an active server fails. However, the failover process is susceptible to instability if a fault affects the control plane between the active and standby servers.
Based on the foregoing, there is a need for efficient failover detection and recovery.
Various exemplary embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements and in which:
An apparatus, method, and software for efficient failover detection and recovery, is described. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It is apparent, however, to one skilled in the art that the present invention may be practiced without these specific details or with an equivalent arrangement. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Although the various exemplary embodiments are described with respect to efficient failover detection and recovery for a telecommunications service provider, it is contemplated that these embodiments have applicability to systems operated by different organizations and to other operations wherein application services are provided.
For illustrative purposes, networks 107-113 may be any suitable wireline and/or wireless network, and be managed by one or more service providers. For example, telephony network 107 may include a circuit-switched network, such as the public switched telephone network (PSTN), an integrated services digital network (ISDN), a private branch exchange (PBX), or other like network. Wireless network 111 may employ various technologies including, for example, code division multiple access (CDMA), enhanced data rates for global evolution (EDGE), general packet radio service (GPRS), mobile ad hoc network (MANET), global system for mobile communications (GSM), Internet protocol multimedia subsystem (IMS), universal mobile telecommunications system (UMTS), etc., as well as any other suitable wireless medium, e.g., microwave access (WiMAX), wireless fidelity (WiFi), satellite, and the like. Meanwhile, data network 113 may be any local area network (LAN), metropolitan area network (MAN), wide area network (WAN), the Internet, or any other suitable packet-switched network, such as a commercially owned, proprietary packet-switched network, such as a proprietary cable or fiber-optic network.
Although depicted as separate entities, networks 107-113 may be completely or partially contained within one another, or may embody one or more of the aforementioned infrastructures. For instance, service provider network 107 may embody circuit-switched and/or packet-switched networks that include facilities to provide for transport of circuit-switched and/or packet-based communications. It is further contemplated that networks 107-113 may include components and facilities to provide for signaling and/or bearer communications between the various components or facilities of system 100. In this manner, networks 107-113 may embody or include portions of a signaling system 7 (SS7) network, or other suitable infrastructure to support control and signaling functions.
According to exemplary embodiments, UE 101 may be utilized to communicate over system 100 and may include any computing device capable of sending and/or receiving information over one or more of networks 107-113. For instance, UE 101 may include mobile devices (or terminals) which may be any cellular phones, radiophones, satellite phones, smart phones, wireless phones, or any other suitable mobile devices, such as personal digital assistants (PDAs), pocket personal computers, tablets, customized hardware, etc. Further, computing devices may include Voice over Internet Protocol (VoIP) phones, skinny client control protocol (SCCP) phones, session initiation protocol (SIP) phones, Internet Protocol (IP) phones, personal computers, softphones, workstations, terminals, servers, etc.
As used herein, the term “application service” refers to the combination of servers, network devices, and server-side applications and operating system components that together constitute a software service. For instance, application service 105 is hosted at one or more servers 115a-115n (collectively referred to as servers 115). The components of application service 105 may include server hardware, networks interconnecting the servers to each other and to networks 107-113, application software running on servers 115, and operating system or virtualization software. In one embodiment, application service 105 is engineered to provide a fault-tolerant software system. Fault tolerance refers to the attribute of a system that provides continuous service in the presence of faults. Different fault tolerance techniques exist to circumvent various kinds of failures. Hardware fault tolerance may be achieved through the use of redundant elements (e.g., multiple servers); software fault tolerance may be achieved through error checking, duplication of data, and software design practices. To achieve hardware fault tolerance, one or more software elements may coordinate the fault detection and recovery phases. For example, middleware and distributed software components may be implemented at each of servers 115. Application service 105 may also rely heavily on replicated databases that are a part of or connected to servers 115. In one embodiment, middleware may ensure the synchronization of software execution states across one or more redundant servers.
It is contemplated that the server applications may be hosted, or otherwise programmed, using various application and/or server configurations to ensure a high level of availability. As used herein, the term “availability” or “high availability (HA)” may be used to refer to computer hardware or software systems that are capable of providing service most of the time. Availability may be quantified in terms of the number of “9s” or “nines.” A system (e.g., typical desktop or server) with “3 nines” refers to a 99.9% availability, which roughly corresponds to nine hours of unavailability (also referred to herein as “downtime” or “outage”) per year. A system (e.g., enterprise server) with “4 nines” refers to a 99.99% availability, which roughly corresponds to one hour of downtime per year. A system (e.g., carrier class server) with “5 nines” refers to 99.999% availability, which roughly corresponds to five minutes of downtime per year. Finally, a system (e.g., carrier switch equipment) with “6 nines” refers to a 99.9999% availability, which roughly corresponds to thirty-one seconds of downtime per year. To ensure end-user satisfaction, application providers try to design systems such that the duration and frequency of outages is low enough for end-users not to perceive the outages as a problem. That is, application providers design system availability to ensure end-user perception of a high degree of service availability.
In one embodiment, fault tolerance may be provided by only using one of the servers in a cluster at one time. The remaining servers are not used until the primary server fails. Upon failure, the primary server is disabled or “fails over” and one of the other servers becomes active. As used herein, the terms “active” or “active mode” and “standby” or “standby mode” may be used to describe a primary server and one or more backup servers, respectively. An application service with only one active and one standby server may be described as “1+1” redundancy. An application service with more than one of active and standby servers may be described as “N+M” redundancy, where N is the number of active servers and M is the number of standby servers. In one embodiment, an active server is one that is actively responding to incoming client requests. The active server performs synchronization and replication to the standby servers. Standby servers track or otherwise monitor the active server. In case of a failover, the standby server begins handling incoming client requests with minimal or no service disruption.
In one embodiment, server 115a operates in an active mode and servers 115b-115n are in standby mode. Thus, server 115a may be described as the active server and servers 115b-115n may be described as standby servers. If server 115a fails, one (or more) of servers 115b-115n may transition to an active mode. Once server 115a recovers from the failure, server 115a may become another standby server. Alternatively, server 115a may force a failover such that it again becomes the active server and servers 115b-115n revert to standby mode. As used herein, the term “failure” may refer to a deviation from expected behavior that, unless corrected, impacts the availability or quality (e.g., delay, appearance, etc.) of an application service. The term “fault” may be used to more particularly describe a cause of a failure. For instance, a fault may include technical issues such as equipment or device malfunctions.
As shown, servers 115 may be connected by one or more networks. In one embodiment, each server maintains two network interfaces. Network interfaces 117a-117n (collectively referred to as network interfaces 117) connect to a network 119 that handles customer traffic (also referred to herein as “data plane”). Network interfaces 121a-121n (collectively referred to as network interfaces 121) connect to network 123 that handles operation, administrative and management traffic (also referred to herein as “control plane”). For example, network 119 may be utilized by UE 101 to access application service 105 and network 123 may be utilized by servers 115 to exchange data/state replication, process synchronization, and system monitoring traffic. The data plane and control plane networks may be separate networks (as shown). Alternatively, in one embodiment, the data plane and control plane networks utilize the same physical network infrastructure, but may occupy different logical or virtual partitions of the infrastructure.
In one embodiment, networks 119 and 123 are fault-tolerant networks. For example, network 119 may employ an N+M redundancy built using switches 125a-125n (collectively referred to as switches 125). In case one switch (e.g., switch 125a) fails, one or more of switches 125b-125n may take over the switching functions. Network 123 may similarly be a fault-tolerant network employing extra redundancy switches 127a-127n (collectively referred to as switches 127). In one embodiment, networks 119 and 123 are high-speed gigabit Ethernet™ networks.
UE 101 may access application service 105 via a virtual network address shared by servers 115. The virtual network address is a publicly accessible address that resolves to the hardware address of the current active server. In one embodiment, the virtual network address is a public IP address (i.e., routable via networks 107-113) that resolves via an address resolution process to the MAC address of the active server (e.g., server 115a). For example, each switch maintains the hardware address information of the current active server in an address resolution cache that is updated whenever a server failover occurs. In one embodiment, servers 115 and switches 125, 127 may utilize a gratuitous address resolution message to update the hardware address associated with the virtual network address. For example, networks 119 and 123 may utilize the ARP protocol to control hardware address caches. A gratuitous address resolution message may refer to an unsolicited message containing address resolution information. In one embodiment, the server attempting to transition to an active mode may broadcast or multicast such a message onto a network in order to force all devices connected to the network to update their cache information. For example, a standby server may broadcast a gratuitous ARP message when a failover occurs in order to cause client requests to be forwarded to the standby server after it has transitioned to active mode. The gratuitous ARP message forces switches 125 and 127 to remove an entry for the failed server and insert a new entry associating the virtual IP address with the hardware address of the standby server.
Following a failover, application service 105 migrates from the faulty server to one of the standby servers. In the process, the standby server becomes the new active server. In one embodiment, the application data and critical state information has been replicated across all servers 115 beforehand such that the failover is transparent to UE 101. The application data may be stored locally by each of the servers or on separate databases (not shown). In one embodiment, data replication/duplication, state preservation and synchronization activities are conducted via a control plane network (e.g., network 123). As described next in relation to
In one embodiment, server hardware 131 and operating system 133 include internetworking components (not shown) that allow servers 115 to share a virtual network address. As described earlier, client devices may access application service 105 at the virtual network address. For example, application service 105 may be accessed by UE 101 at a single public IP address. In one embodiment, the virtual network address is an IP address associated with the active server. Address resolution information shared by servers 115 and switches 125 associates the virtual network address with the hardware address of the active server so that all customer traffic is forwarded to the active server. If the active server fails, one of the standby servers transitions to an active mode and sends a gratuitous address resolution message to associate itself with the virtual network address.
High-availability middleware 135 may reside in each server to coordinate fault-tolerance activities across servers 115. In one embodiment, high-availability middleware 135 includes database management and state preservation functions. The middleware software may, for instance, contain direct interfaces to operating system 133 and to hardware devices (accessed through operating system 133 via device drivers). In one embodiment, high-availability middleware 135 makes decisions whether and when to replicate application data and state variables.
Applications 129 may or may not be aware of other instances of the application executing on other servers. For example, high-availability middleware 135 may transparently perform data and state replication. In one embodiment, high-availability middleware 135 may access the application data and state information on each server and ensure that the data and state information is identical. High-availability middleware 135 may also synchronize execution checkpoints and state variables of standby servers to the active server. Alternatively, the replication and synchronization activities may occur from an active server to one or more standby servers.
Although reference is made herein to a model of an application server in which the operating system runs directly on the platform hardware, it is contemplated that the operating system may also run within a virtualized environment. In one embodiment, applications 129, high-availability middleware 135, and operating system 133 may execute within a virtual machine (VM) running on a hypervisor or other virtualization software. It is contemplated that virtual application servers may take advantage of efficient failover and recovery as described in the present disclosure.
In one embodiment, application instances 137 on active server 139 drive the synchronization and replication with application instances 141 on standby server 143. Each application instance may perform some sort of data replication and state synchronization with its corresponding standby instance. Rapid synchronization and replication of databases and state information, for instance, may occur via a control plane. Replication may be synchronous or asynchronous. Synchronous replication, for instance, may be implemented by means of atomic data write operations (i.e., either a write occurs on both or neither of servers 139 and 143). Synchronous replication may be desirable to ensure zero data loss, but may lead to service disruptions if a control plane failure occurs. In another embodiment, asynchronous (i.e., write operation is considered complete as soon as local storage acknowledges it) or semi-synchronous (i.e., write operation is considered complete as soon as local storage acknowledges it and a standby server acknowledges that it has received the write into memory or log file) replication may be implemented. The present disclosure is independent of the particular form of replication utilized.
The approach of system 100 stems, in part, from the recognition that there may be situations where an active server has not failed, but a standby server may mistake a failure affecting communication via the control plane for a failure of the active server. Such failures may create an undesirable feedback loop that causes the servers to alternatively failover without reaching a stable state. For example, a fault affecting network 119 may cause the active server (e.g., server 115a) to perform repeated failovers with a standby server (e.g., server 115b). Unable to communicate via the control plane, each server will transition to the active mode and broadcast a gratuitous address resolution message to take ownership of the virtual network address. In this manner, both servers 115a and 115b may alternatively control the virtual network address and neither server stably transitions to a standby mode. Furthermore, replication and synchronization of critical state information is interrupted during this period because neither server stably transitions to the active mode.
To address these problems, system 100 utilizes failover platform 145 to implement an efficient failover and recovery mechanism in case of a control plane fault. This may be achieved by a method including: detecting an attempt by a first server to transition from a standby mode to an active mode, diagnosing a loss of connectivity to the first server in the control plane as a cause of the attempt, and causing a second server to transition from an active mode to a standby mode based on the diagnosed cause of the attempt.
As used herein, “diagnosis” may be used to refer to an analysis of one or more events and system parameters to determine the nature and location of a reported fault. In one embodiment, the second server (e.g., server 115a) may determine the receipt of one or more gratuitous ARP messages broadcast by the first server (e.g., server 115b) as being related to a fault in the control plane (e.g., network 123) connecting the first and second servers. For example, the second server may determine that the first server is attempting to transition to the active mode because it no longer sees the second server via the control plane. In one embodiment, a device driver for a network interface reports that a network fault has occurred in the control plane. The fault may also interfere with one or more data/state replication processes, which also may report a fault. In one embodiment, the second server may transition to a standby mode and allow the first server to transition to an active mode.
As described above, the second server may receive one or more gratuitous address resolution messages from the first server via a data plane (e.g., data network 119). The data plane may refer to a high-speed LAN (e.g., gigabit Ethernet™) to which both the first and second servers are connected. In one embodiment, the data plane may include a highly reliable, fault tolerant fabric switching network infrastructure with redundant network links. Such a network infrastructure may employ high-speed LAN switches that can be rapidly re-programmed. For example, the receipt of a gratuitous ARP message may cause the switches and the network interfaces of servers 115 to clear an existing address resolution entry in an ARP cache and re-program it with a new entry based on the first server's MAC address.
In one embodiment, the second server determines a failure to synchronize to one or more applications at the first server via the control plane. For instance, high-availability middleware 135 of the second server (e.g., server 115a) may unsuccessfully attempt to replicate application data and critical state information to the first server (e.g., server 115b). In one embodiment, high-availability middleware 135 of the second server (e.g., server 115a) may attempt to communicate with the first server (e.g., server 115b), but may receive an error message from operating system 133.
In one embodiment, the second server (e.g., server 115a) may utilize the information of the received gratuitous ARP message and the information indicating that replication of critical state information via the control plane was unsuccessful to determine that the first server (e.g., server 115b) is attempting to transition to the active state because of a failure in the control plane. The second server (e.g., server 115a) may then transition from its current active mode to a standby mode, thereby allowing the first server (e.g., server 115b) to stably transition to active mode. In one embodiment, the second server (e.g., server 115a) may remain in standby mode until it detects a resumption of connectivity in the control plane. The second server (e.g., server 115a) may then re-synchronize its application data and critical state information with the first server (e.g., server 115b). In one embodiment, the second server (e.g., server 115a) may force a failover to return to active mode.
Control logic 201 provides the logic for executing one or more processes and storing information generated by the various modules. In one embodiment, identical instances of control logic 201 reside at each of servers 115 and are activated when the server transitions to an active mode. Alternatively, or in addition, control logic 201 may reside on a separate system connected to each of the servers 115 and networks 119 and 123. If a failure affects connectivity at the control plane, control logic 201 helps a stable transition of a standby server to an active mode. Once connectivity at the control plane resumes, control logic 201 may cause the standby server to revert to standby mode.
Communication module 203 may be utilized to transmit or receive messages via the data and control planes. In one embodiment, communication module 203 interacts or includes components of operating system 133 that perform device driver functions for a server. Communication module 203 may, for instance, be utilized to send broadcast, multicast, or unicast messages via one or more network interfaces of a server. In one embodiment, communication module 203 may also be utilized to alert or report the receipt of messages to an application, operating system component, or middleware. For instance, communication module 203 may report the receipt of a gratuitous ARP message received from a standby server attempting to transition to an active mode. Control logic 201 may utilize the report to trigger other modules of failover platform 145 to take appropriate diagnostic or recovery actions.
Detection module 205 may be utilized to detect an attempt by a standby server to transition to an active mode. In one embodiment, detection module 205 determines that a standby server is attempting to transition from a standby to an active mode based on the receipt of one or more gratuitous ARP messages from the standby server. Detection module 205 may receive a report of such messages from communication module 203. In response, detection module 205 may interact with one or more other modules (e.g., diagnosis module 213) to help determine whether a feedback loop has been or will be created.
Synchronization module 207 may be utilized to synchronize the instances of one or more applications provided as a service at the active and standby servers. In one embodiment, synchronization module 207 performs replication of data and critical state information related to the operation of the applications. For instance, synchronization module 207 may perform either synchronous or asynchronous replication with respect to one or more databases connected to (or a part of) servers 115. In one embodiment, synchronization occurs via a control plane. For instance, synchronization module 207 may interact with communication module 203 to exchange replicated data and critical state information via network 123.
Address module 209 may be utilized to manage one or more virtual network addresses at which an application service (e.g., application service 105) is accessible. For instance, the virtual network addresses may be publicly routable IP addresses distributed via networks 107-113 to UE 101. In one embodiment, address module 209 manages the network-to-hardware address resolution of a virtual network address. For instance, address module 209 may cause communication module 203 to broadcast a gratuitous ARP message whenever the active server changes. A network device forwarding packets addressed to the virtual network address must accurately resolve the network address to the actual hardware of the active server. In one embodiment, address module 209 updates the hardware address the virtual network address is resolved to by broadcasting the updated information to switches 125 on network 119. Based on the information contained within the broadcast, switches 125 and other servers update their respective cached address information. For instance, the gratuitous ARP message may indicate a new hardware address to be associated with the virtual network address. Based on this information, a receiving server or network device may create or modify an entry in its routing/forwarding tables.
Transition module 211 may be utilized to change the state of a server or an application executing at a server from active to standby mode or vice versa. In one embodiment, communication module 203 is a part of a high-availability middleware that interfaces with the server operating system (e.g., operating system 133) and the hosted application (e.g., applications 129). In one embodiment, transition module 211 performs various state preservation functions for an instance of the application running on a server that is about to transition. In one embodiment, high-availability middleware 135 performs a coordinated failover of the application by synchronizing each of the instances of the application running on the various servers prior to transitioning. In cases where the transition is triggered by a control plane fault, transition module 211 may be limited to performing state preservation with respect to a local storage. For instance, critical application data and state information may be saved to local volatile or non-volatile media.
Diagnosis module 213 may be utilized to diagnose a fault in the control plane as a cause of an attempt by a standby server to transition to an active mode. In one embodiment, diagnosis module 213 receives an indication from communication module 203 that the control plane has suffered a fault. Alternatively, or in addition, diagnosis module 213 may receive notification from synchronization module 207 that a replication or state synchronization step has been unsuccessful because of a control plane failure. If diagnosis module 213 also learns of a standby server attempting to transition to an active mode, diagnosis module 213 may inferentially determine that the control plane fault reported by communication module 203 or synchronization module 207 is the cause of the attempted transition. In one embodiment, diagnosis module 213 may learn of the standby server's attempted transition based on receipt of one or more gratuitous address resolution messages from the standby server. In another embodiment, diagnosis module 213 may first learn of the receipt of one or more gratuitous address resolution messages before learning of a control plane fault. For example, diagnosis module 213 may proactively check whether any faults have been reported for the control plane upon receipt of a gratuitous address resolution message. If there are reports of issues affecting connectivity to the standby server that is transmitting the gratuitous address resolution messages, diagnosis module 213 may diagnose the cause of the attempt to be related to the control plane fault.
Diagnosis module 213 may utilize transition module 211 to force the active server to failover if diagnosis module 213 determines that the standby server is attempting to transition to active mode because it can no longer communicate with the active server. After a failover occurs, diagnosis module 213 may periodically check if the control plane fault has been repaired. If the fault is repaired, diagnosis module 213 may utilize recovery module 215 to cause the server that was originally active to revert to active mode.
Recovery module 215 may be utilized for full recovery to occur once connectivity via the control plane is restored. In one embodiment, a full recovery may mean restoration of the respective modes of servers belonging to an application service to their original modes prior to the occurrence of the control plane failure. Recovery module 215 may, for instance, force such a recovery if it receives an indication that communication via the control plane has resumed. Recovery module 215 may then utilize synchronization module 207 to re-synchronize state information and replicate all data between the instances of the application running on servers 115. In one embodiment, recovery module 215 may utilize transition module 211 to transition back to an active mode. As described above, transition module 211 may, in turn, utilize address module 209 and communication module 203 to broadcast one or more other gratuitous address resolution messages via the data plane.
In another embodiment, recovery module 215 may be utilized to perform a partial recovery even if the control plane fault continues to affect communication among the servers. For instance, application service 105 may rely on asynchronous or semi-synchronous replication techniques. In such cases, recovery may simply entail a failover to a standby server. Recovery module 215 may still utilize transition module 211 to cause the transition of the active server to standby mode, but may not require the server to transition back to active mode when the control plane fault is repaired.
In accordance with an exemplary embodiment of the present disclosure, server 401a may diagnose a control plane fault as being a cause of the receipt of a gratuitous ARP message (“GARP”) via data network 405. As described earlier, a gratuitous ARP message may be seen as an attempt by a server to transition to active mode because it indicates an attempt to actively respond to client access requests based on a virtual IP address. For example, a fault affecting control network 407 may be diagnosed as the reason why standby server 401b is attempting to transition to an active mode. In such a scenario, active server 401a may failover to standby mode and allow server 401b to stably transition to active mode. When connectivity via control network 407 resumes, server 401a may attempt to revert back to active mode by sending another set of gratuitous ARP messages onto data network 405. In one embodiment, the decision to revert back to active mode may be determined by the control logic 201 of the failover platform 145 based on the state of the network links. For instance, the control logic 201 may determine that the servers would become unstable if server 401a reverts back to active mode (e.g., due to flapping or continuous failover). To prevent instability, the control logic 201 may determine to leave the servers in the failed over state and wait for manual reconfiguration.
Computer system 500 may be coupled via bus 501 to a display 511, such as a cathode ray tube (CRT), liquid crystal display, active matrix display, or plasma display, for displaying information to a computer user. An input device 513, such as a keyboard including alphanumeric and other keys, is coupled to bus 501 for communicating information and command selections to processor 503. Another type of user input device is a cursor control 515, such as a mouse, a trackball, or cursor direction keys, for communicating direction information and command selections to processor 503 and for controlling cursor movement on display 511.
According to an embodiment of the invention, the processes described herein are performed by computer system 500, in response to processor 503 executing an arrangement of instructions contained in main memory 505. Such instructions can be read into main memory 505 from another computer-readable medium, such as storage device 509. Execution of the arrangement of instructions contained in main memory 505 causes processor 503 to perform the process steps described herein. One or more processors in a multi-processing arrangement may also be employed to execute the instructions contained in main memory 505. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the embodiment of the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.
Computer system 500 also includes a communication interface 517 coupled to bus 501. Communication interface 517 provides a two-way data communication coupling to a network link 519 connected to a local network 521. For example, communication interface 517 may be a digital subscriber line (DSL) card or modem, an integrated services digital network (ISDN) card, a cable modem, a telephone modem, or any other communication interface to provide a data communication connection to a corresponding type of communication line. As another example, communication interface 517 may be a local area network (LAN) card (e.g. for Ethernet™ or an Asynchronous Transfer Model (ATM) network) to provide a data communication connection to a compatible LAN. Wireless links can also be implemented. In any such implementation, communication interface 517 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information. Further, communication interface 517 can include peripheral interface devices, such as a Universal Serial Bus (USB) interface, a PCMCIA (Personal Computer Memory Card International Association) interface, etc. Although a single communication interface 517 is depicted in
Network link 519 typically provides data communication through one or more networks to other data devices. For example, network link 519 may provide a connection through local network 521 to a host computer 523, which has connectivity to a network 525 (e.g. a wide area network (WAN) or the global packet data communication network now commonly referred to as the “Internet”) or to data equipment operated by a service provider. Local network 521 and network 525 both use electrical, electromagnetic, or optical signals to convey information and instructions. The signals through the various networks and the signals on network link 519 and through communication interface 517, which communicate digital data with computer system 500, are exemplary forms of carrier waves bearing the information and instructions.
Computer system 500 can send messages and receive data, including program code, through the network(s), network link 519, and communication interface 517. In the Internet example, a server (not shown) might transmit requested code belonging to an application program for implementing an embodiment of the invention through network 525, local network 521 and communication interface 517. Processor 503 may execute the transmitted code while being received and/or store the code in storage device 509, or other non-volatile storage for later execution. In this manner, computer system 500 may obtain application code in the form of a carrier wave.
The term “computer-readable medium” as used herein refers to any medium that participates in providing instructions to processor 503 for execution. Such a medium may take many forms, including but not limited to non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as storage device 509. Volatile media include dynamic memory, such as main memory 505. Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise bus 501. Transmission media can also take the form of acoustic, optical, or electromagnetic waves, such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
Various forms of computer-readable media may be involved in providing instructions to a processor for execution. For example, the instructions for carrying out at least part of the embodiments of the invention may initially be borne on a magnetic disk of a remote computer. In such a scenario, the remote computer loads the instructions into main memory and sends the instructions over a telephone line using a modem. A modem of a local computer system receives the data on the telephone line and uses an infrared transmitter to convert the data to an infrared signal and transmit the infrared signal to a portable computing device, such as a personal digital assistant (PDA) or a laptop. An infrared detector on the portable computing device receives the information and instructions borne by the infrared signal and places the data on a bus. The bus conveys the data to main memory, from which a processor retrieves and executes the instructions. The instructions received by main memory can optionally be stored on storage device either before or after execution by processor.
In one embodiment, chip set 600 includes a communication mechanism such as a bus 601 for passing information among the components of chip set 600. Processor 603 has connectivity to bus 601 to execute instructions and process information stored in, for example, a memory 605. Processor 603 may include one or more processing cores with each core configured to perform independently. A multi-core processor enables multiprocessing within a single physical package. Examples of a multi-core processor include two, four, eight, or greater numbers of processing cores. Alternatively or in addition, processor 603 may include one or more microprocessors configured in tandem via bus 601 to enable independent execution of instructions, pipelining, and multithreading. Processor 603 may also be accompanied with one or more specialized components to perform certain processing functions and tasks such as one or more digital signal processors (DSP) 607, or one or more application-specific integrated circuits (ASIC) 609. A DSP 607 typically is configured to process real-world signals (e.g., sound) in real time independently of processor 603. Similarly, an ASIC 609 can be configured to performed specialized functions not easily performed by a general purposed processor. Other specialized components to aid in performing the inventive functions described herein include one or more field programmable gate arrays (FPGA) (not shown), one or more controllers (not shown), or one or more other special-purpose computer chips.
Processor 603 and accompanying components have connectivity to memory 605 via bus 601. Memory 605 includes both dynamic memory (e.g., RAM, magnetic disk, writable optical disk, etc.) and static memory (e.g., ROM, CD-ROM, etc.) for storing executable instructions that when executed perform the inventive steps described herein for failover detection and recovery. Memory 605 also stores the data associated with or generated by the execution of the inventive steps.
While certain exemplary embodiments and implementations have been described herein, other embodiments and modifications will be apparent from this description. Accordingly, the invention is not limited to such embodiments, but rather to the broader scope of the presented claims and various obvious modifications and equivalent arrangements.
Claims
1. A method comprising:
- detecting an attempt by a first server to transition from a standby mode to an active mode;
- diagnosing a loss of connectivity to the first server in a control plane as a cause of the attempt;
- transitioning to a standby mode based on the diagnosed cause of the attempt;
- receiving one or more gratuitous address resolution messages from the first server via a data plane; and
- determining a failure to synchronize to one or more applications at the first server via the control plane,
- wherein the attempt is detected based on the one or more gratuitous address resolution messages.
2. The method of claim 1, further comprising:
- detecting a resumption of connectivity via the control plane to the first server; and
- re-synchronizing the one or more applications with the first server via the control plane.
3. The method of claim 2, further comprising:
- transitioning to the active mode based on the detected resumption of connectivity; and
- transmitting one or more other gratuitous address resolution messages via the data plane.
4. The method of claim 1, wherein the control and data planes correspond to respective first and second communication links to the first server.
5. The method of claim 1, wherein the one or more gratuitous address resolution messages comprise one or more virtual network addresses shared by the first server and a second server.
6. The method of claim 5, wherein the one or more virtual network addresses are shared virtual Internet protocol (IP) addresses.
7. The method of claim 1, wherein the first server hosts the one or more applications with a second server.
8. An apparatus comprising at least one processor configured to:
- detect an attempt by a first server to transition from a standby mode to an active mode;
- diagnose a loss of connectivity to the first server in a control plane as a cause of the attempt;
- transition to a standby mode based on the diagnosed cause of the attempt;
- receive one or more gratuitous address resolution messages from the first server via a data plane; and
- determine a failure to synchronize to one or more applications at the first server via the control plane,
- wherein the attempt is detected based on the one or more gratuitous address resolution messages.
9. The apparatus of claim 8, wherein the apparatus is further configured to:
- detect a resumption of connectivity via the control plane to the first server; and
- re-synchronize the one or more applications with the first server via the control plane.
10. The apparatus of claim 9, wherein the apparatus is further configured to:
- transition to the active mode based on the detected resumption of connectivity; and
- transmit one or more other gratuitous address resolution messages via the data plane.
11. The apparatus of claim 8, wherein the control and data planes correspond to respective first and second communication links to the first server.
12. The apparatus of claim 8, wherein the one or more gratuitous address resolution messages comprise one or more virtual network addresses shared by the first server and a second server.
13. The apparatus of claim 12, wherein the one or more virtual network addresses are shared virtual Internet protocol (IP) addresses.
14. The apparatus of claim 8, wherein the first server hosts the one or more applications with a second server.
15. A system comprising:
- a first server;
- a second server;
- a failover platform configured to: detect an attempt by the first server to transition from a standby mode to an active mode, diagnose a loss of connectivity to the first server in a control plane as a cause of the attempt, transition to a standby mode based on the diagnosed cause of the attempt, receive one or more gratuitous address resolution messages from the first server via a data plane, and determine a failure to synchronize to one or more applications at the first server via the control plane,
- wherein the control plane connects the first and second servers via a first communication link and the data plane connects the first and second servers via a second communication link, and
- wherein the attempt is detected based on the one or more gratuitous address resolution messages.
16. The system of claim 15, wherein the failover platform is further configured to:
- detect a resumption of connectivity via the control plane to the first server; and
- re-synchronize the one or more applications with the first server via the control plane.
17. The system of claim 16, wherein the failover platform is further configured to:
- transition to the active mode based on the detected resumption of connectivity; and
- transmit one or more other gratuitous address resolution messages via the data plane.
18. The system of claim 15, wherein the one or more gratuitous address resolution messages comprise one or more virtual network addresses shared by the first server and the second server.
19. The system of claim 18, wherein the one or more virtual network addresses are shared virtual Internet protocol (IP) addresses.
20. The system of claim 15, wherein the first server hosts the one or more applications with the second server.
6049825 | April 11, 2000 | Yamamoto |
6108300 | August 22, 2000 | Coile |
20020083036 | June 27, 2002 | Price |
20040268175 | December 30, 2004 | Koch |
20060206611 | September 14, 2006 | Nakamura |
20060245411 | November 2, 2006 | Chen |
20070070975 | March 29, 2007 | Otani |
20080120177 | May 22, 2008 | Moscirella |
20080244281 | October 2, 2008 | Felter |
20090037763 | February 5, 2009 | Adhya |
20090037998 | February 5, 2009 | Adhya |
20100271933 | October 28, 2010 | Li |
20110271136 | November 3, 2011 | Abbot |
20130185416 | July 18, 2013 | Larkin |
20140289399 | September 25, 2014 | Shimokuni |
20140317440 | October 23, 2014 | Biermayr |
20150081767 | March 19, 2015 | Evens |
- How does gratuitous ARP work? Network Engineering Stack Exchange. May 2, 2014 [retrieved on Mar. 23, 2016]. Retrieved from the Internet <URL: http://networkengineering.stackexchange.com/questions/7713/how-does-gratuitous-arp-work>.
- Bhide, Anupam. Elnozahy, Elmootazbellah. Morgan, Stephen. A Highly Available Network File Server. Proceedings of the Winter USENIX Conference. 1991.
- What is Gratuitous ARP? Jul. 16, 2014 [retrieved on Mar. 23, 2016]. Retrieved from the Internet <URL: https://supportforums.cisco.com/discussion/12257536/what-gratuitous-arp>.
Type: Grant
Filed: Jan 24, 2014
Date of Patent: Nov 1, 2016
Patent Publication Number: 20150212909
Assignee: Verizon Patent and Licensing Inc. (Basking Ridge, NJ)
Inventor: Eric Sporel (Westford, MA)
Primary Examiner: Gabriel Chu
Assistant Examiner: Paul Contino
Application Number: 14/163,166
International Classification: G06F 11/20 (20060101); G06F 11/16 (20060101); H04L 29/12 (20060101);