SYSTEM AND METHOD FOR MANAGING DATA CENTER ALARMS

Info

Publication number: 20150170508
Type: Application
Filed: Dec 16, 2013
Publication Date: Jun 18, 2015
Applicant: ALCATEL-LUCENT USA INC. (Murray Hill, NJ)
Inventor: Vyacheslav Lvin (Mountain View, CA)
Application Number: 14/107,258

Abstract

Systems, methods, architectures, mechanisms and/or apparatus to manage alarm generation associated with event-sourcing objects or entities at a data center in accordance with a hierarchy of failure relationships of the event-sourcing objects or entities, wherein alarms normally generated in response to a received event are suppressed (i.e., not generated) if the source of the event is not the root cause of the event.

Description

Description

FIELD OF THE INVENTION

The invention relates to the field of network and data center management and, more particularly but not exclusively, to management of event data in networks, data centers and the like.

BACKGROUND

Data Center (DC) architecture generally consists of a large number of compute and storage resources that are interconnected through a scalable Layer-2 or Layer-3 infrastructure. In addition to this networking infrastructure running on hardware devices the DC network includes software networking components (v-switches) running on general purpose compute, and dedicated hardware appliances that supply specific network services such as load balancers, ADCs, firewalls, IPS/IDS systems etc. The DC infrastructure can be owned by an Enterprise or by a service provider (referred as Cloud Service Provider or CSP), and shared by a number of tenants. Compute and storage infrastructure are virtualized in order to allow different tenants to share the same resources. Each tenant can dynamically add/remove resources from the global pool to/from its individual service.

Virtualized services as discussed herein generally describe any type of virtualized compute and/or storage resources capable of being provided to a tenant. Moreover, virtualized services also include access to non-virtual appliances or other devices using virtualized compute/storage resources, data center network infrastructure and so on. The various embodiments are adapted to improve event-related processing within the context of data centers, networks and the like.

Within the context of a typical data center arrangement, a tenant entity such as a bank or other entity has provisioned for it a number of virtual machines (VMs) which are accessed via a Wide Area Network (WAN) using Border Gateway Protocol (BGP). At the same time, thousands of other virtual machines may be provisioned for hundreds or thousands of other tenants. The scale associated data center may be enormous. Thousands of virtual machines may be created and/or destroyed each day per tenant demand.

Each of the virtual ports, virtual machines, virtual switches, virtual switch controllers and other objects or entities within the data center (virtual and otherwise) generates event data in response to many different types of conditions. Of critical importance is event data associated with the failure of an object or entity such that an alarm may be generated and delivered to a client, system operator or other alarm processing entity to indicate this failure.

Unfortunately, the enormous number of objects or entities providing event data results in the generation of very large number of alarms such that the timely processing of alarms becomes very difficult, which may result in missed opportunities for system optimizations, degraded customer experience, longer repair/replace cycles and so on.

SUMMARY

Various deficiencies in the prior art are addressed by systems, methods, architectures, mechanisms and/or apparatus to manage alarm generation associated with event-sourcing objects or entities at a data center in accordance with a hierarchy of failure relationships of the event-sourcing objects or entities. In various embodiments, alarms normally generated in response to a received event are suppressed (i.e., not generated) if the source of the event is not the root cause of the event. A determination as to root cause may be made according to a hierarchy of failure relationships of objects/entities at the data center.

A method according to one embodiment comprises defining a hierarchy of failure relationships of DC entities, each of said failure relationships comprising a higher-level DC entity and a lower level DC entity, each lower level DC entity necessarily failing in response to failure of a corresponding higher-level DC entity; in response to received events indicative of failed DC entities, correlating failed higher-level entities corresponding to failed lower level DC entities to identify thereby root cause failed DC entities; and generating failure alarms associated with said root cause failed DC entities.

In various embodiments, alarms associated with lower-level objects/entities that have necessarily failed due to the failure of a corresponding higher-level objects/entity are suppressed. In various embodiments, the hierarchy of failure relationships is indicated using a relational graph. In various embodiments, the relational graph includes one or more trees. In various embodiments, alarm generation is not suppressed in the case of preferred or priority entities, services, customers and the like.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings herein can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:

FIG. 1 depicts a high-level block diagram of a system benefiting from various embodiments;

FIG. 2 depicts an exemplary management system suitable for use in the system of FIG. 1;

FIG. 3 depicts a flow diagram of methods according to various embodiments;

FIG. 4 graphically depicts a hierarchy of failure relationships of DC entities supporting an exemplary virtualized service useful in understanding the embodiments; and

FIG. 5 depicts a high-level block diagram of a computing device suitable for use in performing the functions described herein.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.

DETAILED DESCRIPTION OF THE INVENTION

The invention will be discussed within the context of systems, methods, architectures, mechanisms and/or apparatus adapted to reduce the alarm processing burdens by selectively suppressing alarms associated with lower-level objects or entities that necessarily failed to the failure of a corresponding higher-level object or entity. That is, alarms associated with objects or entities below a hierarchically superior object or entity may be suppressed in various circumstances, such as where a hierarchically superior object or entity has failed thereby causing failure of corresponding hierarchically inferior objects or entities.

For example, a failure of a virtual switch supporting (i.e., hierarchically above) a number of virtual machines in a data center will result in the generation of alarms indicative of the failure of the virtual switch, the failure of each of the virtual machines, the failure of the virtual ports supported by the virtual machines and so on.

However, it will be appreciated by those skilled in the art that the invention has broader applicability than described herein with respect to the various embodiments.

Virtualized services as discussed herein generally describe any type of virtualized compute and/or storage resources capable of being provided to a tenant. Moreover, virtualized services also include access to non-virtual appliances or other devices using virtualized compute/storage resources, data center network infrastructure and so on. The various embodiments are adapted to improve event-related processing within the context of data centers, networks and the like. The various embodiments advantageously improve such processing even as problems due to the nature of virtual machines, mixed virtual and real provisioning of VMs and the like make such processing more complex. Moreover, as data center sizes scale up the resources necessary to perform such correlation become enormous and the process cannot be handled in an efficient manner.

FIG. 1 depicts a high-level block diagram of a system benefiting from various embodiments. Specifically, FIG. 1 depicts a system 100 comprising a plurality of data centers (DC) 101-1 through 101-X (collectively data centers 101) operative to provide compute and storage resources to numerous customers having application requirements at residential and/or enterprise sites 105 via one or more networks 102.

The customers having application requirements at residential and/or enterprise sites 105 interact with the network 102 via any standard wireless or wireline access networks to enable local client devices (e.g., computers, mobile devices, set-top boxes (STBs), storage area network components, Customer Edge (CE) routers, access points and the like) to access virtualized compute and storage resources at one or more of the data centers 101.

The networks 102 may comprise any of a plurality of available access network and/or core network topologies and protocols, alone or in any combination, such as Virtual Private Networks (VPNs), Long Term Evolution (LTE), Border Network Gateway (BNG), Internet networks and the like.

The various embodiments will generally be described within the context of IP networks enabling communication between provider edge (PE) nodes 108. Each of the PE nodes 108 may support multiple data centers 101. That is, the two PE nodes 108-1 and 108-2 depicted in FIG. 1 as communicating between networks 102 and DC 101-X may also be used to support a plurality of other data centers 101.

The data center 101 (illustratively DC 101-X) is depicted as comprising a plurality of core switches 110, a plurality of service appliances 120, a first resource cluster 130, a second resource cluster 140, and a third resource cluster 150.

Each of, illustratively, two PE nodes 108-1 and 108-2 is connected to each of the, illustratively, two core switches 110-1 and 110-2. More or fewer PE nodes 108 and/or core switches 110 may be used; redundant or backup capability is typically desired. The PE routers 108 interconnect the DC 101 with the networks 102 and, thereby, other DCs 101 and end-users 105. The DC 101 is generally organized in cells, where each cell can support thousands of servers and virtual machines.

Each of the core switches 110-1 and 110-2 is associated with a respective (optional) service appliance 120-1 and 120-2. The service appliances 120 are used to provide higher layer networking functions such as providing firewalls, performing load balancing tasks and so on.

The resource clusters 130-150 are depicted as compute and/or storage resources organized as racks of servers implemented either by multi-server blade chassis or individual servers. Each rack holds a number of servers (depending on the architecture), and each server can support a number of processors. A set of network connections connect the servers with either a Top-of-Rack (ToR) or End-of-Rack (EoR) switch. While only three resource clusters 130-150 are shown herein, hundreds or thousands of resource clusters may be used. Moreover, the configuration of the depicted resource clusters is for illustrative purposes only; many more and varied resource cluster configurations are known to those skilled in the art. In addition, specific (i.e., non-clustered) resources may also be used to provide compute and/or storage resources within the context of DC 101.

Exemplary resource cluster 130 is depicted as including a ToR switch 131 in communication with a mass storage device(s) or storage area network (SAN) 133, as well as a plurality of server blades 135 adapted to support, illustratively, virtual machines (VMs). Exemplary resource cluster 140 is depicted as including an EoR switch 141 in communication with a plurality of discrete servers 145. Exemplary resource cluster 150 is depicted as including a ToR switch 151 in communication with a plurality of virtual switches 155 adapted to support, illustratively, the VM-based appliances.

In various embodiments, the ToR/EoR switches are connected directly to the PE routers 108. In various embodiments, the core or aggregation switches 120 are used to connect the ToR/EoR switches to the PE routers 108. In various embodiments, the core or aggregation switches 120 are used to interconnect the ToR/EoR switches. In various embodiments, direct connections may be made between some or all of the ToR/EoR switches.

A VirtualSwitch Control Module (VCM) running in the ToR switch gathers connectivity, routing, reachability and other control plane information from other routers and network elements inside and outside the DC. The VCM may run also on a VM located in a regular server. The VCM then programs each of the virtual switches with the specific routing information relevant to the virtual machines (VMs) associated with that virtual switch. This programming may be performed by updating L2 and/or L3 forwarding tables or other data structures within the virtual switches. In this manner, traffic received at a virtual switch is propagated from a virtual switch toward an appropriate next hop over a tunnel between the source hypervisor and destination hypervisor using an IP tunnel. The ToR switch performs just tunnel forwarding without being aware of the service addressing.

Generally speaking, the “end-users/customer edge equivalents” for the internal DC network comprise either VM or server blade hosts, service appliances and/or storage areas. Similarly, the data center gateway devices (e.g., PE servers 108) offer connectivity to the outside world; namely, Internet, VPNs (IP VPNs/VPLS/VPWS), other DC locations, Enterprise private network or (residential) subscriber deployments (BNG, Wireless (LTE etc), Cable) and so on.

In addition to the various elements and functions described above, the system 100 of FIG. 1 further includes a Management System (MS) 190. The MS 190 is adapted to support various management functions associated with the data center or, more generically, telecommunication network or computer network resources. The MS 190 is adapted to communicate with various portions of the system 100, such as one or more of the data centers 101. The MS 190 may also be adapted to communicate with other operations support systems (e.g., Element Management Systems (EMSs), Topology Management Systems (TMSs), and the like, as well as various combinations thereof).

The MS 190 may be implemented at a network node, network operations center (NOC) or any other location capable of communication with the relevant portion of the system 100, such as a specific data center 101 and various elements related thereto. The MS 190 may be implemented as a general purpose computing device or specific purpose computing device, such as described below with respect to FIG. 5.

FIG. 2 depicts an exemplary management system suitable for use as the management system of FIG. 1. As depicted in FIG. 2, MS 190 includes one or more processor(s) 210, a memory 220, a network interface 230N, and a user interface 230I. The processor(s) 210 is coupled to each of the memory 220, the network interface 230N, and the user interface 230I.

The processor(s) 210 is adapted to cooperate with the memory 220, the network interface 230N, the user interface 230I, and the support circuits 240 to provide various management functions for a data center 101 and/or the system 100 of FIG. 1.

The memory 220, generally speaking, stores programs, data, tools and the like that are adapted for use in providing various management functions for the data center 101 and/or the system 100 of FIG. 1.

The memory 220 includes various management system (MS) programming modules 222 and MS databases 223 adapted to implement network management functionality such as discovering and maintaining network topology, processing VM related requests (e.g., instantiating, destroying, migrating and so on) and the like.

The memory 220 includes a rules engine 228 (e.g., DROOLS) operable to process events of virtualized and/or non-virtualized objects, entities, protocols and the like associated with the data center objects or entities within the data center against a data structure representing a current hierarchical failure relationship of these objects or entities to identify thereby events which may be repetitive or extraneous such that corresponding alarms need not be generated. That is, where events associated with the failure of a parent entity (e.g., a virtual switch) necessarily result in the failure of a child entity (e.g., a port of the virtual switch), the generation of an alarm associated with the virtual switch failure renders the generation of an alarm associated with the virtual switch port failure unnecessary since the virtual switch port failure may be assumed where the virtual switch itself has failed.

The memory 220 also includes a failure relationship engine 229 operable to construct a data structure or otherwise define the hierarchy of failure relationships in a manner suitable for use by the rules engine 228. Generally speaking, the hierarchy of failure relationships identifies hierarchically higher level objects, entities, protocols and the like which, upon failure, necessarily cause the failure of corresponding hierarchically lower level objects, entities, protocols and the like.

In various embodiments, the rules engine 228 suppresses alarms normally appropriate in view of received event data streams and/or other information were such alarms are related to a hierarchically lower level object, entity, protocol and the like which has necessarily failed due to the failure of a corresponding hierarchically higher level object, entity, protocol and the like.

In various embodiments, the MS programming module 222, rules engine 228, failure relationship engine 229 are implemented using software instructions which may be executed by a processor (e.g., processor(s) 210) for performing the various management functions depicted and described herein.

The network interface 230N is adapted to facilitate communications with various network elements, nodes and other entities within the system 100, DC 101 or other network to support the management functions performed by MS 190.

The user interface 230I is adapted to facilitate communications with one or more user workstations (illustratively, user workstation 250), for enabling one or more users to perform management functions for the system 100, DC 101 or other network.

As described herein, memory 220 includes the MS programming module 222, MS databases 223, rules engine 228 and failure relationship engine 229 which cooperate to provide the various functions depicted and described herein. Although primarily depicted and described herein with respect to specific functions being performed by and/or using specific ones of the engines and/or databases of memory 220, it will be appreciated that any of the management functions depicted and described herein may be performed by and/or using any one or more of the engines and/or databases of memory 220.

The MS programming 222 adapts the operation of the MS 140 to manage various network elements, DC elements and the like such as described above with respect to FIG. 1, as well as various other network elements (not shown) and/or various communication links therebetween. The MS databases 223 are used to store topology data, network element data, service related data, VM related data, BGP related data, IGP related data and any other data related to the operation of the Management System 190. The MS program 222 may implement various service aware manager (SAM) or network manager functions.

Each virtual and nonvirtual object/entity generating events communicate these events to the MS 190 or other entity via respective event streams. The MS 190 processes the event streams as described herein and, additionally, maintains an event log associated with each of the individual event stream sources. In various embodiments, combined event logs are maintained.

Events and Event Logs

Each virtual and nonvirtual object/entity generating events (i.e., each event source object/entity) communicate these events to the MS 190 or other entity via respective event streams. The MS 190 processes the event streams as described herein and, additionally, maintains an event log associated with each of the individual event stream sources. In various embodiments, combined event logs are maintained.

Each event log generally includes data fields providing, for each event, (1) a timestamp, (2) an event source object/entity identifier (3) any parent object/entity identifiers (optional), (4) an event type indicator and other information as appropriate.

The timestamp is based upon the time the event was generated, the time the event was received and logged, or some other relevant timestamp criteria.

The event source object/entity identifier identifies the object/entity generating the event. The identifier may comprise, illustratively, a Universal Unique Identifier (UUID), an IP address or any other suitable identifier.

The optional parent object/entity identifiers identify any parent objects/entities associated with the event source object/entity. Specifically, most source objects/entities are associated with one or more parent objects/entities, wherein a failure of a parent object/entity necessarily results in a failure of any child object/entities. Thus, the parent object/entity identifiers identify those objects/entities in a failure relationship with the source object/entity, wherein the parent objects/entities comprise hierarchically higher level entities having failure relationships with the corresponding and hierarchically lower level source (i.e., child) entity.

Event type indicator indicates the type of event generated by the event source object/entity. Various types of events may be generated. For example, nonvirtual object/entity sourced events may comprise events such as UP, DOWN, SUSPEND, OFF-LINE, ON-LINE, FAIL, RESTORE, INITIALIZED and so on; virtual object/entity, virtual machine (VM) and VM-appliance sourced events may comprise events such as UP, DOWN, SUSPEND, STOP, CRASH, DESTROY, CREATE and so on; and IGP/BGP sourced events may comprise events such as New Prefix, Prefix withdrawn, Prefix Unreachable, Prefix Redundancy Changed and so on. Other examples will be known to those skilled in the art.

In various embodiments, each event source object/entity has knowledge of one or more respective parent objects/entities. In these embodiments, the event source object/entity includes parent object/entity identifiers within some or all of the events generated by the source object/entity.

In various embodiments, some or all of the event source objects/entities do not possess knowledge of respective parent objects/entities. However, current parent information for each of the event source objects/entities may be associated with each received event such that the parent information may be included within the event logs. The current parent information may be derived from provisioning information, stored correlation information and/or other management information. This information may be stored in, illustratively, the MS database 223 or other location.

Current Hierarchy of Failure Relationships

In various embodiments, current parent information for event source objects/entities may be retrieved or derived from information within a currently maintained hierarchy of failure relationships of some or all objects/entities within the DC.

The current hierarchy of failure relationships may be organized according to any of a number of data structures or formats, such as discussed in more detail herein. The current hierarchy of failure relationships, however organized, is substantially continually updated in response to changes in the state of the various real and/or virtual objects/entities within the DC, such as due to provisioning changes, object/event failures, object/event capability changes or service degradations and so on to provide thereby a relatively instantaneous or current “snapshot” of parent/child failure relationships of the various object/entities within the DC. Thus, the current hierarchy of failure relationships may be used to identify, for each event source object/entity, any corresponding parent objects/entities contemporaneously associated with an event source object/entity generating an event to be logged. This contemporaneous parent/child information may be included within the event log(s) associated with incoming events.

In various embodiments, the current hierarchy of failure relationships may be formed using a table of associations, using one or more directed trees, using a forest of directed trees forest of directed trees or using some other structure. The current hierarchy of failure relationships may be maintained by the failure relationship engine 229, MS programming 222 or other module within MS 190.

Thus, received events may be logged in a manner including event source object/entity identification along with corresponding parent object/entity information.

Suppression of Non-Root Cause Alarms

In various embodiments, the rules engine 228 or other module within MS 190 correlates event streams provided by child entities/objects to event streams provided by corresponding parent entities/objects to identify events, event streams and the like which are repetitive or extraneous (e.g., child entity failure events corresponding to parent entity failure events). Alarms normally generated in response to received events are suppressed where the received events are deemed to be repetitive or extraneous. That is, alarms normally generated in response to a received event from a source object/entity is suppressed where the root cause of the received event is another object/entity, such as a corresponding parent object/entity. Thus, rather than generating multiple alarms associated with multiple object/entities, a single alarm will be generated from the object/entity that is the root cause of the various events.

Thus, in various embodiments, a current hierarchical representation of the object/entities within the data center is maintained by the failure relationship engine 229 and used by the rules engine 228 to suppress the generation of extraneous alarms.

The various embodiments described herein contemplate that substantially all received events are logged, irrespective of the event source. However, in various embodiments, the logging of events from some event sources may also be suppressed. For example, in various embodiments some or all of the events or event types received from specific object/entities are not logged, such as warning/failure events associated with hierarchically lowest or lower order object/entities where corresponding warning/failure events of hierarchically superior objects/entities are received.

FIG. 3 depicts a flow diagram of a method according to one embodiment. Specifically, the method 300 of FIG. 3 contemplates various steps performed by, illustratively, the rules engine 228, failure relationship engine 229 and/or other MS programming mechanisms 222 associated with the management system 190. In various embodiments, the rules engine 228, failure relationship engine 229 and/or other MS programming mechanisms 222 are separate entities, partially combined or combined into a single functional module. In various embodiments, these functions are performed within the context of a general management function, an event processing function, and alarm generation function or other function.

At step 310, the method 300 constructs or updates a relational graph or other data structure defining a hierarchy of failure relationships of various virtual and nonvirtual objects/entities within the data center. Referring to box 315, virtual objects/entities may comprise virtual objects/entities such as virtual machines (VMs) or VM-based appliances, BGP/IGP or other protocols, user or supervisory services, or other virtual objects/entities. Similarly, nonvirtual objects/entities may comprise computation resources, memory resources, communication resources, communication protocols, user or supervisory services/implementations and other nonvirtual objects/entities.

In various embodiments, the relational graph or other data structure defining a hierarchy of failure relationships is constructed during initial provisioning of the data center and updated as new resources are instantiated, torn down, migrated and the like. Thus, in various embodiments, the relational graph or other data structure defining the hierarchy of failure relationships is continually updated in response to changes in data center resources, changes in use of data center resources, changes in status of data center resources (i.e., failure events and the like) and/or other management information.

At step 320, MS 190 receives event streams from one or more event-sourcing objects or entities within the data center. The event streams may comprise failure events, warning events, status events and so on. Of particular interest within the context of the various embodiments are failure events. Other embodiments may utilize failure events and warning events. Referring to box 325, event streams may comprise the and events, BGP events, IGP events, service events, network element event, network link events, and/or other events.

At step 330, the hierarchical failure relationships of objects/entities associated with failure events is correlated to identify root cause failed objects/entities. That is, for those object/entities deemed to be failed as indicated by a respective failure event, the relational graph or other data structure defining the hierarchy of failure relationships is used to correlate failed higher-level objects/entities to corresponding failed lower-level objects/entities. In this manner, the failed higher-level objects/entities that are the root cause of corresponding failed lower-level objects entities are identified.

At step 340, alarms are generated for root cause failed objects/entities. Optionally, other alarms may be generated. Generally speaking, those alarms associated with failed lower-level object/entities are suppressed where a corresponding higher-level object/entity has also failed. Referring to box 345, alarms may be generated for a root cause object/entity, a priority object/entity, a priority object/entity type, a priority service, a parity customer and/or other exceptions to the above-described alarm suppression mechanisms.

In various embodiments, only events of interest within the event streams are further processed; namely, those events deemed to be events of interest. For example, event of interest may comprise fault/failure events whereas events of no interest may comprise status updates, warnings and the like. For example, events of interest may comprise one or more of a VM fault/failure event, a VM fault/failure recovery event, a BGP fault/failure event, a BGP fault/failure recovery event, a IGP fault/failure event, a IGP fault/failure recovery event, or some other type of fault/failure event or recovery therefrom.

In various embodiments, a threshold level is adapted such that both warnings and fall/failure events are deemed to be events of interest.

The various embodiments operate to reduce the problem space, required resources and processing time associated with processing alarms by suppressing unnecessary or superfluous alarms. Thus, various embodiments contemplate suppressing the generation of repetitive/extraneous alarms, avoiding the processing of events associated with some object/entities, avoiding the processing of some types of events, or any combination thereof.

The various embodiments described herein may be advantageously employed within a number of applications such as the following, any of which may be implemented as a revenue generating application of a data center owner or service provider: (1) On-demand historic failure analysis; (2) Analysis of historic data to improve DC performance; (3) Analysis of historic data to improve customer experience or performance; (4) Analysis of historic data to enable customers to more precisely define necessary virtual resources, thereby avoiding waste and improving experience; and/or other applications.

Multiple failure relationship hierarchies may be used to identify potential or actual root cause failures (or, conversely, the impact of the event of interest to other objects/entities) associated with failures or service degradations of interest to the system operator, client, user and so on. In various embodiments, the hierarchy of failure relationships is indicated using a relational graph. In various embodiments, the relational graph includes one or more trees. One or more of these multiple failure relationship hierarchies may be used to identify those alarms to be generated, those alarms to be suppressed and/or other information pertaining to the processing of various event streams.

FIG. 4 graphically depicts hierarchy of failure relationships of DC entities supporting an exemplary virtualized service useful in understanding the embodiments. Specifically, FIG. 4 depicts virtual and nonvirtual DC objects/entities supporting a Virtual Private Routed Network (VPRN) service as well as the parent/child failure relationships between the various DC objects/entities.

Referring to FIG. 4, it can be seen that a top level VPRN service 410 is a higher-level object with respect to a DVRS site 450 and a provider edge (PE) router 470. PE router 470 is a higher-level object with respect to SAP2 471, which is a higher-level object with respect to external BGP unreachable events 472. DVRS site 450 is a higher-level object with respect to SAP1 451 and SDP 481, which is a higher-level object with respect to internal BGP unreachable events 422. Label Switched Path (LSP) monitor 480 is also a higher-level object with respect to Service Distribution Path (SDP) 481.

SAP1 451 is a higher-level object with respect to a first virtual machine (VM 1) 452, which is a higher-level object with respect to first virtual port (VP1.1) 453 and second virtual port (VP1.2) 454 of the first the end 452. Each of the first 453 and second 454 virtual ports are higher-level objects with respect to internal BGP unreachable events 422.

Internal Gateway Protocols (IGPs) 420, Route Reflectors (RR) 430 and Border Gateway Protocol (BGP) sites (e.g., DVRS and PE) 440 are all higher-level objects with respect to a BGP peer 421, which is a higher-level object with respect to internal BGP unreachable events 422.

A first hypervisor port 460 is a higher-level object with respect to a TCP session 461, which is a higher-level object with respect to a virtual switch 462, which is a higher-level object with respect to first VM 452.

Thus, FIG. 4 depicts the various parent/child failure relationships among a number of DC objects/entities forming an exemplary VPRN service 410. The failure of any object/entity representing a higher-level or parent object/entity in a failure relationship with one or more corresponding lower level or child objects/entities will necessarily result in the failure of the lower-level or child objects/entities. Further, it can be seen that multiple levels or tiers within a hierarchy of failure relationships are provided. Further, it can be seen that an object/entity may have failure relationships with one or more corresponding higher-level or parent objects/entities, one or more lower-level or child object/entities or any combination thereof.

FIG. 5 depicts a high-level block diagram of a computing device, such as a processor in a telecom network element, suitable for use in performing functions described herein such as those associated with the various elements described herein with respect to the figures.

As depicted in FIG. 5, computing device 500 includes a processor element 503 (e.g., a central processing unit (CPU) and/or other suitable processor(s)), a memory 504 (e.g., random access memory (RAM), read only memory (ROM), and the like), a cooperating module/process 505, and various input/output devices 506 (e.g., a user input device (such as a keyboard, a keypad, a mouse, and the like), a user output device (such as a display, a speaker, and the like), an input port, an output port, a receiver, a transmitter, and storage devices (e.g., a persistent solid state drive, a hard disk drive, a compact disk drive, and the like)).

It will be appreciated that the functions depicted and described herein may be implemented in hardware and/or in a combination of software and hardware, e.g., using a general purpose computer, one or more application specific integrated circuits (ASIC), and/or any other hardware equivalents. In one embodiment, the cooperating process 505 can be loaded into memory 504 and executed by processor 503 to implement the functions as discussed herein. Thus, cooperating process 505 (including associated data structures) can be stored on a computer readable storage medium, e.g., RAM memory, magnetic or optical drive or diskette, and the like.

It will be appreciated that computing device 500 depicted in FIG. 5 provides a general architecture and functionality suitable for implementing functional elements described herein or portions of the functional elements described herein.

It is contemplated that some of the steps discussed herein may be implemented within hardware, for example, as circuitry that cooperates with the processor to perform various method steps. Portions of the functions/elements described herein may be implemented as a computer program product wherein computer instructions, when processed by a computing device, adapt the operation of the computing device such that the methods and/or techniques described herein are invoked or otherwise provided. Instructions for invoking the inventive methods may be stored in tangible and non-transitory computer readable medium such as fixed or removable media or memory, and/or stored within a memory within a computing device operating according to the instructions.

Various modifications may be made to the systems, methods, apparatus, mechanisms, techniques and portions thereof described herein with respect to the various figures, such modifications being contemplated as being within the scope of the invention. For example, while a specific order of steps or arrangement of functional elements is presented in the various embodiments described herein, various other orders/arrangements of steps or functional elements may be utilized within the context of the various embodiments. Further, while modifications to embodiments may be discussed individually, various embodiments may use multiple modifications contemporaneously or in sequence, compound modifications and the like.

Although various embodiments which incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings. Thus, while the foregoing is directed to various embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. As such, the appropriate scope of the invention is to be determined according to the claims.

Claims

1. A method of managing alarms at a data center (DC), comprising:

defining a hierarchy of failure relationships of DC entities, each of said failure relationships comprising a higher-level DC entity and a lower level DC entity, each lower level DC entity necessarily failing in response to failure of a corresponding higher-level DC entity;

in response to received events indicative of failed DC entities, correlating failed higher-level entities corresponding to failed lower level DC entities to identify thereby root cause failed DC entities; and

generating failure alarms associated with said root cause failed DC entities.

2. The method of claim 1, wherein an alarm normally generated in response to a received event is suppressed unless the event source comprises a root cause of the received event.

3. The method of claim 1, further comprising suppressing failure alarm generation associated with failed DC entities that do not comprise root cause failed DC entities.

4. The method of claim 1, further comprising for each generated failure alarm associated with a root cause failed DC entity, suppressing failure alarm generation associated with failed DC entities in a corresponding failure relationship with the root cause failed DC entity.

5. The method of claim 1, further comprising generating failure alarms associated with failed DC entities comprising priority DC entities.

6. The method of claim 1, further comprising generating failure alarms associated with a failed DC entities comprising priority entity types.

7. The method of claim 1, further comprising generating failure alarms associated with a failed DC entities associated with a priority service.

8. The method of claim 1, further comprising generating failure alarms associated with a failed DC entities associated with a priority customer.

9. The method of claim 1, wherein received events are included within event streams processed by a rules engine.

10. The method of claim 1, wherein said root cause failed DC entities are selected as those higher-level failed DC entities corresponding to lower-level failed entities.

11. The method of claim 10, wherein selecting root cause failed DC entities comprises identifying a minimum number of higher order failed DC entities hierarchically corresponding to a group of failed DC entities.

12. The method of claim 10, wherein said relational graph is formed as a directed tree structure.

13. The method of claim 12, wherein a first directed tree represents a data center object failure hierarchy, a second directed tree represents a Border Gateway Protocol (BGP) failure hierarchy, and a third directed tree represents and Interior Gateway Protocol (IGP) failure hierarchy.

14. The method of claim 12, wherein the entities comprise data center objects, wherein a first directed tree represents a data center object hard failure hierarchy and a second directed tree represents a data center object soft failure hierarchy.

15. The method of claim 12, wherein the BC entities comprise Border Gateway Protocol (BGP) objects, wherein a first directed tree represents a BGP object hard failure hierarchy and a second directed tree represents a BGP object soft failure hierarchy.

16. The method of claim 1, wherein the DC entities comprise any virtual or nonvirtual event-sourcing entity in the data center.

17. The method of claim 1, wherein the entities comprise any of a virtual machine (VM), a VM-based appliance, a virtual router (VR) and a virtual service.

18. An apparatus for managing alarms at a data center, the apparatus comprising:

a processor configured for: defining a hierarchy of failure relationships of DC entities, each of said failure relationships comprising a higher-level DC entity and a lower level DC entity, each lower level DC entity necessarily failing in response to failure of a corresponding higher-level DC entity;

in response to received events indicative of failed DC entities, correlating failed higher-level entities corresponding to failed lower level DC entities to identify thereby root cause failed DC entities; and

generating failure alarms associated with said root cause failed DC entities.

19. A tangible and non-transient computer readable storage medium storing instructions which, when executed by a computer, adapt the operation of the computer to perform a method for managing alarms at a data center, the method comprising:

defining a hierarchy of failure relationships of DC entities, each of said failure relationships comprising a higher-level DC entity and a lower level DC entity, each lower level DC entity necessarily failing in response to failure of a corresponding higher-level DC entity;

in response to received events indicative of failed DC entities, correlating failed higher-level entities corresponding to failed lower level DC entities to identify thereby root cause failed DC entities; and

generating failure alarms associated with said root cause failed DC entities.

20. A computer program product wherein computer instructions, when executed by a processor in a network element, adapt the operation of the network element to provide a method for managing alarms at a data center, the method comprising:

defining a hierarchy of failure relationships of DC entities, each of said failure relationships comprising a higher-level DC entity and a lower level DC entity, each lower level DC entity necessarily failing in response to failure of a corresponding higher-level DC entity;

in response to received events indicative of failed DC entities, correlating failed higher-level entities corresponding to failed lower level DC entities to identify thereby root cause failed DC entities; and

generating failure alarms associated with said root cause failed DC entities.