EVENT BASED SERVICE DISCOVERY AND ROOT CAUSE ANALYSIS
A system uses event correlation to identify components belonging to a same service or service domain. The system correlates events by generating covariance matrices or by performing sequence mining with temporal databases in order to discover event patterns that occur sequentially in a fixed time window. Components corresponding to the correlated events are identified as being part of a same service domain and can be indicated in a service domain data structure, such as a topology. The system utilizes the identified service domains during root cause analysis. The system can determine an anomalous event occurring a lowest layer component in a service domain as a root cause or can determine an anomalous event which occurs first in an identified event sequence of a service domain as a root cause. After identifying the root cause event, the system suppresses notifications of events occurring at other components in the service domain.
The disclosure generally relates to the field of information security, and more particularly to software development, installation, and management.
Information related to interconnections among components in a system is often used for root cause analysis of system issues. For example, a network administrator or network management software may utilize network topology and network events to aid in troubleshooting issues and outages. Network topology describes connections between physical components of a network and may not describe relationships between software components. Events are generated by a variety of sources or components, including hardware and software. Events may be specified in messages that can indicate numerous activities, such as an application finishing a task or a server failure.
Aspects of the disclosure may be better understood by referencing the accompanying drawings.
The description that follows includes example systems, methods, techniques, and program flows that embody aspects of the disclosure. However, it is understood that this disclosure may be practiced without these specific details or be practiced in other environments. For instance, this disclosure refers to performing root cause analysis using identified service domains in illustrative examples. Aspects of this disclosure can be also applied to using identified service domains for determining single points of failure in a system or identifying other weaknesses, such as load balancing issues, for a system. In other instances, well-known instruction instances, protocols, structures and techniques have not been shown in detail in order not to obfuscate the description.
INTRODUCTIONWithout knowledge of a system, detecting relationships across domains for components of a service can be difficult in the absence of topology information. While some techniques for cross domain correlation of components are available, these require significant effort to pre-load or pre-define the correlation scope. Additionally, some forms of cross domain correlation are very limited in which domains they can operate and may still require some front-end design or hard coding by experts to function properly. These hard-coded correlation techniques fail when deployed to new domains (or new network configurations or network technologies).
OVERVIEWTo provide improved identification of service components and root cause analysis, a system uses event correlation to identify components belonging to a same service or service domain. The system correlates events by generating covariance matrices or by performing sequence mining with temporal databases in order to discover event patterns (or episodes of events) that occur sequentially in a time window. Components corresponding to the correlated events are identified as being part of a same service domain and can be indicated in a service domain data structure, such as a topology. The system utilizes the identified service domains during root cause analysis. The system can determine an anomalous event occurring a lowest layer component in a service domain as a root cause or can determine an anomalous event which occurs first in an identified event sequence of a service domain as a root cause. After identifying the root cause event, the system suppresses notifications of events occurring at other components in the service domain to avoid providing superfluous notifications through network management software to an administrator.
TerminologyThe term “component” as used in the description below encompasses both hardware and software resources. The term component may refer to a physical device such as a computer, server, router, etc.; a virtualized device such as a virtual machine or virtualized network function; or software such as an application, a process of an application, database management system, etc. A component may include other components. For example, a server component may include a web service component which includes a web application component.
The description below refers to an “event” to describe a message, indication, or notification of an event. An event is an occurrence in a system or in a component of the system at a point in time. An event often relates to resource consumption and/or state of a system or system component. As examples, an event may be that a file was added to a file system, that a number of users of an application exceeds a threshold number of users, that an amount of available memory falls below a memory amount threshold, or that a component stopped responding or failed. An event indication can reference or include information about the event and is communicated to by an agent or probe to a component/agent/process that processes event indications. Example information about an event includes an event type/code, application identifier, time of the event, severity level, event identifier, event description, etc.
The description below refers to correlating events or event correlation. The process of event correlation involves identifying events that have a connection or relationship to one another, such as a temporal connection, cause-and-effect relationship, statistical relationship, etc. Correlating events or event correlation as used herein refers to the identification of this existing relationship and does not include modifying events to establish a connection or relationship.
The description below uses the term “service domain” to refer to a collection of resources or components which are utilized in providing a service, such as an application, a database, a web server, etc. For example, a service domain can include a cloud storage application, a virtual machine which executes the application, a hypervisor underlying the virtual machine, a server hosting the hypervisor, and a router which connects the server to a network.
Example Illustrations
At stage A, the event collector 105 receives events from components in the network 104 and stores them in the event database 106. The event collector 105 may receive the events from agents of the components in the network 104. In
At stage B, the event correlator 107 retrieves and correlates events in the event database 106 to identify components for service domains 108. Event correlation refers to the identification of a relationship or statistical connection between two or more events. Events can be correlated based on a determination that a first event caused a second event, that a first series of events caused a second series of events, that two events often occur near simultaneously, etc. The event correlator 107 can also correlate events based on a statistical, causal, or probability analysis using a statistical correlation/covariance matrix, as described in more detail in
A correlation between events indicates a relationship between the corresponding components. Event correlation can reveal component relationships which may not be apparent from network topology information, and these relationships can be identified without requiring extensive manual input by an administrator. The event correlator 107 uses the determined relationships to identify components which are part of a same service domain. The event correlator 107 indicates the components in the service domains 108 which includes the example service domain 1 115. As shown in
At stage C, the root cause analyzer 109 performs root cause analysis using the service domains 108 and events in the event database 106. The root cause analyzer 109 may monitor the event database 106 to identify one or more anomalous events occurring at the components. An anomalous event is an event that indicates a network occurrence or condition that deviates from a normal or expected value or outcome. For example, an event may have an attribute value that exceeds or falls below a determined threshold or required value, or an event may indicate that a component shut down or restarted prior to a scheduled time. Additionally, an anomalous event may be an event that indicates a network issue such as a component failure.
After identifying one or more anomalous events, the root cause analyzer 109 identifies one or more service domains from the service domains 108 which include components corresponding to the anomalous events. The root cause analyzer 109 then utilizes the identified service domain(s) to aid in the root cause analysis process. For example, if an anomalous event, such as a slow response time, occurred at the application 102, the root cause analyzer 109 identifies the service domain 1 115 from the service domains 108. The root cause analyzer 109 then identifies related components in the service domain 1 115 and retrieves events for those components from the event database 106. In one implementation, the root cause analyzer 109 identifies an anomalous event occurring at a lowest layer component in the service domain 1 115 and outputs that event as a root cause event 111. For example, if a high processor load event was occurring at the hypervisor, which is a lower layer component than the application 102, the root cause analyzer 109 prioritizes the high processor load event as the root cause and outputs that event as the root cause event 111. In another implementation, the root cause analyzer 109 may utilize an event sequence or pattern indicated in the service domain 1 115 to identify which component typically starts the series of events resulting in an anomaly. If the event sequence is typically instigated by the application 102, the root cause analyzer 109 outputs an event at the application 102 as the root cause event 111. The root cause analyzer 109 may also output related events 112 which occur at other components in the service domain 1 115; however, as indicated by the dashed lines in
If the “Session 1” of the service domain 1 201 fails or encounters an issue, root cause analysis of the session can be simplified by limiting the analysis to the components in the service domain 1 201. Additionally, other information about the service domain 1 201 may be utilized to identify a root cause. In one implementation, root causes are inferred based on a lowest layer component in the service domain 1 201 which is experiencing an issue. For example, if the “Group 1” IP-multicast is experiencing an issue and the router is experiencing an issue, it is determined that the router issue is the root cause of problems for the service domain 1 201, as the router is at a lower layer than the IP-multicast. A component's layer can be determined based on a component type, an assigned or logical OSI layer, etc. Additionally, a components layer can be determined relative to other components. For example, a virtual machine is considered a higher layer than the hypervisor on which it executes. Similarly, a server which executes the hypervisor is at a higher layer than a router which it uses for transmitting network traffic.
After determining the root cause, alarms or notifications for other components in the service domain 1 201 can be suppressed, e.g., not displayed to a user. Furthermore, if “Session 2” of the service domain 2 202 is also experiencing issues, alarms or notifications for other components in the service domain 2 202 can also be suppressed and, ultimately, only a single event or notification identifying the root cause is presented, thereby avoiding overloading an interface of network management software with notifications. Events for components of the service domains may be suppressed until the issue is resolved and then event notification may continue as normal.
At stage A, the event correlator 307 generates a first matrix, the matrix 301, based on event correlation. The event correlator 307 may use events from an event log from a first or most recent time period to generate the matrix 301. For example, the event correlator 307 may use events generated in the previous 10 minutes or events from a first 30 minutes of operation of the components. Since the matrix 301 is based on correlation from just a single time period, the matrix 301 is treated as a hypothesis and is tested/validated as additional events are received and analyzed.
During stage B, the event correlator 307 continues collecting and analyzing events over multiple time periods to generate the set of covariance matrices 302. As additional matrices are generated and correlations identified, the statistical power of the correlations increases, thereby decreasing the risk of making a Type II error. A type II error refers to the failure to reject a false hypothesis; a hypothesis being in this instance that there is a correlation between events of two components as shown in the matrix 301. Statistical power is inversely related to beta where beta is the probability of making a Type II error (power=1−β). The event correlator 307 may continue collecting and analyzing events over multiple time periods until the probability of making a Type II error falls below a threshold or, stated differently, until the statistical power has exceeded a threshold. In general, the consistency with which a correlation is identified over the multiple time periods indicates the confidence which can be place in the identified correlation. For example, if the event correlator generates three matrices over three time periods and a threshold satisfying correlation appears in all three, then the event correlator 307 can have high confidence in the correlation and the correlation likely has high statistical power. After arriving at a statistically sound result, the event correlator 307 can output a matrix based on an aggregation of the set of covariance matrices 302 or a list of related components identified based on the correlation.
The event correlator 407 uses sequence mining on the temporal listing of events 402 in order to discover the patterns (or episodes of events) that occur sequentially in a fixed time window. This approach allows discovery of patterns which occur repeatedly with a high confidence index thereby making the mined pattern causal and not co-incidental. The event correlator 407 may mine the data using a priori algorithms to identify sequences. If an event pattern or episode is recognized within a specified timeframe with a high confidence index on causality (based on factors like number of repetitions, probabilistic distribution, etc.), then that episode is a set of events that occur one after the other and are correlated. Components associated with that set of events are then indicated as being part of a same service domain.
In
A network management system (“system”) retrieves events from an event log for analysis (502). The system may query an event database to retrieve events or may subscribe to an event management service which forwards batches of events to the system. The system may sort the events into a chronological order, filter for events of a particular type, or otherwise prepare the collection of events for analysis.
The system begins operations for multiple time periods represented by the events (504). The system may divide or split the events into time periods for processing. For example, the system may split the events into collections of five-minute periods. Alternatively, in some implementations, the system may divide the events into sets of a number of events, e.g., 100 events per set. The time period or collection of events currently being processed is hereinafter referred to as “events for the selected time period.”
The system identifies correlations of events for the selected time period (506). The system analyzes the events and may generate a covariance matrix for components represented in the events or perform sequence mining on the events. The system may compare/combine correlations based on the events from the selected time period to correlations generated based on events from previous time periods. The system can then generate a cumulative set of event correlations based on the analysis performed across the different time periods.
The system determines whether any event correlations satisfy a statistical threshold (508). The system may compare values representing a probability of the statistical correlations to one or more thresholds to determine whether any of the correlations have a satisfactory statistical power or confidence. Additionally, as described above, the system may determine whether the probability of making a Type II error has been sufficiently reduced for one or more of the event correlations. For event sequences, the system can determine whether the event sequence has occurred a threshold number of times or a sufficient number of times to satisfy a statistical probability that the sequence is not a random occurrence and represents correlated events.
If no correlations satisfy the threshold, the system waits for an additional time period of events (510). If analyzing events from a log, the system may select a collection of events from a next time period. Alternatively, the system waits until a subsequent time period has elapsed and retrieves events for that time period or waits until another batch of events is received from an event management system. The system then continues operations at block 504.
If there are correlations which satisfy the threshold, the system generates service domains based on threshold satisfying event correlations (512). The system identifies components corresponding to the event correlations and generates a service domain comprising the components. The service domain may be a topology, graph data structure, or a listing which identifies the components as belonging to a same service domain. The system may include information in the service domain data structure such as identified event sequences, service or network layers associated with each of the components, statistical strength of event correlations, etc. After generating at least a first service domain based on the event correlations, the system is prepared to begin root cause analysis utilizing the generated service domain represented by the operations at block 514, 516, 518, and 520. The system also continues refining and validating the event correlations and the generated service domains. For example, the system may add or remove components in the service domains based on additional event correlation. As a result, the system also returns to block 510 to continue performing event correlation in parallel with the root cause analysis operations.
The system detects an occurrence of an anomalous event (514). Block 514 is depicted with a dashed outline to represent that the system continually monitors for the occurrence of anomalous events as a background operation and that the operations of blocks 516, 518, and 520 may be triggered each time an anomalous event is detected. The system may detect an anomalous event by identifying an event which indicate that a component issue or failure has occurred or by comparing metrics in an event to preestablished performance thresholds.
The system selects at least a first service domain related to the anomalous event (516). The system determines component corresponding to the event based on a component identifier or other indicator in the event associated with the component. The system then searches the generated service domains with the component identifier to retrieve one or more service domains which include the component. The system selects at least a first service domain for which to perform root cause analysis but can also perform root cause analysis for all affected service domains in parallel, as the service domains are likely all experiencing a same root cause since the service domains share the anomalous component.
The system retrieves events related to components in the service domain (518). The system identifies all components in the service domain and then queries an event database to retrieve recent events for the components. The system may structure the query to retrieve only anomalous events for the components.
The system identifies a root cause of the anomalous event based on the events within the service domain (520). In one implementation, the system can perform root cause analysis by identifying a lowest layer component in the service domain which is experiencing an anomalous event and identify that component as the root cause. In another implementation, the system may analyze an event sequence indicated in the service domain to identify the earliest event in the sequence which matches an event of one of the components. For example, if the sequence begins with an event of type 3 at a component A, the system determines whether an event of type 3 recently occurred at the component A. The system may continue through the sequence to determine whether there is a matching recent event for each event in the sequence. If there is a matching sequence of events, the system determines that the event at the component corresponding to the first event in the sequence is the root cause. The system outputs the root cause event identified as a result of the analysis and suppresses other alarms or events for the service domain. The system may also perform automated remedial actions to correct the issue. For example, if the root cause event was a router issue, the system may remotely reboot the router or invoke a script for resetting a port on the router. After identifying the root cause, the system returns to block 514 until another anomalous event is detected.
Variations
The flowcharts are provided to aid in understanding the illustrations and are not to be used to limit scope of the claims. The flowcharts depict example operations that can vary within the scope of the claims. Additional operations may be performed; fewer operations may be performed; the operations may be performed in parallel; and the operations may be performed in a different order. For example, the operations depicted in blocks 506 and 520 of
As will be appreciated, aspects of the disclosure may be embodied as a system, method or program code/instructions stored in one or more machine-readable media. Accordingly, aspects may take the form of hardware, software (including firmware, resident software, micro-code, etc.), or a combination of software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” The functionality presented as individual modules/units in the example illustrations can be organized differently in accordance with any one of platform (operating system and/or hardware), application ecosystem, interfaces, programmer preferences, programming language, administrator preferences, etc.
Any combination of one or more machine readable medium(s) may be utilized. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable storage medium may be, for example, but not limited to, a system, apparatus, or device, that employs any one of or combination of electronic, magnetic, optical, electromagnetic, infrared, or semiconductor technology to store program code. More specific examples (a non-exhaustive list) of the machine readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a machine readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine readable storage medium is not a machine readable signal medium.
A machine readable signal medium may include a propagated data signal with machine readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A machine readable signal medium may be any machine readable medium that is not a machine readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a machine readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as the Java® programming language, C++ or the like; a dynamic programming language such as Python; a scripting language such as Perl programming language or PowerShell script language; and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on a stand-alone machine, may execute in a distributed manner across multiple machines, and may execute on one machine while providing results and or accepting input on another machine.
The program code/instructions may also be stored in a machine readable medium that can direct a machine to function in a particular manner, such that the instructions stored in the machine readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
While the aspects of the disclosure are described with reference to various implementations and exploitations, it will be understood that these aspects are illustrative and that the scope of the claims is not limited to them. In general, techniques for event-based identification of service domains and root cause analysis as described herein may be implemented with facilities consistent with any hardware system or hardware systems. Many variations, modifications, additions, and improvements are possible.
Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the disclosure. In general, structures and functionality presented as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the disclosure.
Use of the phrase “at least one of” preceding a list with the conjunction “and” should not be treated as an exclusive list and should not be construed as a list of categories with one item from each category, unless specifically stated otherwise. A clause that recites “at least one of A, B, and C” can be infringed with only one of the listed items, multiple of the listed items, and one or more of the items in the list and another item not listed.
Claims
1. A method comprising:
- correlating events generated in a network to identify components of a service domain;
- based on detecting a first event indicating an issue at a first component in the service domain, retrieving a first set of event notifications indicating events occurring at the components within the service domain;
- identifying a root cause of the issue at the first component based, at least in part, on the components indicated in the service domain and the first set of event notifications; and
- suppressing generation of event notifications for the components in the service domain.
2. The method of claim 1, wherein identifying the root cause of the first event based, at least in part, on the components indicated in the service domain and the first set of event notifications comprises:
- identifying anomalous events in the first set of event notifications;
- identifying a lowest layer component of the service domain which experienced a second event of the identified anomalous events; and
- indicating the second event as the root cause of the issue at the first component.
3. The method of claim 2 further comprising:
- identifying component types of the components in the service domain; and
- assigning the components to layers based, at least in part, on the component types of the components in the service domain.
4. The method of claim 1, wherein identifying the root cause of the first event based, at least in part, on the components indicated in the service domain and the first set of event notifications comprises:
- identifying a sequentially first event indicated in an event sequence associated with the service domain;
- identifying a second event in the first set of event notifications which matches the sequentially first event indicated in the event sequence; and
- indicating the second event as the root cause of the issue at the first component.
5. The method of claim 1, wherein correlating the events generated in the network to identify components of the service domain comprises:
- identifying a repeated event sequence in the events generated in the network; and
- indicating components corresponding to the event sequence as the components of the service domain.
6. The method of claim 1, wherein correlating the events generated in the network to identify components of the service domain comprises:
- generating a first covariance matrix with entries indicating a covariance of events between components in the network;
- identifying the entries in the first covariance matrix which exceed a threshold; and
- indicating components corresponding to the entries in the first covariance matrix which exceed the threshold as the components of the service domain.
7. The method of claim 6 further comprising:
- generating a set of covariance matrices over multiple operational periods of the components in the network; and
- validating the entries in the first covariance matrix which exceed the threshold based, at least in part, on entries in the set of covariance matrices generated over the multiple operational periods.
8. The method of claim 1 further comprising, after determining that the issue at the first component has been resolved, resuming generation of event notifications for the components in the service domain.
9. The method of claim 1 further comprising generating a graph data structure to represent the components of the service domain, wherein the graph data structure comprises nodes to represent the components and edges to indicate relationships between the components.
10. One or more non-transitory machine-readable media comprising program code, the program code to:
- identify a first event pattern in an event log comprising events generated by devices in a network;
- determine a first service domain based, at least in part, on devices corresponding to events in the first event pattern;
- based on detecting an issue at a first device in the first service domain, determining whether an event generated by the first device matches an event in the first pattern; and
- based on determining that that an event generated by the first device matches an event in the first event pattern, indicating a second device in the first service domain as a root cause of the issue at the first device, wherein the second device corresponds to a first event in the first event pattern.
11. The machine-readable media of claim 10, wherein the program code to identify the first event pattern in the event log comprises program code to:
- divide the events in the event log into multiple collections of events based, at least in part, on a time window; and
- determine that the first event pattern repeats in a threshold number of the collections of events prior to determining the first service domain.
12. The machine-readable media of claim 10 further comprising program code to:
- identify a second event pattern in the event log; and
- determine a second service domain based, at least in part, on devices corresponding to events in the second event pattern.
13. The machine-readable media of claim 10, wherein the first event pattern is a temporally sequential series of events.
14. The machine-readable media of claim 10 further comprising program code to suppress events for the devices of the first service domain after detecting the issue at the first device.
15. An apparatus comprising:
- a processor; and
- a machine-readable medium having program code executable by the processor to cause the apparatus to, correlate events generated in a network to identify components of one or more service domains; detect a first event indicating an issue at a first component in the network; identify at least a first service domain of the one or more service domains which comprises the first component; and identify a root cause of the issue at the first component based, at least in part, on the components indicated in the first service domain.
16. The apparatus of claim 15 further comprising program code to:
- identify a second service domain which comprises the first component; and
- suppress generation of event notifications for the components in the first service domain and the second service domain until the issue at the first component is resolved.
17. The apparatus of claim 15, wherein the program code to identify the root cause of the first event based, at least in part, on the components indicated in the first service domain:
- retrieve a first set of event notifications for anomalous events occurring at the components in the first service domain;
- identify a lowest layer component of the first service domain which experienced a second event indicated in the first set of event notifications; and
- indicate the second event as the root cause of the issue at the first component.
18. The apparatus of claim 17 further comprising program code to:
- identify component types of the components in the first service domain; and
- assign the components to layers based, at least in part, on the component types of the components in the first service domain.
19. The apparatus of claim 15, wherein the program code to identify the root cause of the first event based, at least in part, on the components indicated in the first service domain:
- identify a sequentially first event indicated in an event sequence associated with the first service domain;
- identify a second event occurring at one of the components in the first service domain which matches the sequentially first event indicated in the event sequence; and
- indicate the second event as the root cause of the issue at the first component.
20. The apparatus of claim 15, wherein the program code to correlate events generated in a network to identify components of one or more service domains comprises program code to:
- identify one or more event sequences in the events generated in the network; and
- generate a service domain for each of the identified event sequences which correspond to different sets of components.
Type: Application
Filed: Sep 28, 2018
Publication Date: Apr 2, 2020
Inventors: Balram Reddy Kakani (Hyderabad), Ravindra Kumar Puli (Hyderabad), Smrati Gupta (San Jose, CA)
Application Number: 16/145,553