METHOD AND APPARATUS FOR TELECOMMUNICATIONS NETWORK PERFORMANCE ANOMALY EVENTS DETECTION AND NOTIFICATION
In order to provide an early and more accurate determination of network problems, current NPI OMs are compared with samples of recent historical NPI OMs so that changes in the NPI OM are detected based on current overall network conditions rather than on conditions that may have existed at statistically insignificant earlier operational periods. By constantly adjusting a performance threshold, against which the current NPI OM is compared, by using a smaller and very recent sampling of NPIs (in the case of sudden and abrupt performance-NPI degradation detection) or a larger and greater number of NPIs over a wider time period (in the case of slow and persistent NPI degradation detection) to establish the threshold, detection results are more accurate and meaningful.
This application is a continuation of U.S. application Ser. No. 13/383,971, filed on Dec. 21, 2012, which claims the priority of PCT/US2010/042192, filed on Jul. 15, 2010, which claims the priority of U.S. Provisional application No. 61/225,672, filed Jul. 15, 2009, the entire contents of which are incorporated fully herein by reference.
FIELD OF THE INVENTIONThe invention pertains to the detection of network performance anomaly events based on Network Performance Indicator (NPI) Operational Measurements (OM).
BACKGROUND OF THE INVENTIONAs communications technology has evolved, communications technology users have become increasingly reliant on the ability to communicate almost instantaneously with others all over the globe. With this technology seemingly available everywhere, users of network resources have come to perceive performance delays of as little as 2-3 seconds as unacceptable. Time delays in data transfers and dropped phone calls in mobile telephone systems irritate and alienate customers and thus, service providers try to pay close attention to performance problems and correct them as quickly as possible.
Operational Measurements (OM's) in the context of network performance are network parameters that are measured and used as Network Performance Indicators (NPI's). These measurements can include call success rates, call termination rates, Quality of Service (QOS) measurements, traffic and routing measurements, network outage statistics, and the like. These OM's are typically measured over a fixed period of time, referred to as “OM transfer periods”.
Early detection of network performance anomalies could help avoid network outage events. A slow and persistent degradation of NPIs can indicate an issue such as memory leak. Additionally, simultaneous large abrupt and sudden changes in, for example, the call success rates from multiple NPIs can indicate the onset of outage events (the outage can be partial, i.e. losing >10% of capacity, or total outage). Therefore, it would be desirable to utilize the NPI process to help avoid or reduce the outage downtime of the network and other problems such as memory leak by devising a way to automatically process the NPIs to detect the occurrence of slow and persistent NPI OM degradation, severe and sudden degradation in NPI OM, and potential outage events and raise an appropriate log or alarm to alert the operator of the observed performance anomaly so that they can be investigated and dealt with in a timely manner.
There are many relevant existing stochastic process control algorithms that are routinely used in various industries to monitor product quality such as Shewhart, EWMA, and Page's CUSUM control charts. However, these standard quality control algorithms only deal with detecting deviations of the monitored quality metric from a fixed (known or unknown) mean value that is constant over time. In the NPI performance anomaly detection problem, the mean value of success rates can fluctuate slowly over time in normal operation (e.g., due to the change in traffic level or services usage pattern during the day), and thus only a statistically significant large and abrupt degradation, or a slow but steady degradation, from the most recent average success rates would indicate a possible onset of a new outage. This time-varying statistical characteristic of the NPI prevents direct application of these traditional stochastic process control algorithms.
SUMMARY OF THE INVENTIONIn order to provide an early and more accurate determination of network problems, current NPI OMs are compared with samples of recent historical NPI OMs so that changes in the NPI OM are detected based on current overall network conditions rather than on conditions that may have existed at statistically insignificant earlier operational periods. By constantly adjusting a performance threshold, against which the current NPI OM is compared, by using a smaller and very recent sampling of NPIs (in the case of sudden and abrupt performance-NPI degradation detection) or a larger and greater number of NPIs over a wider time period (in the case of slow and persistent NPI degradation detection) to establish the threshold, detection results are more accurate and meaningful.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.
The present invention will now be described in connection with an exemplary embodiment for a mobile (cellular) telephone network. However, it should be noted that the present invention is broadly applicable to many other types of network degradation detection schemes and to networks other than mobile telephone networks.
Core network 101 includes a mobile switching center 101-1, a data support node 101-2, a home location register 101-3, and other network functionality 101-4. In addition, however, core network 101 also includes network degradation detection processor 101-5, which can comprise a severe and abrupt NPI degradation detection node, an NPI slow and persistent degradation detection node, or a combination of both. Network degradation processor 101-5 is a processor that is configurable to perform the steps described in connection with
As shown at step 201, processing begins by reading and storing the current (most recent) NPI OM, e.g., an operational measurement related to call success rate for the telephone network. In a typical large telephone network operating normally, call failures, SMS failures, handover failures etc, will be occurring every hour, but on a per-second basis, there typically will not be very many such failures. Thus, in such a system, the NPI OMs will be taken every few minutes, every 10 minutes, etc. It is understood that the choice of how often to take such measurements is within the discretion of the network operator.
At step 203, the mean (average) value of the last n immediately preceding NPI OMs is calculated by adding up the NPI OM values of the last n immediately-preceding NPI OMs and dividing the sum by n. For the purpose of detecting severe and abrupt degradation, the value of n is small, e.g., 2 or 3, so that the current NPI OM value is being compared to only the last few NPI OMs rather than a larger window spanning a larger time period. Thus, for example, if the network has NPI OMs taken every 10 minutes, and the current NPI OM value is taken at time t and if n is decided to be 3, then the NPI OM values at t-10 minutes, t-20 minutes, and t-30 minutes would be combined and then divided by 3 to determine the mean value for the purpose of step 203. In the example given above, it is suggested that the value of n should be small, e.g., 2 or 3. However, it is understood that the value of n can be changed depending on the needs of the network operator, and an n value of 10 could, for example, still be considered “small” for the purpose of this invention.
At step 205, the variance value of the last n NPI OMs is calculated by taking the standard deviation of the last n NPI OMs (the NPI OMs taken at t-10 minutes, t-20 minutes, and t-30 minutes in this example). As is clear, the processes of steps 103 and 105 are calculations of a moving average for the mean and variance values of the NPI OMs within the moving average window defined by the value of n.
At step 207, a Severe and Abrupt Performance Degradation Threshold (SAPDT) is calculated using the moving averages calculated in steps 203 and 205. The threshold essentially identifies what was, in this example, a “normal” rate of call success (and thus call failures) over the last n (3 in this example) NPI measurement periods and establishes a predetermined rate of call success (and failure) that will be considered as acceptable. Conversely, this also establishes the point at which the rate of call success (and failure) has become unacceptable. This enables a comparison of what was “normal” degradation over the previous n sample periods with what the current level of degradation is (described below with respect to steps 211 and 213. Specific examples of algorithms for performing the calculation of the SAPDT are provided later in this application.
At step 211, a comparison is made to determine if the current NPI OM has crossed the SAPDT, indicating the existence of a severe performance degradation relative to the current moving average window. If the comparison indicates the existence of a severe performance degradation, the process proceeds to step 213 where a severe performance degradation alarm is triggered, and any action desired can be taken by the network operator or other monitoring entity. If the comparison indicates no existence of a severe performance degradation, the process proceeds back to step 201, where the moving window is “moved” and the process begins again on the next current NPI and on the new set of n NPI OMs. Using the process of the embodiment described with respect to
In another embodiment of the invention, described now with respect to
Referring to
As can be seen, if an SPD alarm has been issued for a particular NPI OM, this fact is conveyed to a summing process 307 (for example, if an SPD alarm has been issued for a particular NPI, a “1” can be forwarded to the summing process 307, and if no SPD alarm has been issued for a particular NPI, a “0” can be forwarded to the summing process 307).
At step 307, the summing process determines how many of the NPI OMs are indicating an SPD condition as indicated by the issuance of an SPD alarm. At step 309, a determination is made as to whether or not a Potential Outage Alarm threshold had been met. This threshold can be arbitrarily set by the network operator so that a certain number of simultaneous SPD alarms during the same NPI measurement period must occur before an alarm condition is considered to exist. For example, if the Potential Outage Alarm threshold is set at 3, and the if the systems is set to issue an Potential Outage Alarm if the threshold is exceeded, then if 4 or more NPI OMs cause SPD alarms to be triggered at the same time, at step 313 a potential outage alarm is issued so that investigations and/or corrective actions can be instituted, and then at step 311 the summing process 307 is reset. If the potential outage alarm threshold is not exceeded (i.e., if the number of SPD alarms issued for a particular NPI measurement period is 3 or less, at step 311 the summing process 307 is reset (e.g., the sum is returned to zero to await the next NPI measurement period data).
To summarize the operations performed by the processes of
An additional embodiment of the invention is described with reference to
Referring to
At step 407, the slope of the data trend line resulting from the smoothed data is determined by, for example using the linear least square fit for some or all of the data trend line. At step 409, the determined slope is compared with a predetermined Slope and Persistent Degradation threshold. At step 411, if the determined slope meets or exceeds the predetermined Slope and Persistent Degradation threshold, an Long Term Performance Anomaly alarm is triggered so that investigation and/or corrective measures can be taken. If at step 411 it is determined that the determined slope does not meet or exceed the threshold, the process continues back to step 401 to perform the same steps for the next current NPI value.
For any of the alarm conditions described with respect to
What follows are examples of specific algorithms and elements that can be used to perform the processes described in
An NPI OMs dynamic thresholding approach is described that automatically detects an onset/abatement of network performance anomaly events utilizing Network Performance Indicators Operation Measurement (NPI OMs) as input data. Network performance anomaly events considered in this document are the following:
1. Severe and abrupt degradation of the NPI OM,
2. Potential network outage events,
3. Slow and persistent degradation of the NPI OM.
The following are the NPI performance anomaly onset and abatement events defined in this algorithm:
A1i) Severe and abrupt performance degradation event detected for the ith NPI
A2i) Slow and persistent performance degradation event detected for the ith NPI
A3) Network outage event detected (detection of simultaneous severe and abrupt performance degradation from the NPIs in some non-empty set K
R1i) Severe and abrupt performance anomaly event recovery detected for the ith NPI
R2i) Slow and persistent performance anomaly event recovery detected for the ith NPI
R3) Network outage event recovery detected (detection of recovery from performance degradation for every NPIs in non-empty set K
R4i) Recovery to long term average performance event detected for the ith NPI
G/U and CDMA Voice Core Network Performance Indicator (NPI) process provides measurement of various call success rates at the end of every k OM transfer periods. A low and persistent degradation of NPIs can indicate an issue such as memory leak. Early detection of the network performance anomaly problem could help avoid network outage events. Additionally, simultaneous large abrupt and sudden changes in the call success rates from multiple NPIs can indicate the onset of an outage event (the outage can be partial, i.e., losing >10% of capacity, or total outage), therefore, it is of interest to utilize the NPI process to help avoid or reduce the outage downtime of the network by devising an algorithm to automatically process the NPIs data in order to detect occurrence of slow and persistent NPI OM degradation, severe and sudden degradation in NPI OM, and potential outage events and raise an appropriate log or alarm to alert the operator of the observed performance anomaly so that they can be investigated and dealt with in a timely manner.
3. Detection of Severe and Abrupt NPI Degradation and Potential Outage EventAn algorithm for detecting severe and abrupt degradation of NPI and the detecting a potential outage event is summarized in the next two Sections. Section 4.1 presents an algorithm assuming floating point arithmetic is used whereas Section 4.2 present an algorithm when the calculation is to be performed using integer arithmetic.
3.1 Summary of the Severe and Abrupt NPI Degradation/Outage Detection Algorithm (Floating Point Implementation)
3.2 Summary of a Severe & Abrupt NPI Degradation/Outage Detection Algorithm (Integer Arithmetic Implementation)
4.1 Summary of a Slow and Persistent NPI Degradation Detection Algorithm (Floating Point Implementation)
4.2 Summary of a Slow and Persistent NPI Degradation Detection Algorithm (Integer Arithmetic Implementation: Signed 32 Bits)
Once the system has entered the network performance anomaly state and the alarm has been set, in order to declare that the network performance anomaly event has abated and the system has entered the ‘normal’ state it is necessary to make sure that the system performance has reverted back to its most recent long term average performance. To achieve this recovery detection goal, a test statistic constructed from a 7-day moving average estimate of the mean and variance of each of the NPI OMs can be used. Suppose there are J samples of the OMs over the 7-day period, then the sample mean value of the m-th NPI at the time instant k is given by
Let qm,k:=Σi=k−J+1kum,i2, then the sample variance of the m-th NPI at the time instant k is given by
With the above recursive relations for sample mean and sample variance, it is straight forward to construct a recovery to long term performance detection algorithm.
5.1 Summary of a Recovery of NPI Long Term Average Performance Detection Algorithm (Floating Point Implementation)
5.2 Summary of a Recovery of NPI Long Term Average Performance Detection Algorithm (Integer Arithmetic Implementation)
Use the same ordering of the M NPI as the severe & abrupt NPI degradation/outage detection.
As set forth above, a scheme for network performance anomaly detection has been disclosed based on 1) detecting severe and sudden change in the NPI OM 2) Detecting slow and persistent degradation of the NPI OM. Furthermore, utilizing multiple NPIs helps reduce the false alarm probability while maximizing the probability of outage detection. Each NPI has two network performance degradation thresholds which dynamically adapts to changes in the most recent mean and variance values of the NPI success rate in the severe and abrupt NPI degradation detection. The severe and abrupt performance degradation decision rule compares the most recent measurement of NPI call success rate against this dynamic threshold value in order to determine if there is a statistically significant large and sudden change from the most recent mean value. At any particular time instant at which the NPI call success rate dropped below the threshold, a network performance anomaly alarm is issued for that NPI.
The second dynamic threshold for network performance anomaly detection uses a low pass filter (i.e. long moving average window) to smooth out the normal NPIs daily fluctuation in order to discriminate real versus fictitious slow downward trend in the NPI OM performance. To ascertain that the network performance anomaly event has abated, three abatement dynamic thresholds are proposed. The first two thresholds concern the detection of the abatement event related to the severe and abrupt NPI degradation detection and the slow and persistent NPI detection algorithm. The last: abatement dynamic threshold is used to check whether the net-work performance has recovered to its long term average performance level. Once a network performance anomaly alarm has been set, it could only be cleared when the relevant NPI OM value exceeds all three abatement thresholds.
The above-described steps can be implemented using standard well-known programming techniques. The novelty of the above-described embodiment lies not in the specific programming techniques but in the use of the steps described to achieve the described results. Software programming code which embodies the present invention is typically stored in permanent storage. In a client/server environment, such software programming code may be stored with storage associated with a server. The software programming code may be embodied on any of a variety of known media for use with a data processing system, such as a diskette, or hard drive, or CD ROM. The code may be distributed on such media, or may be distributed to users from the memory or storage of one computer system over a network of some type to other computer systems for use by users of such other systems. The techniques and methods for embodying software program code on physical media and/or distributing software code via networks are well known and will not be further discussed herein.
It will be understood that each element of the illustrations, and combinations of elements in the illustrations, can be implemented by general and/or special purpose hardware-based systems that perform the specified functions or steps, or by combinations of general and/or special-purpose hardware and computer instructions.
These program instructions may be provided to a processor to produce a machine, such that the instructions that execute on the processor create means for implementing the functions specified in the illustrations. The computer program instructions may be executed by a processor to cause a series of operational steps to be performed by the processor to produce a computer-implemented process such that the instructions that execute on the processor provide steps for implementing the functions specified in the illustrations. Accordingly, the figures support combinations of means for performing the specified functions, combinations of steps for performing the specified functions, and program instruction means for performing the specified functions.
Although the present invention has been described with respect to a specific preferred embodiment thereof, various changes and modifications may be suggested to one skilled in the art and it is intended that the present invention encompass such changes and modifications as fall within the scope of the appended claims.
Claims
1. A system for detecting performance anomaly events in a communication network, the system comprising at least one processor coupled to the communication network and configured:
- to determine a dynamic performance degradation threshold based on a plurality of Network Performance Indication Operations Measurements (NPI OMs) immediately preceding current NPI OMs;
- to compare the current NPI OMs with the dynamic performance degradation threshold; and
- to generate an indication of an alarm condition when the comparison indicates a performance degradation condition.
2. The system of claim 1, wherein the at least one processor is configured to acquire the current NPI OMs.
3. The system of claim 1, wherein the processor is configured to repeatedly acquire NPI OMs during NPI measurement periods, and wherein the dynamic performance degradation threshold is determined using an average value and a variance value of the NPI OMs during no more than ten NPI measurement periods immediately preceding a current measurement period.
4. The system of claim 1, wherein the at least one processor is configured to repeatedly acquire NPI OMs during NPI measurement periods, and wherein the dynamic performance degradation threshold is determined using an average value and a variance value of the NPI OMs during no more than five NPI measurement periods immediately preceding a current measurement period.
5. The system of claim 1, wherein the performance degradation condition is deemed to be a severe and abrupt performance degradation.
6. The system of claim 1, wherein the at least one processor is configured:
- to monitor a number of indications of alarm conditions generated during a particular NPI measurement period;
- to compare the number of indications of alarm conditions during the particular NPI measurement period with a potential outage alarm threshold value; and
- to generate a potential outage alarm when the comparison of the number of indications of alarm conditions and the potential outage alarm threshold value indicates a potential outage condition.
7. The system of claim 6, wherein the potential outage alarm threshold corresponds to occurrence of a predetermined plurality of alarm condition indications being generated within one NPI measurement period.
8. The system of claim 6, wherein the potential outage alarm threshold is predetermined.
9. The system of claim 8, wherein the predetermined potential outage alarm threshold is provided to the at least one processor.
10. The system of claim 1, wherein the at least one processor is configured:
- to compare NPI OMs made after generation of an indication of an alarm condition to at least one abatement threshold; and
- to cease generation of the alarm condition when the comparison indicates an abatement condition.
11. The system of claim 10, wherein:
- the at least one processor is configured to compare the NPI OMs made after generation of an indication of an alarm condition to at least one abatement threshold by comparing the NPI OMs to multiple abatement thresholds; and
- the at least one processor is configured to cease generation of the alarm condition when the comparison indicates an abatement condition by ceasing generation of the alarm condition only the comparison to all abatement thresholds indicates an abatement condition.
12. The system of claim 11, wherein the multiple abatement thresholds comprise:
- a first abatement threshold deemed to indicate abatement of a severe and sudden degradation;
- a second abatement threshold deemed to indicate abatement of a slow and persistent degradation; and
- a third abatement threshold deemed to indicate recovery of long term average performance.
13. A system for detecting performance anomaly events in a communication network comprising at least one processor configured:
- to determine a data trend line based on a moving average of Network Performance Indication Operations Measurements (NPI OMs) taken over a preceding interval of time;
- to determine a slope of the data trend line;
- to compare the slope of the data trend line with a degradation threshold; and
- to generate an indication of an alarm condition when the comparison indicates a long-term performance anomaly.
14. The system of claim 13, wherein the long-term performance anomaly is deemed to be a slow and persistent degradation of performance.
15. The system of claim 13, wherein the degradation threshold is predetermined.
16. The system of claim 15, wherein the predetermined degradation threshold is provided to the at least one processor.
17. The system of claim 13, wherein the at least one processor is configured to acquire the current NPI OMs.
18. The system of claim 17, wherein the at least one processor is configured to repeatedly acquire NPI OMs during NPI measurement periods and to determine the data trend line based on a moving average of NPI OMs acquired by the at least one processor during more than ten NPI measurement periods immediately preceding a current measurement period.
19. The system of claim 13, wherein the at least one processor is configured:
- to compare NPI OMs made after generation of an indication of an alarm condition to at least one abatement threshold; and
- to cease generation of the alarm condition when the comparison indicates an abatement condition.
20. The system of claim 19, wherein:
- the at least one processor is configured to compare the NPI OMs made after generation of an indication of an alarm condition to at least one abatement threshold by comparing the NPI OMs to multiple abatement thresholds; and
- the at least one processor is configured to cease generation of the alarm condition when the comparison indicates an abatement condition by ceasing generation of the alarm condition only the comparison to all abatement thresholds indicates an abatement condition.
21. The system of claim 20, wherein the multiple abatement thresholds comprise:
- a first abatement threshold deemed to indicate abatement of a severe and sudden degradation;
- a second abatement threshold deemed to indicate abatement of a slow and persistent degradation; and
- a third abatement threshold deemed to indicate recovery of long term average performance.
Type: Application
Filed: Sep 18, 2014
Publication Date: Jan 1, 2015
Inventors: Channarong TONTINUTTANANON (Richardson, TX), Kuntaporn SAIYOS (Seattle, WA), Deborah CASE (Dallas, TX), Peter WENZEL (Plano, TX), Aamir SATTAR (Santa Clara, CA)
Application Number: 14/489,925
International Classification: H04W 24/08 (20060101); H04L 12/24 (20060101);