METHODS AND SYSTEMS FOR DETECTION OF DATA ANOMALIES
The disclosure presents computational methods and systems for detecting and correcting, or deleting, data anomalies in data generated by information technology business management (“ITBM”) systems. In one aspect, a method receives a record of data generated by an ITBM system. The record of data includes data types and associated numerical values recorded over a number of time periods. The method detects one or more data anomalies in the record of data based on the numerical values of the data types in the time periods and reports a set of the one or more data anomalies to a user. The method also enables a user to correct the data anomalies in the recorded data based on the user's decision to selectively correct or delete each of the data anomalies.
Latest VMware, Inc. Patents:
- RECEIVE SIDE SCALING (RSS) USING PROGRAMMABLE PHYSICAL NETWORK INTERFACE CONTROLLER (PNIC)
- ASYMMETRIC ROUTING RESOLUTIONS IN MULTI-REGIONAL LARGE SCALE DEPLOYMENTS WITH DISTRIBUTED GATEWAYS
- METHODS AND SYSTEMS FOR DETECTING AND CORRECTING TRENDING PROBLEMS WITH APPLICATIONS USING LANGUAGE MODELS
- CONFIGURATION OF SERVICE PODS FOR LOGICAL ROUTER
- BLOCKCHAIN-BASED LICENSING AS A SERVICE
The present disclosure is directed to computational methods and systems for detecting and correcting, or deleting, anomalous data generated by information technology business management systems.
BACKGROUNDIn recent years, the number of enterprises purchasing information technology (“IT”) services from IT service providers has steadily increased. For example, a large number of enterprises purchase cloud computing services from IT service providers, because cloud computing enables enterprises to cut costs and decrease time to market while eliminating a heavy investment in IT and operating expenses. Many of the IT services purchased by enterprises are network-based services that appear to a user as real server hardware, but, in fact, are virtual machines (“VMs”) simulated by software running on one or more real computers. The VMs are not bound physical resources. Instead, VMs are virtual resources that may be moved around and scaled up or down as needed without affecting the user's experience. For example, an IT service provider allocates resources of a data center to satisfy high-demand periods for an enterprise's software may reallocate these same resources at other times when demand decreases. VMs enable IT service providers to maximize use of resources and lower the costs of IT services purchased by enterprises.
In order to more efficiently and effectively manage cost of IT services, IT service providers try to determine how much each service provided to an enterprise actually cost the provider. As a result, IT service providers must consider as many cost drivers as can possibly be monitored in order to accurately assess the cost of IT services. Accurately assessing the cost of IT services is typically handled by IT management software and systems that enable IT service providers to model and track the total cost of delivering and maintaining the IT services they provide to enterprises. IT cost transparency solutions integrate financial information such as labor, software licensing costs, hardware acquisition and depreciation, data center facilities charges, from general ledger systems and combines that with operational data from monitoring, asset management, and project portfolio management systems to provide a single, integrated view of IT costs by service, department, general ledger line item and project. Costs, budgets, performance metrics and changes to data points are tracked over time in order to identify trends in the data and the impact of changes to underlying cost drivers so that managers are better able to address cost drivers responsible for increasing IT costs and improve planning. However, the data collected and used to monitor cost drivers may be very large and contain data anomalies that affect the cost assessment of IT services. IT service providers seek methods and systems that can detect anomalies in the data tracked to assess IT costs and warn IT financial controllers of the anomalies and allow the anomalies to be corrected.
SUMMARYThis disclosure presents computational methods and systems for detecting and correcting, or deleting, data anomalies in data generated to assess IT service cost by information technology business management (“ITBM”) systems. In one aspect, a computational method receives a record of data generated by an ITBM system. The record of data includes data types and associated numerical values recorded over a number of time periods. The method detects one or more data anomalies in the data and reports the data anomalies to a user. The method also enables a user to correct the data anomalies based on a user's decision to selectively correct or delete each of the data anomalies.
This disclosure presents computational methods and systems for detecting anomalous data in data generated in the course of tracking various information technology (“IT”) services.
The IT service provider 104 uses an IT business management (“ITBM”) system to generate a record 108 of data extracted from one or more data sources maintained by the IT service provider. The ITBM system uses adaptors to fetch data from data sources and uses data-management operators to perform various selected operations on the data in order to monitor and determine the cost of using IT services, as described in greater detail below. For example, typical data-management operators perform data manipulations, such as filtering, aggregating, joining and correlating. The disclosure describes ITBM systems that include one or more anomaly operators that may be placed anywhere within an adaptor/operator flow of the ITBM system in order to detect anomalies in the data generated in flow and correct, or delete, the anomalies from the data.
It should be noted at the onset that the data and the anomaly operators described below are not, in any sense, abstract or intangible. Instead, the data is necessarily digitally encoded and stored in a physical data-storage computer-readable medium, such as an electronic memory, mass-storage device, or other physical, tangible, data-storage device and medium. It should also be noted that the currently described data-processing and data-storage methods cannot be carried out manually by a human analyst, because of the complexity and vast numbers of intermediate results generated for processing and analysis of even quite modest amounts of data. Instead, the methods described herein are necessarily carried out by electronic computing systems on electronically or magnetically stored data, with the results of the data processing and data analysis digitally encoded and stored in one or more tangible, physical, data-storage devices and media.
It should be noted that ITBM systems are not limited to the adaptors and data-management operators described above with reference to
Anomaly operators may be incorporated into an ITBM system in any number of different ways in order to detect anomalous data in records output from adaptors and/or data-management operators.
Anomaly operators described below use outlier detection to detect anomalies in the numerical values of a record of data output from an adaptor or a data-management operator. Outlier detection may be applied to the numerical quantities associated with one or more data types. It should be noted that methods may be applied to data collected in a batch of time periods output from an adaptor or a data-management operator. Alternatively, methods may be applied to a data of a current batch of time periods joined with a history data from any number of preceding time periods in order to detect data anomalies over a much larger interval of time and not just in the time periods of a current batch of time periods. For example, a data-base adapter that collects data for a data type on a monthly basis may join the data for the current month with any number of previous months in order to detect anomalies in the context of a larger interval of time. Consider the record in
The following description presents an outlier detection technique for detecting anomalous numerical data associated with one or more data types output from either an adaptor or an operator of an adaptor/operator flow. Consider a set of M points {xi}i=1M associated with one or more data types. The number of points M may represent a number of points output from an adaptor or data-management operator for a number of recent time periods or represent a number of points output from an adaptor or data-management operator of recent time periods joined with a number of points associated with time periods that precede the recent time periods. In certain implementations, xi=(Vi, Ti), where Vi is the numerical value of the date type, Ti may be the time period when the data type is recorded, and M is the number of time periods. For example, for the record of
{d(xp,xi)}1=1M-1 (1)
The k points xi with the k shortest distances in the set of Equation (1) form a set of k-nearest neighboring points to the point xp. The set of k-nearest neighbor points is denoted by Np (xp ∉Np) and is referred to as the neighborhood of point xp. The data type associated with the point xp is identified as an anomaly when the point xp is outside the neighborhood Np as determined by:
where
is the center of the neighborhood Np;
is an average of the distance of the point xp to each cost the neighborhood Np; and
is an average distance between the points in the neighborhood Np.
In certain implementations, the distance d may be an Euclidean distance denoted by ∥.∥ or the square of the Euclidean distance ∥.∥2. In other implementations, the distance d may be simply a function of points. For example, d(xi,xj)=|Vi−Vj| and
In alternative implementations, a user selected tolerance, denoted by TOL, may be included in order to avoid classifying any value with a point outside the neighborhood Np as an outlier. For example, certain values may be on the outside edge of the neighborhood Np but should not necessarily be considered an outlier. As a result, in alternative implementations, the point xp is outside the neighborhood of k-nearest neighbors and the value Vp may be identified as an outlier when
where TOL is a user selected tolerance.
After the outliers have been identified, the outliers are rank ordered. In certain implementations, the outliers may be rank order according to their distance from the center
When data types with anomalous values are detected, a user may decide to replace the anomalous numerical values with the mean or median of the numerical values over a number of time periods for the same data type or the user may decide to delete the all data associated with the anomalous value. For example, suppose the numerical value Vo for a particular data type has been identified as an outlier as described above with reference to Equations (2) or (4). The user may decide to replace the numerical value Vo with the mean of the M numerical values for the same data type calculated according to
Alternatively, the user may decide to replace the numerical value with the median of the M numerical values:
In other implementations, an anomalous value Vo may be replaced by a maximum or minimum of the numerical values with the anomalous value excluded:
Although the above disclosure has been described in terms of particular implementations, it is not intended that the disclosure be limited to these implementations. Modifications within the spirit of the disclosure will be apparent to those skilled in the art. For example, any of a variety of different implementations can be obtained by varying any of many different design and development parameters, including programming language, underlying operating system, modular organization, control structures, data structures, and other such design and development parameters.
It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A system for correcting data anomalies comprising:
- one or more processors;
- one or more data-storage devices; and
- a routine stored in the data-storage devices and executed using the one or more processors, the routine receiving a record of data output from an adaptor or a data-management operator of an information technology business management system, the record of data including data types and associated numerical values recorded over a number of time periods; detecting one or more data anomalies in the record of data based on the numerical values of the data types in the time periods; reporting a set of the one or more data anomalies to a user; and correcting the data anomalies in the recorded data based a user's decision to selectively correct or delete each of the data anomalies.
2. The system of claim 1, wherein the adaptor or the data-management operator are incorporated in an adaptor/operator flow of the information technology business management system.
3. The system of claim 1, wherein detecting the one or more data anomalies further comprises:
- for each data type, collecting a set of data points over the time periods, each data point represents a numerical value and time period for the data type; detecting outlier data points in the set of data points based on a distance from each data point to a center of a neighborhood of nearest data points in the set; when one or more outlier data points are detected, rank ordering the outlier data points; and identifying a set of highest ranked outlier data points based on the rank order of the one or more outlier data points, the set of highest ranked outlier data points correspond to the set of one or more data anomalies reported to the user.
4. The system of claim 3, wherein detecting outlier data points in the set of data points of the data type further comprises
- for each data point, determining the neighborhood of nearest data points in the set; calculating the center of the neighborhood; calculating a sum of average distances from the data point to each data point in the neighbor; calculating an average distance between data points in the neighborhood; and identifying the data point as an outlier data point when the distance from the data point to the center of the neighborhood is greater than a ratio of average distance from the data point to each data point in the neighborhood to the average distance between nearest data points.
5. The system of claim 1, wherein correcting the data anomalies further comprises:
- for each of the one or more data anomalies, replacing an outlier numerical value with one of the following: a mean value of the numerical values recorded over the time intervals, excluding the outlier numerical value; a median value of the numerical values recorded over the time intervals; a maximum value of the numerical values recorded over the time intervals, excluding the outlier numerical value; and a minimum value of the numerical values recorded over the time intervals, excluding the outlier numerical value.
6. A method stored in one or more data-storage devices and executed using one or more processors that detects data anomalies, the method comprising:
- receiving a record of data output from an adaptor or a data-management operator of an information technology business management system, the record of data including data types and associated numerical values recorded over a number of time periods;
- detecting one or more data anomalies in the record of data based on the numerical values of the data types in the time periods;
- reporting a set of the one or more data anomalies to a user; and
- correcting the data anomalies in the recorded data based a user's decision to selectively correct or delete each of the data anomalies.
7. The method of claim 6, wherein the adaptor or the data-management operator are incorporated in an adaptor/operator flow of the information technology business management system.
8. The method of claim 6, wherein detecting the one or more data anomalies further comprises:
- for each data type, collecting a set of data points over the time periods, each data point represents a numerical value and time period for the data type; detecting outlier data points in the set of data points based on a distance from each data point to a center of a neighborhood of nearest data points in the set; when one or more outlier data points are detected, rank ordering the outlier data points; and identifying a set of highest ranked outlier data points based on the rank order of the one or more outlier data points, the set of highest ranked outlier data points correspond to the set of one or more data anomalies reported to the user.
9. The method of claim 8, wherein detecting outlier data points in the set of data points of the data type further comprises for each data point,
- determining the neighborhood of nearest data points in the set;
- calculating the center of the neighborhood;
- calculating a sum of average distances from the data point to each data point in the neighbor;
- calculating an average distance between data points in the neighborhood; and
- identifying the data point as an outlier data point when the distance from the data point to the center of the neighborhood is greater than a ratio of average distance from the data point to each data point in the neighborhood to the average distance between nearest data points.
10. The method of claim 6, wherein correcting the data anomalies further comprises:
- for each of the one or more data anomalies, replacing an outlier numerical value with one of the following: a mean value of the numerical values recorded over the time intervals, excluding the outlier numerical value; a median value of the numerical values recorded over the time intervals; a maximum value of the numerical values recorded over the time intervals, excluding the outlier numerical value; and a minimum value of the numerical values recorded over the time intervals, excluding the outlier numerical value.
11. A computer-readable medium encoded with machine-readable instructions that implement a method carried out by one or more processors of a computer system to perform the operations of
- receiving a record of data output from an adaptor or a data-management operator of an information technology business management system, the record of data including data types and associated numerical values recorded over a number of time periods;
- detecting one or more data anomalies in the record of data based on the numerical values of the data types in the time periods;
- reporting a set of the one or more data anomalies to a user; and
- correcting the data anomalies in the recorded data based a user's decision to selectively correct or delete each of the data anomalies.
12. The computer-readable medium of claim 11, wherein the adaptor or the data-management operator are incorporated in an adaptor/operator flow of the information technology business management system.
13. The computer-readable medium of claim 11, wherein detecting the one or more data anomalies further comprises:
- for each data type, collecting a set of data points over the time periods, each data point represents a numerical value and time period for the data type; detecting outlier data points in the set of data points based on a distance from each data point to a center of a neighborhood of nearest data points in the set; when one or more outlier data points are detected, rank ordering the outlier data points; and identifying a set of highest ranked outlier data points based on the rank order of the one or more outlier data points, the set of highest ranked outlier data points correspond to the set of one or more data anomalies reported to the user.
14. The computer-readable medium of claim 13, wherein detecting outlier data points in the set of data points of the data type further comprises for each data point,
- determining the neighborhood of nearest data points in the set; calculating the center of the neighborhood; calculating a sum of average distances from the data point to each data point in the neighbor; calculating an average distance between data points in the neighborhood; and identifying the data point as an outlier data point when the distance from the data point to the center of the neighborhood is greater than a ratio of average distance from the data point to each data point in the neighborhood to the average distance between nearest data points.
15. The computer-readable medium of claim 11, wherein correcting the data anomalies further comprises:
- for each of the one or more data anomalies, replacing an outlier numerical value with one of the following:
- a mean value of the numerical values recorded over the time intervals, excluding the outlier numerical value;
- a median value of the numerical values recorded over the time intervals;
- a maximum value of the numerical values recorded over the time intervals, excluding the outlier numerical value; and
- a minimum value of the numerical values recorded over the time intervals, excluding the outlier numerical value.
Type: Application
Filed: Mar 18, 2014
Publication Date: Sep 24, 2015
Applicant: VMware, Inc. (Palo Alto, CA)
Inventors: Al Yaros (Herzliya), Eyal Cohen (Herzilya), Evgeny Etkin (Herzliya), Asaf Abramovitz (Herzliya)
Application Number: 14/218,544