Anomaly Analysis For Software Distribution
A population of devices provides telemetry data and receives software changes or updates. Event buckets for respective events are found. Event buckets have counts of event instances, where each event instance is an occurrence of a corresponding event reported as telemetry by a device. Records of the software changes are provided, each change record representing a software change on a corresponding device. The event buckets are analyzed to identify which indicate an anomaly. Based on the change records and the identified event buckets, correlations between the software changes and the identified event buckets are found.
Latest Patents:
This application is a continuation patent application of copending application with Ser. No. 14/676,214, (attorney docket no. 356778.01) filed Apr. 1, 2015, entitled “ANOMALY ANALYSIS FOR SOFTWARE DISTRIBUTION”, which is now allowed. The aforementioned application(s) are hereby incorporated herein by reference.
BACKGROUNDDevices that run software usually require updates over time. The need for software updates may be driven by many factors, such as fixing bugs, adding new functionality, improving performance, maintaining compatibility with other software, and so forth. While many techniques have been used for updating software, an update typically involves changing the source code of a program, compiling the program, and distributing the program to devices where the updated program will be executed.
It is becoming more common for programs to be compiled for multiple types of devices and operating systems. Executable code compiled from a same source code file might end up executing on devices with different types of processors, and different types or versions of operating systems. Updates for such cross-platform programs can be difficult to assess.
In addition, the increasing network connectivity of devices has led to higher rates of updating by software developers and more frequent reporting of performance-related data (telemetry) by devices.
In a short time period, a device might receive many software updates and might transmit many telemetry reports to a variety of telemetry collectors. A software distribution system might rapidly issue many different software updates to many different devices. As devices provide feedback telemetry about performance, crashes, stack dumps, execution traces, etc., around the same time, many software components on the devices might be changing. Therefore, it can be difficult for a software developer to use the telemetry feedback to decide whether a particular software update created or fixed any problems. If an anomaly is occurring on some devices, it can be difficult to determine whether any particular software update is implicated, any conditions under which an update might be linked to an anomaly, or what particular code-level changes in a software update are implicated. In short, high rates of software updating and telemetry reporting, perhaps by devices with varying architectures and operating systems, has made it difficult to find correlations between software updates (or source code changes) and anomalies manifested in telemetry feedback.
Techniques related to finding anomalies in telemetry data and finding correlations between anomalies and software updates are discussed below.
SUMMARYThe following summary is included only to introduce some concepts discussed in the Detailed Description below. This summary is not comprehensive and is not intended to delineate the scope of the claimed subject matter, which is set forth by the claims presented at the end.
A population of devices provides telemetry data and receives software changes or updates. Event buckets for respective events are found. Event buckets have counts of event instances, where each event instance is an occurrence of a corresponding event reported as telemetry by a device. Records of the software changes are provided, each change record representing a software change on a corresponding device. The event buckets are analyzed to identify which indicate an anomaly. Based on the change records and the identified event buckets, correlations between the software changes and the identified event buckets are found.
Many of the attendant features will be explained below with reference to the following detailed description considered in connection with the accompanying drawings.
The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein like reference numerals are used to designate like parts in the accompanying description.
An update 102 can be implemented in a variety of ways. An update 102 can be a package configured or formatted to be parsed and applied by an installation program or service running on a device 104. An update 102 might be one or more files that are copied to appropriate file system locations on devices 104, possibly replacing prior versions. An update 102 might be a script or command that reconfigures software on the device 104 without necessarily changing executable code on the device 104. For example, an update 102 might be a configuration file or other static object used by corresponding software on the device 104. An update 102 can be anything that changes the executable or application. Commonly, an update 102 will involve replacing, adding, or removing at least some executable code on a device 104.
An update service 100, in its simplest form, provides software updates 102 to the devices 104. For example, and update service 100 might be an HTTP (hypertext transfer protocol) server servicing file download requests. An update service 100 might be more complex, for instance, a so-called software store or marketplace that application developers use to propagate updates. An update service 100 might also be a backend service working closely with a client side component to transparently select, transmit, and apply updates. In one embodiment, the update service 100 is a peer-to-peer service where peers share updates with each other. In another embodiment, an update service 100 is a network application or service running on a group of servers, for instance as a cloud service, that responds to requests from devices 104 by transmitting software updates 102. Any known technology may be used to implement an update service 100. Moreover, as shown in
In one embodiment, an update service 100 includes an update distributor 112 and an updates database 114. The updates database 114 may have records for respective updates 102. Each update record identifies the corresponding update 102, information about the update 102 such as an intended target (e.g., target operating system, target hardware, software version, etc.), a location of the update 102, and other related data as discussed further below. The update distributor 112 cooperates with the updates database 114 to determine what updates are to be made available to, or transferred to, any particular device 104. As will be described further below, finding correlations between updates and anomalies can be facilitated by keeping track of which particular updates 102 have been installed on which particular devices 104. This can be tracked in the updates database 114 or elsewhere. For this purpose, in one embodiment, each time a particular update 102 is provided to a particular device 104, an update instance record is stored identifying the particular update, the particular device 104, and the update 102. In some embodiments, this information is obtained indirectly, for instance from logs received from a device 104, perhaps well after an update has been applied.
The telemetry reports 108 are any communications pushed or pulled from the devices 104. Telemetry reports 108 indicate respective occurrences on the devices 104, and are collected by a telemetry collector 116 and then stored in a telemetry database 118. Examples of types of telemetry reports 108 that can be used are: operating system crash dumps (possibly including stack traces), application crash dumps, system log files, application log files, execution trace files, patch logs, performance metrics (e.g., CPU load, memory usage, cache collisions, network performance statistics, etc.), or any other information that can be tied to software behavior on a device 104. As discussed in detail further below, in one embodiment, text communications published on the Internet are mined as a telemetry source. As will become apparent, embodiments described herein can improve the diagnostic value—with respect to updates—of telemetry sources that provide information about general system health or that are not specific-to or limited-to any particular application or software. By using such telemetry sources on a large scale, it may become possible to find correlations in mass that might not be discoverable on an individual device basis.
Regardless of the type of telemetry data, a telemetry collection service 106 will usually have some collection mechanism, e.g., a telemetry collector 116, and a storage or database—e.g., a telemetry database 118—that collates and stores incoming telemetry reports 108. Due to potentially large volumes of raw telemetry data, in some embodiments, key items of data can be extracted from incoming telemetry reports 108 and stored in the telemetry database 118 before disposing of the raw telemetry data.
A device 104 may also have one or more telemetry reporting elements 142. As discussed above, any type of telemetry data can be emitted in telemetry reports 108. Any available reporting frameworks may be used, for instance Crashlytics™, TestFlight™, Dr. Watson™, the OS X Crash Reporter™, BugSplat™, the Application Crash Report for Android Tool, etc. A reporting element 142 can also be any known tool or framework such as the diagnostic and logging service of OS X, the Microsoft Windows instrumentation tools, which will forward events and errors generated by applications or other software, etc. A reporting element 142 can also be an application or other software unit that itself sends a message or other communication when the application or software unit encounters diagnostic-relevant information such as a performance glitch, a repeating error or failure, or other events. In other words, software can self-report. The telemetry reports 108 from the reporting elements 142 may go to multiple respective collection services 106, or they may to go a same collection service 106.
The telemetry reports 108 from a device 104 will at times convey indications of events that occur on the device. The flow of telemetry reports 108 can collectively function like a signal that includes features that correspond to events or occurrences on the device. Such events or anomaly instances can, as discussed above, be any type of information that is potentially helpful for evaluating software performance, such as operating system or application crashes, errors self-reported by software or handled by the device's operating system, excessive or spikey storage, processor, or network usage, reboots, patching failures, repetitive user actions (e.g., repeated attempts to install an update, repeatedly restarting an application, etc.).
In any case, the process begins at step 180 by forming event buckets. Each event bucket is a collection of event instances that correspond to a same event. For example, an event bucket may consist of a set of event instances that each represent a system crash, or a same performance trait (e.g., poor network throughput), or any error generated by a same binary executable code (or a same function, or binary code compiled from a same source code file), or a particular type of error such as a memory violation, etc.
An event bucket can be formed in various ways, for example, by specifying one or more properties of event instance records, by a clustering algorithm that forms clusters of similar event instance records, by selecting an event instance record and searching for event instance records that have some attributes in common with the selected event record, and so forth. Event instance records are generally dated (the terms “date” and “time” are used interchangeably herein and refer to either a date, or a date and time, a period of time, etc.). Therefore, each bucket is a chronology of event instances, and counts of those event instances can be obtained for respective regular intervals of time.
As will be described later, one or more filters appropriate to the type of telemetry data in a bucket may be applied to remove event instance records that would be considered noise or otherwise unneeded. For instance, records may be filtered by date, operating system type or version, processor, presence of particular files on a device or in a stack trace, and/or any other condition.
At step 182, each event bucket is statistically analyzed to identify which buckets indicate anomalies, or more specifically, to identify any anomalies in each bucket. Various types of analysis may be used. For example, a regression analysis may be used. In one embodiment, anomalies are identified using a modified cumulative distribution function (CDF) with supervised machine learning. A standard CDF determines the probability that number X is found in the series of numbers S. For example, X may be a count of event instances for a time period Z, such as a day, and the series S is the number of events, for example for previous days 0 to Z−1. Additionally, certain time periods may be removed from the series S if they were independently determined to be a true-positive anomaly (e.g., they may have been identified by a developer). To identify an anomaly, an anomaly threshold is set, for instance to 95% or 100%; if a candidate anomaly (an X) has a probability (of not being in the series S) that exceeds the anomaly threshold, then an anomaly has been identified.
Returning to
At step 186, correlations are found between the updates and the event buckets with anomalies. Any of a variety of heuristics may be used to find correlations or probabilities of correlations. That is, one or more heuristics are used to determine the probability of a particular anomaly being attributable or correlated to a particular software update. Among the heuristics that can be used, some examples are: determining the probability that a particular binary (or other software element) in a software update is a source of a telemetry point; determining the extent to which an anomaly aligns with the release of software update; finding a deviation between an update deployment pattern and an anomaly pattern; determining whether an anomaly is persistent after deployment of an update (possibly without intervention). In one embodiment, each applicable heuristic can be given a weight. The weight is adjusted via a feedback loop when reported anomalies are marked by a user as either (i) a true-positive or (ii) false-positive. These heuristics and weights can be combined to compute a probability or percentage for each respective anomaly and software update pair. In this way, many possible updates can be evaluated against many concurrent anomalies and the most likely causes can be identified. For run-time optimization, a software update can be excluded from calculation once the software update has reached a significant deployment with no issues or, in the case of an update intended to fix an issue, when the corresponding issues are determined to have been resolved.
Within the concept of finding links between updates and anomalies, a number of variations of the algorithm in
As discussed with reference to
The update store 220 includes an update table 222 storing records of updates, an update instance table 224 storing records of which updates were applied to which devices and when, and a change table 226 storing records of changes associated with an update. The update instance records and the change records link back to related update records. The update records store information about the updates and can be created when an update is released or can be initiated by a user when analysis of an update is desired. The update records can store any information about the updates, such as versions, release dates, locations of the bits, files, packages, information about target operating systems or architectures, etc. The update instance records store any information about an update applied to a device, such as a time of the update, any errors generated by applying the update, and so on.
The change table 226 stores records indicating which assets are changed by which updates. An asset is any object on a device, such as an executable program, a shared library, a configuration file, a device driver, etc. In one embodiment, the change table 226 is automatically populated by a crawler or other tool that looks for new update records, opens the related package, identifies any affected assets, and records same in the change table 226. The changes can also be obtained from a manifest, if available. In one embodiment, a source code control system is accessed to obtain information about assets in the update. If an executable file is in the update, the identity of the executable file (e.g., filename and version) is passed to the source code control system. The source code control system returns information about any source code files that were changed since the executable file was last compiled. In other words, the source code control system provides information about any source code changes that relate to the update. As will be described further below, this can help extend the correlation analysis to particular source code files or even functions or lines of source code.
Returning to
It may be noted that in the case of an application crash telemetry source with stack traces, to help manage the scale of telemetry data, a list of binaries, their versions, and bounding dates can be used to find all failures (event instances) within the bounding dates that have one of the binaries in their respective stack traces.
In one embodiment, a telemetry source can be provided by monitoring public commentary. User feedback is obtained by crawling Internet web pages, accessing public sources (e.g., Twitter™ messages) for text or multimedia content that makes reference to software that is subject to updates or that might be affected by updates. For a given update, a set of static and dynamic keywords are searched for in content to identify content that is relevant. Static keywords are any words that are commonly used when people discuss software problems, for instance “install”, “reboot”, “crashing”, “hanging”, “uninstall”, etc. Dynamic keywords are generated according to the update, for example, the name or identifier of an update, names of files affected by the update, a publisher of the update, a name of an operating system, and others. When keywords sufficiently match an item of user-authored content, a record is created indicating the time of the content, the related update (if any), perhaps whether the content was found (using known text analysis techniques) to be a positive or negative comment, or other information that might indicate whether a person is having difficult or success with an update. Counts of dated identified user feedback instances can then serve as another source of telemetry buckets that can be correlated with updates.
In cases where a telemetry bucket has items that represent an event that has occurred prior to the date of an update, prior event instances can be used to determine whether the recent event instances in the bucket are normal or not. For example, a regular event spike might indicate that a recent event spike is not an anomaly. A baseline historical rate of event instances can also serve as a basis for filtering a bucket, for example, using a standard deviation of the historical rate.
As noted above, spikes in a bucket (e.g., rapid increases in occurrences over time) can serve as features potentially indicating anomalies. Spikes can be detected in various ways. In the case of an application crash telemetry, often there will be a many-to-many relationship between application crashes and event buckets. So, for a given crash or failure, related buckets are found. Then, the buckets are iterated over to find the distribution of the failure or crash among that bucket's updates. Hit counts are obtained from the release date back to a prior date (e.g., 60 days). The hit counts are adjusted for distribution of the crash or failure and hit counts for the crash/failure are adjusted accordingly. Then, pivoting on the corresponding update release date, hit count mean and variance prior to the update are computed. These are used to determine the cumulative probability of the hit counts after the update was released. Buckets without a sufficiently high hit probability (e.g., 95%) are filtered out.
When the buckets have been filtered, one or more heuristics, such as statistical regression, are applied to determine whether any of the buckets indicate that an anomaly has occurred. Analysis for potential regression conditions can be prioritized. For example, in order: the percentage of times an update is present when a crash occurs; the probability that a spike (potential anomaly) does not occur periodically and that the spike is not related to an installation reboot; the possibility that crashes in a crash bucket have resulted in overall binary crashes to rise; the possibility that a crash spike is related to an operating system update and not third party software; the probability that a spike is consistent rather than temporary; the probability that a spike is maximal over an extended time (e.g., months). As suggested above, an anomaly analysis can also depend on whether an event corresponding to a bucket is a new event or has occurred previously. The buckets determined to have anomalies are passed to an update correlator 264.
The update correlator 264 receives the anomalous buckets and accesses update data, such as the update store 220. Updates can be selected based on criteria used for selecting the telemetry buckets, based on a pivot date, based on a user selection, or any new updates can be selected. Correlations between the updates and the anomalous telemetry buckets are then computed using any known statistical correlation techniques, including those discussed elsewhere herein.
As mentioned above, if detailed update information is available (see change table 226), then depending on the granularity of that information, more specific correlations with anomalies can be found. A source code correlator 266 can be provided to receive correlations between updates and anomalies. The source code correlator 266 determines the correlation of source-code changes per a software update and its correlated telemetry anomaly bucket. Each correlation between a source code change and an anomaly is assigned a rank based on the level of match between the source-code change and the source-code emitting the telemetry points in the anomaly bucket. If a direct match is not found, a search of the call graph (or similar source-code analysis) for the source-code that emitted the telemetry and for the changed source-code list is performed to arrive at a prioritized list of source-code changes that could be causes of the anomaly.
Returning to
The update monitor 270 can also be used to evaluate the effectiveness of updates that are flagged as fixes. A pre-existing event (bucket) can be linked to an update and the effectiveness of the update can be evaluated with feedback provided through a user interface of the update monitor 270. An update can be deemed effective if the associated event/anomaly drops below a threshold occurrence rate.
The update monitor 270 can also be implemented as a client with a user interface that a developer uses to look for issues that might be related to updates identified by the developer. Or, a user can select an anomaly and see which updates, source code files, etc., are associated with the anomaly. In one embodiment a dashboard type of user interface is provided, which can display the overall health of the devices as reflected by anomalies in the analysis database 268. Anomalies can be evaluated by their extent (by geography, device type, etc.), their rate of increase, and so on. The dashboard can display alerts when an update is determined to be causing an anomaly. In the case where an update is flagged as a fix intended to remedy an anomaly, the dashboard can provide an indication that the update was successful.
In one embodiment the correlation engines operate as online algorithms, processing telemetry data and update data as it becomes available. Once a correlation engine 260 has determined all anomalies at the current time, it feeds the relevant information into the above-mentioned reporting dashboard (via the analysis database 268). The update monitor 270 or other tool can include ranking logic to rank anomalies correlated to updates. Types of anomalies, updates, source code files, etc., can be assigned weights to help prioritize the ranking. Other criteria may be used, such as the magnitude or rate of an anomaly, the duration of an anomaly, the number of anomalies correlated to a same update, and so forth. Anomalies and/or updates can be displayed in the order of their priority ranking to help developers and others to quickly identify the most important issues that need to be addressed.
Feedback can inform the supervised learning discussed above. The client application 320 displays user interfaces 322, as shown in
The anomaly summary user interface 322 might include: a name or identifier of the anomaly; summary information, such as a date the anomaly was identified, the type of anomaly (e.g., a spike), and scope of the anomaly (e.g., an affected operating system or processor); features of the anomaly like the probability that detection of the anomaly is correct and a measure of the anomaly; a listing of correlated updates and their occurrence rate; a telemetry graph showing counts of instances of the anomaly over time; and telemetry statistics such as an average daily count, counts before and after an update, unique machines reporting the event, and others.
Embodiments and features discussed above can also be realized in the form of information stored in volatile or non-volatile computer or device readable hardware. This is deemed to include at least hardware such as optical storage (e.g., compact-disk read-only memory (CD-ROM)), magnetic storage, flash read-only memory (ROM), or any current or future means of storing digital information. The stored information can be in the form of machine executable instructions (e.g., compiled executable binary code), source code, bytecode, or any other information that can be used to enable or configure computing devices to perform the various embodiments discussed above. This is also deemed to include at least volatile memory such as random-access memory (RAM) and/or virtual memory storing information such as central processing unit (CPU) instructions during execution of a program carrying out an embodiment, as well as non-volatile media storing information that allows a program or executable to be loaded and executed. The embodiments and features can be performed on any type of computing device, including portable devices, workstations, servers, mobile wireless devices, and so on.
Claims
1. A method performed by one or more computing devices, the method comprising:
- accessing a plurality of types of telemetry data sources, each telemetry data source comprising a different type of software performance feedback received from devices;
- forming pluralities of data sets from each of the respective telemetry data sources, each data set comprising indicia of counts of software event instances on the devices as a function of time from a corresponding telemetry data source;
- analyzing each data set to determine whether the data set comprises an anomaly;
- computing correlations of software updates of the devices with the data sets determined to comprise respective anomalies, wherein multiple software updates are applied to the devices during timespans that at least partially overlap with timespans of the data sets; and
- automatically controlling distribution, application, or availability of one or more of the software updates based on the one or more of the software updates having been sufficiently correlated with one or more of the data sets determined to comprise anomalies.
2. A method according to claim 1, wherein the controlling comprises sending messages to the devices, the messages identifying a software update.
3. A method according to claim 1, wherein the timing of distribution of a software update to the devices depends on the correlations.
4. A method according to claim 1, further comprising accessing source code data linking a source code file of an update with an element in the pluralities of data sets, and, based thereon, selecting one of the software updates.
5. A method according to claim 1, wherein the automatically controlling comprises sending a message that causes a software update to be sent to at least some of the devices.
6. A method according to claim 5, wherein the software update either installs or uninstalls code corresponding to a source code file determined to link the software update with an anomaly.
7. A method performed by one or more server computer accessible to user computing devices via a data network, the method comprising:
- receiving from the user computing devices, via the data network, telemetry data reports associated with a software product, each computing device comprising the software product installed therein and a reporting module that builds and transmits the telemetry data reports, and each telemetry data report comprising diagnostic information of the software product;
- storing the diagnostic information from the telemetry data reports in a data store;
- accessing software update information to identify software updates for the software product that have been applied to the user computing devices;
- analyzing the diagnostic information in the data store to identify anomalies associated with the software product;
- determining correlations between the anomalies and the identified software updates; and
- based on the correlations, communicating with the user computing devices to automatically install and/or uninstall a software update configured to update the software product.
8. A method according to claim 7, wherein the software update comprises one of the identified software updates.
9. A method according to claim 7, wherein the telemetry data reports comprise indicia of events on the user computing devices and respective times thereof, and wherein the correlations are determined according to the times of the events.
10. A method according to claim 9, wherein the correlations are determined further according to frequencies of the events.
11. A method according to claim 7, wherein the telemetry data comprises crash data captured in association with crashes of the software product on the user computing devices
12. A method according to claim 7, further comprising, based on the correlations, inhibiting rollout of a software correlated with an anomaly.
13. A method according to claim 7, further comprising accessing update data identifying which software updates have been applied to which user computing devices and determining the correlations according to the update data.
14. A method according to claim 13, wherein the update data is derived from the telemetry data reports.
15. A method according to claim 7, wherein the software product comprises the reporting module.
16. A method performed by one or more server devices, the method comprising:
- accessing a data store to obtain telemetry data stored in the data store, the telemetry data associated with a software application, the telemetry data provided by client computing devices via a network, each client computing device comprising the software application installed thereon, the telemetry data comprising diagnostic information generated by execution of the software product on the client computing devices;
- based on the diagnostic information, identifying anomalies of the software application and times associated with the anomalies;
- based on the times associated with the anomalies, automatically selecting a software update configured to update the software application; and
- transmitting a message configured to cause the selected software update to be installed or uninstalled on at least some of the client computing devices.
17. A method according to claim 16, wherein the selected software update comprises a package configured to add, remove, or replace an executable file corresponding to the software application.
18. A method according to claim 16, wherein the telemetry data comprises information derived from stack traces of the software product.
19. A method according to claim 16, further comprising enabling display of a user interface, the user interface comprising a graph representing an identified anomaly.
20. A method according to claim 19, the user interface further comprising a graphic representation of the selected software update.
Type: Application
Filed: Mar 8, 2017
Publication Date: Jun 22, 2017
Applicant:
Inventors: Aarthi Thangamani (Redmond, WA), Bryston Nitta (Redmond, WA), Chris Day (Kirkland, WA), Divyesh Shah (Redmond, WA), Nimish Aggarwal (Redmond, WA)
Application Number: 15/453,782