TRACKING ERROR PROPAGATION ACROSS MICROSERVICES BASED APPLICATIONS USING DISTRIBUTED ERROR STACKS
A method of performing error analysis in a system comprising microservices comprises identifying a root cause error span from among a plurality of error spans for a trace associated with a user-request, wherein an error span is a span that returns an error to a microservice initiating a call resulting in the span, and wherein a root cause error span is an error span associated with an error originating microservice. The method further comprises determining a call path associated with the root cause error span, where the call path comprises a chain of spans starting at the root cause error span, and where each subsequent span in the chain is a parent span of a prior span. Subsequently the method comprises mapping each span in the chain to a span error frame to create an error stack and rendering an image of the error stack.
Computer systems may run applications or services that are provided via a server or cloud. A client computer system may send a request to a server that retrieves application installation files in an underlying database. The applications or services can be developed and deployed as a single unit or as multiple units, such as a collection of microservices. Applications that are developed as single unit may be a monolithic application that includes a user interface and data access codes combined into a single program from a single platform. Monolithic applications are self-contained and independent from other computing applications. With the advent of cloud computing, however, these large centralized monolithic systems are being decoupled and distributed to address scalability needs and to allow companies to deliver value faster.
Microservices or microservice architecture is a software development method of developing software applications as a suite of independently deployable smaller cooperating services. The cooperating services run processes and communicate to serve a business goal to form an enterprise application. More specifically, in a microservice architecture, an application is developed as a collection of small services; each service implements business capabilities, runs in its own process and communicates via APIs (e.g., HTTP APIs) or messaging. Each microservice can be deployed, upgraded, scaled, and restarted independent of other services in the application, typically as part of an automated system, enabling frequent updates to live applications without impacting end customers.
With the rise of cloud native applications, which include microservices, there has been not only a shift in the manner in which software is built and deployed, but also in the manner in which it is monitored and observed. Microservices based applications have to deal with dramatically increased complexity and many more layers of abstraction compared to previous generations of monolithic applications. Compared to monolithic applications, microservice architectures introduce complexity in network communication, feature short lifecycles, and require resiliency in dynamic environments.
As companies begin to increasingly rely on microservice architectures, they run into operational complexity and struggle to efficiently monitor their environments. One of the challenges associated with microservice architectures, for example, is getting visibility in distributed workflows without a central orchestrator. Traditional monitoring and troubleshooting tools, designed as symptom-based solutions with single purpose capabilities, result in information overload and are unable to keep up with the dynamic nature of cloud native applications. Thus, systems that can efficiently and accurately monitor microservice architectures and microservices based applications are the subject of considerable innovation.
SUMMARYThis Summary is provided to introduce a selection of concepts in a simplified form that is further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The challenges associated with monitoring, troubleshooting and tracking errors in microservice architectures has led to the rise of distributed tracing, which deals with tracing user requests through applications that are distributed. One of the challenges in implementing microservice observability with distributed tracing systems is pinpointing where errors originated or determining the root cause of an error. In a microservice architecture, an error associated with a particular request may originate in one of the services and might manifest as errors across all the services that participate in that request. Conventional tracing and monitoring systems are unable to systematically track the causal chain of events that result in an error or degraded response in a distributed system.
Embodiments of the present invention allow clients of a monitoring platform to track the causal chain of operations that resulted in an error or possibly a degraded response being returned to a client in response to a request. In order to track the causal chain of operations, the monitoring platform may utilize distributed error stacks. A distributed error stack (also referred to as a “trace error signature”) is a type of signature that summarizes a causal chain of operations encountered during trace execution that resulted in an error, and, further, also summarizes information pertaining to the root cause of the error. Capturing summary information for a causal chain of operations that followed from a root cause originating the error using a distributed error stack provides an application developer insight into types of errors occurring in the environment and helps assess impact.
In one or more embodiments, in order to generate distributed error stacks, the monitoring platform needs to keep track of and summarize the errors generated by the various services in a microservices-based application for discrete ranges of time. For each trace, the monitoring platform first identifies one or more root cause error spans belonging to a single trace. A trace corresponds to a user request and is comprised of one or more spans that are generated as the request propagates from one service to the next in response to calls made by the services in a distributed application. One or more of those spans may return an error to a caller service and is considered an “error span,” and, further, one or more of the error spans may be a “root cause error span,” which is an error span that is associated with the error originator.
In one or more embodiments, having determined a root cause error span, the monitoring platform determines a call path associated with it. The call path comprises a chain of spans (associated with calls made by services in the application) starting at the root cause error span, where each subsequent span in the chain is a parent span of a prior span in the chain.
In one or more embodiments, having determined the call path, the monitoring platform may create a distributed error stack by mapping each span in the chain to a span error frame that comprises summary information for an associated span. The distributed error stack comprises a stack of span error frames where the top-most frame is associated with the root cause error span (which corresponds to the error originator).
In an embodiment, a visual representation of the distributed error stack is rendered and presented to a client through a graphical user interface (GUI), which allows a client to efficiently determine the micro-service that the error originated in and the call path that the error propagated through before ultimately reaching the user (either as an error or a degraded response to a user-request).
According to one or more embodiments of the present disclosure, a method is provided for performing error analysis in a computer system comprising microservices. The method comprises identifying a first root cause error span from among a plurality of error spans for a trace associated with a user-request processed by the microservices, wherein an error span is a span that returns an error to a microservice that initiates a call to the span, and wherein a root cause error span is an error span associated with a microservice that originated the error. Further the method comprises determining a first call path associated with the first root cause error span, wherein the first call path comprises a first chain of spans starting at the first root cause error span, and wherein each subsequent span in the first chain of spans is a parent span of a prior span in the first chain. Subsequently, the method comprises mapping each span in the first chain of spans to a span error frame to create a first error stack from the first chain of spans, wherein a span error frame associated with the first root cause error span comprises a top-most frame in the first error stack, and wherein a span error frame comprises summary information associated with a respective span. Thereafter, the method comprises rendering an image of the first error stack for display within a graphical user interface (GUI).
According to a second embodiment of the present disclosure a non-transitory computer-readable medium having computer-readable program code embodied therein for causing a computer system to perform error analysis in a system comprising microservices is provided. The method comprises identifying a first root cause error span from among a plurality of error spans for a trace associated with a user-request processed by the microservices, wherein an error span is a span that returns an error to a microservice that initiates a call to the span, and wherein a root cause error span is an error span associated with a microservice that originated the error. Further the method comprises determining a first call path associated with the first root cause error span, wherein the first call path comprises a first chain of spans starting at the first root cause error span, and wherein each subsequent span in the first chain of spans is a parent span of a prior span in the first chain. Subsequently, the method comprises mapping each span in the first chain of spans to a span error frame to create a first error stack from the first chain of spans, wherein a span error frame associated with the first root cause error span comprises a top-most frame in the first error stack, and wherein a span error frame comprises summary information associated with a respective span. Thereafter, the method comprises rendering an image of the first error stack for display within a GUI.
According to a third embodiment of the present disclosure a system is provided for performing error analysis in a network comprising microservices tion. The system comprises a processing device communicatively coupled with a memory and configured to: (a) identify a first root cause error span from among a plurality of error spans for a trace associated with a user request processed by the microservices, wherein an error span is a span that returns an error to a microservice that initiates a call to the span, and wherein a root cause error span is an error span associated with a microservice that originated the error; (b) determine a first call path associated with the first root cause error span, wherein the first call path comprises a first chain of spans starting at the first root cause error span, and wherein each subsequent span in the first chain of spans is a parent span of a prior span in the first chain; (c) map each span in the first chain of spans to a span error frame to create a first error stack from the first chain of spans, wherein a span error frame associated with the first root cause error span comprises a top-most frame in the first error stack, and wherein a span error frame comprises summary information associated with a respective span; and (d) render an image of the first error stack for display in a GUI.
In the drawings:
The invention now will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific embodiments by which the invention may be practiced. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Among other things, the invention may be embodied as methods or devices. Accordingly, the invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrase “in one embodiment” as used herein does not necessarily refer to the same embodiment, though it may. Furthermore, the phrase “in another embodiment” as used herein does not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.
In addition, as used herein, the term “or” is an inclusive “or” operator, and is equivalent to the term “and/or,” unless the context clearly dictates otherwise. The term “based on” is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. In addition, throughout the specification, the meaning of “a,” “an,” and “the” include plural references. The meaning of “in” includes “in” and “on.”
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. While the embodiments will be described in conjunction with the drawings, it will be understood that they are not intended to limit the embodiments. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents. Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a thorough understanding. However, it will be recognized by one of ordinary skill in the art that the embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments.
Some regions of the detailed descriptions which follow are presented in terms of procedures, logic blocks, processing and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing the terms such as “determining,” “identifying,” “creating,” “grouping,” “associating,” “computing,” “reducing,” “storing,” “extracting,” “sampling,” “inputting,” “performing,” “using,” “applying,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The description below provides a discussion of computers and other devices that may include one or more modules. As used herein, the term “module” or “block” may be understood to refer to software, firmware, hardware, and/or various combinations thereof. It is noted that the blocks and modules are exemplary. The blocks or modules may be combined, integrated, separated, and/or duplicated to support various applications. Also, a function described herein as being performed at a particular module or block may be performed at one or more other modules or blocks and/or by one or more other devices instead of or in addition to the function performed at the described particular module or block. Further, the modules or blocks may be implemented across multiple devices and/or other components local or remote to one another. Additionally, the modules or blocks may be moved from one device and added to another device, and/or may be included in both devices. Any software implementations of the present invention may be tangibly embodied in one or more storage media, such as, for example, a memory device, a floppy disk, a compact disk (CD), a digital versatile disk (DVD), or other devices that may store computer code.
Embodiments are described herein according to the following outline:
-
- 1.0 Definitions
- 2.0 General Overview
- 3.0 Data Collection
- 4.0 Using Distributed Error Stacks To Track Error Propagation
- 4.1 Full-Context Service Graph
- 4.2 Full-Fidelity Service Level Indicators
- 4.3 Distributed Error Stacks
- 4.4 Mapping Error Stacks Across Multiple Attributes
- 5.0 Probe Exclusion
- 6.0 Logs, Traces and Metrics
The term “trace” as used herein refers to a record of the manner in which a single user request, also referred to as a transaction, propagates from one microservice (hereinafter, interchangeably, referred to as “service”) to the next in a distributed application. A transaction is generally described as an end-to-end request-response flow, from the making of the user's initial request to receiving the final response. A transaction often involves the interaction of multiple services. A trace is the record of a transaction and each trace is identified using a unique trace identifier (“Trace ID”). The trace follows the course of a request or transaction from its source to its ultimate destination in a distributed system. A trace may be conceptualized as a highly dimensional structured log that captures the full graph of user-generated and background request execution within an application, and contains valuable information about interactions as well as causality.
The term “span” as used herein refers to the primary building block of a trace, representing an individual unit of work done in a distributed system. A trace is composed of one or more spans where a span represents a call within the request. It is appreciated that a call may be to a separate microservice or a function within a microservice. The trace captures the work done by each microservice as a collection of linked spans all sharing the same unique Trace ID. Each component of the distributed system may contribute a span—a named, timed operation representing a piece of the workflow. A span may also include a unique span ID, a service name, operation, duration (latency), and additional annotations and attributes (e.g., tags). The annotations and attributes can describe and contextualize the work being done under a span.
The term “tags” as used herein refers to key:value pairs that enable user-defined annotation of spans in order to query, filter and comprehend trace data. Tag information is typically included with each span and there may be different levels of tag information included in a span.
“Global tags” represent properties of a user-request (e.g. tenant name, tenant level, client location, environment type, etc.) and can be extracted from any span of the trace based on configured rules. A global tag for a particular span in a trace may be attributed to the other spans in a trace, because all spans within a single trace would comprise the same global attributes. For example, if one span within a trace comprises a tag relating it to a request from a “gold” level “tenant,” it may be inferred that all the other spans in the same trace are associated with the same request and, accordingly, from the same “gold” level “tenant.” Accordingly, the “tenant:gold” key-value pair or tag may be attributed to all the other spans in the same trace.
“Service-level tags” represent a non-global property of the service or the infrastructure that the associated span (which served a portion of the request) executed on, e.g., service version, host name, region, etc. Spans that executed on different services may have different values for the same tag, e.g. tag “region” may take different values in two services: a span in Service A may be attributed to “region:east”, and a span in Service B attributed to “region:west”. Also, multiple instances of the same service can serve different parts of the request and so the same service may be associated with different service-level tags in relation to those different parts.
“Span-level tags” comprise attributes that are specific to a particular span.
The term “root span” as used herein refers to the first span in a trace. A span without a parent is called a root span.
The term “child span” as used herein refers to any span that follows a root span, including a child of a child.
The term “parent span” as used herein refers to any span that executes a call (to a different service or a function within the same service) that generates another span, wherein the span executing the call is the “parent span” and the span generated in response to the call is the “child span.” Each span may typically comprise information identifying its parent span, which along with the Trace ID, may be used to consolidate spans associated with the same user-request into a trace.
2.0 General OverviewOne of the fundamental shifts in modern day computing has been the shift from monolithic applications to microservices-based architectures. As previously mentioned, this is the shift from an application being hosted together (e.g. on a single system) to each piece of an application being hosted separately (e.g. distributed).
Microservices were created in order to overcome the issues and constraints of monolithic applications. Monolithic applications have a tendency over time to grow in size. As applications become larger and larger, the tight coupling between components results in slower and more challenging deployments. Because of the tight coupling, the potential for a failure of the entire application due to a recently deployed feature is high. In some cases, deployments can take several months to a year, greatly reducing the number of features that can be rolled out to users. This tight coupling also makes it difficult to reuse and replace components because of the effect they may have on other components throughout the application.
Microservices address these issues by being small in scope and modular in design. Modular design results in components being loosely coupled, which offers enormous benefits from the standpoint of being fault tolerant and independently deployable. The result is functionality that can be frequently deployed and continuously delivered. The loosely coupled modules without a central orchestrator in a microservice architecture, however, leads to considerable challenges in terms of monitoring, troubleshooting, and tracking errors.
These challenges have led to the rise of observability, a new generation of monitoring, the foundation for which is built on distributed tracing. Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance. Distributed tracing, as the name implies, deals with tracing user requests through applications that are distributed. A trace represents a single user request, also referred to as a transaction, and represents the entire lifecycle of a request as it moves across all of the services or components of a distributed system.
A given request typically comprises one span (e.g. root Span A 202) for the overall request and a child span for each outbound call made to another service, database, or a function within the same microservice etc. as part of that request. For example, in the example of
Distributed tracing data is generated through the instrumentation of microservice based applications, libraries, and frameworks. Some common open source instrumentation specifications include OpenTracing and OpenCensus. The instrumentation handles the creating of unique trace and span IDs, keeping track of duration, adding metadata and handling context data. Handling context data, also known as context propagation is critical and is responsible for passing context such as the trace ID between function/microservice calls, thereby, enabling an observer to view the entire transaction at each stop along the way. Context propagation may, for example, be based on Representational State Transfer (“REST”). REST is header-based and requires a transaction to pass headers between service-to-service calls. In order to work properly, all services within a request must use the same context propagation format.
Once the code has been instrumented and context propagation has been implemented using a standard format, the trace data generated by the services can be collected and analyzed to monitor and troubleshoot the microservice based applications generating the trace data.
The tasks 301 and 302 may be instrumented using open source or common commercial tracing libraries (e.g., from tracing applications such as Jaeger or Zipkin), in-house formats, or auto-instrumentation. Each task may be configured to generate spans that describe the processing of a portion of a request as the request traverses through the various tasks (or services) on the client-side.
An agent 303 is typically configured at the client-side host or service for receiving spans collected from the various tasks on the client-side and transmitting the spans to a collector 304. An agent may receive generated spans locally using User Datagram Protocol (UDP). Tasks 302 may comprise instrumented tasks that are not using an agent and may be configured to span directly to the collector 304. While spans can be collected from the client-side tasks without configuring an agent (e.g., in the case of Client B), using an agent can provide benefits including batching, buffering, and updating trace libraries.
Batches of span data collected by agent 303 are received at the collector 304. The collector may be implemented within a client's on-prem software or in the cloud (e.g., in an AWS VPC). Traces often generate duplicative data that is not relevant for monitoring or troubleshooting. Collector 304 avoids redundancies by sampling the data before processing and storing it. The collector 304 runs the span data through a processing pipeline and may store it in a specified storage or analytics backend such a monitoring service 306. Monitoring service 306 receives and analyzes the span data for monitoring and troubleshooting purposes. It should be noted that, in addition to monitoring service 306, tracing data might also be simultaneously transmitted to other types of storage and monitoring back-end services.
Monitoring service 306 may be a Software as a Service (SaaS) based service offering or, alternatively, it may also be implemented as an on-prem application. Monitoring service 306 receives the observability data collected by collector 304 and provides critical insights into the collected trace data to a client, who may be an application owner or developer.
4.0 Using Distributed Error Stacks to Track Error PropagationHistorically, there have been several challenges associated with implementing an analytics tool such as monitoring service 306 within a heterogeneous distributed system. One of the challenges in implementing microservice observability with distributed tracing is pinpointing where errors originated or determining the root cause of an error. In a microservice architecture, an error associated with a particular request may originate in one of the services and might manifest as errors across all the services that participate in that request. For example, a front-end service in a microservice architecture may return errors to the user, but the errors may have originated within a back-end service and propagated upstream through the distributed system to the front-end. Conventional methods of tracing are unequipped to efficiently track the manner in which errors originate and propagate through microservices based applications, and do not provide enough support for diagnosing problems. Further, conventional monitoring tools are inadequate because, among other issues, the alerts generated by the monitoring tools are typically threshold driven and lead to alert storms for the same underlying incident.
As mentioned above, any request that the user initiates would generate an associated trace. It is appreciated that each user request will be assigned its own Trace ID, which will then propagate to all the various spans that are generated during the servicing of that request. Each service may process a portion of the request and generate one or more spans depending on the manner in which instrumentation is configured for a respective service. The Trace ID may then be used by the server to group the spans together into a trace with that Trace ID. So, for example, the user's checkout transaction may generate a call at Front-end service 404, which may in turn generate calls to various microservices including CheckoutService 406. CheckoutService 406 may, in turn, generate calls to other services such as PaymentService 408, EmailService 410, ShippingService 412. Each of these calls passes the Trace ID to the respective service being called, wherein each service in the call path could potentially generate several child spans.
It should be noted that a service does not necessarily need to make calls to other services—a service may also generate calls to itself (or, more specifically, to different operations and sub-functions within the same service), which would also generate spans with the same Trace ID. Through context propagation then, each of the spans generated (either by a service making a call to another service or a service making a call to various operations and sub-functions within itself) is passed the Trace ID associated with the request. Eventually, the spans generated from a single user request would be consolidated (e.g., by a collector 304 or a monitoring service 306 of
As noted above, conventional distributed tracing methodologies are not equipped to efficiently track the manner in which errors originate and propagate through a microservice architecture such as the one shown in
Embodiments of the present invention advantageously allow clients to track the causal chain of operations that resulted in an error or possibly a degraded response being returned to a client in response to a request. Embodiments of the present invention further provide automated analyses that assist application owners and developers in isolating availability incidents and performance bottlenecks. Identifying performance bottlenecks may comprise isolating the component, or layer causing errors or performance issues and the source microservice (including its dimensions or attributes, such as client level, client name, environment type, etc.) causing the errors and outages. Identifying the source microservice(s) of an error is advantageous because often errors propagate upstream without providing any clear indication to an application owner using a monitoring service (such as monitoring service 306) of where the error started. Embodiments of the present invention further help accelerate root-cause analysis by allowing the ability to an application owner or developer to drill down and get further insight into high-level anomalous behavior by providing exemplar traces for the anomalous behavior.
4.1 Full-Context Service Graph
Embodiments of the present invention create a full-context application topology graph (interchangeably referred to as a “service graph” herein) entirely from distributed tracing information.
A user may submit a request at front-end service 502; the user's request at front-end service 502 may set off a chain of subsequent calls. For example, a request entered by the user at the front-end may generate a call from front-end service 502 to recommendation service 504, which in turn may generate a further call to product catalog service 506. As noted previously, a chain of calls to service a request may also comprise calls that a service makes to internal sub-functions or operations within the same service.
Each edge in the service graph 500 (e.g., edges 522, 524 and 526) represents a dependency. Front-end service 502 depends on recommendation service 504 because it calls recommendation service 504. Similarly, recommendation service 504 depends on product catalog service 506 because it makes a call to product catalog service 506. The directionality of the edge represents a dependency of a calling node on the node that is being called. Each of the calls passes the Trace ID for the request to the respective service being called. Further, each service called in the course of serving the request could potentially generate several spans (associated with calls to itself or other services). Each of the spans generated will then carry the Trace ID associated with the request, thereby, propagating the context for the trace. Spans with the same Trace ID are, thereafter, grouped together to compose a trace.
In some embodiments, the GUI comprising service graph 500 may be configured so that the nodes themselves provide a visual indication regarding the number of errors that originated at a particular node versus errors that propagated through the particular node but originated elsewhere. For example, as seen in the service graph of
It is appreciated that conventional monitoring technologies would not provide adequate means for a client to distinguish between errors that originated at recommendation service 504 versus errors that propagated through recommendation service 504 but originated elsewhere. The service graph 500 allows clients the ability visualize an overview showing errors that originated at recommendation service 504 as compared with errors that simply propagated through recommendation service 504. As shown in
Similarly, solid-filled region 560 within product catalog service node 506 represents all the errors that originated at product catalog service. Note that all the errors returned by product catalog service 506 originated at product catalog service. In other words, product catalog service 506 does not have errors from another downstream service propagating through it because it does not make calls to another service that is further downstream in the execution pipeline. Conversely, front-end service 502 only comprises a partially-filled region 564 because all the errors observed at front-end service 502 propagated to it from other downstream services (e.g., recommendation service 504, currency service 530, product catalog service 506, etc.) Front-end service 502 was not the originator of any errors in the example shown in
In an embodiment, more detailed information pertaining to the errors including distributed error stacks associated with application topology graph 500 may be displayed in a side-bar 589, which will be discussed in detail below.
4.2 Full-Fidelity Service Level Indicators
Embodiments of the present invention are also able to provide clients with full fidelity service level indicators (SLIs) associated with each of the services. The SLIs are aggregated and extracted for the various services in a microservices architecture so that the behavior of applications may be understood. An SLI is a quantitative measure of some aspect of the level of service that is provided. Most clients consider request latency—how long it takes to return a response to a request—as a key SLI. Other common SLIs include the error rate (often expressed as a fraction of all requests received) and system throughput, typically measured in requests per second. The measurements are often aggregated, e.g., raw trace data is collected over a measurement window and then turned into a rate, average, or percentile.
Referring to
As shown in
For example, the Request-related SLIs 610 comprise information regarding the rate of requests and number of requests serviced by recommendation service 606 during a specific time duration. The time duration over which the SLIs are calculated may be adjusted using drop-down menu 622. The time duration over which SLIs are calculated can vary, for example, from 5 minute to 3 days. As indicated by the time axis on graph 628, for this example, a time window of 30 minutes (from 9:09 to 9:39 a.m.) is selected.
In an embodiment, the pop-up window 608 also provides the client information pertaining to Errors-related SLIs 612. In the example of
In an embodiment, the pop-up window 608 also provides the client information pertaining to Latency-related SLIs 614 and a graphical representation 620 of the same. For example, SLI p95 indicates that for 95% of the users, the latency for servicing the requests was less than 467 ms. Latency-related SLIs also include information regarding p90 and p50 percentiles. Graph 620, in the example of
In one embodiment of the present invention, the pop-up window 608 also displays information pertaining to errors for which the selected service was the root-cause. The Root Cause information 616 includes the number of errors for which the selected service (e.g., recommendation service 606 in the example of
For the example of
As mentioned in connection with
As seen in
As seen in dialog box 808, SLIs pertaining to number of requests (or calls) that were made, the number of those that returned in errors, and the latency associated with servicing the requests are provided. Further, similar to the feature discussed in connection with
It should be noted that the latency value 820 of 49 ms shown in
It should be noted that while the terms “mouse-over effect,” “hovering the cursor over,” “clicking” and “double-clicking” have been used herein to represent well-known GUI interactions exemplary actions that a user may take to access the various features, embodiments of the present invention are not so limited to the user interactions—the features of the disclosed embodiments may be accessed in several different ways, including, but not limited to, keyboard shortcuts, right-clicking and selecting an option from a drop-down menu, selecting objects via touch-screen capability, etc.
In some embodiments, other modes of analysis besides full-fidelity are also supported. For example, a first mode of analysis may comprise a “fast and full-fidelity” mode that supports a fast and full-fidelity analysis for a limited set of tags, e.g., the global tags or attributes of the trace. This mode may be useful for clients that need accurate SLI information for a limited set of high-value tags despite the expensive indexing it may entail. A second mode of analysis may comprise a “fast and approximate” mode that provides fast and approximate SLI information on any dimension of data for clients that want to rapidly analyze problem areas in their application. Further, a third mode of analysis may comprise a “slow and full-fidelity” mode where a full-fidelity analysis can be conducted on any dimension of data (or attribute). The “slow and full-fidelity” mode may sacrifice speed for accuracy, and may be used by clients that need a more thorough analysis across all dimensions. In one embodiment, all three modes may be supported by the monitoring platform simultaneously.
In some embodiments, information gleaned from the computed SLIs may be used to visually represent nodes and edges within a service graph in a way that allows clients to intuit the health of the services and the flow of traffic between services. For example, the service graph 500 of the GUI may be configured to indicate the diminishing health status of a particular node or a particular edge by changing the pattern or shade (or color) of a respective node or edge. A health of a node may be inferred on the basis of how many calls made to a node resulted in an error. Similarly, a health of a dependency (or edge) may be inferred on the basis of how many calls represented by the edge resulted in an error.
Unhealthy nodes (e.g. nodes that fail to meet their Service Level Objectives because they return a high percentage of errors when called) may be represented using a darker shade (or an appropriate color, e.g., red). Further, by way of example, if a particular call to a service results in a high number of errors, the edge associated with the call will also be changed to the darker shade (or appropriate color, e.g. red) or any other visual attribute can be used.
Referencing
Similarly, referencing
In one embodiment, the GUI for the service graph may be configured so that the thickness of an edge may be dynamic and represent the number of requests flowing between services. Similarly, in one embodiment of the present invention, the size of the node may be dynamic and represent the amount of traffic the node is receiving. For example, front-end service 502, recommendation service 504 and product catalog service 506, all receive more traffic than currency service 530. Similarly, the dependency 522 in
Embodiments of the present disclosure compute full-fidelity SLIs across multiple dimensions in real-time. Further, embodiments of the present invention support high dimensionality and high cardinality tags. In one embodiment, the GUI of
Similarly, drop down on-screen menus 934, 936 and 932, relating to incident, tenant-level and environment respectively, all provide further dimensions across which SLIs may be computed. SLIs may be computed for each of these dimensions and also for each combination of dimensions. In an embodiment, for each combination of dimensions selected using one or more of the drop-down menus, the client may be able determine the computed SLIs (e.g., by hovering a cursor over the various nodes and edges of the graph after the dimensions have been selected using the drop-down menus shown in
It should be noted that clients might have different attributes or dimensions that may be of interest for their respective application. In an embodiment, the monitoring platform can be configured to provide insight into client-specific dimensions. Consequently, the specific attributes or dimensions available in each of the drop-down menus may vary by client.
Referencing
Further, in one or more embodiments, the user may be able to click on any of the interactive graphical representations of the SLIs (e.g., graph 628 in
Further, an interactive graphical view 1020 of the SLIs is also provided, and similar to features associated with graph 704 in
It should be noted that GUI WOO also provides the user the ability to select any combination of dimensions through drop-down menus 1050 in order to compute more particularized SLIs for each of the services.
4.3 Distributed Error Stacks
As noted previously, one of the challenges associated with distributing tracing is diagnosing the root cause of error events. Referencing
In order to generate distributed error stacks (interchangeably referred to as “trace error signatures”), embodiments of the present invention keep track of and summarize the errors generated by the various services in a microservices-based application for discrete time-ranges. As will be explained further below, embodiments of the present invention subsequently perform a statistical aggregate of the trace error signatures across different traces to keep track of the number of times in a given time window an error associated with each trace error signature occurred.
To keep track of and summarize the errors generated by the various services in a microservices-based application for discrete time-ranges, embodiments of the present invention need to determine error spans associated with a trace. A span generated by a service is considered an “error span” if it represents an operation that returned an error to its caller (e.g., a microservice that made the call that resulted in the error). An error span may be explicitly marked as an error span during the instrumentation process or may be inferred to be an error span based on some type of heuristics, e.g., if inspecting the span tags for a span returns a HTTP status code that is >=400. In one embodiment, the application owner can customize what constitutes an error span by using a query language. For example, the application owner or developer may want to regard only spans that have an HTTP status code >500 as an error span.
A root cause error span is an error span that does not have a child span that is also an error span.
In one embodiment of the present invention, a span error signature (interchangeably referred to as a “span error frame”) is created to summarize the type of error returned by the error span to its caller. In order to create a span error signature, embodiments of the present invention map a span to some key attributes that summarize the type of error associated with the span (e.g., a span may either be associated with the root cause originating service for the error or with a service through which the error propagates). An exemplary span error signature may be represented as the following exemplary tuple: {serviceName, operationName, statusCode, isError, isRootCause}, where serviceName represents the name of the service that executes a call resulting in the error span, operationName represents the name of the specific operation within the service that made the call, statusCode represents the status code of the error (e.g., “503”), isError is a ‘TRUE/FALSE’ flag that indicates whether a span is an error span, and isRootCause is a ‘TRUE/FALSE’ flag that indicates whether the error span is the root cause error span. It should be noted that the tuple used to represent the span error signature may include other identifying dimensions as well. Furthermore, in alternate embodiments of the present invention, the tuple used to represent a span error signature may contain fewer elements. For example, the tuple may comprise only the serviceName, statusCode, isError and isRootCause elements.
It should be noted that a span error signature may be calculated even for non-error spans, where the error related fields (e.g., isError, isRootCause) may be empty or comprise ‘FALSE’ values.
In one or more embodiments, trace error signatures are created by identifying one or more root cause error spans associated with a particular trace (from among all the error spans for the trace), determining a call path for each identified root cause error span, and, for each call path identified, mapping each span in the call path to a span error signature.
At block 1102, a root cause error span is identified from among a plurality of error spans associated with a particular trace. It should be noted that a single trace can comprise multiple root cause error spans as will be discussed further in connection with
At block 1104, a call path is determined for the root cause error span, wherein the call path comprises a chain (or series) of spans starting at the root cause error span, and wherein each subsequent span in the chain is a parent span of the prior span in the chain. Stated differently, each subsequent span error signature will belong to the caller of the previous span in the list (e.g., any given span in the chain has a child-of reference to the subsequent span).
In one embodiment, the chain or series of spans starts from the root cause error span and extends to include the last span in the call path that is an error span and has a parent that is a non-error span. In this embodiment, the span error chain ends at the parent of the first non-error span. In other words, the span error chain only comprises spans that are error spans.
In a different embodiment, the chain of spans is extended all the way to the root span of the trace regardless of whether it is an error span or not. Extending the chain all the way to the root span may be beneficial in cases where errors manifest as degraded payload at a service (e.g., a front end service) even if the root span itself is not marked as an error span. In yet another embodiment, different criteria are also possible; for example, the chain may be extended beyond the last non-error parent span but may stop short of the root span of the trace.
At block 1106, a distributed error stack (“or trace error signature”) is created from the chain of spans comprising the call path by mapping each span in the chain to a span error frame (the “span error signature”). As mentioned above, a span error signature comprises summary information for an associated span and can be created for both error and non-error spans. The top-most frame of the distributed error stack will belong to the root cause error span and the bottom-most frame will belong to the last span in the chain (which may either be the root span of the trace or the last span in the call path that is an error span, depending on implementation).
Accordingly, a trace error signature is a signature that summarizes a type of error encountered during trace execution. In an embodiment, a trace error signature comprises an ordered list of span error signatures starting from the root cause error span all the way to the last span in the call path that is an error span and has a parent that is a non-error span. Accordingly, the first span error signature in the ordered list or the top-most frame in the distributed stack will belong to the root cause error span. In this embodiment, the bottom-most frame comprises the error signature of the parent of the first non-error span. In a different embodiment, the bottom-most frame of the stack comprises the error signature of the root span of the trace regardless of whether it is an error span or not; this may be done, for example, to provide the developer visibility in instances where intermediary services in a call path may have absorbed an error, but the error still manifests at an upstream service as a degraded or incorrect payload.
A single trace can comprise several trace error signatures, one for each chain of spans (that constitute a call path) extending out from one or more root cause error spans. It should be noted, however, that trace error signatures may be unique even if they share the same call path if, for example, the error status codes returned by a service in response to one or more calls in the call path are different. Accordingly, there may be cases where the call path for two different trace error signatures is identical, but the stacks are unique from each other because one or more spans within the call path of one error stack comprises a different error status code from a corresponding span in the other error stack. For example, two call paths may traverse through the same set of services, but the trace error signature may be unique from each other because a span generated by a service in the first call path may generate an error with a “500” status code while a span generated by the same service in the second call path may generate an error with a “503” status code.
In certain cases, two or more call paths may converge on the same ancestor but comprise different trace error signatures because they start from a different root cause span.
As noted previously, a single trace may comprise multiple trace error signatures because it may have multiple root cause error spans.
At block 1202, one or more root cause error spans belonging to the same trace are identified from among a plurality of error spans. For example, a monitoring service may receive potentially millions of spans from a client. To identify trace error signatures related to a particular trace, one or more root cause error spans comprising a common Trace ID may be identified from the set of incoming spans.
At block 1204, a call path is determined for each root cause error span in the trace, wherein the call path comprises a chain (or series) of spans starting at a root cause error span, and wherein each subsequent span in the chain is a parent span of the prior span in the chain. As noted previously, each chain may extend to include the last span in the call path that is an error span and has a parent that is a non-error span or it may be extended further to include the root span of the trace regardless of whether it is an error span or not.
At block 1206, for each chain of spans determined at block 1204, a distributed error stack (“or trace error signature”) is created from each chain of spans by mapping each span in the chain to a span error frame (the “span error signature”). The top-most frame of each distributed error stack will belong to a respective root cause error span for the chain and the bottom-most frame will belong to the last span in the respective chain (which may either be the root span of the trace or the last span in the respective call path that is an error span).
In one or more embodiments, where each service may generate multiple spans, only a single span error frame (or span error signature) is maintained in the distributed error stack for each service even though the service may generate several error spans. The single span error frame that is generated for each service may summarize all the frames generated by that service. This is beneficial in instances where an application owner is primarily interested in monitoring communication between services (and not as interested in calls that a service makes to operations and functions within the same service).
It should be noted that various other alternatives for representing the tuple of information for a span error signature are also possible. For example, in certain embodiments, a client may customize the span error signature. Rather than hard-coded fields, a client may be allowed to project their own desired tags to customize what constitutes a span error signature. By way of example, a client may even include complex expressions evaluated from one or many tags or include regex-based matching on a single tag in the span error signature. In an embodiment, the client may also be provided a GUI through which the client can further customize the span error signature.
Further, in certain embodiments, a client may also be able to customize the trace error signature or distributed error stacks, e.g., by defining a “traversal specification” that indicates exactly how many frames from the call path should be included in the stack and/or whether consecutive spans generated by the same service in the call path should be combined into a single frame.
In an embodiment, the user may be able to customize the trace error signature to include the error message generated in response to a particular error. Thereafter, machine language based intelligent grouping may be able to summarize the error messages to control the cardinality of the signatures.
Referring back to
In the current example, referencing
Referencing
Embodiments of the present invention perform a statistical aggregate of the trace error signatures across different traces to maintain the number of times in a given time window an error associated with each trace error signature occurred. Each distributed error stack computed and its associated count is displayed in the error panel 589.
For example, the error panel 589 displays information pertaining to 5 error stacks, recommendation service error stack 555, product catalog service error stack 550, product catalog service error stack 596, currency service error stack 597, and payment service error stack 598.
Two of the error stacks in the exemplary GUI of
In an embodiment, each of the error stacks shown in
In one embodiment, a client may be provided an interface to filter the distributed error stacks by various criteria. For example, a client may be able to filter and display distributed error stacks that contain only certain specified services. Or, a client, may be able to filter distributed error stacks based on if a particular service is associated with a root cause error span (e.g., show all distributed error stacks where “recommendation service” is the originator.)
All the error stacks 580 provide information about errors that occur within the application topology represented by graph 500 for a given time-range. It should be noted that for the exemplary GUI of
The top-most frame of an error stack, as displayed in the GUI, is associated with the root cause error span. Error stack 550 in
The “503” error status code generated by the product catalog service 506 (for stack 550), for example, propagates upstream to the recommendation service 504 as a “503” error before it finally shows up at the front end service 502 as a “503” error. It should be noted that there are two frames within stack 550 associated with each of recommendation service 504 and front end service 502. As noted previously, often times a service may execute calls to itself (e.g., different operations within the same service). Accordingly, there may be multiple consecutive frames in an error stack associated with a single service.
In an embodiment, only a single error frame may be displayed within the distributed error stack for each service even though the service may make multiple calls to itself. The single span error frame that is created for each service may summarize all the frames generated by that service.
Error stack 555 in
Error stack 1496 in
Meanwhile, error stack 1497 in
The on-screen GUI of
Similarly, for error stack 555, a client may visually take note that an error originating at recommendation service 504 may propagate to front end service 502 via edge 522.
For error stack 1496 (the expanded view of error stack 596 in
Finally, for error stack 1497 (the expanded view of error stack 597 in
Conventional monitoring tools do not provide application owners clear indications of where an error started. For example, conventional monitoring tools would not be able to differentiate errors associated with error stack 550 as compared with error stack 1496 (represented as collapsed stack 596 in
Similarly, conventional monitoring tools may also not be able to distinguish between errors associated with error stack 555 as compared with error stack 550. Both types of errors terminate at front-end service providing an error status code of “503.” Nevertheless, the error associated with stack 555 originates at recommendation service 504 while the error associated with stack 550 originates at product catalog service 506. Without the root cause analysis capabilities provided by embodiments of the present invention, a user may not be able to distinguish between the two different originators of errors.
In some embodiments, the error stacks 580 may be filtered and grouped across multiple dimensions. As discussed previously, embodiments of the present invention support high dimensionality and high cardinality tags providing a developer the ability to analyze the distributed error stacks 580 across several dimensions of data, e.g., multiple categories of dimensions including, but not limited to, workflow 630, environment 632, incident 634 and tenant-level 636 (as discussed in connection with
By helping clients summarize the errors occurring in the environment over a requested time range and different dimensions, embodiments of the present invention advantageously help the clients prioritize investigations and effort based upon the frequency, impact and possibly unusualness of the errors.
Using the dynamic application topology graph 500, the distributed error stacks 580, and the various associated features, embodiments of the present invention advantageously enable a client to visually analyze the entire application and provide actionable insights that greatly accelerate problem isolation and root cause analysis while preserving context as an application owner navigates through the monitoring platform. The service graph 500 provided by embodiments of the present invention is constructed entirely using distributed tracing information. The service graph 500 also supports high dimensionality and high cardinality tags providing a developer the ability to advantageously analyze the graph across several dimensions of data (e.g., multiple categories of dimensions including, but not limited to, workflow 630, environment 632, incident 634 and tenant-level 636 as discussed in connection with
Furthermore, embodiments of the present invention, unlike conventional monitoring services, are able to ingest 100% of all incoming traces in real-time using a streaming pipeline, where the ingested traces are used to construct the service graph and distributed error stacks, and provide other analytics information.
As also discussed in connection with
Referencing
In an embodiment, a user can hover a cursor over or select a particular error stack, which results in the accompanying service graph 1500 brightening or otherwise highlighting the nodes and edges associated with the trace for the selected error stack (while dimming other nodes and edges). As shown in the exemplary GUI of
Note that, in an embodiment, edge 1506 may also brightened in
In accordance with some embodiments, a client is provided the capability of loading up exemplar traces for each distributed error stack. For example, in an embodiment, a client may be able to click at any point of time within a given heat-map (e.g., heat-map 1530) to load up an exemplar trace.
Trace 1600 of
Note that even though both error spans 1606 and 1614 are generated by product catalog service, they are associated with different respective error stacks because the call path for each is different. For example, the call path associated with error stack 1597 (as seen in
Similarly, the call path associated with error stack 1598 starting at the root span for the trace is as follows: “frontend” service executing operation “/product” (generating root span 1602) makes a call to operation “request/GetRecommendations” within “frontend” service (generating span 1608), which in turn makes a call to operation “/GetRecommendations” within “recommendation service,” (generating span 1610), which then makes a call to operation “request/GetProducts” within “recommendation service,” (generating span 1612), which finally makes a call to operation “/GetProducts” within “product catalog service” thereby generating root cause error span 1614.
Note that the operation and the service that generates the root cause error spans 1606 and 1614 are the same (namely operation “/GetProducts” within service “productcatalogservice”), but they are associated with different error stacks because the call path for each is different. Referring to service graph 1500, it is clear that the call path for root cause error span 1614 is associated with nodes 1510, 1509 and 1507, and edges 1505 and 1504.
The exemplary GUI of
Further, note that for each span, the GUI of
The exemplary GUI of
Note that loading exemplar traces associated with a particular error stack advantageously allows users to gather further information regarding a type of problem that is prevalent in the environment. In conventional monitoring platforms, the user is unable to determine whether an error is caused by a one-off trace or is an example of behavior that is prevalent in the environment.
The exemplary trace 1700 illustrated in
The tags associated with each span may be global tags, service-level tags or span-level tags comprised within a span error frame. For example, tag 1702 (“environment” “prod”), tag 1722 (“incident”=“version_errors”), and tag 1712 (“tenant”=“gold”) are global tags, which are associated with a user request. Global tags associated with any span in trace 1700 can be attributed to all the spans within trace 1700. Accordingly, the values for tags 1702, 1722 and 1712 may be inherited by all the spans in trace 1700. Other types of tags, e.g., service-level tags, do not share the inheritance property of global level tags. For example, service-level tag 1716 (region=“us-west-1”) may comprise different values for one or more spans in a trace. Finally, tags 1706 (comprising error status code values), 1708 (root_cause_error flag), and 1704 (error span flag) are span-level tags that comprise heuristics to identify an error span and determine a span error frame for the associated error span (as discussed in connection with
In an embodiment, selecting any one of the service nodes within the application topology provides error information in the side-bar 1910 associated with the selected node. For example, selecting recommendation service node 1960 will provide error information in the side-bar 1910 specific to recommendation service 1960 for a given duration (as selected using exemplary drop-down menu 1906). This information may comprise SLIs 1915 associated with recommendation service node 1960. The side-bar 1910 may also comprise distributed error stacks 1960 and 1970 for the selected service node, where the selected service may either be associated with the root cause error span (as is the case with stack 1960) or where the selected service may be associated with one of the frames in the error stack (as is the case with stack 1970). It should be noted that the heat-maps (e.g., heat-map 1971) for each error stack visually are dynamic and indicate the intensity of error activity for the selected time duration. For example, darker shades of color for portions of the heat-map indicate higher error activity than lighter shades. Alternatively, instead of shades of color, different patterns may be used for different levels of error activity.
Further, as previously discussed in connection with
In an embodiment, the exemplary GUI of
In an embodiment, when “workflow” is selected as the breakdown category using menu 2110, error stacks 2150 are broken down to show the workflows associated with each respective error stack. “Workflows,” as mentioned previously, are a dimension or attribute of the user request. Breaking down errors by workflows, for example, allows a client to efficiently determine which of the workflows (in the user-request) resulted in the most errors. As seen in
4.4 Mapping Error Stacks Across Multiple Attributes
As mentioned in connection with
In some embodiments of the present invention, the sessionization module is able to ingest, process and store 100% of the spans received from the collector 2204 in real-time. By comparison, conventional monitoring systems do not accept 100% of the incoming spans or traces; instead, they sample incoming spans (or traces) to calculate SLIs at the root level of a trace before discarding the spans. Embodiments of the present invention, by comparison, comprise an ingestion streaming pipeline that is able to ingest and consolidate all the incoming spans into traces in real-time, and is further able to use advanced compression methods to store all the traces.
As noted above, the sessionization module 2206 collects all the traces within a first time window Y 2280 (e.g., a duration of M minutes) using the time-stamps for the traces and then emits them to a mapping module 2220. In one or more embodiments, the mapping module 2220 receives all the traces generated within a predetermined time window Y 2280 and maps each trace to one or more {trace error signature, global tags} pairs, where each trace can comprise multiple different trace error signatures. In other words, mapping module 2220 receives traces from the sessionization module 2206 for a given time duration and emits several {trace error signature, global tags} pairs associated with the received traces. Note that each pair comprises a single trace error signatures, but can comprise multiple global tags (or attributes).
In order to perform the mapping, the mapping module 2220 first computes one or more trace error signatures (or error stacks) associated with each trace and also determine the global-level attributes for each trace. Global tags, as noted previously, represent properties of a user-request (e.g. tenant name, tenant level, client location, environment type, etc.) and each trace is associated with its own respective unique set of global tags that comprise valuable information regarding the user-request (e.g., global tags 1650 as discussed in connection with
In different embodiments, mapping module 2220 may also be able to map traces to other hierarchies of tags, e.g., service-level tags, team-level tags, span-level tags, etc.
Subsequently, mapping module 2220 also receives and maps traces generated within the next time window 2285 (e.g., Y+M mins or secs) to {trace error signature, global tags} pairs. Thereafter, the process is repeated for all subsequent time windows in increments of M mins or secs.
For each time window, each unique {trace error signature, global tags} pair emitted by mapping module 2220 is then reduced and associated with a count value using reduction module 2222. Mapping module 2220 and reduction module 2222 together are able to aggregate all the distributed error stacks by the associated global attributes. Because potentially millions of traces may be collected for any given time window, there may be several thousand {trace error signature, global tags} pairs generated for each time window. These duplicate pairs may be reduced using reduction module 2222 and an associated count may be maintained that keeps track of the number of times any particular pair occurs for a given time window.
The pairing created by mapping module 2220 allows error stacks to be associated with tag (or attribute) values. Subsequently, the reduction process (using reduction module 222) allows a count value to be maintained for each unique {trace error signature, tags} pair. Aggregating trace error signatures by all the attributes allows users the ability to filter the error information accumulated by any dimension, e.g., a client may filter out errors stacks where “environment=prod” and “tenant=gold.”
The pairing and reduction process enable embodiments of the present invention to support breaking down the distributed error stacks by different dimensions (e.g., global tags). For example, a user may be able to break down error stacks by tenant-level. Specifically, a client is able determine the number of times an error stack occurs per attribute value (e.g., tenant-level) over time. For example, a client may create queries through an exemplary user-interface 2282 that enables the client to query and filter all the error stacks across several dimensions.
A client may send in a request to retrieve error stacks that match a particular set of filters through query interface 2282. A request may ask to retrieve all error stacks that originate from or include a particular service and, further, include a set of tag filters as well. For example, the client query may request error stacks originating from a specified service (e.g., recommendation service) where “environment=prod” and “tenant=gold.” The filter may be defined based on a) the services comprising the frames in the error stack, where each service may either be an “error originating” service or an “error propagating” service (a service through which the error passes); and b) a set of tag or dimensional filters. In response to the query, the monitoring platform may access databases 2224 to return a list of error stacks that match the filtering criteria, and the number of traces for each error stack over the queried time window.
In an embodiment, during the mapping and reduction process, summary metrics other than the count may also be maintained for each {trace error signature, global tags} pair. For example, a “sum of request duration” metric may be maintained, which allows clients to estimate duration percentiles for each error stack. There may also be other user-provided metrics that are extracted from the tags of the trace and can be maintained for each {trace error signature, global tags} pair.
In an embodiment, the monitoring platform maintains exemplar traces in storage for each {trace error signature, global tags} pair, which allows the user to load up exemplar traces (as was discussed in connection with
The data associated with the {trace error signature, global tags} pairs and the associated count for each time window is then normalized and persisted in the back-end in storage module 2224. There may be several different ways the trace data is persisted in the back-end. For example, in an embodiment, the data may be stored in a non-normalized (or non-optimized) fashion where for each window of time, each unique {trace error signature, global tags} pair is stored individually.
In a different embodiment, however, the storage for the {error stack, tags} pairs may be optimized. One method of optimizing storage for the {error stack, tags} pairs is assigning each unique pair of {error stack, tags} to an ID. When a client sends in a request to retrieve error stacks that match a particular filter, a first request on the back-end is generated to collect the unique IDs of error stacks that match the filter for a particular range of time. The result of the first request is then used to generate a second request on the back-end that summarizes the number of traces (or other metrics) for each of the unique IDs. A list of error stacks and associated metrics is then returned to the client that matches the user's filter over a specified range of time. This process of normalizing the data allows the error-related data to be stored efficiently without unnecessary duplication.
In an embodiment, a client may be able to enter dimensional criteria in an exemplary GUI 2300 to filter out spans with user-specified attributes. For example, in the exemplary GUI 2300 of
At block 2402, a plurality of spans from a collector are ingested, where each span comprises at least a Trace ID associated with a user-request. In one embodiment, the spans are ingested in real-time.
At block 2404, the spans are grouped into one or more traces in discrete windows of time. A sessionization process is used whereby, using the time-stamps and the Trace IDs associated with each trace, traces are created from the incoming spans such that each session of the sessionization process creates traces associated with a corresponding window of time.
At block 2406, for a first window of time, trace error signatures for each trace in the first window are computed and the associated set of attributes for each trace are determined, wherein the attributes comprise information pertaining to a user-request associated with a corresponding trace.
At block 2408, for the first time window, each pair of computed trace error signature with its associated set of attributes is emitted using, for example, a mapping module 2220.
At block 2410, for the first time window, duplicate pairs of trace error signature and associated attributes are reduced and a count is maintained for the number of occurrences of each pair within the first time window.
At block 2412, the unique pairs of trace error signature and associated attribute values are stored along with their respective counts. Thereafter, the process is repeated for subsequent windows of time.
5.0 Probe ExclusionTrace data received a monitoring service (e.g., monitoring service 306) typically includes requests made to a microservices-based system by a user, but it may also include requests from other originators of data such as requests made by a system or a third party service. Probes are “bots” (software applications that run automated tasks or scripts) that generate automated requests (and associated traces) as opposed to requests (and associated traces) that originate from actual user actions. There are many examples of probes including: health checks (e.g. Kubernetes checking the health of a microservice), monitoring (e.g. Prometheus scraping metrics), and others (e.g. getting credentials).
Often clients are more interested in analyzing or viewing metrics related to user flows or user requests without the results being polluted by probe requests. Further, some type of probe requests (e.g., runscope) may be problematic because they can often trigger expected failures, which, when aggregated by a conventional monitoring platform, may appear to a client as though problematic incidents are occurring in the environment if the monitoring platform is unable to distinguish between expected failures and unexpected failures. In other instances, the probes may only be providing metrics related to, for example, the health of a service and may not be as interesting to an application owner that is focused on troubleshooting availability and performance related issues.
In order to address this shortcoming of conventional monitoring platforms, embodiments of the present invention provide clients the ability to exclude probe requests from the displayed aggregated results. The probe exclusion feature is intended to help the application owners and developers distinguish between traces generated from probes and traces generated from actual user requests to an application and its comprising services. Further, it prevents SLIs computed for any given service in a microservice architecture from being polluted by probe information. Embodiments of the present invention can automatically exclude probe related requests from the default SLI computation so that clients can get a view of the “real” traffic, while allowing the “true” full-fidelity view if desired (which includes probes).
Embodiments of the present invention may provide a toggle radio button 680 (shown in
In one or more embodiments of the present invention, monitoring service 306 may also configure a component running within a collector (e.g., collector 304) that will gather metrics for each type of request (e.g., user, probe, etc.) even if the actual traces are not all sampled. This would enable full-fidelity aggregate calculations on request types along with exemplars even if all the traces cannot be transmitted from the collector to the monitoring service 306. In some embodiments, filtering of traces for all (or some) classes of bot-initiated requests can be enabled at the collector to preserve bandwidth and storage costs.
Probe exclusion is typically based upon out-of-the box standardized rules that can be further customized based on specific environment properties. In one embodiment of the present invention, the monitoring service 306, for example, pre-populates definitions for the traces that need to be excluded as probe-related traces at the back-end of the platform (e.g., using a regex based rule system or a JSON query language). A probe may, for example, be defined based on any attribute or tag of any span within a trace. If any span within a trace for example, meets the definitional criteria, the entire trace associated with that span would be considered a probe trace.
In a different embodiment, an application owner using the monitoring platform may be able to define the traces to be filtered out as probe traces through a GUI dialog box. Again, a probe may be defined by the client based on any attribute or tag of any span within a trace.
As noted above, a monitoring platform may need to pre-define traces related to the “cartservice” service comprising an http.method tag with a value of “GET” as a probe trace. This definition may be created using either a GUI, a query language such as JSON or a regex based rule system. Once the probe has been defined, a JSON payload may, for example, be sent to the back-end of the platform to configure the probe. Subsequently, when the spans are ingested by the monitoring service 306, any spans where the http.method tag has a value of “GET” would be identified and the trace associated with that span would be flagged as a probe trace. These probe traces may then be excluded by a client using the probe toggle switch 2602. It should be noted that there is no limit to the number of probes that may be pre-defined and enabled in the monitoring platform. Also, as noted above, a probe may be defined based on any tag or attribute of any span within a trace.
Alternatively, if the client elects to delete a probe, the GUI may provide an interface that allows a selected probe to be deleted.
6.0 Logs, Traces, and MetricsAs mentioned above, the shift from monolithic applications to microservices-based architectures has increased the usefulness of analyzing traces in a distributed system. In other implementations, the tracing data analyzed herein may be coupled with log data and/or metrics data, in order to provide users with a more complete picture of the system. For example, the trace data may be coupled with log or other data from a data intake and query system. One example of a data intake and query system is the event-based data intake and query SPLUNK® ENTERPRISE system developed by Splunk Inc. of San Francisco, Calif. The SPLUNK® ENTERPRISE system is the leading platform for providing real-time operational intelligence that enables organizations to collect, index, and search machine-generated data from various websites, applications, servers, networks, and mobile devices that power their businesses. The SPLUNK® ENTERPRISE system is particularly useful for analyzing data which is commonly found in system log files, network data, and other data input sources. One example of the SPLUNK® ENTERPRISE system is described in more detail in U.S. patent application Ser. No. 15/011,651, entitled “TOKENIZED HTTP EVENT COLLECTOR, filed on 31 Jan. 2016, and U.S. patent application Ser. No. 14/503,335, entitled “GENERATING REPORTS FROM UNSTRUCTURED DATA”, filed on 30 Sep. 2014, each of which is hereby incorporated by reference in their entirety. In another example, the data intake and query system may be based on a distributed or cloud-based service, for example, as described in U.S. patent application Ser. No. 15/665,148, entitled QUERY PROCESSING USING QUERY-RESOURCE USAGE AND NODE UTILIZATION DATA, filed on 31 Jul. 2017.
In one embodiment, the trace data may be ingested into the data intake and query system, or may be coupled with outputs from the data intake and query system, e.g., from searches that may be based on trace data and run on the data intake and query system. In some embodiments, the data intake and query system described above may be integrated with or into the monitoring platform that analyzes trace data, e.g., monitoring service 306. Monitoring service 306 may, accordingly, comprise a full suite of services including, for example, analyzing trace data, ingesting and analyzing log data, ingesting metrics data and providing insights generated from the metrics data, including aggregating and/or correlating trace data, log data, and metrics data, in order to gain insights into a computing platform.
As described above, the trace data may be sent to systems configured to ingest and search data, such as the data intake and query systems described above. Additionally, other implementations may use a stream processor that can perform transformations and other operations on incoming data prior to, or as an alternative to, ingestion of the data. In some implementations, the system may also be configured to ingest metrics data and may be optimized to ingest, query, and generate insights from metrics data, as described in U.S. patent application Ser. No. 15/339,863, titled STORING AND QUERYING METRICS DATA, and published as U.S. Patent Application Publication No. 2018/0089286.
In other implementations, metrics may be generated by instrumentation and sent to a SaaS-based processing system. For example, software may be instrumented to send metrics to a gateway or to another source, where metrics may be aggregated, queried, and alerted. Some examples of these systems are described in U.S. Pat. No. 9,846,632, titled, REAL-TIME REPORTING BASED ON INSTRUMENTATION OF SOFTWARE, the entirety of which is herein incorporated by reference. As above, the trace data may be paired with data from the data intake and query system, metrics generated by instrumentation, and other data sources, and correlated in various ways to provide insights. For example, as a broad-based correlation example, the metrics data may be used in a thresholding comparison to determine that there is an issue that needs attention, the trace data may be used to determine which component or microservice requires attention, and log data from the data intake and query system may be used to determine exactly why the component or microservice needs attention. Other correlations and uses for the combination of metrics data, log data, and event data are also contemplated herein. As noted above, all the various features and services may be provided within an integrated monitoring platform, e.g., monitoring service 306.
While the principles of the invention have been described above in connection with specific apparatus, it is to be clearly understood that this description is made only by way of example and not as a limitation on the scope of the invention. Further, the foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise finals disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as may be suited to the particular use contemplated.
Claims
1. A method of performing error analysis in a computer system comprising microservices, the method comprising:
- identifying a first root cause error span from among a plurality of error spans of a trace associated with a user request processed by the microservices, wherein an error span is a span that returns an error to a microservice that initiates a call resulting in the span, and wherein a root cause error span is an error span associated with a microservice that originated the error;
- determining a first call path associated with the first root cause error span, wherein the first call path comprises a first chain of spans starting at the first root cause error span, and wherein each subsequent span in the first chain is a parent span of a prior span in the first chain;
- mapping each span in the first chain to a span error frame to create a first error stack from the first chain of spans, wherein a span error frame associated with the first root cause error span comprises a top-most frame in the first error stack, and wherein a span error frame comprises summary information associated with a respective span; and
- rendering an image of the first error stack for display within a graphical user interface (“GUI”), wherein the image is operable to indicate the microservice that originated the error.
2. The method of claim 1, wherein the identifying the first root cause error span from among the plurality of error spans comprises:
- identifying an error span that does not have a child span that is also an error span.
3. The method of claim 1, wherein the first error stack comprises an ordered list of span error frames corresponding to the first chain in the first call path.
4. The method of claim 1, wherein the summary information comprises: a name of a microservice initiating a call that resulted in the respective span; and an error status code associated with the respective span.
5. The method of claim 1, wherein the summary information comprises one or more fields selected from a group consisting of: a name of a microservice initiating a call that resulted in the respective span; an error status code associated with the respective span; a name of an operation within the microservice initiating the call that resulted in the respective span; a flag identifying the respective error span as an error span; and a flag identifying the respective error span as the root cause error span.
6. The method of claim 1, wherein a bottom-most frame in the first error stack is associated with a last span in the call path that is an error span.
7. The method of claim 1, wherein a bottom-most frame in the first error stack is associated with a root span of the trace.
8. The method of claim 1, further comprising:
- displaying the image of first error stack visually through the GUI, wherein the image is operable to indicate the microservice that originated the error and a chain of associated microservices through which the error propagated.
9. The method of claim 1, further comprising:
- determining a second root cause error span among the plurality of error spans of the trace;
- determining a second call path associated the second root cause error span, wherein the second call path comprises a second chain of spans starting at the second root cause error span, wherein each subsequent error span in the second chain is a parent span of a prior error span in the second chain; and
- mapping each span in the second chain to a span error frame to create a second error stack from the second chain, wherein a span error frame associated with the second root cause error span comprises a top-most frame in the first error stack; and
- rendering an image of the second error stack for display within the GUI.
10. The method of claim 1, further comprising:
- determining a second root cause error span among the plurality of error spans of the trace;
- determining a second call path associated the second root cause error span, wherein the second call path comprises a second chain of spans starting at the second root cause error span, wherein each subsequent error span in the second chain is a parent span of a prior error span in the second chain; and
- mapping each span in the second chain to a span error frame to create a second error stack from the second chain, wherein a span error frame associated with the second root cause error span comprises a top-most frame in the first error stack, wherein the first error stack and the second error stack share one or more span error frames in common; and
- rendering an image of the second error stack for display within the GUI.
11. The method of claim 1, further comprising:
- determining a second root cause error span among the plurality of error spans of the trace;
- determining a second call path associated the second root cause error span, wherein the second call path comprises a second chain of spans starting at the second root cause error span, wherein each subsequent error span in the second chain is a parent span of a prior error span in the second chain;
- mapping each span in the second chain to a span error frame to create a second error stack from the second chain, wherein a span error frame associated with the second root cause error span comprises a top-most frame in the first error stack;
- rendering an image of the second error stack for display within the GUI; and
- displaying the images of the first error stack and the second error stack visually through the GUI, wherein each span error frame in the first error stack and each span error frame in the second error stack are displayed alongside a portion of summary information from a respective span error frame.
12. The method of claim 1, further comprising:
- tracking a number of errors associated with the first error stack over a duration of time; and
- displaying the image of the first error stack through the GUI, wherein the image comprises a heat-map, wherein the heat-map visually represents a manner in which the number of errors associated with the first error stack fluctuates over the duration of time.
13. The method of claim 1, further comprising:
- tracking a number of errors associated with the first error stack over a duration of time; and
- displaying the image of the first error stack through the GUI, wherein the image comprises a heat-map, wherein the heat-map visually represents a manner in which the number of errors associated with the first error stack fluctuates over the given duration of time, and wherein the heat-map is interactive and operable to display exemplar traces representative of an error type associated with the first error stack responsive to a user action.
14. The method of claim 1, further comprising:
- tracking a number of errors associated with the first error stack over a given duration of time;
- displaying the image of the first error stack through the GUT; and
- displaying the number of errors that occurred during the given duration of time alongside the image of the first error stack within the GUI.
15. The method of claim 1, further comprising:
- tracking a number of errors associated with the first error stack over a given duration of time;
- identifying a plurality of attributes associated with the trace;
- creating an association between the plurality of attributes and the first error stack;
- displaying the image of the first error stack within the GUI alongside the number of errors associated with the first error stack that occurred during the given time interval; and
- providing a user an option through a GUI element of the GUI to fragmentize the first error stack and the count in accordance with criteria associated with the plurality of attributes.
16. A non-transitory computer-readable medium having computer-readable program code embodied therein for causing a computer system to perform a method for performing error analysis in a system comprising microservices, the method comprising:
- identifying a first root cause error span from among a plurality of error spans of a trace associated with a user request processed by the microservices, wherein an error span is a span that returns an error to a microservice that initiates a call resulting in the span, and wherein a root cause error span is an error span associated with a microservice that originated the error;
- determining a first call path associated with the first root cause error span, wherein the first call path comprises a first chain of spans starting at the first root cause error span, and wherein each subsequent span in the first chain is a parent span of a prior span in the first chain;
- mapping each span in the first chain to a span error frame to create a first error stack from the first chain of spans, wherein a span error frame associated with the first root cause error span comprises a top-most frame in the first error stack, and wherein a span error frame comprises summary information associated with a respective span; and
- rendering an image of the first error stack for display within a graphical user interface (“GUI”), wherein the image is operable to indicate the microservice that originated the error.
17. The non-transitory computer-readable medium of claim 16, wherein the identifying the first root cause error span from among the plurality of error spans comprises:
- identifying an error span that does not have a child span that is also an error span.
18. The non-transitory computer-readable medium of claim 16, wherein the first error stack comprises an ordered list of span error frames corresponding to the first chain in the first call path.
19. The non-transitory computer-readable medium of claim 16, wherein the summary information comprises: a name of a microservice initiating a call that resulted in the respective span; and an error status code associated with the respective span.
20. The non-transitory computer-readable medium of claim 16, wherein the summary information comprises one or more fields selected from a group consisting of: a name of a microservice initiating a call that resulted in the respective span; an error status code associated with the respective span; a name of an operation within the microservice initiating the call that resulted in the respective span; a flag identifying the respective error span as an error span; and a flag identifying the respective error span as the root cause error span.
21. The non-transitory computer-readable medium of claim 16, wherein a bottom-most frame in the first error stack is associated with a last span in the call path that is an error span.
22. The non-transitory computer-readable medium of claim 16, wherein a bottom-most frame in the first error stack is associated with a root span of the trace.
23. A system for performing error analysis in a network comprising microservices, the system comprising:
- a processing device communicatively coupled with a memory and configured to: identify a first root cause error span from among a plurality of error spans of a trace associated with a user request processed by the microservices, wherein an error span is a span that returns an error to a microservice that initiates a call resulting in the span, and wherein a root cause error span is an error span associated with a microservice that originated the error; determine a first call path associated with the first root cause error span, wherein the first call path comprises a first chain of spans starting at the first root cause error span, and wherein each subsequent span in the first chain is a parent span of a prior span in the first chain; map each span in the first chain to a span error frame to create a first error stack from the first chain of spans, wherein a span error frame associated with the first root cause error span comprises a top-most frame in the first error stack, and wherein a span error frame comprises summary information associated with a respective span; and render an image of the first error stack for display within a graphical user interface (“GUI”), wherein the image is operable to indicate the microservice that originated the error.
24. The system of claim 23, wherein to identify the first root cause error span from among the plurality of error spans, the processor is further configured to:
- identify an error span that does not have a child span that is also an error span.
25. The system of claim 23, wherein the first error stack is an ordered list of span error frames corresponding to the first chain.
26. The system of claim 23, wherein the summary information comprises: a name of a microservice initiating a call that resulted in the respective span; and an error status code associated with the respective span.
27. The system of claim 23, wherein the summary information comprises one or more fields selected from a group consisting of: a name of a microservice initiating a call that resulted in the respective span; an error status code associated with the respective span; a name of an operation within the microservice initiating the call that resulted in the respective span; a flag identifying the respective error span as an error span; and a flag identifying the respective error span as the root cause error span.
28. The system of claim 23, wherein a bottom-most frame in the first error stack is associated with a last span in the call path that is an error span.
29. The system of claim 23, wherein a bottom-most frame in the first error stack is associated with a root span of the trace.
30. The system of claim 23, wherein the processor is further configured to:
- display the image of the first error stack through the GUI, wherein the image is operable to indicate the microservice that originated the error and a chain of associated microservices through which the error propagated.
Type: Application
Filed: Nov 1, 2019
Publication Date: May 6, 2021
Inventors: Mayank AGARWAL (Kirkland, WA), Steven FLANDERS (Nashua, NH), Justin SMITH (San Francisco, CA), Gergely DANYI (Redwood City, CA)
Application Number: 16/672,174