INSTRUMENTATION TRACE CAPTURE TECHNIQUE

Info

Publication number: 20220012161
Type: Application
Filed: Jul 12, 2021
Publication Date: Jan 13, 2022
Inventors: Nicholas DeMonner (Redwood City, CA), David Michael Renie (Atherton, CA), David Marcin (Saratoga, CA), Margaret Henry (Redwood City, CA)
Application Number: 17/373,192

Abstract

An instrumentation trace capture technique enables software developers to monitor, diagnose and solve errors associated with application development and production. A client library of an investigative platform is loaded in a user application executing on a virtual machine instance of a virtualized computing environment. The client library interacts with an agent of the platform to instrument executable code of the user application and, to that end, loads a capture configuration that specifies, inter alia, methods and associated arguments, variables and data structures (values), to instrument. The client library inspects the executable code to determine portions of the code to instrument based on the capture configuration, which describes a degree of fidelity (e.g., a frequency) of the executable code and data to trace at runtime. Capture points of the runtime application are implemented as callbacks to the client library, which are registered with a runtime system executing the user application.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent application Ser. No. 16/926,268, entitled INSTRUMENTATION TRACE CAPTURE TECHNIQUE, filed on Jul. 10, 2020 by Nicholas DeMonner et al., which application is hereby incorporated by reference.

BACKGROUND Technical Field

The present disclosure relates to software application development and production and, more specifically, to an investigative platform having observability tools configured to diagnose and solve errors associated with software application development and production.

Background Information

Conventional observability tools are typically used in both software development and production environments to infer internal states of an executing software application (e.g., executable code) from knowledge of external outputs. However, these tools generally have a limited view/observation of information for a user (software developer) to obtain sufficient information (e.g., internal state information) about executable code to correctly diagnose a malfunction. That is, the tools typically collect information, such as logs, metrics and traces, from the executable code at runtime with insufficient detail and independently. As a result, an integrated view of sufficient fidelity across the collected information is not possible to aid the malfunction diagnosis, especially with respect to a historical view of specific operations manifesting the malfunction. For example, the tools may capture exceptions raised by the executable code that indicate a malfunction, but the root cause may be buried in a history of specific data values and processing leading to the exception. As such, examining a voluminous history of invocations and data changes across the collected information is often necessary to successfully diagnose the malfunction. Moreover, in production these tools are not generally configured for arbitrarily detailed information capture in an “always on” manner, but rather are typically used for testing or similar short-lived activities and then turned off.

In addition, an issue may arise during use of the tools for which there is no “visibility,” and where the time to address and repair such an issue (problem) and its impact may depend on how quickly the developer can acquire visibility of the problem. A typical approach involves the software developer receiving notification of the problem in the production application, finding and examining relevant source code, defining and installing new points for collecting information about the code, deploying code with these new points, reviewing subsequently collected information, inferring what portions of the code may be creating the malfunction, and finally implementing any corrections to the code typically iteratively until the malfunction ceases. The developer may thereafter review any issues related to collected information that is logged and reported but may find nothing abnormal in the collected information. This approach may be continually repeated to no avail, which often hampers and even discourages problem solving. As a result, there is a need for on-demand, arbitrarily detailed trace capture based on always-on historical capture during production and in development environments. Such capture would enable gathering of enough detail when necessary and rendering of the voluminous collected information efficiently with sufficiently integrated view for effective diagnosis and root cause determination.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the embodiments herein may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identically or functionally similar elements, of which:

FIG. 1 is a block diagram of a virtualized computing environment;

FIG. 2 is a block diagram of the virtual machine instance;

FIG. 3 is a block diagram of an investigative platform;

FIG. 4 illustrates a workflow for instrumenting executable code using a capture configuration in accordance with an instrumentation trace capture technique; and

FIG. 5 illustrates a workflow for instrumenting executable code using a nesting level threshold in accordance with the instrumentation trace capture technique.

OVERVIEW

The embodiments described herein are directed to an instrumentation trace capture technique configured to enable software developers to monitor, diagnose and solve errors associated with application development and production. A client library of an investigative platform is loaded in a user application executing on a virtual machine instance of a virtualized computing environment or, for other embodiments, on an actual computer/machine. The client library interacts with a separate agent process of the platform to instrument executable code (e.g., symbolic text, interpreted bytecodes, machine code and the like visible to the client library) of the user application and, to that end, loads a capture configuration that specifies information such as, inter alia, methods and associated arguments, variables and data structures (values), to instrument. The client library inspects the executable code to determine portions of the code to instrument based on rules or heuristics of the capture configuration, which describe a degree of fidelity (e.g., a frequency) of the executable code and information to trace at runtime. Capture points of the runtime application are implemented as callback functions (callbacks) to the client library, which are registered with a runtime system executing the user application.

Illustratively, the client library may examine a language runtime stack and associated call history during a capture interval, i.e., a method execution event triggering the callback, and gather symbolic information, e.g., symbols and associated source code (when available) from the runtime system, invocations of methods, arguments/variables (including local and instance variables) and return values of the methods, as well as any exceptions raised based on a capture filter. In an embodiment, the capture filter is a table having identifiers associated with the methods to instrument, such that presence of a particular identifier in the table results in trace capture of the method associated with the identifier during the capture interval. When an exception is raised, the client library captures detailed information for every method in the stack, even if it was not instrumented in detail initially. The client library may also inspect language runtime internals to determine values of data structures used by the application. In an embodiment, the capture configuration for data structures involves walking the structures based on a defined level of nesting (e.g., depth of the data structures) which may be specified per data structure type, instance, method, etc. All gathered information and executed executable code are transferred to the agent process via shared memory and/or Inter Process Communication (such as message passing via sockets, pipes and the like) to isolate the capture from the executing user application. The captured trace information may be reported graphically and interactively to a user via a user interface infrastructure of the investigative platform.

Description

The disclosure herein is generally directed to an investigative platform having observability tools that enable software developers to monitor, investigate, diagnose and remedy errors as well as other deployment issues including code review associated with application development and production. In this context, an application (e.g., a user application) denotes a collection of interconnected software processes or services, each of which provides an organized unit of functionality expressed as instructions or operations, such as symbolic text, interpreted bytecodes, machine code and the like, which is defined herein as executable code and which is associated with and possibly generated from source code (i.e., human readable text written in a high-level programming language) stored in repositories. The investigative platform may be deployed and used in environments (such as, e.g., production, testing, and/or development environments) to facilitate creation of the user application, wherein a developer may employ the platform to provide capture and analysis of the operations (contextualized as “traces”) to aid in executable code development, debugging, performance tuning, error detection, and/or anomaly capture managed by issue.

In an exemplary embodiment, the investigative platform may be used in a production environment which is executing (running) an instance of the user application. The user application cooperates with the platform to capture traces (e.g., execution of code and associated data/variables) used to determine the cause of errors, faults and inefficiencies in the executable code and which may be organized by issue typically related to a common root cause. To that end, the investigative platform may be deployed on hardware and software computing resources, ranging from laptop/notebook computers, desktop computers, and on-premises (“on-prem”) compute servers to, illustratively, data centers of virtualized computing environments.

FIG. 1 is a block diagram of a virtualized computing environment 100. In one or more embodiments described herein, the virtualized computing environment 100 includes one or more computer nodes 120 and intermediate or edge nodes 130 collectively embodied as one or more data centers 110 interconnected by a computer network 150. The data centers may be cloud service providers (CSPs) deployed as private clouds or public clouds, such as deployments from Amazon Web Services (AWS), Google Compute Engine (GCE), Microsoft Azure, typically providing virtualized resource environments. As such, each data center 110 may be configured to provide virtualized resources, such as virtual storage, network, and/or compute resources that are accessible over the computer network 150, e.g., the Internet. Each computer node 120 is illustratively embodied as a computer system having one or more processors 122, a main memory 124, one or more storage adapters 126, and one or more network adapters 128 coupled by an interconnect, such as a system bus 123. The storage adapter 126 may be configured to access information stored on storage devices 127, such as magnetic disks, solid state drives, or other similar media including network attached storage (NAS) devices and Internet Small Computer Systems Interface (iSCSI) storage devices. Accordingly, the storage adapter 126 may include input/output (I/O) interface circuitry that couples to the storage devices over an I/O interconnect arrangement, such as a conventional peripheral component interconnect (PCI) or serial ATA (SATA) topology.

The network adapter 128 connects the computer node 120 to other computer nodes 120 of the data centers 110 over local network segments 140 illustratively embodied as shared local area networks (LANs) or virtual LANs (VLANs). The network adapter 128 may thus be embodied as a network interface card having the mechanical, electrical and signaling circuitry needed to connect the computer node 120 to the local network segments 140. The intermediate node 130 may be embodied as a network switch, router, firewall or gateway that interconnects the LAN/VLAN local segments with remote network segments 160 illustratively embodied as point-to-point links, wide area networks (WANs), and/or virtual private networks (VPNs) implemented over a public network (such as the Internet). Communication over the network segments 140, 160 may be effected by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP) and the User Datagram Protocol (UDP), although other protocols, such as the OpenID Connect (OIDC) protocol, the HyperText Transfer Protocol Secure (HTTPS), HTTP/2, and the Google Remote Procedure Call (gRPC) protocol may also be advantageously employed.

The main memory 124 includes a plurality of memory locations addressable by the processor 122 and/or adapters for storing software programs (e.g., user applications, processes and/or services) and data structures associated with the embodiments described herein. As used herein, a process (e.g., a user mode process) is an instance of a software program (e.g., a user application) executing in the operating system. The processor and adapters may, in turn, include processing elements and/or circuitry configured to execute the software programs, including an instance of a virtual machine and a hypervisor 125, and manipulate the data structures. The virtual machine instance (VMI) 200 is managed by the hypervisor 125, which is a virtualization platform configured to mask low-level hardware operations and provide isolation from one or more guest operating systems executing in the VMI 200. In an embodiment, the hypervisor 125 is illustratively the Xen hypervisor, although other types of hypervisors, such as the Hyper-V hypervisor and/or VMware ESX hypervisor, may be used in accordance with the embodiments described herein. As will be understood by persons of skill in the art, in other embodiments, the instance of the user application may execute on an actual (physical) machine.

It will be apparent to those skilled in the art that other types of processing elements and memory, including various computer-readable media, may be used to store and execute program instructions pertaining to the embodiments described herein. Also, while the embodiments herein are described in terms of software programs, processes, services and executable code stored in memory or on storage devices, alternative embodiments also include the code, services, processes and programs being embodied as logic, components, and/or modules consisting of hardware, software, firmware, or combinations thereof.

FIG. 2 is a block diagram of the virtual machine instance (VMI) 200. In an embodiment, guest operating system (OS) 210 and associated user application 220 may run (execute) in the VMI 200 and may be configured to utilize system (e.g., hardware) resources of the data center 110. The guest OS 210 may be a general-purpose operating system, such as FreeBSD, Microsoft Windows®, macOS®, and similar operating systems; however, in accordance with the embodiments described herein, the guest OS is illustratively the Linux® operating system. A guest kernel 230 of the guest OS 210 includes a guest OS network protocol stack 235 for exchanging network traffic, such as packets, over computer network 150 via a network data path established by the network adapter 128 and the hypervisor 125. Various data center processing resources, such as processor 122, main memory 124, storage adapter 126, and network adapter 128, among others, may be virtualized for the VMI 200, at least partially with the assistance of the hypervisor 125. The hypervisor may also present a software interface for processes within the VMI to communicate requests directed to the hypervisor to access the hardware resources.

A capture infrastructure 310 of the investigative platform may be employed (invoked) to facilitate visibility of the executing user application 220 by capturing and analyzing traces of the running user application, e.g., captured operations (e.g., functions and/or methods) of the user application and associated data/variables (e.g., local variables, passed parameters/arguments, etc.) In an embodiment, the user application 220 may be created (written) using an interpreted programming language such as Ruby, although other compiled and interpreted programming languages, such as C++, Python, Java, PHP, and Go, may be advantageously used in accordance with the teachings described herein. Illustratively, the interpreted programming language has an associated runtime system 240 within which the user application 220 executes and may be inspected. The runtime system 240 provides application programming interfaces (APIs) to monitor and access/capture/inspect (instrument) operations of the user application so as to gather valuable information or “signals” from the traces (captured operations and associated data), such as arguments, variables and/or values of procedures, functions and/or methods. A component of the capture infrastructure (e.g., a client library) cooperates with the programming language's runtime system 240 to effectively instrument (access/capture/inspect) the executable code of the user application 220.

As described further herein, for runtime systems 240 that provide first-class support of callback functions (“callbacks”), callbacks provided by the client library may be registered by the user application process of the guest OS 210 when the executable code is loaded to provide points of capture for the running executable code. Reflection capabilities of the runtime system 240 may be used to inspect file path(s) of the executable code and enumerate the loaded methods at events needed to observe and capture the signals. Notably, a fidelity of the captured signals may be configured based on a frequency of one or more event-driven capture intervals and/or a selection/masking of methods/functions to capture, as well as selection/masking, type, degree and depth of associated data to capture. The event-driven intervals invoke the callbacks, which filter information to capture. The events may be triggered by method invocation, method return, execution of a new line of code, raising of exceptions, and periodic (i.e., time based). For languages that do not provide such first-class callback support, a compiler may be modified to insert callbacks as “hooks” such that, when processing the executable code, the modified compiler may generate code to provide initial signals passed in the callbacks to the client library, as well as to provide results from the callbacks to the client library. In other embodiments, the callbacks may be added at runtime by employing proxy methods (i.e., wrapping invocations of the methods to include callbacks at entry and/or exit of the methods) in the executable code. Moreover, the client library (which is contained in the same process running the user application 220) may examine main memory 124 to locate and amend (rewrite) the executable code and enable invocation of the callbacks to facilitate instrumentation on behalf of the investigative platform.

FIG. 3 is a block diagram of the investigative platform 300. In one or more embodiments, the investigative platform 300 includes the capture infrastructure 310 in communication with (e.g. connected to) an analysis and persistent storage (APS) infrastructure 350 as well as a user interface (UI) infrastructure 360 via computer network 150. Illustratively, the capture infrastructure 310 includes a plurality of components, such as the client library 320 and an agent 330, that interact (e.g., through the use of callbacks) to instrument the running executable code visible to the client library, initially analyze traces captured through instrumentation, compress and thereafter send the traces via the computer network 150 to the APS infrastructure 350 for comprehensive analysis and storage. The APS infrastructure 350 of the investigative platform 300 is configured to provide further multi-faceted and repeatable processing, analysis and organization, as well as persistent storage, of the captured traces. The UI infrastructure 360 allows a user to interact with the investigative platform 300 and examine traces via comprehensive views distilled by the processing, analysis and organization of the APS infrastructure 350. The capture infrastructure 310 illustratively runs in a VMI 200a on a computer node 120a that is separate and apart from a VMI 200b and computer node 120b on which the APS infrastructure 350 runs. Note, however, that the infrastructures 310 and 350 of the investigative platform 300 may run in the same or different data center 110.

In an embodiment, the client library 320 may be embodied as a software development kit (SDK) that provides a set of tools including a suite of methods that software programs, such as user application 220, can utilize to instrument and analyze the executable code. The client library 320 illustratively runs in the same process of the user application 220 to facilitate such executable code instrumentation and analysis (work). To reduce performance overhead costs (e.g., manifested as latencies that may interfere with user application end user experience) associated with executing the client library instrumentation in the user application process, i.e., allocating the data center's processing (e.g., compute, memory and networking) resources needed for such work, the client library queries the runtime system 240 via an API to gather trace signal information from the system, and then performs a first dictionary compression and passes the compressed signal information to an agent 330 executing in a separate process. The agent 330 is thus provided to mitigate the impact of work performed by the client library 320, particularly with respect to potential failures of the user application.

Illustratively, the agent 330 is spawned as a separate process of the guest OS 210 to the user application 220 and provides process isolation to retain captured traces in the event of user process faults, as well as to prevent unexpected processing resource utilization or errors from negatively impacting execution of the user application 220. As much processing as possible of the captured traces of the executable code is offloaded from the client library 320 to the agent 330 because overhead and latency associated with transmission of information (e.g., the captured traces) between operating system processes is minimal as compared to transmission of the information over the computer network 150 to the APS infrastructure 350. In an embodiment, the client library 320 and agent 330 may communicate (e.g., transmit information) via an Inter Process Communication (IPC) mechanism 340, such as shared memory access or message passing of the captured trace signals. Thereafter, the agent 330 may perform further processing on the captured traces, such as a second dictionary compression across captured traces, and then send the re-compressed captured traces to the APS infrastructure 350 of the investigative platform 300 over the computer network 150 for further processing and/or storage.

The embodiments described herein are directed to an instrumentation trace capture technique configured to enable software developers to monitor, diagnose and solve errors associated with application development and production. A user links the client library 320 to the user application 220, e.g., after the client library is loaded into a process of the application and, thereafter, the client library (at initialization and thereafter on-demand) loads a capture configuration that specifies information such as, inter alia, methods and associated arguments, variables and data structures (values) to instrument as well as a fidelity of capture (i.e., a frequency and degree or amount of the information detail to gather of the running application) expressed as rules. Essentially, the capture configuration acts as a filter to define the type and degree of information to capture. The client library 320 inspects the executable code to determine portions of the code to instrument based on the rules or heuristics of the capture configuration. Capture points of the runtime application are implemented as callbacks to the client library 320 which, as noted, are registered with the runtime system executing the user application 220 and invoked according to the capture configuration. The capture configuration may be loaded from various sources, such as from the agent 330, the APS infrastructure 350, and/or via user-defined sources such as files, environment variables and graphically via the UI infrastructure 360.

FIG. 4 illustrates a workflow 400 for instrumenting executable code 410 using a capture configuration 420 in accordance with the instrumentation trace capture technique. Since there is only a finite amount of processing resources available for the client library 320 to perform its work, the technique optimizes the use of the processing resources in accordance with the capture configuration 420, which describes a degree of fidelity of executable code 410 and information to capture at runtime as traces of the executing methods and data of the executable code. In one or more embodiments, default rules or heuristics 425 of the configuration 420 are employed to dynamically capture the traces 450, wherein the default heuristics 425 may illustratively specify capture of (i) all methods 430 of the executable code 410 as well as (ii) certain dependencies on one or more third-party libraries 460 that are often mis-invoked (i.e., called with incorrect parameters or usage). A capture filter 426 is constructed (i.e., generated) from the capture configuration based on the heuristics. Changes to the capture configuration 420 may be reloaded during the capture interval and the capture filter re-generated. In this manner, the executable code 410 may be effectively re-instrumented on-demand as the capture filter screens the traces 450 to capture.

Illustratively, the capture filter 426 may be embodied as a table having identifiers associated with methods to instrument, such that presence of a particular identifier in the table results in trace capture of the method associated with the identifier during the capture interval. That is, the capture filter is queried (e.g., the capture table is searched) during the capture interval to determine whether methods of the event driving the capture interval are found. If the method is found in the capture filter 426, a trace 450 is captured (i.e., recorded). Notably the method identifiers may depict the runtime system representation of the method (e.g., symbols) or a memory address for a compiled user application and runtime environment. In an embodiment, the capture filter may be extended to include capture filtering applied to arguments, variables, data structures and combinations thereof.

A default capture configuration is based on providing a high fidelity (i.e., capture a high trace detail) where there is a high probability of error. As such, the capture configuration may trade-off “high-signal” information (i.e., information very useful to debugging, analyzing and resolving errors) against consistently capturing a same level of detail of all invoked methods. For example, the third-party libraries 460 (such as, e.g., a standard string library or regular expression library) are typically widely used by software developers and, thus, are generally more reliable and mature than the user application 220 but are also likely to have incorrect usage by the user application. As a result, the heuristics 425 primarily focus on methods 430 of the user application's executable code 410 based on the assumption that it is less developed and thus more likely where errors or failures are to arise. The heuristics 425 (and capture filter 426) are also directed to tracing invocation of methods of the third-party libraries 460 by the user application via a curated list 465 of methods 470 of the third-part library having arguments/variables (arg/var) 472 and associated values 474 deemed as valuable (high-signal) for purposes of debugging and analysis. Notably, the curated list 465 may be folded into the capture filter 426 during processing/loading of the capture configuration 420. That is, the curated list includes high-signal methods of the third-party library most likely to be mis-invoked (e.g., called with incorrect calling parameters) and, thus, benefits debugging and analysis of the user application 220 that uses the curated high-signal method. The technique utilizes the available processing resources to capture these high-signal method/value traces 450.

Illustratively, the client library 320 may examine a language runtime stack 480 and associated call history 482 using, e.g., inspection APIs, to query the runtime system during a capture interval to gather symbolic information, i.e., symbols and associated source code (when available), from the runtime system 240, invocations of methods 430, 470, associated arguments/variables 432, 472 (including local and instance variables), return values 434, 474 of the methods, and any exceptions being raised. Notably, the gathered symbolic information of a captured trace may include one or more of (i) high-level programming text processed by the runtime system, which may be derived (generated) from source code stored in repositories, and (ii) symbols as labels representing one or more of the methods, variables, data and state of the executable code. When an exception is raised, the client library 320 captures detailed information for every method in the stack 480, even if was not instrumented in detail initially as provided in the capture configuration 420. That is, fidelity of trace capture is automatically increased during the capture interval in response to detecting a raised exception. Note that in some embodiments, this automatic increase in trace capture detail may be overridden in the capture configuration. In some embodiments, the runtime system executable code 410 may have limited human readability (i.e., may not be expressed in a high-level programming language) and, in that event, mapping of symbols and references from the executable code 410 to source code used to generate the executable code may be gathered from the repositories by the APS infrastructure 350 and associated with the captured trace.

The client library 320 may also inspect language runtime internals to determine values of data structures used by the application 220. In an embodiment, the capture configuration 420 for data structures may involve “walking” the structures and capturing information based on a defined level of nesting (e.g., a nested depth of the data structures) which may be specified per data structure type, instance and/or method as provided in the capture configuration 420. As stated previously for language implementations that do not provide first-class callback support, a compiler may be modified to insert callbacks as “hooks” such that, when processing the executable code 410, the modified compiler may generate code to provide initial signals passed in the callbacks to the client library 320 which may inspect the stack 480 directly (e.g., examine memory locations storing the stack). In other embodiments, the client library may add callbacks at runtime in the executable code via proxy methods (e.g., wrapping invocations of the methods to include the callbacks at entry and/or exit of the methods).

FIG. 5 illustrates a workflow 500 for instrumenting executable code using a nesting level threshold for gathering information from data structures in accordance with the instrumentation trace capture technique. Assume a method 530 has arguments/variables (arg/var) 532, the values 534 of which are captured by the client library 320 and associated with data structures 540 (such as an array or list) having elements 550a-x with runtime values 560a-x. However, one or more of these elements 550a may contain references to further elements 550b of the data structures 540 arranged to create various levels of nesting (“nesting levels”) leading to complex hierarchies of information across the data structures of the user application. To control a processing time of probing memory to walk the data structures 540, capture the runtime values 560a-x, and manage capture interval processing overhead, a reasonable limit on the amount of information to capture is needed. To that end, the technique defines a nesting level threshold 570 at which the client library 320 terminates recursion into the elements 550a-x and runtime values 560 of the data structures 540, e.g., runtime values 560a-b are captured, but not value 560x for element 550x. Instead of capturing the actual values 534 of the arguments/variables 532, the client library records metadata 580 based on the executable code describing the recursive arguments as, e.g., types of data structures 540 and constituent elements 550x without capturing their runtime values 560x.

In an embodiment, the nesting level threshold 570 may be dynamically adjusted depending on the type of elements 550a-x of the data structure 540 as provided in the capture configuration. The capture configuration 420 enables expression of the fact that capture of a particular element 550 (e.g., data structure 540) is critical to a user by enabling a large number of nesting levels (e.g., 5) for the element to be captured despite implication of the substantial amount of information captured and processing time consumed. In addition, the capture configuration 420 may express a nesting path according to the data structures 540 declared in the executable code 410 and an expression of values 560 of the elements 550 of the path as a selection/mask formulated as, e.g., a dictionary of key/value (element) pairs, wherein certain values of the pairs are deemed high-signal and desirous of capture. Essentially, the capture configuration 420 enables various forms of expression to define capture of an argument/variable 532 and data structure 540 by matching: (i) a type of the variable/data structure; (ii) a level of nesting path expressed as a maximum depth of capture of the data structure 540; or (iii) by value 560 of one or more elements 550 of the structure 540 to achieve a desired information capture.

Upon completion of executable code instrumentation, the client library 320 gathers captured trace signal information, e.g., in cooperation with the runtime system 240, and performs the first dictionary compression of the captured information (traces). The captured traces and the executable code are transferred via shared memory (in-core) and/or via the IPC mechanism 340 to the agent 330 in order to isolate the capture from the executing user application 220. Upon receipt, the agent 330 may perform the second dictionary compression across the captured traces and send the re-compressed captured traces over the computer network 150 to the APS infrastructure 350. The captured trace information may then be reported graphically and interactively to a user via the UI infrastructure 360.

While there have been shown and described illustrative embodiments for enabling software developers to monitor, diagnose and solve errors associated with application development and production using an instrumentation trace capture technique, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the embodiments herein. For example, embodiments have been shown and described herein with relation to capturing information based on a defined level of nesting which may be specified per data structure type, instance and/or method. However, the embodiments in their broader sense are not so limited, and may, in fact, allow for quantifying the cost associated with defining the nesting level.

For instance, the instrumentation trace capture technique may enable user selection of a default nesting level. Depending on the amount of time needed to capture the information (which impacts execution of the user application 220), the default level can be adjusted (e.g., dialed down). The technique also provides manual configuration through the UI infrastructure 360 that enables the user to specify a level of detail capture for a specific method or variable despite the additional cost of processing resource consumption. A tradeoff is thus required between performance and resource utilization. A simple tradeoff may be time required to perform the desired capture which impacts execution performance of the user application but is easiest to quantify.

The foregoing description has been directed to specific embodiments. It will be apparent, however, that other variations and modifications may be made to the described embodiments, with the attainment of some or all of their advantages. For instance, it is expressly contemplated that the components and/or elements described herein can be implemented as software encoded on a tangible (non-transitory) computer-readable medium (e.g., disks, and/or electronic memory) having program instructions executing on a computer, hardware, firmware, or a combination thereof. Accordingly, this description is to be taken only by way of example and not to otherwise limit the scope of the embodiments herein. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the embodiments herein.

Claims

1. A method comprising:

instrumenting executable code of an application to record information of the application executing on a computer node having a memory, wherein the executable code is instrumented in accordance with a capture configuration determining a fidelity of the recorded information;

generating a capture filter defining the information to capture from the application execution based on event-driven capture intervals established as callbacks into a process of the application;

recording the application execution during the capture interval according to the capture filter, wherein the recording includes information having (i) high-level programming text processed by a runtime system within which the application executes and (ii) symbols, values and associated arguments of methods of the application;

compressing the recording; and

is storing the recording via an agent executing on a same operating system as the application.

2. The method of claim 1 wherein recording the application execution during the capture interval further comprises:

searching the capture filter to determine a presence of one or more identifiers of the information to capture from the application execution; and

in response to determining the presence of an identifier of the information to capture, recording the application execution.

3. The method of claim 2 wherein recording the application execution during the capture interval further comprises:

querying the runtime system via an application programming interface to determine a trace of the methods and associated arguments of the methods for the application invoked during the capture interval.

4. The method of claim 1 further comprising:

in response to an exception raised during the capture interval, increasing the fidelity of the recorded application execution.

5. The method of claim 1 wherein recording the application execution during the capture interval further comprises:

determining whether an expression stored in the capture configuration matches a path of a data structure defined in the executable code; and

in response to determining that the expression stored in the capture configuration matches the path, recording a runtime value of the data structure.

6. The method of claim 1 wherein recording the application execution during the capture interval further comprises:

capturing runtime values of a data structure by walking elements of the data structure in the memory according to a maximum nested depth stored in the capture configuration.

7. The method of claim 6, further comprising:

dynamically adjusting the maximum nested depth stored in the capture configuration based on one of a type of a first element of the data structure and runtime value of the first element of the data structure.

8. The method of claim 1 wherein recording the application execution during the capture interval further comprises:

determining whether an expression stored in the capture configuration matches a type of an element of a data structure defined in the executable code; and

in response to determining that the expression stored in the capture configuration matches the type of the element, capturing a value of the element of the structure.

9. The method of claim 1 wherein the recorded information includes one or more traces of the application method executed during the execution of the application.

10. The method of claim 1 wherein the recorded information includes one or more of (i) high-level programming text processed by a runtime system executing the application and (ii) symbols as labels representing one or more of methods, variables, data and state of the executable code.

11. A non-transitory computer readable medium including program instructions for execution on one or more processors, the program instructions configured to:

instrument executable code of an application to record information of the application executing on a computer node having a memory, wherein s the executable code is instrumented in accordance with a capture configuration determining a fidelity of the recorded information;

generate a capture filter defining the information to capture from the application execution based on even-driven capture intervals established as callbacks into a process of the application;

record the application execution during the capture interval according to the capture filter, wherein the recording includes information having (i) high-level programming text processed by a runtime system within which the application executes and (ii) symbols, values, and associated arguments of methods of the application;

compress the recording; and

store the recording via an agent executing on a same operating system as the application.

12. The non-transitory computer readable medium of claim 11, wherein the program instructions configured to record the application execution during the capture interval are further configured to:

search the capture filter to determine a presence of one or more identifiers of the information to capture from the application execution; and

in response to determining the presence of an identifier of the information to capture, record the application execution.

13. The non-transitory computer readable medium of claim 12, wherein the program instructions configured to record the application execution during the capture interval are further configured to:

query the runtime system via an application programming interface to determine a trace of the methods and associated arguments of the methods for the application invoked during the capture interval.

14. The non-transitory computer readable medium of claim 11, wherein the program instructions configured to record the application execution during the capture interval are further configured to:

in response to an exception raised during the capture interval, increase the fidelity of the recorded application execution.

15. The non-transitory computer readable medium of claim 11, wherein the program instructions configured to record the application execution during the capture interval are further configured to:

determine whether an expression stored in the capture configuration matches a path of a data structure defined in the executable code; and

in response to determining that the expression stored in the capture configuration matches the path, capture a runtime value of the data structure.

16. The non-transitory computer readable medium of claim 11, wherein the program instructions configured to record the application execution during the capture interval are further configured to:

capture runtime values of a data structure by walking elements of the data structure in the memory according to a maximum nested depth stored in the capture configuration.

17. The non-transitory computer readable medium of claim 11, wherein the program instructions configured to record the application execution during the capture interval are further configured to:

determine whether an expression stored in the capture configuration matches a type of an element of a data structure defined in the executable code; and

in response to determining that the expression stored in the capture configuration matches the type of the element, record a value of the element of the structure.

18. The non-transitory computer readable medium of claim 16 wherein the recorded information includes one or more of (i) high-level programming text processed by a runtime system executing the application and (ii) symbols as labels representing one or more of methods, variables, data and state of the executable code.

19. The non-transitory computer readable medium of claim 11, wherein the program instructions configured to compress the recording are further configured to use two-stages of dictionary-based compression for the recorded symbols, values, and associated arguments of the methods, wherein a first stage of the dictionary-based compression is performed in the client library, and a second stage of the dictionary compression is performed in the agent process.

20. A system comprising:

a node including a processor and a memory, the memory including an application having executable code linked to a client library with program instructions configured to, instrument the executable code to record information of the application executing on the node according to a capture configuration determining a fidelity of the recorded information; generate a capture filter defining the information to capture from the application execution based on even-driven capture intervals established as callbacks into a process of the application; record the application execution during the capture interval according to the capture filter, wherein the recording includes information having (i) high-level programming text processed by a runtime system within which the application executes and (ii) symbols, values, and associated arguments of methods of the application; compress the recording; and store the recording via an agent executing on a same operating system as the application.