METHODS AND SYSTEMS FOR INTELLIGENT SAMPLING OF NORMAL AND ERRONEOUS APPLICATION TRACES

Info

Publication number: 20220291982
Type: Application
Filed: Jul 13, 2021
Publication Date: Sep 15, 2022
Applicant: VMware, Inc. (Palo Alto, CA)
Inventors: Arnak Poghosyan (Yerevan), Ashot Nshan Harutyunyan (Yerevan), Naira Movses Grigoryan (Yerevan), Clement Pang (Palo Alto, CA), George Oganesyan (Yerevan), Karen Avagyan (Yerevan)
Application Number: 17/374,682

Abstract

Computer-implemented methods and systems described herein perform intelligent sampling of application traces generated by an application. Computer-implemented methods and systems determine different sampling rates based on frequency of occurrence of normal traces and erroneous traces of the application. The sampling rates for low frequency normal and erroneous traces are larger than the sampling rates for high frequency normal and erroneous traces. The relatively larger sampling rates for low frequency trace ensures that low frequency traces are sampled in sufficient numbers and are not passed over during sampling of the application traces. The sampled normal and erroneous traces are stored in a data storage device.

Description

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No. 63/155,349, filed Mar. 3, 2021.

TECHNICAL FIELD

This disclosure is directed to automated methods and systems for intelligent sampling of application traces.

BACKGROUND

Electronic computing has evolved from primitive, vacuum-tube-based computer systems, initially developed during the 1940s, to modern electronic computing systems in which large numbers of multi-processor computer systems, such as server computers, work stations, and other individual computing systems are networked together with large-capacity data storage devices and other electronic devices to produce geographically distributed computing systems with hundreds of thousands, millions, or more components that provide enormous computational bandwidths and data-storage capacities. These large, distributed computing systems include data centers and are made possible by advances in computer networking, distributed operating systems and applications, data-storage appliances, computer hardware, and software technologies. *The number and size of data centers have continued to grow to meet the increasing demand for information technology (“IT”) services, such as running applications for organizations that provide business services, web services, and other cloud services to millions of customers each day.

Management tools have been developed to collect traces of applications and aid system administrators and application owners with detecting performance problems with applications executed in distributed computing systems. An application trace, or simply a “trace,” is a representation of a workflow executed by an application, such as the workflow of application components of a distributed application. Application owners analyze application traces to detect performance problems with their applications. For example, a distributed application may have multiple application components executed in VMs or containers on one or more hosts of a data center. The application traces are stored and used by administrators and application developers to troubleshoot performance problems and perform root cause analysis.

Storage of application traces for a plurality of applications executing in a distributed computing environment over time creates an increasing demand for available data storage space. For example, a typical distributed application that serves hundreds of thousands of clients each day generates hundreds of thousands of corresponding application traces that are stored in data storage devices each day. For application owners, storing an enormous number of application traces increases the costs of operation. In addition, application traces that reveal performance problems associated with execution of an application, called erroneous traces, often occur with far lower frequencies than normal application traces that indicate normal execution of an application. As a result, system administrators and application developers sift through millions of application traces to identify the small number of erroneous traces, which is expensive and time consuming. Typical management tools employ sampling procedures that sample and store a fraction of the application traces in an effort to reduce the storage space occupied by applications traces and reduce the amount of time and cost associated with identifying erroneous traces. However, these sampling procedures fail to distinguish between the different types of traces. As a result, infrequently generated erroneous traces are often missed during sampling, which makes troubleshooting a performance problem a more challenging task. One approach is to store all erroneous traces. However, in certain situations the number of erroneous traces far exceeds the number of normal application traces, which eventually leads to the same problem of not having enough storage space available for normal and erroneous traces. Application owners and system administrators seek computer-implemented methods and systems that, in general, reduce the number of stored application traces, do not under sample or miss low frequency erroneous traces, and reduce the number of stored erroneous traces when erroneous traces outnumber normal application traces.

SUMMARY

Computer-implemented methods and systems described herein perform intelligent sampling of normal and erroneous traces of an application. A set of trace data associated with the application is from a data storage device. The trace data may be stored in a trace database or temporarily stored in a buffer. Computer-implemented methods and systems determine sampling rates for sampling normal traces in the set and for sampling erroneous traces in the set. The different sampling rates are inversely proportional to the frequency of occurrence of the normal traces and erroneous traces. The sampling rates are used to obtain sampled normal traces and sampled erroneous traces. The sampling rates ensure that less frequently occurring normal traces are sampled at higher sampling rates than more frequently occurring normal traces and that less frequently occurring erroneous traces are sampled at higher sampling rates than more frequently occurring erroneous traces. The sampled traces are stored in a data storage device.

DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an architectural diagram for various types of computers.

FIG. 2 shows an Internet-connected distributed computer system.

FIG. 3 shows cloud computing.

FIG. 4 shows generalized hardware and software components of a general-purpose computer system.

FIGS. 5A-5B show two types of virtual machine (“VM”) and VM execution environments.

FIG. 6 shows an example of an open virtualization format package.

FIG. 7 shows example virtual data centers provided as an abstraction of underlying physical-data-center hardware components.

FIG. 8 shows virtual-machine components of a virtual-data-center management server and physical servers of a physical data center.

FIG. 9 shows a cloud-director level of abstraction.

FIG. 10 shows virtual-cloud-connector nodes.

FIG. 11 shows an example server computer used to host three containers.

FIG. 12 shows an approach to implementing containers on a VM.

FIG. 13 shows an example of a virtualization layer located above a physical data center.

FIGS. 14A-14B show an example of a distributed application and an example application trace.

FIGS. 15A-15B show examples of erroneous traces for the distributed application represented in FIG. 14A.

FIG. 16A shows an example graphical-user interface (“GUI”) that enables a user to select an application and input sampling rates for sampling normal and erroneous traces of an application.

FIG. 16B shows an example of a computer system that executes machine-readable instructions for sampling traces.

FIG. 17 shows an example set of trace data generated by an application.

FIG. 18 shows an example calculation of traces sampled from a set of trace data sorted according to trace type.

FIG. 19 shows an example of erroneous traces partitioned into sets of erroneous traces based on error status codes.

FIG. 20 shows an example of a set of trace data sorted according to trace durations.

FIG. 21 shows an example of partitioning duration-sorted traces.

FIG. 22 shows an example histogram constructed from a set of trace data.

FIG. 23 shows an example of normal traces and example of erroneous traces the same trace type.

FIG. 24 shows an example of set of trace data partitioned into normal traces and erroneous traces based on erroneous and normal status.

FIG. 25 shows an example set of erroneous traces partitioned into according to different error codes.

FIG. 26A-26D shows a plot of modified Gini indices versus sampling parameters.

FIG. 27 is a flow diagram illustrating an example implementation of a “method for sampling traces of an application.”

FIG. 28 is a flow diagram illustrating an example implementation of the “determine a normal trace sampling rate and an erroneous trace sampling rate” procedure performed in FIG. 27.

FIG. 29 is a flow diagram illustrating an example implementation of the “determine sampling rates based on trace type and/or duration” procedure performed in FIG. 28.

FIG. 30 is a flow diagram illustrating an example implementation of the “determine hybrid-sampling rates” procedure performed in FIG. 29.

FIG. 31 is a flow diagram illustrating an example implementation of the “determine trace-type sampling rates” procedure performed in FIG. 29.

FIG. 32 is a flock diagram illustrating an example implementation of the “determine duration-sampling rates” procedure performed in FIG. 29.

FIG. 33 is a flow diagram illustrating an example implementation of the “determine normal and erroneous trace sampling rates” procedure in FIG. 28.

FIG. 34 is a flow diagram illustrating an example implementation of the “determine normal trace sampling rate” procedure in FIG. 28.

FIG. 35 is a flow diagram illustrating an example implementation of the “sample normal traces using the normal trace sampling rate and the erroneous trace sampling rate” procedure performed in FIG. 27.

FIG. 36 is a flow diagram illustrating an example implementation of the “sample traces using hybrid-sampling rates” procedure performed in FIG. 35.

FIG. 37 is a flow diagram illustrating an example implementation of the “sample traces using trace-type sampling rates” procedure performed in FIG. 35.

FIG. 38 is a flow diagram illustrating an example implementation of the “sample traces using duration-sampling rates” procedure performed in FIG. 35.

FIG. 39 is a flow diagram illustrating an example implementation of the “sample traces using normal and erroneous sampling rates” procedure performed in FIG. 35.

DETAILED DESCRIPTION

This disclosure presents computer-implemented methods and systems that intelligently sample application traces generated by applications running in a distributed computing system. In the first subsection, computer hardware, complex computational systems, and virtualization are described. Computer-implemented methods and systems for intelligent sampling of normal and erroneous application traces are described below in the second subsection.

Computer Hardware, Complex Computational Systems, and Virtualization

The term “abstraction” as used to describe virtualization below is not intended to mean or suggest an abstract idea or concept. Computational abstractions are tangible, physical interfaces that are implemented, ultimately, using physical computer hardware, data storage devices, and communications systems. Instead, the term “abstraction” refers, in the current discussion, to a logical level of functionality encapsulated within one or more concrete, tangible, physically-implemented computer systems with defined interfaces through which electronically-encoded data is exchanged, process execution launched, and electronic services are provided. Interfaces may include graphical and textual data displayed on physical display devices as well as computer programs and routines that control physical computer processors to carry out various tasks and operations and that are invoked through electronically implemented application programming interfaces (“APIs”) and other electronically implemented interfaces.

FIG. 1 shows a general architectural diagram for various types of computers. Computers that receive, process, and store log messages may be described by the general architectural diagram shown in FIG. 1, for example. The computer system contains one or multiple central processing units (“CPUs”) 102-105, one or more electronic memories 108 interconnected with the CPUs by a CPU/memory-subsystem bus 110 or multiple busses, a first bridge 112 that interconnects the CPU memory-subsystem bus 110 with additional busses 114 and 116, or other types of high-speed interconnection media, including multiple, high-speed serial interconnects. These busses or serial interconnections, in turn, connect the CPUs and memory with specialized processors, such as a graphics processor 118, and with one or more additional bridges 120, which are interconnected with high-speed serial links or with multiple controllers 122-127, such as controller 127, that provide access to various different types of mass-storage devices 128, electronic displays, input devices, and other such components, subcomponents, and computational devices. It should be noted that computer-readable data storage devices include optical and electromagnetic disks, electronic memories, and other physical data storage devices.

Of course, there are many different types of computer-system architectures that differ from one another in the number of different memories, including different types of hierarchical cache memories, the number of processors and the connectivity of the processors with other system components, the number of internal communications busses and serial links, and in many other ways. However, computer systems generally execute stored programs by fetching instructions from memory and executing the instructions in one or more processors. Computer systems include general-purpose computer systems, such as personal computers (“PCs”), various types of server computers and workstations, and higher-end mainframe computers, but may also include a plethora of various types of special-purpose computing devices, including data-storage systems, communications routers, network nodes, tablet computers, and mobile telephones.

FIG. 2 shows an Internet-connected distributed computer system. As communications and networking technologies have evolved in capability and accessibility, and as the computational bandwidths, data-storage capacities, and other capabilities and capacities of various types of computer systems have steadily and rapidly increased, much of modern computing now generally involves large distributed systems and computers interconnected by local networks, wide-area networks, wireless communications, and the Internet. FIG. 2 shows a typical distributed system in which a large number of PCs 202-205, a high-end distributed mainframe system 210 with a large data-storage system 212, and a large computer center 214 with large numbers of rack-mounted server computers or blade servers all interconnected through various communications and networking systems that together comprise the Internet 216, Such distributed computing systems provide diverse arrays of functionalities. For example, a PC user may access hundreds of millions of different web sites provided by hundreds of thousands of different web servers throughout the world and may access high-computational-bandwidth computing services from remote computer facilities for running complex computational tasks.

Until recently, computational services were generally provided by computer systems and data centers purchased, configured, managed, and maintained by service-provider organizations. For example, an e-commerce retailer generally purchased, configured, managed, and maintained a data center including numerous web server computers, back-end computer systems, and data-storage systems for serving web pages to remote customers, receiving orders through the web-page interface, processing the orders, tracking completed orders, and other myriad different tasks associated with an e-commerce enterprise.

FIG. 3 shows cloud computing. In the recently developed cloud-computing paradigm, computing cycles and data-storage facilities are provided to organizations and individuals by cloud-computing providers. In addition, larger organizations may elect to establish private cloud-computing facilities in addition to, or instead of, subscribing to computing services provided by public cloud-computing service providers. In FIG. 3, a system administrator for an organization, using a PC 302, accesses the organization's private cloud 304 through a local network 306 and private-cloud interface 308 and accesses, through the Internet 310, a public cloud 312 through a public-cloud services interface 314. The administrator can, in either the case of the private cloud 304 or public cloud 312, configure virtual computer systems and even entire virtual data centers and launch execution of application programs on the virtual computer systems and virtual data centers in order to carry out any of many different types of computational tasks. As one example, a small organization may configure and run a virtual data center within a public cloud that executes web servers to provide an e-commerce interface through the public cloud to remote customers of the organization, such as a user viewing the organization's e-commerce web pages on a remote user system 316.

Cloud-computing facilities are intended to provide computational bandwidth and data-storage services much as utility companies provide electrical power and water to consumers. Cloud computing provides enormous advantages to small organizations without the devices to purchase, manage, and maintain in-house data centers. Such organizations can dynamically add and delete virtual computer systems from their virtual data centers within public clouds in order to track computational-bandwidth and data-storage needs, rather than purchasing sufficient computer stems within a physical data center to handle peak computational-bandwidth and data-storage demands. Moreover, small organizations can completely avoid the overhead of maintaining and managing physical computer systems, including hiring and periodically retraining information-technology specialists and continuously paying for operating-system and database-management-system upgrades. Furthermore, cloud-computing interfaces allow for easy and straightforward configuration of virtual computing facilities, flexibility in the types of applications and operating systems that can be configured, and other functionalities that are useful even for owners and administrators of private cloud-computing facilities used by a single organization.

FIG. 4 shows generalized hardware and software components of a general-purpose computer system, such as a general-purpose computer system having an architecture similar to that shown in FIG. 1. The computer system 400 is Mien considered to include three fundamental layers: (1) a hardware layer or level 402; (2) an operating-system layer or level 404; and (3) an application-program layer or level 406. The hardware layer 402 includes one or more processors 408, system memory 410, various different types of input-output (I/O) devices 410 and 412, and mass-storage devices 414. Of course, the hardware level also includes many other components, including power supplies, internal communications links and busses, specialized integrated circuits, many different types of processor-controlled or microprocessor-controlled peripheral devices and controllers, and many other components. The operating system 404 interfaces to the hardware level 402 through a low-level operating system and hardware interface 416 generally comprising a set of non-privileged computer instructions 418, a set of privileged computer instructions 420, a set of non-privileged registers and memory addresses 422, and a set of privileged registers and memory addresses 424. In general, the operating system exposes non-privileged instructions, non-privileged registers, and non-privileged memory addresses 426 and a system-call interface 428 as an operating-system interface 430 to application programs 432-436 that execute within an execution environment provided to the application programs by the operating system. The operating system, alone, accesses the privileged instructions, privileged registers, and privileged memory addresses. By reserving access to privileged instructions, privileged registers, and privileged memory addresses, the operating system can ensure that application programs and other higher-level computational entities cannot interfere with one another's execution and cannot change the overall state of the computer system in ways that could deleteriously impact system operation. The operating system includes many internal components and modules, including a scheduler 442, memory management 444, a file system 446, device drivers 448, and many other components and modules. To a certain degree, modern operating systems provide numerous levels of abstraction above the hardware level, including virtual memory, which provides to each application program and other computational entities a separate, large, linear memory-address space that is mapped by the operating system to various electronic memories and mass-storage devices. The scheduler orchestrates interleaved execution of various different application programs and higher-level computational entities, providing to each application program a virtual, stand-alone system devoted entirely to the application program. From the application program's standpoint, the application program executes continuously without concern for the need to share processor devices and other system devices with other application programs and higher-level computational entities. The device drivers abstract details of hardware-component operation, allowing application programs to employ the system-call interface for transmitting and receiving data to and from communications networks, mass-storage devices, and other I/O devices and subsystems. The file system 446 facilitates abstraction of mass-storage-device and memory devices as a high-level, easy-to-access, file-system interface. Thus, the development and evolution of the operating system has resulted in the generation of a type of multi-faceted virtual execution environment for application programs and other higher-level computational entities.

While the execution environments provided by operating systems have proved to be an enormously successful level of abstraction within computer systems, the operating-system-provided level of abstraction is nonetheless associated with difficulties and challenges for developers and users of application programs and other higher-level computational entities. One difficulty arises from the fact that there are many different operating systems that run within various different types of computer hardware. In many cases, popular application programs and computational systems are developed to run on only a subset of the available operating systems and can therefore be executed within only a subset of the different types of computer systems on which the operating systems are designed to run. Often, even when an application program or other computational system is ported to additional operating systems, the application program or other computational system can nonetheless run more efficiently on the operating systems for which the application program or other computational system was originally targeted. Another difficulty arises from the increasingly distributed nature of computer systems. Although distributed operating systems are the subject of considerable research and development efforts, many of the popular operating systems are designed primarily for execution on a single computer system. In many cases, it is difficult to move application programs, in real time, between the different computer systems of a distributed computer system for high-availability, fault-tolerance, and load-balancing purposes. The problems are even greater in heterogeneous distributed computer systems which include different types of hardware and devices running different types of operating systems. Operating systems continue to evolve, as a result of which certain older application programs and other computational entities may be incompatible with more recent versions of operating systems for which they are targeted, creating compatibility issues that are particularly difficult to manage in large distributed systems.

For all of these reasons, a higher level of abstraction, referred to as the “virtual machine,” (“VM”) has been developed and evolved to further abstract computer hardware in order to address many difficulties and challenges associated with traditional computing systems, including the compatibility issues discussed above. FIGS. 5A-B show two types of VM and virtual-machine execution environments. FIGS. 5A-B use the same illustration conventions as used in FIG. 4. FIG. 5A shows a first type of virtualization. The computer system 500 in FIG. 5A includes the same hardware layer 502 as the hardware layer 402 shown in FIG. 4. However, rather than providing an operating system layer directly above the hardware layer, as in FIG. 4, the virtualized computing environment shown in FIG. 5A features a virtualization layer 504 that interfaces through a virtualization-layer/hardware-layer interface 506, equivalent to interface 416 in FIG. 4, to the hardware. The virtualization layer 504 provides a hardware-like interface to VMs, such as VM 510, in a virtual-machine layer 511 executing above the virtualization layer 504. Each VM includes one or more application programs or other higher-level computational entities packaged together with an operating system, referred to as a “guest operating system,” such as application 514 and guest operating system 516 packaged together within VM 510. Each VM is thus equivalent to the operating-system layer 404 and application-program layer 406 in the general-purpose computer system shown in FIG. 4. Each guest operating system within a VM interfaces to the virtualization layer interface 504 rather than to the actual hardware interface 506. The virtualization layer 504 partitions hardware devices into abstract virtual-hardware layers to which each guest operating system within a VM interfaces. The guest operating systems within the VMs, in general, are unaware of the virtualization layer and operate as if they were directly accessing a true hardware interface. The virtualization layer 504 ensures that each of the VMs currently executing within the virtual environment receive a fair allocation of underlying hardware devices and that all VMs receive sufficient devices to progress in execution. The virtualization layer 504 may differ for different guest operating systems. For example, the virtualization layer is generally able to provide virtual hardware interfaces for a variety of different types of computer hardware. This allows, as one example, a VM that includes a guest operating system designed for a particular computer architecture to run on hardware of a different architecture. The number of VMs need not be equal to the number of physical processors or even a multiple of the number of processors.

The virtualization layer 504 includes a virtual-machine-monitor module 518 (“VMM”) that virtualizes physical processors in the hardware layer to create virtual processors on which each of the VMs executes. For execution efficiency, the virtualization layer attempts to allow VMs to directly execute non-privileged instructions and to directly access non-privileged registers and memory. However, when the guest operating system within a VM accesses virtual privileged instructions, virtual privileged registers, and virtual privileged memory through the virtualization layer 504, the accesses result in execution of virtualization-layer code to simulate or emulate the privileged devices. The virtualization layer additionally includes a kernel module 520 that manages memory, communications, and data-storage machine devices on behalf of executing VMs (“VM kernel”). The VM kernel, for example, maintains shadow page tables on each VM so that hardware-level virtual-memory facilities can be used to process memory accesses. The VM kernel additionally includes routines that implement virtual communications and data storage devices as well as device drivers that directly control the operation of underlying hardware communications and data storage devices. Similarly, the VM kernel virtualizes various other types of I/O devices, including keyboards, optical-disk drives, and other such devices. The virtualization layer 504 essentially schedules execution of VMs much like an operating system schedules execution of application programs, so that the VMs each execute within a complete and fully functional virtual hardware layer.

FIG. 5B shows a second type of virtualization. In FIG. 5B, the computer system 540 includes the same hardware layer 542 and operating system layer 544 as the hardware layer 402 and the operating system layer 404 shown in FIG. 4. Several application programs 546 and 548 are shown running in the execution environment provided by the operating system 544. In addition, a virtualization layer 550 is also provided, in computer 540, but, unlike the virtualization layer 504 discussed with reference to FIG. 5A, virtualization layer 550 is layered above the operating system 544, referred to as the “host OS,” and uses the operating system interface to access operating-system-provided functionality as well as the hardware. The virtualization layer 550 comprises primarily a VMM and a hardware-like interface 552, similar to hardware-like interface 508 in FIG. 5A. The hardware-layer interface 552, equivalent to interface 416 in FIG. 4, provides an execution environment for a number of VMs 556-558, each including one or more application programs or other higher-level computational entities packaged together with a guest operating system.

In FIGS. 5A-5B, the layers are somewhat simplified for clarity of illustration. For example, portions of the virtualization layer 550 may reside within the host-operating-system kernel, such as a specialized driver incorporated into the host operating system to facilitate hardware access by the virtualization layer.

It should be noted that virtual hardware layers, virtualization layers, and guest operating systems are all physical entities that are implemented by computer instructions stored in physical data storage devices, including electronic memories, mass-storage devices, optical disks, magnetic disks, and other such devices. The term “virtual” does not, in any way, imply that virtual hardware layers, virtualization layers, and guest operating systems are abstract or intangible. Virtual hardware layers, virtualization layers, and guest operating systems execute on physical processors of physical computer systems and control operation of the physical computer systems, including operations that alter the physical states of physical devices, including electronic memories and mass-storage devices. They are as physical and tangible as any other component of a computer since, such as power supplies, controllers, processors, busses, and data storage devices.

A VM or virtual application, described below, is encapsulated within a data package for transmission, distribution, and loading into a virtual-execution environment. One public standard for virtual-machine encapsulation is referred to as the “open virtualization format” (“OVF”). The OVF standard specifies a format for digitally encoding a VM within one or more data files. FIG. 6 shows an OVF package. An OVF package 602 includes an OVF descriptor 604, an OVF manifest 606, an OVF certificate 608, one or more disk-image files 610-611, and one or more device files 612-614. The OVF package can be encoded and stored as a single file or as a set of files. The OVF descriptor 604 is an XML document 620 that includes a hierarchical set of elements, each demarcated by a beginning tag and an ending tag. The outermost, or highest-level, element is the envelope element, demarcated by tags 622 and 623. The next-level element includes a reference element 626 that includes references to all files that are part of the OVF package, a disk section 628 that contains meta information about all of the virtual disks included in the OVF package, a network section 630 that includes meta information about all of the logical networks included in the OVF package, and a collection of virtual-machine configurations 632 which further includes hardware descriptions of each VM 634. There are many additional hierarchical levels and elements within a typical OVF descriptor. The OVF descriptor is thus a self-describing, XML file that describes the contents of an OVF package. The OVF manifest 606 is a list of cryptographic-hash-function-generated digests 636 of the entire OVF package and of the various components of the OVF package. The OVF certificate 608 is an authentication certificate 640 that includes a digest of the manifest and that is cryptographically signed. Disk image files, such as disk image file 610, are digital encodings of the contents of virtual disks and device files 612 are digitally encoded content, such as operating-system images. A VM or a collection of VMs encapsulated together within a virtual application can thus be digitally encoded as one or more files within an OVF package that can be transmitted, distributed, and loaded using well-known tools for transmitting, distributing, and loading files. A virtual appliance is a software service that is delivered as a complete software stack installed within one or more VMs that is encoded within an OVF package.

The advent of VMs and virtual environments has alleviated many of the difficulties and challenges associated with traditional general-purpose computing. Machine and operating-system dependencies can be significantly reduced or eliminated by packaging applications and operating systems together as VMs and virtual appliances that execute within virtual environments provided by virtualization layers running on many different types of computer hardware. A next level of abstraction, referred to as virtual data centers or virtual infrastructure, provide a data-center interface to virtual data centers computationally constructed within physical data centers.

FIG. 7 shows virtual data centers provided as an abstraction of underlying physical-data-center hardware components. In FIG. 7, a physical data center 702 is shown below a virtual-interface plane 704. The physical data center consists of a virtual-data-center management server computer 706 and any of various different computers, such as PC 708, on which a virtual-data-center management interface may be displayed to system administrators and other users. The physical data center additionally includes generally large numbers of server computers, such as server computer 710, that are coupled together by local area networks, such as local area network 712 that directly interconnects server computer 710 and 714-720 and a mass-storage array 722. The physical data center shown in FIG. 7 includes three local area networks 712, 724, and 726 that each directly interconnects a bank of eight server computers and a mass-storage array. The individual server computers, such as server computer 710, each includes a virtualization layer and runs multiple VMs. Different physical data centers may include many different types of computers, networks, data-storage systems and devices connected according to many different types of connection topologies. The virtual-interface plane 704, a logical abstraction layer shown by a plane in FIG. 7, abstracts the physical data center to a virtual data center comprising one or more device pools, such as device pools 730-732, one or more virtual data stores, such as virtual data stores 734-736, and one or more virtual networks. In certain implementations, the device pools abstract banks of server computers directly interconnected by a local area network.

The virtual-data-center management interface allows provisioning and launching of VMs with respect to device pools, virtual data stores, and virtual networks, so that virtual-data-center administrators need not be concerned with the identities of physical-data-center components used to execute particular VMs. Furthermore, the virtual-data-center management server computer 706 includes functionality to migrate running VMs from one server computer to another in order to optimally or near optimally manage device allocation, provides fault tolerance, and high availability by migrating VMs to most effectively utilize underlying physical hardware devices, to replace VMs disabled by physical hardware problems and failures, and to ensure that multiple VMs supporting a high-availability virtual appliance are executing on multiple physical computer systems so that the services provided by the virtual appliance are continuously accessible, even when one of the multiple virtual appliances becomes compute bound, data-access bound, suspends execution, or fails. Thus, the virtual data center layer of abstraction provides a virtual-data-center abstraction of physical data centers to simplify provisioning, launching, and maintenance of VMs and virtual appliances as well as to provide high-level, distributed functionalities that involve pooling the devices of individual server computers and migrating VMs among server computers to achieve load balancing, fault tolerance, and high availability.

FIG. 8 shows virtual-machine components of a virtual-data-center management server computer and physical server computers of a physical data center above which a virtual-data-center interface is provided by the virtual-data-center management server computer. The virtual-data-center management server computer 802 and a virtual-data-center database 804 comprise the physical components of the management component of the virtual data center. The virtual-data-center management server computer 802 includes a hardware layer 806 and virtualization layer 808 and runs a virtual-data-center management-server VM 810 above the virtualization layer. Although shown as a single server computer in FIG. 8, the virtual-data-center management server computer (“VDC management server”) may include two or more physical server computers that support multiple VDC-management-server virtual appliances. The virtual-data-center management-server VM 810 includes a management-interface component 812, distributed services 814, core services 816, and a host-management interface 818. The host-management interface 818 is accessed from any of various computers, such as the PC 708 shown in FIG. 7. The host-management interface 818 allows the virtual-data-center administrator to configure a virtual data center, provision VMs, collect statistics and view log files for the virtual data center, and to carry out other, similar management tasks. The host-management interface 818 interfaces to virtual-data-center agents 824, 825, and 826 that execute as VMs within each of the server computers of the physical data center that is abstracted to a virtual data center by the VDC management server computer.

The distributed services 814 include a distributed-device scheduler that assigns VMs to execute within particular physical server computers and that migrates VMs in order to most effectively make use of computational bandwidths, data-storage capacities, and network capacities of the physical data center. The distributed services 814 further include a high-availability service that replicates and migrates VMs in order to ensure that VMs continue to execute despite problems and failures experienced by physical hardware components. The distributed services 814 also include a live-virtual-machine migration service that temporarily halts execution of a VM, encapsulates the VM in an OVF package, transmits the OVF package to a different physical server computer, and restarts the VM on the different physical server computer from a virtual-machine state recorded when execution of the VM was halted. The distributed services 814 also include a distributed backup service that provides centralized virtual-machine backup and restore.

The core services 816 provided by the VDC management server VM 810 include host configuration, virtual-machine configuration, virtual-machine provisioning, generation of virtual-data-center alerts and events, ongoing event logging and statistics collection, a task scheduler, and a device-management module. Each physical server computers 820-822 also includes a host-agent VM 828-830 through which the virtualization layer can be accessed via a virtual-infrastructure application programming interface (“API”). This interface allows a remote administrator or user to manage an individual server computer through the infrastructure API. The virtual-data-center agents 824-826 access virtualization-layer server information through the host agents. The virtual-data-center agents are primarily responsible for offloading certain of the virtual-data-center management-server functions specific to a particular physical server to that physical server computer. The virtual-data-center agents relay and enforce device allocations made by the VDC management server VM 810, relay virtual-machine provisioning and configuration-change commands to host agents, monitor and collect performance statistics, alerts, and events communicated to the virtual-data-center agents by the local host agents through the interface API, and to carry out other, similar virtual-data-management tasks.

The virtual-data-center abstraction provides a convenient and efficient level of abstraction for exposing the computational devices of a cloud-computing facility to cloud-computing-infrastructure users. A cloud-director management server exposes virtual devices of a cloud-computing facility to cloud-computing-infrastructure users. In addition, the cloud director introduces a multi-tenancy layer of abstraction, which partitions VDCs into tenant-associated VDCs that can each be allocated to an individual tenant or tenant organization, both referred to as a “tenant.” A given tenant can be provided one or more tenant-associated VDCs by a cloud director managing the multi-tenancy layer of abstraction within a cloud-computing facility. The cloud services interface (308 in FIG. 3) exposes a virtual-data-center management interface that abstracts the physical data center.

FIG. 9 shows a cloud-director level of abstraction. In FIG. 9, three different physical data centers 902-904 are shown below planes representing the cloud-director layer of abstraction 906-908. Above the planes representing the cloud-director level of abstraction, multi-tenant virtual data centers 910-912 are shown. The devices of these multi-tenant virtual data centers are securely partitioned in order to provide secure virtual data centers to multiple tenants, or cloud-services-accessing organizations. For example, a cloud-services-provider virtual data center 910 is partitioned into four different tenant-associated virtual-data centers within a multi-tenant virtual data center for four different tenants 916-919. Each multi-tenant virtual data center is managed by a cloud director comprising one or more cloud-director server computers 920-922 and associated cloud-director databases 924-926. Each cloud-director server computer or server computers runs a cloud-director virtual appliance 930 that includes a cloud-director management interface 932, a set of cloud-director services 934, and a virtual-data-center management-server interface 936. The cloud-director services include an interface and tools for provisioning multi-tenant virtual data center virtual data centers on behalf of tenants, tools, and interfaces for configuring and managing tenant organizations, tools and services for organization of virtual data centers and tenant-associated virtual data centers within the multi-tenant virtual data center, services associated with template and media catalogs, and provisioning of virtualization networks from a network pool. Templates are VMs that each contains an OS and or one or more VMs containing applications. A template may include much of the detailed contents of VMs and virtual appliances that are encoded within OVF packages, so that the task of configuring a VM or virtual appliance is significantly simplified, requiring only deployment of one OVF package. These templates are stored in catalogs within a tenant's virtual-data center. These catalogs are used for developing and staging new virtual appliances and published catalogs are used for sharing templates in virtual appliances across organizations. Catalogs may include OS images and other information relevant to construction, distribution, and provisioning of virtual appliances.

Considering FIGS. 7 and 9, the VDC-server and cloud-director layers of abstraction can be seen, as discussed above, to facilitate employment of the virtual-data-center concept within private and public clouds. However, this level of abstraction does not fully facilitate aggregation of single-tenant and multi-tenant virtual data centers into heterogeneous or homogeneous aggregations of cloud-computing facilities.

FIG. 10 shows virtual-cloud-connector nodes (“VCC nodes”) and a VCC server, components of a distributed system that provides multi-cloud aggregation and that includes a cloud-connector server and cloud-connector nodes that cooperate to provide services that are distributed across multiple clouds. VMware vCloud™ VCC servers and nodes are one example of VCC server and nodes. In FIG. 10, seven different cloud-computing facilities are shown 1002-1008. Cloud-computing facility 1002 is a private multi-tenant cloud with a cloud director 1010 that interfaces to a VDC management server 1012 to provide a multi-tenant private cloud comprising multiple tenant-associated virtual data centers. The remaining cloud-computing facilities 1003-1008 may be either public or private cloud-computing facilities and may be single-tenant virtual data centers, such as virtual data centers 1003 and 1006, multi-tenant virtual data centers, such as multi-tenant virtual data centers 1004 and 1007-1008, or any of various different kinds of third-party cloud-services facilities, such as third-party cloud-services facility 1005. An additional component, the VCC server 1014, acting as a controller is included in the private cloud-computing facility 1002 and interfaces to a VCC node 1016 that runs as a virtual appliance within the cloud director 1010. A VCC server may also run as a virtual appliance within a VDC management server that manages a single-tenant private cloud. The VCC server 1014 additionally interfaces, through the Internet, to VCC node virtual appliances executing within remote VDC management servers, remote cloud directors, or within the third-party cloud services 1018-1023. The VCC server provides a VCC server interface that can be displayed on a local or remote terminal, PC, or other computer system 1026 to allow a cloud-aggregation administrator or other user to access VCC-server-provided aggregate-cloud distributed services. In general, the cloud-computing facilities that together form a multiple-cloud-computing aggregation through distributed services provided by the VCC server and VCC nodes are geographically and operationally distinct.

As mentioned above, while the virtual-machine-based virtualization layers, described in the previous subsection, have received widespread adoption and use in a variety of different environments, from personal computers to enormous, distributed computing systems, traditional virtualization technologies are associated with computational overheads. While these computational overheads have steadily decreased, over the years, and often represent ten percent or less of the total computational bandwidth consumed by an application running above a guest operating system in a virtualized environment, traditional virtualization technologies nonetheless involve computational costs in return for the power and flexibility that they provide.

While a traditional virtualization layer can simulate the hardware interface expected by any of many different operating systems, OSL virtualization essentially provides a secure partition of the execution environment provided by a particular operating system. As one example, OSL virtualization provides a file system to each container, but the file system provided to the container is essentially a view of a partition of the general file system provided by the underlying operating system of the host. In essence, OSL virtualization uses operating-system features, such as namespace isolation, to isolate each container from the other containers running on the same host. In other words, namespace isolation ensures that each application is executed within the execution environment provided by a container to be isolated from applications executing within the execution environments provided by the other containers. A container cannot access files that are not included in the container's namespace and cannot interact with applications running in other containers. As a result, a container can be booted up much faster than a VM, because the container uses operating-system-kernel features that are already available and functioning within the host. Furthermore, the containers share computational bandwidth, memory, network bandwidth, and other computational resources provided by the operating system, without the overhead associated with computational resources allocated to VMs and virtualization layers. Again, however, OSL virtualization does not provide many desirable features of traditional virtualization. As mentioned above, OSL virtualization does not provide a way to run different types of operating systems for different groups of containers within the same host and OSL-virtualization does not provide for live migration of containers between hosts, high-availability functionality, distributed resource scheduling, and other computational functionality provided by traditional virtualization technologies.

FIG. 11 shows an example server computer used to host three containers. As discussed above with reference to FIG. 4, an operating system layer 404 runs above the hardware 402 of the host computer. The operating system provides an interface, for higher-level computational entities, that includes a system-call interface 428 and the non-privileged instructions, memory addresses, and registers 426 provided by the hardware layer 402. However, unlike in FIG. 4, in which applications run directly above the operating system layer 404, OSL virtualization involves an OSL virtualization layer 1102 that provides operating-system interfaces 1104-1106 to each of the containers 1108-1110. The containers, in turn, provide an execution environment for an application that runs within the execution environment provided by container 1108. The container can be thought of as a partition of the resources generally available to higher-level computational entities through the operating system interface 430.

FIG. 12 shows an approach to implementing the containers on a VM. FIG. 12 shows a host computer similar to that shown in FIG. 5A, discussed above. The host computer includes a hardware layer 502 and a virtualization layer 504 that provides a virtual hardware interface 508 to a guest operating system 1102. Unlike in FIG. 5A, the guest operating system interfaces to an OSL-virtualization layer 1104 that provides container execution environments 1200-1208 to multiple application programs.

Note that, although only a single guest operating system and OSL virtualization layer are shown in FIG. 12, a single virtualized host system can run multiple different guest operating systems within multiple VMs, each of which supports one or more OSL-virtualization containers. A virtualized, distributed computing system that uses guest operating systems running within VMs to support OSL-virtualization layers to provide containers for running applications is referred to, in the following discussion, as a “hybrid virtualized distributed computing system.”

Running containers above a guest operating system within a VM provides advantages of traditional virtualization in addition to the advantages of OSL virtualization. Containers can be quickly booted in order to provide additional execution environments and associated resources for additional application instances. The resources available to the guest operating system are efficiently partitioned among the containers provided by the OSL-virtualization layer 1204 in FIG. 12, because there is almost no additional computational overhead associated with container-based partitioning of computational resources. However, many of the powerful and flexible features of the traditional virtualization technology can be applied to VMs in which containers run above guest operating systems, including live migration from one host to another, various types of high-availability and distributed resource scheduling, and other such features. Containers provide share-based allocation of computational resources to groups of applications with guaranteed isolation of applications in one container from applications in the remaining containers executing above a guest operating system. Moreover, resource allocation can be modified at run time between containers. The traditional virtualization layer provides for flexible and scaling over large numbers of hosts within distributed computing systems and a simple approach to operating-system upgrades and patches. Thus, the use of OSL virtualization above traditional virtualization in a hybrid virtualized distributed computing system, as shown in FIG. 12, provides many of the advantages of both a traditional virtualization layer and the advantages of OSL virtualization.

Computer-Implemented Methods and Systems for Performing Intelligent Sampling of Normal and Erroneous Application Traces

A distributed application comprises multiple VMs or containers that run application components simultaneously on one or more host server computers of a distributed computing system. The components are typically executed separately in the VMs or containers. The server computers are networked together so that information processing performed by the distributed application is distributed over the server computers, allowing the VMs or containers to exchange data. The distributed application can be scaled to satisfy changing demands by increasing or decreasing the number of VMs or containers. As a result, a typical distributed application can process multiple requests from multiple clients at the same time.

FIG. 13 shows an example of a virtualization layer 1302 that is executed in a physical data center 1304. For the sake of illustration, the virtualization layer 1302 is shown separated from the physical data center 1304 by a virtual-interface plane 1306. The physical data center 1304 is an example of a distributed computing system. The physical data center 1304 comprises physical objects, including an administration computer system 1308, any of various computers, such as PC 1310, on which a virtual data center (“VDC”) management interface may be displayed to system administrators and other users, computers, such as computers 1312-1319, data storage devices, and network devices. Each computer may have multiple network interface cards (“NIC”) that provide high bandwidth and networking to other computers and data storage devices in the physical data center 1304. The computers may be mounted in racks (not shown) that are networked together to form server-computer groups within the data center 1304. The example physical data center 1304 includes three computer groups, each of which have eight computers. For example, computer group 1320 comprises interconnected computers 1312-1319 that are connected to a mass-storage array 1322 via a switch (not shown). Within each computer group, certain computers are grouped together to form clusters. Each cluster provides an aggregated set of resources, such as processors, memory, and disk space, (i.e., resource pool) to objects in the virtualization layer 1302. Physical data centers are not limited to the example physical data center 1304. Different physical data centers may include many different types of computers, networks, data-storage systems, and devices connected according to many different types of connection topologies.

The virtualization layer 1302 includes virtual objects, such as VMs, applications, and containers, hosted by the computers in the physical data center 1304. The virtualization layer 1302 also includes a virtual network (not illustrated) comprising virtual switches, virtual routers, load balancers, and virtual NICs. Certain computers host VMs and containers as described above. For example, computer 1318 hosts two containers identified as Cont₁and Cont₂; cluster of computers 1313 and 1314 host five VMs identified as VM₁, VM₂, VM₃, VM₄, and VM₅; computer 1324 hosts four VMs identified as VM₇, VM₈, VM₉, VM₁₀. Other computers may host applications as described above with reference to FIG. 4. For example, computer 1326 hosts a standalone application identified as App₁.

In FIG. 13, the VMs VM₁, VM₂, VM₃, VM₄, and VM₅are application components of a distributed application executed on the cluster of server computers 1313 and 1314. The resources of the server computers 1313 and 1314 provide a resource pool for the five VMs. The VMs enable different software components of the distributed application to run on different operating systems, share the same pool of resources, and share data. The VMs VM₁, VM₂, VM₃, VM₄, and VM₅may provide web services to customers. For example, VM₁may provide frontend services that enables users to purchase items sold by an owner of the distributed application over the Internet. VMs VM₂-VM₅execute backend operations that complete each user's purchase, such as collecting money from a user's bank, charging a user's credit card, updating a user's information, updating the owner's inventory, and arranging for products to be shipped to the user. The VMs VM₇, VM₈, VM₉, and VM₁₀execute a second distributed application on the server computer 1324. Containers Cont₁and Cont₂, execute components of a third distributed application on the server computer 1318.

Application tracing tracks an application's flow and data progression with the results for each execution of the application presented in a separate application trace. An application trace, also called a “trace,” represents a workflow executed by an application or a distributed application. A trace represents how a request, such as a user or client request, propagates through components of a distributed application or through services provided by each component of a distributed application. A trace consists of one or more spans. Each span represents an amount of time spent executing a service or performance of a function of the application. Application traces may be used in troubleshooting to identify interesting patterns or performance problems with the application itself, the resources used to execute the application, and the network.

FIGS. 14A-14B show an example of a distributed application and an example application trace. FIG. 14A shows an example of five services provided by a distributed application. The services are represented by blocks identified as Service₁, Service₂, Service₃, Service₄, and Service₅. The services may be web services provided to customers. For example, Service₁may be a web server that enables a user to purchase items sold by the application owner and communicates with other services. The services Service₂, Service₃, Service₄, and Service₅are computational services that perform different functions to complete the user's request. The components perform different functions of a distributed application and are executed in separate VMs on one or more server computers or using shared resources of a resource pool provided by a cluster of server computers. For example, services Service₁, Service₂, Service₃, Service₄, and Service₅are performed by five application components in VMs VM₁, VM₂, VM₃, VM₄, and VM₅, respectively, of FIG. 13. Directional arrows 1401-1405 represent requests for a service provided by the services Service₁, Service₂, Service₃, Service₄, and Service₅. For example, directional arrow 1401 represents a user's request for a service offered by Service₁, such as a functionality provided by a web server. After a request has been issued by the user, directional arrows 1403 and 1404 represent Service₁requests for execution of services or functions performed by Service₂and Service₃. Dashed directional arrows 1406 and 1407 represent responses. For example, Service₂sends a response to Service₁indicating that the operations performed by Service₃and Service₄have been completed. Service₁then requests services from Service₅as represented by directional arrow 1405, and provides a response to the user, as represented by directional arrow 1407.

FIG. 14B shows an example trace of the distributed application represented in FIG. 14A. Directional arrow 1408 is a time axis. The order in which services are executed are listed in column 1409. The services perform different functions indicated in parenthesis with service Service₁and Service₂performing more than one function. Each bar represents a time span, which is an amount of time (i.e., duration) spent performing one of the functions provided by a service. Unshaded bars 1410-1412 represent spans of time spent executing the different functions performed by Service₁. For example, bar 1410 represents the span of time Service₁spends interacting with a user. Bar 1411 represents the span of time Service₁spends interacting with the services provided by Service₂, Hash marked bars 1414-1415 represent spans of time spent executing Service₂with services Service₃and Service₄. Shaded bar 1416 represents a span of time spent executing Service₃. Dark hash marked bar 1418 represents a span of time spent executing Service₄. Cross-hatched bar 1420 represents a span of time spent executing Service₅.

Traces are classified according to trace type which is given by the span of the first service, operation, or function performed by an application. The first span is called the “root span” which is used as the trace type and is denoted by TT. For example, the span 1410 of the trace shown in FIG. 14B is the root span of the trace and is used to identify the trace type. A trace may also be classified by the order in which different services or functions are performed by an application. For example, the ordered sequence of services or functions listed in column 1409 may be used to define a trace type denoted by a 7-tuple: (Service₁, Service₁, Service₂, Service₃, Service₄, Service₁, Service₅). Each trace has a corresponding duration or total time of the trace denoted by D. The duration is the amount of time taken by the application to complete a request or perform a series of functions requested by a client. For example, time interval 1422 is the duration D of the trace shown in FIG. 14B, which represents the total amount time taken to execute the services in FIG. 14A.

Modern distributed applications generate enormous numbers of traces each day. For example, a shopping website may be accessed and used hundreds of thousands of times each day, resulting in storage of hundreds of thousands of corresponding traces in a data storage device. Many of the traces may be nearly identical and correspond to nearly identical operations performed by an application. Traces that correspond to normal operations performed by an application are identified as normal traces. On the other hand, erroneous traces that are used to troubleshoot performance of the application and identify a root cause a problem with the application are often produced with a much lower frequency than other normal traces.

FIGS. 15A-15B show examples of erroneous traces for the distributed application represented in FIG. 14A. In FIG. 15A, dashed line bars 1501-1504 represent normal spans for services provided by Service₁, Service₂, Service₄, and Service₅as represented by spans 1515, 1518, 1512, and 1520 in FIG. 14B. Spans 1506 and 1508 represent shortened spans for Service₂, and Service₄. No spans are present for Service₁and Service₅as indicated by dashed bars 1503 and 1504, indicating that the application components that perform Service₁and Service₅failed to execute. The trace illustrated in FIG. 15A is identified as an erroneous trace. In FIG. 15B, a latency pushes the spans 1512 and 1520 associated with executing corresponding Service₁and Service₅to later times.

Application traces may be assigned status codes that indicate whether execution of a particular operation or response to a client request by a corresponding application is a success or a failure. In one implementation, erroneous traces may be identified by corresponding HTTP (“hyper-text transfer protocol”) status codes. For example, HTTP is a protocol used to transfer data over the World Wide Web, HTTP is part of an Internet protocol suite that defines commands and services used for transmitting webpage data. In one implementation, traces are assigned, or tagged with, HTTP status codes that indicate the status of a specific HTTP requests associated with execution of an application. In particular, traces tagged with HTTP error status codes 4XX (i.e., request contains bad syntax or cannot be fulfilled) and server error status codes 5XX (i.e., the server failed to fulfil an apparently valid request), where X represents a positive integer, are erroneous traces. For example, when data has been successfully transmitted by the application, or application components, to a client or vis-a-vis, the corresponding trace may be tagged with the HTTP status code 200, indicating a success and the trace is identified as a normal trace. On the other hand, when data has not been successfully transmitted between the application and a client or between application components, the corresponding trace is an erroneous trace that is tagged with the HTTP status code 400, indicating a failed or bad request.

In another implementation, when hardware and/or network used by an application experiences particular failures, user-defined status codes may be used to tag corresponding traces as erroneous traces. For example, if CPU usage or memory usage spikes or drops below a threshold while an application is executing a corresponding trace may be tagged as erroneous. In another example, when data packets are dropped by one or more VMs executing application components of an application, a corresponding trace may be tagged as an erroneous trace.

In another implementation, user-defined status codes may be used to tag spans of traces. An erroneous trace contains one or more spans that have been tagged with an error. For example, spans 1506 and 1508 in FIG. 15A are tagged with error_tag=TRUE. As a result, the trace represented by FIG. 15A is tagged as an erroneous trace. In another example, the latency shifted spans 1512 and 1520 in FIG. 1513 are tagged with error_tag=TRUE. As a result, the trace represented by FIG. 15B is tagged as an erroneous trace. Alternatively, when a distributed application performs without issue, as described above with reference to the example in FIG. 14B, the corresponding spans may be tagged with error_tag=FALSE and the trace is tagged as a normal trace.

Erroneous traces of an application tend to have shorter or longer durations than the typical trace duration. For example, during typical execution of an application a corresponding trace has duration, D, that falls between lower and upper limits denoted by D_l<D<D_u, where D_lis a lower time limit and D_uis an upper time limit. When D≤D_lor D_u<D, performance of the application is abnormal, and the corresponding trace is identified as an erroneous trace. The upper and lower time limits may be the upper and lower thresholds of a histogram constructed as described below with reference to Equations (6a) and (6b) under histogram creation.

In recent years, application management tools have been developed to apply different sampling procedures that reduce the amount of storage dedicated to storing traces. The sampling procedures include rate-based sampling and duration-based sampling. Rate-based sampling, also called “probabilistic sampling,” stores a fixed percentage of the generated traces. Duration-based sampling stores traces with durations that are greater than a predefined threshold. However, these conventional sampling procedures fail to distinguish the different trace types and durations during sampling which leads to information distortion. Information distortion occurs when infrequently occurring traces are not included in the sampled traces. For example, conventional trace sampling procedures fail to consider the frequencies of different trace types and trace durations. Erroneous traces are often infrequently generated and contain information that is useful in troubleshooting a performance problem with an application. Because conventional sampling procedures do not make a distinction between that is useful in troubleshooting a performance problem with an application. Because conventional sampling procedures do not make a distinction between high and low frequency generated trace types and trace durations, there is a risk that sampled traces obtained using conventional sampling procedures will not contain any, or not contain a sufficient representation, of erroneous traces, resulting in a loss of potentially important information needed in troubleshooting performance of an application. As a result, troubleshooting performance problems without a sufficient representation of erroneous traces leads to inaccurate representation of a performance problem and misleads troubleshooting algorithms and system administrators in detecting the root cause of the performance problem. One approach is to use error-based sampling, which stores only erroneous traces. However, in certain situations the number of erroneous traces far exceeds the number of normal application traces, which eventually leads to the same problem of not having enough storage space available for normal and erroneous traces.

Computer-implemented methods and systems described below perform intelligent sampling of normal and erroneous traces. The traces are generated for an application. The sampling and compression described below may be performed in real time on a stream of traces or performed on traces read from a trace database. Computer-implemented intelligent sampling described below stores enough normal and erroneous traces across the different trace types and different durations regardless of frequency to enable accurate troubleshooting of performance problems without information distortion created by conventional sampling procedures. In particular, computer-implemented intelligent sampling methods and systems described below generate different sampling rates for normal and erroneous traces. The sampling rates for low frequency normal and erroneous traces are larger than the sampling rates for higher frequency normal and erroneous traces. The sampling rates ensure that low frequency normal and erroneous traces are sampled with a larger sampling rate than high frequency normal and erroneous traces. Troubleshooting and root cause analysis is applied to the sampled erroneous traces to identify the source of performance problems with the application and the application components. Computer-implemented methods and systems may then employ remedial measures to correct the performance problems. For example, VMs or containers executing application components may be migrated to different hosts to increase performance. Additional VM or containers may be started to alleviate the workloads on already existing VMs and containers. Network bandwidth may be increased to reduce latency between peer VMs.

FIG. 16A shows an example graphical-user interface (“GUI”) 1600 that enables a user to select an application and input sampling rates for sampling normal and erroneous traces of the selected application. A user, such as a system administrator or application owner, selects an application from a list of applications provided in window 1602. For example, highlighted entry 1604 indicates a user has select “Application 6” by clicking on the application name with the cursor. The example GUI 1600 includes two ways the user may input a sampling rate. The sampling rate is the percentage (i.e., fraction) of normal and erroneous traces that are to be sampled from a set of trace data and stored in a data storage device. When trace types and durations are known, a user may choose to sample by trace type and/or trace duration by clicking on button 1606. When sampling does not depend on trace type or trace durations, a user may choose to sample based on normal and erroneous status of the traces alone by clicking on button 1608. After clicking on button 1606 or button 1608, the user clicks on button 1609 and selects one of the preset sampling levels identified as “conservative” 1610, “aggressive” 1611, and “super aggressive” 1613. Conservative, aggressive, and super aggressive sampling rates correspond to different fractions of traces sampled from runtime traces or a database traces for an application. For example, a conservative sampling rate is used to sample and store a larger number of traces than an aggressive sampling rate and an aggressive sampling rate is used to sample and store a larger number of traces than a super aggressive sampling rate. In this example, conservative, aggressive, and super aggressive sampling rates are preset and correspond to storing 15%, 10%, and 5% of the traces in the data storage device. Rather than using one of the preset sampling rates, the user may also choose to input a sampling level by clicking on button 1614 and entering a sampling level (i.e., sample rate) in field 1616. When a user would like to set an overall sampling rate of the set of trace data and set an erroneous trace sampling rate, instead of clicking on either button 1606 or 1608, the user clicks on button 1618 and enters an overall sampling rate in field 1620 and an erroneous trace sampling rate in field 1622. A user may choose to sample the set of trace data stored in a database of traces by clicking on button 1624 and entering a location of the database in field 1626. Alternatively, the user may choose to sample runtime traces of“Application 6” as the traces are generated by clicking on button 1628. When a user clicks on the “Execute sampling” button 1630, sampling is executed on the traces in accordance with the user's selections as described below.

Computer-implemented methods and systems for intelligent sampling of application traces described below are encoded in machine-readable instructions that are executed in a computer system, such as a server computer. FIG. 16B shows an example of a computer system 1632 that executes machine-readable instructions for sampling traces produced by an application 1634. The traces may be sent directly from the application to the computer system 1632 as indicated by directional arrow 1636. Alternatively, the traces may be stored in a trace database 1638, as indicated by directional arrow 1640, and the computer system 1632 reads the traces from the database 1638 as indicated by directional arrow 1642. The computer system 1632 applies the user-selected sampling rate to the traces as described below and stores sampled normal traces and erroneous traces in the data storage device 1644, thereby reducing the overall number of traces. Troubleshooting a performance problem with the application 1634 is performed on the sampled erroneous traces stored in the data storage device 1644. Troubleshooting is a systematic technique in which the erroneous traces are used to identify a performance problem with the application 1644 or a performance problem with the hardware or network of a distributed computing system used to run the application 1644. When a performance problem has been identified, computer-implemented methods and systems execute remedial measures to correct the problem.

Sampling Known Trace Types and Durations with Normal and Erroneous Traces

Computer-implemented methods described below perform three different processes for sampling normal and erroneous traces with known trace types and durations. One process performs trace-type sampling of normal and erroneous traces based on frequencies of trace types. A second process performs sampling of erroneous and normal traces based on durations of traces independent of the trace type. A third process performs a hybrid trace-type and duration sampling of normal and erroneous traces. Each process is described separately below,

Trace-Type Sampling of Known Trace Types with Normal and Erroneous Traces

FIG. 17 shows an example set of trace data 1702 generated by an application. The trace data 1702 may be stored in a trace database or sent directly to the server computer 1602 and stored in a buffer. Each row represents information associated with a trace. Each trace is assigned a trace identification (“ID”). Column 1704 is a list of trace IDs assigned to the traces in the trace data. Column 1706 is a list of durations of the traces. Column 1708 lists the trace type (i.e., root span). Columns 1710 list K different services or functions as described above with reference to FIG. 14B. Each entry in columns 1710 contains a span-tuple (span—name(k), t_s, t_e), where span—name(k) is a span name, t_sis a start time of the span, and t_eis an end time of the span. Entries in column 1711 record the status codes of the traces. The status codes indicate whether a trace is a normal trace, denoted by “norm,” or an erroneous trace, denoted by “err.” In one implementation, the status codes may be simple binary norm and err designations. In another implementation, the status codes for erroneous traces may contain more information, such as HTTP status codes or user-defined status codes. For example, trace 1712 has a trace ID “ef167gp7.” a trace duration of 00:12.46 seconds, and a trace type “Service₁,” which is the name of the root span of the trace, and “err” status code indicating that the trace 1712 is an erroneous trace.

The traces recorded in a set of trace data are sorted into groups of traces with the same trace type independent of trace durations and status code. The number of traces in each group of traces are counted. The traces of each trace type are partitioned into normal traces and erroneous traces. For each trace type, a normal trace-type sampling rate is determined for the normal traces and an erroneous trace-type sampling rate is determined for the erroneous traces. Suppose a set of trace data contains N traces with M different trace types (i.e., M≤N). Let N_mbe the number of traces with the m-th trace type, where index m=1, . . . , M. Let N_n^(m)be the number of normal traces in the group of m-th trace types and N_e^(m)be the number of erroneous traces in the group of m-th trace types, where N_m=N_e^(m)+N_n^(m). A frequency of occurrence of normal traces of the m-th trace type is

$\begin{matrix} p_{n}^{(m)} = \frac{N_{n}^{(m)}}{N_{m}} & (1 a) \end{matrix}$

and frequency of occurrence of erroneous traces in the m-th trace type is

$\begin{matrix} p_{e}^{(m)} = \frac{N_{e}^{(m)}}{N_{m}} & (1 b) \end{matrix}$

The normal trace-type sampling rate of each of the normal traces of the m-th trace type is

h_n^(m)=1−(p_n^(m))^βⁿ (2a)

where 0≤β_nand is called the “normal trace-type sampling parameter.”

and the erroneous trace-type sampling rate of each of the erroneous traces of the m-th trace type is

h_e^(m)=1−(p_n^(m))^β^e (2b)

where 0≤β_eis called the “erroneous trace-type sampling parameter.”

The normal trace-type sampling rate is the inverse of the frequency of occurrence of normal traces with the m-th trace type. Similarly, the erroneous trace-type sampling rate is the inverse of the frequency of occurrence of erroneous traces with the m-th trace type. Each trace type has associated sampling rates represented by Equation (2a) or (2b). The normal trace-type sampling rate in Equation (2a) is the fraction of normal traces that belong to the m-th trace type and are sampled and stored in a data storage device. The erroneous trace-type sampling rate in Equation (2b ) is the fraction of erroneous traces that belong to the m-th trace type and are sampled and stored in a data storage device.

Returning to FIG. 17, the traces of the set of trace data 1702 are sorted according to trace types to obtain sorted trace types 1714. For the sake of illustration, the durations and span information are omitted. The trace types are denoted by TT_m, where m=1, . . . , M, and Mk the number of different trace types in the trace data. Traces of the same trace type may be normal or erroneous traces. For example, normal trace 1716 is a TT₂trace type with a normal status code “norm” 1718. By contrast, erroneous trace 1720 is also a TT₂trace type with an erroneous status code “err” 1722.

The trace-type sampling parameters β_nand β_ecorresponds to the amount of normal and erroneous traces sampled and are based on user-selected sampling rates described below. Note that in one implementation β_n≠β_eand in another implementation β_n=β_e. For example,“conservative” sampling corresponds to β=1, “aggressive” sampling corresponds to β=0.5, and “super aggressive” sampling corresponds to =0,25, where β represents β_nand β_e. The trace-type sampling parameters β_nand β_eare determined based the user-selected sampling rate as described below.

The number of normal traces of the m-th trace type stored in the data storage device is given by:

N_n^(m)=N_n^(m)×h_n^(m)) (3a)

The number of erroneous traces of the m-th trace type stored in the data storage device is given by

N_e^(m)=N_e^(m)×h_e^(m) (3a)

The number of traces N_n^(m)and N_e^(m)are rounded to the nearest integer number. The N_n^(m)normal traces are randomly sampled from the N_n^(m)normal traces and are stored in a data storage device as described below. The remaining unsampled normal traces of the m-th trace type (i.e., N_n,rem=N_n^(m)−N_n^(m)are discarded by deleting the remaining normal traces from the data storage device or from a buffer where traces are temporarily stored during sampling. The N_e^(m)erroneous traces are randomly sampled from the N_e^m) erroneous traces and are stored in a data storage device as described below. The remaining unsampled erroneous traces of the in-th trace type (i.e., N_e,rem=N_e^(m)−N_e^m) are discarded by deleting the remaining erroneous traces from the data storage device or from a buffer where traces are temporarily stored during sampling.

FIG. 18 shows an example calculation of the number of traces sampled from a set of trace data 1802 already sorted according to trace type. The set of trace data comprises N traces and M different groups of traces where the traces in each group are the same trace type. For example, group of traces 1804 has N_mtraces with the same trace type TT_m. For each group of traces, the traces are partitioned into normal traces and erroneous traces based on the corresponding status codes. In FIG. 18, the group of traces 1804 is partitioned according to status codes into normal traces 1806 and erroneous traces 1808. There are N_n^(m)normal traces 1806 and N_e^(m)erroneous traces 1808. Frequency of occurrences 1810 and 1812 are computed as described above with reference to Equations (1a) and (1b) for the normal and erroneous traces 1806 and 1808, respectively. Sampling rates 1814 and 1816 are computed as described above with reference to Equations (2a) and (2b) for the normal and erroneous traces 1806 and 1808, respectively. The normal trace-type sampling rate 1814 is used to sample and store N_n^(m)of the normal traces 1806. The erroneous trace-type sampling rate 1816 is used to sample and store N_e^(m)of the erroneous traces 1808.

The trace-type sampling rates represented by Equations (2a) and (2b) ensures that rarely occurring normal and erroneous trace types are sampled at a higher sampling rates than are more frequently occurring normal and erroneous trace types. Suppose the m-th trace type contains 1,000 traces (i.e., N_m=1,000) with 145 erroneous traces (i.e., N_e^(m)=145) and 855 normal traces (i.e., N_n^(m)=855). The frequency of occurrence of the erroneous traces of trace type TT_mis p_e=0.145 and the frequency of occurrence of the normal traces of trace type TT_mis p_m=0.655. The following table shows the normal and erroneous trace-type sampling rates using the same value for the sampling parameter (i.e., β=β_e=β_n):

Table of Normal and Erroneous Trace-type Sampling Rates 1 Conservative Aggressive Sup. Agg. Status Code (β = 1) (β = 0.5) (β = 0.25) normal 0.345 0.191 0.100 erroneous 0.855 0.619 0.383

The entries in the above table show that as the sampling parameter decreases, the sampling rates also decrease. Note also that the less frequently occurring erroneous traces are sampled with larger sampling rates than the more frequently occurring normal traces across the conservative, aggressive, and super aggressive sampling rates.

In an alternative implementation, the erroneous trace types may be further partitioned based on the types of status codes, such as HTTP error status codes or user-define error status codes described above. A frequency of occurrence of erroneous traces in the m-th trace type is

$\begin{matrix} p_{e, u}^{(m)} = \frac{N_{e, u}^{(m)}}{N_{m}} & (4 a) \end{matrix}$

where

- subscript u denotes a particular error status code:
- u=1, . . . , U; and
- U the total number of error status codes.
  For example, error status code u may represent one of the HTTP error status codes 4XX and 5XX or a user-defined error status code. The erroneous trace-type sampling rate of the error status code u is given by

h_e,u^(m)=1−(p_e,u^(m))^β^e (4b)

The number of erroneous traces of the m-th trace type with error status code u that are sampled and stored in the data storage device is given by

N_e,u^(m)=N_e,u^(m)×h_e,u^(m) (4c)

The sampling rate represented by Equation (4b) ensures that rarely occurring erroneous traces are sampled at a higher sampling rates than are more frequently occurring erroneous traces.

FIG. 19 shows an example of the erroneous traces 1808 in FIG. 18 partitioned into U sets of erroneous traces based on U error status codes. For example, erroneous traces 1901 have error status code err₁, erroneous traces 1902 have error status code err_u, and erroneous traces 1903 have error status code err_u. Erroneous trace-type sampling rates 1904-1906 are computed for corresponding sets of erroneous traces 1901-1903, as described above with reference to Equations (4a) and (4b). The erroneous trace-type sampling rates 1904-1906 are used to obtain N_e,1erroneous traces 1908, N_e,uerroneous traces 1909, and N_e,Uerroneous traces 1910.

For each trace type, the normal and erroneous traces have separate compression ratios and compression rates. A modified Gini index for the fraction of normal traces sampled from the set of trace data across the M different trace types:

$G_{n}^{(β)} = \frac{{\overline{N}}_{n}}{N} where {\overline{N}}_{n} = \sum_{m = 1}^{M} {\overline{N}}_{n}^{(m)}$

The compression rate across normal traces with different trace types is given by

C_n^(β)=1−G_n^(β) (5a)

A modified Gini index for the fraction of erroneous traces sampled from the set of trace data across the A/different trace types:

$G_{e}^{(β)} = \frac{{\overline{N}}_{e}}{N} where {\overline{N}}_{e} = \sum_{m = 1}^{M} {\overline{N}}_{e}^{(m)}$

The compression rate across erroneous traces with different trace types is given by

C_e^(β)=1−G_e^(β) (5b)

A modified Gini index for the fraction of normal and erroneous traces sampled from the set of trace data across the M different trace types:

$G^{(β)} = \frac{\overline{N}}{N}$

where N=N_e+N_n.

The compression rate is given by

C^(β)=1−G^(β) (5c)

Diversity of frequencies of occurrence may be measured by the modified Gini index. For example, trace-type sampling may be selected when the modified Gini index satisfies the following condition:

G^(β)≤Th_G (5d)

where

- G^(β)represents G_e^(β)or G_n^(β);
- β_e=β_n=β; and
- Th_Gis a modified Gini index threshold (e.g., Th_G=0.1, 0.05, or 0.01).
  When the conditions given in Equation (5d) is not satisfied, trace-type information is not adequate for investigating performance of an application.

Duration Sampling of Normal and Erroneous Traces

Computer-implemented methods perform duration sampling on trace durations independent of the trace type. Erroneous traces usually have short durations or long durations. Traces of the trace data are sorted based on duration. For example, the traces may be sorted from shortest (longest) duration to longest (shortest) duration. The duration-sampling rates described below are used to separately sample normal and erroneous traces in corresponding bins of the histogram, where each bin corresponds to a time interval.

FIG. 20 shows an example of the set of trace data 1702 sorted according to trace durations to obtain duration-sorted traces 2002. For the sake of illustration, trace types, span information, and status information are omitted. The durations are denoted by D_n, where n=1, . . . , N. In this example, the traces are sorted from longest duration to shortest duration with D₁representing the longest trace duration and D_Nrepresenting the shortest duration.

Computer-implemented methods compute upper and lower thresholds for distinguishing normal traces from erroneous traces of the duration-sorted traces. Traces with durations between the upper and lower thresholds are identified as normal tracs. Traces with durations that are greater than the upper threshold or less than the lower threshold are identified as erroneous traces. A histogram is constructed for the traces with normal traces having durations that fall between the lower and upper thresholds and erroneous traces have durations that are less than the lower threshold or greater than the upper threshold.

Upper and lower quantiles are used to partition the duration-sorted traces into three groups of traces. The upper and lower quantiles are given by

M(upper)=q_1−s

M(lower)=q_s

where 0≤s≤1(e.g., s=0.05 or s=0.1).

The lower quantile q_sis a time that partitions the duration-sorted traces such that s traces have durations that are less than or equal to the quantile q_s. The upper quantile q_1−sis a time that partitions the duration-sorted traces such that s traces have durations that are greater than or equal to the quantile q_1−s. For example, if s=0.1, the lower quantile q_0.1denotes a time that partitions the duration-sorted traces such that 10% of the traces have durations that are less than or equal to q_0.1and the upper quantile q_0.9denotes a time that partitions the duration-sorted traces such that 10% of the traces have durations that are greater than or equal to q₀₉. Upper distances are computed for traces with durations that are greater than or equal to the upper quantile by

dist(upper)=|data(upper)−M(upper)| (6a)

and lower distances are computed for traces with durations that are less than or equal to the lower quantile by

dist(tower)=|data(lower)−M(lower)| (6b)

where

data(upper) represents a trace duration that is greater than or equal to M(upper); and

data(lower) represents a trace duration that is less than or equal to M(low).

A mean average deviation (“MAD”) is computed for the set of upper distances and is denoted by MAD (upper). A MAD is computed for the set of lower distances and is denoted by MAD (lower). Upper and lower thresholds for the duration-sorted traces are computed as follows:

Th_upper=min(M(upper)+Γ×MAD(upper), max (duration)) (7a)

and

Th_lower=max(M(lower)−Γ×MAD (lower),min(duration)) (7b)

where

- 0<Γ<1 (e.g., Γ=0.25, 0.20, or 0.30);
- max(duration) is the maximum trace duration; and
- min(duration) is the minimum trace duration.
  A trace duration D_nis identified as an outlier if the trace duration satisfies one of the following conditions:

D_n>Th_upper (8a)

D_n<Th_lower (8b)

A histogram is constructed from traces with durations that satisfy the following condition:

Th_upper≥D_n≥Th_lower (8c)

FIG. 21 shows an example of partitioning duration-sorted traces 2102. Directional arrow 2104 represents increasing durations of the traces with trace 2106 have the maximum duration max(duration) trace 2108 having the minimum duration min(duration). Mark 2110 represents an upper quantile q_1−s. Mark 2112 represents a lower quantile q_s. The quantiles q_1−sand q_spartition the duration-sorted traces 2102 into three groups of traces. The first group comprises s fraction of the traces with durations greater than or equal to q_1−s. The second group comprises s fraction of the traces with durations less than or equal to q_s. The third group comprises 1-2s fraction of traces with durations between q₅and q_1−s. Distances are calculated for traces 2114 according to Equation (6a). For example, directional arrow 2116 represents a distance between trace duration 2118 and the lower quantile q_s2012. Distances are calculated for traces 2120 according to Equation (6b). For example, directional arrow 2122 represents a distance between trace duration 2124 and the upper quantile q_1−s2010. The MAD MAD (lower) is computed for the distances associated with traces 2114. Is The MAD MAD (upper) is computed for the distances associated with traces 2120. Lower threshold 2026 is computed using Equation (7b). Upper threshold 2128 is computed using Equation (7a). In this example, traces 2130 with durations that are less than the lower threshold 2126 and are identified as outliers and traces 2132 with durations that are greater than the upper threshold 2128 and are identified as outliers.

Traces with durations that satisfy either of the conditions given by Equations (8a) and (8b) are erroneous traces that lie within lower interval [min(duration), Th_lower) and the upper interval (Th_upper, max(duration)]. respectively. The range of time between the upper and lower thresholds is partitioned into B equal duration intervals denoted by [c_b−1, c_b) for b=1, . . . , B−1, and [c_B−1, c_B], where c₀=Th_lowerand c_B=Th_upper. Each bin of the histogram corresponds to a time interval. A trace with a duration that satisfies the condition given by Equation (8c) is identified as a normal trace. A normal trace that lies within one of the intervals is assigned to a bin that corresponds to the interval. The number of traces in each bin are counted and denoted by n_b, where b=1, . . . , B . For example, n_brepresents the total number of traces in the interval [c_B−1, c) and n_Brepresents the total number of traces in the interval [c_B-1, c_B]. The number of erroneous traces that lie within the lower interval [min(duration), T _lower) are denoted by n_sand form a short-duration bin of erroneous traces. The number of traces that lie within the upper interval (T h_upper,max(duration)] are denoted by n_Land form a long-duration bin of erroneous traces. A histogram of traces is constructed by counting the number of traces in each bin.

FIG. 22 shows an example histogram constructed from a set of trace data. Horizontal axis 2202 represents time. Vertical axis 2204 represents number of traces. Time axis 2202 is partitioned into intervals between a lower threshold 2206 and an upper threshold 2208. In this example, the time axis 2202 includes a minimum duration 2210 and a maximum duration 2212. Unshaded bars represent the number of normal traces with durations that lie within the intervals (i.e., number of traces that lie within corresponding bins). For example. FIG. 22 shows a magnified view of intervals 2214 and 2216. Bar 2218 represents the number of traces, n_b, in the interval 2214. Bar 2220 represents the number of traces, n_b+1, in the interval 2016. Shaded bar 2222 represents the number of erroneous traces in the lower interval [min(duration),Th_low). Shaded bar 2224 represents the number of erroneous traces in the upper interval (Th_upp, max(duration)].

A histogram may also be constructed for the trace durations using the t-digest approach described in “Computing extremely accurate quantiles using t-digests,” T. Dunning et. al., arXiv.org, Cornell University. Feb. 11, 2019. Instead of storing the entire set of trace data based on trace durations. t-digest stores only the results of data clustering, such as centroids of clusters and trace counts in each cluster.

A histogram of traces in the B bins is given by

Hist(B)={n_s, n₁, . . . , n_B, n_L}

where

n_bis the number of traces in the b-th bin with durations in the interval [c_b−1, c_b) for b=1, . . . , B−1;

n_Bis the number of traces in the B-th bin with durations in the interval [c_B−1, c_B];

n_sis the number of short duration traces (i.e., erroneous traces) in the interval [min(duration),Th_lower); and

n_Lis the number of long duration traces (i.e., erroneous traces) in the interval

(Th_upper, max(duration)].

The frequency of occurrence of traces in the b-th bin of the histogram is given by:

$\begin{matrix} p_{b} = \frac{n_{b}}{N_{H}} where N_{H} = \sum_{b = 1}^{B} n_{b} + n_{S} + n_{L} & (9) \end{matrix}$

The normal duration sampling rate for normal traces in the b-th bin is given by

r_b=1−(p_b)^aⁿ (10)

where 0≤αa_nand is called the “normal duration sampling parameter.”

The normal duration-sampling rates in Equation (10) is the fractions of traces to be sampled from the b-th bin and stored in a data storage device. The frequency of occurrence of traces in the S-th bin of the histogram is given by:

$\begin{matrix} p_{S} = \frac{n_{S}}{N_{H}} & (11 a) \end{matrix}$

The frequency of occurrence of traces in the L-th bin of the histogram is given by:

$\begin{matrix} p_{L} = \frac{n_{L}}{N_{H}} & (11 b) \end{matrix}$

The short trace duration sampling rate for traces in the S-th bin is given by

h_s=1−(p_s)^α^e (12a)

and the long duration sampling rate for traces in the L-th bin is given by

h_L=1−(p_L) (12b)

where 0≤α_eand is called the “erroneous duration sampling parameter.”

The normal duration-sampling rates in Equations (12a) and (12b) are the fractions of traces to be sampled from the corresponding s-th and i-th bins and stored in a data storage device.

Note that in one implementation α_n≠α_eand in another implementation α_n=α_e.The duration-sampling parameter a corresponds to an amount of trace sampling based on the user-selected sampling level described above. For example, “conservative” sampling corresponds to α=1, “aggressive” sampling corresponds to α=0.5, and “super aggressive” sampling corresponds to α=0.25. The duration-sampling parameter α may be selected to provide the user-selected sampling level as described below.

The normal and erroneous duration-sampling rates in Equation (10). (12a) and (12b) may be different for each bin and is inversely proportional to the frequency of occurrences of the traces in each bin. For example, suppose the number of traces in a histogram comprises 10,000 traces with 460 traces in a bin B₁(i.e., n₁=460) and 2.035 traces in a bin B₂(i.e., n₂=2,035). The frequency of occurrence of traces in B₁is p₁=0.046 and the frequency of occurrence of traces in B₂is p₂=0.204. The following table shows the duration-sampling rates for the example traces in B₁and B₂:

Table of Duration-sampling Rates Conservative Aggressive Sup. Agg. Bins (α = 1) (α = 0.5) (α = 0.25) B₁ 0.954 0.786 0.537 B₂ 0.796 0.548 0.328

Note that the less frequently occurring traces in the bin B₁are sampled with a larger duration-sampling rate than the more frequently occurring traces in the bin B₂across the conservative. aggressive, and super aggressive sampling rates.

The number of normal traces sampled from the b-th bin and stored in the data storage device is given by:

n_b=n_b×r_b (13)

where n_bis rounded to the nearest integer number.

The number of erroneous traces sampled from the s-th and l-th bins and stored in the data storage device is given by:

n_s=n_s×h_s (14a)

n_L=n_L×h_(14h)

where n_sand n_Lare rounded to the nearest integer number.

The remaining unsampled traces are discarded by deleting the unsampled traces from a data storage device.

Returning to FIG. 22, a frequency of occurrence p_b2226 is computed for traces with durations in the interval [c_b−1, c_b) 2214. A duration-sampling rate r_b2228 is computed for traces in the interval [c_b−1, c_b) 2214 (i.e., corresponding b-th bin). The number of traces sampled from the b-th bin are n_b2230. The set of sample normal traces 2232 is obtained from sampling each bin of the traces. A frequency of occurrence p_s2234 and erroneous duration sampling rate 2236 is computed for traces with durations in the short duration interval [min(duration),Th_lower). The number of traces sampled from the short duration is nn_s2238. A frequency of occurrence p_L2240 and erroneous duration sampling rate 2242 are computed for traces with durations in the long duration interval (Th_upper,max(duration)]. The number of traces sampled from the short duration is n_L2244.

The modified Gini index equals the fraction of traces samples from the bins. For the normal traces, the modified Gini index is given by

$G_{e}^{(α)} = \frac{{\overline{N}}_{H}}{N_{H}} where {\overline{N}}_{H} = \sum_{b = 1}^{B} {\overline{n}}_{b}$ $N_{H} = \sum_{b = 1}^{B} n_{b}$

The compression rate across the traces with normal durations is given by

C_n^(α)=1−G_n^(α) (15a)

For the erroneous traces, the modified Gini index is given by

$G_{e}^{(α)} = \frac{{\overline{n}}_{L} + {\overline{n}}_{S}}{n_{L} + n_{S}}$

The compression rate across the traces with erroneous durations is given by

C_e^(α)=1−G_e^(α) (15b)

A modified Gini index for the fraction of normal and erroneous traces sampled from the set of trace data across the M different trace types:

$G^{(α)} = \frac{{\overline{N}}_{H} + {\overline{n}}_{L} + {\overline{n}}_{S}}{N}$

The compression rate across erroneous traces with different trace types is given by

C^(α)=1−G^(α) (15c)

where α_e=α_n=α.

Hybrid Sampling of Known Trace Types and Durations with Normal and Erroneous Traces

When both trace types and trace durations are important for troubleshooting performance of an application, a hybrid combination of trace-type sampling and duration-based sampling may be applied across different trace types and different trace durations for normal and erroneous traces.

A set of trace data is sorted into different trace types as described above with reference to FIG. 17. The traces of each trace type are partitioned into normal traces and erroneous traces as described above with reference to FIG. 18. Normal and erroneous trace-type sampling rates are computed for each of M trace types. For the normal traces with the m-th trace type, a frequency of occurrence of normal traces p_n^(m)is computed as described above with reference to Equation (1 a) and FIG. 18. For the erroneous traces, a frequency of occurrence of erroneous traces p_e^(m)is computed as described above with reference to Equation (1 b) and FIG. 18. Separate histograms are constructed for the normal traces and for the erroneous traces as described above with reference to FIGS. 20 and 21. A frequency of occurrence is computed for normal traces in each bin of a histogram of the normal traces of the m-th trace type as follows

$\begin{matrix} p_{n, b}^{(m)} = \frac{n_{n, b}^{(m)}}{N_{n, H}^{(m)}} & (16 a) \end{matrix}$

where

- subscript n denotes normal traces:
- b=1, . . . , B;
- n_n,b^(m)is the number of normal traces with the m-th trace type in the b-th bin; and

$N_{n, H}^{(m)} = \sum_{b = 1}^{B} n_{n, b}^{(m)}$

A frequency of occurrence is computed for erroneous traces in each bin of a histogram of erroneous traces of the m-th trace type as follows:

$\begin{matrix} p_{e, b}^{(m)} = \frac{n_{e, b}^{(m)}}{N_{e, H}^{(m)}} & (16 b) \end{matrix}$

where

- subscript e denotes normal traces; and
- n_e,b^(m)is the number of normal traces with the m-th trace type in the b-th bin

$N_{e, H}^{(m)} = \sum_{b = 1}^{B} n_{e, b}^{(m)}$

FIG. 23 shows an example set of normal traces 2302 and example set of erroneous traces 2304 for the same in-th trace type. Traces in the sets 2302 and 2304 have the same trace type TT_m. The traces in each set of normal and erroneous traces are sorted according to trace duration as described above with reference to FIG. 20. Trace durations of erroneous traces are denoted by D_N_e_(m), where N_e^(m)2306 is the number of erroneous traces of the m-th trace type. A frequency of occurrence of the traces p_e^(m)2308 is determined for the set 2302. FIG. 23 shows an example erroneous trace histogram 2310 constructed from the set of erroneous traces 2302. The range of time is partitioned into B equal duration intervals denoted by [c_b−1, c_b), for b=1, B−1. and [c_B−1, c_B] as described above with reference to FIG. 22. The number of erroneous traces in each interval (i.e., bin) are counted and denoted n_e,b^(m). A set of frequency of occurrences 2312 is computed for each bin, as described above with reference to Equation (16b), to obtain frequencies of occurrences of erroneous traces of m-th trace type. Trace durations of normal traces are denoted by D_N_n_(m), where N_e^(m)2314 is the number of erroneous traces of the m-th trace type. A frequency of occurrence of the traces p_e^(m)2316 is determined for the set 2304. A set of frequency of occurrences 2318 is computed for each bin of a normal trace histogram constructed for the set of normal traces 2304, as described above with reference to Equation (16a), to obtain frequencies of occurrences of normal traces of m-th trace type.

A normal hybrid sampling rate for each bin of the m-th set of normal traces is given by

h_n,b^(m)=1−(p_n^(m))^βⁿ(p_n,b^(m))^αⁿ (17a)

where 0≤α_nand 0≤β_nare normal trace sampling parameters.

The normal hybrid-sampling rate in Equation (13) may be different for each bin of each group of traces and is inversely proportional to the frequency of occurrences of the traces in each bin and each group of traces.

An erroneous hybrid sampling rate for each bin of the m-th set of normal traces is given by)

h_e,b^(m)=1−(p_e^(m))^β^e(p_e,b^(m))^α^e (17b)

where 0≤α_eand 0≤β_eare erroneous trace sampling parameters.

The erroneous hybrid-sampling rate in Equation (17b) may be different for each bin of each group of traces and is inversely proportional to the frequency of occurrences of the traces in each bin and each group of traces.

There are many ways in which the sampling parameters in Equations (17a) and (17b) be selected for sampling. In one implementation, α_e=β_eand α_n=β_n, but α_e≠α_n. In another implementation, α_e=α_nand β_e=β_n, but α_e≠β_e. In another implementation, the sampling parameters are the same with α_e=β_e=β_n, but α_e≠β_e. In still another implementation, the sampling parameters are different with α_e≠β_e≠α_n≠β_n.

The number of normal traces sampled from the b-th bin and stored in the data storage device is given by:

n_n,b^(m)=n_n,b^(m)×h_n,b^(m) (18a)

and the number of erroneous traces sampled from the b-th bin and stored in the data storage device is given by:

n_e,b^(m)=n_e,b^(m)×h_e,b^(m) (18b)

where n_n,b^(m)and n_e,b^(m)are rounded to the nearest integer number.

Remaining unsampled traces are discarded by deleting the unsampled traces from a data storage device.

The modified Gini index of the normal traces equals the fraction of normal traces sample from the bins:

$G_{n}^{(β, α)} = \frac{{\overline{N}}_{n}}{N_{n}} where$ ${\overline{N}}_{n} = \sum_{m = 1}^{M} \sum_{b = 1}^{B} {\overline{n}}_{n, b}^{(m)}$ $N_{n} = \sum_{m = 1}^{M} \sum_{b = 1}^{B} n_{n, b}^{(m)}$

The compression rate for normal hybrid sampling of the traces is given by

C_n^(β,α)=1−G_n^(β,α) (19a)

The modified Gini index of the erroneous traces equals the fraction of normal traces sample from the bins:

$G_{n}^{(β, α)} = \frac{{\overline{N}}_{e}}{N_{e}} where$ ${\overline{N}}_{n} = \sum_{m = 1}^{M} \sum_{b = 1}^{B} {\overline{n}}_{e, b}^{(m)}$ $N_{n} = \sum_{m = 1}^{M} \sum_{b = 1}^{B} n_{e, b}^{(m)}$

The compression rate for erroneous hybrid sampling of the traces is given by

C_e^(β,α)=1−G_e^(β,α) (19b)

A modified Gini index for the fraction of normal and erroneous traces sampled from the set of trace data across the M different trace types:

$G^{(β, α)} = \frac{\overline{N}}{N}$

where

- β_e=β_n=β;
- α_e=α_n=α; and
- N=N_e+N_n.
  The compression rate across erroneous traces with different trace types is given by

C^(β,α)=1−G^(βα) (19c)

Sampling Based Only on Normal and Erroneous Traces

Trace sampling may be performed on a set of trace data regardless of trace type and/or trace duration. A set of trace data is partitioned into normal traces and erroneous traces. Let N be the total number of traces in a set of trace data. Let N_nbe the total number of normal traces in the set of trace data. Let N_ebe the total of erroneous traces in the set of trace data.

FIG. 24 shows an example of the set of trace data 1702 partitioned into normal traces 2402 and erroneous traces 2404 based only on the erroneous and normal status of the traces in the set 1702. For example, column 1711 list the normal and erroneous status codes of the traces in the set 1702, such as entries 2406 and 2408. The set of trace data 1702 contains N total number traces. The normal trace data 2402 contains N_nnormal traces. The erroneous trace data 2402 contains N_enormal traces.

A frequency of occurrence of normal traces is given by

$\begin{matrix} p_{n} = \frac{N_{n}}{N} & (20 a) \end{matrix}$

and a frequency of occurrence of erroneous traces is given by

$\begin{matrix} p_{e} = \frac{N_{n}}{N} & (20 b) \end{matrix}$

where N=N_n+N_e.

The normal sampling rate for sampling the normal traces is given by

h_n=1−(p_n)^βⁿ (21a)

The erroneous sampling rate for sampling the erroneous traces is given by

h_e=1−(p_e)^β^e (21b)

The sampling rates represented by Equations (21a) and (21 b) ensures that rarely occurring normal and erroneous traces are sampled at a higher sampling rates than are more frequently occurring normal and erroneous traces.

The number of normal traces stored in the data storage device is given by:

N_n=N_n×h_e (22a)

The number of erroneous traces stored in the data storage device is given by

N_e=N_e×h_e (22a)

The number of traces N_nand N_eare rounded to the nearest integer number.

The sampling parameters β_nand β_ecorresponds to the amount of normal and erroneous traces sampled and are based on user-selected sampling rates described below. Note that in one implementation β_n≠β_eand in another implementation β_n=β_e. For example, “conservative” sampling corresponds to β=1, “aggressive” sampling corresponds to β=0.5, and “super aggressive” sampling corresponds to β=0.25, where β represents β_nand β_e. The sampling parameters β_nand β_eare determined based the user-selected sampling rate as described below.

The modified Gini index for the normal sampling rate is given by

$G_{n}^{(β)} = \frac{{\overline{N}}_{n}}{N_{n}}$

The compression rate across normal traces is given by

C_n^(β)=1−G_n^(β) (23a)

The modified Gini index for the erroneous sampling rate is given by

$G_{e}^{(β)} = \frac{{\overline{N}}_{e}}{N_{e}}$

The compression rate across erroneous traces is given by

C_e^(β)=1−G_e^(β) (23b)

The sampling parameters β_nand β_emay be independently selected based on desired sampling rates h_n, and h_eor compression rates. A modified Gini index for the fraction of normal and erroneous traces sampled from the set of trace data across the M different trace types:

$G^{(β)} = \frac{\overline{N}}{N}$

where N=N_e+N_n.

The compression rate is given by

C^(β)=1−G^(β) (23c)

In another implementation, the set of erroneous traces may be partitioned further based on error codes, such as HTTP error status codes 4XX and 5XX or user-defined error status codes. FIG. 25 shows an example of the set of erroneous traces 2404 partitioned into U set of erroneous traces. each set of erroneous traces corresponding to a different error code. In the example of FIG. 25, three sets 2501-2503 of U sets of erroneous traces are represented. The erroneous traces in each set have the same error code. For example, set 2501 has error code err₁, set 2502 has error code err_u, and set 2503 has error code err_U.

A frequency of occurrence of erroneous traces is given by

$\begin{matrix} p_{e, u} = \frac{N_{e, u}}{N_{e}} & (24) \end{matrix}$

where u=1, . . . , U.

The erroneous sampling rate for sampling the set of erroneous traces with error code u is give by

h_e.u=1−(p_e,u)^β^e (25)

The modified Gini index for the erroneous sampling rate with error status code u is given by

$G_{e, u}^{(β)} = \frac{{\overline{N}}_{e, u}}{N_{e}}$

The compression rate across all error status codes is given by

$\begin{matrix} C_{e}^{(β)} = 1 - G_{e}^{(β)} where G_{e}^{(β)} = \sum_{u = 1}^{U} G_{e, u}^{(β)} & (26) \end{matrix}$

Sampling based on a User Selected Overall Sampling Rate and an Erroneous Trace Sampling Rate

In this implementation, a user selects an overall sampling rate, h, of a set of trace data and selects an erroneous trace sampling rate h_e. Computer-implemented methods and systems described below determine a normal trace sampling rate h_n. The number of traces that are sampled and stored for an overall sampling rate h is given by

N=h×N

The remaining unsampled traces, N−N, are discarded. The number of sampled traces in terms of number of sampled normal and erroneous traces is given by

N=N_eN_n=h×N (27)

Let h_nand h_ebe the normal and erroneous trace sampling rates. The relationships between the number of normal and erroneous traces to be sampled and the normal and erroneous trace sampling rates are given by

N_n=h_n×N_n (28a)

N_e=h_e×N_e (28b)

Dividing Equation (26) by N gives

h_n×p_n+h_e×p_e=h (29)

where p_n=N_n/N and p_e=N_e/N.

The frequencies of occurrences p_nand p_eare determined as described above with reference to Equations (20a) and (20b). When a user selects the erroneous trace sampling rate, h_e, the normal trace sampling rate is given by

$\begin{matrix} h_{n} = \frac{h - h_{e} \times p_{e}}{p_{n}} & (30) \end{matrix}$

When h_n>0, the user-selected erroneous trace sampling rate h_ecan be used to sample erroneous traces of a set of traces. Alternatively, when h_n≤0, an alert is trigger in a GUI, such as on a monitor of a system administrator, and the normal traces are sampled with a preset normal trace sampling rate. When the normal and erroneous trace sampling rates are known, sampling is performed as described above with reference to FIG. 24 and Equations (20a)-(21b).

Suppose a set of trace data contains 260 normal traces and 170 erroneous traces. As a result, the frequency of occurrence of normal traces is p_n=0.6 and the frequency of occurrence of erroneous traces is p_e=0.4. When a user selects an overall sampling rate of h=0.30. the following table represents various combinations of normal and erroneous trace sampling rates that may be used:

Table of Erroneous and Normal Sampling Rates for Overall Sampling Rate of h = 0.30 h_e 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 h_n 0.043 0.37 0.3 0.24 0.17 0.1 0.04 −0.027 −0.092

The Table shows that when a user selects erroneous trace sampling rates h_eless than or equal to 0.7. the corresponding normal sampling rate h_nacceptable. However, when a user selects erroneous sampling rates 0.8 and 0.9, the corresponding normal sampling rates are negative valued, which triggers an alert that is displayed on system administrator's monitor. The normal trace sampling rate may be set to a default normal trace sampling rate, such as 0.04 or 0.1. Suppose a user selects an erroneous trace sampling rate of 0.60, which, according to the Table, corresponds to a normal trace sampling rate of 0.1. These sampling rates will produce an overall sampling rate of h=0.30, which corresponds to sampling and storing 30% of the traces in the set of trace data.

In another implementation, the processes described above may be performed independent of user selections for the overall and erroneous traces sampling rates. In particular, normal and erroneous sampling rates may be preset and used based on certain metrics violating a corresponding threshold. For example, red metrics, such as request rate. error rate, and duration, are associated with services in an application. When the error rate, for example, is less than 10%. 4% of normal traces are sampled and stored (i.e., h_n=0.04) and 1% of erroneous traces are sampled and stored (i.e., h_e=0.01). On the other hand, when the error rate is greater than 10%. 4% of normal traces (i.e., h_n=0.04) are sampled and stored and 6% of erroneous traces are sampled and stored (i.e., h_e=0.01).

For a user-selected sampling rate h, the modified Gini index G^(β)=h and the sampling parameter β is obtained as described below. The modified Gini index for the normal sampling rate is given by

$\begin{matrix} G_{n}^{(β)} = \frac{{\overline{N}}_{n}}{N_{n}} & (31 a) \end{matrix}$

The modified Gini index for the erroneous sampling rate is given by

$\begin{matrix} G_{e}^{(β)} = \frac{{\overline{N}}_{e}}{N_{e}} & (31 b) \end{matrix}$

The compression rate is given by Equation (23c).

Sampling Parameters

In one implementation, the GUI 1500 in FIG. 15 may include fields that enable a user to input values for the sampling parameters α and β. For example, the GUI 1500 may include fields that enable a user to define “conservative” sampling corresponds to α=β1, “aggressive” sampling corresponds to α=β=0.5, and “super aggressive” sampling corresponds to a α=β=0.25.

In another implementation, the sampling parameters are determined based on the user-selected sampling level input via the GUI in FIG. 15. The sampling rates and corresponding compression rates depend on the modified Gini index ^(γτ)defined via a set of parameters γ={γ₁, . . . , γ_τ}, where γ_i∈ γ. In the following discussion, the generalized sampling parameter and represents α, β, or (β, α), where the sampling parameter α represents the erroneous sampling parameter α_eor the normal sampling parameter α_n, and the sampling parameter β represents the erroneous sampling parameter β_eor the normal sampling parameter β_n. The efficiency of a sampling rate depends on the value of the parameters γ_ithat will produce a user-selected sampling level as described above with reference to FIG. 15. Alternatively, a user selects a sampling level that has a corresponding sampling rate and a corresponding parameter γ_i. For example, suppose a user defines a “conservative” sampling rate as storing 15% of unsampled traces. The optimal parameter value γ₀satisfies the modified Gini index:

G^(γ⁰⁾≈0.15

The parameter γ₀is used as the sampling parameter. Suppose a user defines an “aggressive” sampling rate as storing 10% of unsampled traces. The optimal parameter value γ₁satisfies the modified Gini index:

G^(γ¹⁾≈0.10

The parameter γ₁is used as a sampling parameter. Suppose a user defines a “super aggressive” sampling as storing 5% of unsampled traces. The optimal parameter value γ₂satisfies the following condition:

G^(γ²⁾≈0.05

The parameter γ₂is used as a sampling parameter. Optimization of the sampling parameter γ is solved based on the latest historical set of trace data. When traces of an application exhibit static behavior the set optimal parameters γ are hard coded for long term use. In case of an application with highly dynamic behavior, the optimal parameters γ are regularly determined.

The optimal parameters and corresponding modified Gini indices (i.e., percentage of sampled traces) may be computed in advance. When a user selects a particular sampling level the corresponding parameter may be obtained from the predetermined relationships between the optimal parameters and the modified Gini indices (i.e., percentage of sampled traces).

FIG. 26A shows a plot of Gini indices versus trace-type sampling parameters β. Horizontal axis 2602 represents a range of trace-type sampling parameters β. Vertical axis 2604 represents a range of modified Gini indices. Curve 2606 represents the modified Gini index as a function of the trace-type sampling parameter β. Table I shows trace-type sampling parameters Band the corresponding modified Gini index values:

TABLE I Modified Gini index versus trace-type sampling parameter β β 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 G^(β) 0.046 0.058 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15

Relations (β, G^(β)) in Table 1 may be stored in a data storage device and retrieved from the data storage device based on a user-selected sampling level. The trace-type sampling parameter βthat corresponds to a modified Gini index closest to the user-selected sampling level is used to obtain the sampling rate, such as in Equations (2a), (2b), (21a), and (21b). For example, when a user selects a sampling level of 15% (i.e., modified Gini index of 0.15), the corresponding trace-type sampling parameter 0.07 (i.e., β=0.07) is retrieved from Table I and used to obtain the sampling rate, such as in Equations (2a), (2b), (21a), and (21b). When a user selects a sampling level of 10% (i.e., modified Gini index of 0.10), the corresponding trace-type sampling parameter 0.045 (i.e., β=0.045) is retrieved from Table I and used to obtain the sampling rate, such as in Equations (2a), (2b), (21a), and (21b). When a user selects a sampling level of 5% (i.e., closest modified Gini index is 0.045), the corresponding trace-type sampling parameter 0.02 (i.e., β=0.02) is retrieved from Table I and used to obtain the sampling rate, such as in Equations (2a), (2b), (21a), and (21b).

FIG. 26B shows a plot of modified Gini index versus duration-sampling parameters β. Horizontal axis 2608 represents a range of duration-sampling parameters α. Vertical axis 2610 represents a range of modified Gini indices. Curve 2612 represents the modified Gini index as a function of the duration-sampling parameter α. Table II shows duration-sampling parameters α and the corresponding modified Gini indices:

TABLE II Modified Gini index versus duration-sampling parameter α α 0.01 0.02 0.035 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 G^(α) 0.015 0.03 0.051 0.059 0.072 0.086 0.099 0.11 0.13 0.14 0.15

Relations (α, G^(α)) in Table II may be stored in a data storage device and retrieved from the data storage device based on a user-selected sampling level. The duration-sampling parameter α that corresponds to the modified Gini index closest to the user-selected sampling level is used to obtain the duration-sampling rate in Equations (12a) and (12b). For example, when a user selects a sampling level of 15% (i.e., modified Gini index of 0.15), the corresponding duration-sampling parameter 0.11 (i.e., α=0.11) is retrieved from Table II and used to obtain the duration-sampling rate in Equations (12a) and (12b). When a user selects a sampling level of 10% (i.e., closest modified Gini index is 0.099), the corresponding duration-sampling parameter 0.07 (i.e., α=0.07) is retrieved from Table II and used to obtain the duration-sampling rate in Equations (12a) and (12b). When a user selects a sampling level of 5% (i.e., closest modified Gini index is 0.051). the corresponding duration-sampling parameter 0.035 (i.e., α=0.035) is retrieved from Table II and used to obtain the duration-sampling rate in Equation (9).

FIG. 26C shows a plot of Gini index versus trace-type and duration-sampling parameters β and α. Axis 2614 represents a range of trace-type sampling parameters β. Axis 2616 represents a range of duration-sampling parameters α. Axis 2618 represents a range of modified Gini indices. Curve 2620 represents the Gini index as a function of the sampling parameters β and α. Table III shows sampling parameters β and α and α and the corresponding modified Gini indices:

TABLE III Modified Gini index versus sampling parameters β and α α 0.02 0.07 0.05 0.03 0.08 0.01 0.06 β 0.01 0.01 0.02 0.03 0.03 0.04 0.04 G^(β,α) 0.049 0.1 0.1 0.1 0.15 0.1 0.15

Relations ((β, α), G^{(β, α)}) in Table III may be stored in a data storage device and retrieved from the data storage device based on a user-selected sampling level. The sampling parameters β and α that corresponds to a modified Gini index closest to the user-selected sampling level is used to obtain the duration-sampling rate in Equations (17a) and (17b). For example, when a user selects a sampling level of 15% (i.e., modified Gini index of 0.15). the corresponding sampling parameters β=0.04 and α=0.06 are retrieved from Table III and used to obtain the hybrid sampling rate in Equations (17a) and (17b). When a user selects a sampling level of 10% (i.e., modified Gini index is 0.01), a combination of the sampling parameters β and α are retrieved from Table III and used to obtain the hybrid sampling rate in Equations (17a) and (17b). Table III shows that different combinations of sampling parameters may be used for a modified Gini index of 0.1. In one implementation, when multiple combinations of a sampling parameters are available, rather than using different sampling parameters the number of different parameters may be reduced by using sampling parameters that are equal, such as α=β=0.03. When a user selects a sampling level of 5% (i.e., closest modified Gini index is 0.049), the corresponding sampling parameter β5% and α=0.02 are retrieved from Table III and used to obtain the hybrid sampling rate in Equation (13).

FIG. 26D shows a plot of modified Gini indices versus a sampling parameter. In this example, the same value is used for the trace-type sampling parameter β and the duration-sampling parameter α (i.e., α=β). Horizontal axis 2622 represents a range of sampling parameters α=β. Vertical axis 2624 represents a range of modified Gini indices. Curve 2626 represents the modified Gini index as a function of the sampling parameter. Table IV shows sampling parameter α (i.e., α=β) and the corresponding modified Gini index values:

TABLE IV Modified Gini index versus trace-type sampling parameter α α 0.011 0.013 0.015 0.025 0.027 0.029 0.037 0.041 0.043 G^(α,α) 0.04 0.049 0.057 0.093 0.097 0.1 0.14 0.147 0.0153

Relations (α, G^{(α, α)}) in Table IV may be stored in a data storage device and retrieved from the data storage device based on a user-selected sampling level. The sampling parameter α that corresponds to the modified Gini index closest to the user-selected sampling level is used to obtain the hybrid sampling rate in Equations (17a) and (17b) with α=β. For example, when a user selects a sampling level of 15% (i.e., the closest modified Gini index of 0.147), the corresponding sampling parameters α=β0.041 is retrieved from Table IV and used to obtain the hybrid sampling rate in Equations (17a) and (17b). When a user selects a sampling level of 10% (i.e., modified Gini index is 0.1). the corresponding hybrid sampling parameters α=β=0.029) is retrieved from Table IV and used to obtain the hybrid sampling rate in Equations (17a) and (17b). When a user selects a sampling level of 5% (i.e., closest modified Gini index is 0.049), the corresponding sampling parameters α=β=0.013) is retrieved from Table IV and used to obtain the hybrid sampling rate in Equation (11).

Optimizing Sampling Parameters

In practice, historical optimization of the sampling parameters α and β is not feasible due to the dynamic nature of applications. Instead, the compression rate of the sampling rate is monitored over for a recent time window selected 13 a user. The duration of the time window may be one-half hour, one hour, two hours, twelve hours, or sixteen hours. The compression rate C^(γ)is calculated for the corresponding sampling rate applied in the time window. After the compression rate has been calculated for the time window, a difference is calculated between the compression rate and a user-selected compression rate as follows:

Δ=|C^(γ)−C_s (32)

where

γ represents the sampling parameter (i.e., γ=α, γ=β, or γ=(α, β));

C^(γ)M is the compression rate of the sampling rate with sampling parameter γ of traces over the recent time period: and

C_sis the user-selected compression rate.

The user-selected compression rate is given by C_s=1−G_s, where G_sis the modified Gini index that corresponds to the user-selected sampling level. For example, when a user selects a sampling level of 15%, the modified Gini index is G_s=0.15, and the user-selected compression rate is C_s=0.85.

When the difference satisfies the following condition

Δ≤Th_Opt (33)

where Th_Optis the optimization threshold (e.g., Th_Opt=0.01, 0.02, or 0.05), the sampling rate is unchanged. On the other hand, when Δ>Th_Opt, the sampling parameter of the sampling rate is adjusted using the following function;

factor (Δ)=2−exp(−10×Δ) (34a)

where

- 0≤Δ≤100; and
- 1≤factor≤2.

The factor in Equation (33a) is used to compute an adjusted sampling parameter as follows:

γ_adj=factor×γ (34b)

Alternatively, the factor in Equation (33a) is used to compute an adjusted sampling parameter as follows:

$\begin{matrix} γ_{adj} = \frac{1}{factor} \times γ & (34 c) \end{matrix}$

The adjusted sampling parameter of Equation (34b) or (34c) replaces a previously used sampling parameter in the sampling rates described above.

Sampling Normal and Erroneous Traces

A trace is randomly sampled based on a Bernoulli distribution, where the probability of a success (i.e., sampling the trace) is the sampling rate r and the probability of a failure (i.e., discarding the trace) is the probability 1−r, and where r represents the sampling rate associated with the trace described above. The BRBNG receives as input the sampling rate r and, based on r, randomly outputs a number 1 for a success with probability r or randomly outputs a number 0 for a failure with probability 1−r. For each trace in a set of traces, the sampling rate r is input to BRBNG. When the BRBNG outputs a number 1, the trace is sampled by storing the trace in a data storage device. On the other hand, when the BRBNG outputs a number 0, the trace is discarded or deleted from memory or from a data storage device. Note that assignment of the values 1 and 0 may be reversed provided 0 is associated with probability of a success r and 1 is associated with probability of a failure 1−r. In an alternative implementation, a random number generator (e.g., pseudo-random number generator) is used to output a random number, R, for each trace, where 0≤R≤1. When R≤r, the trace is sampled by storing the trace in a data storage device. On the other hand, when R>r, the trace is discarded or deleted from memory or from a data storage device.

The computer-implemented methods described below with reference to FIGS. 27-39 are stored in one or more data storage devices as machine-readable instructions that when executed by one or more processors of the computer system, such as the computer system shown in FIG. 1m sample traces of an application executed in a distributed computing system.

FIG. 27 is a flow diagram illustrating an example implementation of a “method for sampling traces of an application.” In block 2701, a set of trace data is retrieved from data storage, such as a data storage device or a buffer. In block 2702, a “determine a normal trace sampling rate and an erroneous trace sampling rate” procedure is performed. An example implementation of the “determine a normal trace sampling rate and an erroneous trace sampling rate” procedure is described below with reference to FIG. 28. In block 2703, a “sample normal traces using the normal trace sampling rate and the erroneous trace sampling rate” procedure is performed. An example implementation of the “sample normal traces using the normal trace sampling rate and the erroneous trace sampling rate” procedure is described below with reference to FIG. 30. In block 2704, troubleshooting on the erroneous traces to identify the source of performance problems with the application. Remedial measures may be employed to correct the performance problems. For example, VMs or containers executing application components may, be migrated to different hosts to increase performance. Additional VM or containers may be started to alleviate the workloads on already existing VMs and containers. Network bandwidth may be increased to reduce latency between peer VMs. In decision block 2705, when sampling continues, the operations represented by blocks 2702-2704 are repeated.

FIG. 28 is a flow diagram illustrating an example implementation of the “determine a normal trace sampling rate and an erroneous trace sampling rate” procedure performed in block 2702. In decision block 2801, when sampling is based on trace type and/or duration, control flows to block 2802. In block 2802, a “determine sampling rates based on trace type and/or duration” procedure is performed. An example implementation of the “determine sampling rates based on trace type and/or duration” is described below with reference to FIG. 29. In decision block 2803, when sampling is based on normal and erroneous traces alone, control flows to block 2804. In block 2804, a “determine normal and erroneous trace sampling rates” procedure is performed. An example implementation of the “determine normal and erroneous trace sampling rates” is described below with reference to FIG. 34. In decision block 2805, when an overall sample rate and an erroneous sample rate is given by a user, control flows to block 2806. In block 2806, a “determine normal trace sampling rate” procedure is performed. An example implementation of the “determine normal trace sampling rate” is described below with reference to FIG. 35. In decision block 2807, when a red metric, such as request rate, error rate, and duration, violates threshold, control flows to block 2808. Otherwise, control flows to block 2809. In block 2809, normal and erroneous sampling rates are selected based on the red metric. In block 2810, already determined or preset normal and erroneous sampling rates continue to be used.

FIG. 29 is a flow diagram illustrating an example implementation of the “determine sampling rates based on trace type and/or duration” procedure performed in block 2802. In decision block 2901, when a user has selected trace-type sampling of the application traces, control flows to block 2902. In block 2902, a “determine trace-type sampling rates” procedure is performed. An example implementation of the “determine trace-type sampling rates” procedure is described below with reference to FIG. 28. The trace-type sampling rates (“TTSR”) are return and used to perform sampling of the application traces in block 2703 of FIG. 27. In block 2903, a “determine hybrid-sampling rates” procedure is performed. An example implementation of the “determine hybrid-sampling rates” procedure is described below with reference to FIG. 30. Blocks 2905 and 2904 are a while loop in which a “sample traces using hybrid-sampling rates” procedure is performed on the application traces while the duration of the time spent sampling, t, in block 2907 is less than the duration of a period of time T_p. An example implementation of the “sample traces using hybrid-sampling rates” procedure is described below with reference to FIG. 36. In block 2908, compression rates C^{(β, α)}are computed according to Equations (19a) and (19b) for hybrid sampling rates in Equations (17a) and (17b). In decision block 2909, when compression rate obtain in block 2908 satisfies the condition |C^(β,α)−C_s|<ε, where ε is a small user selected positive number (e.g., 0.01, 0.05, or 0.1) and C_Sis the user-selected compression rate, the hybrid-sampling rates obtained in block 2903 are used to perform sampling of the application traces in block 2703 of FIG. 27. Alternatively, when compression rate obtained in block 2908 satisfies the condition |C^(α)−C_s|<ε, the hybrid-sampling rates (“HSR”) obtained in block 2903 are returned and used to perform sampling of the application traces in block 2703 of FIG. 27. Otherwise, control flows to block 2910. In block 2910, an alert is displayed in a GUI, or sent in an email to an administrator or application developer, indicating the hybrid sampling failed to satisfy the user-selected compression rate. In block 2911, a “determine duration-sampling rates” procedure is performed. An example implementation of the “determine duration-sampling rates” procedure is described below with reference to FIG. 33. Blocks 2912 and 2913 are a while loop in which a “sample traces using duration-sampling rates” procedure is performed on the application traces while the duration of the time spent sampling, t, in block 2914 is less than the duration of a period of time T_p. An example implementation of the “sample traces using duration-sampling rates” procedure is described below with reference to FIG. 38. In block 2915, compression rates C^(α)are computed according to Equations (15a) and (15b) for duration sampling rates in Equation (12a) and (12b). In decision block 2916, when the compression rate obtained in block 2915 satisfies the condition |C^(α)−C_s|<ε, the duration-sampling rates (“DSR”) obtained in block 2911 are returned and used to perform sampling of the application traces in block 2703 of FIG. 27. Otherwise, control flows to block 2917. In block 2917, an alert is displayed in a GUI, or sent in an email to an administrator or application developer, indicating the duration sampling failed to satisfy the user-selected compression rate. In decision block 2918, when the condition |C^(α)−C_s|<|C^(β,α)−C_s|, the compression rate for duration sampling is closer to the user-selected compression rate than the compression rate for hybrid sampling and the duration-sampling rates obtained in block 2911 are returned. Otherwise, the compression rate for hybrid sampling is closer to the user-selected compression rate than the compression rate for duration sampling and the hybrid-sampling rates obtained in block 2903 are returned.

FIG. 30 is a flow diagram illustrating an example implementation of the “determine hybrid-sampling rates” procedure performed in block 2903. In block 3001, sampling parameters are determined based on the user-selected sampling rate as described above with reference to Table III or Table IV. In block 3002, traces are sorted according to trace type to obtain groups of traces as described above with reference to FIG. 17. A loop beginning with block 3003 repeats the operations represented by blocks 3004-3010. In block 3004, frequencies of occurrence of normal and erroneous traces in a group of traces are determined as described above with reference to Equations (16a) and (16b). In block 3005, normal and erroneous traces of the group of traces are sorted according to trace duration as described above with reference to FIG. 23. In block 3006, a normal trace histogram and an erroneous trace histogram are constructed as described above with reference to FIGS. 21-23. In block 3007, a frequency of occurrence is determined for traces in each a bin of the normal trace histogram as described above with reference to FIG. 23. In block 3008, a normal hybrid sampling rate is computed for each bin of the normal trace histogram from the frequency of occurrences as described above with reference to Equation (17a). In block 3009, a frequency of occurrence is determined for traces in each a bin of the erroneous trace histogram as described above with reference to FIG. 23. In block 3010, an erroneous hybrid sampling rate is computed for each bin of the normal trace histogram from the frequency of occurrences as described above with reference to Equation (17b). In decision block 3011, blocks 3004-3010 are repeated for another group of traces.

FIG. 31 is a flow diagram illustrating an example implementation of the “determine trace-type sampling rates” procedure performed in block 2902. In block 3101, a sampling parameter β is determined based on the user-selected sampling rate as described above with reference to Table I. In block 3102, traces are sorted according to trace type to obtain groups of traces as described above with reference to FIG. 17. A loop beginning with block 3103 repeats the operations represented by blocks 3104-3108. In block 3104, a group of traces are partitioned into normal traces and erroneous traces. In block 3105, a frequency of occurrence of the normal traces is determined as described above with reference to Equation (1a). In block 3106, a normal trace-type sampling rate is computed from the frequency of occurrence of traces obtained in block 3105 and the sampling parameter β according to Equation (2a). In block 3107, a frequency of occurrence of the erroneous traces is determined as described above with reference to Equation (1b). In block 3108, an erroneous trace-type sampling rate is computed from the frequency of occurrence of traces obtained in block 3107 and the sampling parameter β according to Equation (2b). In decision block 3109, blocks 3104-3108 are repeated for another group of traces.

FIG. 32 is a flow diagram illustrating an example implementation of the “determine duration-sampling rates” procedure performed in block 2911. In block 3201, a sampling parameter α are determined based on the user-selected sampling rate as described above with reference to Table II. In block 3202, traces are sorted according to trace duration as described above with reference to FIG. 20. In block 3203, a histogram is constructed as described above with reference to FIG. 21. A loop beginning with block 3204 repeats the operations represented by blocks 3205 and 3206. In block 3205, a frequency of occurrence of traces in a bin of the histogram is determined as described above with reference to Equation (9). In block 3206, a duration-sampling rate is computed from the frequency of occurrence of traces obtained in block 3205 and the sampling parameter α according to Equation (10). In decision block 3207, blocks 3205 and 3206 are repeated for another bin of the histogram. In block 3208, a frequency of occurrence of traces in the lower bin is determined as described above with reference to Equation (11a). In block 3209, a short duration sampling rate is computed from the frequency of occurrence obtained in block 3208 and the sampling parameter β according to Equation (12a). In block 3210, a frequency of occurrence of traces in the upper bin is determined as described above with reference to Equation (11b). In block 3211, a long duration sampling rate is computed from the frequency of occurrence obtained in block 3210 and the sampling parameter β according to Equation (12b).

FIG. 33 is a flow diagram illustrating an example implementation of the “determine normal and erroneous trace sampling rates” procedure in block 2804. In block 3301, the set of trace data is partitioned into normal traces and erroneous traces as described above with reference to FIG. 24. In block 3302, a frequency of occurrence is determined as described above with reference to Equation (20a). In block 3303, a normal trace sampling rate is determined as described above with reference to Equation (21a). In block 3304, a frequency of occurrence is determined as described above with reference to Equation (20b). In block 3305, an erroneous trace sampling rate is determined as described above with reference to Equation (21b).

FIG. 34 is a flow diagram illustrating an example implementation of the “determine normal trace sampling rate” procedure in block 2806. In block 3401, the set of trace data is partitioned into normal traces and erroneous traces as described above with reference to FIG. 24. In block 3402, a frequency of occurrence of normal traces is determined as described above with reference to Equation (20a). In block 3403, a frequency of occurrence of erroneous traces is determined as described above with reference to Equation (20b). In block 3404, given an overall sampling rate and an erroneous trace sampling rate, a normal trace sampling rate is determined as described above with reference to Equation (30). In decision block 3405, when the normal trace sampling rate is less than zero, control flows to block 3406. In block 3406, the normal trace sampling rate is set to a default sampling rate.

FIG. 35 is a flow diagram illustrating an example implementation of the “sample normal traces using the normal trace sampling rate and the erroneous trace sampling rate” procedure performed in block 2703. In decision block 3501, when hybrid sampling has been selected, control flows to block 3502. In block 3502, the “sample traces using hybrid-sampling rates” procedure described below in FIG. 36 is performed. In decision block 3503, when trace-type sampling has been selected, control flows to block 3504. In block 3504, the “sample traces using trace-type sampling rates” procedure described below in FIG. 37 is performed. In decision block 3505, when duration sampling has been selected, control flows to block 3506. In block 3506, the “sample traces using duration-sampling rates” procedure described below in FIG. 33 is performed. In block 3507, a “sample traces using normal and erroneous trace sampling rates” procedure is performed. An example implementation of the “sample traces using normal and erroneous trace sampling rates” is described below with reference to FIG. 39. In block 3708, a compression rate that corresponds to the sampling rate is computed over a time window. In decision block 3509, when Δ≤Th_Opt, control flows to block 3510 and the sampling parameter is adjusted as described above with reference to Equation (34b) or Equation (34c).

FIG. 36 is a flow diagram illustrating an example implementation of the “sample traces using hybrid-sampling rates” procedure performed in block 3502. This procedure is performed in block 3502 for normal traces and for erroneous traces. A loop beginning with block 3601 repeats the operations represented by blocks 3602-3608 for each group of traces. A loop beginning with block 3202 repeats the operations represented by blocks 3603-3607 for each bin of the normal trace histogram, each bin of the erroneous trace histogram, obtained in block 3006 of FIG. 30. A loop beginning with block 3603, repeats the operations represented by blocks 3604-3606 for each trace in the bin. In block 3604, a success (e.g., “1”) or a failure (e.g., “0”) is computed with the BRBNG for the sampling rate associated with the trace. In decision block 3605, when output of the BRBNG is a success, control flows to block 3606. Otherwise, the output of the BRBNG is a failure and the trace is discarded. In block 3606, the trace is stored in a normal trace database for the application when the trace is a normal trace and is stored in an erroneous trace database when the trace an erroneous trace. The normal and erroneous databases are persisted in a data storage device. In decision block 3607, the operations represented by blocks 3604-3606 are repeated for the traces in the bin. In decision block 3608, the operations represented by blocks 3603-3607 are repeated for another bin. In decision block 3609, the operations represented by blocks 3602-3608 are repeated for another group or traces.

FIG. 37 is a flow diagram illustrating an example implementation of the “sample traces using trace-type sampling rates” procedure performed in block 3504. This procedure is performed in block 3504 for normal traces and for erroneous traces. A loop beginning with block 3701 repeats the operations represented by blocks 3702-3706 for each group of traces. A loop beginning with block 3702 repeats the operations represented by blocks 3703-3705 for each trace in the group. In block 3703, a success (e.g., “1”) or a failure (e.g., “0”) is computed with the BRBNG for the sampling rate associated with the trace given by Equation (2). In decision block 3704, when output of the BRBNG is a success, control flows to block 3705. Otherwise, the output of the BRBNG is a failure and the trace is discarded. In block 3705, the trace is stored in a normal trace database for the application when the trace is a normal trace and is stored in an erroneous trace database when the trace an erroneous trace. The normal and erroneous databases are persisted in a data storage device. In decision block 3706, the operations represented by blocks 3703-3705 are repeated for each of the traces in the group. In decision block 3707, the operations represented by blocks 3702-3706 are repeated for another group of traces.

FIG. 38 is a flow diagram illustrating an example implementation of the “sample traces using duration-sampling rates” procedure performed in block 3506. This procedure is performed in block 3506 for normal traces and for erroneous traces. A loop beginning with block 3801 repeats the operations represented by blocks 3802-3805 for each bin of the histogram obtain in FIG. 22. A loop beginning with block 3802, repeats the operations represented by blocks 3803-3805 for each trace in the bin. In block 3803, a success (e.g., “1”) or a failure (e.g., “0”) is computed with the BRBNG for the sampling rate associated with the trace. In decision block 3804, when output of the BRBNG is a success, control flows to block 3805. Otherwise, the output of the BRBNG is a failure and the trace is discarded. In block 3805, the trace is stored in a normal trace database for the application when the trace is a normal trace and is stored in an erroneous trace database when the trace an erroneous trace. The normal and erroneous databases are persisted in a data storage device. In decision block 3806, the operations represented by blocks 3803-3805 are repeated for each of the traces in the bin. In decision block 3807, the operations represented by blocks 3802-3806 are repeated for another bin.

FIG. 39 is a flow diagram illustrating an example implementation of the “sample traces using normal and erroneous sampling rates” procedure performed in block 3508. This procedure is performed in block 3508 for normal traces and for erroneous traces. A loop beginning with block 3901 repeats the operations represented by blocks 3902-3904 for each trace in the normal traces and again for each trace in the erroneous traces. In block 3902, a success (e.g., “1”) or a failure (e.g., “0”) is computed with the BRBNG for the sampling rate associated with the trace. In decision block 3903, when output of the BRBNG is a success, control flows to block 3904. Otherwise, the output of the BRBNG is a failure and the trace is discarded. In block 3904, the trace is stored in a normal trace database for the application when the trace is a normal trace and is stored in an erroneous trace database when the trace an erroneous trace. The normal and erroneous databases are persisted in a data storage device. In decision block 3905, the operations represented by blocks 3902-3904 are repeated for each of the traces in the normal traces and the erroneous traces of the set of trace data. In decision block 3905, the operations represented by blocks 3902-3904 are repeated for another group of traces.

It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method stored in one or more data storage devices and executed using one or more processors of a computer system for sampling a set of traces of an application executed in a distributed computing system, the method comprising:

retrieving a set of trace data associated with the application from a data storage device;

determining sampling rates for sampling normal traces in the set and for sampling erroneous traces in the set, wherein the different sampling rates are inversely proportional to the frequency of occurrence of the normal traces and erroneous traces;

sampling the traces using the sampling rates to obtain sampled normal traces and sampled erroneous traces, wherein less frequently occurring normal traces are sampled at higher sampling rates than more frequently occurring normal traces and less frequently occurring erroneous traces are sampled at higher sampling rates than more frequently occurring erroneous traces; and

storing the sampled traces in a data storage device.

2. The method of claim 1 wherein determining the sampling rates comprises:

sorting the traces according to trace type to obtain one or more groups of traces, each group of traces having a different associated trace type; and

for each group of traces. partitioning the group of traces into normal traces and erroneous traces, determining a frequency of occurrence of normal traces in the group, determining a frequency of occurrence of erroneous traces in the group, constructing a normal trace histogram of the normal traces, constructing an erroneous trace histogram of the erroneous traces, determining a frequency of occurrence of normal traces in each bin of the normal trace histogram, determining a frequency of occurrence of normal traces in each bin of the erroneous trace histogram, determining a normal hybrid sampling rate for each bin of the normal histogram based on the frequency of occurrence of normal traces in each bin, the frequency of occurrence of the normal traces, and determining an erroneous hybrid sampling rate for each bin of the normal histogram based on the frequency of occurrence of normal traces in each bin, the frequency of occurrence of the normal traces.

3. The method of claim 1 wherein determining the sampling rates comprises:

sorting the traces according to trace type to obtain one or more groups of traces, each group of traces having a different associated trace type;

receiving a sampling level via a graphical user interface;

determining a trace-type sampling parameter based on the user-selected sampling level; and

for each group of traces, partitioning the group of traces into normal traces and erroneous traces; determining a frequency of occurrence of normal traces in the group of traces, determining a normal trace-type sampling rate based on the frequency of occurrence of normal traces, determining a frequency of occurrence of erroneous traces in the group of traces, determining an erroneous trace-type sampling rate based on the frequency of occurrence of erroneous traces.

4. The method of claim 1 wherein determining the sampling rates comprises:

constructing a histogram of traces based on the durations, each bin of the histogram corresponding to a time interval and containing traces with durations in the time interval;

determining a frequency of occurrence of normal traces in each bin of the histogram;

for each bin of the histogram, determining a duration-sampling rate based on the frequency of occurrence of traces in the bin and the duration-sampling parameter:

determining a frequency of occurrence of traces in a lower bin;

computing a short duration sampling rate from the frequency of occurrence of traces in the lower bin;

determining a frequency of occurrence of traces in an upper bin; and

computing a long duration sampling rate from the frequency of occurrence of traces in the upper bin.

5. The method of claim 1 wherein determining the sampling rates comprises:

partitioning the set of trace data into normal traces and erroneous traces:

determining a frequency of occurrence of the normal traces:

determining a normal trace sampling rate based on the frequency of occurrence of the normal traces;

determining a frequency of occurrence of the erroneous traces; and

determining a erroneous trace sampling rate based on the frequency of occurrence of the erroneous traces.

6. The method of claim 1 wherein determining the sampling rates comprises:

partitioning the set of trace data into normal traces and erroneous traces;

determining a frequency of occurrence of the normal traces;

determining a frequency of occurrence of the erroneous traces; and

determining a normal trace sampling rate based on the frequency of occurrence of the normal traces, frequency of occurrence of the erroneous traces, an overall sampling rate, and erroneous trace sampling rate.

7. The method of claim 1 wherein sampling the traces using the sampling rates comprises sampling normal traces with a normal trace sampling rate, wherein the normal trace sampling rate is inversely proportional to a frequency of occurrence of the normal traces.

8. The method of claim 1 wherein sampling the traces using the sampling rates comprises sampling erroneous traces with an erroneous trace sampling rate, wherein the erroneous trace sampling rate is inversely proportional to a frequency of occurrence of the erroneous traces.

9. The method of claim 1 further comprises:

performing troubleshooting on the sampled erroneous traces to identify a performance problem with the application; and

executing remedial measures to correct the performance problem.

10. A computer system for sampling application traces of an application executed in a distributed computer system, the system comprising:

one or more processors:

one or more data storage devices: and

machine-readable instructions stored in the one or more data storage devices that when executed using the one or more processors controls the system to perform operations comprising: retrieving a set of trace data associated with the application from a data storage device; determining sampling rates for sampling normal traces in the set and for sampling erroneous traces in the set, wherein the different sampling rates are inversely proportional to the frequency of occurrence of the normal traces and erroneous traces: sampling the traces using the sampling rates to obtain sampled normal traces and sampled erroneous traces, wherein less frequently occurring normal traces are sampled at higher sampling rates than more frequently occurring normal traces and less frequently occurring erroneous traces are sampled at higher sampling rates than more frequently occurring erroneous traces; and storing the sampled traces in a data storage device.

11. The computer system of claim 10 wherein determining the sampling rates comprises:

sorting the traces according to trace type to obtain one or more groups of traces, each group of traces having a different associated trace type; and

for each group of traces, partitioning the group of traces into normal traces and erroneous traces, determining a frequency of occurrence of normal traces in the group, determining a frequency of occurrence of erroneous traces in the group constructing a normal trace histogram of the normal traces, constructing an erroneous trace histogram of the erroneous traces, determining a frequency of occurrence of normal traces in each bin of the normal trace histogram, determining a frequency of occurrence of normal traces in each bin of the erroneous trace histogram, determining a normal hybrid sampling rate for each bin of the normal histogram based on the frequency of occurrence of normal traces in each bin, the frequency of occurrence of the normal traces, and determining an erroneous hybrid sampling rate for each bin of the normal histogram based on the frequent of occurrence of normal traces in each bin, the frequency of occurrence of the normal traces.

12. The computer system of claim 10 wherein determining the sampling rates comprises:

sorting the traces according to trace type to obtain one or more groups of traces, each group of traces having a different associated trace type;

receiving a sampling level via a graphical user interface;

determining a trace-type sampling parameter based on the user-selected sampling level; and

for each group of traces, partitioning the group of traces into normal traces and erroneous traces; determining a frequency of occurrence of normal traces in the group of traces, determining a normal trace-type sampling rate based on the frequency of occurrence of normal traces, determining a frequency of occurrence of erroneous traces in the group of traces, determining an erroneous trace-type sampling rate based on the frequency of occurrence of erroneous traces.

13. The computer system of claim 10 wherein determining the sampling rates comprises:

constructing a histogram of traces based on the durations, each bin of the histogram corresponding to a time interval and containing traces with durations in the time interval;

determining a frequency of occurrence of normal traces in each bin of the histogram:

for each bin of the histogram, determining a duration-sampling rate based on the frequency of occurrence of traces in the bin and the duration-sampling parameter;

determining a frequency of occurrence of traces in a lower bin;

computing a short duration sampling rate from the frequency of occurrence of traces in the lower bin;

determining a frequency of occurrence of traces in an upper bin; and

computing a long duration sampling rate from the frequency of occurrence of traces in the upper bin.

14. The computer system of claim 10 wherein determining the sampling rates comprises:

partitioning the set of trace data into normal traces and erroneous traces;

determining a frequency of occurrence of the normal traces;

determining a normal trace sampling rate based on the frequency of occurrence of the normal traces;

determining a frequency of occurrence of the erroneous traces; and

determining a erroneous trace sampling rate based on the frequency of occurrence of the erroneous traces.

15. The computer system of claim 10 wherein determining the sampling rates comprises:

partitioning the set of trace data into normal traces and erroneous traces;

determining a frequency of occurrence of the normal traces;

determining a frequency of occurrence of the erroneous traces; and

determining a normal trace sampling rate based on the frequency of occurrence of the normal traces, frequency of occurrence of the erroneous traces, an overall sampling rate, and erroneous trace sampling rate.

16. The computer system of claim 10 wherein sampling the traces using the sampling rates comprises sampling normal traces with a normal trace sampling rate, wherein the normal trace sampling rate is inversely proportional to a frequency of occurrence of the normal traces.

17. The computer system of claim 10 wherein sampling the traces using the sampling rates comprises sampling erroneous traces with an erroneous trace sampling rate, wherein the erroneous trace sampling rate is inversely proportional to a frequency of occurrence of the erroneous traces.

18. The computer system of claim 10 further comprises:

performing troubleshooting on the sampled erroneous traces to identify a performance problem with the application; and

executing remedial measures to correct the performance problem.

19. A non-transitory computer-readable medium encoded with machine-readable instructions that when executed by one or more processors of a computer system perform operations comprising:

retrieving a set of trace data associated with the application from a data storage device;

determining sampling rates for sampling normal traces in the set and for sampling erroneous traces in the set, wherein the different sampling rates are inversely proportional to the frequency of occurrence of the normal traces and erroneous traces:

sampling the traces using the sampling rates to obtain sampled normal traces and sampled erroneous traces, wherein less frequently occurring normal traces are sampled at higher sampling rates than more frequently occurring normal traces and less frequently occurring erroneous traces are sampled at higher sampling rates than more frequently occurring erroneous traces; and

storing the sampled traces in a data storage device.

20. The medium of claim 19 wherein determining the sampling rates comprises:

sorting the traces according to trace type to obtain one or more groups of traces, each group of traces having a different associated trace type; and

for each group of traces, partitioning the group of traces into normal traces and erroneous traces, determining a frequency of occurrence of normal traces in the group, determining a frequency of occurrence of erroneous traces in the group, constructing a normal trace histogram of the normal traces, constructing an erroneous trace histogram of the erroneous traces, determining a frequency of occurrence of normal traces in each bin of the normal trace histogram, determining a frequency of occurrence of normal traces in each bin of the erroneous trace histogram, determining a normal hybrid sampling rate for each bin of the normal histogram based on the frequency of occurrence of normal traces in each bin, the frequency of occurrence of the normal traces, and determining an erroneous hybrid sampling rate for each bin of the normal histogram based on the frequency of occurrence of normal traces in each bin, the frequency of occurrence of the normal traces.

21. The medium of claim 19 wherein determining the sampling rates comprises:

sorting the traces according to trace type to obtain one or more groups of traces, each group of traces having a different associated trace type;

receiving a sampling level via a graphical user interface;

determining a trace-type sampling parameter based on the user-selected sampling level; and

for each group of traces, partitioning the group of traces into normal traces and erroneous traces; determining a frequency of occurrence of normal traces in the group of traces, determining a normal trace-type sampling rate based on the frequency of occurrence of normal traces, determining a frequency of occurrence of erroneous traces in the group of traces, determining an erroneous trace-type sampling rate based on the frequency of occurrence of erroneous traces.

22. The medium of claim 19 wherein determining the sampling rates comprises:

constructing a histogram of traces based on the durations, each bin of the histogram corresponding to a time interval and containing traces with durations in the time interval;

determining a frequency of occurrence of normal traces in each bin of the histogram;

for each bin of the histogram, determining a duration-sampling rate based on the frequency of occurrence of traces in the bin and the duration-sampling parameter;

determining a frequency of occurrence of traces in a lower bin;

computing a short duration sampling rate from the frequency of occurrence of traces in the lower bin;

determining a frequency of occurrence of traces in an upper bin; and

computing a long duration sampling rate from the frequency of occurrence of traces in the upper bin.

23. The medium of claim 19 wherein determining the sampling rates comprises:

partitioning the set of trace data into normal traces and erroneous traces;

determining a frequency of occurrence of the normal traces;

determining a normal trace sampling rate based on the frequency of occurrence of the normal traces;

determining a frequency of occurrence of the erroneous traces; and

determining a erroneous trace sampling rate based on the frequency of occurrence of the erroneous traces.

24. The medium of claim 19 wherein determining the sampling rates comprises:

partitioning the set of trace data into normal traces and erroneous traces;

determining a frequency of occurrence of the normal traces;

determining a frequency of occurrence of the erroneous traces; and

determining a normal trace sampling rate based on the frequency of occurrence of the normal traces, frequency of occurrence of the erroneous traces, an overall sampling rate, and erroneous trace sampling rate.

25. The medium of claim 19 wherein sampling the traces using the sampling rates comprises sampling normal traces with a normal trace sampling rate, wherein the normal trace sampling rate is inversely proportional to a frequency of occurrence of the normal traces.

26. The medium of claim 19 wherein sampling the traces using the sampling rates comprises sampling erroneous traces with an erroneous trace sampling rate, wherein the erroneous trace sampling rate is inversely proportional to a frequency of occurrence of the erroneous traces.

27. The medium of claim 19 further comprises:

performing troubleshooting on the sampled erroneous traces to identify a performance problem with the application; and

executing remedial measures to correct the performance problem.