INCAST DROP CAUSE TELEMETRY

Info

Publication number: 20150124824
Type: Application
Filed: Sep 11, 2014
Publication Date: May 7, 2015
Inventors: Thomas J. Edsall (Los Gatos, CA), Mohammadreza Alizadeh Attar (Santa Clara, CA)
Application Number: 14/484,181

Abstract

Aspects of the subject disclosure relate to ways to capture packet metadata following an incast event. In some implementations, a method of the subject technology can include steps for receiving a plurality of data packets at a network device, storing each of the plurality of packets in a buffer, and detecting a packet drop event for one or more incoming packets, wherein the one or more incoming packets are not stored in the queue. In some aspects, the method can further include steps for indicating a marked packet from among the received data packets, dequeuing each of the plurality of packets in the buffer, capturing metadata for each dequeued packet until the marked packet is dequeued.

Description

Description

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No. 61/900,324, filed Nov. 5, 2013, entitled “SYSTEMS AND METHODS FOR DETERMINING METRICS AND WORKLOAD MANAGEMENT,” which is incorporated herein by reference in its entirety.

BACKGROUND Field of the Invention

The subject technology relates to data gathering for packets that are enqueued and dequeued in a buffer and in particular, for collecting packet metadata for use in analyzing incast events.

Introduction:

As data centers grow in the number of server nodes and operating speed of the interconnecting network, it has become challenging to ensure reliable packet delivery. Moreover, the workload in large data centers is generated by an increasingly heterogeneous mix of applications, such as search, retail, high-performance computing and storage, and social networking.

There are two main causes of packets loss/drops: (1) drops due to congestion episodes, particularly “incast” events, and (2) corruption on the channel due to increasing line rates. Packet losses can cause timeouts at the transport and application levels, leading to a loss of throughput and an increase in flow transfer times and the number of aborted jobs.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain features of the subject technology are set forth in the appended claims. However, the accompanying drawings, which are included to provide further understanding, illustrate disclosed aspects and together with the description serve to explain the principles of the subject technology. In the drawings:

FIG. 1 illustrates an example network device, according to certain aspects of the subject technology.

FIG. 2 illustrates an example of a network configuration in which an incast event can occur, according to some implementations.

FIG. 3 illustrates a conceptual block diagram of a buffer implemented in a network device, according to some aspects.

FIG. 4 illustrates a block diagram of an example method for capturing packet metadata, according to some implementations.

DETAILED DESCRIPTION

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which aspects of the disclosure can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a more thorough understanding of the subject technology. However, it will be clear and apparent that the subject technology is not limited to the specific details set forth herein and may be practiced without these details. In some instances, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

Overview:

A computer network is a geographically distributed collection of nodes interconnected by communication links and segments for transporting data between endpoints, such as personal computers and workstations. Many types of networks are available, with the types ranging from local area networks (LANs) and wide area networks (WANs) to overlay and software-defined networks, such as virtual extensible local area networks (VXLANs).

LANs typically connect nodes over dedicated private communication links located in the same geographic region, such as a building or campus. WANs, on the other hand, typically connect geographically dispersed nodes over long-distance communications links, such as common carrier telephone lines, optical lightpaths, synchronous optical networks (SONET), or synchronous digital hierarchy (SDH) links. LANs and WANs can include layer 2 (L2) and/or layer 3 (L3) networks and devices.

The Internet is an example of a WAN that connects disparate networks throughout the world, providing global communication between nodes on various networks. The nodes typically communicate over the network by exchanging discrete frames or packets of data according to predefined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP). In this context, a protocol can refer to a set of rules defining how the nodes interact with each other. Computer networks may be further interconnected by an intermediate network node, such as a router, to extend the effective “size” of each network.

Transmission Control Protocol (TCP) is widely used to provide reliable, ordered delivery of data from one network entity to another. More particularly, TCP is frequently relied upon to implement Internet applications, such as, for example, the World Wide Web, e-mail, and file transfer. In a high-bandwidth and low latency network utilizing TCP, multiple servers may independently send data to a single common receiver. Provided that the multiple senders simultaneously transmit data to the receiver, congestion, or incast congestion, can occur if the receiver is not capable of receiving the quantity of data being transmitted.

Description:

The congestion episode termed “incast” or “fan-in” congestion can lead to bursty losses and TCP timeouts. Essentially, incasting occurs when multiple sources simultaneously transfer data to a common client/receiver, overwhelming the buffers to which the client is connected. Incast can cause severe losses of throughput and vastly increase flow transfer times, making its prevention an important factor in ensuring reliable packet delivery across data center interconnect fabrics.

Two approaches are conventionally implemented to address the incast problem in data centers: (1) reducing the duration of TCP timeouts using high resolution timers (HRTs), and (2) increasing switch buffer sizes to reduce loss events.

The use of HRTs is designed to drastically reduce the minimum retransmission timeout (min-RTO). The approach of reducing the value of the TCP's min-RTO has the effect of drastically reducing the amount of time a TCP source is timed out after bursty packet losses. However, high resolution timers can be difficult to implement, especially in virtual-machine-rich environments. For instance, reducing min-RTO can require making operating system-specific changes to the TCP stack—imposing potentially serious deployment challenges because of the widespread use of closed-source operating systems like Windows and legacy operating systems.

The other approach to the incasting problem is to reduce packet losses using switches with very large buffers. However, increasing switch buffer sizes is very expensive, and increases latency and power dissipation. Moreover, large, high-bandwidth buffers such as needed for high-speed data center switches require expensive, complex and power-hungry memories. In terms of performance, while they can reduce packet drops and hence timeouts due to incast, they may also increase the latency of short messages, potentially leading to the violation of service level agreements (SLAs) for latency-sensitive applications.

In some other implementations, incast events may be predicted, for example by monitoring a rate at which packets are dequeued from a buffer, as compared to a buffer fill rate. However, such information is often of limited use because a given buffer can (on average) be empty—thus, time varying measurements based on bandwidth utilization, or on buffer use, may be too coarse-grained to yield insight into the actual cause/s of an incast event.

Accordingly, there remains a need to better understand network conditions that exist just before and during, the occurrence of an incast event.

The subject technology addresses the foregoing need by providing a way to capture data about packets enqueued just before an incast occurrence. With information regarding enqueued packets, network administrators can better analyze and understand the conditions leading to an overflow event. Enqueued packet information can yield clues as to the systemic cause of an incast event, for example, by providing information regarding source(s) and/or destination(s) of buffered packets, as well information identifying application(s) for which they are associated.

In some aspects, the subject technology can be implemented by capturing packet metadata for packets residing in a buffer when an incast occurs. As discussed in further detail below, packet metadata for each dequeued packet can be captured, up to the last packet that was added to the buffer before the incast is detected. In some implementations, a last packet stored to the buffer can be marked or “flagged” upon the detection of an incast event. Thereafter, packet metadata is captured (e.g., packet header information can be recorded) as each packet is subsequently dequeued. The recordation/capturing of dequeued packet metadata can continue until it is determined that the flagged packet has been dequeued. Thus, a “snapshot” of packet metadata, e.g., representing all packets in the filled buffer (before the incast event), can be recorded for later analysis. A brief introductory description of example systems and networks for which metadata information can be captured, as illustrated in FIGS. 1 and 2, is disclosed herein.

FIG. 1 illustrates an example network device 110 (e.g., a router) suitable for implementing the present invention. Network device 110 includes a master central processing unit (CPU) 162, interfaces 168, and bus 115 (e.g., a PCI bus). When acting under the control of appropriate software or firmware, CPU 162 is responsible for executing packet management, error detection, and/or routing functions, such as miscabling detection functions, for example. CPU 162 can accomplish all these functions under the control of software including an operating system and any appropriate applications software. CPU 162 may include one or more processors 163 such as a processor from the Motorola family of microprocessors or the MIPS family of microprocessors. In alternative aspects, processor 163 is specially designed hardware for controlling the operations of router 110. In a specific implementation, memory 161 (such as non-volatile RAM and/or ROM) also forms part of CPU 162. However, there are many different ways in which memory could be coupled to the system.

Interfaces 168 are typically provided as interface cards (sometimes referred to as “line cards”). Generally, they control the sending and receiving of data packets over the network and sometimes support other peripherals used with router 110. Among the interfaces that may be provided are Ethernet interfaces, frame relay interfaces, cable interfaces, DSL interfaces, token ring interfaces, and the like. In addition, various very high-speed interfaces may be provided such as fast token ring interfaces, wireless interfaces, Ethernet interfaces, Gigabit Ethernet interfaces, ATM interfaces, HSSI interfaces, POS interfaces, FDDI interfaces and the like.

Although the system shown in FIG. 1 illustrates an example of a network device implementation, it is not the only network device architecture on which the subject technology may be implemented. For example, an architecture having a single processor that handles communications as well as routing computations, etc. is often used. Further, other types of interfaces and media can also be implemented.

FIG. 2 illustrates a data center network structure in which an environment 100 includes Top of Rack (TOR) switches 202-208, aggregate switches 210 and 212, aggregate routers 214 and 216, an access router 218, and Internet 220. Furthermore, FIG. 1 illustrates an example of how one of the TOR switches 208 can be connected to a plurality of servers 222-226. However, it is contemplated that each of TOR switches 202-208 can be similarly connected to the same plurality of servers 222-226 and/or different servers. In various embodiments, the environment 200 may represent a basic topology, at the abstract level of a data center network. As shown, each of the TOR switches 202-208 may be connected to each of the aggregate switches 210. For instance, TOR switch 202 can be connected to both aggregate switch 210 and aggregate switch 212. Moreover, each of aggregate switches 210 and 212 can be connected to each of the aggregate routers 214 and 216, which may be connected to the access router 218. Lastly, access router 218 can be connected to Internet 220. It is contemplated that any number of TOR switches 202-208, aggregate switches 210 and 212, aggregate routers 214 and 216, and access routers 218 can be implemented in environment 200.

In various aspects, a network data center may be a facility used to house computer systems and associated components, such as TOR switches 202-208, aggregate switches 210 and 212, aggregate routers 214 and 216 and/or access router 218, for example. Moreover, TOR switches 202-208 can refer to small port count switches that are situated on the top or near the top of a rack included in a network data center. In addition, aggregate switches 210 and 212 can be used to increase the link speed beyond the limits of any single cable or port.

As stated above, each of TOR switches 202-208 may be connected to a plurality of servers 222-226. Although three servers 222-226 are shown in FIG. 1, it is contemplated that each TOR switch 202-208 can be connected to any number of servers 222-226. In this embodiment, TOR switch 208 is representative of TOR switches 202-206 and it may be directly connected to servers 222-226. TOR switch 208 may be connected to dozens of servers 222-226. In one embodiment, the number of servers 222-226 under the same TOR switch 208 is from 44 to 48, and the TOR switch 208 is a 48-port Gigabit switch with one or multiple 10 Gigabit uplinks.

In the environment 200, such as a network data center, data may be stored on multiple servers 222-226. Incast congestion can occur when a file, or a portion thereof, is fetched from multiple of servers 222-226. More specifically, incast congestion may occur when multiple senders (i.e., servers 222-226), which may be operating under the same TOR switch 202-208, send data to a single receiver either simultaneously or at approximately the same time. In various implementations, the receiver can include any type of server and/or computing device. Even if the senders simultaneously transmit data to the receiver, if the number of senders or the amount of data transmitted by each sender is relatively small, incast congestion may be avoided. However, when the amount of data transmitted by the senders exceeds the available buffering at the receiver's access port, data packets that were transmitted by a sender may be lost and therefore, not be received by the receiver. Hence, throughput can decline due to one or more TCP connections experiencing time out caused by data packet drops and/or loss.

For instance, assume that the environment 200 includes ten servers 222-226 and an allocator that assigns one or more of the servers 222-226 to provide data in response to a request for that data. In various embodiments, if the servers 222-226 send their respective data packets to a receiver at approximately the same time, the receiver may not have available bandwidth to receive the data packets (i.e., incast congestion). As a result, data packets may be lost and the server 222-226 that transmitted the lost data packet(s) may need to retransmit those data packets. Accordingly, provided that the receiver requested a particular piece of data from the servers 222-226, the receiver may need to wait for the lost data packet to be retransmitted in order to receive the data responsive to the request. That is, the performance of environment 200 may be dependent upon the TCP connections between servers 222-226 and the receiver. Therefore, the time associated with retransmitting the lost data packets may cause unneeded delay in the environment 200.

FIG. 3 illustrates an example, of buffer (queue) of a network device 300 (e.g., similar to network device 110, discussed above with respect to FIG. 1). As illustrated, network device 300 includes buffer 302, which stores multiple packets e.g., packets ‘A,’ ‘B,’ ‘C,’ and ‘D.’ Network device 300 also include multiple network connections, e.g., a dequeue channel, which removes data from buffer 302, and multiple enqueued channels, incoming data from which is stored into buffer 302.

In practice, network device 300 receives data via the multiple enqueue channels, and stores the data in buffer 302. When properly functioning, network device 300 will dequeue the data in buffer 302 at a rate that is equal to, or faster than, a rate at which new data is being stored or added to buffer 302. However, in some instances new data (e.g., packets) are stored to buffer 302 at a rate exceeding that at which stored data (packets) its data can be dequeued. In such instances, buffer 302 can fill to capacity, and subsequently received packets, such as packet ‘E’ are dropped. As discussed above, an incast event can occur when multiple enqueue channels are used to push data/packets onto buffer 302 faster than the data/packets can be dequeued.

To better understand the nature of an incast event, it can be helpful to know more about the network conditions preceding the event, for example, by observing the packets stored in buffer 302 before the incast event occurred. In practice, data may be collected about the packets stored to buffer 302, for example by capturing packet header metadata for each packet as it is dequeue from buffer 302. The storing/capturing of packet metadata can be initialized by the detection of an incast event and can be continued, for example, until a marked/flag packet is dequeue. In such implementations the marked/flagged packet can be a packet last stored to buffer 302, before the incast event was detected. That is, upon detection of an incast event, the packet last stored to buffer 302 can be flagged/marked e.g., to indicate a time immediately preceding a packet drop. In some implementations the marking/flagging of a last packet stored to buffer 302 can be performed by modifying packet header information of the marked packet.

In the example illustrated in FIG. 3, packet ‘D’ is a last packet stored to buffer 302. Packet ‘E’ represents a first packet dropped after buffer 302 is filled, e.g., due to data incast. As illustrated, packet ‘D’ has been marked, by network device 300, such that a bit in the packet header has been flipped, distinguishing packet ‘D’ from the other packets in buffer 302. It is understood that the foregoing implementation depicted by FIG. 3 is merely an illustration of an example marking process. However, depending on implementation, the manner and/or process in which packet marking is performed may vary.

Further to the example of FIG. 3, as each of the stored packets are dequeue, the respective metadata information for each packet can be captured/recorded and stored for later analysis. By better understanding the nature of packets contained in buffer 302 when incast event occurs, network administrators may better troubleshoot the causes of incast events.

It is understood that packet metadata information may be analyzed locally, or remotely (e.g., across one or more remote collectors), depending on the desired implementation. That is, packet metadata may be stored and/or analyzed locally, e.g. on a network device in which metadata information is captured. Alternatively, any portion of captured metadata information may be sent to one or more remote systems/collectors for further storage and/or analysis.

FIG. 4. illustrates an example block diagram of a process 400 that can be used to implement aspects of the subject technology. Process 400 begins with step 402, in which one or more data packets are received at a network device. It is understood that the network device can include any of a variety of network enabled, processor-based devices, such as one or more switches (e.g. TOR switches) or routers, etc.

In step 402, each of the received data packets are stored in a buffer (e.g., a queue) associated with the network device. For example the packets can be stored in a queue or buffer as they are processed/routed e.g., before being dequeued and transmitted/routed to another node, or network end-point.

Subsequently, in decision step 406, a packet drop condition is determined. If in decision step 406 it is determined that no packet drop has been detected, process 400 proceeds back to step 404, in which incoming packets continue to be stored in a queue of the network device. Alternatively, if in decision step 406 it is determined that a packet drop has been detected, process 400 proceeds to step 408 in which a packet presently stored in the queue (buffer) is marked, indicating a time marker before the drop event. As discussed in further detail below, the marked packet can be used to identify a time-frame for which packet information (for dequeued packets) is to be captured/collected.

Although any packet in the queue can be marked, in some implementations, the marked packet is the last packet enqueued before the drop event was detected. That is, the most recent packet stored to the buffer is identified and marked, for example, by modifying one or more bits in the packet header.

In step 410, packets stored in the buffer prior to the drop event are dequeued. In some implementations, the packets are dequeued in a particular order, such as in a first-in-first-our order. As such, the marked packet is the last packet to be dequeued, from among the set of total packets residing in the buffer when the packet drop was detected. In this manner, packet data (e.g., packet metadata) is captured for all packets residing in the buffer when the drop (incast event) occurred. In certain implementations, the capturing of packet metadata is stopped after the marked packet has been dequeued, e.g., once a ‘snap-shot’ of buffered metadata has been captured.

Subsequently, in step 412, captured metadata information is analyzed, for example, to better understand the circumstances preceding the incast event. In some implementations, a network administrator, or other user diagnosing the cause of a packet drop event, may find such information useful, for example, in determining what applications or network paths/links are associated with the incast. For example, captured packet metadata can contain information indicating one or more originating applications, source/origination addresses, destination addresses, tenant network identifier(s), virtual local area network (VLAN) identification(s), etc. By better understanding the network conditions leading to an incast, network administrators are provided more information with which to diagnose network problems.

It is understood that any specific order or hierarchy of steps in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged, or that only a portion of the illustrated steps be performed. Some of the steps may be performed simultaneously. For example, in certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.”

A phrase such as an “aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. A phrase such as an aspect can refer to one or more aspects and vice versa. A phrase such as a “configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A phrase such as a configuration may refer to one or more configurations and vice versa.

The word “exemplary” is used herein to mean “serving as an example or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs.

Claims

1. A computer-implemented method comprising:

receiving a plurality of data packets at a network device;

storing, by the network device, each of the plurality of packets in a buffer;

detecting, by the network device, a packet drop event for one or more incoming packets, wherein the one or more incoming packets are not stored in the buffer;

indicating a marked packet from among the plurality of received data packets, wherein the marked packet indicates a last packet enqueued prior to the packet drop event;

dequeuing each of the plurality of packets in the buffer; and

capturing metadata for each dequeued packet until the marked packet is dequeued.

2. The computer-implemented method of claim 1, further comprising:

determining a cause of an incast event at the network device based on the metadata.

3. The computer-implemented method of claim 1, further comprising:

sending the captured metadata for one or more of the dequeued packets to one or more remote collectors for further processing.

4. The computer-implemented method of claim 1, further comprising:

identifying one or more applications associated with one or more of the plurality of packets in the buffer prior to the drop event.

5. The computer-implemented method of claim 1, wherein the packet drop event corresponds with an incast event at the network device.

6. The computer-implemented method of claim 1, wherein identifying the marked packet further comprises:

modifying, by the network device, packet header information associated with the marked packet.

7. The computer-implemented method of claim 1, wherein the metadata for each dequeued packet comprises one or more of the following: a destination address, an origination address, a tenant network identifier, a protocol type, or a virtual local area network (VLAN) identification.

8. The computer-implemented method of claim 1, wherein indicating the marked packet from among the plurality of received data packets, further comprises:

modifying packet header information of the marked packet.

9. A system for capturing metadata information after an incast event, the system comprising:

a memory; and

one or more processors coupled to the memory, wherein the one or more processors are configured to perform operations comprising: receiving a plurality of data packets at a network device; storing, by the network device, each of the plurality of packets in a buffer; detecting, by the network device, a packet drop event for one or more incoming packets, wherein the one or more incoming packets are not stored in the buffer; indicating a marked packet from among the plurality of received data packets, wherein the marked packet indicates a last packet enqueued prior to the packet drop event; dequeuing each of the plurality of packets in the buffer; and capturing metadata for each dequeued packet until the marked packet is dequeued.

10. The system of claim 9, wherein the one or more processors are further configured to perform operations comprising:

determining a cause of an incast event at the network device based on the metadata.

11. The system of claim 9, wherein the one or more processors are further configured to perform operations comprising:

sending the captured metadata for one or more of the dequeued packets to one or more remote collectors for further processing.

12. The system of claim 9, wherein the one or more processors are further configured to perform operations comprising:

identifying one or more applications associated with one or more of the plurality of packets in the buffer prior to the drop event.

13. The system of claim 9, wherein the packet drop event corresponds with an incast event at the network device.

14. The system of claim 9, wherein identifying the marked packet further comprises:

modifying, by the network device, packet header information associated with the marked packet.

15. The system of claim 9, wherein the metadata for each dequeued packet comprises one or more of the following: a destination address, an origination address, a tenant network identifier, a protocol type, or a virtual local area network (VLAN) identification.

16. The system of claim 9, wherein indicating the marked packet from among the plurality of received data packets, further comprises:

modifying packet header information of the marked packet.

17. A non-transitory computer-readable storage medium comprising instructions stored therein, which when executed by one or more processors, cause the processors to perform operations comprising:

receiving a plurality of data packets at a network device;

storing, by the network device, each of the plurality of packets in a buffer;

detecting, by the network device, a packet drop event for one or more incoming packets, wherein the one or more incoming packets are not stored in the buffer;

indicating a marked packet from among the plurality of received data packets, wherein the marked packet indicates a last packet enqueued prior to the packet drop event;

dequeuing each of the plurality of packets in the buffer; and

capturing metadata for each dequeued packet until the marked packet is dequeued.

18. The non-transitory computer-readable storage medium of claim 17, wherein the processors are further configured to perform operations comprising:

determining a cause of an incast event at the network device based on the metadata.

19. The non-transitory computer-readable storage medium of claim 17, wherein the one or more processors are further configured to perform operations comprising:

sending the captured metadata for one or more of the dequeued packets to one or more remote collectors for further processing.

20. The non-transitory computer-readable storage medium of claim 17, wherein the processors are further configured to perform operations comprising:

identifying one or more applications associated with one or more of the plurality of packets in the buffer prior to the drop event.

21. The non-transitory computer-readable storage medium of claim 17, wherein the packet drop event corresponds with an incast event at the network device.

22. The non-transitory computer-readable storage medium of claim 17, wherein identifying the marked packet further comprises:

modifying, by the network device, packet header information associated with the marked packet.

23. The non-transitory computer-readable storage medium of claim 17, wherein the metadata for each dequeued packet comprises one or more of the following: a destination address, an origination address, a tenant network identifier, a protocol type, or virtual local area network (VLAN) identification.