Port congestion notification in a switch
A congestion notification mechanism provides a congestion status for all destinations in a switch at each ingress port. Data is stored in a memory subsystem queue associated with the destination port at the ingress side of the crossbar. A cell credit manager tracks the amount of data in this memory subsystem for each destination. If the count for any destination exceeds a threshold, the credit manager sends an XOFF signal to the XOFF masks. A lookup table in the XOFF masks maintains the status for every switch destination based on the XOFF signals. An XON history register receives the XOFF signals to allow queuing procedures that do not allow a status change to XON during certain states. Flow control signals directly from the memory subsystem are allowed to flow to each XOFF mask, where they are combined with the lookup table status to provide a congestion status for every destination.
This application is a continuation-in-part application based on U.S. patent application Ser. No. 10/020,968, entitled “Deferred Oueuing in a Buffered Switch,” filed on Dec. 19, 2001, which is hereby incorporated by reference.
This application is related to U.S. patent application entitled “Fibre Channel Switch,” Ser. No. ______, attorney docket number 3194, filed on even date herewith with inventors in common with the present application. This related application is hereby incorporated by reference.
FIELD OF THE INVENTIONThe present invention relates to congestion notification in a switch. More particularly, the present invention relates to maintaining and updating a congestion status for all destination ports within a switch.
BACKGROUND OF THE INVENTIONFibre Channel is a switched communications protocol that allows concurrent communication among servers, workstations, storage devices, peripherals, and other computing devices. Fibre Channel can be considered a channel-network hybrid, containing enough network features to provide the needed connectivity, distance and protocol multiplexing, and enough channel features to retain simplicity, repeatable performance and reliable delivery. Fibre Channel is capable of full-duplex transmission of frames at rates extending from 1 Gbps (gigabits per second) to 10 Gbps. It is also able to transport commands and data according to existing protocols such as Internet protocol (IP), Small Computer System Interface (SCSI), High Performance Parallel Interface (HIPPI) and Intelligent Peripheral Interface (IPI) over both optical fiber and copper cable.
In a typical usage, Fibre Channel is used to connect one or more computers or workstations together with one or more storage devices. In the language of Fibre Channel, each of these devices is considered a node. One node can be connected directly to another, or can be interconnected such as by means of a Fibre Channel fabric. The fabric can be a single Fibre Channel switch, or a group of switches acting together. Technically, the N_port (node ports) on each node are connected to F_ports (fabric ports) on the switch. Multiple Fibre Channel switches can be combined into a single fabric. The switches connect to each other via E-Port (Expansion Port) forming an interswitch link, or ISL.
Fibre Channel data is formatted into variable length data frames. Each frame starts with a start-of-frame (SOF) indicator and ends with a cyclical redundancy check (CRC) code for error detection and an end-of-frame indicator. In between are a 24-byte header and a variable-length data payload field that can range from 0 to 2112 bytes. The switch uses a routing table and the source and destination information found within the Fibre Channel frame header to route the Fibre Channel frames from one port to another. Routing tables can be shared between multiple switches in a fabric over an ISL, allowing one switch to know when a frame must be sent over the ISL to another switch in order to reach its destination port.
Fibre Channel switches are required to deliver frames to any destination in the same order that they arrive from a source. One common approach to insure in order delivery in this context is to process frames in strict temporal order at the input or ingress side of a switch. This is accomplished by managing its input buffer as a first in, first out (FIFO) buffer. Sometimes, however, a switch encounters a frame that cannot be delivered due to congestion at the destination port. This frame remains at the top of the buffer until the destination port becomes un-congested, even when the next frame in the FIFO is destined for a port that is not congested and could be transmitted immediately. This condition is referred to as head of line blocking.
Various techniques have been proposed to deal with the problem of head of line blocking. Scheduling algorithms, for instance, do not use true FIFOs. Rather, they search the input FIFO buffer looking for matches between waiting data and available output ports. If the top frame is destined for a busy port, the scheduling algorithm merely scans the FIFO buffer for the first frame that is destined for an available port. Such algorithms must take care to avoid sending Fibre Channel frames out of order. Another approach is to divide the input buffer into separate buffers for each possible destination. However, this requires large amounts of memory and a good deal of complexity in large switches having many possible destination ports. A third approach is the deferred queuing solution described in detail in the incorporated references. Deferred queuing requires that all incoming data frames that are destined for a congested port be placed in a deferred queue, which keeps these frames from unduly interfering with frames destined for uncongested ports. This technique requires a dependable method for determining the congestion status for all destinations at each input port.
Congestion and blocking are especially troublesome when the destination port is an E_Port providing an interswitch link to another switch. One reason that the E_Port can become congested is that the input port on the second switch has filled up its input buffer. The flow control between the switches prevents the first switch from sending any more data to the second switch. Often times the input buffer on the second switch becomes filled with frames that are all destined for a single congested port on that second switch. This filled buffer has congested the ISL, so that the first switch cannot send any data to the second switch—including data that is destined for an un-congested port on the second switch. Several manufacturers have proposed the use of virtual channels to prevent the situation where congestion on an interswitch link is caused by traffic to a single destination. In these proposals, traffic on the link is divided into several virtual channels, and no virtual channel is allowed to interfere with traffic on the other virtual channels. In range Technologies Corporation has proposed a technique for flow control over virtual channel that is described in the incorporated Fibre Channel Switch application. This flow control technique monitors the congestion status of all destination ports at the downstream switch. If a destination port becomes congested, the flow control process determines which virtual channel on the ISL is affected, and sends an XOFF message so informing the upstream switch. The upstream switch will then stop sending data on the affected virtual channel.
Like the deferred queuing solution, the virtual channel flow control solution requires that every input port in the downstream switch know the congestion status of all destinations in the switch. Unfortunately, the existing solutions for providing this information are not satisfactory, as they do not easily present accurate congestion status information to each of the ingress ports in a switch.
SUMMARY OF THE INVENTIONThe foregoing needs are met, to a great extent, by the present invention, which provides a method for noticing port congestion and informing ingress ports of the congestion. The present invention utilizes a switch that submits data to a crossbar component for making connections to a destination port. Before data is submitted to the crossbar, it is stored in a virtual output queue structure in a memory subsystem. A separate virtual output queue is maintained for each destination within the switch. When a connection is made over the crossbar to a destination port, data is removed from the virtual output queue associated with that destination port and transmitted over the connection. When a destination port becomes congested, flow control within the switch will prevent data from leaving the virtual output queues associated with that destination.
The present invention utilizes a cell credit manager at the ingress to the switch. The cell credit manager tracks credits associated with each virtual output queue in order to obtain knowledge about the amount of data within each queue. If the credit count in the cell credit manager drops below a threshold value, the cell credit manager views the associated port as a congested port and asserts an XOFF signal. The XOFF signal includes three components: a internal switch destination address for the relevant destination port, an XOFF/XON status bit, and a validity signal to indicate that a valid XOFF signal is being sent.
The XOFF signal of the cell credit manager is received by a plurality of XOFF mask modules. One XOFF mask is utilized at each ingress to the switch. Each XOFF mask receives the XOFF signal, and assigns the designated destination port to the indicated XOFF/XON status. The XOFF mask maintains the status for every destination port in a look up table that assigns a single bit to each port. If the bit assigned to a port is set to “1,” the port has an XOFF status. If the bit is “0,” the port has an XON status and is free to receive data.
The present invention recognizes that the XOFF mask should not set the status of the destination port to XON during certain portions of the deferred queuing procedure. Consequently, the present invention utilizes a XON history register that also tracks the current status of all ports. This XON history register receives the XOFF signals from the cell credit manager and reflects those changes in its own lookup table. The values in the look up table in the XON history register are then used to periodically update the values in the look up table in the XOFF mask.
The present invention also recognizes flow control signals directly from the memory subsystem that request that all data stop flowing to that subsystem. When these signals are receives, a “gross_xoff” signal is sent to the XOFF mask. The XOFF mask is then able to combine the results of this signal with the status of every destination port as maintained in its lookup table. When another portion of the switch wishes to determine the status of a particular port, the internal switch destination address is submitted to the XOFF mask. This address is used to reference the status of that destination in the lookup table, and the result is ORed with the value of the gross_xoff signal. The resulting signal indicates the status of the indicated destination port.
The present invention utilizes a single cell credit manager to track the inputs to the memory subsystem for a plurality of ports. Since each port has its own XOFF mask, the XOFF signals must be sent to the XOFF mask for each port that the cell credit manager tracks. Other cell credit managers exist within the switch. The present invention utilizes a special bus to transfer XOFF signals between the various cell credit managers within a switch. In addition, the present invention provides a technique for a stop_all signal to be shared with all XOFF masks utilizing a single memory subsystem. This signal will ensure that when the gross_xoff signal is set, it will prevent all traffic from flowing into the memory subsystem.
BRIEF DESCRIPTION OF THE DRAWINGS
1. Switch 100
The present invention is best understood after examining the major components of a Fibre Channel switch, such as switch 100 shown in
Switch 100 is a director class Fibre Channel switch having a plurality of Fibre Channel ports 110. The ports 110 are physically located on one or more I/O boards inside of switch 100. Although
In the preferred embodiment, each board 120, 122 also contains four port protocol devices (or PPDs) 130. These PPDs 130 can take a variety of known forms, including an ASIC, an FPGA, a daughter card, or even a plurality of chips found directly on the boards 120, 122. In the preferred embodiment, the PPDs 130 are ASICs, and can be referred to as the FCP ASICs, since they are primarily designed to handle Fibre Channel protocol data. Each PPD 130 manages and controls four ports 110. This means that each I/O board 120, 122 in the preferred embodiment contains sixteen Fibre Channel ports 110.
The I/O boards 120, 122 are connected to one or more crossbars 140 designed to establish a switched communication path between two ports 110. Although only a single crossbar 140 is shown, the preferred embodiment uses four or more crossbar devices 140 working together. In the preferred embodiment, crossbar 140 is cell-based, meaning that it is designed to switch small, fixed-size cells of data. This is true even though the overall switch 100 is designed to switch variable length Fibre Channel frames.
The Fibre Channel frames are received on a port, such as input port 112, and are processed by the port protocol device 130 connected to that port 112. The PPD 130 contains two major logical sections, namely a protocol interface module 150 and a fabric interface module 160. The protocol interface module 150 receives Fibre Channel frames from the ports 110 and stores them in temporary buffer memory. The protocol interface module 150 also examines the frame header for its destination ID and determines the appropriate output or egress port 114 for that frame. The frames are then submitted to the fabric interface module 160, which segments the variable-length Fibre Channel frames into fixed-length cells acceptable to crossbar 140.
The fabric interface module 160 then transmits the cells to an ingress memory subsystem (iMS) 180. A single iMS 180 handles all frames received on the I/O board 120, regardless of the port 110 or PPD 130 on which the frame was received.
When the ingress memory subsystem 180 receives the cells that make up a particular Fibre Channel frame, it treats that collection of cells as a variable length packet. The iMS 180 assigns this packet a packet ID (or “PID”) that indicates the cell buffer address in the iMS 180 where the packet is stored. The PID and the packet length is then passed on to the ingress Priority Queue (iPQ) 190, which organizes the packets in iMS 180 into one or more queues, and submits those packets to crossbar 140. Before submitting a packet to crossbar 140, the iPQ 190 submits a “bid” to arbiter 170. When the arbiter 170 receives the bid, it configures the appropriate connection through crossbar 140, and then grants access to that connection to the iPQ 190. The packet length is used to ensure that the connection is maintained until the entire packet has been transmitted through the crossbar 140, although the connection can be terminated early.
A single arbiter 170 can manage four different crossbars 140. The arbiter 170 handles multiple simultaneous bids from all iPQs 190 in the switch 100, and can grant multiple simultaneous connections through crossbar 140. The arbiter 170 also handles conflicting bids, ensuring that no output port 114 receives data from more than one input port 112 at a time.
The output or egress memory subsystem (eMS) 182 receives the data cells comprising the packet from the crossbar 140, and passes a packet ID to an egress priority queue (ePQ) 192. The egress priority queue 192 provides scheduling, traffic management, and queuing for communication between egress memory subsystem 182 and the PPD 130 in egress I/O board 122. When directed to do so by the ePQ 192, the eMS 182 transmits the cells comprising the Fibre Channel frame to the egress portion of PPD 130. The fabric interface module 160 then reassembles the data cells and presents the resulting Fibre Channel frame to the protocol interface module 150. The protocol interface module 150 stores the frame in its buffer, and then outputs the frame through output port 114.
In the preferred embodiment, crossbar 140 and the related components are part of a commercially available cell-based switch chipset, such as the nPX8005 or “Cyclone” switch fabric manufactured by Applied Micro Circuits Corporation of San Diego, Calif. More particularly, in the preferred embodiment, the crossbar 140 is the AMCC S8705 Crossbar product, the arbiter 170 is the AMCC S8605 Arbiter, the iPQ 190 and ePQ 192 are AMCC S8505 Priority Queues, and the iMS 180 and eMS 182 are AMCC S8905 Memory Subsystems, all manufactured by Applied Micro Circuits Corporation.
2. Port Protocol Device 130
a) Link Controller Module 300
The LCM 300 uses a SERDES chip (such as the Gigablaze SERDES available from LSI Logic Corporation, Milpitas, Calif.) to convert between the serial data used by the port 110 and the 10-bit parallel data used in the rest of the protocol interface 150. The LCM 300 performs all low-level link-related functions, including clock conversion, idle detection and removal, and link synchronization. The LCM 300 also performs arbitrated loop functions, checks frame CRC and length, and counts errors.
b) Memory Controller Module 310
The memory controller module 310 is responsible for storing the incoming data frame on the inbound frame buffer memory 320. Each port 110 on the PPD 130 is allocated a separate portion of the buffer 320. Alternatively, each port 110 could be given a separate physical buffer 320. This buffer 320 is also known as the credit memory, since the BB_Credit flow control between switch 100 and the upstream device is based upon the size or credits of this memory 320. The memory controller 310 identifies new Fibre Channel frames arriving in credit memory 320, and shares the frame's destination ID and its location in credit memory 320 with the inbound routing module 330.
The routing module 330 of the present invention examines the destination ID found in the frame header of the frames and determines the switch destination address (SDA) in switch 100 for the appropriate destination port 114. The router 330 is also capable of routing frames to the SDA associated with one of the microprocessors 124 in switch 100. In the preferred embodiment, the SDA is a ten-bit address that uniquely identifies every port 110 and processor 124 in switch 100. A single routing module 330 handles all of the routing for the PPD 130. The routing module 330 then provides the routing information to the memory controller 310.
As shown in
c) Queue Control Module 400
The queue control module 400 stores the routing results received from the inbound routing module 330. When the credit memory 320 contains multiple frames, the queue control module 400 decides which frame should leave the memory 320 next. In doing so, the queue module 400 utilizes procedures that avoid head-of-line blocking.
The queue control module 400 has four primary components, namely the deferred queue 402, the backup queue 404, the header select logic 406, and the XOFF mask 408. These components work in conjunction with the XON History register 420 and the cell credit manager or credit module 440 to control ingress queuing and to assist in managing flow control within switch 100. The deferred queue 402 stores the frame headers and locations in buffer memory 320 for frames waiting to be sent to a destination port 114 that is currently busy. The backup queue 404 stores the frame headers and buffer locations for frames that arrive at the input port 112 while the deferred queue 402 is sending deferred frames to their destination. The header select logic 406 determines the state of the queue control module 400, and uses this determination to select the next frame in credit memory 320 to be submitted to the FIM 160. To do this, the header select logic 406 supplies to the memory read module 350 a valid buffer address containing the next frame to be sent. The functioning of the backup queue 404, the deferred queue 402, and the header select logic 406 are described in the incorporated Fibre Channel Switch application.
The XOFF mask 408 contains a congestion status bit for each port 110 in the switch. The XON history register 420 is used to delay updating the XOFF mask 408 under certain conditions. These two components 408, 420 and their interaction with the cell credit manager 440 and FIM 160 are described in more detail below.
d) Fabric Interface Module 160
When a Fibre Channel frame is ready to be submitted to the ingress memory subsystem 180 of I/O board 120, the queue control 400 passes the frame's routed header and pointer to the memory read portion 350. This read module 350 then takes the frame from the credit memory 320 and provides it to the fabric interface module 160. The fabric interface module 160 converts the variable-length Fibre Channel frames received from the protocol interface 150 into fixed-sized data cells acceptable to the cell-based crossbar 140. Each cell is constructed with a specially configured cell header appropriate to the cell-based switch fabric. When using the Cyclone switch fabric of Applied Micro Circuits Corporation, the cell header includes a starting sync character, the switch destination address of the egress port 114 and a priority assignment from the inbound routing module 330, a flow control field and ready bit, an ingress class of service assignment, a packet length field, and a start-of-packet and end-of-packet identifier.
When necessary, the preferred embodiment of the fabric interface 160 creates fill data to compensate for the speed difference between the memory controller 310 output data rate and the ingress data rate of the cell-based crossbar 140. This process is described in more detail in the incorporated Fibre Channel Switch application.
Egress data cells are received from the crossbar 140 and stored in the egress memory subsystem 182. When these cells leave the eMS 182, they enter the egress portion of the fabric interface module 160. The FIM 160 then examines the cell headers, removes fill data, and concatenates the cell payloads to re-construct Fibre Channel frames with extended SOF/EOF codes. If necessary, the FIM 160 uses a small buffer to smooth gaps within frames caused by cell header and fill data removal.
In the preferred embodiment, there are multiple links between each PPD 130 and the iMS 180. Each separate link uses a separate FIM 160. Preferably, each port 110 on the PPD 130 is given a separate link to the iMS 180, and therefore each port 110 is assigned a separate FIM 160.
e) Outbound Processor Module 450
The FIM 160 then submits the frames to the outbound processor module (OPM) 450. A separate OPM 450 is used for each port 110 on the PPD 130. The outbound processor module 450 checks each frame's CRC, and handles the necessary buffering between the fabric interface 160 and the ports 110 to account for their different data transfer rates. The primary job of the outbound processor modules 450 is to handle data frames received from the cell-based crossbar 140 that are destined for one of the Fibre Channel ports 110. This data is submitted to the link controller module 300, which replaces the extended SOF /EOF codes with standard Fibre Channel SOF/EOF characters, performs 8b/10b encoding, and sends data frames through its SERDES to the Fibre Channel port 110.
The components of the PPD 130 can communicate with the microprocessor 124 on the I/O board 120, 122 through the microprocessor interface module (MIM) 360. Through the microprocessor interface 360, the microprocessor 124 can read and write registers on the PPD 130 and receive interrupts from the PPDs 130. This communication occurs over a microprocessor communication path 362. The microprocessor 124 also uses the microprocessor interface 360 to communicate with the ports 110 and with other processors 124 over the cell-based switch fabric.
3. Queues
a) Class of Service Queue 280
I/O Board 264 has a single egress memory subsystem 182 to hold all of the data received from the crossbar 140 (not shown) for its sixteen ports 110. The data in eMS 182 is controlled by the egress priority queue 192 (also not shown). In the preferred embodiment, the ePQ 192 maintains the data in the eMS 182 in a plurality of output class of service queues (O_COS_Q) 280. Data for each port 110 on the I/O Board 264 is kept in a total of “n” O_COS queues, with the number n reflecting the number of virtual channels 240 defined to exist with the ISL 230. When cells are received from the crossbar 140, the eMS 182 and ePQ 192 add the cell to the appropriate O_COS_Q 280 based on the destination SDA and priority value assigned to the cell. This information was placed in the cell header as the cell was created by the ingress FIM 160. The cells are then removed from the O_COS_Q 280 and are submitted to the PPD 262 for the egress port 114, which converts the cells back into a Fibre Channel frame and sends it across the ISL 230 to the downstream switch 270.
b) Virtual Output Queue 290
The frame enters switch 270 over the ISL 230 through ingress port 112. This ingress port 112 is actually the second port (labeled port 1) found on the first PPD 272 (labeled PPD 0) on the first I/O Board 274 (labeled I/O Board 0) on switch 270. Like the I/O board 264 on switch 260, this I/O board 274 contains a total of four PPDs 130, with each PPD 130 containing four ports 110. With a total of thirty-two I/O boards 120, 122, switch 270 has the same five hundred and twelve ports as switch 260.
When the frame is received at port 112, it is placed in credit memory 320. The D_ID of the frame is examined, and the frame is queued and a routing determination is made as described above. Assuming that the destination port on switch 270 is not XOFFed according to the XOFF mask 408 servicing input port 112, the frame will be subdivided into cells and forwarded to the ingress memory subsystem 180.
The iMS 180 is organized and controlled by the ingress priority queue 190, which is responsible for ensuring in-order delivery of data cells and packets. To accomplish this, the iPQ 190 organizes the data in its iMS 180 into a number (“m”) of different virtual output queues (V_O_Qs) 290. To avoid head-of-line blocking, a separate V_O_Q 290 is established for every destination within the switch 270. In switch 270, this means that there are at least five hundred forty-four V_O_Qs 290 (five hundred twelve physical ports 110 and thirty-two microprocessors 124) in iMS 180. The iMS 180 places incoming data on the appropriate V_O_Q 290 according to the switch destination address assigned to that data by the routing module 330 in PPD 272.
Data in the V_O_Qs 290 is handled like the data in O_COS_Qs 280, such as by using round robin servicing. When data is removed from a V_O_Q 290, it is submitted to the crossbar 140 and provided to an eMS 182 on the switch 270.
c) Virtual Input Queue 282
4. Flow Control in Switch
a) XOFF Flow Control between iMS 180 and eMS 182
The cell-based switch fabric used in the preferred embodiment of the present invention can be considered to include the memory subsystems 180, 182, the priority queues 190, 192, the cell-based crossbar 140, and the arbiter 170. As described above, these elements can be obtained commercially from companies such as Applied Micro Circuits Corporation. This switch fabric utilizes a variety of flow control mechanisms to prevent internal buffer overflows, to control the flow of cells into the cell-based switch fabric, and to receive flow control instructions to stop cells from exiting the switch fabric.
XOFF internal flow control within the cell-based switch fabric is shown as communication 500 in
This flow control works as follows. When cell occupancy of an O_COS_Q 280 reaches a threshold, an XOFF signal is generated internal to the switch fabric to stop transmission of data from the iMS 180 to these O_COS_Qs 280. The preferred Cyclone switch fabric uses three different thresholds, namely a routine threshold, an urgent threshold, and an emergency threshold. Each threshold creates a corresponding type of XOFF signal to the iMS 180.
Unfortunately, since the V_O_Qs 290 in iMS 180 are not organized into the individual class of services for each possible output port 114, the XOFF signal generated by the eMS 182 cannot simply turn off data for a single O_COS_Q 280. In fact, due to the manner in which the cell-based fabric addresses individual ports, the XOFF signal is not even specific to a single congested port 110. Rather, in the case of the routine XOFF signal, the iMS 180 responds by stopping all cell traffic to the group of four ports 110 found on the PPD 130 that contains the congested egress port 114. Urgent and Emergency XOFF signals cause the iMS 180 and Arbiter 170 to stop all cell traffic to the effected egress I/O board 122. In the case of routine and urgent XOFF signals, the eMS 182 is able to accept additional packets of data before the iMS 180 stops sending data. Emergency XOFF signals mean that new packets arriving at the eMS 182 will be discarded.
b) Backplane Credit Flow Control
The iPQ 190 also uses a backplane credit flow control 510 (shown in
Note that even though only a single O_COS_Q 280 is not sending data, the iPQ 190 only maintains credits on an port 110 basis, not a class of service basis. Thus, the effected iPQ 190 will stop sending all data to the port 114, including data with a different class of service that could be transmitted over the port 114. In addition, since the iPQ 190 services an entire I/O board 120, all traffic to that egress port 114 from any of the ports 110 on that board 120 is stopped. Other iPQs 190 on other I/O boards 120, 122 can continue sending packets to the same egress port 114 as long as those other iPQs 190 have backplane credits for that port 114.
Thus, the backplane credit system 510 can provide some internal switch flow control from ingress to egress on the basis of a virtual channel 240, but it is inconsistent. If two ingress ports 112 on two separate I/O boards 120, 122 are each sending data to different virtual channels 240 on the same ISL 230, the use of backplane credits will flow control those channels 240 differently. One of those virtual channels 240 might have an XOFF condition. Packets to that O_COS_Q 280 will back up, and backplane credits will not be returned. The lack of backplane credits will cause the iPQ 190 sending to the XOFFed virtual channel 240 to stop sending data. Assuming the other virtual channel does not have an XOFF condition, credits from its O_COS_Q 280 to the other iPQ 190 will continue, and data will flow through that channel 240. However, if the two ingress ports 112 sending to the two virtual channels 240 utilize the same iPQ 190, the lack of returned backplane credits from the XOFFed O_COS_Q 280 will stop traffic to all virtual channels 240 on the ISL 230.
c) Input to Fabric Flow Control 520
The cell-based switch fabric must be able to stop the flow of data from its data source (i.e., the FIM 160) whenever the iMS 180 or a V_O_Q 290 maintained by the iPQ 190 is becoming full. The switch fabric signals this XOFF condition by setting the RDY (ready) bit to 0 on the cells it returns to the FIM 160, shown as 520 on
There are three situations where the switch fabric may request an XOFF or XON state change. In every case, flow control cells 520 are sent by the eMS 182 to the egress portion of the FIM 160 to inform the PPD 130 of this updated state. These flow control cells use the RDY bit in the cell header to indicate the current status of the iMS 180 and its related queues 290.
In the first of the three different situations, the iMS 180 may fill up to its threshold level. In this case, no more traffic should be sent to the iMS 180. When a FIM 160 receives the flow control cells 520 indicating this condition, it sends a congestion signal (or “gross_xoff” signal) 522 to the XOFF mask 408 in the memory controller 310. This signal informs the MCM 310 to stop all data traffic to the iMS 180, as described in more detail below. The FIM 160 will also broadcast an external signal called STOP_ALL 164 to the FIMs 160 on its PPD 130, as well as to the other three PPDs 130 on its I/O board 120. The STOP_ALL congestion signal 164 may take the same form as the gross_xoff congestion signal 522, or it may be differently formatted. The interconnection between the PPDs 130 and the STOP_ALL signal 164 is shown in
In the second case, a single V_O_Q 290 in the iMS 180 fills up to its threshold. When this occurs, the signal 520 back to the PPD 130 will behave just as it did in the first case, with the generation of a gross_xoff congestion signal 522 and a STOP_ALL congestion signal 164. Thus, the entire iMS 180 stops receiving data, even though only a single V_O_Q 290 has become congestion.
The third case involves a failed link between a FIM 160 and the iMS 180. Flow control cells indicating this condition will cause a gross_xoff signal 522 to be sent only to the MCM 310 for the corresponding FIM 160. No STOP_ALL signal 164 is sent in this situation.
d) Outputfrom Fabric Flow Control 530
When an egress portion of a PPD 130 wishes to stop traffic coming from the eMS 182, it signals an XOFF to the switch fabric by sending a cell from the input FIM 160 to the iMS 180, which is shown as flow control 530 on
The PPD 130 might desire to stop the flow of data from the eMS 182 for several reasons. First, an internal buffer within the egress portion of the FIM 160 may be approaching an overflow condition. Second, the egress portion of the PIM 150 may have received a switch-to-switch flow control signal. This signal may request stopping the flow of data over the entire link. Alternatively, the signal may reflect only a desire to stop traffic over a particular virtual channel 240 on a link. Regardless of the reason, when the FIM 160 needs to stop data traffic from the eMS 182, the FIM 160 sends an XOFF to the switch fabric in an ingress cell header directed toward iMS 180. The iMS 180 extracts each XOFF instruction from the cell header, and sends it to the eMS 182, directing the eMS 182 to XOFF or XON a particular O_COS_Q 280. If the O_COS_Q 280 is sending a packet to the FIM 160, it finishes sending the packet. The eMS 182 then stops sending fabric-to-port or fabric-to-microprocessor packets to the FIM 160.
5. Congestion Notification
a) XOFF Mask 408
The XOFF mask 408 shown in
Each XOFF mask 408 contains a separate status bit for all destinations within the switch 100. In one embodiment of the switch 100, there are five hundred and twelve physical ports 110 and thirty-two microprocessors 124 that can serve as a destination for a frame. Hence, the XOFF mask 408 uses a 544 by 1 look up table 410 to store the “XOFF” status of each destination. If a bit in XOFF look up table 410 is set, the port 110 corresponding to that bit is busy and cannot receive any frames.
In the preferred embodiment, the XOFF mask 408 returns a status for a destination by first receiving the switch destination address for that port 110 or microprocessor 124 on SDA input 412. The look up table 410 is examined for the SDA on input 412, and if the corresponding bit is set, the XOFF mask 408 asserts a signal on “defer” output 414, which indicates to the rest of the queue control module 400 that the selected port 110 or processor 124 is busy. This construction of the XOFF mask 408 is the preferred way to store the congestion status of possible destinations at each port 110. Other ways are possible, as long as they can quickly respond to a status query about a destination with the congestion status for that destination.
In the preferred embodiment, the output of the XOFF look up table 410 is not the sole source for the defer signal 414. In addition, the XOFF mask 408 receives the gross_xoff signal 522 from its associated FIM 160. This signal 522 is ORed with the output of the lookup table 410 in order to generate the defer signal 414. This means that whenever the gross_xoff signal 522 is set, the defer signal 414 will also be set, effectively stopping all traffic to the iMS 180. In another embodiment (not shown), a force defer signal that is controlled by the microprocessor 124 is also able to cause the defer signal 414 to go on. When the defer signal 414 is set, it informs the header select logic 406 and the remaining elements of the queue module 400 that the port 110 having the address on next frame header output 415 is congested, and this frame should be stored on the deferred queue 402.
b) XOFF History Register 420
The XON history register 420 is used to record the history of the XON status of all destinations in the switch 100. Under the procedure established for deferred queuing, the XOFF mask 408 cannot be updated with an XON event when the queue control 400 is servicing deferred frames in the deferred queue 402. During that time, whenever a port 110 changes status from XOFF to XON, the XOFF mask 408 will ignore (or not receive) the XOFF signal 452 from the cell credit manager 440 and will therefore not update its lookup table 410. The signal 452 from the cell credit manager 440 will, however, update the lookup table 422 within the XON history register 420. Thus, the XON history register 420 maintains the current XON status of all ports 110. When the update signal 416 is made active by the header select 406 portion of the queue control module 400, the entire content of the lookup table 422 in the XON history register 420 is transferred to the lookup table 410 of the XOFF mask 408. Registers within the table 422 containing a zero(having a status of XON) will cause corresponding registers within the XOFF mask lookup table 410 to be reset to zero. The dual register setup allows for XOFFs to be written directly to the XOFF mask 408 at any time the cell credit manager 440 requires traffic to be halted, and causes XONs to be applied only when the logic within the queue control module 400 allows for a change to an XON value. While a separate queue control module 400 and its associated XOFF mask 408 is necessary for each port 110 in the PPD 130, only one XON history register 420 is necessary to service all four ports 110 in the PPD 130, which again is shown in
c) Cell Credit Manager 440
The cell credit manager or credit module 440 sets the XOFF/XON status of the possible destination ports 110 in the lookup tables 410, 422 of the XOFF mask 408 and the XON history register 420. To update these tables 410, 422, the cell credit manager 440 maintains a cell credit count of every cell in the virtual output queues 290 of the iMS 180. Every time a cell addressed to a particular SDA leaves the FIM 160 and enters the iMS 180, the FIM 160 informs the credit module 440 through a cell credit event signal 442. The credit module 440 then decrements the cell count for that SDA. Every time a cell for that destination leaves the iMS 180, the credit module 440 is again informed and adds a credit to the count for the associated SDA. The iPQ 190 sends this credit information back to the credit module 440 by sending a cell containing the cell credit back to the FIM 160 through the eMS 182. The FIM 160 then sends an increment credit signal 442 to the cell credit manager 440. This cell credit flow control is designed to prevent the occurrence of more drastic levels of flow control from within the cell-based switch fabric described above, since these flow control signals 500-520 can result in multiple blocked ports 110, shutting down an entire iMS 180, or even the loss of data.
In the preferred embodiment, the cell credits are tracked through increment and decrement credit events 442 received from FIM 160. These events are stored in dedicated increment FIFOs 444 and decrement FIFOs 446. Each FIM 160 is associated with a separate increment FIFO 444 and a separate decrement FIFO 446, although ports 1-3 are shown as sharing FIFOs 444, 446 for the sake of simplicity. Decrement FIFOs 446 contain SDAs for cells that have entered the iMS 180. Increment FIFOs 444 contain SDAs for cells that have left the iMS 180. These FIFOs 444, 446 are handled in round robin format, decrementing and incrementing the credit count that the credit module 440 maintains for each SDA in its cell credit accumulator 447. In the preferred embodiment, the cell credit accumulator 447 is able to handle one increment event from one of the FIFOs 444 and one decrement event from one of the FIFOs 446 at the same time. An event select logic services the FIFOs 444, 446 in a round robin manner while monitoring the status of each FIFOs 444, 446 so as to avoid giving access to the accumulator 447 to empty FIFOs 444, 446.
The accumulator 447 maintains separate credit counts for each SDA, with each count reflecting the number of cells contained within the iMS 180 for a given SDA. A compare module 448 detects when the count for an SDA within accumulator 447 crosses an XOFF or XON threshold stored in threshold memory 449. When a threshold is crossed, the compare module 448 causes a driver to send the appropriate XOFF or XON event 452 to the XOFF mask 408 and the XON history register 420. If the count gets too low, then that SDA is XOFFed. This means that Fibre Channel frames that are to be routed to that SDA are held in the credit memory 320 by queue control module 400. After the SDA is XOFFed, the credit module 440 waits for the count for that SDA to rise to a certain level, and then the SDA is XONed, which instructs the queue control module 400 to release frames for that destination from the credit memory 320. The XOFF and XON thresholds in threshold memory 449 can be different for each individual SDA, and are programmable by the processor 124.
When an XOFF event or an XON event occurs, the credit module 440 sends an XOFF instruction 452 to the XON history register 420 and all four XOFF masks 408 in its PPD 130. In the preferred embodiment, the XOFF instruction 452 is a three-part signal identifying the SDA, the new XOFF status, and a validity signal.
In the above description, each cell credit manager 440 receives communications from the FIMs 160 on its PPD 130 regarding the cells that each FIM 160 submits to the iMS 180. The FIMs 160 also report back to the cell credit manager 440 when those cells are submitted by the iMS 180 over the crossbar 140. As long as the system works as described, the cell credit managers 440 are able to track the status of all cells submitted to the iMS 180. Even though each cell credit manager 440 is only tracking cells related to its PPD 130 (approximately one fourth of the total cells passing through the iMS 180), this information could be used to implement a useful congestion notification system.
Unfortunately, the preferred embodiment ingress memory system 180 manufactured by AMCC does not return cell credit information to the same FIM 160 that submitted the cell. In fact, the cell credit relating to a cell submitted by the first FIM 160 on the first PPD 130 might be returned by the iMS 180 to the last FIM 160 on the last PPD 130. Consequently, the cell credit managers 440 cannot assume that each decrement credit event 442 they receive relating to a cell entering the iMS 180 will ever result in a related increment credit event 442 being returned to it when that cell leaves the iMS 180. The increment credit event 442 may very well end up at another cell credit manager 440.
To overcome this issue, an alternative embodiment of the present invention has the four cell credit managers 440 on an I/O board 120, 122 combine their cell credit events 442 in a master/slave relationship. In this embodiment, each board 120, 122 has a single “master” cell credit manager 441 and three “slave” cell credit manager 440. When a slave unit 440 receives a cell credit event signal 442 from a FIM 160, the signal 442 is forwarded to the master cell credit manager 441 over a special XOFF bus 454 (as seen in
The master cell credit manager 441 is solely responsible for maintaining the credit counts and for comparing the credit counts with the threshold values stored in its threshold memory 449. When a threshold is crossed, the master unit 441 sends an XOFF or XON event 452 to its associated XON history register 420 and XOFF masks 408. In addition, the master unit 441 sends an instruction to the slave cell credit managers 440 to send the same XOFF or XON event 452 to their XON history registers 420 and XOFF masks 408. In this manner, the four cell credit managers 440, 441 send the same XOFF/XON event 452 to all four XON history registers 442 and all sixteen XOFF masks 408 on the I/O board 120, 122, effectively unifying the cell credit congestion notification across the board 120, 122.
Due to error probabilities, there is a possibility that the cell credit counts in accumulator 447 may drift from actual values over time. The present invention overcomes this issue by periodically re-syncing these counts. To do this, the FIM 160 toggles a ‘state’ bit in the headers of all cells sent to the iMS 180 to reflect a transition in the system's state. At the same time, the credit counters in cell credit accumulator 447 are restored to full credit. Since each of the cell credits returned from the iMS 180/eMS 182 includes an indication of the value of the state bit in the cell, it is possible to differentiate credits relating to cells sent before the state change. Any credits received by the FIM 160 that do not have the proper state bit are ignored. After the iMS 180 recognizes the state change, credits will only be returned for those cells indicating the new state. In the preferred embodiment, this changing of the state bit and the re-syncing of the credit in cell credit accumulator 447 occurs approximately every eight minutes, although this time period is adjustable under the control of the processor 124.
The many features and advantages of the invention are apparent from the above description. Numerous modifications and variations will readily occur to those skilled in the art. For instance, persons of ordinary skill could easily reconfigure the various components described above into different elements, each of which has a slightly different functionality than those described. The component reconfigurations do not fundamentally alter the present invention. Since such modifications are possible, the invention is not to be limited to the exact construction and operation illustrated and described. Rather, the present invention should be limited only by the following claims.
Claims
1. A method for congestion notification within a switch comprising:
- a) maintaining a plurality of lookup tables having multiple entries, each entry containing a congestion status for a different destination in the switch;
- b) sending a congestion update to the plurality of lookup tables, the congestion update containing a destination identifier and an updated congestion status; and
- c) updating the entry in the lookup table corresponding to the destination identifier using the updated congestion status.
2. The method of claim 1, wherein each lookup table contains an entry for all available destinations in the switch.
3. The method of claim 2, wherein a separate lookup table is maintained at each ingress to the switch.
4. The method of claim 1, wherein the switch is a Fibre Channel switch.
5. The method of claim 1, further comprising:
- d) maintaining an indicator of an amount of data within a buffer for each destination; and
- e) triggering the sending of the congestion update when the indicator passes a threshold value.
6. The method of claim 5, wherein a credit module maintains the indicators and sends the congestion updates.
7. The method of claim 6, wherein the credit module uses a single indicator for each destination to track data entering the switch from a plurality of ingress ports.
8. The method of claim 7, wherein
- i) data from each ingress ports passes through a fabric module before entering the buffer,
- ii) each fabric module submits a first credit event to the credit module for each grouping of data submitted to the buffer, and
- iii) the credit module uses the first credit event to alter the indicator so as to reflect additional data entering the buffer.
9. The method of claim 8, wherein
- iv) the buffer informs the fabric module each time a grouping of data leaves the buffer,
- v) the fabric module responds to such information from the buffer by submitting a second credit event to the credit module, and
- vi) the credit module uses the second credit event to alter the indicator so as to reflect data leaving the buffer.
10. The method of claim 9, wherein the first credit event is a decrement event decreasing a value of the indicator, the second credit event is an increment event increasing the value the indicator.
11. The method of claim 9, wherein a plurality of fabric modules submit first and second credit events to the credit module, which stores the credit events in a plurality of FIFOs.
12. The method of claim 11, wherein the credit events are retrieved from the FIFOs and applied to the indicator.
13. The method of claim 8, wherein each lookup table responds to a switch destination address by returning the congestion status for the destination associated with the switch destination address.
14. The method of claim 13, wherein the congestion status returned by the lookup table is combined with a congestion indicator generated by the fabric module to return a final congestion status for the switch destination address.
15. The method of claim 14, wherein a first fabric module shares the congestion indicator with a second fabric module within the switch, with the second fabric module submitting the congestion indicator to at least one additional lookup table.
16. The method of claim 7, wherein each congestion update from the credit module is sent to the lookup tables used by the plurality of ingress ports for which the credit module maintains the indicators.
17. The method of claim 16, wherein the credit module is a master credit module, further comprising a plurality of slave credit modules each serving a different subset of ingress ports on the switch.
18. The method of claim 17, wherein each slave credit module receives information on the data entering the buffer from its own subset of ingress ports and forwards that information to the master credit module.
19. The method of claim 18, wherein the master credit module uses the information received from the slave credit modules to maintain the indicators, and furthermore wherein the master credit module directs the slave credit modules to submit congestion updates to their subset of served ports.
20. The method of claim 5, wherein different threshold values are maintained for different destinations, and further wherein the grouping of data is a fixed-sized data cell.
21. The method of claim 1, wherein each lookup table responds to a switch destination address by returning the congestion status for the destination associated with the switch destination address.
22. The method of claim 21, wherein the congestion status returned by the lookup table is combined with a congestion indicator to return a final congestion status.
23. A method for congestion notification within a switch comprising:
- a) maintaining at each ingress port a lookup table having multiple entries, each entry containing a congestion status for a different destination in the switch, each lookup table containing entries for all available destinations in the switch, each lookup table returning the congestion status in response to a status query for a particular destination;
- b) maintaining at a first module an indicator of an amount of data submitted for each destination; and
- c) when the indicator passes a threshold value, sending a congestion update from the first module to a first lookup table, the congestion update containing a destination identifier and an updated congestion status; and
- d) updating the entry in the first lookup table corresponding to the destination identifier using the updated congestion status.
24. The method of claim 23, wherein the first module services a plurality of ports and their associated lookup tables, with all data passing through the serviced ports being reflected in the indicators of the first module.
25. The method of claim 24, wherein data from each serviced port passes through a separate second module, each second module submitting credit events to the first module reflecting data being submitted to and exiting a memory subsystem.
26. The method of claim 25, wherein cell credit events are stored by the first module in FIFOs to be later applied to the indicators for each destination.
27. The method of claim 25, wherein the congestion status returned by the lookup table is combined with a congestion signal generated by the second module to return a final congestion status.
28. The method of claim 27, wherein the congestion signal is in response to an XOFF/XON signal from the memory subsystem.
29. The method of claim 23, wherein different threshold values are maintained for different destinations.
30. A method for sharing congestion information in a switch comprising:
- a) interfacing with an ingress memory subsystem for a crossbar component through a plurality of fabric interfaces, the crossbar component handling data in predefined units;
- b) associating a set of fabric interfaces to a credit module;
- c) transmitting a first data event from one of the fabric interfaces to the credit module when a unit of data for a destination is submitted to the ingress memory subsystem, the first data event identifying the destination;
- d) transmitting a second data event from one of the fabric interfaces to the credit module when the ingress memory subsystem informs the fabric interface that a unit of data has been submitted to the crossbar from the ingress memory subsystem;
- e) using the first and second data events at the credit module to track a congestion status for the destinations in the switch.
31. The method of claim 30, further comprising:
- f) sending a congestion event from the credit module to a plurality of ingress ports to indicate a change in the congestion status for a destination;
32. The method of claim 31, further comprising:
- g) upon receiving a flow control signal from the ingress memory subsystem, sending a congestion signal from one of the fabric interfaces to one of the ingress ports.
33. The method of claim 32, further comprising:
- h) sending the congestion signal from the one of the fabric interfaces to a second fabric interface, and then sending the congestion signal from the second fabric interface to a second ingress port.
34. A method for distributing information regarding port congestion on a switch having a switch fabric and a plurality of I/O boards, each board having a plurality of ports, the method comprising
- a) submitting incoming data on a first I/O board to the switch fabric via a single ingress memory subsystem;
- b) organizing the ingress memory subsystem so as to establish a separate queue for each destination on the switch;
- c) monitoring an amount of data in each queue in the ingress memory subsystem;
- d) submitting a congestion event to each port on the first I/O board when the amount of data in a first queue passes a threshold value; and
- e) maintaining at each port a destination lookup table containing a congestion value for each destination on the switch based upon the congestion events.
35. The method of claim 34, wherein each I/O board has a plurality of protocol devices servicing a plurality of ports, and further wherein a credit module on each protocol device performs the monitoring step based on the amount of data in each queue that originated from ports on its protocol device, wherein the credit module submits the congestion event to each port on its protocol device.
36. The method of claim 34, wherein each I/O board has a plurality of protocol devices each servicing a plurality of ports, and further wherein slave credit modules on at least some of the protocol devices submit information to a master credit module concerning the data entering each queue that originated from ports on its protocol device, wherein the master credit module instructs the slave credit modules to submit the congestion event to each port that it services.
37. The method of claim 36, wherein
- i) all data passes through a fabric interface before being submitted to the ingress memory subsystem,
- ii) multiple fabric interfaces exist on each protocol device,
- iii) the fabric interfaces receive congestion signals from the ingress memory subsystem, and
- iv) the fabric interfaces submit a fabric congestion signal to at least one port after receiving congestion signals from the ingress memory subsystem.
38. The method of claim 37, wherein the fabric interfaces indicate to each other when fabric congestion signals are created.
39. The method of claim 37, wherein the fabric interfaces track the data entering and leaving the ingress memory subsystem and all the fabric interfaces on a single protocol device report this information to a single credit module.
40. A data communication switch having a plurality of destinations comprising:
- a) a crossbar component; and
- b) a plurality of I/O boards, each I/O board having i) a memory subsystem for queuing data for submission to the crossbar component, ii) a credit component for tracking an amount of data within the memory subsystem for each destination, and iii) a plurality of protocol devices, each protocol devices having (1) a plurality of ports, (2) a port congestion indicator at each port, the port congestion indicator having an indication of a congestion status for each destination in the switch, and (3) a congestion communication link connecting the credit component with each of the port congestion indicators.
41. The switch of claim 40, wherein each destination in the switch has a switch destination address and further wherein the credit component contains a decrement FIFO containing switch destination addresses and an increment FIFO containing switch destination addresses.
42. The switch of claim 41, wherein a first switch destination address for a first destination is added to the decrement FIFO when a unit of data for the first destination is submitted to the memory subsystem and further wherein the first switch destination address for the first destination is added to the increment FIFO when the unit of data for the first destination exits the memory subsystem.
43. The switch of claim 42, wherein each port communicates to the memory subsystem through a fabric interface module, and further wherein the fabric interface modules submit the switch destination addresses to the FIFOs.
44. The switch of claim 43, wherein the fabric interface modules receive flow control signals from the memory subsystem.
45. The switch of claim 44, wherein each fabric interface module send congestion signals to an associated port congestion indicator upon receipt of the flow control signals.
46. The switch of claim 45, wherein each fabric interface module sends a congestion signal to the other fabric interface modules on its I/O board upon receipt of the flow control signals.
47. The switch of claim 40, wherein the credit component submits a congestion event to at least one of the port congestion indicators when an amount of data in the memory subsystem for a first destination crosses a threshold.
48. The switch of claim 47, wherein the credit component is a master credit component, and further comprising a plurality of slave credit components, wherein when the master credit component submits the congestion event to the at least one port congestion indicators, the master credit component also submits an instruction to the slave credit components to submit the congestion event to other port congestion indicators.
49. The switch of claim 40, wherein each port communicates to the memory subsystem through a fabric interface module, and further wherein the fabric interface modules communicate to the credit components events related to data entering and leaving the memory subsystem.
50. The switch of claim 40, wherein the port congestion indicator is an XOFF mask lookup table.
51. A data communication switch having a plurality of destinations comprising:
- a) a crossbar component; and
- b) at least one I/O board having i) an output queuing means for queuing data for submission to the crossbar component, ii) a plurality of ports, iii) a congestion indicator means at each port for indicating a congestion status for each destination in the switch, iv) a congestion signaling means for signaling a need to update the congestion indicator means with a new congestion status for at least one port.
52. The switch of claim 51, further comprising a means for sharing the congestion signaling means with all ports on an I/O board.
53. The switch of claim 51, wherein the destinations include the ports and at least one microprocessor.
Type: Application
Filed: Jun 21, 2004
Publication Date: Apr 28, 2005
Inventors: Scott Carlsen (Mount Laurel, NJ), Anthony Tornetta (King of Prussia, PA), Steven Schmidt (Westampton, NJ)
Application Number: 10/873,329