Dynamically updating routing information while avoiding deadlocks and preserving packet order after a configuration change
A system for allowing dynamic changing of routing information of a network interconnect while avoiding deadlocks and preserving packet ordering. A network resiliency system detects when an error in the network interconnect occurs and dynamically generates new routing information for the routers that factors in the detected error. The network resiliency system then generates new routing information that factors in the failure. The network resiliency system then directs the network interconnect to enter a quiescent state in which no packets are transiting through the network interconnect. After the network interconnect enters the quiescent state, the network resiliency system directs the loading of the new routing information into the routing tables of the network interconnect and then directs the network interconnect to start injecting request packets into the network interconnect.
Latest Cray Inc. Patents:
This invention was made with government support under (identify the contract) awarded by the Defense Advanced Research Projects Agency (DARPA). The government has certain rights in the invention.
CROSS-REFERENCE TO RELATED APPLICATIONSThis application is related to U.S. patent application Ser. No. 13/104,778, now U.S. Pat. No. 8,854,951 entitled DYNAMICALLY UPDATING ROUTING INFORMATION WHILE AVOIDING DEADLOCKS AND PRESERVING PACKET ORDER AFTER A LINK ERROR, filed concurrently herewith and is incorporated herein by reference.
BACKGROUNDMassively parallel processing (“MPP”) systems may have tens and even hundreds of thousands of nodes connected via a communications mechanism. Each node may include one or more processors (e.g., an AMD Opteron processor), memory (e.g., between 1-32 gigabytes), and a communications interface (e.g., HyperTransport technology) connected via a network interface controller (“NIC”) to a router with router ports. Each router may be connected via its router ports to some number of other routers and then to other nodes to form a routing topology (e.g., torus, hypercube, and fat tree) that is the primary system network interconnect. Each router may include routing tables specifying how to route incoming packets from a source node to a destination node. The nodes may be organized into modules (e.g., a board) with a certain number (e.g., 4) of nodes and routers each, and the modules may be organized into cabinets with multiple (e.g., 24) modules in each cabinet. Such systems may be considered scalable when an increase in the number of nodes results in a proportional increase in their computational capacity. An example network interconnect for an MPP system is described in Alverson, R., Roweth, D., and Kaplan, L., “The Gemini System Interconnect,” 2010 IEEE Annual Symposium on High Performance Interconnects, pp. 83-87, Mountain View, Calif., Aug. 18-20, 2010, which is hereby incorporated by reference.
The nodes of an MPP system may be designated as service nodes or compute nodes. Compute nodes are primarily used to perform computations. A service node may be dedicated to providing operating system and programming environment services (e.g., file system services, external Input/Output (“I/O”), compilation, editing, etc.) to application programs executing on the compute nodes and to users logged in to the service nodes. The operating system services may include I/O services (e.g., access to mass storage), processor allocation services, program launch services, log in capabilities, and so on. The service nodes and compute nodes may employ different operating systems that are customized to support the processing performed by the node.
An MPP system may include a supervisory system comprising a hierarchy of controllers for monitoring components of the MPP system as described in U.S. Patent Application No. 2008/0134213, entitled “Event Notifications Relating to System Failures in Scalable Systems,” filed on Sep. 18, 2007, which is hereby incorporated by reference. At the lowest level of the hierarchy, the supervisory system may include a controller associated with each node that is implemented as software that may execute on the node or on special-purpose controller hardware. At the next lowest level of the hierarchy, the supervisory system may include a controller for each module that may be implemented as software that executes on special-purpose controller hardware. At the next lowest level of the hierarchy, the supervisory system may include a controller for each cabinet that also may be implemented in software that executes on special-purpose controller hardware. The supervisory system may optionally include other levels of controllers for groups of cabinets. At the top of the hierarchy is a controller designated as the supervisory controller or system management workstation, which provides a view of the overall status of the components of the multiprocessor system. The hierarchy of controllers forms a tree organization with the supervisory controller being the root and the controllers of the nodes being the leaf controllers. Each controller communicates between its parent and child controller using a supervisory communication network that is independent of (or out of band from) the primary system network interconnect. For example, the supervisory communication network may be a high-speed Ethernet network.
The controllers monitor the status of the nodes, network interface controllers, and routers. A leaf controller (or node controller) may monitor the status of the hardware components of the node and the system services executing on the node. The next higher level controller (module controller or L0 controller) may monitor the status of the leaf controllers of the nodes of the module, power to the module, and so on. The next higher level controller (cabinet controller or L1 controller) may monitor the status of the next lower level controllers, power to the cabinet, cooling of the cabinet, and so on.
The routing logic of the tiles routes the flits based on a routing table for each of the tiles. Each routing table contains 32 entries, and each entry includes a match and a mask. The routing logic at an input port of a tile applies the match of each entry in sequence to each packet to find the first matching entry. The routing logic then routes the packet (on a flit-by-flit basis) to an output port identified by the mask of that matching entry. Other router architectures may have one or more routing tables per router and may not be tile-based. Each routing table may also have any number of entries (e.g., 64 or 128).
The routing tables of a network interconnect are typically initialized to avoid deadlocks and to ensure proper ordering of packets. A deadlock may occur, for example, when routers along a routing path cannot send a flit because other routers along the routing path are full and cannot send a flit because other routers are full. There are well-known routing algorithms for avoiding deadlocks such as that described in U.S. Pat. No. 5,533,198, entitled “Direction Order Priority Routing of Packets Between Nodes in a Networked System.” When routed through a network, certain types of packets need to have their order of delivery guaranteed. For example, a program may store data in a remote memory location and later load that data from that same remote memory location. To store the data, the processor executing the program sends a store request via the network to the remote memory location. To load the data, the processor sends a load request via the network to the remote memory location. If the requests were to travel on different routes through the network, it might be possible (e.g., depending on network congestion) for the load request to arrive at the remote memory location before the store request. In such a case, the load request would load the old value from the remote memory location. Networks employ various techniques to ensure that “ordered packets” are received in the same order as they were sent. For example, a network may ensure that ordered packets each travel through the same route. Unordered packets, in contrast, do not depend on their ordering for proper functioning. For example, two load requests to the same memory location will function properly regardless of which is received first (assuming no intervening store request).
Links of a network can fail for various reasons. For example, a link may simply break or become disconnected at one end, or the router to which a link is connected may lose power. Whenever a link fails, the network is no longer fully connected. In such a case, ordered packets may not be able to travel on the same route. Various techniques have been used to recover from failed links. One technique terminates all jobs executing on each node, then restarts the system with new routes that avoid failed links and restarts the terminated job, which may continue from a checkpoint. Another technique may have redundant links, and when a link fails, the technique routes packets onto the redundant link. However, if the redundant link also fails, then another approach needs to be used such as restarting the system.
A method, a system, and a computer-readable storage device are provided to allow dynamic changing of routing information of a network interconnect while avoiding deadlocks and preserving packet ordering. In some embodiments, a network resiliency system detects when an error in the network interconnect occurs and dynamically generates new routing information for the routers that factors in the detected error. For example, if a link is reported as having failed, the new routing information identifies routes that bypass the failed link. The network resiliency system may use a conventional routing algorithm that avoids deadlocks to generate the new routing information. Because the network interconnect may have thousands of routers, the loading of the new routing information into the routers may happen over a period of time. During this period, the routing information of the various routers may be in an inconsistent state because some routers have new routing information while other routers still have the old routing information. Although the old routing information and the new routing information may each separately avoid deadlocks, the mixture of old and new routing information may not avoid such deadlocks. Moreover, during this period, the ordering of ordered packets may not be guaranteed because of the mixture of old and new routing information.
Although conventional techniques may avoid such deadlocks and preserve packet ordering by reinitializing the entire network including processors and the network interconnect, the network resiliency system does so without having to reinitialize the entire network. When the network resiliency system receives an indication of a failure in the network interconnect, the network resiliency system generates new routing information that factors in the failure. The network resiliency system then directs the network interconnect to enter a quiescent state in which no packets are transiting through the network interconnect. To achieve this quiescent state, the network resiliency system suppresses the injection of request packets into the network interconnect. Although the injection of the request packets is suppressed, the network resiliency system allows response packets to be injected into the network interconnect and allows those request packets that have already been injected (are in transit) to continue to their destination. The network resiliency system thus allows already-injected request packets and their response packets to be delivered. Once all the request packets and their responses are delivered, the network interconnect is in a quiescent state. After the network interconnect enters the quiescent state, the network resiliency system directs the loading of the new routing information into the routing tables of the network interconnect. After the loading of the new routing information has been confirmed (i.e., the routing information is in a consistent state), the network resiliency system directs the network interconnect to restart injecting request packets into the network interconnect. These injected request packets will be routed according to be new routing information only, thus avoiding deadlocks and preserving packet ordering.
In some embodiments, the network resiliency system may be implemented primarily using a supervisory system that includes a supervisory controller at the highest level of a supervisory system hierarchy and local controllers near or at the lowest level of the hierarchy. When a network interconnect error is detected by a controller, the controller routes an indication of the error up the hierarchy to the supervisory controller. The supervisory controller then directs the generating of the new routing information that factors in the error. The supervisory controller then notifies each local controller to suppress the injecting of request packets into the network interconnect. The local controllers may set a flag on the network interface controllers to effect this suppressing. When this flag is set, the network interface controllers buffer new request packets received from a processor without injecting them into the network interconnect, but allow response packets to be injected into the network interconnect. When the buffer is full of request packets, the programs executing on the processor will eventually stall waiting for receipt of the response packet corresponding to a request packet that has not yet been injected into the network interconnect. Because the network interconnect may not have an effective way of signaling when no packets are currently in transit, the network resiliency system waits for confirmation that each local controller has suppressed the injecting of request packets into the network interconnect and then starts a timer to allow the already-injected request packets and any response packets to complete their routes through the network interconnect. The network resiliency system may set the timer based on the maximum anticipated time it would take for a request packet and its response packet to travel through the interconnect network. When the timer expires, the network resiliency system may assume that the network interconnect is in a quiescent state.
In some embodiments, when the network interconnect enters the quiescent state, the network resiliency system of the supervisory controller requests the local controllers to have the new routing information loaded into the routers. Because the request and the subsequent loading may take a variable amount of time, the network resiliency system executing on the supervisory controller waits until each local controller responds that the routing information has been successfully loaded. At that point, the network resiliency system at the supervisory controller requests the local controllers to start allowing request packets to be injected into the network interconnect. The local controllers then reset the flag to allow the injection of request packets. Any programs waiting for the network interface controller buffer to no longer be full will detect that the buffers are no longer full and start sending requests. Any programs, to the extent that they do not send requests, would continue their execution during the process of generating the new routing information, entering the quiescent state, and loading the new routing information. In general, the network resiliency system seeks to avoid the termination of processes created by the operating system while the routing information is being dynamically created and loaded.
In some embodiments, the network resiliency system may generate new routing information in a distributed manner. When requested by the supervisory controller, each local controller may generate routing information for the routers controlled by that local controller. The local controllers may access a central configuration store (e.g., database) to access the current or anticipated configuration for the network interconnect. Each local controller stores its routing information locally while waiting for the supervisory controller to request loading or installing of the routing information in the routers. Alternatively, the network resiliency system may rely on a computing system other than the hierarchy of controllers to generate the new routing information and store the new routing information in a central store. In such a case, each local controller may retrieve its new routing information from the central store when requested by a supervisory controller to load routing information into the routers.
In some embodiments, the network resiliency system allows for planned changing of the configuration of the network interconnect. For example, the configuration may be changed to add additional routers and links, to upgrade existing routers and links, and to remove routers and links. The network resiliency system allows for the changes in configuration to be accomplished dynamically in much the same way as when a network interconnect error is detected as described above. The network resiliency system executing at the supervisory controller first receives a notification that a change has been made (e.g., add a new blade) or that a change is to be made (e.g., remove a blade) to the configuration of the network interconnect. If a blade is to be removed, the programs executing on the nodes of the blade should be stopped prior to receipt of the notification. Upon receiving such a notification, the network resiliency system updates the central configuration store with the new configuration of the network interconnect and then requests that new routing information be generated based on the configuration. The network resiliency system then requests that the network interconnect enter a quiescent state. After the network interconnect enters the quiescent state, the network resiliency system then directs the loading of the new routing information into the routers and then directs the network interconnect to exit the quiescent state by starting to inject request packages into the network interconnect. If a blade is to be removed, a person can then physically remove the blade. If a blade was added, a person can direct the booting of the operating system on the nodes of the added blade. In this way, the configuration of the network interconnect can be dynamically changed without having to bring down the entire network interconnect and its connected nodes.
The devices on which the network resiliency system may be implemented may include a central processing unit and memory and may include, particularly in the case of the system management workstation, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives). Computer-readable media includes computer-readable storage media and data transmission media. The computer-readable storage media includes memory and other storage devices that may have recorded upon or may be encoded with computer-executable instructions or logic that implement the network resiliency system. The data transmission media is media for transmitting data using signals or carrier waves (e.g., electromagnetism) via a wire or wireless connection. Various functions of the network resiliency system may also be implemented on devices using discrete logic or logic embedded as an application-specific integrated circuit. The devices on which the network resiliency system is implemented are computing devices.
The network resiliency system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. The connection devices may include routers, switches, network switches, switching hubs, switching devices, routing devices, network routers, packet switches, connectors, sub-components of such connection devices, and so on. The components of a network may be connected via wired or wireless connections. Although the data routed through the network is described as being organized as packets with flits, the data may be organized in other ways (e.g., packets without any subdivision into flits, packets with a fixed number of sub-packets or flits, or fixed-sized packets). Accordingly, the invention is not limited except as by the appended claims.
Claims
1. A method for changing configuration of a network interconnect having routers connected via links, the network interconnect connecting nodes that are not part of the network interconnect, the network interconnect for routing packets that are sent from source nodes and injected into the network interconnect for delivery to destination nodes, comprising:
- receiving an indication of a change to the configuration of the network interconnect;
- directing the network interconnect to enter a quiescent state in which the injecting of request packets sent from nodes into the network interconnect is suppressed while the injecting of response packets sent from nodes into the network interconnect is allowed and already-injected request packets are allowed to continue through the network interconnect, wherein a response packet is sent from a node in response to a request packet; and
- after the network interconnect enters the quiescent state, initializing components based on the changed configuration; directing the network interconnect to install new routing information based on the changed configuration; and after the new routing information is installed in the network interconnect, directing the network interconnect to enter an unquiescent state by allowing the injecting of request packets sent from nodes into the network interconnect.
2. The method of claim 1 wherein programs executing on nodes connected to the network interconnect continue to execute after the injecting of request packets is suppressed to the extent that the suppressing does not interfere with execution of those programs.
3. The method of claim 1 wherein the change to the configuration is selected from a group of changes consisting of replacing a router, adding a router, and removing a router.
4. The method of claim 1 wherein the change to the configuration is selected from a group of changes consisting of replacing a link, adding a link, and removing a link.
5. The method of claim 1 wherein the network interconnect connects a plurality of processors of the nodes, each processor being connected to a router through a network interface controller and each processor being connected to a local controller through a network that is out-of-band from the network interconnect, and the suppressing of the injecting of request packets includes the local controller directing the network interface controller to not send request packets to the router.
6. The method of claim 1 including after directing the network interconnect to enter a quiescent state, waiting for a timeout period that is based on a maximum time for a request packet and its response packet to transit through the network interconnect.
7. The method of claim 1 including after receiving the indication that a change is to be made to the configuration of the network interconnect, computing new routing information based on the change to be made.
8. The method of claim 7 wherein the network interconnect connects a plurality of processors of the nodes, each processor being connected to a router through a network interface controller and each processor being connected to a local controller through a network that is out-of-band from the network interconnect, and the installing of the new routing information into the network interconnect includes distributing via the out-of-band network a request to use the new routing information to each local controller and waiting for a response to the request to use the new routing information from each local controller.
9. A computer-readable storage device containing computer-executable instructions for changing configuration of a network interconnect having routers connected via links, the network interconnect connecting processors that are not part of the network interconnect, by a method comprising:
- directing the network interconnect to enter a quiescent state in which no request packets or response packets are in transit by suppressing the injecting of request packets sent from processors into the network interconnect, but allowing the injecting of response packets sent from processors into the network interconnect wherein a response packet is sent from a processor in response to a request packet; and
- after the network interconnect enters the quiescent state, directing the network interconnect to install new routing information based on the changed configuration; and after the new routing information is installed into the network interconnect, directing the network interconnect to enter an unquiescent state in which request packets and response packets are sent through the network interconnect.
10. The computer-readable storage device of claim 9 wherein the change to the configuration is based on a change to a router, to a link, or to both.
11. The computer-readable storage device of claim 9 wherein programs executing on processors connected to the network interconnect continue to execute after injecting of request packets is suppressed to the extent that the suppressing does not interfere with execution of those programs.
12. The computer-readable storage device of claim 9 wherein the network interconnect connects a plurality of processors, each processor being connected to a router through a network interface controller and each processor being connected to a local controller through a network that is out-of-band from the network interconnect, and the entering of the quiescent state includes suppressing the injecting of request packets into the network interconnect by the local controller directing the network interface controller to not send request packets to the router.
13. The computer-readable storage device of claim 12 wherein a program executing on a processor detects a buffer full condition that prevents the program from sending requests for injecting request packets onto the network interconnect.
14. The computer-readable storage device of claim 9 including determining that the network interconnect has entered the quiescent state based on waiting for a timeout period that is derived from a maximum time for a request packet and its response packet to transit through the network interconnect.
15. A system for adapting routing in a network interconnect to a change in configuration of the network interconnect, comprising:
- a plurality of processors connected to routers of the network interconnect via a network interconnect controller wherein the processors are outside of the network interconnect;
- local controllers that are each connected to a processor and that in response to receiving a request to quiesce, suppress the injection of request packets by the processor into the network interconnect, but allow the injection of response packets by the processor into the network interconnect, in response to receiving a request to unquiesce, allow the injection of request packets by the processor into the network interconnect wherein a response packet is sent by the processor in response to a request packet, and in response to receiving a request to install routing information, direct the installing of routing information into the routers; and
- a supervisory controller that is connected to the local controllers via a network that is out-of-band from the network interconnect, the supervisory controller for sending a request to each local controller to quiesce, receiving an indication that the configuration of the network interconnect has changed, sending a request to each local controller to install in the routers routing information that factors in the changed configuration, and after the routing information is installed, sending a request to each local controller to unquiesce.
16. The system of claim 15 wherein a quiescent state is entered when no packets are in transit in the network interconnect.
17. The system of claim 15 wherein the network interconnect controller allows response packet to be injected into the network interconnect and the network interconnect allows already-injected request packets to continue in transit.
18. The system of claim 15 wherein before sending a request to the local controllers to quiesce, the supervisory controller receives an indication that a change to the configuration is to be made.
19. The system of claim 15 wherein after the network interconnect enters a quiescent state, the supervisory controller indicates that the change to the configuration can be made.
20. A method for quiescing a network interconnect having routers connected via links, the network interconnect connecting nodes that are not part of the network interconnect, the network interconnect for routing packets that are sent from source nodes and injected into the network interconnect for delivery to destination nodes, comprising:
- receiving an indication to quiesce the network interconnect; and
- after receiving the indication to quiesce, suppressing the injecting into the network interconnect of request packets by source nodes; allowing the injecting of response packets into the network interconnect by source nodes for delivery to destination nodes prior to entering a quiescent state, each response packet being in response to a request packet; and allowing the delivery to destination nodes of both request packets and response packets that were injected into the network interconnect by source nodes before the indication to quiesce was received.
21. The method of claim 20 wherein programs executing on the nodes connected to the network interconnect continue to execute after the injecting of request packets is suppressed to the extent that the suppressing does not interfere with execution of those programs.
22. The method of claim 20 including after receiving the indication to quiesce, indicating that the network interconnect has quiesced after waiting for a timeout period that is based on a maximum time for a request packet and its response packet to transit through the network interconnect.
5533198 | July 2, 1996 | Thorson |
6907011 | June 14, 2005 | Miller et al. |
7283463 | October 16, 2007 | Miller et al. |
7565566 | July 21, 2009 | Davies et al. |
7761696 | July 20, 2010 | Bhattacharyya et al. |
7984453 | July 19, 2011 | Alverson et al. |
8407703 | March 26, 2013 | Ault et al. |
8572624 | October 29, 2013 | Heller et al. |
8675639 | March 18, 2014 | Berman |
20050044268 | February 24, 2005 | Johnston-Watt et al. |
20080134213 | June 5, 2008 | Alverson et al. |
20090006829 | January 1, 2009 | Cai et al. |
20090264125 | October 22, 2009 | Rofougaran |
20100290457 | November 18, 2010 | Dasoju et al. |
20110294472 | December 1, 2011 | Bramwell et al. |
20130047034 | February 21, 2013 | Salomon et al. |
- U.S. Appl. No. 13/104,778, filed May 10, 2011, Godfrey et al.
- Alverson, R., Roweth., and Kaplan, L., “The Gemini System Interconnect,” 2010 IEEE Symposium on High Performance Interconnects, pp. 83-87, Mountain View, CA, Aug. 18-20, 2010.
- Final Office Action for U.S. Application No. 13/104,778, Mail Date Oct. 29, 2013, 34 pages.
- Non-Final Office Action for U.S. Application No. 13/104,778, Mail Date Apr. 25, 2013, 33 pages.
Type: Grant
Filed: May 10, 2011
Date of Patent: Oct 6, 2015
Patent Publication Number: 20120287821
Assignee: Cray Inc. (Seattle, WA)
Inventors: Aaron F. Godfrey (Eagan, MN), Christopher B. Johns (Edina, MN)
Primary Examiner: Dustin Nguyen
Application Number: 13/104,799
International Classification: G06F 15/177 (20060101); H04L 12/751 (20130101); H04L 12/703 (20130101);