DYNAMIC DISTRIBUTION OF NODES ON A MULTI-NODE COMPUTER SYSTEM
A method and apparatus dynamically distribute I/O nodes on a multi-node computing system. An I/O configuration mechanism located in the service node of a multi-node computer system controls the distribution of the I/O nodes. The I/O configuration mechanism uses job information located in a job record to initially configure the I/O node distribution. The I/O configuration mechanism further monitors the I/O performance of the executing job to then dynamically adjusts the I/O node distribution based on the I/O performance of the executing job.
1. Technical Field
The disclosure and claims herein generally relate to multi-node computer systems, and more specifically relate to dynamic distribution of compute nodes with respect to I/O nodes on a multi-node computer system.
2. Background Art
Supercomputers and other multi-node computer systems continue to be developed to tackle sophisticated computing jobs. One type of multi-node computer system is a massively parallel computer system. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a high density, scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.
Computer systems such as Blue Gene have a large number of nodes, each with its own processor and local memory. The nodes are connected with several communication networks. One communication network connects the nodes in a logical tree network. In the logical tree network, the Nodes are connected to an input-output (I/O) node at the top of the tree.
In Blue Gene, there are 2 compute nodes per node card with 2 processors each. A node board holds 16 node cards and each rack holds 32 node boards. A node board has slots to hold 2 I/O cards that each have 2 I/O nodes. Thus, fully loaded node boards have 4 I/O nodes for 32 compute nodes. The nodes on two node boards can be configured in a virtual tree network that communicate with the I/O nodes. For two node boards there may be 8 I/O nodes that correspond to 64 compute nodes. If the I/O nodes slots are not fully populated, then there could be 2 I/O nodes for 64 compute nodes. Thus the distribution of I/O nodes to compute nodes may vary between 1/64 and ⅛. Thus, the I/O node to compute node ratios can be defined as ⅛, 1/32, 1/64 or 1/128 (10/compute). In the prior art, the distribution of the I/O nodes is static once a block is configured.
The Blue Gene computer can be partitioned into multiple, independent blocks. Each block is used to run one job at a time. A block consists of a number of ‘processing sets’ (psets). Each pset has an I/O node and a group of compute nodes. The compute nodes run the user application, and the I/O nodes are used to access external files and networks.
With the communication networks as described above, applications or “jobs” loaded on nodes execute on a fixed I/O to compute node ratio. Without a way to dynamically distribute the I/O nodes to adjust the ratio of 10 to compute nodes based on the I/O characteristics of work being performed on the system, multi-node computer systems will continue to suffer from reduced efficiency of the computer system.
BRIEF SUMMARYAn apparatus and method is described for dynamic distribution of compute nodes versus I/O nodes on a multi-node computing system. An I/O configuration mechanism located in the service node of a multi-node computer system controls the distribution of the I/O nodes. The I/O configuration mechanism uses job information located in a job record to initially configure the I/O node distribution. The I/O configuration mechanism further monitors the I/O performance of the executing job to then dynamically adjust the I/O node distribution based on the I/O performance of the executing job.
The description and examples herein are directed to a massively parallel computer system such as the Blue Gene architecture, but the claims herein expressly extend to other parallel computer systems with multiple processors arranged in a network structure.
The foregoing and other features and advantages will be apparent from the following more particular description, and as illustrated in the accompanying drawings.
The disclosure will be described in conjunction with the appended drawings, where like designations denote like elements, and:
The description and claims herein are directed to a method and apparatus for dynamic distribution of compute nodes versus I/O nodes on a multi-node computing system. An I/O configuration mechanism located in the service node of a multi-node computer system controls the distribution of the I/O nodes. The I/O configuration mechanism uses job information located in a job record to initially configure the I/O node distribution. The I/O configuration mechanism further monitors the I/O performance of the executing job to then dynamically adjust the I/O node distribution based on the I/O performance of the executing job. The examples herein will be described with respect to the Blue Gene/L massively parallel computer developed by International Business Machines Corporation (IBM).
The Blue Gene/L computer system structure can be described as a compute node core with an I/O node surface, where communication to 1024 compute nodes 110 is handled by each I/O node 170 that has an I/O processor connected to the service node 140. The I/O nodes 170 have no local storage. The I/O nodes are connected to the compute nodes through the logical tree network and also have functional wide area network capabilities through a gigabit ethernet network (See
Again referring to
The service node 140 communicates through the control system network 150 dedicated to system management. The control system network 150 includes a private 100-Mb/s Ethernet connected to an Ido chip 180 located on a node board 120 that handles communication from the service node 160 to a number of nodes. This network is sometime referred to as the JTAG network since it communicates using the JTAG protocol. All control, test, and bring-up of the compute nodes 110 on the node board 120 is governed through the JTAG port communicating with the service node.
The service node includes a job scheduler 142 for allocating and scheduling work and data placement on the compute nodes. The job scheduler 142 loads a job record 144 from data storage 138 for placement on the compute nodes. The job record 144 includes a job and related information as described more fully below. The service node further includes an I/O configuration mechanism 146 that dynamically distributes I/O nodes on a multi-node computing system. The I/O configuration mechanism 146 uses job information located in the job record 144 to initially configure the I/O node distribution. The I/O configuration mechanism further monitors the I/O performance of the executing job to then dynamically adjust the I/O node distribution based on the I/O performance of the executing job.
Stored in RAM 214 is a class routing table 221, an application program (or job) 222, an operating system kernel 223. The class routing table 221 stores data for routing data packets on the collective network or tree network as described more fully below. The application program is loaded on the node by the control system to perform a user designated task. The application program typically runs in a parallel with application programs running on adjacent nodes. The operating system kernel 223 is a module of computer program instructions and routines for an application program's access to other resources of the compute node. The quantity and complexity of tasks to be performed by an operating system on a compute node in a massively parallel computer are typically smaller and less complex than those of an operating system on a typical stand alone computer. The operating system may therefore be quite lightweight by comparison with operating systems of general purpose computers, a pared down version as it were, or an operating system developed specifically for operations on a particular massively parallel computer. Operating systems that may usefully be improved, simplified, for use in a compute node include UNIX, Linux, Microsoft XP, AIX, IBM's i5/OS, and others as will occur to those of skill in the art.
The compute node 110 of
The data communications adapters in the example of
The data communications adapters in the example of
The data communications adapters in the example of
The data communications adapters in the example of
Again referring to
The collective network partitions in a manner akin to the torus network. When a user partition is formed, an independent collective network is formed for the partition; it includes all nodes in the partition (and no nodes in any other partition). In the collective network, each node contains a class routing table that is used in conjunction with a small header field in each packet of data sent over the network to determine a class. The class is used to locally determine the routing of the packet. With this technique, multiple independent collective networks are virtualized in a single physical network with one or more I/O nodes for the virtual network. Two standard examples of this are the class that connects a small group of compute nodes to an I/O node and a class that includes all the compute nodes in the system. In Blue Gene, the physical routing of the collective network is static and in the prior art the virtual network was static after being configured. As described herein, the I/O configuration mechanism (
As illustrated in the above example, the I/O configuration mechanism dynamically distributes I/O nodes to blocks of compute nodes in a massively parallel computer system. In the previous example, the determination to distribute an additional I/O node to the node block may have been based on data in the job record or by real-time I/O needs determined by monitoring the job execution. For example, upon loading the job, the I/O configuration mechanism could have detected from the job description that the job has extensive I/O needs and then distributed the additional I/O node from a block that has less I/O demands or is not being used. Second, the historical I/O utilization 516 may have indicated that the job typically requires a large amount of I/O resources and thus would execute more efficiently with an additional I/O node. Third, the I/O configuration mechanism may have determined from the job record that the application will assert control with application control parameters 518 (
An apparatus and method is described herein to dynamically distributes I/O nodes on a multi-node computing system. The I/O configuration mechanism monitors the I/O performance of the executing job to then dynamically redistribute the I/O node distribution based on the I/O performance of the executing job to increase the multi-node computer system.
One skilled in the art will appreciate that many variations are possible within the scope of the claims. Thus, while the disclosure has been particularly shown and described above, it will be understood by those skilled in the art that these and other changes in form and details may be made therein without departing from the spirit and scope of the claims.
Claims
1. A multi-node computer system comprising:
- a plurality of compute nodes that each comprise a processor and memory;
- a plurality of input/output (I/O) nodes connected to the plurality of compute nodes that provide I/O communication to network resources;
- a job executing on a set compute nodes from the plurality of compute nodes, wherein the set of compute nodes has a number of associated I/O nodes that form a ratio of I/O nodes to compute nodes; and
- an I/O configuration mechanism that dynamically adjusts the ratio of input nodes to compute nodes based on I/O characteristics of the job executing on the set of the plurality of compute nodes.
2. The multi-node computer system of claim 1 wherein the multi-node computer system is a massively parallel computer system.
3. The multi-node computer system of claim 1 wherein the I/O characteristics of the job are determined by real-time monitoring of the job's I/O characteristics and wherein the I/O configuration mechanism dynamically updates the ratio by suspending the job and re-allocating additional nodes of the plurality of I/O nodes to be associated with the executing job.
4. The multi-node computer system of claim 1 wherein the I/O characteristics of the job are determined from information stored in a job record selected from the following: job description, historical I/O utilization, and application control parameters.
5. The multi-node computer system of claim 4 wherein the I/O characteristics of the job are used for the initial ratio of input nodes to compute nodes upon beginning execution of the job.
6. The multi-node computer system of claim 1 wherein the plurality of compute nodes are arranged in a virtual tree network and further comprising an I/O node that connects to the top of the tree network to allow the compute nodes to communicate with a service node of a massively parallel computer system.
7. The multi-node computer system of claim 6 wherein the virtual tree network is determined by a class routing table on the node.
8. A computer implemented method for an I/O configuration mechanism to distribute I/O nodes in a multi-node computer system, the method comprising the steps of:
- monitoring the I/O characteristics of an executing job on a block of nodes with one or more I/O nodes in multi-node the computer system;
- determining whether an I/O demand on the one or more I/O nodes is above a threshold; and
- dynamically updating the I/O configuration to adjust a ratio of I/O nodes to compute nodes for the block of nodes.
9. The computer implemented method of claim 8 further comprising the steps of:
- suspending the job;
- re-allocating the ratio of I/O nodes;
- resetting the block structure with a new allocation of I/O nodes that adjusts the ratio; and
- resuming the job.
10. The computer implemented method of claim 8 further comprising the steps of:
- examining a job record associated with the job for I/O needs of the job; and
- dynamically allocating I/O nodes to the job based on the job record.
11. The computer implemented method of claim 10 wherein the step of examining the job record further comprises the steps of:
- examining the job description for I/O needs;
- examining a job execution history for I/O needs; and
- allowing an application to control the I/O configuration with application control parameters.
12. The computer implemented method of claim 8 wherein the plurality of compute nodes are arranged in a virtual tree network and further comprising an I/O node that connects to the top of the tree network to allow the compute nodes to communicate with a service node of a massively parallel computer system.
13. The computer implemented method of claim 12 wherein the virtual tree network is determined by a class routing table in the node.
14. A computer implemented method for an I/O configuration mechanism to distribute I/O nodes in a multi-node computer system, the method comprising the steps of:
- examining a job record associated a job for I/O needs of the job; and
- dynamically allocating I/O nodes to the job based on the job record
- monitoring the I/O characteristics of an executing job on a block of nodes with one or more I/O nodes in multi-node the computer system;
- determining whether an I/O demand on the one or more I/O nodes is above a threshold; and
- dynamically updating the I/O configuration to adjust a ratio of I/O nodes to compute nodes for the block of nodes by performing the steps of: suspending the job; re-allocating the ratio of I/O nodes; resetting the block structure with a new allocation of I/O nodes that adjusts the ratio; and resuming the job.
15. A computer-readable article of manufacture comprising:
- a job for execution on a set compute nodes chosen from a plurality of compute nodes with a plurality of input/output (I/O) nodes connected to the plurality of compute nodes that provide I/O communication to network resources, wherein the set of compute nodes has a number of associated I/O nodes that form a ratio of I/O nodes to compute nodes;
- an I/O configuration mechanism that dynamically adjusts the ratio of input nodes to compute nodes based on the I/O characteristics of the job executing on the set of the plurality of compute nodes a plurality of compute nodes that each comprise a processor and memory; and
- tangible computer recordable media bearing the job scheduler.
16. The article of manufacture of claim 15 wherein the I/O characteristics of the job are determined by real-time monitoring of the job's I/O characteristics and wherein the I/O configuration mechanism dynamically updates the ratio of input nodes by suspending the job and re-allocating additional nodes of the plurality of I/O nodes to be associated with the executing job.
17. The article of manufacture of claim 15 wherein the I/O characteristics of the job are determined from information stored in a job record selected from the following: job description, historical I/O utilization, and application I/O control.
18. The article of manufacture of claim 17 wherein the I/O characteristics of the job are used for the initial ratio of input nodes to compute nodes upon beginning execution of the job.
19. The article of manufacture of claim 15 wherein the plurality of compute nodes are arranged in a virtual tree network and further comprising an I/O node that connects to the top of the tree network to allow the compute nodes to communicate with a service node of a massively parallel computer system.
20. The article of manufacture of claim 19 wherein the virtual tree network is determined by a class routing table in the node.
Type: Application
Filed: Dec 12, 2007
Publication Date: Jun 18, 2009
Inventors: Eric Lawrence Barsness (Pine Island, MN), David L. Darrington (Rochester, MN), Amanda Peters (Rochester, MN), John Matthew Santosuosso (Rochester, MN)
Application Number: 11/955,067
International Classification: G06F 9/46 (20060101);