Partitionable Networked Computer
A computer comprising a plurality of processing nodes is provided. Each processing node has at least one processor configured to process input data to generate an array of data items. The processing nodes are arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links. The cliques are inter-connected in rings such that each processing node is a member of a single clique and a single ring. The processing nodes of all cliques are configured to exchange in each exchange step of a machine learning collective via the respective first and second clique links at least two data items with the other processing node(s) in its clique, and all processing nodes are configured to reduce each received data item with the data item in the corresponding position in the array on that processing node.
The present application claims priority to United Kingdom Patent Application No. 1904265.4, filed on Mar. 27, 2019, which is hereby incorporated by reference in its entirety.
FIELDThe present disclosure relates to computer network topology processing nodes connected in a computer particularly but not exclusively for facilitating portioning for machine learning/artificial intelligence tenant applications.
BACKGROUNDCollectives are routines which are commonly used when processing data in a computer. They are routines which enable data to be shared and processed across multiple different processes, which may be running on the same processing node or different processing nodes. For example, if one process reads data from a data store it can use a “broadcast” process to share that data with other processes. Another example is when the result of a particular function is needed on multiple processes. A “reduction” is a result which has required the application of a compute function to a data value from each of multiple processes. “Gather” and “Scatter” collectives handle more than one data item. Certain collectives have become increasingly important in processing machine learning applications.
MPI (Message Passing Interface) is a message passing standard which can be applied to many parallel computing architectures. MPI defines a number of collectives applicable to machine learning. Two such collective are termed “Reduce” and “Allreduce”. A Reduce operation enables a result of a compute function acting on multiple data values from different source processes to be provided at a single receiving process. Note that a receiving process may be one of the source processes, and that there may be multiple receiving processes. The Allreduce collective reduces the data values from multiple source processes and distributes the results to all the source processes, (which are acting as receiving processes for the reduced result). According to the MPI Standard, the Allreduce collective may be implemented by reducing the data values from all source processes in a reduce collective (e.g. at one of the processes) and then broadcasting the result to each source process.
The aim with the architecture of
To understand the implementation of the Allreduce collective, assume that the first node N0 has generated a “partial” labelled Δ 0. The “partial” is a set of delta weights. This is stored in the storage capability 202 ready to be exchanged in an Allreduce collective. In reality, there may be a vector of partials, where each partial corresponds to a computation on the processing node. In a simple streaming line Allreduce algorithm, the forward links are used for “reduce” and the backward links are used for “broadcast”. The algorithm starts with the processing node at one end (the left hand node in
Furthermore, the backward links are not utilised for broadcast until the fully reduced result has been obtained at the end node. However, if the partials vectors are large, due to the pipelined effect, the lead data item of the reduced result, being the reduction of the first partials from the partial vectors at each node, will return to the starting node well before that starting node has finished sending the data items of its partial, so there may be substantial overlap of activity on all forward and backward links.
In a modification to this algorithm, which represents a small improvement, processing nodes at each end of the line can start to transmit their partials towards a central node, with the reduction being completed at the central nodes. In that case, the result is broadcast back to the end nodes. Note that in this scenario, there would be a reversal in the direction of movement, for example between nodes N2 and N3, and N3 and N4 on both the forward and backward links. If a line is closed into a ring (by connecting the final node PN4 to the first node PN0 on both the backward and forward links), a pipeline algorithm can serialise reduction and broadcast in the same direction, so that the two logical rings formed by the bi-directional links can each operate independently on half of the data. That is, each partial vector is split into two and a first half Δ A is reduced on the forward links (as in
Machine learning workloads call for massively parallel compute, with efficient communication between the processing nodes. One form of topology which provides efficient ring connections is a torus. An n by n torus is shown in
Using rings in this dimension, an alternative approach to implement Allreduce is to use a reduce scatter collective followed by an Allreduce collective. A paper authored by Jain and Sabharwal entitled “Optimal Bucket Algorithms for large MPI collectives on torus interconnects” (ICS' 10, June 2-4, Tsukuba) presents bucket based algorithms for Allgather, reduce-scatter and Allreduce collectives assuming bi-directional links between processing nodes in a torus interconnected processor. This approach operates on the basis that there are multiple data values (partial fragments) to be handled in each step. The algorithm described by Jain et al splits partials into to 2p equal fragments on each processor, where p is the number of logical or virtual rings which can be utilised in the computer. p logical rings are pipelined in each direction, each ring in one direction starting at a different processor. In the reduce-scatter collective, each process starts with an initial partial. It is assumed that a reference here to a process is to a process carried out on a processing node. A partial vector has multiple fragments—each fragment may be for example one or more data value. The corresponding fragments of all partials are reduced and the reductions are then distributed across the processes. In the Allgather collective, every process receives all elements from all other processes. The reduce-scatter collective reduces all partials and stores each reduction on a respective node—see
As discussed in Jain's paper, torus interconnects are attractive interconnection architectures for distributed memory supercomputers. In the above discussion, collectives have been explained in the context of communication between processes. In a distributed super computer, processing nodes are interconnected, and each processing node may be responsible for one or more process in the context of collectives. A torus interconnect is a type of mesh interconnect with processing nodes arranged in an array of n dimensions, with each node connected to its nearest neighbours, and corresponding nodes on opposite edges of the array also connected. Bi-directional communication links exist between interconnected processing nodes.
The algorithms for implementing collectives which are discussed in the above-referenced paper authored by Jain and Sabharwal are applied on torus connected architectures. This allows the collectives to process different parts of the partial vectors in rings in different dimensions at the same time, making the process bandwidth efficient.
Based on the study in the Jain paper, dividing the partial vector into 8n fragments (2 dimensions ×2 directions×2 phases) {reduce, broadcast}×n nodes in a ring enables bandwidth optimality. According to one AllReduce algorithm, the subdivision associated with each dimension is used as follows:
One half of the partial vectors are first reduced using the two rings along the x dimension, then along the y dimension. The other half uses the dimensions in the opposite order. Then the results are broadcast likewise using dimensions in turn, but in the opposite order.
A torus can be created in three dimensions when it is referred to as hyper-torus. A hyper-torus of n×n×n nodes forms 6n rings, each of length n and has thirds of the data (namely the partial vectors) using the dimension in non-overlapping orders xyz, yzx, zxy. Note that in this case z represents the dimension perpendicular to the plane of x and y. Partial vectors are divided into 12n fragments to optimise the AllReduce algorithm in this case.
In a n×n torus, all the rings are used all the time. Rectangular toroids can also be constructed such as 2 n×n. In this case, a certain type of twisted torus allows equal ring length in both dimensions. The rings are the length of the longer dimension, 2 n.
While the torus and twisted torus allow bandwidth optimality for implementation of the AllReduce algorithm, they suffer from a defect which renders them unsuitable for implementing multitenant algorithms in a computer network. Moreover, in any particular construction the configuration is preset at a certain size (number of nodes). It is not easy to provide torus and twisted torus architectures as partitions of a larger network, and maintain usable rings in the partitions.
It is an objective of the present invention to provide a configuration of interconnected processing nodes which overcome these defects, particularly but not exclusively for use in processing functions in machine learning, such as collectives.
While the topologies and configurations described herein are particularly effective for the efficient implementation of Allreduce, they may also be advantageously used for other machine learning collectives and other types of parallel programs.
SUMMARYAn aspect of the invention provides a computer comprising a plurality of processing nodes arranged in respective front and rear layers, each layer comprising a two-dimensional array of processing nodes, each processing node having a set of activatable links which, when activated, enable the transmission of data items between the processing node and an adjacent processing node connected via the activated link and, when not activated, prevent the transmission of data items between the processing node and the adjacent processing node connected via the inactive link,
the set of activatable links comprising for each processing node in the first and second layer a respective link which connects the processing node to each adjacent node in the array, and to a facing processing node in the second or first layer respectively; and an allocation engine configured to receive an allocation instruction and connected to the processing nodes to selectively activate the links to connect at least a group of the processing nodes in a configuration in which:
(i) links between adjacent nodes within each of the first and second layers respectively are activated;
(ii) links between facing nodes are only activated for edge processing nodes of the group; and
(iii) links between processing nodes outside the group and adjacent processing nodes in the group are deactivated.
The set of activatable links may comprise two such respective links connecting the processing node to its facing processing node. In that case, two links are activated between corner facing nodes of the group.
The links may be bi-directional links.
The two-dimensional array may be an array of n by m processing nodes, and wherein the group comprises an array of p×q processing nodes in each layer where at least one of the following conditions is satisfied: p is less than n or q is less than m.
In some embodiments, m equals n. In some embodiments p equals q.
In some embodiments, the links may be SERDES links or similar, wherein each link when activated has a fixed power requirement largely independent of data traffic. That is, the link draws power when activated even when no data is being transmitted, and when deactivated consumes no power. The links may operate to transmit data by a variation in voltage from a powered voltage level on the link.
The allocation engine may comprise one or more processor configured to execute allocation computer code responsive to a user request. This processor may be a separate master machine, or could be implemented by code on each of the processing nodes themselves without the need for a separate allocation machine.
The invention also provides in another aspect a method of configuring a computer comprising a plurality of processing nodes arranged in respective front and rear layers, each layer comprising a two-dimensional array of processing node, each processing node having a set of activatable links which, when activated, enable the transmission of data items between the processing node and an adjacent processing node connected via the activated link and, when not activated, prevent the transmission of data items between the processing node and the adjacent processing node connected via the inactive link, the set of activatable links comprising for each processing node in the first and second layer a respective link which connects the processing node to each adjacent node in the array, and to a facing processing node in the second or first layer respectively, the method comprising:
selectively activating the links of each processing node in at least a group of the processing nodes to generate a networked configuration of processing nodes in which:
(i) links between adjacent nodes within each of the first and second layers respectively activated;
(ii) links between facing nodes are only activated for edge processing nodes of the group; and
(iii) links between processing nodes outside the group and adjacent the processing nodes are deactivated.
Links may be selectively activated by providing power to the link.
The invention provides in a further aspect a method of operating a group of processing nodes in a networked configuration, the method comprising configuring a computer to generate the networked configuration in accordance with the method defined above, and
operating the group of processing nodes in the networked configuration using m rings in each of two dimensions, where each ring is formed by n nodes, where n is the number of edge processing nodes in the networked configuration.
The method of operating the group may comprise dividing a partial vector generated at each processing node of the configuration into fragments and implementing a logical ring for each corresponding fragment in the partial vector to implement an Allreduce collective.
The Allreduce collective may be implemented by a reduce-scatter collective followed by an Allgather collective in the logical rings.
The logical rings may be implemented in forwards and backwards directions in each dimension.
The invention provides in a further aspect a “cushion” computer which comprises a plurality of processing nodes arranged in respective front and rear layers, each layer comprising a two-dimensional array of processing nodes, each processing node having a set of activated links which enable the transmission of data items between the processing node and an adjacent processing node connected via the activated link, wherein the processing nodes are connected in a configuration in which:
(i) adjacent nodes within each of the first and second layers are connected by activated links;
(ii) edge-processing nodes in each of the first and second layers are connected by activated links to their facing node in the other layer; and
(iii) any links between processing nodes outside the configuration and processing nodes in the configuration are deactivated such that the transmission of data items is prevented between processing nodes of the configuration and processing nodes outside the configuration.
The processing nodes of the configuration can be operated to form a set of connected rings in each of X and Y directions, wherein each ring comprises the same number of processing nodes.
Each processing node may comprise memory configured to store an array of data items (such as a vector or tensor) ready to be exchanged in the reduce scatter phase, wherein each data item is respectively positioned in the array with corresponding data items being respectively positioned at corresponding locations in the arrays of other processing nodes. The array may be a partial vector “partial” (a vector of partial results) in the reduce-scatter phase or a “result” (a vector of fully reduced partials) in the Allgather phase.
The processing nodes may each be programmed to transmit data items to its adjacent processing node in each ring in a forwards (or backwards, respectively) direction in the reduce-scatter phase. The data items which may be transmitted in each step are termed a “fragment”. A fragment is piece of the vector—as described herein, vectors are divided into fragments to make use of logical rings formed in the physically connected rings.
Each array may represent at least part of a vector of partial deltas, each partial delta representing an adjustment to a value stored at each processing node. Each processing node may be programmed to generate the vector of partial deltas in a compute step. Each processing node may programmed to divide its vector into two sub arrays for respective utilisation of the two rings in X and Y directions at each node.
For a better understanding of the present invention to show how the same may be carried into effect, reference will now be made by way of example to the accompanying drawings.
Aspects of the present invention have been developed in the context of a multi-tile processor which is designed to act as an accelerator for machine learning workloads. The accelerator comprises a plurality of interconnected processing nodes. Each processing node may be a single multi-tile chip, a package of multiple chips, or a rack of multiple packages. The aim herein is to devise a machine which is highly efficient at deterministic (repeatable) computation. Processing nodes are interconnected in a manner which enable collectives, especially broadcast and Allreduce, to be efficiently implemented.
One particular application is to update models when training a neural network using distributed processing. In this context, distributed processing utilises multiple processing nodes which are in different physical entities, such as chips or packages or racks. That is transmitting data between the processing nodes requires messages to be exchanged over physical links.
The challenges in developing a topology dedicated to machine learning differ from those in the general field of high performance computing (HPC) networks. HPC networks usually emphasise on demand asynchronous all-to-all personalised communication, so dynamic routing and bandwidth over provisioning are normal. Excess bandwidth may be provisioned in a HPC network with the aim of reducing latency rather than to provide bandwidth. Over provisioning of active communication links waste power which could contribute to compute performance. A commonly used link today draws power when activated whether or not it is being used to transmit data.
Torus computer networks have been discussed above as a machine topology which is particularly adapted to super computing machine workloads, such as ML workloads.
In ML workloads, inter chip communication is currently dominated by broadcast and Allreduce collectives. The broadcast collective can be implemented by a scatter collective followed by an Allgather collective, and the Allreduce collective can be implemented by a reduce-scatter collective followed by an Allgather collective. In this context, the term inter-chip denotes any communication between processing nodes which are connected via external communication links. As mentioned, these processing nodes may be chips, packages or racks.
One particularly useful algorithm which has been developed to process the Allreduce collective in machine learning algorithms has been described above. This algorithm uses rings in a torus structure to efficiently exchange and process fragments of partial vectors. One challenge which arises however is that such torus computer networks are not easy to partition. One objective of the novel computer network discussed herein is to enable a large network of many processing nodes to be partitioned into one or more smaller networks without any or any significant hardware involvement in the partitioning. To achieve the above objectives in ML applications, rings in two dimensions need to be preserved in the allocated portions. With present torus computer networks this is not possible, as any partitioning of the torus computer network would involve breaking one or more of the rings in the torus structure.
One reason for partitioning very large networks is to enable one or more tenant to use the network. A tenant may be described as a computer program distributed across processing nodes of the partition, the program being self-contained and controlled by a particular owner. Another requirement of partition networks is that there is no “leakage” between partitions. That is, it is highly undesirable for an owner of one tenant to be exposed to code or messages from another tenant of the partition.
The front layer comprises an n×n array or grid of processing nodes which in the embodiment of
If the 6×6 array were to be used as a single computer, the links between the front and rear faces on the central nodes would be rendered inactive. Furthermore, only one of the link between the facing layers would be activated at the edge nodes (see for example nodes 404 and its facing node 414). However, both of the link between the facing corner nodes PNF/PNR would be activated. That is, for each node in an active computer topology four of the six links would be active, and the remaining two would be inactive. It should be evident from this description that in the case of the central node 400 the inactive links are LF5 and LF6, and the active links are L1 through L4.
As a result of this connectivity, the word “cushion” has been coined to describe the computer network in its activated state. This is because it is joined only at its edges and not between its facing layers.
The activated computer network in this form provides two dimensions of n rings of length 2 n (where n=6 in the maximum configuration of
This provides for an optimally-efficient implementation of an AllReduce algorithm as discussed earlier herein. Reference will now be made to
In the vertical (Y) direction, the extreme most left hand ring is formed by nodes Na1, Na2, Na3 (in the front face) and N′a3, N′a2 and N′a1 (in the rear face). Once again, the ring can operate in two directions due to the bi-directional links.
The nodes with a dark outline (the 4x4 group of nodes in the bottom left-hand corner of the array) represent a partitioned cushion that is an allocated set of nodes from the main set. The corner nodes of the front layer are labelled 420 and the corner nodes of the rear layer are labelled 422. The four active links for each node are shown in black or dark grey illustrating whether they form part of a vertical or horizontal link respectively. The facing corner nodes are each connected by two links, a black and a dark grey because these form the terminations of both horizontal and vertical rings. The remaining edge nodes on the horizontal edge connect the facing layers only through a single activated link (dark grey) which forms the termination of vertical rings only. Conversely, the edge nodes on the vertical edges are connected only through one black link which represents the termination of horizontal rings.
The links between corresponding facing central nodes are de-activated (light grey). Furthermore, the links of nodes outside the partitioned cushion (for example, the vertical links from nodes 420 and 422 are de-activated). Note, however, that the de-activation of these links would not prevent the forming of a second cushion comprising the nodes in the upper two rows of the array.
Partitioning of the array into one or more partitioned cushions can be carried out in any suitable way.
A particular advantage of being able to partition the network into separate “cushions” is that each cushion has multiple rings, which can be used to efficiently implement a ring Allreduce collective.
The Allreduce collective has been described above and is illustrated in
In step one, the first fragment (the A0) in each virtual ring is transferred from its node to the next adjacent node where it is reduced with the corresponding fragment at that node. That is, RA0 moves from N0 to N1 where it is reduced into R(A0+A1). Once again, the “+” sign is used here as a shorthand for any combinatorial function. Note that in the same step the A 0 fragments of each virtual ring will simultaneously be being transmitted. That is, the link between N1 and N2 is used to transmit YA0, the link between N2 and N3 is used to transmit GA0, et cetera. In the next step, the corresponding reduced fragments are transmitted over the forward links to their next adjacent node. For example, R(A0+A1) is transmitted from N1 to N2, and Y(A0+A1) is transmitted from N2 to N3. Note that for reasons of clarity not all fragments are numbered, nor are all transmissions numbered in
The beginning of the Allgather phase starts by a transmission from the last to the first node in each virtual ring. Thus, the final reduction for the R fragments ends on node N5 ready for the first step of the Allgather phase. The final reduction of the Y fragments correspondingly ends up on the node N0. In the next step of the Allgather phase, the reduced fragments are transmitted again to their next adjacent node. Thus the fully reduced R fragment is now also at N2, the fully reduced Y fragment is now also at N3 and so on. In this way, each node ends up at the end of the Allgather phase with all fully reduced fragments R, Y, G, B, P, L of the partial.
Implementation of the algorithm can be achieved if the computation required for the reduction can be concealed behind the pipeline latency. Note that in forming suitable rings in a computer for implementation of Allreduce, a tour of the ring must visit each node in the ring only once. The algorithm described above uses six “pipelined” logical rings in a physical one-dimensional ring. The principle can be extended to two dimensions in the cushion. For full bandwidth utilisation, not only are rings used in a horizontal direction, but also in a vertical direction. Consider for example the node Na1. The partial at node Na1 is split into two halves. Note for the purpose of this discussion that the effect of bi-directionality in the links is ignored for now. In fact, everything that is described herein is mirrored in the opposite direction. Reverting to node Na1 the partial to be transmitted from this node is split into two halves. A first half of the partial is reduce-scattered (according to the one-dimensional line algorithm described above) around the ring in the X direction, and the other half is reduce-scattered around the ring in the Y direction. The same thing is happening at each node in the cushion. Note that (according to the one-dimensional algorithm described above), in fact, what is transmitted in each step on the X ring or Y ring respectively is a corresponding fragment for that step of the logical ring. When the reduce-scatter operations in the X and Y rings have been completed, the algorithm reverts back to the first node in each ring (consider node Na1) and swaps over the partials. That is, it now reduce-scatters the first set of the reductions resulting from the first pass round the X ring in the Y direction, and conversely reduce-scatters the first pass of reductions from the Y ring around the X ring.
After this phase, the fully reduced fragments are subject to an Allgather step, according to the one-dimensional ring algorithm described above. The first sets of the fully reduced fragments are Allgathered round the Y rings, and the second set of the fully reduced fragments are Allgathered around the X rings. Then, once again, the partially gathered sets are swapped and an Allgather process takes a first set around the X rings, and the second set around the Y rings.
The final result is a fully reduced complete vector at each node. As the number of nodes in the rings in both the X and Y directions are the same, the process has full bandwidth utilisation in all phases (that is all links are operating).
Each node is capable of implementing a processing or compute function. Each node could be implemented as a single processor. It is more likely, however, that each node will be implemented as a single chip or package of chips, wherein each chip comprises multiple processors. There are many possible different manifestations of each individual node. In one example, a node may be constituted by an intelligence processing unit of the type described in British applications with publication numbers GB2569843; GB2569430; GB2569275; the contents of which are herein incorporated by reference. However, the techniques described herein may be used on any type of processor constituting the nodes. What is outlined herein is a computer topology which can be partitioned in an efficient manner to maintain “rings” which are useful in executing collectives in machine learning models. Furthermore, the links could be manifest in any suitable way, subject only to the criteria that they are bi-directional. As mentioned, one particular category of communication link is a SERDES link which has a power requirement which is independent of the amount of data that is carried over the link, or the time spent carrying that data. SERDES is an acronym for Serializer/DeSerializer and such links are known. In order to transmit a signal on a wire of such links, power is required to be applied to the wire to change the voltage in order to generate the signal. A SERDES link has the characteristic that power is continually applied to the wire to maintain it at a certain voltage level, such that signals may be conveyed by a variation in that voltage level (rather than by a variation between 0 and an applied voltage level). Thus, there is a fixed power for a bandwidth capacity on a SERDES link whether it is used or not. A SERDES link is implemented at each end by circuitry which connects a link layer device to a physical link such as copper wires. This circuitry is sometimes referred to as PHY (physical layer). PCIe (Peripheral Component Interconnect Express) is an interface standard for connecting high speed computers.
Deactivation of the links in non-allocated partitions can save power. While in theory the links could be deactivated to consume effectively no power in an allocated cushion, in practice the activation time and non-deterministic nature of machine learning applications can render dynamic activation during program execution as problematic. As a consequence, the present inventor has determined that it is better to make use of the fact that the chip to chip link power consumption is essentially constant for any particular configuration, and that therefore the best optimisation is to maximise the use of the physical links by maintaining chip to chip traffic within a “cushion” concurrent with IPU activity as far as is possible.
SERDES PHYs are full duplex (that is a 16 Gbit per second PHY supports 16 Gbits per second in each direction simultaneously), so full link bandwidth utilisation implies balanced bi-directional traffic. Moreover, note that there is significant advantage in using direct chip to chip communication as compared with indirect communication such as via switches. Direct chip to chip communication is much more power efficient than switched communication.
Another factor to be taken into consideration is the bandwidth requirement between nodes. An aim is to have sufficient bandwidth to conceal inter node communication behind the computations carried out at each node for distributed machine learning.
The links are physical links provided by suitable buses or wires as mentioned above. In one manifestation, each processing node has a set of wires extending out of it for connecting it to another processing node. This may be done for example by one or more interface of each processing node having one or more port to which one or more physical wire is connected.
In another manifestation, the links may be constituted by on-board wires. For example, a single board may support a group of chips, for example four chips. Each chip has an interface with ports connectable to the other chips. Connections may be formed between the chips by soldering wires onto the board according to a predetermined method. Note that the concepts and techniques described herein are particularly useful in that context, because they make maximise use of links which have been pre soldered between chips on a printed circuit board.
The concepts and techniques described herein are particularly useful because they enable optimum use to be made of non-switchable links. A configuration may be built by connecting up the processing nodes as described herein using the fixed non switchable links between the nodes.
In some embodiments, in order to use the configuration, a set of parallel programs are generated. The set of parallel programs contain node level programs, that is programs designated to work on particular processing nodes in a configuration. The set of parallel programs to operate on a particular configuration may be generated by a compiler. It is the responsibility of the compiler to generate node level programs which correctly define the links to be used for each data transmission step for certain data. These programs include one or more instruction for effecting data transmission in a data transmission stage which uses a link identifier to identify the link to be used for that transmission stage. For example, a processing node may have two or three active links at any one time (double that if the links are simultaneously bidirectional). The link identifier causes the correct link to be selected for the data items for that transmission stage. Note that each processing node may be agnostic of the actions of its neighbouring nodes—the exchange activity is pre compiled for each exchange stage.
Note also that links do not have to be switched—there is no need for active routing of the data items at the time at which they are transmitted, or to change the connectivity of the links.
As mentioned above, the configurations of computer networks described herein are to enhance parallelism in computing. In this context, parallelism is achieved by loading node level programs into the processing nodes of the configuration which are intended to be executed in parallel, for example to train an artificial intelligence model in a distributed manner as discussed earlier. It will be readily be appreciated however that this is only one application of the parallelism enabled by the configurations described herein. One scheme for achieving parallelism is known as “bulk synchronous parallel” (BSP) computing. According to a BSP protocol, each processing node performs a compute phase and an exchange phase which follows the compute phase. During the compute phase, each processing nodes performs its computation tasks locally but does not exchange the results of its computations with the other processing nodes. In the exchange phase, each processing node is permitted to exchange the results of its computations from the preceding compute phase with the other processing nodes in the configuration. A new compute phase is not commenced until the exchange phase has been completed on the configuration. In this form of BSP protocol, a barrier synchronisation is placed at the juncture transitioning from the compute phase into the exchange phase, or transitioning from the exchange phase into the compute phase or both.
In the present embodiments, when the exchange phase is initiated, each processing node executes an instruction to exchange data with its adjacent nodes, using the link identifier established by the compiler for that exchange phase. The nature of the exchange phase can be established by using the MPI message passing standard discussed earlier. For example, a collective may be recalled from a library, such as the all reduced collective. In this way, the compiler has precompiled node level programs which control the links over which the partial vectors are transmitted (or respective fragments of the partial vectors are transmitted).
It will readily be apparent that other synchronisation protocols may be utilised.
While particular embodiments have been described, other applications and variants of the disclosed techniques may become apparent to a person skilled in the art once given the disclosure herein. The scope of the present disclosure is not limited by the described embodiments but only by the accompanying claims.
Claims
1. A computer comprising a plurality of processing nodes, each processing nodes having at least one processor configured to process input data to generate output data in the form of an array of data items; the plurality of processing nodes arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links, the cliques being inter-connected in rings such that each processing node is a member of a single clique and a single ring, the processing nodes being configured to exchange data items in respective exchange steps of a machine learning collective, wherein the processing nodes of all cliques are configured to exchange in each exchange step via the respective first and second clique links at least two data items with the other processing node(s) in its clique, and all processing nodes are configured to reduce each received data item with the data item in the corresponding position in the array on that processing node.
2. The computer according to claim 1, wherein the machine learning collective is an Allreduce collective and each processing node is configured to exchange data items in exchange steps of an Allgather phase, following a reduce scatter phase of the Allreduce collective, wherein in each step of the Allgather phase reduced data items are exchanged between processing nodes in a clique.
3. The computer according to claim 1, wherein each processing node comprises memory configured to store an array of data items ready to be exchanged in the reduce scatter phase, wherein each data item is respectively positioned in the array with corresponding data items being respectively positioned at corresponding locations in the arrays of other processing nodes.
4. The computer according to claim 1, wherein the processing nodes are each configured to transmit data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase.
5. The computer according to claim 4, wherein the processing nodes are configured to transmit data items to their forwards adjacent processing node in the ring for all exchange steps of the reduce scatter phase apart from a first step, in which no data items are transmitted between processing nodes connected in a ring.
6. The computer according to claim 1, wherein the array at each processing node comprises two sub arrays and processing nodes are inter-connected by bi-directional links, wherein in each exchange step of the reduce scatter phase, all processing nodes are configured to exchange with the other processing node(s) of their clique, two data items from one sub array and two further data items from the other sub array wherein the two data items and the further two data items are exchanged over the same bi-directional link in opposite directions.
7. The computer according to claim 6, wherein the processing nodes are each configured to transmit data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase and wherein in at least some exchange steps of the reduce-scatter phase each processing node is configured to transmit data items to its adjacent backwards processing node in the ring, wherein the transmission in each of the forwards and backwards direction from each processing node is carried out on the same bi-directional link.
8. The computer according to claim 3, wherein each array represents at least part of a vector of partial deltas, each partial delta representing an adjustment to a value stored at each processing node.
9. The computer according to claim 8, wherein each processing node is configured to generate the vector of partial deltas in a compute step.
10. The computer according to claim 9, wherein each processing node is configured to divide the vector into two sub arrays for separate exchange and reduction in the reduce-scatter phase.
11. The computer according to claim 9, wherein each processing node is configured to generate the vector of partial deltas by carrying out a compute function on a set of values and a batch of incoming deltas, the partial deltas being the output of the compute function.
12. The computer according to claim 11, which is configured to implement a machine learning model wherein the incoming batch data is training data, and the values are weights of the machine learning model.
13. A method of operating a computer comprising a plurality of processing nodes, each processing node having at least one processor configured to process input data to generate output data in the form of an array of data items, the plurality of processing nodes arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links, the cliques being interconnected in rings such that each processing node is a member of a single clique and a single ring, the method comprising exchanging data item in respect of exchange steps of a first phase of a machine learning collective, wherein in each exchange step the processing nodes of all cliques exchange via the respective first and second clique links at least two data items with the other processing nodes in its clique, and all processing nodes reduce each received data item with the data item in the corresponding position in the array on that processing node.
14. The method according to claim 13, wherein the machine learning collective is an Allreduce collective and each processing node exchanges data items in exchange steps of an Allgather phase, following a reduce scatter phase of the Allreduce collective, wherein in each step of the Allgather phase reduced data items are exchanged between processing nodes in a clique.
15. The method according to claim 13, wherein each processing node comprises memory configured to store an array of data items ready to be exchanged in the reduce scatter phase, wherein each data item is respectively positioned in the array with corresponding data items being respectively positioned at corresponding locations in the arrays of other processing nodes.
16. The method according to claim 13, wherein each processing node transmits data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase.
17. The method according to claim 16, wherein each processing node transmits data items to their forwards adjacent processing node in the ring for all exchange steps of the reduce scatter phase apart from a first step, in which no data items are transmitted between processing nodes connected in a ring.
18. The method according to claim 13, wherein the array at each processing node comprises two sub arrays and processing nodes are inter-connected by bi-directional links, wherein in each exchange step of the reduce scatter phase, all processing nodes exchange with the other processing node(s) of their clique, two data items from one sub array and two further data items from the other sub array wherein the two data items and the further two data items are exchanged over the same bi-directional link in opposite directions.
19. The method according to claim 18, wherein each processing node transmits data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase and wherein in at least some exchange steps of the reduce-scatter phase each processing node transmits data items to its adjacent backwards processing node in the ring, wherein the transmission in each of the forwards and backwards direction from each processing node is carried out on the same bi-directional link.
20. The method according to claim 15, wherein each array represents at least part of a vector of partial deltas, each partial delta representing an adjustment to a value stored at each processing node.
Type: Application
Filed: Mar 26, 2020
Publication Date: Oct 1, 2020
Inventor: Simon KNOWLES (Bristol)
Application Number: 16/831,572