Parallel processing mechanism for multi-processor systems
A multi-processor computing device is provided that has at least two processing subsystems which each comprise a processor unit and at least one further component. In each processing subsystem, the processor unit is connected to the further component via a first link, and can be connected to at least one processor unit of another processing subsystem via a second link. The first and second links are physically decoupled, and the processing subsystems can simultaneously send data over the first and second links. There are further provided corresponding processing subsystems and multi-processor computing methods.
Latest Patents:
1. Field of the Invention
The invention generally relates to multi-processor computing devices and corresponding methods, and in particular to a technique for implementing parallel processing mechanisms.
2. Description of the Related Art
Multi-processor systems are generally used to increase the computing capabilities by building systems which have more than just one processor to perform the central processing tasks. Two structurally different concepts are known: SMP (Symmetrical Multi-Processing) and MPP (Massive Parallel Processing).
SMP systems have multiple identical processors that share the memory and make use of a global address space. Communication between the processors is done using a shared parallel bus. Usually, the parallelization of the applications is done by the operating system by assigning the different tasks to the various processors. However, SMP systems suffer from low scalability since the number of processors is limited by the capacity of the shared bus.
As mentioned above, the scalability of systems like those shown in
Other SMP techniques to increase the scalability include the NUMA (Non-Uniform Memory Access) and the COMA (Cache Only Memory Architecture) architectures. However, these techniques introduce undesired asymmetry to the I/O and graphics systems.
MPP systems have a plurality of computer nodes which are processor memory groups which are independent from each other and which each run an operating system. There is no common address space so that communication between the nodes requires message buses or even networks. MPP systems are easily scalable but are difficult to program since each application program has to deal with the parallel processing by itself.
Thus, conventional techniques are either limited with respect to the scalability, or are difficult to implement. The lack of flexibility in implementing the parallel processing mechanisms often results from the fact that conventional systems have the parallelization mechanism hardwired into the system.
SUMMARY OF THE INVENTIONAn improved multi-processing technique is provided that may allow for high performance parallel processing in easily scalable structures implementing flexible parallelization mechanisms.
In one embodiment, there is provided a multi-processor computing device that comprises at least two processing subsystems. Each processing subsystem comprises a processor unit and at least one further component. In each one of the at least two processing subsystems, the processor unit is connected to the at least one further component via at least one first link. Further, the processor unit in each one of the at least two processing subunits is adapted to be connected to at least one processor unit of another one of the at least two processing subsystems via at least one second link. The at least one first link and the at least one second link are physically decoupled. The at least two processing subsystems are capable of simultaneously sending data over the at least one first link and the at least one second link.
According to another embodiment, a processing subsystem for use in a multi-processor computing device is provided. The processing subsystem comprises a processor unit and at least one further component. The processor unit is connected to the at least one further component via at least one first link. The processor unit is further adapted to be connected to at least one processor unit of another processing subsystem via at least one second link. The at least one first link and the at least one second link are physically decoupled. The processing subsystem is capable of simultaneously sending data over the at least one first link and the at least one second link.
In a further embodiment, there is provided a multi-processor computing method. The multi-processor computing method comprises operating a first and a second processing subsystem of a multi-processor computing device. The first and second processing subsystems each comprise a processor unit and at least one further component. Operating the first and second processing subunits comprises simultaneously sending data over at least one first link between the processor unit and a respective further component of one of the first and second processing subsystems, and at least one second link between the processor units of the first and second processing subsystems. The at least one first link and the at least one second link are physically decoupled.
In still a further embodiment, a computer-readable storage medium stores instructions that, when executed on a multi-processor computing device that has at least two processing subsystems which each comprise a processor unit and at least one further component, cause the multi-processor computing device to simultaneously send data over at least one first link between the processor unit and a respective further component of one of the processing subsystems, and at least one second link between the processor units of the processing subsystems. The at least one first link and the at least one second link are physically decoupled.
BRIEF DESCRIPTION OF THE DRAWINGSThe accompanying drawings are incorporated into and form a part of the specification for the purpose of explaining the principles of the invention. The drawings are not to be construed as limiting the invention to only the illustrated and described examples of how the invention can be made and used. Further features and advantages will become apparent from the following and more particular description of the invention, as illustrated in the accompanying drawings, wherein:
The illustrative embodiments of the present invention will be described with reference to the figure drawings wherein like elements and structures are indicated by like reference numbers.
As will be described in more detail below, the embodiments make use of processing subsystems that have a link structure which makes it possible to easily scale the system to increase the degree of parallelization in a flexible manner.
Referring to
Thus, the arrangement of
Having dedicated links for each function allows these functions to use their links in a deterministic way so that no transfer is interrupted by other functions and each link has its full dedicated bandwidth without the need to share the bandwidth with other functions. This enables the processing subsystem 200 to perform highly concurrent transfers, and in addition makes the system highly scalable simply by adding further processing subsystems to a multi-processor computing device.
One or more of the links shown in
It is noted that the arrangement of
Referring now to
The graphics subsystem 300 performs the necessary graphic operations. Various functionality modifications and implementations are possible. For instance, the graphics subsystem can be a standard graphics adapter card, a special chip which is directly coupled to the CPU, an external graphics subsystem, or it may be integrated on the CPU. Further, the connection to the CPU link may be different in the various embodiments. For instance, the CPU link may interface directly with the graphics subsystem, or it may require a bridge system.
In the embodiment of
While not limited to the embodiments of
It is to be noted that other embodiments may differ from the arrangement of
Similarly, the type of internal links between the processor units 410, 430, 450 and the graphics subsystems 405, 425, 445 may vary in other embodiments. Examples of such embodiments will be described in more detail below.
As shown in
Referring now to
In the embodiment, each component 405, 410, 415, 425, 430, 435, 445, 450, 455 of each processing subsystem 400, 420, 440 can communicate with any other component of its own processing subsystem 400, 420, 440 or any other processing subsystem 400, 420, 440. For instance, the processor unit 410 of the processing subsystem 400 may communicate with the graphics subsystem 425 of processing subsystem 420 by forming a data path 510 which includes the processor unit 430 of the processing subsystem 420. The processor unit 430 routes any communication received from one of the two components to the other one.
In another example, the graphics subsystem 405 of the processing subsystem 400 is allowed to communicate with the graphics subsystem 425 of the processing subsystem 420 by forming a data path 500. Any communication through this path is routed by the processor units 410 and 430.
It is to be noted that the routing may be completely transparent to the software. That is, the software just needs to provide the addresses of the receiving component so that from a software perspective, each processor unit 410, 430, 450 can communicate with any other component directly. There is no difference with respect to whether a component communicates with another component of the same processing subsystem, or with a component of a foreign processing subsystem.
That is, each processor unit of each processing subsystem can select one of its internal or external links (e.g., Link0, Link1, Link2 or Link3) to send data in response to receiving an address of the target component from a software function. Further, each processor unit can route data from one link to another link dependent on the address of the target component.
This functionality allows to flexibly apply any parallel processing mechanism simply by using accordingly adapted software. There is then no need to re-configure the hardware. Thus, the parallelization method to be used is not hardwired into the system, but is just implemented by means of software. Consequently, various parallelization mechanisms can be used on the same hardware platform without requiring any hardware modifications.
It is to be noted that the software just provides the target addresses, and the routing is done by the underlying link hardware. The software does not need to be responsible for the routing, nor is the routing visible to the components.
In a further embodiment, the performance can still be increased by selecting a software implemented parallelization mechanism which minimizes the communication between the processing subsystems, since this reduces access latencies.
The following description provides some examples of how good use can be made from the graphics subsystems 405, 425, 445. While not limited to these examples, embodiments will be discussed (i) where each graphics subsystem is directly connected to a physical monitor device, (ii) where just one graphics subsystem is connected to a monitor but the graphics workload is split across all graphics subsystems, and (iii) where multiple monitor devices are used in an SMP-like arrangement. In the latter case, the processor units share the workload of a performance intensive operation regardless of whether the operation is graphics related or not.
Taking first the multiple monitor embodiment,
The arrangement of
In the embodiment of
In other words, taking the example of having multiple viewports as shown in
Turning now to the single monitor embodiment mentioned above,
To achieve this, the present embodiment splits the amount of processing work per frame into multiple workloads which are then distributed to all processing subsystems. The frame may be tiled in many different ways, and the processing may be interleaved. Examples of how a frame may be split are given in
In the embodiment of
It is to be noted that in other embodiments, the frame regions may be arranged in any other configuration, and there is then no need for the frame regions to be of the same size or superficial extent.
Referring, however, back to the arrangements of
Referring to the flow chart of
Once all processing subsystems have rendered their frame region into their private frame buffer (which may be located in the graphics memory 320) in step 920, the results are copied to the master graphics subsystem 405 via data paths 710, 720 in step 930. The copied pixel data are then merged into the frame buffer of the graphics subsystem 405 (step 940) so that the frame pixel data can be displayed on the monitor 700.
While the copying of step 930 is shown in
That is, embodiments may exist where the graphics subsystems have a direct link between them to merge the data. Alternatively, the rendered frame region data can be combined at the monitor output.
As mentioned above, the discussed multi-monitor or single-monitor arrangements are merely non-limiting embodiments. In general, the parallel-processing approach of the embodiments is generic in the sense that it is not restricted to the use of graphics. In other words, embodiments exist that may run standard SMP applications. Taking for instance the hardware arrangement of
As mentioned above, the number of processing subsystems in the multi-processor computing devices of the embodiments is not limited to three. Further, a processing subsystem may contain more than one graphics subsystem for certain requirements. Respective embodiments will now be discussed with reference to FIGS. 10 to 12.
Referring first to
Referring to the flow chart of
The data and commands for the primitives of the respective frame region are then sent from each individual processor unit to the respective private graphics subsystem using the full uninterrupted bandwidth of the respective link. Once all processing subsystems have rendered their frame region into their private frame buffers, the results are merged into the frame buffers of the graphics subsystems 405, 445, respectively. Then the two different frames are simultaneously displayed, one at the monitor 1020 and the other at monitor 1030.
It is noted that in particular the copying of the pixel data for each viewport can occur in parallel.
Referring now to
As apparent from the foregoing description of the various embodiments, a highly parallel system architecture is shown which allows for highly efficient parallel processing of regular computational tasks as well as graphics processing. All parallelization is done by software and no hardwired parallelization mechanism is imposed. This makes the system very flexible and adaptable to the needs of the software.
Further, the use of multiple parallel links leads to the availability of a huge overall system bandwidth and therefore makes highly concurrent operations possible. Further, the usage of processing subsystems makes the system very scalable in regard to the number of processing subsystems used in the interconnection topology. The topology is transparent to the software.
It is further to be noted that the use of completely software-implemented parallel processing mechanisms also allows to combine different parallelization mechanisms into one system. Further, it is to be noted that in any of the above embodiments, the processors may comprise multiple processor cores.
While the invention has been described with respect to the physical embodiments constructed in accordance therewith, it will be apparent to those skilled in the art that various modifications, variations and improvements of the present invention may be made in the light of the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention. In addition, those areas in which it is believed that those of ordinary skill in the art are familiar, have not been described herein in order to not unnecessarily obscure the invention described herein. Accordingly, it is to be understood that the invention is not to be limited by the specific illustrative embodiments, but only by the scope of the appended claims.
Claims
1. A multi-processor computing device comprising:
- at least two processing subsystems each comprising a processor unit and at least one further component,
- wherein in each one of said at least two processing subsystems, the processor unit is connected to said at least one further component via at least one first link,
- wherein in each one of said at least two processing subsystems, the processor unit is further adapted to be connected to at least one processor unit of another one of said at least two processing subsystems via at least one second link,
- wherein said at least one first link and said at least one second link are physically decoupled, and
- wherein said at least two processing subsystems are capable of simultaneously sending data over said at least one first link and said at least one second link.
2. The multi-processor computing device of claim 1, wherein each processor unit of said at least two processing subsystems is adapted to select one of said first and second links to send data, in response to receiving an address of a target component within anyone of said at least two processing subsystems, said target component being the intended recipient of said data.
3. The multi-processor computing device of claim 2, wherein the processor units of said at least two processing subsystems are adapted to receive said address of said target component from a software function.
4. The multi-processor computing device of claim 2, wherein each processor unit of said at least two processing subsystems is capable of routing data from one of said first and second links to another one of said first and second links dependent on said address of said target component.
5. The multi-processor computing device of claim 1, wherein said at least one further component is a graphics subsystem adapted to perform graphics operations.
6. The multi-processor computing device of claim 5, wherein said graphics subsystem is a graphics adapter card.
7. The multi-processor computing device of claim 6, wherein said graphics subsystem comprises a PCI (Peripheral Component Interface) Express interface unit.
8. The multi-processor computing device of claim 5, wherein said graphics subsystem is an integrated circuit chip directly coupled to the respective processor unit via said at least one first link.
9. The multi-processor computing device of claim 5, wherein said graphics subsystem is a subunit of the respective processor unit, integrated on the same chip as the respective processor unit.
10. The multi-processor computing device of claim 5, wherein said graphics subsystem is a graphics interface unit capable of interfacing to an external graphics system.
11. The multi-processor computing device of claim 5, wherein said graphics subsystem comprises a graphics processor adapted to perform graphics processing.
12. The multi-processor computing device of claim 11, wherein said graphics processor is adapted to be connected to a display unit.
13. The multi-processor computing device of claim 5, wherein said graphics subsystem comprises a graphics memory.
14. The multi-processor computing device of claim 5, wherein the processor units of said at least two processing subsystems are adapted to form a data path from a graphics subsystem of a first one of said processing subsystems to a graphics subsystem of a second one of said processing subsystems, said data path comprising a first link between the graphics subsystem of the first processing subsystem and the processor unit of the first processing subsystem, a second link between the processor unit of the first processing subsystem and the processor unit of the second processing subsystem, and another first link between the processor unit of the second processing subsystem and the graphics subsystem of the second processing subsystem.
15. The multi-processor computing device of claim 5, wherein the processor units of said at least two processing subsystems are adapted to form a data path from the processor unit of a first one of said processing subsystems to a graphics subsystem of a second one of said processing subsystems, said data path comprising a second link between the processor unit of the first processing subsystem and the processor unit of the second processing subsystem, and a first link between the processor unit of the second processing subsystem and a graphics subsystem of the second processing subsystem.
16. The multi-processor computing device of claim 5, wherein the graphics subsystems of each of said at least two processing subsystems are capable of being connected to an individual display device, and each graphics subsystem is adapted to perform graphics operations solely for the display device to which it is connected.
17. The multi-processor computing device of claim 5, wherein a graphics subsystem of one of said at least two processing subsystems is adapted to perform graphics operations for a display device connected to a graphics subsystem of another one of said at least two processing subsystems.
18. The multi-processor computing device of claim 17, wherein said graphics subsystem of said one processing subsystem is adapted to perform all of the graphics operations necessary for said display device connected to said graphics subsystem of said other processing subsystem.
19. The multi-processor computing device of claim 17, wherein said graphics subsystem of said one processing subsystem is adapted to perform graphics operations necessary to display a frame region at said display device connected to said graphics subsystem of said other processing subsystem, while said graphics subsystem of said other processing subsystem is adapted to perform graphics operations necessary to display another frame region at said display device.
20. The multi-processor computing device of claim 19, wherein a graphics subsystem of a third processing subsystem is adapted to perform graphics operations necessary to display a third frame region at said display device connected to said graphics subsystem of said other processing subsystem.
21. The multi-processor computing device of claim 20, wherein the frame regions are of the same superficial extent.
22. The multi-processor computing device of claim 20, wherein the frame regions have the same dimensions.
23. The multi-processor computing device of claim 20, wherein the frame regions are arranged to horizontally split the entire frame.
24. The multi-processor computing device of claim 20, wherein at least one of said frame regions has a horizontal dimension less than the entire frame, and a vertical dimension less than the entire frame.
25. The multi-processor computing device of claim 19, wherein the processor units of said one and said other processing subsystems are adapted to preprocess data to be displayed to decide which primitives are to be rendered in the respective frame region.
26. The multi-processor computing device of claim 25, wherein the processor units of said one and said other processing subsystems are adapted to send data and/or commands to the graphics subsystem connected to the respective processor unit via a first link.
27. The multi-processor computing device of claim 26, wherein the graphics subsystems are adapted to render the respective frame regions in response to receiving said data and/or commands.
28. The multi-processor computing device of claim 27, wherein the processing subsystems are adapted to copy rendered pixel data from the graphics subsystem of said one processing subsystem to the graphics subsystem of said other processing subsystem.
29. The multi-processor computing device of claim 28, wherein the processing subsystems are adapted to copy the rendered pixel data via the processor units of the processing subsystems.
30. The multi-processor computing device of claim 28, wherein the processing subsystems are adapted to copy the rendered pixel data via a dedicated link between the the graphics subsystems of the processing subsystems.
31. The multi-processor computing device of claim 28, wherein the graphics subsystem of said other processing subsystem is adapted to merge the copied pixel data with its own rendered pixel data to display the merged pixel data at said display device.
32. The multi-processor computing device of claim 27, wherein the processing subsystems are adapted to merge pixel data rendered by the graphics subsystem of said one processing subsystem and pixel data rendered by the graphics subsystem of said other processing subsystem at a line synch output to said display device.
33. The multi-processor computing device of claim 5, wherein said at least two processing subsystems comprises a first and a second processing subsystem having their respective graphics subsystems connected to an individual display device, and a third and a fourth processing subsystem not having their respective graphics subsystems connected to a display device, wherein said third and fourth processing subsystems are adapted to perform graphics operations for the display devices at the graphics subsystems of the first and second processing subsystems, respectively.
34. The multi-processor computing device of claim 33, adapted to simultaneously perform the operation of the first and third processing subsystems, and the operation of the second and fourth processing subsystems.
35. The multi-processor computing device of claim 5, wherein at least one of said processing subsystems comprises two or more graphics subsystems separately and independently connected to the processor unit of the processing subsystem.
36. The multi-processor computing device of claim 1, wherein said at least one further component is a memory unit.
37. The multi-processor computing device of claim 1, wherein in each one of said at least two processing subsystems, the processor unit is connected to two components of the respective processing subsystem via two separate first links, and wherein in each one of said at least two processing subsystems, the processor unit is further adapted to be connected to two processor units of other processing subsystems via two separate second links.
38. The multi-processor computing device of claim 37, wherein said two component are a graphics subsystem adapted to perform graphics processing, and a memory unit.
39. The multi-processor computing device of claim 1, capable of running SMP (Symmetric Multi-Processing) applications.
40. The multi-processor computing device of claim 1, further comprising at least one interface unit to interface to at least one system component other than said at least two processing subsystems, wherein at least one of said at least two processing subsystems is adapted to be connected to said at least one interface unit.
41. The multi-processor computing device of claim 40, wherein said at least one interface unit is a system bridge.
42. The multi-processor computing device of claim 1, wherein said first and second links are HyperTransport™ compliant links.
43. A processing subsystem for use in a multi-processor computing device, the processing subsystem comprising:
- a processor unit; and
- at least one further component,
- wherein the processor unit is connected to said at least one further component via at least one first link,
- wherein the processor unit is further adapted to be connected to at least one processor unit of another processing subsystem via at least one second link,
- wherein said at least one first link and said at least one second link are physically decoupled, and
- wherein said processing subsystem is capable of simultaneously sending data over said at least one first link and said at least one second link.
44. A multi-processor computing method comprising:
- operating a first and a second processing subsystem of a multi-processor computing device, said first and second processing subsystems each comprising a processor unit and at least one further component,
- wherein operating said first and second processing subsystems comprises:
- simultaneously sending data over at least one first link between the processor unit and a respective further component of one of said first and second processing subsystems, and at least one second link between the processor units of said first and second processing subsystems, said at least one first link and said at least one second link being physically decoupled.
45. A computer-readable storage medium storing instructions that, when executed on a multi-processor computing device having at least two processing subsystems each comprising a processor unit and at least one further component, cause said multi-processor computing device to simultaneously send data over at least one first link between the processor unit and a respective further component of one of said processing subsystems, and at least one second link between the processor units of said processing subsystems, said at least one first link and said at least one second link being physically decoupled.
Type: Application
Filed: Feb 18, 2005
Publication Date: May 4, 2006
Applicant:
Inventor: Uwe Kranich (Kirchheim)
Application Number: 11/061,427
International Classification: G06F 15/00 (20060101);