Active memory data compression system and method
An integrated circuit active memory device receives task commands from a component in a host computer system that may include the active memory device. The host system includes a memory controller coupling the active memory device to a host CPU and a mass storage device. The active memory device includes a command engine issuing instructions responsive to the task commands to either an array control unit or a DRAM control unit. The instructions provided to the DRAM control unit cause data to be written to or read from a DRAM and coupled to or from either the processing elements or a host/memory interface. The processing elements execute instructions provided by the array control unit to decompress data written to the DRAM through the host/memory interface and compress data read from the DRAM through the host/memory interface.
This invention relates memory devices, and, more particularly, to techniques for efficiently transferring data to and from active memory devices.
BACKGROUND OF THE INVENTIONA common computer processing task involves sequentially processing large numbers of data items, such as data corresponding to each of a large number of pixels in an array. Processing data in this manner normally requires fetching each item of data from a memory device, performing a mathematical or logical calculation on that data, and then returning the processed data to the memory device. Performing such processing tasks at high speed is greatly facilitated by a high data bandwidth between the processor and the memory devices. The data bandwidth between a processor and a memory device is proportional to the width of a data path between the processor and the memory device and the frequency at which the data are clocked between the processor and the memory device. Therefore, increasing either of these parameters will increase the data bandwidth between the processor and memory device, and hence the rate at which data can be processed.
An active memory device is a memory device having its own processing resource. It is relatively easy to provide an active memory device with a wide data path, thereby achieving a high memory bandwidth. Conventional active memory devices have been provided for mainframe computers in the form of discrete memory devices having dedicated processing resources. However, it is now possible to fabricate a memory device, particularly a dynamic random access memory (“DRAM”) device, and one or more processors on a single integrated circuit chip. Single chip active memories have several advantageous properties. First, the data path between the DRAM device and the processor can be made very wide to provide a high data bandwidth between the DRAM device and the processor. In contrast, the data path between a discrete DRAM device and a processor is normally limited by constraints on the size of external data buses. Further, because the DRAM device and the processor are on the same chip, the speed at which data can be clocked between the DRAM device and the processor can be relatively high, which also maximizes data bandwidth. The cost of an active memory fabricated on a single chip can is also less than the cost of a discrete memory device coupled to an external processor.
An active memory device can be designed to operate at a very high speed by parallel processing data using a large number of processing elements (“PEs”) each of which processes a respective group of the data bits. One type of parallel processor is known as a single instruction, multiple data (“SIMD”) processor. In a SIMD processor, each of a large number of PEs simultaneously receive the same instructions, but they each process separate data. The instructions are generally provided to the PE's by a suitable device, such as a microprocessor. The advantages of SIMD processing are simple control, efficient use of available data bandwidth, and minimal logic hardware overhead. Another parallel processing architecture is multiple instruction, multiple data (“MIMD”) in which a large number of processing elements process separate data using separate instructions.
A high performance active memory device can be implemented by fabricating a large number of SIMD PEs or MIMD PEs and a DRAM on a single chip, and coupling each of the PEs to respective groups of columns of the DRAM. The instructions are provided to the PEs from an external device, such as a host microprocessor. The number of PE's included on the chip can be very large, thereby resulting in a massively parallel processor capable of processing vast amounts of data.
In operation, data to be operated on by the PEs are first written to the DRAM, generally from an external source such as a disk, network or input/output (“I/O”) device in a host computer system. In response to common instructions passed to all of the PEs, the PE's fetch respective groups of data to be operated on by the PEs, perform the operations called for by the instructions, and then pass data corresponding to the results of the operations back to the DRAM. After they have been written to the DRAM, the results data can be either coupled back to the external source or processed further in a subsequent operation. By operating on the data using active memory devices, particularly active memory devices using SIMD PEs and MIMD PEs, the data can be processed very efficiently. If the same data were operated on by a microprocessor or other central processing unit (“CPU”), it would be necessary to couple substantially smaller blocks of data from the memory device to the CPU for processing, and then write substantially smaller blocks of results data back to the memory device. The wider data bus and faster data transfer speeds made possible by using an active memory instead of a conventional memory result in a significantly higher data bandwidth.
Although an active memory device allows much more efficient processing of data stored in memory, the processing speed of a computer system using active memory devices is somewhat limited by the time required to transfer operand data to the active memory for processing and the time required to transfer results data from the active memory after the operand data has been processed. During such data transfer operations, active memory devices are essentially no more efficient than passive memory devices that also require data stored in the memory device to be transferred to and from an external device, such as a CPU.
There is therefore a need for a system and method for allowing data to be more efficiently transferred between active memory devices and an external system.
SUMMARY OF THE INVENTIONAn integrated circuit active memory device includes a memory device and an array of processing elements, such as SIMD or MIMD processing elements, coupled to the memory device. Compressed data transferred through a host/memory interface port are first written to the memory device. The processing elements then decompresses the data stored in the memory device and write the decompressed data to the memory device. The processing elements also read data from the memory device, compress the data read from the memory device, and then write the compressed data to the memory device. The compressed data are then transferred through the host/memory interface. Instructions are preferably provided to the processing elements by an array control unit, and memory commands are preferably issued to the memory device through a memory control unit. The array control unit and the memory control unit preferably execute instructions provided by a command engine responsive to task commands provided to the active memory device by a host computer system.
BRIEF DESCRIPTION OF THE DRAWINGS
The active memory device 10 includes a first in, first out (“FIFO”) buffer 38 that receives high level task commands from the host system 14, which may also include a task address. The received task commands are buffered by the FIFO buffer 38 and passed to a command engine 40 at the proper time and in the order in which they are received. The command engine 40 generates respective sequences of instructions corresponding to the received task commands. These instructions are at a lower level than the task commands. The instructions are coupled from the command engine 40 to either a processing element (“PE”) FIFO buffer 44 or a dynamic random access memory (“DRAM”) FIFO buffer 48 depending upon whether the commands are PE commands or DRAM commands.
If the instructions are PE instructions, they are passed to the PE FIFO buffer 44 and then from the buffer 44 to a processing array control unit (“ACU”) 50. The ACU 50 subsequently passes microinstructions to an array of PEs 54. The PEs 54 preferably operate as SIMD processors in which all of the PEs 54 receive and simultaneously execute the same instructions, but they may do so on different operands. However, the PEs 54 may alternatively operate at MIMD processors or some other type of processors.
If the instruction from the command engine 40 are DRAM instructions, they are passed to the DRAM FIFO buffer 48 and then to a DRAM Control Unit (“DCU”) 60. The DCU 60 couples memory commands and addresses to a DRAM 64 to read data from and write data to the DRAM 64. In the embodiment shown in
The ACU 50 executes intrinsic routines each containing several microinstructions responsive to the command from the FIFO buffer 44. These microinstructions are stored in a program memory 70, which is preferably loaded at power-up or at some other time based on specific operations that the active memory device 10 is to perform. Control and address (“C/A”) signals are coupled to the program memory 70 from the ACU 50. A memory map 80 of the program memory 70 according to one embodiment is shown in
In operation, in response to each task command from the host system 14, the command engine 40 executes respective sequences of instructions stored in an internal program memory (not shown). The instructions generally include both code that is executed by the command engine 40 and PE instructions that are passed to the ACU 50. Each of the PE instructions that are passed to the ACU 50 is generally used to address the program memory 70 to select the first microinstruction in an intrinsic 84 corresponding to the PE instruction. Thereafter, the ACU 50 couples command and address signals to the program memory 70 to sequentially read from the program memory 70 each microinstruction in the intrinsic 84 being executed. As mentioned above, a portion of each microinstruction from the program memory 70 is executed by the PEs 54 to operate on data received from the register files 68.
With further reference to
In a typical processing task, the host system 14 passes a relatively large volume of data to the DRAM 64 through the HMI port 90, often from the mass storage device 24. The host system 14 then passes task commands to the active memory device 10, which cause subsets of operand data to be read from the DRAM 64 and operated on by the PEs 54. Results data generated from the operations performed by the PEs 54 are then written to the DRAM 64. After all of the subsets of data have been processed by the PE's 54, the relatively large volume of results data are read from the DRAM 64 and passed to the host system 14 through the HMI port 90. Also, of course, the DRAM 64 may simply be used as system memory for the host system 14 without the PEs 54 processing any of the data stored in the DRAM 64.
As mentioned above, the time required to transfer relatively large volumes of data from the host system 14 to the DRAM 64 and from the DRAM 64 to the host system 14 can markedly slow the operating speed of a system using active memory devices. If the data could be transferred trough the HMI port 90 at a more rapid rate, the operating efficiency of the active memory device 10 could be materially increased.
According to one embodiment of the invention, the host system 14 transfers compressed data through the HMI port 90 to the DRAM 64. The compressed data are then transferred to the PEs 54, which execute a decompression algorithm to decompress the data. The decompressed data are then stored in the DRAM 64 and operated on by the PEs 54, as previously explained. The results data are then stored in the DRAM 64. When the data stored in the DRAM 64 are to be transferred to the host system 14, the data are first transferred to the PEs 54, which execute a compression algorithm to compress the data. The compressed data are then stored in the DRAM 64 and subsequently transferred to the host system 14 through the HMI port 90. By transferring only compressed data through the HMI port 90, the data bandwidth to and from the DRAM 64 is markedly increased.
The PEs 54 preferably compress and decompress the data by executing microinstructions stored in the program memory 70. As previously mentioned, some of the intrinsics 84 (
A single active memory device 10 may be used in a computer system as shown in
The operation of the computer system shown in
After sufficient time has lapsed for the active memory devices 10 to complete the task of compressing the read data stored in the designated page and making the compressed data available to the HMI port 90, direct memory access (“DMA”) operations to the mass storage device 24′ are initiated at 124. In this regard, the DMA operations may be initiated at a rate that is faster than the mass storage device 24′ can complete the operations. The DMA operations are simply stored as a list of DMA operations that are sequentially completed, which is detected at 126. Each DMA operation causes the compressed data stored in the DRAM 64 to be sequentially coupled to the mass storage device 24′ through the HMI port 90 and memory controller 18′. The “page to disk” task is then completed at 128.
A “memory page from disk” algorithm that is the reverse of the operation shown in
After the data from the mass storage device 24 have been downloaded to the DRAM 64 and decompressed, the memory device index I is decremented at 158 in a determination is made at 160 whether I=1 corresponding to the data being transferred from the mass storage device 24 to the first active memory device 10-1. If not, the operation returns to 150 to repeat the process described above. If all of the data have been transferred from the mass storage device 24, the operation branches to 170 where it waits for all of the downloaded data to be decompressed by the PEs 54 and stored in the respective DRAM 64. The operation and then takes its through 174.
Although only the “page to disk” and the “memory page from disk” operations have been described herein, it will be understood that other operations can also occur, and corresponding intrinsics 84 are stored in the program memory 70 to assist in carrying out these operations. For example, intrinsics 84 could be provided that cause the PEs 54 to compress and/or decompress all of the data stored in the DRAM 64, or to compressed and/or decompress data stored in the DRAM 64 only within certain ranges of addresses. Other operations in which the PEs 54 compress or decompress data will be apparent to one skilled in the art and, of course, can also be carried out in the active memory device 10.
From the foregoing it will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the invention. For example, rather than transfer the compressed data from the HMI port 90 to the DRAM 64 prior to being decompressed by the PEs 54, it may be possible to transfer the compressed data directly from the HMI port 90 to the register files 68 or some other component (not shown) before being decompressed by the PEs 54. Similarly, rather than storing data compressed by the PEs 54 in the DRAM 64 before being transferring the compressed data through the HMI interface 90, it may be possible to store the data compressed by the PEs 54 in the register files 68 or some other location prior to being transferred through the HMI port 90. As another example, instead of or in addition to transferring the data from the active memory device 10 to the mass storage device 24, it may be transferred to other components, such as the host CPU 20, a graphics processor (not shown), etc., through a DMA operation or some other operation. Furthermore, as mentioned above, the PEs 54 need not SIMD PEs, but instead can be other types of processing devices such as multiple instruction multiple data (“MIMD”) processing elements. Accordingly, the invention is not limited except as by the appended claims.
Claims
1. An integrated circuit active memory device comprising:
- a memory device having a data bus containing a plurality of data bus bits;
- an array of processing elements each of which is coupled to a respective group of the data bus bits, each of the processing elements having an instruction input coupled to receive processing element instructions for controlling the operation of the processing elements;
- a host interface port operable to transfer data to and from the active memory device; and
- a control unit being operable to receive task commands and to generate corresponding sequences of instructions responsive to each of the task commands to control the operation of the memory device and the processing elements, at least some of the instructions generated by the control unit causing the processing elements to either decompress data transferred to the active memory device through the host interface port and then store the decompressed data in the memory device or to compress data transferred from the memory device that is to be transferred from the active memory device through the host interface port.
2-43. (canceled)
Type: Application
Filed: May 9, 2006
Publication Date: Sep 14, 2006
Inventor: Dean Klein (Eagle, ID)
Application Number: 11/431,455
International Classification: G06F 13/12 (20060101);