CONFIGURATION OF ONE OR MORE MEMORY PROCESSING UNITS
Disclosed embodiments include a computational memory system. The computational memory system includes a master controller configured to receive a configuration function from a host CPU and convert the received configuration function into one or more lower level configuration functions. The computational memory system also includes at least one computational memory chip, wherein the at least one computational memory chip includes a plurality of processor subunits and a plurality of memory banks formed on a common substrate. The master controller is adapted to configure the at least one computational memory chip using the one or more lower level configuration functions.
Latest NEUROBLADE LTD. Patents:
This application claims the benefit of priority of U.S. Provisional Patent Application No. 63/198,426, filed on Oct. 16, 2020; U.S. Provisional Patent Application No. 63/198,429, filed on Oct. 16, 2020; U.S. Provisional Patent Application No. 63/230,212, filed on Aug. 6, 2021; U.S. Provisional Patent Application No. 63/092,647, filed on Oct. 16, 2020; U.S. Provisional Patent Application No. 63/092,658, filed on Oct. 16, 2020; U.S. Provisional Patent Application No. 63/092,671, filed on Oct. 16, 2020; U.S. Provisional Patent Application No. 63/092,682, filed on Oct. 16, 2020; U.S. Provisional Patent Application No. 63/092,689, filed on Oct. 16, 2020; and U.S. Provisional Patent Application No. 63/093,968, filed on Oct. 20, 2020; all of which are incorporated herein by reference in their entirety.
BACKGROUND Technical FieldThe present disclosure relates generally to apparatuses for facilitating memory-intensive operations. In particular, the present disclosure relates to memory appliances that include hardware chips comprising both processing elements and dedicated memory banks.
Background InformationAs processor speeds and memory sizes both continue to increase, a significant limitation on effective processing speeds is the von Neumann bottleneck. The von Neumann bottleneck results from throughput limitations resulting from conventional computer architecture. In particular, data transfer from memory to the processor is often bottlenecked compared to actual computations undertaken by the processor. Accordingly, the number of clock cycles to read and write from memory increases significantly with memory-intensive processes. These clock cycles result in lower effective processing speeds because reading and writing from memory consumes clock cycles that cannot be used for performing operations on data. Moreover, the computational bandwidth of the processor is generally larger than the bandwidth of the buses that the processor uses to access the memory.
These bottlenecks are particularly pronounced for memory-intensive processes, such as neural network and other machine learning algorithms; database construction, indexing searching, and querying; and other tasks that include more reading and writing operation than data processing operations.
Additionally, the rapid growth in volume and granularity of available digital data has created opportunities to develop machine learning algorithms and has enabled new technologies. However, it has also brought cumbersome challenges to the world of data bases and parallel computing. For example, the rise of social media and the Internet of Things (IoT) creates digital data at a record rate. This new data can be used to create algorithms for a variety of purposes, ranging from new advertising techniques to more precise control methods of industrial processes. However, the new data has been difficult to store, process, analyze and handle.
New data resources can be massive, sometimes in the order of peta- to zettabytes. Moreover, the growth rate of these data resources may exceed data processing capabilities. Therefore, data scientists have turned to parallel data processing techniques, to tackle these challenges. In an effort to increase computation power and handle the massive amount of data, scientists have attempted to create systems and methods capable of parallel intensive computing. But these existing systems and methods have not kept up with the data processing requirements, often because the techniques employed are limited by their demand of additional resources for data management, integration of segregated data, and analysis of the sectioned data.
The present disclosure describes solutions for mitigating or overcoming one or more of the problems set forth above, among other problems in the prior art.
SUMMARYIn an embodiment, an information transfer system may include a master controller (XMC) configured to issue a command in the form of a memory protocol data packet, and wherein the master controller is configured to generate routing information indicating whether the memory protocol data packet is to be processed according to a first protocol or according to a second protocol different from the first protocol. The information transfer system may also include a slave controller (XSC) configured to receive the memory protocol data packet from the master controller, wherein the slave controller is configured to use the routing information to selectively cause the memory protocol data packet to be processed according to the first protocol or according to the second protocol.
In an embodiment, a computational memory system may include a master controller configured to receive a configuration function from a host CPU and convert the received configuration function into one or more lower level configuration functions. The system may also include at least one computational memory chip, wherein the at least one computational memory chip includes a plurality of processor subunits and a plurality of memory banks formed on a common substrate; wherein the master controller is adapted to configure the at least one computational memory chip using the one or more lower level configuration functions.
In an embodiment, a computational memory system may include a controller configured to receive a configuration function from a host CPU. The system may also include at least one computational memory chip, wherein the at least one computational memory chip includes a plurality of processor subunits and a plurality of memory banks formed on a common substrate; wherein the controller is adapted to multicast the configuration function to two or more of the plurality of processor subunits.
In an embodiment, a computational memory system may comprise at least one computational memory chip including one or more processor subunits and one or more memory banks formed on a common substrate. The at least one computational memory chip may be configured to store one or more portions of an embedding table in the one or more memory banks, the embedding table including one or more feature vectors. The one or more processor subunits may be configured to receive a sparse vector indicator from a host external to the at least one computational memory chip and, based on the received sparse vector indicator and the one or more portions of the embedding table, generate one or more vector sums.
In an embodiment, a data processing unit may include a data analysis unit configured to acquire a plurality of data elements from a memory, evaluate each of the plurality of data elements relative to at least one criteria, and generate an output that includes a plurality of validity indicators identifying a first plurality of data elements among the plurality of data elements that validly satisfy the at least one criteria and identifying a second plurality of data elements among the plurality of data elements that do not validly satisfy the criteria. The data processing unit may also include a data packer configured to generate, based on the output of the data analysis unit, a packed data output including the first plurality of data elements and omitting the second plurality of data elements.
In an embodiment, a processor-to-processor data transfer system may include a first processor programmed to: load data from a memory using a memory mapped interface; generate a data packet for transferring at least some of the loaded data via a non-memory mapped stream interface; and send the generated data packet, including at least some of the loaded data, to a second processor.
In an embodiment, a computational memory chip may include a plurality of processor subunits and a plurality of memory banks formed on a common substrate, and wherein each processor subunit among the plurality of processor subunits is associated with one or more dedicated memory banks from among the plurality of memory banks; at least one originating processor subunit among the plurality of processor subunits; at least one consumer processor subunit among the plurality of processor subunits; and a stream interface, wherein the stream interface is configured to transfer to the at least one consumer processor subunit data generated by the at least one originating processor subunit.
In an embodiment, a computational memory system is disclosed. The system may comprise: at least one computational memory chip including at least one processor subunit and at least one memory bank formed on a common substrate; and a data invalidity detector configured to receive a sequence of data units and invalidity metadata relating to the sequence of data units, and wherein the data invalidity detector is further configured to generate an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata, and append the invalidity bitmap to the sequence of data units to provide a mapped data segment to be stored in the at least one memory bank of the at least one computational memory chip.
In an embodiment, a method of storing a sequence of data units in one or more of a plurality of memory banks may comprise: receiving, using a data invalidity detector, the sequence of data units and invalidity metadata relating to the sequence of data units; generating, using the data invalidity detector, an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata; modifying the sequence of data units by appending the invalidity bitmap to the sequence of data units to provide a mapped data segment; and storing the modified sequence of data units in the one or more memory banks of at least one computational memory chip.
In an embodiment, a non-transitory computer readable medium may store instructions executable by at least one processor to cause the at least one processor to perform a method. The method may comprise receiving a sequence of data units and invalidity metadata relating to the sequence of data units; generating an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata; and appending the invalidity bitmap to the sequence of data units to provide a mapped data segment to be stored in at least one memory bank of at least one computational memory chip.
Consistent with other disclosed embodiments, non-transitory computer-readable storage media may store program instructions, which are executed by at least one processing device and perform any of the methods described herein.
The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various disclosed embodiments. In the drawings:
The following detailed description refers to the accompanying drawings. Wherever convenient, the same reference numbers are used in the drawings and the following description to refer to the same or similar parts. While several illustrative embodiments are described herein, modifications, adaptations and other implementations are possible. For example, substitutions, additions or modifications may be made to the components illustrated in the drawings, and the illustrative methods described herein may be modified by substituting, reordering, removing, or adding steps to the disclosed methods. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples. Instead, the proper scope is defined by the appended claims.
Processor Architecture
As used throughout this disclosure, the term “hardware chip” refers to a semiconductor wafer (such as silicon or the like) on which one or more circuit elements (such as transistors, capacitors, resistors, and/or the like) are formed. The circuit elements may form processing elements or memory elements. A “processing element” refers to one or more hardware-based logic circuit elements that, together, perform at least one logic function (such as an arithmetic function, a logic gate, other Boolean operations, or the like). A processing element may be a general-purpose processing element (such as a configurable plurality of transistors) or a special-purpose processing element (such as a particular logic gate or a plurality of circuit elements designed to perform a particular logic function). A “memory element” refers to one or more circuit elements that can be used to store data. A “memory element” may also be referred to as a “memory cell.” A memory element may be dynamic (such that electrical refreshes are required to maintain the data store), static (such that data persists for at least some time after power loss), or non-volatile memories.
One or more processing elements may form a processor subunit. A “processor subunit” may comprise a smallest grouping of processing elements (e.g., logic circuitry) that may execute at least one task or instructions (e.g., of a processor instruction set). For example, a processor subunit may comprise one or more general-purpose processing elements configured to execute instructions together, one or more general-purpose processing elements paired with one or more special-purpose processing elements configured to execute instructions in a complementary fashion, or the like. The processor subunits may be arranged on a substrate (e.g., a wafer) in an array. Although the “array” may comprise a rectangular shape, any arrangement of the subunits in the array may be formed on the substrate.
Memory elements may be joined to form memory banks. For example, a memory bank may comprise one or more lines of memory elements linked along at least one wire (or other conductive connection). Furthermore, the memory elements may be linked along at least one addition wire in another direction. For example, the memory elements may be arranged along wordlines and bitlines, as explained below. Although the memory bank may comprise lines, any arrangement of the elements in the bank may be used to form the bank on the substrate. Moreover, one or more banks may be electrically joined to at least one memory controller to form a memory array. Although the memory array may comprise a rectangular arrangement of the banks, any arrangement of the banks in the array may be formed on the substrate.
One or more processing elements (resources) paired together with one or more memory elements (resources) formed on a common substrate of a hardware chip may be referred to as a “memory processing module” (MPM), a “memory processing element,” and/or “computational memory.” The processing and memory elements of a memory processing module are fabricated on a common substrate, e.g., as an integrated circuit, and the processing elements may be spatially distributed among the memory elements. In a memory processing module, the processing elements can process data stored in associated local memory elements. The memory processing module may include other elements including, for example, at least one controller. A plurality of memory processing modules may be combined and manufactured on the same substrate to form a “memory processing chip.”
A legacy controller (also referred to as memory controller) refers to a controller that is used to interface with memory resources such as memory chips using legacy memory interface protocols for reading and writing to and from the memory chips. A DDR family controller is an example of a legacy controller.
A memory processing module (MPM) unaware memory controller refers to a controller that communicates with an MPM without being aware of the processing resources of the MPM. In other words, the communication may occur independent from and without information relating to the processing resources of the MPM. A legacy controller or a memory controller is an example of such an unaware memory controller. In the context of this document, the terms “legacy controller,” “MPM unaware controller,” and “DDR4 controller” are used in an interchangeable manner. DDR4 is an example of a memory interface standard or protocol.
An MPM controller is a controller that is configured to access and/or communicate with and/or configure and/or command at least some of the processing resources and the memory resources of the MPM. In the context of this document, an MPM controller is referred to as a “master controller” (MC, XMC).
As further used throughout this disclose, a “bus” refers to any communicative connection between elements of a substrate. For example, a wire or a line (forming an electrical connection), an optical fiber (forming an optical connection), or any other connection conducting communications between components may be referred to as a “bus.”
Conventional processors pair general-purpose logic circuits with shared memories. The shared memories may store both instruction sets for execution by the logic circuits as well as data used for and resulting from execution of the instruction sets. As described below, some conventional processors use a caching system to reduce delays in performing pulls from the shared memory; however, conventional caching systems remain shared. Conventional processors include central processing units (CPUs), graphics processing units (GPUs), various application-specific integrated circuits (ASICs), or the like.
As shown in
Moreover, processing unit 110 communicates with shared memory 140a and memory 140b. For example, memories 140a and 140b may represent memory banks of shared dynamic random access memory (DRAM). Although depicted with two banks, most conventional memory chips include between eight and sixteen memory banks. Accordingly, processor subunits 120a and 120b may use shared memories 140a and 140b to store data that is then operated upon by processor subunits 120a and 120b. This arrangement, however, results in the buses between memories 140a and 140b and processing unit 110 acting as a bottleneck when the clock speeds of processing unit 110 exceed data transfer speeds of the buses. This is generally true for conventional processors, resulting in lower effective processing speeds than the stated processing speeds based on clock rate and number of transistors.
As shown in
Moreover, processing unit 210 communicates with shared memories 250a, 250b, 250c, and 250d. For example, memories 250a, 250b, 250c, and 250d may represent memory banks of shared DRAM. Accordingly, the processor subunits of processing unit 210 may use shared memories 250a, 250b, 250c, and 250d to store data that is then operated upon by the processor subunits. This arrangement, however, results in the buses between memories 250a, 250b, 250c, and 250d and processing unit 210 acting as a bottleneck, similar to the bottleneck described above for CPUs.
The memory module 301 can activate a cyclic redundancy check (CRC) check for each chip's burst of data, to protect the chip interface. A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data. Blocks of data get a short check value attached, based on the remainder of a polynomial division of the block's contents. In this case, an original CRC 426 is calculated by the DDR controller 308 over the 8 bytes of data 422 in a chip's burst (one row in the current figure) and sent with each data burst (each row/to a corresponding chip) as a ninth byte in the chip's burst transmission. When each chip 300 receives data, each chip 300 calculates a new CRC over the data and compares the new CRC to the received original CRC. If the CRCs match, the received data is written to the chip's memory 302. If the CRCs do not match, the received data is discarded, and an alert signal is activated. A conventional alert signal includes an ALERT_N signal.
Additionally, when writing data to a memory module 301, an original parity 428A is normally calculated over the (exemplary) transmitted command 428B and address 428C. Each chip 300 receives the command 428B and address 428C, calculates a new parity, and compares the original parity to the new parity. If the parities match, the received command 428B and address 428C are used to write the corresponding data 422 to the memory module 301. If the parities do not match, the received data 422 is discarded, and an alert signal (e.g., ALERT_N) is activated.
Overview of Disclosed Memory Processing Modules and Associated Appliances
In the example of
A DDR controller 608 may also be operationally connected to each of the memory banks 600, e.g., via an MPM slave controller 623. Alternatively and/or in addition to the DDR controller 608, a master controller 622 can be operationally connected to each of the memory banks 600, e.g., via the DDR controller 608 and memory controller 623. The DDR controller 608 and the master controller 622 may be implemented in an external element 620. Additionally and/or alternatively, a second memory interface 618 may be provided for operational communication with the MPM 610.
While the MPM 610 of
Each MPM 610 may include one processing module 612 or more than one processing module 610. In the example of
Each memory bank 600 may be configured with any suitable number of memory arrays 602. In some cases, a bank 600 may include only a single array. In other cases, a bank 600 may include two or more memory arrays 602, four or more memory arrays 602, etc. Each of the banks 600 may have the same number of memory arrays 602. Alternatively, different banks 600 may have different numbers of memory arrays 602.
Various numbers of MPMs 610 may be formed together on a single hardware chip. In some cases, a hardware chip may include just one MPM 610. In other cases, however, a single hardware chip may include two, four, eight, sixteen, 32, 64, etc. MPMs 610. In the particular example represented by
One or more XRAM chips 624, typically a plurality of XRAM chips 624, such as sixteen XRAM chips 624, may be configured together to provide a dual in-line memory module (DIMM) 626. Traditional DIMMs may be referred to as a RAM stick, which typically includes eight or nine, etc., dynamic random-access memory chips (integrated circuits) constructed as/on a printed circuit board (PCB) and having a 64-bit data path. In contrast to traditional memory, the disclosed memory processing modules 610 include at least one computational component (e.g., processing module 612) coupled with local memory elements (e.g., memory banks 600). As multiple MPMs may be included on an XRAM chip 624, each XRAM chip 624 may include a plurality of processing modules 612 spatially distributed among associated memory banks 600. To acknowledge the inclusion of computational capabilities (together with memory) within the XRAM chip 624, each DIMM 626 including one or more XRAM chips (e.g., sixteen XRAM chips, as in the
As shown in
The DDR controller 608 and the master controller 622 are examples of controllers in a controller domain 630. A higher level domain 632 may contain one or more additional devices, user applications, host computers, other devices, protocol layer entities, and the like. The controller domain 630 and related features are described in the sections below. In a case where multiple controllers and/or multiple levels of controllers are used, the controller domain 630 may serve as at least a portion of a multi-layered module domain, which is also further described in the sections below.
In the architecture represented by
The location of processing elements 612 among memory banks 600 within the XRAM chips 624 (which are incorporated into XDIMMs 626 that are incorporated into IMPUs 628 that are incorporated into memory appliance 640) may significantly relieve the bottlenecks associated with CPUs, GPUs, and other conventional processors that operate using a shared memory. For example, a processor subunit 612 may be tasked to perform a series of instructions using data stored in memory banks 600. The proximity of the processing subunit 612 to the memory banks 600 can significantly reduce the time required to perform the prescribed instructions using the relevant data.
As shown in
The architecture described in
In addition to a fully parallel implementation, at least some of the instructions assigned to each processor subunit may be overlapping. For example, a plurality of processor subunits 612 on an XRAM chip 624 (or within an XDIMM 626 or IMPU 628) may execute overlapping instructions as, for example, an implementation of an operating system or other management software, while executing non-overlapping instructions in order to perform parallel tasks within the context of the operating system or other management software.
For purposes of various structures discussed in more detail below, the Joint Electron Device Engineering Council (JEDEC) Standard No. 79-4C defines the DDR4 SDRAM specification, including features, functionalities, AC and DC characteristics, packages, and ball/signal assignments. The latest version at the time of this application is January 2020, available from JEDEC Solid State Technology Association, 3103 North 10th Street, Suite 240 South, Arlington, Va. 22201-2107, www.jedec.org, and is incorporated by reference in its entirety herein.
The DDR4 “ALERT_n” is an input/output pin and signal having multiple functions such as CRC error flag, Command and Address Parity error, flag as output signal. If there is error in CRC, then ALERT_n goes LOW for the period time interval and goes back HIGH. If there is error in Command Address Parity Check, then ALERT_n goes LOW for relatively long period until ongoing SDRAM internal recovery transaction is complete. During Connectivity Test mode this pin functions as an input. (JEDEC 9-4C page 6).
Referring to the drawings,
The architecture of MPM 610 may lead to challenges that do not exist in conventional architecture implementations. Thus, new systems and methods are required to solve these new technical problems. One challenge that does not exist in conventional memory chips results at least in part from local processing of data that changes the data in memory. For example, the processing subunit 612 may read data from memory array-0 602-0, read data from memory array-1 602-1, perform a calculation on the read data, generate a result, and write the result to memory array-2 602-2. Thus, the previously stored data (e.g., not the entire data, but a portion) within the memory array-2 602-2 has changed from an originally stored value to a new value. As conventional ECC methods (for example, described above) are designed to detect changes in the data, these conventional methods fail (give false positives/indication of error) in the current case of intentionally changing the stored data.
One solution may be to perform ECC within each bank 600, when data is received for writing to each corresponding memory array 602. An original ECC is calculated by the local ECC module 804 across the received data, and then both the received data and corresponding original ECC are stored in each bank 600. When data is read from a memory array 602, the corresponding original ECC is also read from the memory array 602, a new ECC is calculated across the read data, and then the new ECC is compared to the original ECC (typically by the corresponding local ECC module 804) to determine (detect) if an error has occurred in the data (while stored). Note, this implementation is not limiting, and the ECC may be stored in any memory array/bank or an alternative location.
When the data in a memory array 602 (e.g., a portion) is changed, this is typically done by writing new data over the old data/data to be changed. The method is the same as described above for writing data to the memory array: An ECC is calculated by the local ECC module 804 across the received data, and then both the received data and corresponding original ECC are stored in each bank 600, in this case replacing the old ECC with a new ECC that is valid for the current data.
In alternative implementations, the ECC (original and/or new) can be calculated in any area, for example, within the MPM 610, by the processing module/processor subunit 612, within a multi-layered module, and by a controller (such as slave controller 1012).
In the context of this document, ECC modules are also referred to as “ECC calculators”. In addition to the local ECC module 804 in the banks 600 and the ECC module 312 which may be included in the DDR controller 308, there may be provided other additional and/or alternative ECC modules (calculators) anywhere in the controller domain 630 and/or the higher-level domain 632.
The MPM local ECC module 804, the controller ECC module 312, and the one or more other ECC modules may perform the same ECC calculations, perform different ECC calculations, perform ECC calculations on the same type and/or sized of data units, perform calculations on different type and/or different sized data units, may be of the same complexity, may differ in complexity, may be of the same error correction capability, may differ in error correction capability, may be of the same error detection capabilities, and may be of different error correction capabilities. In a non-limiting example, an additional other ECC calculator may perform ECC calculating on larger data units (for example frame versus memory bank line or page), may detect more errors, may correct more errors, and the like, as compared to the local ECC module 804 and the ECC module 312. In another example, one or more other ECC modules may correct errors that were detected by the local ECC module 804, may be used to validate ECC error calculations or detection made by the local ECC module 804, and the like.
Errors occur in information stored in memory. Embodiments may provide effective solutions for error detection and error correction (ECC), especially with respect to the disclosed architecture. ECC information may indicate whether or not an error occurred. The ECC information may be of different resolutions, for example, an error found or detected in a memory address or a memory array 602, in a memory bank 600, and/or in a processing module 612 associated with an outcome of an ECC process. The latter is an example of ECC information that requires more access to the MPM 210 or more steps to find the exact ECC error.
The outcome of the ECC (ECC processing), an ECC indicator, may be output (transmitted) using any means available to the MPM 610. For example, in an implementation using a memory protocol over a first memory interface between the MPM 610 and the memory controller. One implementation of the first memory interface is described elsewhere in this document (in reference to
In another implementation, additionally and/or alternatively to the first memory interface, a second memory interface (second channel) 618 is in operational communication with the MPM 610. The second memory interface 618 may be a second channel and may be a shared communication path or a dedicated communication path. The second interface 618 may be a wired or a wireless interface. The second interface 618 may communicate with the controller domain and/or with a higher-level domain. The second interface may provide out-of-band communication, may be a dedicated pin, and the like.
Referring now to
While shown as separate elements in the current figure, as described elsewhere in this description, the memory controller 1102 may be a sub-element of the master controller 622. Commands and addresses may be shared between all MPMs 610. The number of MPMs may differ from 16. Communication may be implemented using the DDR4 standard. Communication protocols other than DDR4 may be used.
In an alternative implementation, the communication may differ from the parallel communication using shared command and address channel 1304 of the current figure.
In the current figure, the data 122 is being sent with a corresponding ECC 124 to the MPMs 610. For example, each data unit 122 may be 64 bits with a corresponding 8 bits of ECC 124. In this case, the data unit and ECC can be received by the MPM 610 and written to memory. In comparison, in an alternative or additional implementation described in reference to
If an error is detected in the MPM 610, for example an ECC error, a variety of techniques can be used to notify that an error has occurred. For example, notifying the controller domain 1106, such as the MPM controller (XMC, 1104) that a bit has flipped in memory/an ECC error has occurred using in-band communications (
Refer to the drawings,
In an alternative implementation,
In an alternative implementation,
According to the teachings of the present embodiment there is provided a computational memory system including: at least one computational memory module (610) including one or more processor subunits (UMIND, 612) and one or more memory banks (600) formed on a common substrate, and one or more local error correction code (ECC) modules (804) operationally connected to the at least one computational memory module (610) and configured to calculate an original ECC based on received data.
In an optional embodiment, one or more of the local ECC modules (804) is formed on the common substrate. In another optional embodiment, one or more of the local ECC modules (804) is deployed respectively in one or more of the processor subunits (612). In another optional embodiment, each of the local ECC modules (804) is associated with one of the memory banks (600). In another optional embodiment, each of the one or more local ECC modules (804) is associated with each of one or more of the memory banks (600).
In another optional embodiment, the received data is from a slave controller (SC, 1122, XSC, 1124). In another optional embodiment, the received data is from the one or more processor subunits (612).
In another optional embodiment, one or more of the local ECC modules (804) are configured to store the original ECC in a location selected from the group consisting of: in one or more of the memory banks (600) that are associated with the ECC module (804), in one or more of the memory banks (600) other than memory banks that are associated with the ECC module (804), and in a location of the computational memory module (610) other than the memory banks (600).
In another optional embodiment, the received data is stored in one or more of the memory banks (600) as stored data, and one or more of the local ECC modules (804) are configured to store the original ECC in association with the stored data.
In another optional embodiment, one or more of the local ECC modules (804) are configured to calculate a new ECC based on the stored data, to determine if an error in memory has occurred and the stored data has changed to a new value. In another optional embodiment, one or more local ECC modules (804) are configured to generate an ECC indicator based on a comparison of the new ECC to the original ECC. In another optional embodiment, one or more local ECC modules (804) are configured to transmit the ECC using a method selected from the group consisting of: indicator in-band (124) with transmission of the stored data, out-of-band (1304) from transmission of the stored data, and via a second memory interface (218).
In another optional embodiment, one or more of the local ECC modules (804) are configured to transmit the ECC indicator using a data link layer communication channel (DLL 1100).
A present invention uses a memory protocol for conveying a signal to a controller, where the signal may have different meanings. The different meanings may include conventional error notification and innovative data transfer.
The following non-limiting example is provided now so the reader has a general overview of one implementation. When using a memory processing module/chip, an indication is needed that the memory processing module has data to output. In the current embodiment, the ALERT_N signal is used to indicate the memory processing module has data to output, for example that an H4 FIFO is not empty. This enables reads of the memory processing module by an external element to be done only when processed data (processed in the memory processing module) is ready, instead of an external element needing to poll the memory processing module. This implementation can save power consumption or/and to arbitrate accurately between reads/writes. One feature of the current embodiment using the ALERT_N signal is to keep the DDR4 chip and DIMM I/O compatible with the JEDEC standards.
Since the ALERT_N signal functionality has been changed, a solution is needed to indicate a data integrity error (for example a CRC or parity error) which are conventionally indicated by the ALERT_N signal). One solution described below is to convert data integrity error indications (of CRC or parity errors) to interrupt data link layer (DLL) frames.
Referring again to the drawings,
In general, if an error is detected, that is, if an error has occurred in transition to memory or in memory, the elements external to the memory must be informed, so the external elements can take appropriate action (for example, correct the error, or discard the data). In the conventional memory module 301 described above, the error signal S0 is activated (asserted 914) as a conventional alert signal, such as activating the DDR4 ALERT_N signal, and transmitted as a first signal S1 (error signal) to the external element 900 (to the DDR4 controller 904).
Referring again to the drawings,
In the current implementation, the internal element 1010 (for example the memory processing module 610) includes a data link layer (DLL) communication channel 1100 (not shown in the current figure). The DLL communications channel 1100 may notify the external element 1000 that data is waiting in the internal element 1010, for example, in the output FIFO 1024, for the external element 1000 to initiate a transfer of the data from the internal element 1010 to the external element 1000. In an exemplary implementation, an XRAM memory processing module 610 uses the ALERT_N hardware pin and signal to send a FIFO not empty signal (a data ready signal, asserting sixth signal S6) from the internal element 1010 to the XMC (master controller 1002) in the external element 1000. In the context of this document, references to “ALERT_N” are generally to the hardware pin and signal functionality, as one skilled in the art will be aware of based on context. As the ALERT_N signal is also needed to notify the external element 1000 of the assertion 914 of the error signal S0, a solution is needed to use a single, common hardware pin and signal for multiple signals, in this case both conventional error signaling S0 and data ready (FIFO not empty 1020) signal S6.
At least in part to solve this problem, the error signal S0 is not transmitted as the first signal S1 directly to the DDR4 controller 904 (shown in the current figure as a dashed line). Instead, the error signal S0 is transmitted as a fourth signal S4, e.g., optionally via the glue logic 1020 to the internal cause register 1022. Based on the received error signal (S0, S4) the cause register 1022 transfers error indication data to the output FIFO 1024 where the error indication data is enqueued in the output FIFO 1024 and awaits transfer from the internal element 1010 to the external element 1000. As there is now data in the output FIFO 1024, the data ready (FIFO is not empty) module 1026 asserts the data ready (FIFO not empty signal, sixth signal S6). The data ready signal S6 is transmitted via logic 1036 as an eighth signal S8 (standard alert signal) to the external element 1000, in this case to the XMC master controller 1002. The logic 1036 is used for backward compatibility, as the conventional error signal S0 can also be transferred, shown as a second signal (error signal) S2, via the logic 1036 to the external element 1000. Note, in the current exemplary implementation, the signals are active negative, so a logical AND can be used for the logic 1036.
When the XMC master controller 1002 receives the data ready signal (via the eighth signal S8), the XMC master controller 1002 can initiate a transfer of the data from the internal element 1010 to the external element 1000. This data transfer can be via the DLL communication channel 1100.
Note the signaling (such as the sixth signal S6) is an assertion notification, and separate from the transfer of data, such as error indication data, from the output FIFO 1024 (for example to the external element 1000). The transfer of data from the output FIFO 1024 is not shown in the current figure.
There may be provided a method for using a memory protocol pin of a memory processing module 610 (910, 1010) for conveying a signal to a controller, whereas the signal may have different meanings, whereas the controller may differentiate between the different meanings using context. The memory protocol pin is a pin that is/was used, at a legacy, conventional memory chip to convey a dedicated signal to the controller. The controller may be external to the memory processing module 610, such as the external element 1000 master controller 1002. The different meanings may, for example, flag different events or different statuses of the memory processing module 610. The context may be provided by the memory processing module 610 or otherwise be known by the controller. One example is using the ALERT_N signal and memory protocol pin for various purposes, such as conveying a signal having different meanings.
A memory processing module 610 may use the ALERT_N pin to output the ALERT_N signal to indicate that a CRC (optionally or a parity) error occurred. The memory processing module 610 may also use the ALERT_N to output the ALERT_N signal to indicate that the memory processing module 610 has content to output (for example the output FIFO 1024 is not empty). The memory processing module 610 controller (the slave controller 1012) may determine whether the ALERT_N signal is indicative of the data integrity (CRC or parity) error or the memory processing module 610 has content (data) to output based on other signals output by the memory processing module 610, based on timing (different time windows may be allocated for the different purposes of the ALERT_N signals), and the like.
This method of using the ALERT_N signal to indicate that an output element (for example a FIFO such as the output FIFO 1024) is not empty induces the external controller (the master controller 1002) to initiate transfer of the content (data). This indication method may save energy, for example, as the external element 1000 (master controller 1002) does not need to poll the internal element 1010 (output FIFO 1024) to determine a status of the internal element 1010. The ALERT_N signal and pin still can be used to provide an indication of a parity error in a chip testing mode and when CRC/Parity check is enabled.
In further detail, the XSC 1012 can check for CRC errors 426 when data is written via the conventional DDR4 interface and parity errors (428A, 438A) on the command and address lines (CMD+ADDR) over the DDR4 interface. In case of CRC or parity error, the ALERT_N error signal S0 is asserted for few cycles to notify the DDR4 controller 904 that an error has occurred. The XRAM 1010 modified functionality from conventional memory includes modifying the ALERT_N signal for two modes:
-
- 1. Chip testing mode
- 2. XRAM mode
1. During chip testing mode, the ALERT_N signal acts as a standard ALERT_N signal as defined in the DDR4 specification.
2. During XRAM mode, the ALERT_N signal can be set (asserted) when:
2.1. H4 (1016) output FIFO (1024) is not empty
2.1 The ALERT_N signal (as the eighth signal S8) is used to signal the XMC 1002 that there is data to fetch from the H4 (1016) output FIFO 1024. If the XMC 1002 would lack this indication, a conventional implementation would require the XMC 402 to continuously poll the H4 (1016) output FIFO 1024, which results in waste of resources and power.
For the XMC 1002 to catch events of CRC/parity errors, the CRC/parity error indication coming from the XSC 1012 (the error signal S0) can be connected as well (via the fourth signal S4) to the H4 (1016) interrupts cause register 1022 as an interrupt source (via optional glue logic 1020 to convert the signal (the error signal S0, S4) to a standard interrupt source behavior. Unless the CRC/parity interrupt cause is masked, the H4 (1016) will convert this interrupt into a message, and eventually the message will be put into an interrupt, for example in a DLL frame S10, for transfer from the internal element 1010 to the external element 1000.
A feature of using the ALERT_N pin to indicate “H4 FIFO not empty” retains compatibility with the chip+DIMM standard I/O. Alternatively, a pin other than the ALERT_N could be used, keeping the CRC+PARITY ERROR indications as-is via the ALERT_N pin. However, in this alternative case, compatibility with the standard chip+DIMM I/O may not be retained, and proprietary I/O will be needed.
According to the teachings of the present embodiment there is provided a computer communications alert system including: a slave controller (XSC, 1124), and a master controller (XMC, 1104) including a memory controller (DDR, 1102), the master controller (1104) configured for communication with the slave controller (1124). In a first mode (chip testing, conventional) the slave controller (1124) is configured to activate an error signal (S1, S2), and the memory controller (1104) is configured to handle the error signal. In a second mode (XRAM mode) the slave controller (1124) is configured to activate a data ready signal (S6), and the master controller (1104) is configured to handle the data ready signal. Both the error signal and the data ready signal are sent as a pre-defined alert signal (S8) from the slave controller (1124).
In an optional embodiment, further including: one or more processor subunits (UMIND, 612), and one or more memory banks 6200) formed on a common substrate. In another optional embodiment, the memory controller (1102) is a DDR4 controller.
In another optional embodiment, the first mode is a chip testing mode, and the slave controller (1124) is configured to activate the error signal as a pre-defined signal being a DDR4 ALERT_N signal. In another optional embodiment, the second mode is a data transfer mode, and the slave controller (1124) is configured to activate the data ready signal as the pre-defined signal being a DDR4 ALERT_N signal.
In another optional embodiment, the mode is a data transfer mode, and the slave controller (1124) is configured to activate the data ready signal when data is ready to be transferred to the master controller (1104). In another optional embodiment, the master controller (1104) is configured that based on receiving the data ready signal to initiate reading data from the slave controller (1124). In another optional embodiment, reading data is via a data link layer communication channel (DLL 1100).
In another optional embodiment, further including an output buffer (1024), and the slave controller (1124) is configured to activate the data ready signal based on the output buffer (1024) having data to send.
Refer to the drawings,
A feature of the current implementation is accessing processing modules 612 (processor subunits) using a redirection scheme in which one or more redirect addresses are allocated in each memory bank 600, and packets addressed to any of the redirect addresses is not written to the memory bank 600, but instead written to a virtual port (for example DLL port) for handling by the processing module 612. The DLL port can be implemented by hardware implemented in the processor subunit domain, for example, by the MPM slave controller (1012, 1124, XSC).
The external controller 1106 includes an MPM controller 1104 handling the OSI logical layer 2 data link layer (DLL) 1114A functionality and an MPM memory controller 1102 handling the OSI logical layer 1 physical layer 1112A functionality. The MPM controller 1104 may be implemented by the master controller 622 (XMC, 1002) and the MPM memory controller 1102 may be implemented by the DDR controller 904. As described elsewhere in this document, the master controller (622, 1003, 1104) may include a memory controller (DDR4 controller, 608, 1102). In this case, the DDR controller 608 handles conventional layer 1 physical functions. The master controller 622 is optional. When used, the master controller 622 can implement additional functionality such as DLL channel 1100. When the master controller 622 is not used, or bypassed, the DDR controller 608 can be used for conventional communications between the higher-level domain 632 via the external controller 1106 to the internal element 1110. The DDR controller 608 may be implemented as a sub-element of the MPM controller 1104 (a sub-element of the master controller XMC 622). The MPM controller 1104 and/or the MPM memory controller 1102 can be implemented in an FPGA as part of the IMPU 628. In a typical, non-limiting implementation, an IMPU 628 includes a single FPGA implementing 4 external controllers 1106, each external controller 1106 associated with a corresponding to a DIMM 626, and each external controller including a MPM controller 1104 (XMC 622) with an MPM memory controller 1102 sub-element.
The internal element 1110 includes an MPM slave controller 1124 handling the OSI data link layer (DLL) 1114B functionality and an MPM memory slave controller 1122 handling the physical layer 1112B functionality. Similar to the description of the external controller domain. The MPM slave controller 1124 may be implemented by the slave controller 1012 (XSC) and the MPM memory slave controller 1122 may be implemented by a DDR slave/internal controller (912, 608). In this case, the DDR slave/internal controller handles physical functions. The slave controller 1012 is optional. When used, the slave controller 1012 can implement additional functionality such as DLL channel 1100. When the slave controller 1012 is not used, or bypassed, the DDR slave/internal controller can be used for conventional communications between the higher-level domain 632 via the external controller 1106 to the internal element 1110. The DDR slave/internal controller may be implemented as a sub-element of the MPM slave controller 1124 (a sub-element of the slave controller XMC 1012). The MPM slave controller 1124 and/or the MPM memory slave controller 1102 can be implemented in an MPM 610 (XRAM chip 624). In a typical, non-limiting implementation, each MPM 610 includes one or more slave controllers 1012, each slave controller 1012 including the DDR slave/internal controller sub-element.
One or more DLL communications frames S10 are used to transmit information from the external controller 1106 to be received by the internal element 1110, and to transmit information from the internal element 1110 to be received by the external controller 1106. The information may include data, commands, and a combination of data and commands.
Note that the “DLL communications channel” or “DLL channel” 1100 of the current description is a protocol layer that communicates between nodes on a network segment across the physical layer. The innovative DLL channel 1100 can be implemented as a layer on top of an existing protocol, in this case the DDR layer protocol. Conventional DLL implementations communicate between nodes in the controller domain 630 and above, such as the higher-level domain 632, with legacy controllers (for example the DDR controller 608) being a last node which then sends data to memory (for example, the conventional memory module 301). In contrast, the memory processing module MPM 610 includes nodes (for example, the processing module UMIND 612), and the DLL channel 1100 can be used to communicate with the MPM 610, with nodes internal to the MPM 610 such as the processing module 612, and within the MPM 610 from one node to another (for example from a first processing module to a second processing module). That is, where conventional DLL is used to communicate with a memory controller 308, and then the received data is sent from the memory controller 308 to a memory address, the DLL channel 1100 can be used (for example, by the memory controller XMC 622) to communicate to the MPM 610. The DLL channel 1100 communication includes data write commands, data read commands, and instructions such as operations on memory, operations to be executed by the processing module 612, and similar as described elsewhere in this document and the referenced documents.
Referring to the drawings,
As noted above, in conventional implementations a legacy memory controller 1102 sends data write (and read) commands to a legacy memory slave controller 1122 which then writes to (or reads from) given memory address(es). In contrast, the DLL communications channel 1100 uses the DLL communications frame (S10, 1224) to communicate from the MPM controller XMC 1104 to the MPM slave controller XSC 1124. In a case where the DLL communication is a conventional write or read, the legacy memory slave controller SC 1122 can be modified to process the data 122, which in this case can be conventional data 122 (and not a DLL communications frame (S10, 1224)) and perform a conventional write to, or read from, memory. In a case where the data 122 is a DLL communications frame (S10, 1224), that is, is other than a conventional write or read, the legacy memory slave controller SC 1122 can pass/send the DLL communications for further processing. The DLL frame (S10, 1224) is then processed in the internal element 1110, typically in the MPM slave controller XSC 1124. Processing may include parsing the DLL frame (S10, 1224), reading the header, acting based on the header information, unpacking the frame, unpacking the payload, etc. Based on the DLL communication (header, payload, etc.) at least a portion of the DLL communication can be sent to one or more nodes internal to the MPM (610, 910, 1010) such as one or more processing modules 612. In particular, DLL communications that are instructions for execution and processing by the processing module UMIND 612.
In a preferred implementation, a memory address is used as a virtual DLL port. For example, using a predefined memory address, such as the memory address 0x3FF (last row of each memory bank). When the legacy memory slave controller SC 1122 processes the data units 122 and sees the address in the data 122 matches the predefined address (the external element 1000 is communicating on the DLL port), then the memory slave controller SC 1122 passes the DLL frame (S10, 1224) to the MPM slave controller XSC 1124 for handling. Correspondingly, if a memory address in the data 122 fails to match the pre-defined address, the data is processed by the legacy memory slave controller SC 1122. The MPM slave controller XSC 1124 processes the data 122 as a DLL frame (S10, 1224), converting the data to a DLL frame (S10, 1224) (generating a DLL frame (S10, 1224) from the data 122).
As described above, the current figure is exemplary. Other frames may be used for the DLL frame 1224. Additional, alternative, fewer, and more fields may be used to the header 1226-nnH and payload 1226-nnP fields.
The DLL frame 1224 may include a frame type indicator, access type field (for example read or write), one or more addresses (including memory address, processor subunit address, unicast address, multicast address, broadcast address, group identifier of a group of entities (processor subunits and/or memory banks or a combination thereof) that are the source or the target of the frame, source address indicative of the source of the frame, target address indicative of the target of the frame, length of frame field, protected indicator that indicates whether the frame or a part thereof is checksum or error protected, valid indicator indicative of which part of the payload is valid, frame control field, and the like. Different types of traffic may use the same type of DLL frames or other types of DLL frames. The header 1226-nnH may be used to indicate the frame type, or another portion of the DLL frame 1224 may indicate the frame type.
The DLL frame 1224 may be used to send a variety of communications between the external element 1000 (external domain 1106) and the internal element 1010 (internal element 1110). Some non-limiting examples include sending processed data, interrupts, flow control, and statistics. The external domain 1106 may use the DLL port to send DLL communications to the internal element 1110, the internal element 1110 may use the DLL port to send communications to the external domain 1106, and the internal element may use the DLL communications channel to communicate internally, for example from a first processing subunit 612 to a second processing subunit 612.
According to the teachings of the present embodiment there is provided an information transfer system including: a master controller (XMC, 1104) configured to issue a command in the form of a memory protocol data packet (122), and wherein the master controller (1104) is configured to generate routing information indicating whether the memory protocol data packet is to be processed according to a first protocol or according to a second protocol different from the first protocol, and a slave controller (XSC, 1124), configured to receive the memory protocol data packet from the master controller (1104). The slave controller (1124) is configured to use the routing information to selectively cause the memory protocol data packet to be processed according to the first protocol or according to the second protocol.
In an optional embodiment, the first protocol is a DDR4 protocol.
In another optional embodiment, the slave controller further includes a memory protocol controller (SC, 1122) configured to process the memory protocol data packet according to the first protocol. In another optional embodiment, the memory protocol controller is a DDR4 controller (1102, 904).
In another optional embodiment, the routing information is provided out-of-band from the data packet. In another optional embodiment, the routing information is provided in-band with the data packet. In another optional embodiment, the routing information is provided in the address field.
In another optional embodiment, the data packet includes at least an address field and a data field. In another optional embodiment, the data packet includes a data field, and the master controller (1104) generates an address field associated with the data packet. In another optional embodiment, the slave controller (1124) is configured to cause the memory protocol data packet to be processed according to the second protocol in response to detection of at least one re-route value in the address field. In another optional embodiment, the slave controller (1124) is configured to cause the memory protocol data packet to be processed according to the first protocol in response to detection in the address field of an address value not designated as a re-route value
In another optional embodiment, the routing information is a memory address. In another optional embodiment, the routing information is a range of memory addresses. In another optional embodiment, the routing information is a command to configure a memory register in the slave controller (1124). For example, a command can be sent to the memory processing module (210) to set a register indicating that all subsequent communications are not reads or writes, but DLL communications, and should be handled as appropriate by the slave controller (XSC, 1124).
In another optional embodiment, the first protocol includes a physical layer protocol. In another optional embodiment, the second protocol is an Open Systems Interconnection (OSI) logical layer 2 protocol.
In another optional embodiment, further comprising a memory storage (602) operationally connected to the slave controller (1124), wherein the first protocol includes writing data from the data field to at least one memory address, from the address field, in the memory storage (602). In another optional embodiment, further comprising a processing subunit (612) operationally connected to the slave controller (1124). In another optional embodiment, the second protocol includes routing at least the data field to the processing unit.
In another optional embodiment, the second protocol includes: parsing the data field to extract a frame (DLL frame, 1224), parsing the frame to identify a frame header and a frame payload, and executing a function based on the frame header.
In another optional embodiment, further including a processing subunit (612) operationally connected to the slave controller (1124), and the function includes sending at least the frame payload to the processing subunit (612).
In another optional embodiment, the routing information is a memory address, and the second procedure includes re-directing the data field associated with the memory address to a processing subunit operationally connected to the slave controller (1124).
In another optional embodiment, the first protocol includes handling the data section via a first data path and the second protocol includes handling the data section via a second data path. In another optional embodiment, the first data path includes writing the data section to a memory storage (602) and the second data path includes processing the data section by a processing subunit (212).
According to the teachings of the present embodiment there is provided a computational memory system comprising at least one computational memory module (610) including one or more processor subunits (UMIND, 612) and one or more memory banks (600) formed on a common substrate. A slave controller (XSC, 1124) is in operational communication with the processor subunits (612) and memory banks (600). The slave controller (1124) is configured to receive a memory protocol data packet including a data field, receive an address field associated with the data packet, receive routing information indicating whether the memory protocol data packet is to be processed according to a first protocol or according to a second protocol, and based on the routing information, selectively cause the memory protocol data packet to be processed according to the first protocol or according to the second protocol.
In an optional embodiment, further including: a master controller (XMC, 1104) configured to generate the memory protocol data packet, generate the address field, generate the routing information, and in operational communication with the slave controller.
According to the teachings of the present embodiment there is provided an information transfer system including: a first slave controller (SC, 1122), a first master controller (MC, 1102) configured for communications with the first slave controller (1122) using a pre-defined protocol, a second slave controller (XSC, 1124), and a second master controller (XMC, 1104) configured for communications with the second slave controller (1124) via the first master controller (1102) and the first slave controller (1122) using a higher-level protocol (DLL 1100). The higher-level protocol lays on top of the predefined protocol, the pre-defined protocol using a data packet including an indicator (for example, an address) and a data section (for example, data). If the indicator has a first value the data section is processed according to a first procedure (for example, conventional DDR write/read). If the indicator has a second value, or lacks the first value, the data section is processed according to a second procedure (for example, the DLL communications channel 1100).
In an optional embodiment, the pre-defined protocol is selected from the group consisting of: a memory transfer protocol, DDR3, and DDR4. In another optional embodiment, the pre-defined protocol includes a physical layer protocol. In another optional embodiment, the higher-level protocol is a data link layer (DLL) protocol.
In another optional embodiment, the given indicator is selected from the group consisting of: a codeword, a memory address, and a range of memory addresses. In another optional embodiment, the first procedure includes the first slave controller (1122) and the first master controller (1102) processing the data according to predefined protocol [DDR4]. In another optional embodiment, the first slave controller (1122) and the first master controller (1102) implement the pre-defined protocol.
In another optional embodiment, further including a memory storage (202) operationally connected to the first slave controller (1122). In another optional embodiment, the first procedure includes parsing the data packet to extract a memory address and memory data and writing the memory data to the memory address in the memory storage (202).
In another optional embodiment, the second procedure includes the first slave controller (1122) passing the data to the second slave controller (1124) and the master controller (1102) passing the data to a second master controller (1104). In another optional embodiment, the second slave controller (1124) and the second master controller (1104) implement the higher-level protocol.
In another optional embodiment, including a processing subunit (612) operationally connected to the second slave controller (1124). In another optional embodiment, the second procedure includes parsing the data section to extract a higher-level protocol frame (DLL frame, 1224), and sending the higher-level protocol frame (1124) to the processing subunit (612), the higher-level protocol frame (1224) including a frame header (1226-nnH) and a frame payload (1225-nnP).
In another optional embodiment, the second procedure includes: parsing the data section to extract a higher-level protocol frame (DLL frame, 1224), parsing the higher-level protocol frame (1224) to identify a frame header and a frame payload, and executing a function based on the frame header.
In another optional embodiment, further including a processing subunit (612) operationally connected to the second slave controller (1124), and the function includes sending the frame payload to the processing subunit (612).
In another optional embodiment, the indicator is a memory address, and the second procedure includes re-directing the data section associated with the memory address to a processing subunit (612) operationally connected to the second slave controller (1124).
In another optional embodiment, the first procedure includes handling the data section via a first data path and the second procedure includes handling the data section via a second data path.
In another optional embodiment, the first data path includes writing the data section to a memory storage (602) and the second data path includes processing the data section by a processing subunit (612).
In another optional embodiment, the first procedure includes processing the data section according to the predefined protocol, and the second procedure includes processing the data section other than by the predefined protocol.
According to the teachings of the present embodiment there is provided a computational memory system including: at least one computational memory chip including one or more processor subunits (UMIND, 612) and one or more memory banks (602) formed on a common substrate, a slave controller (XSC, 1124) in operational communication with the processor subunits and memory banks, a master controller (XMC, 1104) configured to communicate data with the slave controller (1124). The slave controller (1124) processes the data, and if the data includes an indicator having a first value then the slave controller (1124) executes a first procedure including sending the data to one or more of the memory banks. If the indicator has a second value, or lacks the first value, the data section is processed according to a second procedure (for example, the DLL communications channel (1100).
According to the teachings of the present embodiment there is provided a data packet structure for use in transferring information between a master controller (1104) within a computer system and a slave controller (1124) within a memory processing module (610), the data packet structure including: an indicator and a data section. If the indicator has a first value the data section includes data for a predefined protocol, and if the indicator has a second value the data section is processed to extract from the data section a higher-level protocol frame header and frame payload.
Refer now to
For simplicity, the current figure uses one of the master controllers 622, one of the DDR controllers 608, and eight of the MPMs (610-0 to 610-7). The master controller 622 may access the MPMs 610 directly. The DDR controller 608 may access the MPMs 610 directly. Alternatively, the master controller 622 may access the MPMs indirectly, for example via a legacy controller such as the DDR controller 608.
In a typical case where a number of MPMs 610 are configured in a chip 624, and a number of chips are configured into a DIMM 626, the DIMM 626 may communicate with the controller domain 630, which in turn communicates with the higher-level domain 632.
There may be provided a method for communicating with MPMs 610 using a legacy MPM unaware memory controller (for example, the DDR controller 608) and/or an MPM memory controller (for example, the master controller, XMC 622).
The MPM controller 622 may communicate with the MPM 610 using a memory protocol and/or by using other means such as by using dedicated communication paths, by using a communication protocol that may be used to communicate with the computational resources (for example, the processing module 612), by using a data link layer communication channel 1100, by embedding or otherwise utilizing memory protocols for communicating with the processing resources (for example, the processing module 612) of the MPM 610, by using a communication protocol that may be used for communicating with both computational and memory resources of the MPM 610, by utilizing one or more dedicated ports or resources of the MPM 610 for communication, and the like.
In a case where multiple controllers and/or multiple levels of controllers are used, the controller domain 630 is at least a portion of a multi-layered module domain 1800. In this case, a module that is closest to the first controller is the initial module 1802, other modules are termed additional modules 1804. The modules that are closest to the processor subunits are termed last additional modules 1806. In the current figure, the initial module is the master controller 622, the additional modules are 1804-A, 1804-B, 1804-C, 1804-D, 1804-E, and 1804-F, and the last modules are 1804-E, and 1804-F. An example of an additional module is the H4 controller 1016.
There may be two, three, or more layers of modules. A module in the multi-layered module domain may differ from another module. For example, a module of one domain may differ from a module of another domain by traffic management resources and/or processing resources, and the like. The modules may be arranged in any manner than differs from a multi layered domain. There may be any number of modules per layer, including more than one initial module, one or more than two intermediate additional modules, more or less than eight last additional modules, and the like. One or more initial modules 1802 may be coupled to zero, one, or a plurality of additional (1804, 1806) modules. In a case where there are three or more levels of modules, the additional modules between the initial module(s) 1802 and the last module(s) 1806 are also referred to as intermediate modules. The intermediate modules are additional modules 1804 other than the last modules 1806. In the current figure, the intermediate modules are 1804-A, 1804-B, 1804-C, 1804-D.
There may be provided more than one first controller (initial module 1802), the first controller may be merged or integrated with any of the modules, and the like. There may be any connectivity between modules, any relationship of modules of different layers, and the like. The first controller may interface with the exterior of the MPM 610. There may be any connectivity between the first controller, one or more modules, the processor subunits, and memory banks. For example—one or more memory banks may be dedicated to and coupled via one or more buses to each processor subunit. Processor subunits may be coupled to each other.
One or more elements, such as MPMs 610 may be coupled to a given controller module, which in turn one or more of the given controller modules may be coupled to a given additional module.
Examples of processor subunit, controllers, and memory banks are illustrated in PCT patent application publication WO2019025862, and/or PCT patent application PCT/IB2019/001005.
The processor subunit and/or the first controller and/or the modules may be configured and/or programmed in any one of the following manners for example—in advance (for example during a boot process, after the boot process), once per time period, in response to an event, in bursts, not in runtime, in runtime, by a host entity, and the like. Examples of non-runtime programming is illustrated in PCT patent application publication WO2019025862, and PCT patent application PCT/IB2019/001005. For example—the processor subunit and/or the first controller and/or the any module may be configured and/or programmed by receiving one or more commands and/or related data (for example a neural network module) that should be executed during a certain time period or maybe programmed one command after the other.
Any memory bank and/or anyone of the first controller and/or any module may receive and store commands and/or related data aimed to another entity (for example the processor subunit) and then send the commands and/or related data to the entity. The first controller may receive and store commands and/or related data for the any module. The first module may receive and store one or more commands and/or related data for the second modules.
Refer to other portions of this description where there are provided methods for communicating between the memory banks, processor subunits, modules, the first controller, and other entities such as the memory controller.
Refer now to
Refer now to
During operation of a computer system, various processors and/or system components may be configured (e.g., using commands such as CONFIG, etc.) to prepare the processors and/or other components for performing tasks or functions. In normal systems, there may be just one processor or a small number of processors and/or system components to be configured. In such cases, the configuration process may be completed quickly and without requiring significant computational resource or time.
In the case of a computational memory system, as disclosed herein, however, each included computational memory chip (e.g., XRAM 624) may include many processor subunits 612 that may be individually configured. For example, in the example of
In some cases, configurations can be time sensitive, so delays in the configuration process may have negative consequences including, for example, reductions (some which may be significant and even debilitating) in system throughput.
Each of the processor subunits 612 may be configured to operate on data stored in one or more corresponding, dedicated memory banks 600. For example, a processor subunit 612 may include one or more arithmetic logic units (ALU) to operate on stored data. Each processor subunit 612 may also be associated with one or more memory controllers and various other supporting components.
Described in the sections below are techniques for configuring the processor subunits included in one or more memory processing modules 2150-2164 (or 610) of the described computational memory systems. While the disclosed techniques are described relative to the configuration of processor subunits included in the disclosed computational memory systems, the disclosed techniques may be used to configure other processor-based systems, especially computational systems including two or more processors. The disclosed techniques are targeted toward increasing efficiency of a configuration process in multi-processor systems. In some cases, the disclosed techniques may significantly reduce bandwidth consumption relative to one or more communication links between a host computer 710 and configurable processors, such as processor subunits 612. As a result, configuration time and configuration overhead may be reduced, leading to improved system processing performance.
At a general level, the disclosed configuration techniques may include a mapping between higher level configuration information (e.g., provided by host computer 710) and lower level configuration information that can be used to configure individual processor subunits 612. As described in detail below, one or more controllers, such as master controller 622, can receive the higher level configuration information from the host 710 and convert (or assist in the conversion of) that higher level configuration information to lower level configuration information, which can then be used to configure the processor subunits. In such a case, the master controller 622 may relieve a significant portion of the configuration burden from the host 710. The host 710 may still control the configuration process associated with the processor subunits 612, but that control can occur at a configuration function level. The master controller 622 and other components may provide a bridge between the configuration function level and the lower level configuration information used to accomplish the configuration of individual processor subunits 612. For example, the master controller 622 may be equipped to modify configurations according to run time values from previous configurations, thus avoiding back and forth communications with the master controller 622 and the host 710 between configurations.
In operation, master controller 2120 may be configured to receive a configuration function from host 2110 (e.g., a host CPU, GPU, etc.) and convert the received configuration function into one or more lower level configuration functions for use in configuring any of the processor subunits associated with MPMs 2150-2164. The high level configuration function supplied by host 2110 may have various forms and/or formats. For example, such configuration functions may include one or more CONFIG functions, associated data, etc.
Master controller 2120 may use the one or more lower level configuration functions to configure (or to cause configuration of) one or more processor subunits 612 included among the MPMs 2150-2164, which may be included on one or more computational memory chips 624. Configuring the processor subunits 612 of one or more computational memory chips 624 in this way (e.g., by pre-loading the processor subunits with lower level configuration functions/information) may prepare the processor subunits to perform a function associated with the one or more lower level configuration functions.
Host 2110 may communicate the configuration function to the master controller 2130 at any suitable time. In some cases, host 2110 may generate a configuration function to be sent to the master controller (XMC) 2120 during an initialization process. Host 2110 may provide a different or updated configuration function at other times as well, including run-time, etc. For example, the processor subunits 612 may be configured according to the lower level configuration information during a boot process, after a boot process, at particular time intervals, in response to a triggering event, during burst periods, during runtime, etc. Examples of non-runtime programming is illustrated in PCT patent application publication WO2019025862, and PCT patent application PCT/IB2019/001005. Each processor subunit 612 may also be configured and/or programmed by receiving one or more commands and related data (for example a neural network module) to be executed during a certain time period or in response to receiving one or more commands to be programmed one command after another in series. A response of a processing subunit to a command may trigger a new command thus having a continuous stream of commands causing a second set of commands (for example, a hash join operation).
Host 2110 may communicate configuration functions directly to one or more components associated with the master controller 2120 or may provide the configuration functions via an intermediary device (e.g., a FIFO buffer) or another interface. The configuration functions supplied by the host 2110 may be associated with one or more applications running or configured to run on host 2110.
Master controller 2120 may be provided between a user application—or any other higher-level entity associated with host 2110—and the MPMs 2150-2164. In some cases, one or more additional controllers, such as a DDR4 controller 2130 and/or RCD controller 2140 may be provided between master controller 2120 and the MPMs.
The master controller may include various components to assist in converting the higher level configuration functions to lower level configuration information or functions. For example, in some cases, the master controller 2120 may include one or more configuration accelerators 2122 that participate in the conversion of configuration functions to low level configuration information. Such accelerators may be configured to operate in parallel to increase configuration throughput. In some cases, the accelerators may be allocated to particular MPMs among MPMs 2150-2164, particular computational memory chips 624, etc. The allocation may remain fixed or may change over time. Load balancing may be employed to balance the workload of the different configuration accelerators.
In addition to managing the conversion of the higher level configuration functions to lower level configuration information, master controller 2120 may also be configured to manage communications traffic among various components of the computational memory system. Such traffic may include communications between master controller 2120 and any of the DDR4 controller 2130, RCD 2140, MPMs 2150-2164. The controlled traffic may also include communications between processor subunits 612 and their dedicated memory banks 600. The controlled communications may include, e.g., data traffic (from an arithmetic logic unit (ALU)), register file reads, interrupts, etc.
Master controller 2120 may configure the processor subunits 612 in any suitable manner. In some cases, all processor subunits 612 in a computational memory system (e.g., across XRAM chips, XDIMMs, XIMPUs, etc.) may be similarly configured. In other cases, the processor subunits 612 associated with any component of the computational memory system (e.g., XIMPU, XDIMM, XRAM, etc.) may be commonly configured in common, where processor subunits associated with other subunits may be configured differently. Master controller 2120 may selectively configure any individual processor subunit 612, such that even processor subunits onboard the same XRAM chip 624 may share a common configuration or may be configured differently.
Master controller 2120 may also include or have access to a configuration information table (CIT) 2220, a configuration logic module 2230, and a states data structure 2240. In some examples, the configuration function accelerators 2122, the configuration information table 2220, the configuration logic module 2230, and the states data structure 2240 may be integrated together. In other cases, one or more of these components may be provided as a separate component.
The one or more microcontrollers 2124 may be configured to receive configuration functions from the host computer and provide those configuration functions to the configuration accelerators 2122 via a configuration functions input 2210. The received instructions may be queued in a conditional function queue 2212 or a non-conditional function queue 2214. Functions included in the queues may be retrieved (e.g., by configuration logic module 2230) for conversion to lower level configuration information on a FIFO basis, for example.
Lower level configuration functions may be associated with one or more tasks or functions running (or to be run) on the host CPU. For example, as part of running a particular application or preparing to run a particular application on the host CPU, the host CPU may generate one or more higher level configuration functions associated with one or more general tasks to be completed by a computational memory system. As noted, master controller 2120 may convert the configuration functions from the host CPU into lower level configuration information/functions that prepare the processor subunits 612 of the computational memory system to perform one or more low level tasks aimed toward completion of the one or more general tasks requested by the host CPU. In some embodiments, configuring one or more MPMs 2150-2164 may include preparing one or more of the plurality of MPMs to perform a function associated with the one or more lower level configuration information/functions. The master controller 2120 may be adapted to configure different MPMs with different lower level configuration functions. Furthermore, the master controller 2120 may be adapted to configure different MPMs with similar lower level configuration functions to be executed at different times.
Conversion of the higher level configuration functions issued by the host to lower level configuration information/functions may be performed using any suitable technique. In some cases, one or more maps linking the higher level configurations functions to the lower level configuration functions may be employed. Configuration information table 2220 may provide such a mapping between the higher level and lower level configuration functions. For example, associated with various higher level configuration functions expected to be generated by the host CPU, configuration information table 2220 may store corresponding low-level configuration functions/information. Such low-level configuration information/functions may include configuration commands. In addition to specifying a command to be performed and including an operand to be used in executing the command, the configuration commands may also identify a particular destination for the command. For example, a configuration command may identify a particular MPM 610, particular processor subunit 612, and/or a higher level entity such as an XDIMM, XIMPU, etc. The configuration commands may also specify conditions that must be met before a process can proceed to a next step (e.g., execution of a subsequent command). The configuration commands may include parameter override, which instructs how to override the configuration command with kernel input parameters.
In the example of
The configuration logic module 2230 may also calculate addresses for fetching low-level configuration information from the configuration information table (CIT) 2220. Such addresses may represent or may be based on a mapping between host configuration functions and low-level configuration information. The mapping between a host configuration function and the corresponding low-level configuration information may include a parametrized mapping (using parameters) that allows changes to be made to the configuration function mappings. The host, for example, may update the configuration mappings from time to time.
The configuration logic module 2230 may also include a parameters override logic. In addition to converting higher level to lower level configuration functions, configuration logic 2230 may also access and/or manage a states data structure 2240 that may keep track of a current CIT address, and parameters loaded once when the kernel is loaded. Such parameters may be used both for conditions and parameters override logic.
After the configuration information (e.g., from CIT 2220) is retrieved, the host computer may manage the configuration of the overall system at a configuration function level (e.g., by issuing general configuration functions and allowing the master controller 2120 to implement those functions using appropriate lower level configuration functions/information drawn from CIT 2120. The host computer may cause dynamic changes in the configuration of one or more MPMs 2150-2164 by sending configuration functions to the master controller, which in turn converts each configuration function to the low level configuration functions provided to the one or more MPMs. The low level configuration functions may be executed by the MPMs during runtime.
A configuration function may or may not allow an execution of an independently executable kernel. A kernel is single execution context and is composed of one or more configurations. A configuration function may allow an execution of only a part of such kernel. Different kernels may be slightly different from one another and may require slightly different configurations. The difference may be represented by slight low-level configuration differences. In some cases, however, for some different kernels, low-level configuration information may be re-used.
Each low level configuration function may be stored at a specific address within CIT 2220. In response to a received (or fetched) configuration function, the configuration logic module 2230 may be configured to calculate one or more addresses included in the CIT 2220 based on the configuration function received from a host CPU. For example, a configuration function “A” may be associated with an application running on the host CPU. The configuration function A may be stored at an address “0000” of CIT 2220. During runtime, the configuration logic module 2230 may use an address calculator to calculate the address “0000” based on the fetched host configuration function. With the calculated address, the configuration logic can retrieve the configuration function “A” from CIT 2220, and the retrieved function can be sent to one or more appropriate MPMs 2150-2164.
In some cases, as noted, in addition to a configuration function address, CIT 2220 may also store one or more computational memory chip identifiers, function operands, conditions for executing a next configuration, and/or parameter override logic. The information stored in CIT 2220 may be used before or during execution of one or more configuration functions. The host CPU 2110 may be configured to update information stored in the CIT 2220.
As noted above, master controller 2120 may be configured to communicate the lower level configuration functions/information to XRAM chips 624, MPMs 2150-2164, or to processor subunits 612. In some cases, one or more additional controllers or components may be used to facilitate the configuration process. For example, a DDR4 controller 2130/608 or other type of legacy controller may serve one or more roles in the configuration process, such as serving as an intermediary for communications between the master controller 2120 and one or more slave controllers, such as slave controllers 613 or 623. Each MPM may include a slave controller 613 and a slave controller 623. Alternatively, a single slave controller 613 and a single slave controller 623 may be associated with each XRAM chip 624, to be shared by the MPMs disposed on the XRAM chip. In other cases, some MPMs on an XRAM chip may include dedicated slave controllers 613 and/or 623, while other MPMs on the same XRAM chip share slave controllers 613 and/or 623.
There may be provided a method for virtually opening a port, such as a DLL port for communications with the processor subunits via a communication stack that includes a physical layer and a data link layer. The data link layer may be used to send frames of different types including but not limited to frame control frames, information frames, unicast frames, multicast frames, broadcast frames and the like. In some cases, DDR4 controller 2130 may be responsible for the physical layer, while the master controller 2120 may be responsible for the data link layer. In some cases, the DDR4 controller 2130 may provide configuration information or functions to any of the MPMs using the DDR4 memory protocol (in a sense, treating the MPMs as memory banks).
More specifically, it may be beneficial to access processor subunits using a redirection scheme in which one or more redirect addresses are allocated in each memory bank, and packets aimed to any of the redirect addresses are not written to the memory bank, but rather are written to a virtual port (for example a DLL port) that is implemented by hardware implemented in the processor subunit domain—such as a FIFO. The hardware may include more than a single FIFO, may be dedicated to one of the processor subunits, may be shared between two or more processor subunits, or may be located outside any of the processor subunits, etc.
Thus, while the DDR4 controller treats an access request to the FIFO as an access request to a memory bank—(and applies all the timing constraints related to writing to a memory bank to such access)—the slave controller redirects the access request to the processing subunit. In some cases, it may be more effective to perform the access while virtually toggling between memory banks—as the latency associated with DDR4 writing to different banks is lower than the latency associated with writing content to different lines of the same memory bank. The processor subunits and/or components of the processor subunits (for example registers) can be associated with addresses that allow efficient access to the processor subunits and/or components of the processor subunits.
The processing subunits may manage different types of communication and/or use different types of frames. The different types of communication may include data, responses to a request to access a register, flow control traffic, and interrupts. The different types of communications may be managed by using different types of arbitrators, controllers, and/or different queues to provide a communication solution. A data link layer builder may generate and receive data link layer data structures such as frames, interrupt queue, registers that can be read, ALU, interrupt manager, ALU queue, credit unit for managing credits that may be used to control the traffic in the MPU, read response queue, and a processor subunit.
In some cases, as noted above, a configuration function may be a conditional function. To handles such cases, the master controller 2120 may include a configuration logic module 2230 configured to fetch the conditional function from the conditional function queue 2212 after completion of a process upon which the conditional function depends. For example, a process may depend on a predetermined system event (e.g., a hardware or software trigger or an interrupt), and the conditional configuration function may be fetched after detection of the system event. One or more indicators of the presence of an event, event condition, completion event, etc. may be provided to the configuration logic module 2230 by events input 2211. The configuration logic 2230 may also be configured to fetch the conditional function from the conditional function queue 2212 after completion of a task upon which the conditional function depends.
In some embodiments, a configuration function may include a non-conditional function and may be stored in the non-conditional function queue 2214. Non-conditional configuration functions may be used to configure one or more MPMs without conditions, such as dependence on the completion of one or more processes.
Master controller 2120 may also control the timing by which one or more processor subunits are configured. Such timing may be influenced, at least in part, by whether conditional configuration functions are required to enable one or more of the processor subunits to perform a particular task.
During a second phase 2303, an MPM0 is configured by configuration function F after receiving output(MPM1,H) and then executes a related process to provide an output (i.e., Output(MPM0,F)) related to function F (2308). Also in second phase 2303, an MPM2 is configured by configuration function H after receiving output(MPM2,F) and then executes a related process to provide an output (i.e., Output(MPM2,H)) related to function H (2310).
During a third phase 2305, an MPM5 is configured by configuration function G after receiving output(MPM0,F) and output(MPM2,H) and then executes a related process to provide an output (i.e., Output(MPM5,G)) related to function G (2312).
In some cases, rather than individually configuring one or more processor subunits, master controller 2120 may be configured to multicast a configuration function or information to multiple processor subunits, such that all of the multiple processor subunits are similarly configured. The multiple processor subunits may include any group of processor subunits in a single XRAM chip or spanning across multiple XRAM chips within a computational memory system including multiple XIPHOS memory appliances. In such cases, master controller 2120 can receive a configuration function generated by a host CPU and, in turn, multicast the configuration function to two or more of the plurality of processor subunits. For example, as shown in
In some cases, the multicast configuration functions may be converted to one or more low level configuration functions prior to multicasting. In other cases, however, it is the configuration function issued by the host that is multicast without conversion. Such multicasting can still relieve the host of a significant configuration burden, especially when a large number (e.g., 10, 100, 1000, 10000, etc.) processor units are involved.
The multicasting master controller 2130 may be located onboard a computational memory chip (e.g., XRAM chip 624) or may be located external to the computational memory chips. Additionally, the multicasting master controller 2130 may include a field programmable gate array and may include one or more sub-controllers, such as a DDR controller.
In some cases, a multicasting configuration system may include at least one configuration processor adapted to configure at least a first processor subunit among a plurality of processor subunits. This configuration may be based on an output generated by at least a second processor subunit among the plurality of processor subunits. The at least one configuration processor may be located onboard a computational memory chip or may be located external to the at least one computational memory chip.
In some embodiments, the various systems described herein may be implemented in association with computational memory operations performed on feature vectors. As used herein, a feature vector may include a series of values (e.g., numbers) representative of an object, condition, state, etc. In some embodiments, a plurality of feature vectors may be represented in an embedding table. For example, in machine learning or other applications, an embedding table of feature vectors may be used to reduce the dimensionality of a sparse vector. As used herein, a sparse vector may refer to any vector for representing data that includes at least one zero-value entry. In some embodiments, the sparse vector may be represented by a sparse vector indicator, which may include any information used to represent a sparse vector. These sparse vectors typically store data in an inefficient manner as a large number of zero-value elements are often required to represent data. As a result, performing operations using these sparse vectors can require significant computational bandwidth due to their size. In some embodiments, an embedding table, which may include a plurality of feature vectors, may be used to reduce the dimensionality of a sparse vector. To perform this reduction, relevant feature vectors may be summed to generate an output feature vector, which may represent information from the sparse vector in a more efficient manner.
Using conventional systems and methods, various inefficiencies may exist associated with dimensionality reduction operations due to the configuration of processing and memory storage elements. For example, depending on the application, embedding tables may include a relatively large number of elements, which may need to be stored and accessed by a processor. Accordingly, to perform the dimensionality reduction operations described above, intensive summing operations may be required. The bandwidth limitations between the memory device storing the embedding table and the processing device performing the operations may therefore limit the speed and efficiency when performing the dimensionality reductions.
The disclosed embodiments may resolve these and other technical problems by storing embedding tables in memory banks of one or more memory processing modules (MPMs), as described above. Each MPM may include multiple memory banks and multiple processor subunits, as well as one or more controllers and/or modules to control the MPM, manage traffic to and from the MPM, and manage traffic to and from the memory banks and/or the multiple processor sub-units. In some embodiments, the processor subunits, the controllers, and/or the modules may be arranged in a hierarchical manner such that the processor subunits have the most processing power and may perform more significant bandwidth reduction. For example, the processing subunits may be configured to perform reduction operations between parts of feature vectors and/or between results of processing feature vectors. In some embodiments, it may further be beneficial to store the same segments of different feature vectors of an embedding table in a sequential manner in one or more memory banks. Accordingly, processor subunits may be assigned computing operations associated with specific subsets of feature vectors, and the relevant subsets of feature vector elements may be stored in memory banks associated with those subunits.
Embedding table 2620 may include a plurality of feature vectors, such as feature vectors 2622 and 2624. In this example, embedding table 2620 may include 6 feature vectors, each having 3 elements. In the current example, the feature vector 2622 corresponds to “male” and has an exemplary value of [4, 5, 2]. The feature vector 2624 corresponds to “Israel” and has a value of [4, 9, 9]. These feature vectors may be used to reduce the dimensionality of sparse vector 2610 and may be generated in various ways. For example, a machine learning model may be trained to recognize inputs in the form of sparse vectors and output representative feature vectors, such as those included in embedding table 2620. The feature vectors included in an embedding table may then be used, for example, to reduce the dimensionality of an input sparse vector (e.g., sparse vector 2610) by summing feature vectors from the embedding table implicated by non-zero values in the sparse vector.
In the example of
The vector summation may be performed on an element-by-element basis such that output vector 2630 has the same dimensionality as feature vectors 2622 and 2624. In particular, the first element of feature vector 2622 may be added to the first element of feature vector 2624 to generate the first element of output vector 2630 (in this example, adding 4+4 to get a first element of 8), and so on. Accordingly, the resulting output vector 2630 may be a representation of sparse vector 2610 with a reduced dimensionality. Output vector 2630 may then be used to perform calculations more efficiently than if sparse vector 2610 were used.
It is to be understood that the dimensionality reduction technique illustrated in
The disclosed embodiments may provide, among other benefits, improved speed, efficiency, and power consumption for performing dimensionality reductions or other operations involving stored feature vectors. For example, a computational memory chip may be used to store and process one or more feature vector elements of an embedding table.
As noted above, a sparse vector indicator may include any information used to represent a sparse vector. In some embodiments, the sparse vector indicator may be a sparse vector. For example, the sparse vector indicator may be a vector including one or more zero values and one or more non-zero values, similar to sparse vector 2610. Alternatively or additionally, the sparse vector indicator may be another form of data used represent a sparse vector. For example, the sparse vector indicator may be a set of indices corresponding to non-zero values in an associated sparse vector. Using the example from
Computational memory chip 2700 may include one or more processor subunits and one or more memory banks, which may be formed on a common substrate. For example, computational memory chip 2700 may include various processor subunits 2712a, 2712b, and 2712c, and various memory banks 2714a, 2714b, and 2714c, as shown in
In some embodiments, computational memory chip 2700 may include a controller 2720 (which may correspond to slave controller 613 or various other controllers described herein). Controller 2720 may be configured to facilitate communications between computational memory chip 2700 and host 2730, between computational memory chip 2700 and other computational memory chips (e.g., within one or more dual in-line memory modules, such as DIMM 626 described above), between various processing subunits, or various other elements described herein. In some embodiments, controller 2720 may be configured to receive the sparse vector indication (or a portion of a sparse vector indication) and facilitate processing of the sparse vector indicator by the various processing subunits, such as processing subunits 2712a, 2712b, and 2712c. Further, controller 2720 may be configured to acquire one or more generated vector sums and provide the acquired one or more generated vector sums to the host external to the computational memory chip. While controller 2720 is shown as being included in computational memory chip 2710, it is to be understood that various other arrangements may be used, including those described above. For example, the disclosed systems may include a plurality of computational memory chips configured to generate the one or more vector sums. Accordingly, the system may include one or more controllers (e.g., master controller 622 and/or DDR controller 608) configured to perform the various operations described above with respect to controller 2720 across multiple computational memory chips. Computational memory chip 2710 may further include an interface 2722, which may include any structure allowing transfer of data and other information among processor subunits and/or between processor subunits and one or more controllers.
Computational memory chip 2710 may be configured to generate one or more vector sums based on a stored embedding table. Accordingly, computational memory chip 2710 may store one or more portions of an embedding table in the one or more memory banks (e.g., memory banks 2714a, 2714b, 2714c, etc.). In some embodiments, two or more computational memory chips 2710 may be used to generate vector sums. Accordingly, the embedding table may be stored across a plurality of memory banks included in each of the two or more computational memory chips. The one or more processor subunits (e.g., processor subunits 2712a, 2712b, 2712c, etc.) may be configured to receive the sparse vector indicator from the host external to the at least one computational memory chip and, based on the received sparse vector indicator and the one or more portions of the embedding table, generate one or more vector sums.
As used herein, a vector sum may include any result of one or more summation operations performed with respect to elements of one or more feature vectors. In some embodiments, the vector sums may be a complete result of a dimensionality reduction of a sparse vector. For example, a vector sum may correspond to an output vector, such as output vector 2630 described above. Accordingly, the one or more vector sums may constitute a complete sum of full feature vectors included in the embedding table. The full feature vectors summed together may be identified in the embedding table based on the sparse vector indicator, as described above with respect to
In some embodiments, the one or more vector sums may refer to portions of summations, such as partial sums of full feature vectors, complete sums of partial feature vectors, partial sums of partial feature vectors, or any other partial summation, which may be combined or further summed with other vector sums to form output vector 2630. This additional combination or summation may be performed by host 2730 or may be performed by one or more controllers prior to providing the vector sums to host 2730.
In embodiments where the vector sums constitute partial sums of full feature vectors included in the embedding table (e.g., a summation of all elements of only a portion of the feature vectors), the full feature vector sums may need to be added to other full feature vector sums. For example, if a sparse vector indicator indicates that four feature vector sums should be added together, a vector sum may include sums of the entire first two feature vectors, which may need to be added together with a sum of the entire second two feature vectors. In other words, the one or more vector sums may include a plurality of intermediate summations, and each of the plurality of intermediate summations may represent a summation of a subset of feature vectors in the embedding table implicated by the sparse vector indicator. Accordingly, host 2730 may append or concatenate multiple vector sums to create an output vector.
In some embodiments, the vector sums may include complete summations of partial feature vectors. For example, the vector sum may represent a summation of a first segment (i.e., the first L elements, as described below) of the feature vectors, which may need to be further summed with the first segment of one or more additional feature vectors to generate a summation of a first segment of the feature vectors. This summation may then need to be combined (e.g., concatenated) with summations of additional segments of feature vectors to produce the full output vector. As another example, the vector sums may only be a partial summation of a segment of the feature vectors. In other words, the one or more vector sums may include a plurality of intermediate summations, wherein each of the plurality of intermediate summations represents a summation of partial feature vectors in the embedding table implicated by the sparse vector indicator. Accordingly, the vector sums may then be added with other partial summations of partial feature vectors. The resulting summation of partial feature vectors may then be concatenated with other summations of partial feature vectors to generate the output vector.
Accordingly, the vector sums described herein may refer to any form of summation, whether they include a full summation or intermediate summation of one or more partial or full feature vectors. The vector sum may then be combined with various other vector sums, if needed, to produce a full output vector. Further, the vector sum may refer to an output from a single processor subunit, or a combination of outputs from a plurality of processor subunits, which may or may not be included on the same computational memory chip.
In some embodiments, the embedding table may be stored in a manner to efficiently distribute summation operations among the plurality of processor subunits. For example, a particular processor subunit may be assigned to sum together particular elements of particular feature vectors. Accordingly, it may be beneficial to store those particular elements in a memory bank associated with the processor subunit. Further, it may be beneficial to store other portions of the embedding table in other locations, such that the processor subunit may does need to access feature vector elements relevant to its assigned task, which may improve or maximize the efficiency for the particular processor subunit. This may be repeated across all processor subunits on one or more computational memory chips to increase the efficiency of the system as a whole.
In some embodiments, elements of embedding table 2800 may be stored in different locations based on which processor subunit will process the elements. In other words, the one or more processor subunits and the one or more memory banks of the at least one computational memory chip may be arranged into a plurality of computational memory groups. Each computational memory group may include at least one processor subunit and one or more of the memory banks dedicated to the at least one processor subunit and may be configured to store a different sub-portion of the one or more portions of the embedding table. The sub-portions may be allocated in various ways. In some embodiments, the sub-portions may represent at least one complete feature vector from the embedding table. Accordingly, one or more full feature vectors, such as feature vector 2810 may be stored sequentially in a memory bank. Alternatively, the various feature vectors may be split apart and the sub-portions may represent at least one partial feature vector from the embedding table. In other words, the feature vectors may be stored non-sequentially in the one or more memory banks.
By splitting the feature vectors into segments, a processor subunit (or group of processor subunits) may be dedicated to summing elements associated with particular segments. Accordingly, a first computational memory group may be configured to store a first segment of a first feature vector and a first segment of a second feature vector. For example, memory processing module 2710 may be configured to store segments 2822 and 2832 in memory bank 2714a. A dimensionality of the first segment of the first feature vector may be the same as a dimensionality of the first segment of the second feature, as indicated above. Accordingly, a processor subunit associated with the first computational memory group may perform summations associated with the first segments of the first and second feature vectors. In some embodiments, the first computational memory group may be configured to store first segments of additional feature vectors, such as a first segment of a third feature vector and a first segment of a fourth feature vector. The segments stored in the first computational memory group may be the first segments of all feature vectors in embedding table 2800 (i.e., segments 2822 through 2842), or may be a subset of the first segments of the feature vectors. For example, if the number of feature vectors (K) is less than the number of rows in a memory bank, the memory bank may store the all of the L-long segments of the K feature vectors in one memory bank. Otherwise, in some embodiments, the L-long segments may be stored across multiple memory banks and partial summations from multiple memory banks may be summed together.
Similarly, second segments of the of the feature vectors may be stored in other computational memory groups. Accordingly, a second computational memory group different from the first computational memory group may be configured to store a second segment of the first feature vector and a second segment of a second feature vector. For example, this may include one or more of segments 2824, 2834, and 2844 (as well as any other second segments of feature vectors of embedding table 280), which may be stored in memory bank 2714b. Accordingly, summations associated with these second segments of the feature vectors may be performed by processor subunit 2712b. The remaining segments of the feature vectors may be split and stored in a nonsequential manner relative to the original feature vectors in a similar manner. Accordingly, the elements of the feature vectors may be split and distributed according to which processor subunits will access the elements to perform summation operations. As a result, the vector elements may be stored in memory banks directly accessible by processor subunits performing summation operations using those elements. The resulting vector sums from each processor subunit may then be combined (e.g., summed together, concatenated, etc.) in order to generate the resulting output vector.
In some embodiments, in order to reduce the sizes of sums generated during the sum calculations, the sums may be quantized. These quantized vector sums may then be provided as an output to the host external to the computational memory chip. The host external to the computational memory chip may be configured to combine the quantized vector sums. For example, this may include converting a 14-bit sum to an 11-bit sums before it is output from the computational memory chip. As one example, three bits used to indicate the position of the and the remaining eights bits will include sum bits. The quantization may provide a desired tradeoff of accuracy of accumulation and MPM to environment bandwidth. In particular, consecutive columns may refer to the number of embedding columns that are concatenated before moving to the next row, while consecutive rows may refer to the number of rows that are concatenated before moving to the next columns. The advantage of a high number of consecutive columns is that more data can be read from the same memory line without paying the penalty of row activation. The disadvantage is that the accumulation may be done by the modules, and not by the processor subunits, which may result in less bandwidth reduction by the processor subunits. The advantage of a high number of consecutive rows is that the accumulation will be done near the memory banks. The disadvantage is that each computational memory chip may need to receive more elements, and the same elements may be transmitted also to other computational memory chip (as they have different columns), which may result in increased bandwidth outside of the computational memory chip. If the number of consecutive columns is low each index could be multicast to different columns groups.
A numeral example is provided below. A feature element may be 8 bits long (int8). At the end of the accumulation there may be M bits: M=8+log 2(max_acc). It should be noted that when adding, for example, 2 int8 numbers, the result is 9 bits. Or, for example, when adding N elements, add log 2(N) bits to the initial size (int8). The format of the sum outputted from the computational memory chip may include location information of length ceil(log 2(M−8)) followed by D sum value bits. The D bits may be rounded before transmitting them. Each address to the memory may be paired with sample ID and the last sample ID that was fully transmitted when this address was transmitted. The accumulator may finish the accumulation on all the samples that are equal or lower from the last sample ID.
The memory/processing unit may be manufactured by a first manufacturing process that better fits memory cells than logic cells. For example, the memory cells manufactured by the first manufacturing process may exhibit a critical dimension that is smaller, and even much smaller (for example by a factor that exceeds 2, 3, 4, 5, 6, 7, 8, 9, 10, and the like) than the critical dimension of a logic circuit manufactured by the first manufacturing process. For example, the first manufacturing process may be an analog manufacturing process, the first manufacturing process may be a DRAM manufacturing process, and the like.
In some data processing operations, the number of output data elements may be smaller (and may be significantly smaller) than the number of input data elements. Such a situation may arise relative to many different types of data operations. As one example, a structured query language (SQL) operation may be used to perform various types of data tasks relative to a database (e.g., updating data, retrieving data, retrieving data entries according to a designated filter, etc.). In some cases, input data elements acquired from a database or elsewhere may be processed using vector processing techniques performed, e.g., by a single instruction multiple data (SIMD) vector processor. Based on a particular type of operation performed relative to the input data elements by the vector processor, some of the data output elements generated or identified by the vector processor may not include valid data. In one example, such invalid data may refer to input data elements that do not meet a certain reference criteria applied by the vector processor (e.g., a reference criteria applied as part of a filtering operation).
Outputting invalid output data elements (or irrelevant data elements) and continue to process such data elements can impact performance of a computing system. For example, transfer of invalid or irrelevant data values to memory or other processing elements on a chip or to remote locations/assets can waste communication bandwidth and memory resources. As a result, various processes of the computing system can be significantly slowed.
The data processing system described in the sections below is aimed at increasing the efficiency of a computing system, for example, with respect to output data elements. Rather than transmitting invalid or irrelevant data elements (e.g., those that do not satisfy a filter criteria or other type of criteria), the disclosed system is configured to omit invalid data elements from a system output. Such an approach may speed computing operations and leave communication buses more available for transmission of valid data or communications. In addition to outputting valid data elements and omitting invalid data elements, the disclosed system may also be configured to generate an output that includes validity metadata representative of the output data elements (e.g., metadata identifying valid data elements among a set of data, invalid data elements among a set of data, or both).
As shown in
Memory 2910 can include any suitable type of memory device for storing data. As shown in
Memory 2910 may be configured to store a database including a plurality of values arranged over a plurality of rows and a plurality of columns (e.g., a relational database). In some cases, column values of the database may be stored sequentially in the memory. Values stored in the database may be acquired from memory for analysis by the data analysis unit 2912. In one example, the plurality of data elements 2916 acquired by the data analysis unit 2912 may include a plurality of row values from a single column of a database stored in memory 2910.
As noted above, the data analysis unit 2912 may include a vector processor. In some cases, the vector processor may include at least one single instruction, multiple data (SIMD) configured processor. The SIMD processor can receive an SIMD command from a controller 2922 configured to communicate with the data analysis unit. The SIMD processor may use the received SIMD command in processing the input data elements 2916. In some cases, the SIMD command may specify a type of operation and/or a certain criteria for the data analysis unit 2912 to use in evaluating the input data elements 2916. The SIMD processor may be configured to process in parallel a plurality of the input data elements 2916. In addition to evaluating the acquired input data elements 2916 (e.g., using a filter operation, Boolean operation, or other function relevant to a database query), the data analysis unit may be further configured to execute one or more of logic, algebraic, or string operations relative to the plurality of input data elements 2916.
Data analysis unit 2912 may include any suitable hardware for providing at least the described evaluation functionality. For example, the data analysis unit 2912 may include one or more processors, one or more field programmable gate arrays, one or more application specific integrated circuits (ASICs), buffers, cache, registers, accelerators, etc.
As noted above, the data analysis unit 2912, which may include a vector processor or other types of processors, can operate on data acquired from memory 2910. In some cases, the acquisition of input data elements 2916 from memory 2910 may result from a query relative to a database fully or partially stored in the memory 2910. In some cases, the database query may involve the data analysis unit performing a filter function as part of the analysis of the input data elements 2916. In other cases, the evaluation of the plurality of data elements performed by the data analysis unit 2912 may include a scan, filter, join, aggregate, or sort operation, combinations thereof, and/or other operations for executing a desired query. One or more commands associated with the query may be provided to the data analysis unit 2912 by controller 2922. In the case of a filter query, the command(s) provided by controller 2922 to data analysis unit 2912 may include a filter command, and the at least one criteria used by the data analysis unit 2912 to evaluate the values included in the input data elements 2916 may be provided to the data analysis unit 2912 in association with the filter command. In some cases, the at least one criteria may include one or more reference values to be compared by the data analysis unit 2912 to data values included in the plurality of data elements 2916. The reference values may include numerical values, string values, etc.
The data analysis unit 2912 may be configured to output various types of information. Returning to
It should be noted that data evaluation need not be the only function provided by data analysis unit 2912. For example, in some cases, data analysis unit 2912 may be configured to perform one or more additional operations, such as various arithmetic or algebraic operations, including addition, subtraction, multiplication, among others. Such operations may be performed, for example, as inter-column operations.
Turning to data packer 2914, one role of the data packer may include condensing output data to provide packed data output 2920 such that invalid or irrelevant data elements are not passed on to memory or to another operation. Condensing the data output in this way may significantly reduce the bandwidth and/or processing required to transfer and/or store data results from one or more processes. In addition to a condensed or packed data output, the data packer 2914 may also output validity metadata to assist in retaining information about the validity/invalidity of data elements that were evaluated by data analysis unit 2912.
In the example shown in
Data packer 2914 may include suitable hardware elements for providing the described data packing functionality. In some cases, data packer 2914 may include one or more hardware-based accelerators, processors, gate arrays, and/or logic-based components. The packed data output 2920 may be stored in one or more output FIFOs.
Data packer 2914 may also include one or more accumulators configured to accumulate one or more data segments of a predetermined size before including the accumulated one or more data segments in the packed data output 2920. In some cases, the predetermined data segment size may include 1 bit, 8 bits, 16 bits, 32 bits, or 64 bits or more.
In some cases, data packer 2914 may be configured to use a packing mask included in a predicate register to accumulate one or more data segments of a predetermined size before including the accumulated one or more data segments in the packed data output 2920. The predetermined size in this case may correspond to a size of the predicate register.
Different data types may be supported by the described data analysis system. For example, supported data type widths may include 1 bit for boolean/predicate, 8 bits for INT8/UINT8, 16 bits for INT16/UINT16, 32 bits for INT32/UINT32/FP32, 64 bits for INT64/UINT64. When data is pushed to and output FIFO or pulled from and input FIFO, data may be packed/unpacked to reduce transfer bandwidth.
In a one-bit example, data output by data packer 2914 may be accumulated to L bits before pushing to an output FIFO. For example, after 64 push instructions, when 64×L bits are accumulated, data may be pushed into an output FIFO. If a flush command is issued, the push to the output FIFO is done without waiting to accumulate the whole data lane.
In an eight-bit example, data output by data packer 2914 may be accumulated to 8×L bits before pushing to an output FIFO. For example, after 8 push instructions, when 64×L bits are accumulated, data may be pushed into an output FIFO. If a flush command is issued, the push to the output FIFO is done without waiting to accumulate the whole data lane.
In a sixteen-bit example, data output by data packer 2914 may be accumulated to 16×L bits before pushing to an output FIFO. For example, after 4 push instructions, when 64×L bits are accumulated, data may be pushed into an output FIFO. If a flush command is issued, the push to the output FIFO is done without waiting to accumulate the whole data lane.
In a 32-bit example, data output by data packer 2914 may be accumulated to 32×L bits before pushing to an output FIFO. For example, after 2 push instructions, when 64×L bits are accumulated, data may be pushed into an output FIFO. If a flush command is issued, the push to the output FIFO is done without waiting to accumulate the whole data lane.
As noted above, in addition to providing packed data output 2920, data packer 2914 may also output validity metadata 2940. The validity metadata 2940 may be provided or transmitted together with the packed data output 2920 or may be provided or transmitted separately from packed data output 2920 (e.g., as part of an out of band communication).
Various types of information relating to the validity of data elements evaluated by data analysis unit 2912 may be provided by the validity metadata 2940. For example, in some cases, validity metadata 2940 may identify valid data elements included in packed data output 2920. Identification of valid data elements may be provided using various types of identifiers associated with valid data elements included in packed data output 2920. Such identifiers may include data element indices, vector indices, row/column indices, etc. Validity metadata 2940 may also identify particular data elements (e.g., data element Out_DE_2) that the data packer 2914 omitted from packed data output 2920. Identification of invalid data elements may be provided using various types of identifiers associated with invalid data elements excluded from packed data output 2920. Such identifiers may include data element indices, vector indices, row/column indices, etc. In some cases, validity metadata 2940 may include identifiers associated with both the valid data elements included in packed data output 2920 and the invalid data elements excluded from packed data output 2920. The identifiers of validity metadata 2940 may take the form of a validity bitmap including Boolean values associated with the data elements included in packed data output 2920 and/or excluded from packed data output 2920. The validity metadata 2940 may also include one or more identifiers associated with a source location of the valid data elements, a source location of the invalid data elements, or both.
In some cases, the validity metadata 2940 may replicate or otherwise correspond to the validity indicators 2918. For example, the validity metadata 2940 may include the same plurality of validity indicators 2918 included in the output of the data analysis unit 2912. In other cases, validity metadata 2940 may be provided in a form or format different from validity indicators 2918. The validity metadata 2940 may map the output data elements 2932 outputted from a chip that includes the data analysis unit 2912 or transferred from another processing block inside the chip, identify the origin of the data elements (e.g., a processing entity including the data analysis unit 2912), and/or indicate an order of the valid output data elements in the packed output data 2920.
The output generated by data packer 2914 may be provided to various destinations. For example, the data packer output may be transmitted to one or more destination chips external to a chip on which the data packer resides. The data packer may also be provided to a FIFO external to a chip on which the data packer resides. External destination chips configured to receive the data packer output may include a memory chip, a computational memory chip, one or more processors, a host processor, a CPU, a GPU, one or more communication controllers, etc.
In one example, the data packer output (including packed data output 2920 and/or validity metadata 2940) may be provided to one or more blocks 2950. One such block 2950 may include a computational memory chip 3110, as shown in
In computer systems including at least one memory and at least one processor, the processor may access the memory to retrieve various data values, operate on the data values, and store the operation resultants in the memory. For example, one standard technique for a processor to read from and write to a memory includes the use of a memory mapped interface to facilitate data transfer associated with memory reads and writes. Using the memory mapped interface, a processor can load data from a shared memory (e.g., a RAM), operate on the loaded data and use the memory mapped interface to write the operation resultant back to the shared memory. The same processor or another processor can later access the address where the operation resultant was stored, retrieve the resultant, and perform operations relative to the resultant.
Memory accesses using a memory mapped interface can lead to bottlenecks. For example, there may be limited pathways for data transfer between a shared memory and a processor, and all communications relating to the reads and writes must travel the same limited communication paths. Such operations can lead to reduced system efficiency and throughput and may be especially problematic where multiple processors are arranged to use the same shared memory. In such a configuration, there is competition for the shared memory, as all processors must access the same shared memory and use the same communication paths into and out of the shared memory to accomplish a data transfer from one processor to another. That is, rather than being able to transfer data (e.g., an operation resultant) directly from a first processor to a second processor such that the second processor can operate on the resultant, the first processor must first store the resultant in the shared memory via the memory mapped interface. The second processor must then retrieve the resultant from the shared memory using the address where the first processor stored the resultant. Using the shared memory in this way (e.g., as an intermediate repository) results in inefficiencies, especially with respect to the communication bandwidth involving the shared memory.
The disclosed processor-to-processor communication systems are aimed at increasing the efficiency of data transfers among processors. For example, the disclosed systems may accomplish processor-to-processor data transfers that bypass the shared memory and do not rely upon a memory mapped interface. Rather, the disclosed system may transfer data (e.g., data loaded from a shared memory, data generated as a result of one or more operations on data loaded from a shared memory, etc.) from one processor to one or more other processors via a stream interface.
The disclosed processor-to-processor communication systems may include various architectures. In some cases, as represented by
In some cases, source processor 3220 and destination processor 3230 may be located on different chips (i.e., different substrates). In other cases, however, source processor 3220 and destination processor 3230 may be formed on a common substrate (e.g., as part of a single chip). Memory 3210 may represent any suitable memory device. In some cases, memory 3210 may include a shared memory, such as a random access memory (RAM, etc.).
In some cases, source processor 3220 and destination processor 3230 may be included on the same or different computational memory chips (e.g., XRAM chips 624, as represented by
Thus, in the example, shown in
Among the processor subunits included on the computational memory chip 624 of
In one example, source processor 3220 may retrieve data from memory bank 3352 and perform one or more operations (e.g., arithmetic operation, logic operation, etc.) relative to the retrieved data to provide an operation resultant. Rather than, for example, writing the operation resultant back to memory bank 3352 as part of a memory-mapped data transfer from source processor 3220 to destination processor 3230, source processor 3220 may use a stream interface 3360 to directly transfer the operation resultant to destination processor 3230. Stream interface 3360, which may include a stream crossbar router 3362 among other components, may be configured to transfer to data generated by at least one originating processor subunit (e.g., source processor 3220) to at least one processor subunit (e.g., destination processor 3230). In some cases, the transfer of the data generated by the at least one originating processor subunit (e.g., source processor 3220) may occur in response to an execution of a write command by the at least one originating processor subunit (e.g., source processor 3220).
To facilitate data transfer via the stream interface 3360, each of the plurality of processor subunits included on a computational memory chip 624 (or more broadly in a computational memory system including multiple computational memory chips 624) may be associated with a unique destination identifier. For example, in a system with multiple processors, each processor may be identified by a unique destination processor identifier (for example, a number, string, etc.). When a source processor sends data to a destination processor, the sender processor may package or otherwise associate the destination processor identifier with the data to be sent. The sent data may be routed between processors, for example, using one or more stream crossbar routers according to the destination processor identifier.
The stream interface 3360 may be used to transfer data from a source processor subunit (e.g., memory bank 3352) to various types of entities. In some cases, as described above, the destination entity may include a processor subunit located on the same or different chip as the source processor subunit. In some cases, the stream interface 3360 may be used to send data from the source processor subunit to at least one destination asset. In some cases, the at least one destination asset includes a memory bank, such as memory bank 3354, dedicated to a processor subunit (e.g., memory bank 3352, associated with source processor 3230) other than the originating processor subunit. In other cases, the stream interface 3360 may be configured to transfer data from a source processor subunit to a remote destination asset that is remotely located relative to the computational memory chip 624 on which the source processor subunit resides. For example, such a remote destination asset may include at least one processor subunit disposed on a different computational memory chip (e.g., an XRAM 3380), at least one memory bank disposed on a different computational memory chip (e.g., XRAM 3380), a host CPU 3370, etc.
Stream interface 3360 may include various components. In some cases, as shown in
Various other components may be included in the disclosed processor-to-processor communication system to facilitate data transfers between processors. For example, one or more output FIFO buffers 3382 may be incorporated into the processor-to-processor communication system. When a source processor generates a memory write command, data to be written need not be transferred to a shared memory entry with an address. Instead, the data may be pushed to an output FIFO (allocated to the source processor) with a destination processor identifier that identifies a destination processor that should receive the data. The output FIFO may also be referred to as an O-FIFO. The data pushed to the output FIFO buffer (e.g., O-FIFO 3382) may then be sent to the destination processor 3230 (or retrieved by the destination processor 3230) via stream cross bar router 3362. Note that inclusion of output FIFOs is optional, as the data may be sent directly from a source processor to a destination processor via the stream interface (e.g., as a result of a write command). Using an output FIFO, however, may be desirable to allow to the source processor to continue operation if a destination processor cannot immediately accept the data.
In some cases, stream interface 3360 may include a point-to-point communication conduit. In one example, the point-to-point communication conduit avoids components such as routers and switches, but includes a direct path from source to consumer. The stream interface also includes a communication conduit other than a memory bus. In some cases, the stream interface may include a communication conduit passing through one or more switches or routers.
In operation, the stream interface 3360 may be configured to transfer data generated by a source processor subunit to at least one output FIFO buffer 3382 along with an identifier associated with at least one processor subunit or with at least one destination asset (on or off chip) where the data is to be transferred. For example, in some cases, the at least one destination asset may include a memory bank associated with a processor subunit different from the source processor subunit. For example, each of the plurality of processor subunits included on a computational memory chip 624 may be allocated at least one output FIFO buffer 3382. In some cases, each of the plurality of processor subunits may be allocated two or more output FIFO buffers 3382.
One or more input FIFO buffers 3384 may also be included in the disclosed processor-to-processor communication system. For example, the data sent to a destination processor (e.g., destination processor 3230) (or consumer processor) from a source processor (e.g., source processor 3220) may be received by the destination processor as received data to an input FIFO (I-FIFO) 3384, which may be included among one or more I-FIFOs allocated to the destination/consumer processor. Use of an input FIFO is optional. Where included, however, an I-FIFO buffer may be used to store the data generated by at least one originating/source processor subunit (e.g., source processor 3220) and transferred by the stream interface 3360 until at least one processor subunit (e.g., destination processor 3230) is ready to operate on the data. Each of the plurality of processor subunits included on a computational memory chip 624 may be allocated at least one input FIFO buffer 3384. In some cases, each of the plurality of processor subunits may be allocated two or more output FIFO buffers 3384.
Where O-FIFOs and I-FIFOs are included, each processor (e.g., each processor subunit aboard a computational memory chip 624) may be associated with a set of one or more O-FIFOs 3382 and/or with a set of one or more I-FIFOs 3384. Allocating more than a single O-FIFO and/or I-FIFO to a processor may simplify management of different types of traffic and/or management of traffic between different pairs of source and destination processors. For example, as shown in the example of
Similarly, as shown in the example of
Grouping O-FIFOs and I-FIFOs and associating the different groups with different ports of the stream crossbar routers may enable each group to service a different type of traffic without blocking traffic of another set. For example, as shown in
As shown in the example of
As shown in this example, stream crossbar router-2 is communicatively coupled to a second subset of I-FIFOs of all processors, to a second subset of O-FIFOs of all processors, to a first subset of I-FIFOs of all processors, to a first subset of O-FIFOs of all processors, to a second multi-channel DMA, to a third multi-channel DMA, and to all other crossbar routers (Stream crossbar router-1, Stream crossbar router-3 and Stream crossbar router-4).
As shown in this example, stream crossbar router-3 is communicatively coupled to a third subset of I-FIFOs of all processors, to a third subset of O-FIFOs of all processors, to a fourth subset of I-FIFOs of all processors, to a fourth subset of O-FIFOs of all processors, to a fifth multi-channel DMA, to a eighth multi-channel DMA, and to all other crossbar routers (Stream crossbar router-1, Stream crossbar router-2 and Stream crossbar router-4).
As shown in this example, stream crossbar router-4 is communicatively coupled to a second subset of I-FIFOs of all processors, to a second subset of O-FIFOs of all processors, to a third subset of I-FIFOs of all processors, to a third subset of O-FIFOs of all processors, to a fourth multi-channel DMA, to a sixth multi-channel DNA, and to all other crossbar routers (Stream crossbar router-1, Stream crossbar router-2 and Stream crossbar router-3).
One of ordinary skill in the art will recognize that more or fewer stream crossbar routers, I-FIFO sub-sets, O-FIFO sub-sets, multi-channel DMAs may be included in the disclosed processor-to-processor communication system, and the configuration shown in
In managed languages such as Java and C, a null value is unusable and may cause a null pointer exception when dereferenced. In other languages like C and C++, undefined values—that is, those that are uninitialized, or derived from undefined values—are unusable, and their use may cause various problems such as silent data corruption, altered control flow, or a segmentation fault. Similarly, Structured Query Language (SQL) is often used for communicating with databases. SQL allows a value to be undefined. Database processors, however, need to know whether a certain data unit is invalid or not. Therefore, there is a need to keep track of null value locations in data.
A known method for identifying an invalid or null value includes allocating a predefined value to represent an invalid value. For example, when the data unit is one byte long, then one value out of 0-255 may be allocated as an invalid flag. When using this method to encode a data unit, if a valid data unit has the predefined value, then it is necessary to change the valid predefined value of the data unit to ensure that the data unit remains valid. This process is inefficient, especially when dealing with large databases. Thus, there is a growing need to identify and manage invalid data units in an effective manner.
A computational memory system and a method for managing invalid data units in a memory processing module (MPM) is disclosed below. Examples of MPMs are illustrated in PCT patent application publication WO2019025862, and/or PCT patent application PCT/IB2019/001005. An invalidity bit may be referred to as a NULL bit. An invalid value may be referred to as a NULL value.
In some embodiments, the disclosed computational memory system may include at least one computational memory chip including at least one processor subunit and at least one memory bank formed on a common substrate. For example, as discussed above, a memory processing module (MPM) 610, may be implemented on a chip to include at least one processing element (e.g., a processor subunit) local to associated memory elements formed on the chip (see
In some embodiments, the computational memory system may include a data invalidity detector configured to receive a sequence of data units and invalidity metadata relating to the sequence of data units. The disclosed data invalidity detector may be configured to execute one or more instructions to perform various functions, as disclosed below. For example, in some embodiments, the data invalidity detector may reside on the at least one computational memory chip. By way of example, one or more of MPMs 610 may include the data invalidity detector. As discussed above, the at least one processor subunit 612, the at least one memory bank (e.g., 600-0, 600-1, etc.), and/or the data invalidity detector may be formed on a single hardware chip.
In some embodiments, the data invalidity detector may be implemented by the at least one processor subunit on the at least one computational memory chip. For example, at least one processor subunit 612 on one or more of MPMs 610 may execute one or more instructions stored in a memory bank (e.g., 600-0, 600-1, etc.) to perform one or more functions of the data invalidity detector. The functions performed by the data invalidity detector will be described below.
In some embodiments, the data invalidity detector may be implemented by at least one supporting microprocessor disposed on the at least one computational memory chip. For example, as discussed above with respect to
In some embodiments, the data invalidity detector may be implemented by at least one supporting microprocessor located outside of the at least one computational memory chip. In some embodiments, the data invalidity detector may be implemented by a host CPU outside of the at least one computational memory chip. For example one or more central processing units (CPUs) (e.g., 100, see
In some embodiments, the data invalidity detector may also be configured to generate an invalidity bitmap associated with a sequence of data units based on the received invalidity metadata.
As illustrated in
In some embodiments, a number of bits included in the invalidity bitmap may equal a number of data units in the sequence of data units. In some embodiments, the invalidity bitmap may include a series of values equal to either zero or one to indicate which of the sequence of data units includes a null value. It is contemplated that in some embodiments, invalidly bitmap 3530 may include a plurality of bits. In some embodiments, a total number of bits in invalidity bitmap 3530 may be the same as a total number of data units 3516 in data chunk 3512. For example, when data chunk 3512 includes M data units 3516 (e.g., 3516-1, 3516-2, . . . , 3516-M, M is an integer), a number of bits in invalidity bitmap 3530 may also equal M. As one example, when data chunk 3512 includes five data units 3516 (e.g., 3516-1, 3516-2, 3516-3, 3516-4, 3516-5), invalidity bitmap 3530 may also include five bits (e.g., 3530-1, 3530-2, 3530-3, 3530-4, 3530-5). It is also contemplated that the values of bits (e.g., 3530-1, 3530-2, etc.) in invalidity bitmap 3530 may be 0 or 1 to indicate whether the corresponding data units 3516 (e.g., 3516-1, 3516-2, etc.) are valid or invalid. For example, when a fourth data unit 3516 (e.g., 3516-4) in data chunk 3512 has a null value, a fourth bit (e.g., 3530-4) in invalidity bitmap 3530 may take a value 1 to indicate that the fourth data unit 3516-4 includes a null value. By way of another example, when a second data unit 3516 (e.g., 3516-2) in data chunk 3512 is valid (e.g., not a null value), the second bit (e.g., 3530-2) in invalidity bitmap 3530 may take the value 0 to indicate that the second data unit (e.g., 3516-2) does not include a null value.
In some embodiments, the data invalidity detector may be configured to append the invalidity bitmap to the sequence of data units to provide a mapped data segment. In some embodiments, the invalidity bitmap may be appended to a beginning of the sequence of data units. For example, as illustrated in
In some embodiments, the mapped data segment may be stored in the at least one memory bank of the at least one computational memory chip. For example, data invalidity detector 3520 may be configured to store mapped data segment 3514 in at least one memory bank (e.g., 600-0, 600-1, 600-2, 600-3, etc.) Data invalidity detector 3520 may be configured to store mapped data segment 3514 in a single memory bank instead of storing portions of mapped data segment 3514 in different memory banks. As will be described below, storing mapped data segment 3514 in a single memory bank may help reduce latency. In some embodiments, the mapped data segment may be sized to be stored fully within one line of memory included within the at least one memory bank. Additionally or alternatively, in some embodiments, the mapped data segment may be sized to be stored fully within at least one cache associated with the at least one computational memory chip. By way of example, mapped data segment 3514 may have a size smaller than that of a single line of a memory bank (e.g., 600-0, 600-1, 600-2, 600-3, etc.) As discussed above, each of the one or more memory banks (e.g., 600-0, 600-1, 600-2, 600-3, etc.) may include one or more lines of memory elements. It is contemplated that in some embodiments, mapped data segment 3514 may be sized so that it may be stored in a single line of memory elements within a memory bank (e.g., 600-1). In other embodiments, mapped data segment may be sized so that it may be stored on one or more lines of memory elements of a single memory bank (e.g., 600-1). The one or more memory banks (e.g., 600-0, 600-1, 600-2, 600-3, etc.) may constitute one or more caches associated with MPM 610 (e.g., computational memory chip 610). Sizing mapped data segment 3514 as described above may help reduce latency.
In some embodiments, the mapped data segment may have a size equal to or less than an integer division of a line of memory included in the at least one memory bank. Doing so may ensure that an integer number (e.g., 1, 2, 3, . . . N) number of mapped data segments 3514 may be stored in a single memory bank (e.g., 600-0, 600-1, 600-2, 600-3, etc.)
In some embodiments, the data invalidity detector may be configured to generate a sequence of mapped data segments to be stored together in the at least one memory bank of the at least one computational memory chip. For example, as illustrated in
For example, storing portions of mapped data segments 3514-1, 3514-2, etc., in different memory banks may result in lower utilization and higher latency. As one example, when using a DRAM memory bank, invalidity bitmap (e.g., 3530-1, 3530-2, etc.) may need to be updated whenever a data unit 3516 is added to or updated in data chunks 3512-1, 3512-2, etc. To update invalidity bitmap (e.g., 3530-1), it may be necessary to jump to a beginning of the data chunk 3512-1. Positioning invalidity bitmap 3530-1 at one line of the memory bank (e.g., 600-0) and positioning data chunk 3512-1 in another line of memory bank 600-0 or in another line of another memory bank (e.g., 600-2) may substantially slow the writing process. This is because jumping to the beginning of data chunk 3512-1 may require deactivating a first memory line storing data chunk 3512-1, activating a second memory line storing invalidity bitmap 3530-1, and updating the invalidity bitmap (e.g., 3530-1). Furthermore, following the update of the invalidity bitmap (e.g., 3530-1), it may be necessary to deactivate the second memory line and reactivate the first memory line. These processes, involving activation and deactivation of memory lines, may introduce latency in the process of updating the one or more mapped data segments 3514-1, 3514-2, etc. To avoid such increased latency, as discussed above, it may be beneficial to store both mapped data segments 3514-1 and 3514-2 in the same memory bank (e.g., 600-1) and more beneficially in a same line of the same memory bank (e.g., 600-1).
In some embodiments, at least one processor subunit may be configured to perform one or more operations relative to at least one data value included in the mapped data segment and at least one corresponding value in the invalidity bitmap. For example, a processor subunit (e.g., 612) may be configured to perform one or more operations using both the data in a data unit 3516 and its associated validity indicator in invalidity bitmap 3530.
In some embodiments, the one or more operations may include a comparison operation. For example, as illustrated in
By way of an example, invalidity value compatible comparator 3642 may implement a truth table as illustrated in Table 1 below. In Table 1, various operations are identified in the “Operation” column. The first value, a null flag associated with the first value, the second value, and a null flag associated with the second value are identified in the second through fifth columns of Table 1, respectively. The null flags associated with the first and second values indicate whether or not the first and second values, respectively, are valid. A result of the comparison operation is identified in the “Value” column under “Comparison Result,” and a null flag associated with the comparison result is identified in the Null Flag column associated with “Comparison Result.” The null flag associated with the comparison result indicates whether the comparison result itself is valid or invalid.
In some embodiments, the at least one processor subunit may be configured to perform at least one logic operation relative to data stored in its one or more dedicated memory banks. For example, as illustrated in Table 1 above, processor subunit 612 may be configured to perform one or more logic operations, such as, determining whether value A is greater than value B, determining whether value A is not equal to value B, determining whether value A is NULL, and so forth.
In some embodiments, the one or more operations may include an arithmetic operation. For example, as illustrated in
By way of an example, the invalidity value compatible ALU 3644 may implement a truth table as illustrated in Table 2 below. In Table 2, various operations are identified in the “Operation” column. The first value, a null flag associated with the first value, the second value, and a null flag associated with the second value are identified in the second through fifth columns of Table 1, respectively. The null flags associated with the first and second values indicate whether or not the first and second values, respectively, are valid. A result of the calculation operation is identified in the “Value” column under “Comparison Result,” and a null flag associated with the comparison result is identified in the Null Flag column associated with “Comparison Result.” The null flag associated with the calculation result indicates whether the calculation result itself is valid or invalid.
In some embodiments, the at least one processor subunit may be configured to perform at least one arithmetic operation relative to data stored in its one or more dedicated memory banks. For example, as illustrated in Table 2 above, processor subunit 612 may be configured to perform one or more arithmetic operations, such as, add values A and B, subtract (e.g. subtract value A from value B or vice-versa), multiply values A and B, and so forth. In some embodiments, the at least one processor subunit may be further configured to store a result of the at least one arithmetic operation in its one or more dedicated memory banks. For example, processor subunit 612 may be configured to store the calculation result of one or more arithmetic operations (e.g., add, multiply, etc.) in one or more of memory banks, for example, 600-0, 600-1, 600-2, etc. It is contemplated that in some embodiments, the at least one processor subunit may be associated with one or more dedicated memory banks from among the plurality of memory banks. For example, a first processor subunit 612 may be associated with a dedicated memory bank 600-0 from among the memory banks 600-0, 600-1, 600-2, etc. First processor subunit 612 may be configured to read data from and/or write data to its dedicated memory bank 600-0. As another example, a second processor subunit 612 may be associated with a dedicated memory bank 600-2 from among the memory banks 600-0, 600-1, 600-2, etc. Second processor subunit 612 may be configured to read data from and/or write data to its dedicated memory bank 600-2. It is contemplated that, when first processor unit 612, for example, performs a calculation, first processor unit 612 may be configured to store the calculation result in its dedicated memory bank 600-0. It is also contemplated that each processor subunit 612 may be associated with more than one dedicated memory bank. Thus, for example, processor subunit 612 may be associated with two dedicated memory banks 600-3 and 600-4.
In some embodiments, the one or more operations may include an aggregation operation. Processor subunit (e.g., 612) may include an aggregator that may be configured to aggregate one or more of comparison result 3618-1 and/or calculation result 3618-2. For example, as illustrated in
In some embodiments, processor subunit (e.g., 612) may also be configured to handle the required output in case of aggregation on an empty table, that is when there are no results 3618-1 or 3618-2 provided to output unit or aggregator 3646. Table 3 illustrates some of the results when no first or second value is provided to output unit or aggregator 3646.
In some embodiments, the at least one processor subunit may be configured to update the mapped data segment and at least one corresponding value in the invalidity bitmap after performing the at least one logic operation. For example, as discussed above, processor subunit 612 may be configured to perform one or more logic operations as illustrated in Table 1. After performing the logic operations, processor subunit 612 may be configured to provide a comparison result of the logic operation and its associated invalidity flag to output unit or aggregator 3646. When output unit 3646 receives a single value and its associated invalidity flag, output unit 3646 may generate an updated mapped data segment 3514′ that may include an updated data chunk 3512′, including the comparison result (e.g., 3619-1) of the logic operation, and an updated invalidity bitmap 3530′, including and associated invalidity flag (e.g., 3631-1). For example, output unit 3646 may add the comparison result as a data unit 3619-1 to data chunk 3512′ and may also add the invalidity flag 3631-1 associated with the comparison result to invalidity bitmap 3530′.
The invalidity bitmap may be sent from a processor subunit (e.g., 612) to a controller of the MPU in any manner (e.g., in-band or out-of-band). Some lines of the memory banks (e.g., 600-0, 600-1, 600-2, etc.) may store data chunks 3512 while other columns may store data units 3516 that may not be associated with invalidity bitmaps 3530. The lines may be marked or otherwise associated with metadata indicative of the type of data (data chunks or not) stored in each line. Moreover, instead of marking single lines, the marking may be per each sequence of consecutive memory lines.
In step 3702, process 3700 may include receiving, using a data invalidity detector, the sequence of data units and invalidity metadata relating to the sequence of data units. For example, data invalidity detector 3520 may receive a sequence of data units 3516 of data chunk 3512. Data invalidity detector 3520 may additionally receive invalidity metadata 3522, for example, from a database associated with host system 710 and/or from a data table. The database or data table may store the sequence of data units (e.g., 3516-1, 3516-2, etc.) in association with their respective invalidity indicators or flags. As discussed above, data units 3516 may be stored in one column and their associated invalidity indicators or flags may be stored in a different column in the database or data table.
In step 3704, process 3700 may include generating, using the data invalidity detector, an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata. For example, data invalidity detector 3520 may be configured to generate a bitmap sequence 3530. Each value (e.g., 3530-1, 3530-2, etc.) in bitmap sequence 3530 may indicate whether a corresponding data unit 3516-1, 3516-2, etc., is valid or invalid. For example, values 3530-1, 3530-2, etc., may take a null value when corresponding data units 3516-1, 3516-2, respectively, are invalid.
In step 3706, process 3700 may include modifying the sequence of data units by appending the invalidity bitmap to the sequence of data units to provide a mapped data segment. For example, data invalidity detector 3520 may be configured to append invalidity bitmap 3530 to a corresponding data chunk 3512. Data invalidity detector 3520 may be configured to append invalidity bitmap 3530 to data chunk 3512 to provide mapped data segment 3514. Data invalidity detector 3520 may be configured to append invalidity bitmap 3530 to a beginning of the sequence of data units 3516 in data chunk 3512. Thus, for example, mapped data segment 3514 (see
In step 3708, process 3700 may include storing the modified sequence of data units in the one or more memory banks of at least one computational memory chip. For example, data invalidity detector 3520 may be configured to store mapped data segment 3514 in at least one memory bank (e.g., 600-0, 600-1, 600-2, 600-3, etc.) Mapped data segment 3514 may be stored in a single memory bank instead of storing portions of mapped data segment 3514 in different memory banks. As described above, storing mapped data segment 3514 in a single memory bank may help reduce latency.
Disclosed embodiments may include the following:
A computational memory system comprising: a master controller configured to receive a configuration function from a host CPU and convert the received configuration function into one or more lower level configuration functions; and at least one computational memory chip, wherein the at least one computational memory chip includes a plurality of processor subunits and a plurality of memory banks formed on a common substrate; wherein the master controller is adapted to configure the at least one computational memory chip using the one or more lower level configuration functions.
wherein configuring the at least one computational memory chip includes preparing one or more of the plurality of processor subunits to perform a function associated with the one or more lower level configuration functions.
wherein the master controller is adapted to configure a first processor subunit of the at least one computational memory chip according to a first lower level configuration function that is different from a second lower level configuration used to configure a second processor subunit of the at least one computational memory chip.
wherein the master controller is adapted to communicate with one or more slave controllers to configure the at least one computational memory chip.
wherein the master controller is adapted to communicate with at least one DDR controller to configure the at least one computational memory chip.
wherein the at least one DDR controller is configured to communicate with one or more slave controllers included on the at least one computational memory chip.
wherein the master controller includes one or more accelerators that convert the received configuration function into the one or more lower level configuration functions.
wherein each of the one or more accelerators includes at least one microprocessor.
wherein the one or more accelerators are configured to operate in parallel.
wherein the at least one computational memory chip is included on at least one DIMM.
wherein the at least one DIMM is included on an intense memory processing unit (IMPU).
wherein each of the plurality of processor subunits is associated with one or more dedicated memory banks.
wherein the configuration function from the host CPU is associated with an application running on the host CPU.
wherein the received configuration function is a conditional function.
wherein the conditional function is stored in a conditional function queue.
wherein the master controller includes a configuration logic configured to fetch the conditional function from the conditional function queue after completion of a process upon which the conditional function depends.
wherein the master controller includes a configuration logic configured to fetch the conditional function from the conditional function queue after detection of a predetermined system event.
wherein the master controller includes a configuration logic configured to fetch the conditional function from the conditional function queue after completion of a task upon which the conditional function depends.
wherein the received configuration function is a non-conditional function.
wherein the non-conditional function is stored in a non-conditional function queue.
wherein the at least one computational memory chip includes at least a first computational memory chip and a second computational memory chip, and wherein the master controller is adapted to configure the second computational memory chip according to a second lower level configuration function while the first computational memory chip performs a function associated with a first lower level configuration function.
wherein the one or more lower level configuration functions are stored in a configuration information table (CIT).
wherein the master controller includes a configuration logic configured to calculate one or more addresses of the CIT based on the configuration function received from a host CPU.
wherein the configuration logic is configured to retrieve the lower level configuration functions using the calculated one or more addresses of the CIT.
wherein the CIT also stores one or more of a computational memory chip identifier, a function operand, a condition for a next configuration execution, or a parameter override.
wherein the host CPU is configured to update information stored in the CIT.
wherein the master controller includes a configuration logic further including a parameter override logic.
wherein the master controller includes a configuration logic configured to manage a states data structure.
wherein the configuration function is received from the host CPU during an initialization process.
A computational memory system comprising: a controller configured to receive a configuration function from a host CPU; and a plurality of computational memory modules, wherein each of the plurality of computational memory modules includes a processor subunit and one or more memory banks formed on a common substrate; wherein the controller is adapted to multicast the configuration function to two or more respective processor subunits of the plurality of computational memory modules.
wherein the controller is located onboard a computational memory chip together with the plurality of computational memory modules.
wherein the controller is located external to a computational memory chip that includes the plurality of computational memory modules.
wherein the controller includes a field programmable gate array.
wherein the controller includes a DDR controller.
further including at least one configuration processor adapted to configure at least a first processor subunit among the plurality of computational memory modules based on an output generated by at least a second processor subunit among the plurality of computational memory modules.
wherein the at least one configuration processor is located onboard a computational memory chip that includes the plurality of computational memory modules.
wherein the at least one configuration processor is located external to a computational memory chip that includes the plurality of computational memory modules.
wherein the controller is adapted to convert the configuration function into at least one lower level configuration function and multicast the at least one lower level configuration function to two or more respective processor subunits of the plurality of computational memory modules.
A computational memory system, comprising: at least one computational memory chip including one or more processor subunits and one or more memory banks formed on a common substrate; wherein the at least one computational memory chip is configured to store one or more portions of an embedding table in the one or more memory banks, the embedding table including one or more feature vectors; and wherein the one or more processor subunits are configured to receive a sparse vector indicator from a host external to the at least one computational memory chip and, based on the received sparse vector indicator and the one or more portions of the embedding table, generate one or more vector sums.
wherein the sparse vector indicator includes a sparse vector including one or more zero values and one or more non-zero values.
wherein the sparse vector indicator includes a set of indices corresponding to non-zero values in an associated sparse vector.
wherein the computational memory system further includes one or more controllers configured to acquire the one or more generated vector sums and provide the acquired one or more generated vector sums to the host external to the computational memory chip.
wherein each processor subunit among the one or more processor subunits is associated with one or more dedicated memory banks from among the one or more memory banks.
wherein the host external to the computational memory chip includes a CPU.
wherein the host external to the computational memory chip includes a GPU.
wherein a sparse vector associated with the sparse vector indicator has a dimensionality higher than any of the one or more vector sums.
wherein the at least one computational memory chip includes two or more memory banks and wherein the embedding table is stored across the two or more memory banks.
wherein the one or more vector sums constitute sums of full feature vectors included in the embedding table.
wherein the full feature vectors summed together are identified in the embedding table based on the sparse vector indicator.
wherein the one or more vector sums constitute sums of partial feature vectors included in the embedding table.
wherein the one or more vector sums include a plurality of intermediate summations, wherein each of the plurality of intermediate summations represents a summation of a subset of feature vectors in the embedding table implicated by the sparse vector indicator.
wherein the one or more vector sums include a plurality of intermediate summations, wherein each of the plurality of intermediate summations represents a summation of partial feature vectors in the embedding table implicated by the sparse vector indicator.
wherein the one or more vector sums are represented by output from a single processor subunit among the one or more processor subunits.
wherein the one or more vector sums are represented by output from two or more processor subunits among the one or more processor subunits.
wherein the at least one computational memory chip constitutes a plurality of computational memory chips, and wherein the one or more vector sums are represented by output from two or more processor subunits included on different computational memory chips among the plurality of computational memory chips.
wherein the one or more processor subunits and the one or more memory banks of the at least one computational memory chip are arranged into a plurality of computational memory groups, wherein each computational memory group includes at least one processor subunit and one or more of the memory banks dedicated to the at least one processor subunit.
wherein each of the computational memory groups is configured to store a different sub-portion of the one or more portions of the embedding table.
wherein the sub-portion represents at least one complete feature vector from the embedding table.
wherein the sub-portion represents at least one partial feature vector from the embedding table.
wherein a first computational memory group is configured to store a first segment of a first feature vector and a first segment of a second feature vector.
wherein the first computational memory group is configured to store a first segment of a third feature vector and a first segment of a fourth feature vector.
wherein a second computational memory group different from the first computational memory group is configured to store a second segment of the first feature vector and a second segment of a second feature vector.
wherein the first segment of the first feature vector and the first segment of the second feature vector have a common dimensionality.
wherein the common dimensionality is commensurate with a number of at least one accumulator associated with the at least one computational memory chip.
wherein the one or more vector sums are quantized within the at least one computational memory chip.
wherein the quantized vector sums are provided as an output to the host external to the computational memory chip.
wherein the host external to the computational memory chip is configured to combine the quantized vector sums.
A data processing unit, comprising: a data analysis unit configured to acquire a plurality of data elements from a memory, evaluate each of the plurality of data elements relative to at least one criteria, and generate an output that includes a plurality of validity indicators identifying a first plurality of data elements among the plurality of data elements that validly satisfy the at least one criteria and identifying a second plurality of data elements among the plurality of data elements that do not validly satisfy the criteria; and a data packer configured to generate, based on the output of the data analysis unit, a packed data output including the first plurality of data elements and omitting the second plurality of data elements.
wherein the plurality of validity indicators further identify which of the plurality of data elements do not validly satisfy the at least one criteria.
wherein the data analysis unit includes a vector processor.
wherein the data packer output further includes validity metadata.
wherein the validity metadata includes one or more identifiers associated with the first plurality of data elements.
wherein the validity metadata includes one or more identifiers associated with the second plurality of data elements.
wherein the validity metadata includes one or more identifiers associated with the first plurality of data elements and also includes one or more identifiers associated with the second plurality of data elements.
wherein the validity metadata includes a validity bitmap including Boolean values associated with each of the plurality of data elements.
wherein the validity metadata includes the plurality of validity indicators included in the output of the data analysis unit.
wherein the data packer output is transmitted to one or more destination chips external to a chip on which the data packer resides.
wherein the destination chip includes a memory chip.
wherein the destination chip includes a computational memory chip.
wherein the computational memory chip includes at least one processor subunit and one or more corresponding, dedicated memory banks formed on a common substrate.
further including a controller configured to communicate one or more commands to the data analysis unit.
wherein the one or more commands include a filter command, and wherein the at least one criteria is associated with the filter command.
wherein the at least one criteria includes one or more reference values to be compared to data values included in the plurality of data elements.
wherein the one or more reference values are numerical values.
wherein the one or more reference values are string values.
wherein the memory is configured to store a database including a plurality of values arranged over a plurality of rows and a plurality of columns.
wherein column values are stored sequentially in the memory.
wherein the plurality of data elements include a plurality of row values from a single column of the database.
wherein the data analysis unit is configured to evaluate the plurality of data elements in parallel.
wherein the evaluation performed by the data analysis unit results from a query relative to a database stored in the memory, and wherein the evaluation includes a filter function.
wherein the evaluation of the plurality of data elements includes at least one of a scan, filter, join, aggregate, or sort operation.
wherein the data analysis unit includes a single instruction, multiple data (SIMD) configured processor.
wherein an SIMD command for the data analysis unit is received from a controller configured to communicate with the data analysis unit.
wherein the plurality of validity indicators included in the output of the data analysis unit are included in a bit mask.
wherein the bit mask specifies which of the plurality of data elements satisfy the at least one criteria and which of the plurality of data elements do not satisfy the at least one criteria.
wherein the plurality of validity indicators identify row values in a column of data of a database that satisfy the at least one criteria.
wherein the plurality of validity indicators identify row values in a column of data of a database that satisfy the at least one criteria and further identify row values in a column of data of a database that do not satisfy the at least one criteria.
wherein the plurality of indicators include single bit values.
wherein the data analysis unit is further configured to execute one or more of logic, algebraic, or string operations relative to the plurality of data elements.
wherein the data analysis unit includes one or more field programmable gate arrays.
wherein the data analysis unit includes one or more ASICs.
wherein the data packer uses the plurality of validity indicators included in the output of the data analysis unit to identify the first plurality of data elements to include in the output of the data packer.
wherein the data packer comprises one or more hardware-based accelerators.
wherein the data packer is configured to accumulate one or more data segments of a predetermined size before including the accumulated one or more data segments in the output of the data packer.
wherein the predetermined size is 1 bit, 8 bits, 16 bits, 32 bits, or 64 bits.
wherein the output generated by the data packer is provided to a FIFO external to a chip on which the data packer resides.
wherein the data packer is configured to use a packing mask included in a predicate register to accumulate one or more data segments of a predetermined size before including the accumulated one or more data segments in the output of the data packer.
wherein the predetermined size is associated with a size of the predicate register.
further including the memory.
wherein the data packer output further includes validity metadata, and wherein the validity metadata includes at least one source identifier associated with a source location for the first plurality of data elements and where the validity metadata also includes one or more identifiers associated with the first plurality of data elements.
A processor-to-processor data transfer system, comprising: a first processor programmed to: load data from a memory using a memory mapped interface; generate a data packet for transferring processed data via a non-memory mapped stream interface; and send the generated data packet, including the processed data, to a second processor.
wherein the first processor is located on a first chip different from a second chip on which the second processor is located.
wherein the first processor and the second processor are located on a common substrate.
wherein the memory is a shared memory.
wherein the memory is a random access memory.
wherein the first processor includes a first processor subunit disposed on a first computational memory chip together with a first group of one or more memory banks dedicated to the first processor.
wherein the second processor includes a second processor subunit disposed on a second computational memory chip different from the first computational memory chip, and wherein the second computational memory chip includes a second group of one or more memory banks dedicated to the second processor subunit.
wherein the second processor includes a second processor subunit disposed on the first computational memory chip together with the first processor subunit, and wherein the first computational memory chip includes a second group of one or more memory banks dedicated to the second processor subunit.
A computational memory chip, comprising: a plurality of processor subunits and a plurality of memory banks formed on a common substrate, and wherein each processor subunit among the plurality of processor subunits is associated with one or more dedicated memory banks from among the plurality of memory banks; at least one originating processor subunit among the plurality of processor subunits; at least one consumer processor subunit among the plurality of processor subunits; and a stream interface, wherein the stream interface is configured to transfer to the at least one consumer processor subunit data generated by the at least one originating processor subunit.
wherein each of the plurality of processor subunits is associated with a unique destination identifier.
wherein each of the plurality of processor subunits is configured to perform logic or arithmetic operations relative to data stored in its one or more dedicated memory banks.
wherein the stream interface is configured to transfer the data generated by the at least one originating processor subunit without the use of a memory mapped interface and a shared memory.
wherein the transfer of the data generated by the at least one originating processor subunit occurs in response to an execution of a write command by the at least one originating processor subunit.
further including at least one destination asset, wherein the stream interface is configured to transfer to the at least one destination asset data generated by the at least one originating processor subunit.
wherein the at least one destination asset includes a memory bank dedicated to a processor subunit other than the originating processor subunit.
wherein the stream interface is configured to transfer to a remote destination asset, remotely located relative to the computational memory chip, data generated by the at least one originating processor subunit.
wherein the remote destination asset includes at least one processor subunit disposed on a different computational memory chip.
wherein the remote destination asset includes at least one memory bank disposed on a different computational memory chip.
wherein the remote destination asset includes a host CPU.
wherein the transfer of the data to the remote destination asset occurs via at least one direct memory access (DMA) module.
wherein the stream interface includes a stream crossbar router.
wherein the stream crossbar router includes a routing matrix.
wherein the stream crossbar router is configured to enable independent, simultaneous communication between two or more pairs of processor subunits among the plurality of processor subunits.
wherein the stream interface is AXI stream protocol compliant.
further including at least one output FIFO buffer.
wherein the stream interface is configured to transfer the data generated by the at least one originating processor subunit to the at least one output FIFO buffer along with an identifier associated with the at least one consumer processor subunit.
wherein the stream interface is configured to transfer the data generated by the at least one originating processor subunit to the at least one output FIFO buffer along with an identifier associated with at least one destination asset onboard the computational memory chip.
wherein the at least one destination asset includes a memory bank associated with a processor subunit different from the at least one originating processor subunit.
wherein each of the plurality of processor subunits is allocated at least one output FIFO buffer.
wherein each of the plurality of processor subunits is allocated two or more output FIFO buffers.
further including at least one input FIFO buffer.
wherein the at least one input FIFO buffer is allocated to the at least one consumer processor subunit.
wherein the at least one input FIFO buffer is configured to store the data generated by the at least one originating processor subunit and transferred by the stream interface until the at least one consumer processor subunit is ready to operate on the data.
wherein each of the plurality of processor subunits is allocated at least one input FIFO buffer.
wherein each of the plurality of processor subunits is allocated two or more input FIFO buffers.
further including a plurality of input FIFO buffers grouped into a plurality of input FIFO subsets including one or more input FIFO buffers from among the plurality of input FIFO buffers, and wherein each of the plurality of input FIFO subsets is coupled to a different port of the stream interface.
further including a plurality of output FIFO buffers grouped into a plurality of output FIFO subsets including one or more output FIFO buffers from among the plurality of output FIFO buffers, and wherein each of the plurality of output FIFO subsets is coupled to a different port of the stream interface.
wherein the stream interface includes a point-to-point communication conduit.
wherein the stream interface includes a communication conduit other than a memory bus.
wherein the stream interface includes a communication conduit passing through one or more switches or routers.
A computational memory system comprising: at least one computational memory chip including at least one processor subunit and at least one memory bank formed on a common substrate; and a data invalidity detector configured to receive a sequence of data units and invalidity metadata relating to the sequence of data units, and wherein the data invalidity detector is further configured to generate an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata, and append the invalidity bitmap to the sequence of data units to provide a mapped data segment to be stored in the at least one memory bank of the at least one computational memory chip.
wherein the data invalidity detector resides on the at least one computational memory chip.
wherein the data invalidity detector is implemented by the at least one processor subunit on the at least one computational memory chip.
wherein the data invalidity detector is implemented by at least one supporting microprocessor disposed on the at least one computational memory chip.
wherein the data invalidity detector is implemented by at least one supporting microprocessor located outside of the at least one computational memory chip.
wherein the data invalidity detector is implemented by a host CPU outside of the at least one computational memory chip.
wherein the invalidity bitmap indicates which of the sequence of data units includes a null value.
wherein a number of values included in the invalidity bitmap equals a number of data units included in the sequence of data units.
wherein the invalidity bitmap includes a series of values equal to either zero or one to indicate which of the sequence of data units includes a null value.
wherein a number of bits included in the invalidity bitmap equals a number of data units in the sequence of data units.
wherein the invalidity bitmap is appended to a beginning of the sequence of data units.
wherein the invalidity bitmap is appended to an end of the sequence of data units.
wherein the invalidity bitmap is associated with the sequence of data units.
wherein the mapped data segment is sized to be stored fully within one line of memory included within the at least one memory bank.
wherein the mapped data segment has a size equal to or less than an integer division of a line of memory included in the at least one memory bank.
wherein the mapped data segment is sized to be stored fully within at least one cache associated with the at least one computational memory chip.
wherein the data invalidity detector is configured to generate a sequence of mapped data segments to be stored together in the at least one memory bank of the at least one computational memory chip.
wherein at least one processor subunit is configured to perform one or more operations relative to at least one data value included in the mapped data segment and at least one corresponding value in the invalidity bitmap.
wherein the one or more operations include a comparison operation.
wherein the one or more operations include an arithmetic operation.
wherein the one or more operations include an aggregation operation.
wherein the at least one processor subunit is associated with one or more dedicated memory banks from among the plurality of memory banks.
wherein the at least one processor subunit is configured to perform at least one arithmetic operation relative to data stored in its one or more dedicated memory banks.
wherein the at least one processor subunit is further configured to store a result of the at least one arithmetic operation in its one or more dedicated memory banks.
wherein the at least one processor subunit is configured to perform at least one logic operation relative to data stored in its one or more dedicated memory banks.
wherein the at least one processor subunit is configured to update the mapped data segment and at least one corresponding value in the invalidity bitmap after performing the at least one logic operation.
A method of storing a sequence of data units in one or more of a plurality of memory banks, the method comprising: receiving, using a data invalidity detector, the sequence of data units and invalidity metadata relating to the sequence of data units; generating, using the data invalidity detector, an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata; modifying the sequence of data units by appending the invalidity bitmap to the sequence of data units to provide a mapped data segment; and storing the modified sequence of data units in the one or more memory banks of at least one computational memory chip.
A non-transitory computer readable medium storing instructions executable by at least one processor to cause the at least one processor to perform a method, the method comprising: receiving a sequence of data units and invalidity metadata relating to the sequence of data units; generating an invalidity bitmap associated with the sequence of data units based on the received invalidity metadata; and appending the invalidity bitmap to the sequence of data units to provide a mapped data segment to be stored in at least one memory bank of at least one computational memory chip.
The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to the precise forms or embodiments disclosed. Modifications and adaptations will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed embodiments. Additionally, although aspects of the disclosed embodiments are described as being stored in memory, one skilled in the art will appreciate that these aspects can also be stored on other types of computer readable media, such as secondary storage devices, for example, hard disks or CD ROM, or other forms of RAM or ROM, USB media, DVD, Blu-ray, 4K Ultra HD Blu-ray, or other optical drive media.
Computer programs based on the written description and disclosed methods are within the skill of an experienced developer. The various programs or program modules can be created using any of the techniques known to one skilled in the art or can be designed in connection with existing software. For example, program sections or program modules can be designed in or by means of .Net Framework, .Net Compact Framework (and related languages, such as Visual Basic, C, etc.), Java, C++, Objective-C, HTML, HTML/AJAX combinations, XML, or HTML with included Java applets.
Moreover, while illustrative embodiments have been described herein, the scope of any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations and/or alterations as would be appreciated by those skilled in the art based on the present disclosure. The limitations in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application. The examples are to be construed as non-exclusive. Furthermore, the steps of the disclosed methods may be modified in any manner, including by reordering steps and/or inserting or deleting steps. It is intended, therefore, that the specification and examples be considered as illustrative only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
Claims
1-26. (canceled)
27. A computational memory system comprising:
- a master controller configured to receive a configuration function from a host CPU and convert the received configuration function into one or more lower level configuration functions; and
- at least one computational memory chip, wherein the at least one computational memory chip includes a plurality of processor subunits and a plurality of memory banks formed on a common substrate,
- wherein the master controller is adapted to configure the at least one computational memory chip using the one or more lower level configuration functions.
28. The system according to claim 27, wherein configuring the at least one computational memory chip includes preparing one or more of the plurality of processor subunits to perform a function associated with the one or more lower level configuration functions.
29. The system according to claim 28, wherein the master controller is adapted to configure a first processor subunit of the at least one computational memory chip according to a first lower level configuration function that is different from a second lower level configuration used to configure a second processor subunit of the at least one computational memory chip.
30. The system according to claim 27, wherein the master controller is adapted to communicate with one or more slave controllers to configure the at least one computational memory chip.
31. The system according to claim 27, wherein the master controller is adapted to communicate with at least one DDR controller to configure the at least one computational memory chip.
32. The system according to claim 27, wherein the master controller includes one or more accelerators that convert the received configuration function into the one or more lower level configuration functions.
33. The system according to claim 27, wherein the at least one computational memory chip is included on at least one DIMM.
34. The system according to claim 27, wherein each of the plurality of processor subunits is associated with one or more dedicated memory banks.
35. The system according to claim 27, wherein the configuration function from the host CPU is associated with an application running on the host CPU.
36. The system according to claim 27, wherein the received configuration function is a conditional function stored in a conditional function queue, and the master controller includes a configuration logic configured to fetch the conditional function from the conditional function queue after an event selected from the group consisting of:
- completion of a process upon which the conditional function depends,
- detection of a predetermined system event, and
- completion of a task upon which the conditional function depends.
37. The system according to claim 27, wherein the at least one computational memory chip includes at least a first computational memory chip and a second computational memory chip, and wherein the master controller is adapted to configure the second computational memory chip according to a second lower level configuration function while the first computational memory chip performs a function associated with a first lower level configuration function.
38. The system according to claim 27, wherein the one or more lower level configuration functions are stored in a configuration information table (CIT).
39. The system according to claim 38, wherein the master controller includes a configuration logic configured to calculate one or more addresses of the CIT based on the configuration function received from a host CPU.
40. The system according to claim 39, wherein the configuration logic is configured to retrieve the lower level configuration functions using the calculated one or more addresses of the CIT.
41. The system according to claim 27, wherein the master controller includes a configuration logic further including a parameter override logic.
42. The system according to claim 27, wherein the master controller includes a configuration logic configured to manage a states data structure.
43. The system according to claim 27, wherein the configuration function is received from the host CPU during an initialization process.
44. A computational memory system comprising:
- a controller configured to receive a configuration function from a host CPU; and
- a plurality of computational memory modules, wherein each of the plurality of computational memory modules includes a processor subunit and one or more memory banks formed on a common substrate,
- wherein the controller is adapted to multicast the configuration function to two or more respective processor subunits of the plurality of computational memory modules.
45. The system according to claim 44, further including at least one configuration processor adapted to configure at least a first processor subunit among the plurality of computational memory modules based on an output generated by at least a second processor subunit among the plurality of computational memory modules.
46. The system according to claim 44, wherein the controller is adapted to convert the configuration function into at least one lower level configuration function and multicast the at least one lower level configuration function to two or more respective processor subunits of the plurality of computational memory modules.
Type: Application
Filed: Apr 7, 2023
Publication Date: Aug 3, 2023
Applicant: NEUROBLADE LTD. (Tel Aviv)
Inventors: Eliad HILLEL (Herzliya), Ilan MAYER-WOLF (Tel Aviv), Hillel SRETER (Tel Aviv), Gal DAYAN (Hod Hasharon)
Application Number: 18/297,199