PROCESSING STRUCTURED AND UNSTRUCTURED DATA USING OFFLOAD PROCESSORS
Methods of processing structured data, unstructured data, or both are disclosed. Processing structured data can include providing an in-memory database with at least one of a plurality of modules connected to a memory bus of a server; executing database functions with at least one processor on the module; and directing database queries to the at least one module with a CPU of the server. Processing unstructured data can include executing data processing tasks with a CPU connected to a memory bus; and directing parallel computation tasks to a plurality of modules connected to the memory bus.
This application claims the benefit of U.S. Provisional Patent Application 61/650,373 filed May 22, 2012, the contents of which are incorporated by reference herein.
TECHNICAL FIELDThe present disclosure relates generally to servers capable of efficiently processing structured and unstructured data. More particularly, systems supporting offload or auxiliary processing modules that can be physically connected to a system memory bus to process data independent of a host processor of the server are described.
BACKGROUNDEnterprises store and process their large amounts of data in a variety of ways. One manner in which enterprises store data is by using relational databases and corresponding relational database management systems (RDBMS). Such data, usually referred to as structured data, may be collected, normalized, formatted and stored in an RDBMS. Tools based on standardized data languages such as the Structured Query Language (SQL) may be used for accessing and processing structured data. However, it is estimated that such formatted structured data represents only a tiny fraction of an enterprise's stored data. Organizations are becoming increasingly aware that substantial information and knowledge resides in unstructured data (i.e., “Big Data”) repositories. Accordingly, simple and effective access to both structured and unstructured data are seen as necessary for maximizing the value of enterprise informational resources.
However, conventional platforms that are currently being used to handle structured and unstructured data can substantially differ in their architecture. In-memory processing and Storage Area Network (SAN)-like architectures are used for traditional SQL queries, while commodity or shared nothing architectures (each computing node, consisting of a processor, local memory, and disk resources, shares nothing with other nodes in the computing cluster) are usually used for processing unstructured data. An architecture that supports both structured and unstructured queries can better handle current and emerging Big Data applications.
SUMMARYMethods can include processing structured and/or unstructured data. A method of processing structured data can include providing an in-memory database with at least one of a plurality of modules connected to a memory bus in a first server; executing database functions with at least one processor on the module; and connecting a central processing unit (CPU) in the first server to the modules by the memory bus, and directing database queries to the at least one module.
A method of processing unstructured data can include providing a plurality of modules connected to a memory bus that each include at least one processor; connecting a central processing unit (CPU) to the modules by the memory bus; executing data processing tasks with the CPU; and directing parallel computation tasks to a plurality of the modules.
Data processing and analytics for enterprise server or cloud based data systems, including both structured or unstructured data, can be efficiently implemented on offload processing modules connected to a memory bus, for example, by insertion into a socket for a Dual In-line Memory Module (DIMM). Such modules can be referred to as Xocket™ In-line Memory Modules (XIMMs), and can have multiple “wimpy” cores associated with a memory channel. Using one or more XIMMs it is possible to execute lightweight data processing tasks without intervention from a main server processor. As will be discussed, XIMM modules have high efficiency context switching, high parallelism, and can efficiently process large data sets. Such systems as a whole are able to handle large database searches at a very low power when compared to traditional high power “brawny” server cores. Advantageously, by accelerating implementation of MapReduce or similar algorithms on unstructured data and by providing high performance virtual shared disk for structured queries, a XIMM based architecture capable of partitioning tasks is able to greatly improve data analytic performance.
Traditionally, an in-memory database that handles structured queries looks for the query in its physical memory, and in the absence of the query therein will perform a disk read to a database. The disk read might result in the page cache of the kernel being populated with the disk file. The process might perform a complete copy of the page(s) (in a paged memory system) into its process buffer or it might perform a mmap operation, wherein a pointer to the entry in the page cache storing the pages is created and stored in the heap of the process. The latter is less time consuming and more efficient than the former.
The inclusion of a XIMM supported data retrieval framework can effectively extend the overall available size of the in-memory space available with the assembly. In case a structured query is being handled by one of the said CPUs (108), and the CPU is unable to retrieve it from its main memory, it will immediately trap into a memory map (mmap routine) that will handle disk reads and populate the page cache.
The embodiment described herein can modify the mmap routine of standard operating systems to execute code corresponding to a driver for a removable computation module driver, in this case a XIMM driver. The XIMM driver in turn identifies the query and transfers the query to one or more XIMMs in the form of memory reads/writes. A XIMM can house a plurality of offload processors (112a to 112c) that can receive the memory read/write commands containing the structured query. One or more of the offload processors (112a to 112c) can perform a search for the query in its available cache and local memory and return the result. According to embodiments, an mmap query can further be modified to allow the transference of the query to XIMMs that are not in the same server but in the same rack. The mmap abstraction in such a case can perform a remote direct memory access (RDMA) or a similar network memory read for accessing data present in a XIMM that is in the same rack. As a second XIMM (i.e., offload processors 112a to 112c of server 140a) is connected to a first XIMM (i.e., offload processors 112a to 112c of server 140) through a top of rack switch 120, the latency of a system according to an embodiment can be the combined latency of the network interface cards (NICs) of the two servers (i.e., 100 in 140 and 100 in 140a), the latency of the TOR switch 120, and a response time of the second XIMM.
Despite the additional hops, embodiments can provide a system that can have a much lower latency than a read from a hard disk drive. The described architecture can provide orders of magnitude improvement over a single structured query in-memory database. By allowing for non-frequently used data of one of the servers to be pushed to another XIMM located in the same rack, the effective in-memory space can be increased beyond conventional limits, and a large assembly having a large physical, low latency memory space can be created. This embodiment may further be improved by allowing for transparent sharing of pages across multiple XIMMs. Embodiments can further improve the latency between server-to-server connections by mediating them by XIMMs. The XIMMs can act as intelligent switches to offload and bypass the TOR switch.
Conventional data intensive computing platforms for handling large volumes of unstructured data can use a parallel computing approach combining multiple processors and disks in large commodity computing clusters connected with high-speed communications switches and networks. This can allow the data to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data. A variety of distributed architectures have been developed for data-intensive computing and several software frameworks have been proposed to process unstructured data. One such programming model for processing large data sets with a parallel, distributed algorithm on a multiple servers or clusters is commonly known as MapReduce. Hadoop is a popular open-source implementation of MapReduce that is widely used by enterprises for search of unstructured data.
Map and Reduce tasks are computationally intensive and have very tight dependency on each other. While Map tasks are small and independent tasks that can run in parallel, the Reduce tasks include fetching intermediate (key, value) pairs that result from each Map function, sorting and merging intermediate results according to keys and applying Reduce functions to the sorted intermediate results. Reducer (212) can perform the Reduce function only after it receives the intermediate results from all the Mappers (208). Thus, the Shuffle step (210) (communicating the Map results to Reducers) often becomes the bottleneck in Hadoop workloads and introduces latency.
In an embodiment, a XIMM based architecture can improve Hadoop (or similar data processing) performance in two ways. Firstly, intrinsically parallel computational tasks can be allocated to the XIMMs (e.g., modules with offload processors), leaving the number crunching tasks to “brawny” (e.g., x86) cores. This is illustrated in more detail in
A XIMM based architecture, according to an embodiment, can reduce the intrinsic bottleneck of most Hadoop (or similar data processing) workloads (the Shuffle phase) by driving the I/O backplane to its full capacity. In a conventional Hadoop system, the TaskTracker (228) described in
The following example(s) provide illustration and discussion of exemplary hardware and data processing systems suitable for implementation and operation of the foregoing discussed systems and methods. In particular hardware and operation of wimpy cores or computational elements connected to a memory bus and mounted in DIMM or other conventional memory socket is discussed.
The computation elements or offload processors can be accessible through memory bus 405. In this embodiment, the module can be inserted into a Dual Inline Memory Module (DIMM) slot on a commodity computer or server using a DIMM connector (407), providing a significant increase in effective computing power to system 400. The module (e.g., XIMM) may communicate with other components in the commodity computer or server via one of a variety of busses including but not limited to any version of existing double data rate standards (e.g., DDR, DDR2, DDR3, etc.)
This illustrated embodiment of the module 402 contains five offload processors (400a, 400b, 400c, 400d, 400e) however other embodiments containing greater or fewer numbers of processors are contemplated. The offload processors (400a to 400e) can be custom manufactured or one of a variety of commodity processors including but not limited to field-programmable grid arrays (FPGA), microprocessors, reduced instruction set computers (RISC), microcontrollers or ARM processors. The computation elements or offload processors can include combinations of computational FPGAs such as those based on Altera, Xilinx (e.g., Artix™ class or Zynq® architecture, e.g., Zynq® 7020), and/or conventional processors such as those based on Intel Atom or ARM architecture (e.g., ARM A9). For many applications, ARM processors having advanced memory handling features such as a snoop control unit (SCU) are preferred, since this can allow coherent read and write of memory. Other preferred advanced memory features can include processors that support an accelerator coherency port (ACP) that can allow for coherent supplementation of the cache through an FPGA fabric or computational element.
Each offload processor (400a to 400e) on the module 402 may run one of a variety of operating systems including but not limited to Apache or Linux. In addition, the offload processors (400a to 400e) may have access to a plurality of dedicated or shared storage methods. In this embodiment, each offload processor can connect to one or more storage units (in this embodiments, pairs of storage units 404a, 404b, 404c and 404d). Storage units (404a to 404d) can be of a variety of storage types, including but not limited to random access memory (RAM), dynamic random access memory (DRAM), sequential access memory (SAM), static random access memory (SRAM), synchronous dynamic random access memory (SDRAM), reduced latency dynamic random access memory (RLDRAM), flash memory, or other emerging memory standards such as those based on DDR4 or hybrid memory cubes (HMC).
In this embodiment, one of the Zynq® computational FPGAs (416a to 416e) can act as arbiter providing a memory cache, giving an ability to have peer to peer sharing of data (via memcached or OMQ memory formalisms) between the other Zynq® computational FPGAs (416a to 416e). Traffic departing for the computational FPGAs can be controlled through memory mapped I/O. The arbiter queues session data for use, and when a computational FPGA asks for address outside of the provided session, the arbiter can be the first level of retrieval, external processing determination, and predictors set.
Operation of one embodiment of a module 430 (e.g., XIMM) using an ARM A9 architecture is illustrated with respect to
The following table (Table 1) illustrates potential states that can exist in the scheduling of queues/threads to XIMM processors and memory such as illustrated in
These states can help coordinate the complex synchronization between processes, network traffic, and memory-mapped hardware. When a queue is selected by a traffic manager a pipeline coordinates swapping in the desired L2 cache (440), transferring the reassembled 10 data into the memory space of the executing process. In certain cases, no packets are pending in the queue, but computation is still pending to service previous packets. Once this process makes a memory reference outside of the data swapped, a scheduler can require queued data from a network interface card (NIC) to continue scheduling the thread. To provide fair queuing to a process not having data, the maximum context size is assumed as data processed. In this way, a queue must be provisioned as the greater of computational resource and network bandwidth resource, for example, each as a ratio of an 800 MHz A9 and 3 Gbps of bandwidth. Given the lopsidedness of this ratio, the ARM core is generally indicated to be worthwhile for computation having many parallel sessions (such that the hardware's prefetching of session-specific data and TCP/reassembly offloads a large portion of the CPU load) and those requiring minimal general purpose processing of data.
Essentially zero-overhead context switching is also possible using modules as disclosed in
In operation, metadata transport code can relieve a main or host processor from tasks including fragmentation and reassembly, and checksum and other metadata services (e.g., accounting, IPSec, SSL, Overlay, etc.). As 10 data streams in and out, L1 cache 437 can be filled during packet processing. During a context switch, the lock-down portion of a translation lookaside buffer (TLB) of an L1 cache can be rewritten with the addresses corresponding to the new context. In one very particular implementation, the following four commands can be executed for the current memory space.
MRC p15, 0, r0, c10, c0, 0; read the lockdown register
BIC r0, r0, #1; clear preserve bit
MCR p15, 0, r0, c10, c0, 0; write to the lockdown register;
write to the old value to the memory mapped Block RAM
This is a small 32 cycle overhead to bear. Other TLB entries can be used by the XIMM stochastically.
Bandwidths and capacities of the memories can be precisely allocated to support context switching as well as applications such as Openflow processing, billing, accounting, and header filtering programs.
For additional performance improvements, the ACP 434 can be used not just for cache supplementation, but hardware functionality supplementation, in part by exploitation of the memory space allocation. An operand can be written to memory and the new function called, through customizing specific Open Source libraries, so putting the thread to sleep and a hardware scheduler can validate it for scheduling again once the results are ready. For example, OpenVPN uses the OpenSSL library, where the encrypt/decrypt functions 439 can be memory mapped. Large blocks are then available to be exported without delay, or consuming the L2 cache 440, using the ACP 434. Hence, a minimum number of calls are needed within the processing window of a context switch, improving overall performance.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It is also understood that the embodiments of the invention may be practiced in the absence of an element and/or step not specifically disclosed. That is, an inventive feature of the invention may be elimination of an element.
Accordingly, while the various aspects of the particular embodiments set forth herein have been described in detail, the present invention could be subject to various changes, substitutions, and alterations without departing from the spirit and scope of the invention.
Claims
1. A method of processing structured data, comprising:
- providing an in-memory database with at least one of a plurality of modules connected to a memory bus in a first server;
- executing database functions with at least one processor on the module; and
- connecting a central processing unit (CPU) in the first server to the modules by the memory bus, and directing database queries to the at least one module.
2. The method of processing structured data of claim 1, further including communicating between modules without accessing any CPU of the first server.
3. The method of processing structured data of claim 1, further including:
- the modules are mounted on different servers in a same rack; and
- communicating between modules of different racks with a top of the rack switch.
4. The method of processing structured data of claim 1, further including executing a memory mapping operation to transfer a database query from the CPU to at least one module in the form of a memory read or write.
5. The method of processing structured data claim 1, wherein providing the in-memory database includes inserting the modules into dual-in-line-memory-module (DIMM) sockets.
6. A method of processing unstructured data, comprising the steps of:
- providing a plurality of modules connected to a memory bus that each include at least one processor;
- connecting a central processing unit (CPU) to the modules by the memory bus;
- executing data processing tasks with the CPU; and
- directing parallel computation tasks to a plurality of the modules.
7. The method of processing unstructured data of claim 6, further including:
- executing a Map/Reduce algorithm, including collecting data parsed with a plurality of the modules, and storing results of a Map step in a main memory
8. The method of processing unstructured data of claim 6, further including defining a massively parallel input/output (I/O) mid-plane with a plurality of the modules.
9. The method of processing unstructured data of claim 6, further including:
- the modules are mounted on different servers in a same rack;
- processing a Map/Reduce algorithm with multiples of the servers; and
- exchanging intermediate (key, value) pairs across the multiple servers through switching by the modules.
10. The method of processing unstructured data of claim 6, wherein providing the plurality of modules connected to a memory bus includes inserting the modules into dual-in-line-memory-module (DIMM) sockets.
Type: Application
Filed: May 22, 2013
Publication Date: Nov 28, 2013
Inventor: Parin Bhadrik Dalal (Milpitas, CA)
Application Number: 13/900,303
International Classification: G06F 17/30 (20060101);