Array Processor Operation Patents (Class 712/16)
-
Patent number: 8934332Abstract: A system is disclosed for concurrently processing order sensitive data packets. A first data packet from a plurality of sequentially ordered data packets is directed to a first offload engine. A second data packet from the plurality of sequentially ordered data packets is directed to a second offload engine, wherein the second data packet is sequentially subsequent to the first data packet. The second offload engine receives information from the first offload engine, wherein the information reflects that the first offload engine is processing the first data packet. Based on the information received at the second offload engine, the second offload engine processes the second data packet so that critical events in the processing of the first data packet by the first offload engine occur prior to critical events in the processing of the second data packet by the second offload engine.Type: GrantFiled: February 29, 2012Date of Patent: January 13, 2015Assignee: International Business Machines CorporationInventors: Ronald E. Fuhs, Scott M. Willenborg
-
Patent number: 8904152Abstract: Efficient computation of complex multiplication results and very efficient fast Fourier transforms (FFTs) are provided. A parallel array VLIW digital signal processor is employed along with specialized complex multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs are used allowing the complex multiplication pipeline hardware to be efficiently used. In addition, efficient techniques for supporting combined multiply accumulate operations are described.Type: GrantFiled: May 26, 2011Date of Patent: December 2, 2014Assignee: Altera CorporationInventors: Nikos P. Pitsianis, Gerald George Pechanek, Ricardo Rodriguez
-
Patent number: 8892850Abstract: Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (‘PAMI’) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.Type: GrantFiled: January 17, 2011Date of Patent: November 18, 2014Assignee: International Business Machines CorporationInventors: Charles J. Archer, Michael A. Blocksome, Bob R. Cernohous, Joseph D. Ratterman, Brian E. Smith
-
Publication number: 20140337601Abstract: An array processor composed of processor cells that are programmed by a controlling unit, and that are reprogrammed when a cell has finished a current data processing operation, even while other cell continue to process data with their current programming.Type: ApplicationFiled: May 13, 2014Publication date: November 13, 2014Applicant: PACT XPP TECHNOLOGIES AGInventors: Martin Vorbach, Armin Nuckel
-
Patent number: 8886916Abstract: Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.Type: GrantFiled: November 8, 2012Date of Patent: November 11, 2014Assignee: International Business Machines CorporationInventors: Charles J. Archer, Michael A. Blocksome, Bob R. Cernohous, Joseph D. Ratterman, Brian E. Smith
-
Patent number: 8880809Abstract: Embodiments are described for a method for controlling access to memory in a processor-based system comprising monitoring a number of interference events, such as bank contentions, bus contentions, row-buffer conflicts, and increased write-to-read turnaround time caused by a first core in the processor-based system that causes a delay in access to the memory by a second core in the processor-based system; deriving a control signal based on the number of interference events; and transmitting the control signal to one or more resources of the processor-based system to reduce the number of interference events from an original number of interference events.Type: GrantFiled: October 29, 2012Date of Patent: November 4, 2014Assignee: Advanced Micro Devices Inc.Inventors: Gabriel Loh, James O'Connor
-
Patent number: 8874878Abstract: Described embodiments provide a packet classifier for a network processor that generates tasks corresponding to each received packet. The packet classifier includes a scheduler to generate contexts corresponding to tasks received by the packet classifier from processing modules of the network processor. The packet classifier processes threads of instructions, each thread of instructions corresponding to a context received from the scheduler, and each thread associated with a data flow. A thread status table has N entries to track up to N active threads. Each status entry includes a valid status indicator, a sequence value, a thread indicator and a flow indicator. A sequence counter generates a sequence value for each data flow of each thread and is incremented when processing of a thread is started, and is decremented when a thread is completed. Instructions are processed in the order in which the threads were started for each data flow.Type: GrantFiled: November 28, 2012Date of Patent: October 28, 2014Assignee: LSI CorporationInventors: Deepak Mital, James Clee, Jerry Pirog
-
Publication number: 20140281374Abstract: In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.Type: ApplicationFiled: March 12, 2013Publication date: September 18, 2014Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventor: INTERNATIONAL BUSINESS MACHINES CORPORATION
-
Patent number: 8830829Abstract: Disclosed are methods, systems, paradigms and structures for processing data packets in a communication network by a multi-core network processor. The network processor includes a plurality of multi-threaded core processors and special purpose processors for processing the data packets atomically, and in parallel. An ingress module of the network processor stores the incoming data packets in the memory and adds them to an input queue. The network processor processes a data packet by performing a set of network operations on the data packet in a single thread of a core processor. The special purpose processors perform a subset of the set of network operations on the data packet atomically. An egress module retrieves the processed data packets from a plurality of output queues based on a quality of service (QoS) associated with the output queues, and forwards the data packets towards their destination addresses.Type: GrantFiled: December 2, 2013Date of Patent: September 9, 2014Assignee: Unbound Networks, Inc.Inventors: Damon Finney, Ashok Mathur
-
Patent number: 8832413Abstract: A processing system includes processors and dynamically configurable communication elements (DCCs) coupled together in an interspersed arrangement. A source device may transfer a data item through an intermediate subset of the DCCs to a destination device. The source and destination devices may each correspond to different processors, DCCs, or input/output devices, or mixed combinations of these. In response to detecting a stall after the source device begins transfer of the data item to the destination device and prior to receipt of all of the data item at the destination device, a stalling device is operable to propagate stalling information through one or more of the intermediate subset towards the source device. In response to receiving the stalling information, at least one of the intermediate subset is operable to buffer all or part of the data item.Type: GrantFiled: May 29, 2013Date of Patent: September 9, 2014Assignee: Coherent Logix, IncorporatedInventors: Michael B. Doerr, William H. Hallidy, David A. Gibson, Craig M. Chase
-
Patent number: 8825924Abstract: A computer array (10) has a plurality of computers (12). The computers (12) communicate with each other asynchronously, and the computers (12) themselves operate in a generally asynchronous manner internally. When one computer (12) attempts to communicate with another it goes to sleep until the other computer (12) is ready to complete the transaction, thereby saving power and reducing heat production. A plurality of read lines (18), write lines (20) and data lines (22) interconnect the computers (12). When one computer (12) sets a read line (18) high and the other computer sets a corresponding write line (20) then data is transferred on the data lines (22). When both the read line (18) and corresponding write line (20) go low this allows both communicating computers (12) to know that the communication is completed. An acknowledge line (72) goes high to restart the computers (12).Type: GrantFiled: March 4, 2011Date of Patent: September 2, 2014Assignee: Array Portfolio LLCInventor: Charles H. Moore
-
Patent number: 8769244Abstract: Uniforming of the processing load is efficiently realized. Each processing element configuring an SIMD parallel computer system includes a data storage module that stores data processed or transferred, a number-of-data-sets storage device that stores number of data sets, and a front data storage device that stores the front data. Each processing element further includes a control processor that compares the number of data sets stored in one processing element with the number of data sets stored in the own processing element, and issues a data distribution leveling instruction that designates an action for updating contents of the data storage module, the number-of-data-sets storage device, and the front data storage device according to a rule determined based on a comparison result of the own processing element and that of the other processing elements and an action for moving the data stored in the one processing element to the own processing element.Type: GrantFiled: April 8, 2009Date of Patent: July 1, 2014Assignee: Nec CorporationInventor: Shorin Kyo
-
Patent number: 8761188Abstract: In the provided architecture, one or more multi-threaded processors may be combined with hardware blocks. The resulting combination allows for data packets to undergo a processing sequence having the flexibility of software programmability with the high-performance of dedicated hardware. For example, a multi-threaded processor can control the high-level tasks of a processing sequence, while the computationally intensive events (e.g., signal processing filters, matrix operations, etc.) are handled by dedicated hardware blocks.Type: GrantFiled: April 30, 2008Date of Patent: June 24, 2014Assignee: Altera CorporationInventors: Anargyros Krikelis, Martin Roberts
-
Publication number: 20140164734Abstract: A method and circuit arrangement utilize inactive non-pipelined operation resources in one processing core of a multi-core processing unit to execute non-pipelined instructions on behalf of another processing core in the same processing unit. Adjacent processing cores in a processing unit may be coupled together such that, for example, when one processing core's non-pipelined execution sequencer is busy, that processing core may issue into another processing core's non-pipelined execution sequencer if that other processing core's non-pipelined execution sequencer is idle, thereby providing intermittent concurrent execution of multiple non-pipelined instructions within each individual processing core.Type: ApplicationFiled: December 6, 2012Publication date: June 12, 2014Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
-
Patent number: 8751772Abstract: Hardware and software techniques for interrupt detection and response in a scalable pipelined array processor environment are described. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debug interrupts and a dynamic debut monitor mechanism.Type: GrantFiled: June 13, 2013Date of Patent: June 10, 2014Assignee: Altera CorporationInventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
-
Patent number: 8683182Abstract: Systems and apparatuses are presented relating a programmable processor comprising an execution unit that is operable to decode and execute instructions received from an instruction path and partition data stored in registers in the register file into multiple data elements, the execution unit capable of executing group data handling operations that re-arrange data elements in different ways in response to data handling instructions, the execution unit further capable of executing a plurality of different group floating-point and group integer arithmetic operations that each arithmetically operates on the multiple data elements stored in registers in the register file to produce a catenated result that is returned to a register in the register file, wherein the catenated result comprises a plurality of individual results.Type: GrantFiled: June 11, 2012Date of Patent: March 25, 2014Assignee: Microunity Systems Engineering, Inc.Inventors: Craig Hansen, John Moussouris, Alexia Massalin
-
Publication number: 20140075154Abstract: The invention provides hardware based techniques for switching processing tasks of software programs for execution on a multi-core processor. Invented techniques involve a hardware logic based controller for assigning, adaptive to program processing loads, tasks for processing by cores of a multi-core fabric as well as configuring a set of multiplexers to appropriately interconnect cores of the fabric and program task specific segments at fabric memories, to arrange efficient inter-task communication as well as transferring of activating and de-activating task memory images among the multi-core fabric. The invention thereby provides an efficient, hardware-automated runtime operating system for multi-core processors, minimizing any need to use processing capacity of the cores for traditional operating system software functions.Type: ApplicationFiled: September 3, 2013Publication date: March 13, 2014Inventor: Mark Henrik Sandstrom
-
Patent number: 8665727Abstract: A computer-implemented method is described for determining cost in a non-blocking routing network that provides routing functionality using a single level of a plurality of multiplexers in each row of the routing network. The method includes assigning a respective numerical value, represented by bits, to each row of the routing network. A number of bits that differ between the respective numerical values of each pair of rows of the routing network indicates a number of row traversals necessary to traverse from a first row of the pair to a second row of the pair. A signal routing cost is computed from the number of bits that differ between the respective numerical values of the first row and the second row of the routing network. The calculated signal routing cost is provided to a placement module.Type: GrantFiled: June 21, 2010Date of Patent: March 4, 2014Assignee: Xilinx, Inc.Inventor: Stephen M. Trimberger
-
Patent number: 8656141Abstract: An integrated circuit includes a plurality of tiles. Each tile includes a pipelined processor configured to process multiple streams of instructions for the processor; and a switch including switching circuitry to forward data over data paths from other tiles to one or more pipeline stages of the processor and to switches of other tiles. At least some of the data is forwarded based on one or more streams of instructions for the switch.Type: GrantFiled: December 13, 2005Date of Patent: February 18, 2014Assignee: Massachusetts Institute of TechnologyInventor: Anant Agarwal
-
Patent number: 8638805Abstract: Described embodiments provide for restructuring a scheduling hierarchy of a network processor having a plurality of processing modules and a shared memory. The scheduling hierarchy schedules packets for transmission. The network processor generates tasks corresponding to each received packet associated with a data flow. A traffic manager receives tasks provided by one of the processing modules and determines a queue of the scheduling hierarchy corresponding to the task. The queue has a parent scheduler at each of one or more next levels of the scheduling hierarchy up to a root scheduler, forming a branch of the hierarchy. The traffic manager determines if the queue and one or more of the parent schedulers of the branch should be restructured. If so, the traffic manager drops subsequently received tasks for the branch, drains all tasks of the branch, and removes the corresponding nodes of the branch from the scheduling hierarchy.Type: GrantFiled: September 30, 2011Date of Patent: January 28, 2014Assignee: LSI CorporationInventors: Balakrishnan Sundararaman, Shashank Nemawarkar, David Sonnier, Shailendra Aulakh, Allen Vestal
-
Patent number: 8625422Abstract: Disclosed are methods, systems, paradigms and structures for processing data packets in a communication network by a multi-core network processor. The network processor includes a plurality of multi-threaded core processors and special purpose processors for processing the data packets atomically, and in parallel. An ingress module of the network processor stores the incoming data packets in the memory and adds them to an input queue. The network processor processes a data packet by performing a set of network operations on the data packet in a single thread of a core processor. The special purpose processors perform a subset of the set of network operations on the data packet atomically. An egress module retrieves the processed data packets from a plurality of output queues based on a quality of service (QoS) associated with the output queues, and forwards the data packets towards their destination addresses.Type: GrantFiled: March 5, 2013Date of Patent: January 7, 2014Assignee: Unbound NetworksInventors: Damon Finney, Ashok Mathur
-
Patent number: 8607029Abstract: A dynamic reconfigurable circuit including a plurality of processing elements each provided with an arithmetic data input port, a configuration data input port and an output port, a data network that is coupled to the arithmetic data input ports and the output ports of the plurality of processing elements, a configuration memory that is coupled via a configuration path to the configuration data input port of a first processor element being at least one of the plurality of processing elements, and an immediate value network that is independent from the data network and that is coupled to the configuration data input port of a second processor element being at least one of the plurality of processing elements. An internal register of a third processor element is coupled to the immediate value network so that data stored in the internal register can be outputted to the immediate value network.Type: GrantFiled: December 16, 2008Date of Patent: December 10, 2013Assignee: Fujitsu Semiconductor LimitedInventor: Shin-ichi Sutou
-
Patent number: 8572353Abstract: Communicating among cores in a computing system comprising a plurality of cores, each core comprising a processor and a switch, includes: routing a packet from an origin core to a destination core over a route including multiple cores; and at each core in the route before the destination core, routing the packet to the next core in the route according to a respective symbol in a sequence of multiple symbols. The respective symbol has a first symbol value indicating a single likely direction and the respective symbol has a second symbol value indicating multiple less likely directions.Type: GrantFiled: September 20, 2010Date of Patent: October 29, 2013Assignee: Tilera CorporationInventors: Ian Rudolf Bratt, Carl G. Ramey, Matthew Mattina
-
Publication number: 20130283007Abstract: A multi-node video signal processor (VSPN) is describes that tightly couples multiple multi-cycle state machines (hardware assist units) to each processor and each memory in each node of an N node scalable array processor. VSPN memory hardware assist instructions are used to initiate multi-cycle state machine functions, to pass parameters to the multi-cycle state machines, to fetch operands from a node's memory, and to control the transfer of results from the multi-cycle state machines.Type: ApplicationFiled: June 11, 2013Publication date: October 24, 2013Applicant: ALTERA CORPORATIONInventors: Gerald George Pechanek, Mihailo M. Stojancic
-
Patent number: 8549259Abstract: Systems, methods and articles of manufacture are disclosed for performing a vector collective operation on a parallel computing system that includes multiple compute nodes and a network connecting the compute nodes that includes an ALU. A collective operation may be performed to determine displacements for the vector collective operation. Descriptors for the vector collective operation may be generated based on the displacements. The vector collective operation may then be performed using the descriptors.Type: GrantFiled: September 15, 2010Date of Patent: October 1, 2013Assignee: International Business Machines CorporationInventors: Charles J. Archer, Michael A. Blocksome, Joseph D. Ratterman, Brian E. Smith
-
Patent number: 8549258Abstract: A configurable processing apparatus includes a plurality of processing units, at least an instruction synchronization control circuit, and at least a configuration memory. Each processing apparatus has a stall-output signal generating circuit to output a stall-output signal, wherein the stall-output signal indicates that an unexpected stall is occurred in the processing unit. The processing unit has a stall-in signal, and an external circuit of the processing unit can control whether the processing unit is stalled according to the stall-in signal. The instruction synchronization control circuit generates the stall-in signals to the processing units in response to a content stored in the configuration memory and the stall-output signals of the processing units, so as to determine operation modes and instruction synchronization of the processing units.Type: GrantFiled: February 7, 2010Date of Patent: October 1, 2013Assignee: Industrial Technology Research InstituteInventors: Tzu-Fang Lee, Chien-Hong Lin, Jing-Shan Liang, Chi-Lung Wang
-
Patent number: 8532288Abstract: A cryptographic engine for modulo N multiplication, which is structured as a plurality of almost identical, serially connected Processing Elements, is controlled so as to accept input in blocks that are smaller than the maximum capability of the engine in terms of bits multiplied at one time. The serially connected hardware is thus partitioned on the fly to process a variety of cryptographic key sizes while still maintaining all of the hardware in an active processing state.Type: GrantFiled: December 1, 2006Date of Patent: September 10, 2013Assignee: International Business Machines CorporationInventors: Camil Fayad, John K. Li, Siegfried K. H. Sutter, Phil C. Yeh
-
Patent number: 8489857Abstract: A parallel processing architecture comprising a cluster of embedded processors that share a common code distribution bus. Pages or blocks of code are concurrently loaded into respective program memories of some or all of these processors (typically all processors assigned to a particular task) over the code distribution bus, and are executed in parallel by these processors. A task control processor determines when all of the processors assigned to a particular task have finished executing the current code page, and then loads a new code page (e.g., the next sequential code page within a task) into the program memories of these processors for execution. The processors within the cluster preferably share a common memory (1 per cluster) that is used to receive data inputs from, and to provide data outputs to, a higher level processor. Multiple interconnected clusters may be integrated within a common integrated circuit device.Type: GrantFiled: November 5, 2010Date of Patent: July 16, 2013Assignee: Schism Electronics, L.L.C.Inventors: Richard F. Hobson, Bill Ressl, Allan R. Dyck
-
Patent number: 8489858Abstract: Hardware and software techniques for interrupt detection and response are provided in a scalable pipelined array processor environment. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debug interrupts and a dynamic debug monitor mechanism.Type: GrantFiled: March 12, 2012Date of Patent: July 16, 2013Assignee: Altera CorporationInventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
-
Patent number: 8484276Abstract: Techniques are disclosed for converting data into a format tailored for efficient multidimensional fast Fourier transforms (FFTS) on single instruction, multiple data (SIMD) multi-core processor architectures. The technique includes converting data from a multidimensional array stored in a conventional row-major order into SIMD format. Converted data in SIMD format consists of a sequence of blocks, where each block interleaves s rows such that SIMD vector processors may operate on s rows simultaneously. As a result, the converted data in SIMD format enables smaller-sized 1D FFTs to be optimized in SIMD multi-core processor architectures.Type: GrantFiled: March 18, 2009Date of Patent: July 9, 2013Assignee: International Business Machines CorporationInventors: David G. Carlson, Travis M. Drucker, Timothy J. Mullins, Jeffrey S. McAllister, Nelson Ramirez
-
Publication number: 20130166876Abstract: A method and apparatus are described for using a previous column pointer to read a subset of entries of an array in a processor. The array may have a plurality of rows and columns of entries, and each entry in the subset may reside on a different row of the array. A previous column pointer may be generated for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, the previous column pointer indicating whether each entry is in a current column or a previous column. The entries in the subset may be read and re-ordered, and invalid entries in the subset may be replaced with nulls. The valid entries and nulls may then be outputted.Type: ApplicationFiled: December 21, 2011Publication date: June 27, 2013Applicant: ADVANCED MICRO DEVICES, INC.Inventors: Srikanth Arekapudi, Shloke Hajela
-
Patent number: 8468323Abstract: A computer array (10) has a plurality of computers (12). The computers (12) communicate with each other asynchronously, and the computers (12) themselves operate in a generally asynchronous manner internally. When one computer (12) attempts to communicate with another it goes to sleep until the other computer (12) is ready to complete the transaction, thereby saving power and reducing heat production. The sleeping computer (12) can be awaiting data or instructions (12). In the case of instructions, the sleeping computer (12) can be waiting to store the instructions or to immediately execute the instructions. In the later case, the instructions are placed in an instruction register (30a) when they are received and executed therefrom, without first placing the instructions first into memory. The instructions can include a micro-loop (100) which is capable of performing a series of operations repeatedly.Type: GrantFiled: March 21, 2011Date of Patent: June 18, 2013Assignee: ARRAY Portfolio LLCInventors: Charles H. Moore, Jeffrey Arthur Fox, John W. Rible
-
Patent number: 8464025Abstract: A signal processing apparatus able to raise a processing capability in processing accompanying access to a storing means is provided. Stream control units (SCU) 203—0 to 203—3 access data at an external memory system or local memories 204—0 to 204—3 according to a thread under control from a host processor. Processor units (PU) arrays 202—0 to 202—3 perform image processing by a different thread from the thread of the SCUs 203—0 to 203—3.Type: GrantFiled: May 22, 2006Date of Patent: June 11, 2013Assignee: Sony CorporationInventors: Yuji Yamaguchi, Masatoshi Imai, Toshiharu Noda, Naosuke Asari, Tomoo Mitsunaga, Mitsuharu Ohki, Kazumasa Ito, Hidetoshi Nagano, Sumito Arakawa, Kei Ito
-
Patent number: 8417733Abstract: Embodiments of the present invention provide techniques, including systems, methods, and computer readable medium, for dynamic atomic bitsets. A dynamic atomic bitset is a data structure that provides a bitset that can grow or shrink in size as required. The dynamic atomic bitset is non-blocking, wait-free, and thread-safe.Type: GrantFiled: March 4, 2010Date of Patent: April 9, 2013Assignee: Oracle International CorporationInventor: Nathan Reynolds
-
Publication number: 20130086354Abstract: Methods, apparatuses and storage device associated with cache and/or socket sensitive breadth-first iterative traversal of a graph by parallel threads, are disclosed. In embodiments, a vertices visited array (VIS) may be employed to track graph vertices visited. VIS may be partitioned into VIS sub-arrays, taking into consideration cache sizes of LLC, to reduce likelihood of evictions. In embodiments, potential boundary vertices arrays (PBV) may be employed to store potential boundary vertices for a next iteration, for vertices being visited in a current iteration. The number of PBV generated for each thread may take into consideration a number of sockets, over which the processor cores employed are distributed. In various embodiments, the threads may be load balanced; further data locality awareness to reduce inter-socket communication may be considered, and/or lock-and-atomic free update operations may be employed. Other embodiments may be disclosed or claimed.Type: ApplicationFiled: September 27, 2012Publication date: April 4, 2013Inventors: Nadathur Rajagopalan Satish, Changkyu Kim, Jatin Chhuagani, Jason D. Sewall
-
Patent number: 8370605Abstract: A system includes first and second processors, first and second graphics processing units (GPUs), one or more peripheral devices, a switch matrix, and processor-readable memory. The switch matrix comprises programmable data paths between the processors, the GPUs, and the peripheral devices. Software encoded in the process-readable memory includes a first operating system (OS) executed by the first processor, a second OS executed by the second processor, a matrix scheduling engine, and a media interface switch (MIS) engine. The first OS boots faster than the second OS. The matrix scheduling engine runs on both OSs and configures the data paths in the switch matrix to couple the processors and the GPUs, and to couple the processors and the peripheral devices. The MIS engine runs on the operating systems, detects presence of the peripheral devices, and configures the data paths in the switch matrix to couple the processors and the peripheral devices.Type: GrantFiled: November 11, 2009Date of Patent: February 5, 2013Assignee: Sunman Engineering, Inc.Inventors: Allen Nejah, Gholam Reza Golshan, George W. Harvey
-
Publication number: 20130019082Abstract: An array processor includes processing elements arranged in to form a rectangular array. Inter-cluster communication paths are mutually exclusive. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path, thus eliminating half the wiring required for the path. The length of the longest communication path is not directly determined by the overall dimension of the array, as in conventional torus arrays. Rather, the longest communications path is limited by the inter-cluster spacing. Transpose elements of an N×N torus may be combined in clusters and communicate with one another through intra-cluster communications paths. Transpose operation latency is eliminated in this approach. Each PE may have a single transmit port and a single receive port. Thus, the individual PEs are decoupled from the array topology.Type: ApplicationFiled: September 14, 2012Publication date: January 17, 2013Applicant: ALTERA CORPORATIONInventors: Gerald G. Pechanek, Charles W. Kurak, JR.
-
Publication number: 20120311360Abstract: In one embodiment, a multi-core processor includes multiple cores and an uncore, where the uncore includes various logic units including a cache memory, a router, and a power control unit (PCU). The PCU can clock gate at least one of the logic units and the cache memory when the multi-core processor is in a low power state to thus reduce dynamic power consumption.Type: ApplicationFiled: May 31, 2011Publication date: December 6, 2012Inventors: Srikanth Balasubramanian, Tessil Thomas, Satish Shrimali, Baskaran Ganesan
-
Publication number: 20120311299Abstract: A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC). Within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors to enable optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. The plurality of processors within a single node individually or simultaneously work on any combination of computation or communication as required by the particular algorithm being solved. The system-on-a-chip ASIC nodes are interconnected by multiple independent networks that optimally maximizes packet communications throughput and minimizes latency.Type: ApplicationFiled: August 3, 2012Publication date: December 6, 2012Applicant: International Business Machines CorporationInventors: Matthias A. Blumrich, Dong Chen, George L. Chiu, Thomas M. Cipolla, Paul W. Coteus, Alan G. Gara, Mark E. Giampapa, Philip Heidlberger, Gerard V. Kopcsay, Lawrence S. Mok, Todd E. Takken
-
Patent number: 8327114Abstract: In some embodiments, processor-to-processor and/or broadcast proxies are designated in a microprocessor matrix comprising a plurality of mesh-interconnected matrix processors when default processor-to-processor or broadcast routing algorithms used by data switches within the matrix to route messages would not deliver the messages to all intended recipients. The broadcast proxies broadcast messages within individual non-overlapping broadcast domains of the matrix. P-to-P and broadcast proxies may be designated as part of a boot-time testing/initialization sequence. Improving system fault tolerance allows improving semiconductor processing yields, which may be of particular significance in relatively large integrated circuits including large numbers of relatively-complex matrix processors.Type: GrantFiled: July 7, 2008Date of Patent: December 4, 2012Assignee: OvicsInventors: Sorin C Cismas, Ilie Garbacea
-
Patent number: 8312053Abstract: Embodiments of the present invention provide techniques, including systems, methods, and computer readable medium, for dynamic atomic arrays. A dynamic atomic array is a data structure that provides an array that can grow or shrink in size as required. The dynamic atomic array is non-blocking, wait-free, and thread-safe. The dynamic atomic array may be used to provide arrays of any primitive data type as well as complex types, such as objects.Type: GrantFiled: September 11, 2009Date of Patent: November 13, 2012Assignee: Oracle International CorporationInventor: Nathan Reynolds
-
Patent number: 8276116Abstract: An algebra operation method includes the steps of converting algebra operations for a plurality of objects which appear in a program into an algebra operation sequence object described using object access data used to access the plurality of objects and object state data used to store states associated with the plurality of objects without immediately evaluating the algebra operations, determining a function to be applied to the algebra operation sequence object, and evaluating the algebra operations by executing the function by designating an argument group required for the function in response to a call of a substitute operator.Type: GrantFiled: June 7, 2007Date of Patent: September 25, 2012Assignee: Canon Kabushiki KaishaInventor: Yasuhiro Nakahara
-
Patent number: 8200948Abstract: An apparatus and method are provided for performing re-arrangement operations on data. The data processing apparatus has a register data store with a plurality of registers for storing data, and processing logic for performing a sequence of operations on data including at least one re-arrangement operation. The processing logic has scalar processing logic for performing scalar operations and SIMD processing logic for performing SIMD operations. The SIMD processing logic is responsive to a re-arrangement instruction specifying a family of re-arrangement operations to perform a selected re-arrangement operation from that family on a plurality of data elements constituted by data in one or more registers identified by the re-arrangement instruction. The selected re-arrangement operation is dependent on at least one parameter provided by the scalar processing logic, that parameter identifying a data element width for the data elements on which the selected re-arrangement operation is performed.Type: GrantFiled: December 4, 2007Date of Patent: June 12, 2012Assignee: ARM LimitedInventors: Daniel Kershaw, Dominic Hugo Symes, Alastair Reid
-
Patent number: 8195921Abstract: A microprocessor capable of decoding a plurality of instructions associated with a plurality of threads is disclosed. The microprocessor may comprise a first array comprising a first plurality of microcode operations associated with an instruction from within the plurality of instructions, the first array capable of delivering a first predetermined number of microcode operations from the first plurality of microcode operations. The microprocessor may further comprise a second array comprising a second plurality of microcode operations, the second array capable of providing one or more of the second plurality of microcode operations in the event that the instruction decodes into more than the first predetermined number of microcode operations. The microprocessor may further comprise an arbiter coupled between the first and second arrays, where the arbiter may determine which thread from the plurality of threads accesses the second array.Type: GrantFiled: July 9, 2008Date of Patent: June 5, 2012Assignee: Oracle America, Inc.Inventors: Robert Golla, Manish Shah
-
Patent number: 8185719Abstract: Each possessor node in an array of nodes has a respective local node address, and each local node address comprises a plurality of components having an order of addressing significance from most to least significant. Each node comprises: mapping means configured to map each component of the local node address onto a respective routing direction, and a switch arranged to receive a message having a destination node address identifying a destination node. The switch comprises: means for comparing the local node address to the destination node address to identify a the most significant non-matching component; and means for routing the message to another node, on the condition that the local node address does not match the destination node address, in the direction mapped to the most significant non-matching component.Type: GrantFiled: November 18, 2010Date of Patent: May 22, 2012Assignee: XMOS LimitedInventor: Michael David May
-
Publication number: 20120110302Abstract: A method, a system and a computer program product for effectively accelerating loop iterators using speculative execution of iterators. An Efficient Loop Iterator (ELI) utility detects initiation of a target program and initiates/spawns a speculative iterator thread at the start of the basic code block ahead of the code block that initiates a nested loop. The ELI utility assigns the iterator thread to a dedicated processor in a multi-processor system. The speculative thread runs/executes ahead of the execution of the nested loop and calculates indices in a corresponding multidimensional array. The iterator thread adds all the precomputed indices to a single queue. As a result, the ELI utility effectively enables a multidimensional loop to be replaced by a single dimensional loop. At the beginning of (or during) each iteration of the iterator, the ELI utility “dequeues” an entry from the queue to use the entry to access the array upon which the ELI utility iterates.Type: ApplicationFiled: November 2, 2010Publication date: May 3, 2012Applicant: IBM CorporationInventors: Ganesh Bikshandi, Dibyendu Das, Smruti Ranjan Sarangi
-
Publication number: 20120096238Abstract: The present invention discloses a circuit and a method for parallel perforation in rate matching, which can reduce the perforation processing time delay to satisfy the requirements of a Long Term Evolution (LTE). Both the circuit and the method can adopt three selector arrays and three register groups. Specifically, the first selector array is configured to remove null bits in input data and output the remaining data to the first register group; the second selector array is configured to combine the first register group and the third register group and then output the combined data to the second register group; during the combination, the valid data in the third register group are preferentially selected, and then the data in the first register group are selected; when the second register group is full, the data therein are output to the exterior as the results of the perforation processing.Type: ApplicationFiled: June 29, 2010Publication date: April 19, 2012Applicant: ZTE CORPORATIONInventor: Ziyu Wen
-
Patent number: 8161267Abstract: Hardware and software techniques for interrupt detection and response in a scalable pipelined array processor environment are described. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debut interrupts and a dynamic debut monitor mechanism.Type: GrantFiled: November 30, 2010Date of Patent: April 17, 2012Assignee: Altera CorporationInventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
-
Patent number: 8151090Abstract: A systolic data processing apparatus includes a processing element (PE) array and control unit. The PE array comprises a plurality of PEs, each PE executing a thread with respect to different data according to an input instruction and pipelining the instruction at each cycle for executing a program. The control unit inputs a new instruction to a first PE of the PE array at each cycle.Type: GrantFiled: February 17, 2009Date of Patent: April 3, 2012Assignee: Samsung Electronics Co., Ltd.Inventors: Gi-Ho Park, Shin-Dug Kim, Jung-Wook Park, Hoon-Mo Yang, Sung-Bae Park
-
Patent number: 8151089Abstract: A multiplicity of processor elements that are arranged in rows and columns individually execute data processing in accordance with instruction codes that are individually set as data and supply event data as output. A state control unit is composed of a plurality of units that successively switch the instruction codes of the multiplicity of processor elements in accordance with a computer program and the event data, these state control units communicating with each other to realize linked operation as necessary. An event distributing means distributes event data to this plurality of state control units that intercommunicate to realize linked operation, whereby the plurality of state control units can realize linked operation to control a large-scale state transition.Type: GrantFiled: October 29, 2003Date of Patent: April 3, 2012Assignee: Renesas Electronics CorporationInventors: Taro Fujii, Koichiro Furuta, Masato Motomura, Kenichiro Anjo, Yoshikazu Yabe, Toru Awashima, Takao Toi, Noritsugu Nakamura