Array Processor Operation Patents (Class 712/16)
  • Patent number: 8934332
    Abstract: A system is disclosed for concurrently processing order sensitive data packets. A first data packet from a plurality of sequentially ordered data packets is directed to a first offload engine. A second data packet from the plurality of sequentially ordered data packets is directed to a second offload engine, wherein the second data packet is sequentially subsequent to the first data packet. The second offload engine receives information from the first offload engine, wherein the information reflects that the first offload engine is processing the first data packet. Based on the information received at the second offload engine, the second offload engine processes the second data packet so that critical events in the processing of the first data packet by the first offload engine occur prior to critical events in the processing of the second data packet by the second offload engine.
    Type: Grant
    Filed: February 29, 2012
    Date of Patent: January 13, 2015
    Assignee: International Business Machines Corporation
    Inventors: Ronald E. Fuhs, Scott M. Willenborg
  • Patent number: 8904152
    Abstract: Efficient computation of complex multiplication results and very efficient fast Fourier transforms (FFTs) are provided. A parallel array VLIW digital signal processor is employed along with specialized complex multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs are used allowing the complex multiplication pipeline hardware to be efficiently used. In addition, efficient techniques for supporting combined multiply accumulate operations are described.
    Type: Grant
    Filed: May 26, 2011
    Date of Patent: December 2, 2014
    Assignee: Altera Corporation
    Inventors: Nikos P. Pitsianis, Gerald George Pechanek, Ricardo Rodriguez
  • Patent number: 8892850
    Abstract: Methods, apparatuses, and computer program products for endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (‘PAMI’) of a parallel computer are provided. Embodiments include establishing by a parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry. Embodiments also include registering in each endpoint in the geometry a dispatch callback function for a collective operation and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.
    Type: Grant
    Filed: January 17, 2011
    Date of Patent: November 18, 2014
    Assignee: International Business Machines Corporation
    Inventors: Charles J. Archer, Michael A. Blocksome, Bob R. Cernohous, Joseph D. Ratterman, Brian E. Smith
  • Publication number: 20140337601
    Abstract: An array processor composed of processor cells that are programmed by a controlling unit, and that are reprogrammed when a cell has finished a current data processing operation, even while other cell continue to process data with their current programming.
    Type: Application
    Filed: May 13, 2014
    Publication date: November 13, 2014
    Applicant: PACT XPP TECHNOLOGIES AG
    Inventors: Martin Vorbach, Armin Nuckel
  • Patent number: 8886916
    Abstract: Endpoint-based parallel data processing with non-blocking collective instructions in a PAMI of a parallel computer is disclosed. The PAMI is composed of data communications endpoints, each including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task. The compute nodes are coupled for data communications through the PAMI. The parallel application establishes a data communications geometry specifying a set of endpoints that are used in collective operations of the PAMI by associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation.
    Type: Grant
    Filed: November 8, 2012
    Date of Patent: November 11, 2014
    Assignee: International Business Machines Corporation
    Inventors: Charles J. Archer, Michael A. Blocksome, Bob R. Cernohous, Joseph D. Ratterman, Brian E. Smith
  • Patent number: 8880809
    Abstract: Embodiments are described for a method for controlling access to memory in a processor-based system comprising monitoring a number of interference events, such as bank contentions, bus contentions, row-buffer conflicts, and increased write-to-read turnaround time caused by a first core in the processor-based system that causes a delay in access to the memory by a second core in the processor-based system; deriving a control signal based on the number of interference events; and transmitting the control signal to one or more resources of the processor-based system to reduce the number of interference events from an original number of interference events.
    Type: Grant
    Filed: October 29, 2012
    Date of Patent: November 4, 2014
    Assignee: Advanced Micro Devices Inc.
    Inventors: Gabriel Loh, James O'Connor
  • Patent number: 8874878
    Abstract: Described embodiments provide a packet classifier for a network processor that generates tasks corresponding to each received packet. The packet classifier includes a scheduler to generate contexts corresponding to tasks received by the packet classifier from processing modules of the network processor. The packet classifier processes threads of instructions, each thread of instructions corresponding to a context received from the scheduler, and each thread associated with a data flow. A thread status table has N entries to track up to N active threads. Each status entry includes a valid status indicator, a sequence value, a thread indicator and a flow indicator. A sequence counter generates a sequence value for each data flow of each thread and is incremented when processing of a thread is started, and is decremented when a thread is completed. Instructions are processed in the order in which the threads were started for each data flow.
    Type: Grant
    Filed: November 28, 2012
    Date of Patent: October 28, 2014
    Assignee: LSI Corporation
    Inventors: Deepak Mital, James Clee, Jerry Pirog
  • Publication number: 20140281374
    Abstract: In a parallel computer, a plurality of logical planes formed of compute nodes of a subcommunicator may be identified by: for each compute node of the subcommunicator and for a number of dimensions beginning with a first dimension: establishing, by a plane building node, in a positive direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in a positive direction of a second dimension, where the second dimension is orthogonal to the first dimension; and establishing, by the plane building node, in a negative direction of the first dimension, all logical planes that include the plane building node and compute nodes of the subcommunicator in the positive direction of the second dimension.
    Type: Application
    Filed: March 12, 2013
    Publication date: September 18, 2014
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventor: INTERNATIONAL BUSINESS MACHINES CORPORATION
  • Patent number: 8830829
    Abstract: Disclosed are methods, systems, paradigms and structures for processing data packets in a communication network by a multi-core network processor. The network processor includes a plurality of multi-threaded core processors and special purpose processors for processing the data packets atomically, and in parallel. An ingress module of the network processor stores the incoming data packets in the memory and adds them to an input queue. The network processor processes a data packet by performing a set of network operations on the data packet in a single thread of a core processor. The special purpose processors perform a subset of the set of network operations on the data packet atomically. An egress module retrieves the processed data packets from a plurality of output queues based on a quality of service (QoS) associated with the output queues, and forwards the data packets towards their destination addresses.
    Type: Grant
    Filed: December 2, 2013
    Date of Patent: September 9, 2014
    Assignee: Unbound Networks, Inc.
    Inventors: Damon Finney, Ashok Mathur
  • Patent number: 8832413
    Abstract: A processing system includes processors and dynamically configurable communication elements (DCCs) coupled together in an interspersed arrangement. A source device may transfer a data item through an intermediate subset of the DCCs to a destination device. The source and destination devices may each correspond to different processors, DCCs, or input/output devices, or mixed combinations of these. In response to detecting a stall after the source device begins transfer of the data item to the destination device and prior to receipt of all of the data item at the destination device, a stalling device is operable to propagate stalling information through one or more of the intermediate subset towards the source device. In response to receiving the stalling information, at least one of the intermediate subset is operable to buffer all or part of the data item.
    Type: Grant
    Filed: May 29, 2013
    Date of Patent: September 9, 2014
    Assignee: Coherent Logix, Incorporated
    Inventors: Michael B. Doerr, William H. Hallidy, David A. Gibson, Craig M. Chase
  • Patent number: 8825924
    Abstract: A computer array (10) has a plurality of computers (12). The computers (12) communicate with each other asynchronously, and the computers (12) themselves operate in a generally asynchronous manner internally. When one computer (12) attempts to communicate with another it goes to sleep until the other computer (12) is ready to complete the transaction, thereby saving power and reducing heat production. A plurality of read lines (18), write lines (20) and data lines (22) interconnect the computers (12). When one computer (12) sets a read line (18) high and the other computer sets a corresponding write line (20) then data is transferred on the data lines (22). When both the read line (18) and corresponding write line (20) go low this allows both communicating computers (12) to know that the communication is completed. An acknowledge line (72) goes high to restart the computers (12).
    Type: Grant
    Filed: March 4, 2011
    Date of Patent: September 2, 2014
    Assignee: Array Portfolio LLC
    Inventor: Charles H. Moore
  • Patent number: 8769244
    Abstract: Uniforming of the processing load is efficiently realized. Each processing element configuring an SIMD parallel computer system includes a data storage module that stores data processed or transferred, a number-of-data-sets storage device that stores number of data sets, and a front data storage device that stores the front data. Each processing element further includes a control processor that compares the number of data sets stored in one processing element with the number of data sets stored in the own processing element, and issues a data distribution leveling instruction that designates an action for updating contents of the data storage module, the number-of-data-sets storage device, and the front data storage device according to a rule determined based on a comparison result of the own processing element and that of the other processing elements and an action for moving the data stored in the one processing element to the own processing element.
    Type: Grant
    Filed: April 8, 2009
    Date of Patent: July 1, 2014
    Assignee: Nec Corporation
    Inventor: Shorin Kyo
  • Patent number: 8761188
    Abstract: In the provided architecture, one or more multi-threaded processors may be combined with hardware blocks. The resulting combination allows for data packets to undergo a processing sequence having the flexibility of software programmability with the high-performance of dedicated hardware. For example, a multi-threaded processor can control the high-level tasks of a processing sequence, while the computationally intensive events (e.g., signal processing filters, matrix operations, etc.) are handled by dedicated hardware blocks.
    Type: Grant
    Filed: April 30, 2008
    Date of Patent: June 24, 2014
    Assignee: Altera Corporation
    Inventors: Anargyros Krikelis, Martin Roberts
  • Publication number: 20140164734
    Abstract: A method and circuit arrangement utilize inactive non-pipelined operation resources in one processing core of a multi-core processing unit to execute non-pipelined instructions on behalf of another processing core in the same processing unit. Adjacent processing cores in a processing unit may be coupled together such that, for example, when one processing core's non-pipelined execution sequencer is busy, that processing core may issue into another processing core's non-pipelined execution sequencer if that other processing core's non-pipelined execution sequencer is idle, thereby providing intermittent concurrent execution of multiple non-pipelined instructions within each individual processing core.
    Type: Application
    Filed: December 6, 2012
    Publication date: June 12, 2014
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Adam J. Muff, Paul E. Schardt, Robert A. Shearer, Matthew R. Tubbs
  • Patent number: 8751772
    Abstract: Hardware and software techniques for interrupt detection and response in a scalable pipelined array processor environment are described. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debug interrupts and a dynamic debut monitor mechanism.
    Type: Grant
    Filed: June 13, 2013
    Date of Patent: June 10, 2014
    Assignee: Altera Corporation
    Inventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
  • Patent number: 8683182
    Abstract: Systems and apparatuses are presented relating a programmable processor comprising an execution unit that is operable to decode and execute instructions received from an instruction path and partition data stored in registers in the register file into multiple data elements, the execution unit capable of executing group data handling operations that re-arrange data elements in different ways in response to data handling instructions, the execution unit further capable of executing a plurality of different group floating-point and group integer arithmetic operations that each arithmetically operates on the multiple data elements stored in registers in the register file to produce a catenated result that is returned to a register in the register file, wherein the catenated result comprises a plurality of individual results.
    Type: Grant
    Filed: June 11, 2012
    Date of Patent: March 25, 2014
    Assignee: Microunity Systems Engineering, Inc.
    Inventors: Craig Hansen, John Moussouris, Alexia Massalin
  • Publication number: 20140075154
    Abstract: The invention provides hardware based techniques for switching processing tasks of software programs for execution on a multi-core processor. Invented techniques involve a hardware logic based controller for assigning, adaptive to program processing loads, tasks for processing by cores of a multi-core fabric as well as configuring a set of multiplexers to appropriately interconnect cores of the fabric and program task specific segments at fabric memories, to arrange efficient inter-task communication as well as transferring of activating and de-activating task memory images among the multi-core fabric. The invention thereby provides an efficient, hardware-automated runtime operating system for multi-core processors, minimizing any need to use processing capacity of the cores for traditional operating system software functions.
    Type: Application
    Filed: September 3, 2013
    Publication date: March 13, 2014
    Inventor: Mark Henrik Sandstrom
  • Patent number: 8665727
    Abstract: A computer-implemented method is described for determining cost in a non-blocking routing network that provides routing functionality using a single level of a plurality of multiplexers in each row of the routing network. The method includes assigning a respective numerical value, represented by bits, to each row of the routing network. A number of bits that differ between the respective numerical values of each pair of rows of the routing network indicates a number of row traversals necessary to traverse from a first row of the pair to a second row of the pair. A signal routing cost is computed from the number of bits that differ between the respective numerical values of the first row and the second row of the routing network. The calculated signal routing cost is provided to a placement module.
    Type: Grant
    Filed: June 21, 2010
    Date of Patent: March 4, 2014
    Assignee: Xilinx, Inc.
    Inventor: Stephen M. Trimberger
  • Patent number: 8656141
    Abstract: An integrated circuit includes a plurality of tiles. Each tile includes a pipelined processor configured to process multiple streams of instructions for the processor; and a switch including switching circuitry to forward data over data paths from other tiles to one or more pipeline stages of the processor and to switches of other tiles. At least some of the data is forwarded based on one or more streams of instructions for the switch.
    Type: Grant
    Filed: December 13, 2005
    Date of Patent: February 18, 2014
    Assignee: Massachusetts Institute of Technology
    Inventor: Anant Agarwal
  • Patent number: 8638805
    Abstract: Described embodiments provide for restructuring a scheduling hierarchy of a network processor having a plurality of processing modules and a shared memory. The scheduling hierarchy schedules packets for transmission. The network processor generates tasks corresponding to each received packet associated with a data flow. A traffic manager receives tasks provided by one of the processing modules and determines a queue of the scheduling hierarchy corresponding to the task. The queue has a parent scheduler at each of one or more next levels of the scheduling hierarchy up to a root scheduler, forming a branch of the hierarchy. The traffic manager determines if the queue and one or more of the parent schedulers of the branch should be restructured. If so, the traffic manager drops subsequently received tasks for the branch, drains all tasks of the branch, and removes the corresponding nodes of the branch from the scheduling hierarchy.
    Type: Grant
    Filed: September 30, 2011
    Date of Patent: January 28, 2014
    Assignee: LSI Corporation
    Inventors: Balakrishnan Sundararaman, Shashank Nemawarkar, David Sonnier, Shailendra Aulakh, Allen Vestal
  • Patent number: 8625422
    Abstract: Disclosed are methods, systems, paradigms and structures for processing data packets in a communication network by a multi-core network processor. The network processor includes a plurality of multi-threaded core processors and special purpose processors for processing the data packets atomically, and in parallel. An ingress module of the network processor stores the incoming data packets in the memory and adds them to an input queue. The network processor processes a data packet by performing a set of network operations on the data packet in a single thread of a core processor. The special purpose processors perform a subset of the set of network operations on the data packet atomically. An egress module retrieves the processed data packets from a plurality of output queues based on a quality of service (QoS) associated with the output queues, and forwards the data packets towards their destination addresses.
    Type: Grant
    Filed: March 5, 2013
    Date of Patent: January 7, 2014
    Assignee: Unbound Networks
    Inventors: Damon Finney, Ashok Mathur
  • Patent number: 8607029
    Abstract: A dynamic reconfigurable circuit including a plurality of processing elements each provided with an arithmetic data input port, a configuration data input port and an output port, a data network that is coupled to the arithmetic data input ports and the output ports of the plurality of processing elements, a configuration memory that is coupled via a configuration path to the configuration data input port of a first processor element being at least one of the plurality of processing elements, and an immediate value network that is independent from the data network and that is coupled to the configuration data input port of a second processor element being at least one of the plurality of processing elements. An internal register of a third processor element is coupled to the immediate value network so that data stored in the internal register can be outputted to the immediate value network.
    Type: Grant
    Filed: December 16, 2008
    Date of Patent: December 10, 2013
    Assignee: Fujitsu Semiconductor Limited
    Inventor: Shin-ichi Sutou
  • Patent number: 8572353
    Abstract: Communicating among cores in a computing system comprising a plurality of cores, each core comprising a processor and a switch, includes: routing a packet from an origin core to a destination core over a route including multiple cores; and at each core in the route before the destination core, routing the packet to the next core in the route according to a respective symbol in a sequence of multiple symbols. The respective symbol has a first symbol value indicating a single likely direction and the respective symbol has a second symbol value indicating multiple less likely directions.
    Type: Grant
    Filed: September 20, 2010
    Date of Patent: October 29, 2013
    Assignee: Tilera Corporation
    Inventors: Ian Rudolf Bratt, Carl G. Ramey, Matthew Mattina
  • Publication number: 20130283007
    Abstract: A multi-node video signal processor (VSPN) is describes that tightly couples multiple multi-cycle state machines (hardware assist units) to each processor and each memory in each node of an N node scalable array processor. VSPN memory hardware assist instructions are used to initiate multi-cycle state machine functions, to pass parameters to the multi-cycle state machines, to fetch operands from a node's memory, and to control the transfer of results from the multi-cycle state machines.
    Type: Application
    Filed: June 11, 2013
    Publication date: October 24, 2013
    Applicant: ALTERA CORPORATION
    Inventors: Gerald George Pechanek, Mihailo M. Stojancic
  • Patent number: 8549259
    Abstract: Systems, methods and articles of manufacture are disclosed for performing a vector collective operation on a parallel computing system that includes multiple compute nodes and a network connecting the compute nodes that includes an ALU. A collective operation may be performed to determine displacements for the vector collective operation. Descriptors for the vector collective operation may be generated based on the displacements. The vector collective operation may then be performed using the descriptors.
    Type: Grant
    Filed: September 15, 2010
    Date of Patent: October 1, 2013
    Assignee: International Business Machines Corporation
    Inventors: Charles J. Archer, Michael A. Blocksome, Joseph D. Ratterman, Brian E. Smith
  • Patent number: 8549258
    Abstract: A configurable processing apparatus includes a plurality of processing units, at least an instruction synchronization control circuit, and at least a configuration memory. Each processing apparatus has a stall-output signal generating circuit to output a stall-output signal, wherein the stall-output signal indicates that an unexpected stall is occurred in the processing unit. The processing unit has a stall-in signal, and an external circuit of the processing unit can control whether the processing unit is stalled according to the stall-in signal. The instruction synchronization control circuit generates the stall-in signals to the processing units in response to a content stored in the configuration memory and the stall-output signals of the processing units, so as to determine operation modes and instruction synchronization of the processing units.
    Type: Grant
    Filed: February 7, 2010
    Date of Patent: October 1, 2013
    Assignee: Industrial Technology Research Institute
    Inventors: Tzu-Fang Lee, Chien-Hong Lin, Jing-Shan Liang, Chi-Lung Wang
  • Patent number: 8532288
    Abstract: A cryptographic engine for modulo N multiplication, which is structured as a plurality of almost identical, serially connected Processing Elements, is controlled so as to accept input in blocks that are smaller than the maximum capability of the engine in terms of bits multiplied at one time. The serially connected hardware is thus partitioned on the fly to process a variety of cryptographic key sizes while still maintaining all of the hardware in an active processing state.
    Type: Grant
    Filed: December 1, 2006
    Date of Patent: September 10, 2013
    Assignee: International Business Machines Corporation
    Inventors: Camil Fayad, John K. Li, Siegfried K. H. Sutter, Phil C. Yeh
  • Patent number: 8489857
    Abstract: A parallel processing architecture comprising a cluster of embedded processors that share a common code distribution bus. Pages or blocks of code are concurrently loaded into respective program memories of some or all of these processors (typically all processors assigned to a particular task) over the code distribution bus, and are executed in parallel by these processors. A task control processor determines when all of the processors assigned to a particular task have finished executing the current code page, and then loads a new code page (e.g., the next sequential code page within a task) into the program memories of these processors for execution. The processors within the cluster preferably share a common memory (1 per cluster) that is used to receive data inputs from, and to provide data outputs to, a higher level processor. Multiple interconnected clusters may be integrated within a common integrated circuit device.
    Type: Grant
    Filed: November 5, 2010
    Date of Patent: July 16, 2013
    Assignee: Schism Electronics, L.L.C.
    Inventors: Richard F. Hobson, Bill Ressl, Allan R. Dyck
  • Patent number: 8489858
    Abstract: Hardware and software techniques for interrupt detection and response are provided in a scalable pipelined array processor environment. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debug interrupts and a dynamic debug monitor mechanism.
    Type: Grant
    Filed: March 12, 2012
    Date of Patent: July 16, 2013
    Assignee: Altera Corporation
    Inventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
  • Patent number: 8484276
    Abstract: Techniques are disclosed for converting data into a format tailored for efficient multidimensional fast Fourier transforms (FFTS) on single instruction, multiple data (SIMD) multi-core processor architectures. The technique includes converting data from a multidimensional array stored in a conventional row-major order into SIMD format. Converted data in SIMD format consists of a sequence of blocks, where each block interleaves s rows such that SIMD vector processors may operate on s rows simultaneously. As a result, the converted data in SIMD format enables smaller-sized 1D FFTs to be optimized in SIMD multi-core processor architectures.
    Type: Grant
    Filed: March 18, 2009
    Date of Patent: July 9, 2013
    Assignee: International Business Machines Corporation
    Inventors: David G. Carlson, Travis M. Drucker, Timothy J. Mullins, Jeffrey S. McAllister, Nelson Ramirez
  • Publication number: 20130166876
    Abstract: A method and apparatus are described for using a previous column pointer to read a subset of entries of an array in a processor. The array may have a plurality of rows and columns of entries, and each entry in the subset may reside on a different row of the array. A previous column pointer may be generated for each of the rows of the array based on a plurality of bits indicating the number of valid entries in the subset to be read, the previous column pointer indicating whether each entry is in a current column or a previous column. The entries in the subset may be read and re-ordered, and invalid entries in the subset may be replaced with nulls. The valid entries and nulls may then be outputted.
    Type: Application
    Filed: December 21, 2011
    Publication date: June 27, 2013
    Applicant: ADVANCED MICRO DEVICES, INC.
    Inventors: Srikanth Arekapudi, Shloke Hajela
  • Patent number: 8468323
    Abstract: A computer array (10) has a plurality of computers (12). The computers (12) communicate with each other asynchronously, and the computers (12) themselves operate in a generally asynchronous manner internally. When one computer (12) attempts to communicate with another it goes to sleep until the other computer (12) is ready to complete the transaction, thereby saving power and reducing heat production. The sleeping computer (12) can be awaiting data or instructions (12). In the case of instructions, the sleeping computer (12) can be waiting to store the instructions or to immediately execute the instructions. In the later case, the instructions are placed in an instruction register (30a) when they are received and executed therefrom, without first placing the instructions first into memory. The instructions can include a micro-loop (100) which is capable of performing a series of operations repeatedly.
    Type: Grant
    Filed: March 21, 2011
    Date of Patent: June 18, 2013
    Assignee: ARRAY Portfolio LLC
    Inventors: Charles H. Moore, Jeffrey Arthur Fox, John W. Rible
  • Patent number: 8464025
    Abstract: A signal processing apparatus able to raise a processing capability in processing accompanying access to a storing means is provided. Stream control units (SCU) 203—0 to 203—3 access data at an external memory system or local memories 204—0 to 204—3 according to a thread under control from a host processor. Processor units (PU) arrays 202—0 to 202—3 perform image processing by a different thread from the thread of the SCUs 203—0 to 203—3.
    Type: Grant
    Filed: May 22, 2006
    Date of Patent: June 11, 2013
    Assignee: Sony Corporation
    Inventors: Yuji Yamaguchi, Masatoshi Imai, Toshiharu Noda, Naosuke Asari, Tomoo Mitsunaga, Mitsuharu Ohki, Kazumasa Ito, Hidetoshi Nagano, Sumito Arakawa, Kei Ito
  • Patent number: 8417733
    Abstract: Embodiments of the present invention provide techniques, including systems, methods, and computer readable medium, for dynamic atomic bitsets. A dynamic atomic bitset is a data structure that provides a bitset that can grow or shrink in size as required. The dynamic atomic bitset is non-blocking, wait-free, and thread-safe.
    Type: Grant
    Filed: March 4, 2010
    Date of Patent: April 9, 2013
    Assignee: Oracle International Corporation
    Inventor: Nathan Reynolds
  • Publication number: 20130086354
    Abstract: Methods, apparatuses and storage device associated with cache and/or socket sensitive breadth-first iterative traversal of a graph by parallel threads, are disclosed. In embodiments, a vertices visited array (VIS) may be employed to track graph vertices visited. VIS may be partitioned into VIS sub-arrays, taking into consideration cache sizes of LLC, to reduce likelihood of evictions. In embodiments, potential boundary vertices arrays (PBV) may be employed to store potential boundary vertices for a next iteration, for vertices being visited in a current iteration. The number of PBV generated for each thread may take into consideration a number of sockets, over which the processor cores employed are distributed. In various embodiments, the threads may be load balanced; further data locality awareness to reduce inter-socket communication may be considered, and/or lock-and-atomic free update operations may be employed. Other embodiments may be disclosed or claimed.
    Type: Application
    Filed: September 27, 2012
    Publication date: April 4, 2013
    Inventors: Nadathur Rajagopalan Satish, Changkyu Kim, Jatin Chhuagani, Jason D. Sewall
  • Patent number: 8370605
    Abstract: A system includes first and second processors, first and second graphics processing units (GPUs), one or more peripheral devices, a switch matrix, and processor-readable memory. The switch matrix comprises programmable data paths between the processors, the GPUs, and the peripheral devices. Software encoded in the process-readable memory includes a first operating system (OS) executed by the first processor, a second OS executed by the second processor, a matrix scheduling engine, and a media interface switch (MIS) engine. The first OS boots faster than the second OS. The matrix scheduling engine runs on both OSs and configures the data paths in the switch matrix to couple the processors and the GPUs, and to couple the processors and the peripheral devices. The MIS engine runs on the operating systems, detects presence of the peripheral devices, and configures the data paths in the switch matrix to couple the processors and the peripheral devices.
    Type: Grant
    Filed: November 11, 2009
    Date of Patent: February 5, 2013
    Assignee: Sunman Engineering, Inc.
    Inventors: Allen Nejah, Gholam Reza Golshan, George W. Harvey
  • Publication number: 20130019082
    Abstract: An array processor includes processing elements arranged in to form a rectangular array. Inter-cluster communication paths are mutually exclusive. Due to the mutual exclusivity of the data paths, communications between the processing elements of each cluster may be combined in a single inter-cluster path, thus eliminating half the wiring required for the path. The length of the longest communication path is not directly determined by the overall dimension of the array, as in conventional torus arrays. Rather, the longest communications path is limited by the inter-cluster spacing. Transpose elements of an N×N torus may be combined in clusters and communicate with one another through intra-cluster communications paths. Transpose operation latency is eliminated in this approach. Each PE may have a single transmit port and a single receive port. Thus, the individual PEs are decoupled from the array topology.
    Type: Application
    Filed: September 14, 2012
    Publication date: January 17, 2013
    Applicant: ALTERA CORPORATION
    Inventors: Gerald G. Pechanek, Charles W. Kurak, JR.
  • Publication number: 20120311360
    Abstract: In one embodiment, a multi-core processor includes multiple cores and an uncore, where the uncore includes various logic units including a cache memory, a router, and a power control unit (PCU). The PCU can clock gate at least one of the logic units and the cache memory when the multi-core processor is in a low power state to thus reduce dynamic power consumption.
    Type: Application
    Filed: May 31, 2011
    Publication date: December 6, 2012
    Inventors: Srikanth Balasubramanian, Tessil Thomas, Satish Shrimali, Baskaran Ganesan
  • Publication number: 20120311299
    Abstract: A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC). Within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors to enable optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. The plurality of processors within a single node individually or simultaneously work on any combination of computation or communication as required by the particular algorithm being solved. The system-on-a-chip ASIC nodes are interconnected by multiple independent networks that optimally maximizes packet communications throughput and minimizes latency.
    Type: Application
    Filed: August 3, 2012
    Publication date: December 6, 2012
    Applicant: International Business Machines Corporation
    Inventors: Matthias A. Blumrich, Dong Chen, George L. Chiu, Thomas M. Cipolla, Paul W. Coteus, Alan G. Gara, Mark E. Giampapa, Philip Heidlberger, Gerard V. Kopcsay, Lawrence S. Mok, Todd E. Takken
  • Patent number: 8327114
    Abstract: In some embodiments, processor-to-processor and/or broadcast proxies are designated in a microprocessor matrix comprising a plurality of mesh-interconnected matrix processors when default processor-to-processor or broadcast routing algorithms used by data switches within the matrix to route messages would not deliver the messages to all intended recipients. The broadcast proxies broadcast messages within individual non-overlapping broadcast domains of the matrix. P-to-P and broadcast proxies may be designated as part of a boot-time testing/initialization sequence. Improving system fault tolerance allows improving semiconductor processing yields, which may be of particular significance in relatively large integrated circuits including large numbers of relatively-complex matrix processors.
    Type: Grant
    Filed: July 7, 2008
    Date of Patent: December 4, 2012
    Assignee: Ovics
    Inventors: Sorin C Cismas, Ilie Garbacea
  • Patent number: 8312053
    Abstract: Embodiments of the present invention provide techniques, including systems, methods, and computer readable medium, for dynamic atomic arrays. A dynamic atomic array is a data structure that provides an array that can grow or shrink in size as required. The dynamic atomic array is non-blocking, wait-free, and thread-safe. The dynamic atomic array may be used to provide arrays of any primitive data type as well as complex types, such as objects.
    Type: Grant
    Filed: September 11, 2009
    Date of Patent: November 13, 2012
    Assignee: Oracle International Corporation
    Inventor: Nathan Reynolds
  • Patent number: 8276116
    Abstract: An algebra operation method includes the steps of converting algebra operations for a plurality of objects which appear in a program into an algebra operation sequence object described using object access data used to access the plurality of objects and object state data used to store states associated with the plurality of objects without immediately evaluating the algebra operations, determining a function to be applied to the algebra operation sequence object, and evaluating the algebra operations by executing the function by designating an argument group required for the function in response to a call of a substitute operator.
    Type: Grant
    Filed: June 7, 2007
    Date of Patent: September 25, 2012
    Assignee: Canon Kabushiki Kaisha
    Inventor: Yasuhiro Nakahara
  • Patent number: 8200948
    Abstract: An apparatus and method are provided for performing re-arrangement operations on data. The data processing apparatus has a register data store with a plurality of registers for storing data, and processing logic for performing a sequence of operations on data including at least one re-arrangement operation. The processing logic has scalar processing logic for performing scalar operations and SIMD processing logic for performing SIMD operations. The SIMD processing logic is responsive to a re-arrangement instruction specifying a family of re-arrangement operations to perform a selected re-arrangement operation from that family on a plurality of data elements constituted by data in one or more registers identified by the re-arrangement instruction. The selected re-arrangement operation is dependent on at least one parameter provided by the scalar processing logic, that parameter identifying a data element width for the data elements on which the selected re-arrangement operation is performed.
    Type: Grant
    Filed: December 4, 2007
    Date of Patent: June 12, 2012
    Assignee: ARM Limited
    Inventors: Daniel Kershaw, Dominic Hugo Symes, Alastair Reid
  • Patent number: 8195921
    Abstract: A microprocessor capable of decoding a plurality of instructions associated with a plurality of threads is disclosed. The microprocessor may comprise a first array comprising a first plurality of microcode operations associated with an instruction from within the plurality of instructions, the first array capable of delivering a first predetermined number of microcode operations from the first plurality of microcode operations. The microprocessor may further comprise a second array comprising a second plurality of microcode operations, the second array capable of providing one or more of the second plurality of microcode operations in the event that the instruction decodes into more than the first predetermined number of microcode operations. The microprocessor may further comprise an arbiter coupled between the first and second arrays, where the arbiter may determine which thread from the plurality of threads accesses the second array.
    Type: Grant
    Filed: July 9, 2008
    Date of Patent: June 5, 2012
    Assignee: Oracle America, Inc.
    Inventors: Robert Golla, Manish Shah
  • Patent number: 8185719
    Abstract: Each possessor node in an array of nodes has a respective local node address, and each local node address comprises a plurality of components having an order of addressing significance from most to least significant. Each node comprises: mapping means configured to map each component of the local node address onto a respective routing direction, and a switch arranged to receive a message having a destination node address identifying a destination node. The switch comprises: means for comparing the local node address to the destination node address to identify a the most significant non-matching component; and means for routing the message to another node, on the condition that the local node address does not match the destination node address, in the direction mapped to the most significant non-matching component.
    Type: Grant
    Filed: November 18, 2010
    Date of Patent: May 22, 2012
    Assignee: XMOS Limited
    Inventor: Michael David May
  • Publication number: 20120110302
    Abstract: A method, a system and a computer program product for effectively accelerating loop iterators using speculative execution of iterators. An Efficient Loop Iterator (ELI) utility detects initiation of a target program and initiates/spawns a speculative iterator thread at the start of the basic code block ahead of the code block that initiates a nested loop. The ELI utility assigns the iterator thread to a dedicated processor in a multi-processor system. The speculative thread runs/executes ahead of the execution of the nested loop and calculates indices in a corresponding multidimensional array. The iterator thread adds all the precomputed indices to a single queue. As a result, the ELI utility effectively enables a multidimensional loop to be replaced by a single dimensional loop. At the beginning of (or during) each iteration of the iterator, the ELI utility “dequeues” an entry from the queue to use the entry to access the array upon which the ELI utility iterates.
    Type: Application
    Filed: November 2, 2010
    Publication date: May 3, 2012
    Applicant: IBM Corporation
    Inventors: Ganesh Bikshandi, Dibyendu Das, Smruti Ranjan Sarangi
  • Publication number: 20120096238
    Abstract: The present invention discloses a circuit and a method for parallel perforation in rate matching, which can reduce the perforation processing time delay to satisfy the requirements of a Long Term Evolution (LTE). Both the circuit and the method can adopt three selector arrays and three register groups. Specifically, the first selector array is configured to remove null bits in input data and output the remaining data to the first register group; the second selector array is configured to combine the first register group and the third register group and then output the combined data to the second register group; during the combination, the valid data in the third register group are preferentially selected, and then the data in the first register group are selected; when the second register group is full, the data therein are output to the exterior as the results of the perforation processing.
    Type: Application
    Filed: June 29, 2010
    Publication date: April 19, 2012
    Applicant: ZTE CORPORATION
    Inventor: Ziyu Wen
  • Patent number: 8161267
    Abstract: Hardware and software techniques for interrupt detection and response in a scalable pipelined array processor environment are described. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debut interrupts and a dynamic debut monitor mechanism.
    Type: Grant
    Filed: November 30, 2010
    Date of Patent: April 17, 2012
    Assignee: Altera Corporation
    Inventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
  • Patent number: 8151090
    Abstract: A systolic data processing apparatus includes a processing element (PE) array and control unit. The PE array comprises a plurality of PEs, each PE executing a thread with respect to different data according to an input instruction and pipelining the instruction at each cycle for executing a program. The control unit inputs a new instruction to a first PE of the PE array at each cycle.
    Type: Grant
    Filed: February 17, 2009
    Date of Patent: April 3, 2012
    Assignee: Samsung Electronics Co., Ltd.
    Inventors: Gi-Ho Park, Shin-Dug Kim, Jung-Wook Park, Hoon-Mo Yang, Sung-Bae Park
  • Patent number: 8151089
    Abstract: A multiplicity of processor elements that are arranged in rows and columns individually execute data processing in accordance with instruction codes that are individually set as data and supply event data as output. A state control unit is composed of a plurality of units that successively switch the instruction codes of the multiplicity of processor elements in accordance with a computer program and the event data, these state control units communicating with each other to realize linked operation as necessary. An event distributing means distributes event data to this plurality of state control units that intercommunicate to realize linked operation, whereby the plurality of state control units can realize linked operation to control a large-scale state transition.
    Type: Grant
    Filed: October 29, 2003
    Date of Patent: April 3, 2012
    Assignee: Renesas Electronics Corporation
    Inventors: Taro Fujii, Koichiro Furuta, Masato Motomura, Kenichiro Anjo, Yoshikazu Yabe, Toru Awashima, Takao Toi, Noritsugu Nakamura