Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Patent number: 8108846
    Abstract: A mechanism is provided for performing scalar operations using a SIMD data parallel execution unit. With the mechanisms of the illustrative embodiments, scalar operations in application code are identified that may be executed using vector operations in a SIMD data parallel execution unit. The scalar operations are converted, such as by a static or dynamic compiler, into one or more vector load instructions and one or more vector computation instructions. In addition, control words may be generated to adjust the alignment of the scalar values for the scalar operation within the vector registers to which these scalar values are loaded using the vector load instructions. The alignment amounts for adjusting the scalar values within the vector registers may be statically or dynamically determined.
    Type: Grant
    Filed: May 28, 2008
    Date of Patent: January 31, 2012
    Assignee: International Business Machines Corporation
    Inventor: Michael K. Gschwind
  • Patent number: 8108653
    Abstract: A processor includes a compute array comprising a first plurality of compute engines serially connected along a data flow path such that data flows between successive compute engines at successive times. The first plurality of compute engines includes an initial compute engine and a final compute engine. The data flow path includes a recirculation path connecting the final compute engine to the initial compute engine with no compute engine therebetween.
    Type: Grant
    Filed: February 5, 2010
    Date of Patent: January 31, 2012
    Assignee: Analog Devices, Inc.
    Inventors: Boris Lerner, Douglas Garde
  • Patent number: 8103854
    Abstract: A control processor is used for fetching and distributing single instruction multiple data (SIMD) instructions to a plurality of processing elements (PEs). One of the SIMD instructions is a thread start (Tstart) instruction, which causes the control processor to pause its instruction fetching. A local PE instruction memory (PE Imem) is associated with each PE and contains local PE instructions for execution on the local PE. Local PE Imem fetch, decode, and execute logic are associated with each PE. Instruction path selection logic in each PE is used to select between control processor distributed instructions and local PE instructions fetched from the local PE Imem. Each PE is also initialized to receive control processor distributed instructions. In addition, local hold generation logic is associated with each PE. A PE receiving a Tstart instruction causes the instruction path selection logic to switch to fetch local PE Imem instructions.
    Type: Grant
    Filed: April 12, 2010
    Date of Patent: January 24, 2012
    Assignee: Altera Corporation
    Inventors: Gerald George Pechanek, Edwin Franklin Barry, Mihailo M. Stojancic
  • Patent number: 8099584
    Abstract: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.
    Type: Grant
    Filed: May 2, 2011
    Date of Patent: January 17, 2012
    Assignee: NVIDIA Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Patent number: 8085264
    Abstract: A method for multiple queue output buffering in a raster stage of a graphics processor. The method includes receiving a graphics primitive for rasterization in a raster stage of a graphics processor. The graphics primitive is rasterized at a first level to generate a plurality of tiles of pixels related to the graphics primitive. Each tile is then rasterized to determine related sub-portions of each tile. The related sub-portions are transferred to a plurality of output queues. The related sub-portions are subsequently output on a per queue basis and on a per clock cycle basis.
    Type: Grant
    Filed: July 26, 2006
    Date of Patent: December 27, 2011
    Assignee: NVIDIA Corporation
    Inventors: Franklin C. Crow, Jeffrey R. Sewall
  • Patent number: 8086830
    Abstract: An arithmetic processing apparatus capable of performing an arithmetic operation for generating a condition flag commonly referred to by using a condition flag generated on an arithmetic operation unit basis in as few steps as possible is provided. The arithmetic processing apparatus, which processes multiple data in parallel based on single instruction, includes: processing elements capable of performing a common arithmetic operation based on the evaluation result of the instruction stored in the instruction register; and a condition flag arithmetic operation unit capable of performing one of the logical operation and the comparison operation on the condition flag retained in each processing element, transferring the operation result to each processing element, and updating the condition flag based on the operation result.
    Type: Grant
    Filed: August 24, 2005
    Date of Patent: December 27, 2011
    Assignee: Panasonic Corporation
    Inventors: Takeshi Furuta, Hideshi Nishida, Takeshi Tanaka
  • Patent number: 8082419
    Abstract: According to some embodiments, a technique provides for the execution of an instruction that includes receiving residual data of a first image and decoded pixels of a second image, zero-extending a plurality of unsigned data operands of the decoded pixels producing a plurality of unpacked data operands, adding a plurality of signed data operands of the residual data to the plurality of unpacked data operands producing a plurality of signed results; and saturating the plurality of signed results producing a plurality of unsigned results.
    Type: Grant
    Filed: March 30, 2004
    Date of Patent: December 20, 2011
    Assignee: Intel Corporation
    Inventors: Bradley C. Aldrich, Nigel C. Paver, Murli Ganeshan
  • Patent number: 8078836
    Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.
    Type: Grant
    Filed: December 30, 2007
    Date of Patent: December 13, 2011
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
  • Patent number: 8074058
    Abstract: The present invention provides extended precision in SIMD arithmetic operations in a processor having a register file and an accumulator. A first set of data elements and a second set of data elements are loaded into first and second vector registers, respectively. Each data element comprises N bits. Next, an arithmetic instruction is fetched from memory. The arithmetic instruction is decoded. Then, the first vector register and the second vector register are read from the register file. The present invention executes the arithmetic instruction on corresponding data elements in the first and second vector registers. The resulting element of the execution is then written into the accumulator. Then, the resulting element is transformed into an N-bit width element and written into a third register for further operation or storage in memory. The transformation of the resulting element can include, for example, rounding, clamping, and/or shifting the element.
    Type: Grant
    Filed: June 8, 2009
    Date of Patent: December 6, 2011
    Assignee: MIPS Technologies, Inc.
    Inventors: Timothy J. Van Hook, Peter Hsu, William A. Huffman, Henry P. Moreton, Earl A. Killian
  • Patent number: 8069334
    Abstract: The present invention provides histogram calculation for images and video applications using a SIMD and VLIW processor with vector Look-Up Table (LUT) operations. This provides a speed up of histogram calculation by a factor of N times over a scalar processor where the SIMD processor could perform N LUT operations per instruction. Histogram operation is partitioned into a vector LUT operation, followed by vector increment, vector LUT update, and at the end by reduction of vector histogram components. The present invention could be used for intensity, RGBA, YUV, and other type of multi-component images.
    Type: Grant
    Filed: March 12, 2009
    Date of Patent: November 29, 2011
    Inventor: Tibet Mimar
  • Patent number: 8060725
    Abstract: A processor architecture for multimedia applications includes processor clusters providing vectorial data processing capability. Processing elements in the processor clusters process both data with a bit length N and data with bit lengths N/2, N/4, and so on according to a Single Instruction Multiple Data (SIMD) function. A load unit loads into the processor clusters data to be processed according to a same instruction. An intercluster data path exchanges data between the processor clusters. The intercluster data path is scalable to activate selected processor clusters. The processor operates simultaneously on SIMD, scalar and vectorial data.
    Type: Grant
    Filed: June 26, 2007
    Date of Patent: November 15, 2011
    Assignees: STMicroelectronics S.R.L., STMicroelectronics N.V.
    Inventors: Francesco Pappalardo, Giuseppe Notarangelo, Elena Salurso, Elio Guidetti
  • Patent number: 8060726
    Abstract: A SIMD microprocessor, which can be included in an image processing apparatus using an image processing method used therein, includes a global processor and multiple processor elements controlled by the global processor. Each single processor element of the multiple processor elements includes multiple operation units. The global processor is configured to control the multiple processing elements to uniformly change a configuration of the multiple operation units in the single processor element to determine a number of data units of operation according to the multiple operation units either operated individually or in cooperation with each other in the single processor element and a width of data processed per data unit of operation performed in the single processor element. A processor element number is assigned per data unit of operation to the single processor element to use for executing an operation.
    Type: Grant
    Filed: February 27, 2008
    Date of Patent: November 15, 2011
    Assignee: Ricoh Company, Ltd.
    Inventor: Tomoaki Ozaki
  • Patent number: 8060724
    Abstract: Executing a first memory access instruction with update by an N-bit processor includes accessing at least one source register of a plurality of registers, wherein the accessing includes accessing a first register, wherein each register of the plurality of registers includes a main portion of N bits and an extension portion of M bits, wherein the main portion of the first register includes a first address operand. The execution of the first instruction further includes forming a memory access address using the first address operand; using the memory access address as an address for a memory access; producing an updated address operand; and writing the updated address operand to the main portion of the first register. The producing includes accessing an extension portion of a source register of the at least one source register to obtain modifying information and using the modifying information in the producing an updated address operand.
    Type: Grant
    Filed: August 15, 2008
    Date of Patent: November 15, 2011
    Assignee: Freescale Semiconductor, Inc.
    Inventor: William C. Moyer
  • Patent number: 8056069
    Abstract: A method, computer program product, and information handling system for generating loop code to execute on Single-Instruction Multiple-Datapath (SIMD) architectures, where the loop contains multiple non-stride-one memory accesses that operate over a contiguous stream of memory is disclosed. A preferred embodiment identifies groups of isomorphic statements within a loop body where the isomorphic statements operate over a contiguous stream of memory over the iteration of the loop. Those identified statements are then converted into virtual-length vector operations. Next, the hardware's available vector length is used to determine a number of virtual-length vectors to aggregate into a single vector operation for each iteration of the loop. Finally, the aggregated, vectorized loop code is converted into SIMD operations.
    Type: Grant
    Filed: September 17, 2007
    Date of Patent: November 8, 2011
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Kai-Ting Amy Wang, Peng Wu
  • Patent number: 8041927
    Abstract: A processor (and method) of processing multiple data by a single instruction includes first and second register sets each of which includes a plurality of registers, and an arithmetic unit to rearrange data being registered in the first and second register sets according to a relative size of an absolute value of the data between the first and second register sets so that the relative size is defined before executing an instruction considering the relative size.
    Type: Grant
    Filed: April 7, 2009
    Date of Patent: October 18, 2011
    Assignee: NEC Corporation
    Inventor: Yusuke Kobayashi
  • Patent number: 8032735
    Abstract: A method includes, in a processor, loading/moving a first portion of bits of a source into a first portion of a destination register and duplicate that first portion of bits in a subsequent portion of the destination register.
    Type: Grant
    Filed: November 5, 2010
    Date of Patent: October 4, 2011
    Assignee: Intel Corporation
    Inventor: Patrice Roussel
  • Patent number: 8028150
    Abstract: A method and system for decoding and modifying processor instructions in runtime according to certain rules in order to separately control processing elements embedded within a multi-processor array using a single instruction. The present invention allows multiple processing elements and/or execution units in a multi-processor array to perform different operations, based upon a variable or variables such as their location in the multi-processor array, while accepting a single instruction as an input.
    Type: Grant
    Filed: November 16, 2007
    Date of Patent: September 27, 2011
    Inventors: Shlomo Selim Rakib, Yoram Zarai
  • Patent number: 8024550
    Abstract: Disclosed is an SIMD-type microprocessor comprising a processor element group, plural processor elements with an operation part and a register file being arranged therein and a processor element control signal generator configured to output a processor element control signal controlling an operation of the processor element, wherein a feed part configured to feed a processor element control signal output from the processor element control signal generator to the processor element is provided at a center of the processor element group.
    Type: Grant
    Filed: January 21, 2009
    Date of Patent: September 20, 2011
    Assignee: Ricoh Company, Ltd.
    Inventor: Hidehito Kitamura
  • Patent number: 8024531
    Abstract: An ascending ordered list without duplication is generated based on a value list divided and held by multiple memory modules. An information processing system has multiple PMMs (Processor Memory Modules), and the PMMs are interconnected via a data transmission path. The memory in the PMM has a list of values, which are ordered in ascending or descending order without duplication. The PMM determines, for a storage value in the value list (LOCAL_LIST) held by the PMM, whether or not the memory module is a representative module representing one or more memory modules holding the storage value based on rankings determined for the individual PMMs and the value lists received from the other PMMs, and if the memory module is determined to be the representative module (RV-0 . . . RV-7), associates to the storage value and stores information indicating that the memory module is the representative module.
    Type: Grant
    Filed: April 17, 2006
    Date of Patent: September 20, 2011
    Assignee: Turbo Data Laboratories, Inc.
    Inventor: Shinji Furusho
  • Patent number: 8024549
    Abstract: A data processor apparatus comprises a plurality of data receiving means each for receiving data from a data source; a computational element coupleable to each of said data receiving means for performing an operation on said data; and a controller for controlling the flow of data from each data receiving means to the computational element.
    Type: Grant
    Filed: March 3, 2006
    Date of Patent: September 20, 2011
    Assignee: Mtekvision Co., Ltd.
    Inventor: Malcolm Stewart
  • Publication number: 20110211726
    Abstract: This invention provides a system and method for processing discrete image data within an overall set of acquired image data based upon a focus of attention within that image. The result of such processing is to operate upon a more limited subset of the overall image data to generate output values required by the vision system process. Such output value can be a decoded ID or other alphanumeric data. The system and method is performed in a vision system having two processor groups, along with a data memory that is smaller in capacity than the amount of image data to be read out from the sensor array. The first processor group is a plurality of SIMD processors and at least one general purpose processor, co-located on the same die with the data memory. A data reduction function operates within the same clock cycle as data-readout from the sensor to generate a reduced data set that is stored in the on-die data memory.
    Type: Application
    Filed: May 17, 2010
    Publication date: September 1, 2011
    Applicant: COGNEX CORPORATION
    Inventors: Michael C. Moed, E. John McGarry
  • Patent number: 8010953
    Abstract: Performing scalar operations using a SIMD data parallel execution unit is provided. With the mechanisms of the illustrative embodiments, scalar operations in application code are identified that may be executed using vector operations in a SIMD data parallel execution unit. The scalar operations are converted, such as by a static or dynamic compiler, into one or more vector load instructions and one or more vector computation instructions. In addition, control words may be generated to adjust the alignment of the scalar values for the scalar operation within the vector registers to which these scalar values are loaded using the vector load instructions. The alignment amounts for adjusting the scalar values within the vector registers may be statically or dynamically determined.
    Type: Grant
    Filed: April 4, 2006
    Date of Patent: August 30, 2011
    Assignee: International Business Machines Corporation
    Inventor: Michael K. Gschwind
  • Publication number: 20110208946
    Abstract: Disclosed are various embodiments of a stream processing unit for single instruction multiple data (SIMD) processing, wherein the stream processing unit executes a stage of a Multiply-Accumulate calculation. In one embodiment, the stream processing unit comprises a plurality of scalar arithmetic logic units (ALUs) configured to receive data having a plurality of data types. The number and type of scalar ALUs corresponds to an SIMD factor. In one embodiment, the scalar ALUs are executed sequentially with a delay being introduced in between execution of each of the scalar ALUs, wherein the delay corresponds to the SIMD factor.
    Type: Application
    Filed: May 4, 2011
    Publication date: August 25, 2011
    Applicant: VIA TECHNOLOGIES, INC.
    Inventors: Boris Prokopenko, Timour Paltashev, Derek Gladding
  • Patent number: 8001506
    Abstract: A disclosed image processing apparatus includes a SIMD microprocessor in which multiple processor elements are arranged in one dimension, each of the processor elements including multiple access registers arranged in stages for storing image data; and multiple data processing devices corresponding one-to-one with the stages of the access registers, arranged in one dimension in the same direction as the processor elements, and configured to read and write image data from/to the access registers. The access registers of each of the stages, each of which access registers is included in a different one of the processor elements, are connected with a common line. Wiring outlets, each of which connects the common line of a different one of the stages to a corresponding data processing device, are individually disposed within the SIMD microprocessor in such a manner that each wiring outlet has a shortest possible distance to the corresponding data processing device.
    Type: Grant
    Filed: January 21, 2009
    Date of Patent: August 16, 2011
    Assignee: Ricoh Company, Ltd.
    Inventor: Tomoaki Ozaki
  • Patent number: 7996572
    Abstract: Systems and methods of managing transactions provide for receiving a first flush command at a first I/O hub, wherein the first flush command is dedicated to non-posted transactions. One embodiment further provides for halting an inbound ordering queue of the first I/O hub with regard to non-posted transactions in response to the first flush command and flushing a non-posted transaction from an outgoing buffer of the first I/O hub to a second I/O hub while the inbound ordering queue is halted with regard to non-posted transactions.
    Type: Grant
    Filed: June 2, 2004
    Date of Patent: August 9, 2011
    Assignee: Intel Corporation
    Inventors: Robert G. Blankenship, Robert J. Greiner, Herbert H. J. Hum, Kenneth C. Creta, Buderya S. Acharya
  • Publication number: 20110191567
    Abstract: Improvements Relating to Single Instruction Multiple Data (SIMD) Architectures A parallel processor for processing a plurality of different processing instruction streams in parallel is described. The processor comprises a plurality of data processing units; and a plurality of SIMD (Single Instruction Multiple Data) controllers, each connectable to a group of data processing units of the plurality of data processing units, and each SIMD controller arranged to handle an individual processing task with a subgroup of actively connected data processing units selected from the group of data processing units. The parallel processor is arranged to vary dynamically the size of the subgroup of data processing units to which each SIMD controller is actively connected under control of received processing instruction streams, thereby permitting each SIMD controller to be actively connected to a different number of processing units for different processing tasks.
    Type: Application
    Filed: May 20, 2009
    Publication date: August 4, 2011
    Inventors: John Lancaster, Martin Whitaker
  • Publication number: 20110185150
    Abstract: Systems and methods for performing single instruction multiple data (SIMD) operations on a data set. The methods may include examining a structure of the data set to determine what reorganization may be necessary to facilitate SIMD processing. The method may include selecting a stored bit mask corresponding to the organization of the data set and loading the bit mask into an application specific register (ASR). Subsequently, the data may be reorganized inline according to the ASR as the data is loaded into the SIMD functional unit such that the SIMD functional unit may operate on the data set. The results of the SIMD operation may be written to a results register.
    Type: Application
    Filed: January 26, 2010
    Publication date: July 28, 2011
    Applicant: SUN MICROSYSTEMS, INC.
    Inventor: Lawrence A. Spracklen
  • Publication number: 20110185151
    Abstract: A parallel processor is described which is operated in a SIMD manner. The processor comprises: a plurality of processing elements connected in a string and grouped into a plurality of processing units, wherein each processing unit comprises a plurality of processing elements which each have direct interconnections with all of the other processing elements within the respective processing unit, the interconnections enabling data transfer between any two elements within a unit to be effected in a single clock cycle.
    Type: Application
    Filed: May 20, 2009
    Publication date: July 28, 2011
    Inventors: Martin Whitaker, John Lancaster
  • Publication number: 20110173414
    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    Type: Application
    Filed: March 23, 2011
    Publication date: July 14, 2011
    Applicant: NVIDIA Corporation
    Inventors: Norbert Juffa, Brett W. Coon
  • Publication number: 20110161548
    Abstract: A cache manager receives a request for data, which includes a requested effective address. The cache manager determines whether the requested effective address matches a most recently used effective address stored in a mapped tag vector. When the most recently used effective address matches the requested effective address, the cache manager identifies a corresponding cache location and retrieves the data from the identified cache location. However, when the most recently used effective address fails to match the requested effective address, the cache manager determines whether the requested effective address matches a subsequent effective address stored in the mapped tag vector. When the cache manager determines a match to a subsequent effective address, the cache manager identifies a different cache location corresponding to the subsequent effective address and retrieves the data from the different cache location.
    Type: Application
    Filed: December 29, 2009
    Publication date: June 30, 2011
    Applicant: International Business Machines Corporation
    Inventors: Brian Flachs, Barry L. Minor, Mark Richard Nutter
  • Publication number: 20110161624
    Abstract: Mechanisms are provided for performing a floating point collect and operate for a summation across a vector for a dot product operation. A routing network placed before the single instruction multiple data (SIMD) unit allows the SIMD unit to perform a summation across a vector with a single stage of adders. The routing network routes the vector elements to the adders in a first cycle. The SIMD unit stores the results of the adders into a results vector register. The routing network routes the summation results from the results vector register to the adders in a second cycle. The SIMD unit then stores the results from the second cycle in the results vector register.
    Type: Application
    Filed: December 29, 2009
    Publication date: June 30, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Brian K. Flachs, Seiji Maeda, Steven Osman
  • Publication number: 20110153983
    Abstract: According to a first aspect, efficient data transfer operations can be achieved by: decoding by a processor device, a single instruction specifying a transfer operation for a plurality of data elements between a first storage location and a second storage location; issuing the single instruction for execution by an execution unit in the processor; detecting an occurrence of an exception during execution of the single instruction; and in response to the exception, delivering pending traps or interrupts to an exception handler prior to delivering the exception.
    Type: Application
    Filed: December 22, 2009
    Publication date: June 23, 2011
    Inventors: Christopher J. Hughes, Yen-Kuang (Y.K.) Chen, Mayank Bomb, Jason W. Brandt, Mark J. Buxton, Mark J. Charney, Srinivas Chennupaty, Jesus Corbal, Martin G. Dixon, Milind B. Girkar, Jonathan C. Hall, Hideki (Saito) Ido, Peter Lachner, Gilbert Neiger, Chris J. Newburn, Rajesh S. Parthasarathy, Bret L. Toll, Robert Valentine, Jeffrey G. Wiedemeier
  • Patent number: 7966475
    Abstract: A data processor comprises a plurality of processing elements arranged for parallel processing of data, and a controller for controlling the plurality of processing elements. The controller is operable to determine respective status information for a plurality of processing threads, and to control processing of the processing threads by the plurality of processors in dependence upon such status information.
    Type: Grant
    Filed: January 10, 2007
    Date of Patent: June 21, 2011
    Assignee: Rambus Inc.
    Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
  • Patent number: 7962718
    Abstract: A permutation instruction generates vector elements for a destination register using identified source and destination registers. A plurality of partial table lookups corresponding to an extended table produces a plurality of intermediate results. At least one source register stores a plurality of index values corresponding to the extended table. Out-of-range index values are values that are not contained in at least one additional source register and result in a predetermined constant value being stored into a predetermined vector element of the destination register. The index values are adjusted between the partial table lookups. A final result is formed by performing a logic function with the plurality of intermediate results. The final result is thereby formed without a full table lookup of each element of the final result.
    Type: Grant
    Filed: October 12, 2007
    Date of Patent: June 14, 2011
    Assignee: Freescale Semiconductor, Inc.
    Inventor: William C. Moyer
  • Publication number: 20110138151
    Abstract: Disclosed is a mixed mode parallel processor system in which N number of processing elements PEs, capable of performing SIMD operation, are grouped into M (=N÷S) processing units PUs performing MIMD operation. In MIMD operation, P out of S memories in each PU, which S memories inherently belong to the PEs, where P<S, operate as an instruction cache. The remaining memories operate as data memories or as data cache memories. One out of S sets of general-purpose registers, inherently belonging to the PEs, directly operates as a general register group for the PU. Out of the remaining S?1 sets, T set or a required number of sets, where T<S?1, are used as storage registers that store tags of the instruction cache.
    Type: Application
    Filed: February 18, 2011
    Publication date: June 9, 2011
    Applicant: NEC CORPORATION
    Inventor: Shorin KYO
  • Patent number: 7958332
    Abstract: A controller operable to control an array of processing elements comprises a retrieval unit operable to retrieve instruction items for each of a plurality of instructions streams, each instruction stream having a plurality of instructions items, a combining unit operable to combine the plurality of instruction streams into a serial instruction stream, and a distribution unit operable to distribute the serial instruction stream to an array of processing elements.
    Type: Grant
    Filed: March 13, 2009
    Date of Patent: June 7, 2011
    Assignee: Rambus Inc.
    Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
  • Patent number: 7941570
    Abstract: An article of manufacture, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes the host computer system obtaining a transport command word (TCW) for an I/O operation having both input and output data. The TCW specifies a location of the output data and a location for storing the input data. The host computer system forwards the I/O operation to the control unit for execution. The host computer system gathers the output data responsive to the location of the output data specified by the TCW, and then forwards the output data to the control unit for use in the execution of the I/O operation. The host computer system receives the input data from the control unit and stores the input data at the location specified by the TCW.
    Type: Grant
    Filed: February 14, 2008
    Date of Patent: May 10, 2011
    Assignee: International Business Machines Corporation
    Inventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann
  • Publication number: 20110107060
    Abstract: Systems, methods and articles of manufacture are disclosed for transposing array data on a SIMD multi-core processor architecture. A matrix in a SIMD format may be received. The matrix may comprise a SIMD conversion of a matrix M in a conventional data format. A mapping may be defined from each element of the matrix to an element of a SIMD conversion of a transpose of matrix M. A SIMD-transposed matrix T may be generated based on matrix M and the defined mapping. A row-wise algorithm may be applied to T, without modification, to operate on columns of matrix M.
    Type: Application
    Filed: November 4, 2009
    Publication date: May 5, 2011
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Jeffrey S. McAllister, Mark A. Bransford, Timothy J. Mullins, Nelson Ramirez
  • Patent number: 7937567
    Abstract: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.
    Type: Grant
    Filed: November 1, 2006
    Date of Patent: May 3, 2011
    Assignee: Nvidia Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Publication number: 20110099352
    Abstract: There is provided a method of performing single instruction multiple data (SIMD) operations. The method comprises storing a plurality of arrays in memory for performing SIMD operations thereon; determining a total number of SIMD operations to be performed on the plurality of arrays; loading a counter with the total number of SIMD operations to be performed on the plurality of arrays; enabling a plurality of arithmetic logic units (ALUs) to perform a first number of operations on first elements of the plurality of arrays; performing the first number of operations on first elements of the plurality of arrays using the plurality of ALUs; decrementing the counter by the first number of operations to provide a remaining number of operations; and enabling a number of the plurality of ALUs to perform the remaining number of operations on second elements of the plurality of arrays.
    Type: Application
    Filed: November 23, 2009
    Publication date: April 28, 2011
    Applicant: Mindspeed Technologies, Inc.
    Inventor: Patrick D. Ryan
  • Publication number: 20110093682
    Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to pack the packed data responsive to a pack instruction received by the decoder. A first packed data element and a second packed data element are received from the first source register. A third packed data element and a fourth packed data element are received from the second source register. The circuit packs packing a portion of each of the packed data elements into a destination register resulting with the portion from second packed data element adjacent to the portion from the first packed data element, and the portion from the fourth packed data element adjacent to the portion from the third packed data element.
    Type: Application
    Filed: December 22, 2010
    Publication date: April 21, 2011
    Inventors: ALEXANDER PELEG, YAAKOV YAARI, MILLIND MITTAL, LARRY M. MENNEMEIER, BENNY EITAN
  • Publication number: 20110087860
    Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.
    Type: Application
    Filed: December 17, 2010
    Publication date: April 14, 2011
    Applicant: NVIDIA Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Patent number: 7925861
    Abstract: A data processor comprises a plurality of processing elements arranged in a first plurality of single instruction multiple data (SIMD) processing arrays, and comprises a second plurality of controllers for transferring instructions to the processing arrays. Each controller is operable to retrieve a plurality of incoming instruction streams in parallel with one another and operable to supply incoming instruction streams to one of a plurality of processing arrays.
    Type: Grant
    Filed: January 31, 2007
    Date of Patent: April 12, 2011
    Assignee: Rambus Inc.
    Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
  • Publication number: 20110083000
    Abstract: A data processing architecture includes an input device that receives an incoming stream of data packets. A plurality of processing elements are operable to process data received from the input device. The input device is operable to distribute data packets in whole or in part to the processing elements in dependence upon the data processing bandwidth of the processing elements.
    Type: Application
    Filed: December 10, 2010
    Publication date: April 7, 2011
    Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
  • Patent number: 7916864
    Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.
    Type: Grant
    Filed: February 8, 2006
    Date of Patent: March 29, 2011
    Assignee: NVIDIA Corporation
    Inventor: Norbert Juffa
  • Patent number: 7917727
    Abstract: An input/output system transfers data packets to and from a SIMD array of processing elements (PEs) such that different sizes of data packets are transferred to respective ones of the PEs. The packets are transferred in batches to respective different addresses in the array under the control of the PEs. Transfer to or from the array may be carried out when either a batch or part of a batch is ready for transfer. The decision to transfer either full or part batches is made in dependence upon the speed of the PEs and the speed and intermittency of the data packets.
    Type: Grant
    Filed: May 23, 2007
    Date of Patent: March 29, 2011
    Assignee: Rambus, Inc.
    Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
  • Publication number: 20110072238
    Abstract: The present invention provides a method for reducing program memory size required for a dual-issue processor with a scalar processor plus a SIMD vector processor. Coding the map of next group of instruction pairs in a no-operation (NOP) instruction of scalar and vector processor reduces the cases where one of the scalar or vector opcode being a NOP opcode. NOP for either scalar or vector processor defines the next 13 instructions as scalar-plus-vector, scalar-followed-by-scalar, or vector-followed-by-vector so that execution unit performs accordingly until next NOP or a branch instruction.
    Type: Application
    Filed: September 20, 2009
    Publication date: March 24, 2011
    Inventor: Tibet Mimar
  • Patent number: 7908461
    Abstract: A data processing system includes an associative memory device containing n-cells, each of the n-cells includes a processing circuit. A controller is utilized for issuing one of a plurality of instructions to the associative memory device, while a clock device is utilized for outputting a synchronizing clock signal comprised of a predetermined number of clock cycles per second. The clock device outputs the synchronizing clock signal to the associative memory device and the controller which globally communicates one of the plurality of instructions to all of the n-cells simultaneously, within one of the clock cycles.
    Type: Grant
    Filed: December 19, 2007
    Date of Patent: March 15, 2011
    Assignee: Allsearch Semi, LLC
    Inventors: Gheorghe Stefan, Dan Tomescu
  • Patent number: 7903810
    Abstract: A method and apparatus are disclosed for efficiently scrambling one or more bytes of data according to DSL standards on a processor. This is achieved by providing an instruction for scrambling one or more bytes of data according to the DSL standards. Accordingly, the invention advantageously provides a processor with the ability to scramble data with a single instruction thus allowing for more efficient and faster scrambling operations for subsequent modulation and transmission.
    Type: Grant
    Filed: September 22, 2004
    Date of Patent: March 8, 2011
    Assignee: Broadcom Corporation
    Inventors: Mark Taunton, Timothy Martin Dobson
  • Patent number: 7904905
    Abstract: A system and method is disclosed for efficiently executing single program multiple data (SPMD) programs in a microprocessor. A micro single instruction multiple data (SIMD) unit is located within the microprocessor. A job buffer that is coupled to the micro SIMD unit dynamically allocates tasks to the micro SIMD unit. The SPMD programs each comprise a plurality of input data streams having moderate diversification of control flows. The system executes each SPMD program once for each input data stream of the plurality of input data streams.
    Type: Grant
    Filed: November 14, 2003
    Date of Patent: March 8, 2011
    Assignee: STMicroelectronics, Inc.
    Inventor: Stefano Cervini