Single Instruction, Multiple Data (simd) Patents (Class 712/22)
-
Patent number: 8108846Abstract: A mechanism is provided for performing scalar operations using a SIMD data parallel execution unit. With the mechanisms of the illustrative embodiments, scalar operations in application code are identified that may be executed using vector operations in a SIMD data parallel execution unit. The scalar operations are converted, such as by a static or dynamic compiler, into one or more vector load instructions and one or more vector computation instructions. In addition, control words may be generated to adjust the alignment of the scalar values for the scalar operation within the vector registers to which these scalar values are loaded using the vector load instructions. The alignment amounts for adjusting the scalar values within the vector registers may be statically or dynamically determined.Type: GrantFiled: May 28, 2008Date of Patent: January 31, 2012Assignee: International Business Machines CorporationInventor: Michael K. Gschwind
-
Patent number: 8108653Abstract: A processor includes a compute array comprising a first plurality of compute engines serially connected along a data flow path such that data flows between successive compute engines at successive times. The first plurality of compute engines includes an initial compute engine and a final compute engine. The data flow path includes a recirculation path connecting the final compute engine to the initial compute engine with no compute engine therebetween.Type: GrantFiled: February 5, 2010Date of Patent: January 31, 2012Assignee: Analog Devices, Inc.Inventors: Boris Lerner, Douglas Garde
-
Patent number: 8103854Abstract: A control processor is used for fetching and distributing single instruction multiple data (SIMD) instructions to a plurality of processing elements (PEs). One of the SIMD instructions is a thread start (Tstart) instruction, which causes the control processor to pause its instruction fetching. A local PE instruction memory (PE Imem) is associated with each PE and contains local PE instructions for execution on the local PE. Local PE Imem fetch, decode, and execute logic are associated with each PE. Instruction path selection logic in each PE is used to select between control processor distributed instructions and local PE instructions fetched from the local PE Imem. Each PE is also initialized to receive control processor distributed instructions. In addition, local hold generation logic is associated with each PE. A PE receiving a Tstart instruction causes the instruction path selection logic to switch to fetch local PE Imem instructions.Type: GrantFiled: April 12, 2010Date of Patent: January 24, 2012Assignee: Altera CorporationInventors: Gerald George Pechanek, Edwin Franklin Barry, Mihailo M. Stojancic
-
Patent number: 8099584Abstract: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.Type: GrantFiled: May 2, 2011Date of Patent: January 17, 2012Assignee: NVIDIA CorporationInventors: John R. Nickolls, Stephen D. Lew
-
Patent number: 8085264Abstract: A method for multiple queue output buffering in a raster stage of a graphics processor. The method includes receiving a graphics primitive for rasterization in a raster stage of a graphics processor. The graphics primitive is rasterized at a first level to generate a plurality of tiles of pixels related to the graphics primitive. Each tile is then rasterized to determine related sub-portions of each tile. The related sub-portions are transferred to a plurality of output queues. The related sub-portions are subsequently output on a per queue basis and on a per clock cycle basis.Type: GrantFiled: July 26, 2006Date of Patent: December 27, 2011Assignee: NVIDIA CorporationInventors: Franklin C. Crow, Jeffrey R. Sewall
-
Patent number: 8086830Abstract: An arithmetic processing apparatus capable of performing an arithmetic operation for generating a condition flag commonly referred to by using a condition flag generated on an arithmetic operation unit basis in as few steps as possible is provided. The arithmetic processing apparatus, which processes multiple data in parallel based on single instruction, includes: processing elements capable of performing a common arithmetic operation based on the evaluation result of the instruction stored in the instruction register; and a condition flag arithmetic operation unit capable of performing one of the logical operation and the comparison operation on the condition flag retained in each processing element, transferring the operation result to each processing element, and updating the condition flag based on the operation result.Type: GrantFiled: August 24, 2005Date of Patent: December 27, 2011Assignee: Panasonic CorporationInventors: Takeshi Furuta, Hideshi Nishida, Takeshi Tanaka
-
Patent number: 8082419Abstract: According to some embodiments, a technique provides for the execution of an instruction that includes receiving residual data of a first image and decoded pixels of a second image, zero-extending a plurality of unsigned data operands of the decoded pixels producing a plurality of unpacked data operands, adding a plurality of signed data operands of the residual data to the plurality of unpacked data operands producing a plurality of signed results; and saturating the plurality of signed results producing a plurality of unsigned results.Type: GrantFiled: March 30, 2004Date of Patent: December 20, 2011Assignee: Intel CorporationInventors: Bradley C. Aldrich, Nigel C. Paver, Murli Ganeshan
-
Patent number: 8078836Abstract: In-lane vector shuffle operations are described. In one embodiment a shuffle instruction specifies a field of per-lane control bits, a source operand and a destination operand, these operands having corresponding lanes, each lane divided into corresponding portions of multiple data elements. Sets of data elements are selected from corresponding portions of every lane of the source operand according to per-lane control bits. Elements of these sets are copied to specified fields in corresponding portions of every lane of the destination operand. Another embodiment of the shuffle instruction also specifies a second source operand, all operands having corresponding lanes divided into multiple data elements. A set selected according to per-lane control bits contains data elements from every lane portion of a first source operand and data elements from every corresponding lane portion of the second source operand. Set elements are copied to specified fields in every lane of the destination operand.Type: GrantFiled: December 30, 2007Date of Patent: December 13, 2011Assignee: Intel CorporationInventors: Zeev Sperber, Robert Valentine, Benny Eitan, Doron Orenstein
-
Patent number: 8074058Abstract: The present invention provides extended precision in SIMD arithmetic operations in a processor having a register file and an accumulator. A first set of data elements and a second set of data elements are loaded into first and second vector registers, respectively. Each data element comprises N bits. Next, an arithmetic instruction is fetched from memory. The arithmetic instruction is decoded. Then, the first vector register and the second vector register are read from the register file. The present invention executes the arithmetic instruction on corresponding data elements in the first and second vector registers. The resulting element of the execution is then written into the accumulator. Then, the resulting element is transformed into an N-bit width element and written into a third register for further operation or storage in memory. The transformation of the resulting element can include, for example, rounding, clamping, and/or shifting the element.Type: GrantFiled: June 8, 2009Date of Patent: December 6, 2011Assignee: MIPS Technologies, Inc.Inventors: Timothy J. Van Hook, Peter Hsu, William A. Huffman, Henry P. Moreton, Earl A. Killian
-
Patent number: 8069334Abstract: The present invention provides histogram calculation for images and video applications using a SIMD and VLIW processor with vector Look-Up Table (LUT) operations. This provides a speed up of histogram calculation by a factor of N times over a scalar processor where the SIMD processor could perform N LUT operations per instruction. Histogram operation is partitioned into a vector LUT operation, followed by vector increment, vector LUT update, and at the end by reduction of vector histogram components. The present invention could be used for intensity, RGBA, YUV, and other type of multi-component images.Type: GrantFiled: March 12, 2009Date of Patent: November 29, 2011Inventor: Tibet Mimar
-
Patent number: 8060725Abstract: A processor architecture for multimedia applications includes processor clusters providing vectorial data processing capability. Processing elements in the processor clusters process both data with a bit length N and data with bit lengths N/2, N/4, and so on according to a Single Instruction Multiple Data (SIMD) function. A load unit loads into the processor clusters data to be processed according to a same instruction. An intercluster data path exchanges data between the processor clusters. The intercluster data path is scalable to activate selected processor clusters. The processor operates simultaneously on SIMD, scalar and vectorial data.Type: GrantFiled: June 26, 2007Date of Patent: November 15, 2011Assignees: STMicroelectronics S.R.L., STMicroelectronics N.V.Inventors: Francesco Pappalardo, Giuseppe Notarangelo, Elena Salurso, Elio Guidetti
-
Patent number: 8060726Abstract: A SIMD microprocessor, which can be included in an image processing apparatus using an image processing method used therein, includes a global processor and multiple processor elements controlled by the global processor. Each single processor element of the multiple processor elements includes multiple operation units. The global processor is configured to control the multiple processing elements to uniformly change a configuration of the multiple operation units in the single processor element to determine a number of data units of operation according to the multiple operation units either operated individually or in cooperation with each other in the single processor element and a width of data processed per data unit of operation performed in the single processor element. A processor element number is assigned per data unit of operation to the single processor element to use for executing an operation.Type: GrantFiled: February 27, 2008Date of Patent: November 15, 2011Assignee: Ricoh Company, Ltd.Inventor: Tomoaki Ozaki
-
Patent number: 8060724Abstract: Executing a first memory access instruction with update by an N-bit processor includes accessing at least one source register of a plurality of registers, wherein the accessing includes accessing a first register, wherein each register of the plurality of registers includes a main portion of N bits and an extension portion of M bits, wherein the main portion of the first register includes a first address operand. The execution of the first instruction further includes forming a memory access address using the first address operand; using the memory access address as an address for a memory access; producing an updated address operand; and writing the updated address operand to the main portion of the first register. The producing includes accessing an extension portion of a source register of the at least one source register to obtain modifying information and using the modifying information in the producing an updated address operand.Type: GrantFiled: August 15, 2008Date of Patent: November 15, 2011Assignee: Freescale Semiconductor, Inc.Inventor: William C. Moyer
-
Patent number: 8056069Abstract: A method, computer program product, and information handling system for generating loop code to execute on Single-Instruction Multiple-Datapath (SIMD) architectures, where the loop contains multiple non-stride-one memory accesses that operate over a contiguous stream of memory is disclosed. A preferred embodiment identifies groups of isomorphic statements within a loop body where the isomorphic statements operate over a contiguous stream of memory over the iteration of the loop. Those identified statements are then converted into virtual-length vector operations. Next, the hardware's available vector length is used to determine a number of virtual-length vectors to aggregate into a single vector operation for each iteration of the loop. Finally, the aggregated, vectorized loop code is converted into SIMD operations.Type: GrantFiled: September 17, 2007Date of Patent: November 8, 2011Assignee: International Business Machines CorporationInventors: Alexandre E. Eichenberger, Kai-Ting Amy Wang, Peng Wu
-
Patent number: 8041927Abstract: A processor (and method) of processing multiple data by a single instruction includes first and second register sets each of which includes a plurality of registers, and an arithmetic unit to rearrange data being registered in the first and second register sets according to a relative size of an absolute value of the data between the first and second register sets so that the relative size is defined before executing an instruction considering the relative size.Type: GrantFiled: April 7, 2009Date of Patent: October 18, 2011Assignee: NEC CorporationInventor: Yusuke Kobayashi
-
Patent number: 8032735Abstract: A method includes, in a processor, loading/moving a first portion of bits of a source into a first portion of a destination register and duplicate that first portion of bits in a subsequent portion of the destination register.Type: GrantFiled: November 5, 2010Date of Patent: October 4, 2011Assignee: Intel CorporationInventor: Patrice Roussel
-
Patent number: 8028150Abstract: A method and system for decoding and modifying processor instructions in runtime according to certain rules in order to separately control processing elements embedded within a multi-processor array using a single instruction. The present invention allows multiple processing elements and/or execution units in a multi-processor array to perform different operations, based upon a variable or variables such as their location in the multi-processor array, while accepting a single instruction as an input.Type: GrantFiled: November 16, 2007Date of Patent: September 27, 2011Inventors: Shlomo Selim Rakib, Yoram Zarai
-
Patent number: 8024550Abstract: Disclosed is an SIMD-type microprocessor comprising a processor element group, plural processor elements with an operation part and a register file being arranged therein and a processor element control signal generator configured to output a processor element control signal controlling an operation of the processor element, wherein a feed part configured to feed a processor element control signal output from the processor element control signal generator to the processor element is provided at a center of the processor element group.Type: GrantFiled: January 21, 2009Date of Patent: September 20, 2011Assignee: Ricoh Company, Ltd.Inventor: Hidehito Kitamura
-
Patent number: 8024531Abstract: An ascending ordered list without duplication is generated based on a value list divided and held by multiple memory modules. An information processing system has multiple PMMs (Processor Memory Modules), and the PMMs are interconnected via a data transmission path. The memory in the PMM has a list of values, which are ordered in ascending or descending order without duplication. The PMM determines, for a storage value in the value list (LOCAL_LIST) held by the PMM, whether or not the memory module is a representative module representing one or more memory modules holding the storage value based on rankings determined for the individual PMMs and the value lists received from the other PMMs, and if the memory module is determined to be the representative module (RV-0 . . . RV-7), associates to the storage value and stores information indicating that the memory module is the representative module.Type: GrantFiled: April 17, 2006Date of Patent: September 20, 2011Assignee: Turbo Data Laboratories, Inc.Inventor: Shinji Furusho
-
Patent number: 8024549Abstract: A data processor apparatus comprises a plurality of data receiving means each for receiving data from a data source; a computational element coupleable to each of said data receiving means for performing an operation on said data; and a controller for controlling the flow of data from each data receiving means to the computational element.Type: GrantFiled: March 3, 2006Date of Patent: September 20, 2011Assignee: Mtekvision Co., Ltd.Inventor: Malcolm Stewart
-
Publication number: 20110211726Abstract: This invention provides a system and method for processing discrete image data within an overall set of acquired image data based upon a focus of attention within that image. The result of such processing is to operate upon a more limited subset of the overall image data to generate output values required by the vision system process. Such output value can be a decoded ID or other alphanumeric data. The system and method is performed in a vision system having two processor groups, along with a data memory that is smaller in capacity than the amount of image data to be read out from the sensor array. The first processor group is a plurality of SIMD processors and at least one general purpose processor, co-located on the same die with the data memory. A data reduction function operates within the same clock cycle as data-readout from the sensor to generate a reduced data set that is stored in the on-die data memory.Type: ApplicationFiled: May 17, 2010Publication date: September 1, 2011Applicant: COGNEX CORPORATIONInventors: Michael C. Moed, E. John McGarry
-
Patent number: 8010953Abstract: Performing scalar operations using a SIMD data parallel execution unit is provided. With the mechanisms of the illustrative embodiments, scalar operations in application code are identified that may be executed using vector operations in a SIMD data parallel execution unit. The scalar operations are converted, such as by a static or dynamic compiler, into one or more vector load instructions and one or more vector computation instructions. In addition, control words may be generated to adjust the alignment of the scalar values for the scalar operation within the vector registers to which these scalar values are loaded using the vector load instructions. The alignment amounts for adjusting the scalar values within the vector registers may be statically or dynamically determined.Type: GrantFiled: April 4, 2006Date of Patent: August 30, 2011Assignee: International Business Machines CorporationInventor: Michael K. Gschwind
-
Publication number: 20110208946Abstract: Disclosed are various embodiments of a stream processing unit for single instruction multiple data (SIMD) processing, wherein the stream processing unit executes a stage of a Multiply-Accumulate calculation. In one embodiment, the stream processing unit comprises a plurality of scalar arithmetic logic units (ALUs) configured to receive data having a plurality of data types. The number and type of scalar ALUs corresponds to an SIMD factor. In one embodiment, the scalar ALUs are executed sequentially with a delay being introduced in between execution of each of the scalar ALUs, wherein the delay corresponds to the SIMD factor.Type: ApplicationFiled: May 4, 2011Publication date: August 25, 2011Applicant: VIA TECHNOLOGIES, INC.Inventors: Boris Prokopenko, Timour Paltashev, Derek Gladding
-
SIMD image forming apparatus for minimizing wiring distance between registers and processing devices
Patent number: 8001506Abstract: A disclosed image processing apparatus includes a SIMD microprocessor in which multiple processor elements are arranged in one dimension, each of the processor elements including multiple access registers arranged in stages for storing image data; and multiple data processing devices corresponding one-to-one with the stages of the access registers, arranged in one dimension in the same direction as the processor elements, and configured to read and write image data from/to the access registers. The access registers of each of the stages, each of which access registers is included in a different one of the processor elements, are connected with a common line. Wiring outlets, each of which connects the common line of a different one of the stages to a corresponding data processing device, are individually disposed within the SIMD microprocessor in such a manner that each wiring outlet has a shortest possible distance to the corresponding data processing device.Type: GrantFiled: January 21, 2009Date of Patent: August 16, 2011Assignee: Ricoh Company, Ltd.Inventor: Tomoaki Ozaki -
Patent number: 7996572Abstract: Systems and methods of managing transactions provide for receiving a first flush command at a first I/O hub, wherein the first flush command is dedicated to non-posted transactions. One embodiment further provides for halting an inbound ordering queue of the first I/O hub with regard to non-posted transactions in response to the first flush command and flushing a non-posted transaction from an outgoing buffer of the first I/O hub to a second I/O hub while the inbound ordering queue is halted with regard to non-posted transactions.Type: GrantFiled: June 2, 2004Date of Patent: August 9, 2011Assignee: Intel CorporationInventors: Robert G. Blankenship, Robert J. Greiner, Herbert H. J. Hum, Kenneth C. Creta, Buderya S. Acharya
-
Publication number: 20110191567Abstract: Improvements Relating to Single Instruction Multiple Data (SIMD) Architectures A parallel processor for processing a plurality of different processing instruction streams in parallel is described. The processor comprises a plurality of data processing units; and a plurality of SIMD (Single Instruction Multiple Data) controllers, each connectable to a group of data processing units of the plurality of data processing units, and each SIMD controller arranged to handle an individual processing task with a subgroup of actively connected data processing units selected from the group of data processing units. The parallel processor is arranged to vary dynamically the size of the subgroup of data processing units to which each SIMD controller is actively connected under control of received processing instruction streams, thereby permitting each SIMD controller to be actively connected to a different number of processing units for different processing tasks.Type: ApplicationFiled: May 20, 2009Publication date: August 4, 2011Inventors: John Lancaster, Martin Whitaker
-
Publication number: 20110185150Abstract: Systems and methods for performing single instruction multiple data (SIMD) operations on a data set. The methods may include examining a structure of the data set to determine what reorganization may be necessary to facilitate SIMD processing. The method may include selecting a stored bit mask corresponding to the organization of the data set and loading the bit mask into an application specific register (ASR). Subsequently, the data may be reorganized inline according to the ASR as the data is loaded into the SIMD functional unit such that the SIMD functional unit may operate on the data set. The results of the SIMD operation may be written to a results register.Type: ApplicationFiled: January 26, 2010Publication date: July 28, 2011Applicant: SUN MICROSYSTEMS, INC.Inventor: Lawrence A. Spracklen
-
Publication number: 20110185151Abstract: A parallel processor is described which is operated in a SIMD manner. The processor comprises: a plurality of processing elements connected in a string and grouped into a plurality of processing units, wherein each processing unit comprises a plurality of processing elements which each have direct interconnections with all of the other processing elements within the respective processing unit, the interconnections enabling data transfer between any two elements within a unit to be effected in a single clock cycle.Type: ApplicationFiled: May 20, 2009Publication date: July 28, 2011Inventors: Martin Whitaker, John Lancaster
-
Publication number: 20110173414Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.Type: ApplicationFiled: March 23, 2011Publication date: July 14, 2011Applicant: NVIDIA CorporationInventors: Norbert Juffa, Brett W. Coon
-
Publication number: 20110161548Abstract: A cache manager receives a request for data, which includes a requested effective address. The cache manager determines whether the requested effective address matches a most recently used effective address stored in a mapped tag vector. When the most recently used effective address matches the requested effective address, the cache manager identifies a corresponding cache location and retrieves the data from the identified cache location. However, when the most recently used effective address fails to match the requested effective address, the cache manager determines whether the requested effective address matches a subsequent effective address stored in the mapped tag vector. When the cache manager determines a match to a subsequent effective address, the cache manager identifies a different cache location corresponding to the subsequent effective address and retrieves the data from the different cache location.Type: ApplicationFiled: December 29, 2009Publication date: June 30, 2011Applicant: International Business Machines CorporationInventors: Brian Flachs, Barry L. Minor, Mark Richard Nutter
-
Publication number: 20110161624Abstract: Mechanisms are provided for performing a floating point collect and operate for a summation across a vector for a dot product operation. A routing network placed before the single instruction multiple data (SIMD) unit allows the SIMD unit to perform a summation across a vector with a single stage of adders. The routing network routes the vector elements to the adders in a first cycle. The SIMD unit stores the results of the adders into a results vector register. The routing network routes the summation results from the results vector register to the adders in a second cycle. The SIMD unit then stores the results from the second cycle in the results vector register.Type: ApplicationFiled: December 29, 2009Publication date: June 30, 2011Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Brian K. Flachs, Seiji Maeda, Steven Osman
-
Publication number: 20110153983Abstract: According to a first aspect, efficient data transfer operations can be achieved by: decoding by a processor device, a single instruction specifying a transfer operation for a plurality of data elements between a first storage location and a second storage location; issuing the single instruction for execution by an execution unit in the processor; detecting an occurrence of an exception during execution of the single instruction; and in response to the exception, delivering pending traps or interrupts to an exception handler prior to delivering the exception.Type: ApplicationFiled: December 22, 2009Publication date: June 23, 2011Inventors: Christopher J. Hughes, Yen-Kuang (Y.K.) Chen, Mayank Bomb, Jason W. Brandt, Mark J. Buxton, Mark J. Charney, Srinivas Chennupaty, Jesus Corbal, Martin G. Dixon, Milind B. Girkar, Jonathan C. Hall, Hideki (Saito) Ido, Peter Lachner, Gilbert Neiger, Chris J. Newburn, Rajesh S. Parthasarathy, Bret L. Toll, Robert Valentine, Jeffrey G. Wiedemeier
-
Patent number: 7966475Abstract: A data processor comprises a plurality of processing elements arranged for parallel processing of data, and a controller for controlling the plurality of processing elements. The controller is operable to determine respective status information for a plurality of processing threads, and to control processing of the processing threads by the plurality of processors in dependence upon such status information.Type: GrantFiled: January 10, 2007Date of Patent: June 21, 2011Assignee: Rambus Inc.Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
-
Patent number: 7962718Abstract: A permutation instruction generates vector elements for a destination register using identified source and destination registers. A plurality of partial table lookups corresponding to an extended table produces a plurality of intermediate results. At least one source register stores a plurality of index values corresponding to the extended table. Out-of-range index values are values that are not contained in at least one additional source register and result in a predetermined constant value being stored into a predetermined vector element of the destination register. The index values are adjusted between the partial table lookups. A final result is formed by performing a logic function with the plurality of intermediate results. The final result is thereby formed without a full table lookup of each element of the final result.Type: GrantFiled: October 12, 2007Date of Patent: June 14, 2011Assignee: Freescale Semiconductor, Inc.Inventor: William C. Moyer
-
Publication number: 20110138151Abstract: Disclosed is a mixed mode parallel processor system in which N number of processing elements PEs, capable of performing SIMD operation, are grouped into M (=N÷S) processing units PUs performing MIMD operation. In MIMD operation, P out of S memories in each PU, which S memories inherently belong to the PEs, where P<S, operate as an instruction cache. The remaining memories operate as data memories or as data cache memories. One out of S sets of general-purpose registers, inherently belonging to the PEs, directly operates as a general register group for the PU. Out of the remaining S?1 sets, T set or a required number of sets, where T<S?1, are used as storage registers that store tags of the instruction cache.Type: ApplicationFiled: February 18, 2011Publication date: June 9, 2011Applicant: NEC CORPORATIONInventor: Shorin KYO
-
Patent number: 7958332Abstract: A controller operable to control an array of processing elements comprises a retrieval unit operable to retrieve instruction items for each of a plurality of instructions streams, each instruction stream having a plurality of instructions items, a combining unit operable to combine the plurality of instruction streams into a serial instruction stream, and a distribution unit operable to distribute the serial instruction stream to an array of processing elements.Type: GrantFiled: March 13, 2009Date of Patent: June 7, 2011Assignee: Rambus Inc.Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
-
Patent number: 7941570Abstract: An article of manufacture, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes the host computer system obtaining a transport command word (TCW) for an I/O operation having both input and output data. The TCW specifies a location of the output data and a location for storing the input data. The host computer system forwards the I/O operation to the control unit for execution. The host computer system gathers the output data responsive to the location of the output data specified by the TCW, and then forwards the output data to the control unit for use in the execution of the I/O operation. The host computer system receives the input data from the control unit and stores the input data at the location specified by the TCW.Type: GrantFiled: February 14, 2008Date of Patent: May 10, 2011Assignee: International Business Machines CorporationInventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann
-
Publication number: 20110107060Abstract: Systems, methods and articles of manufacture are disclosed for transposing array data on a SIMD multi-core processor architecture. A matrix in a SIMD format may be received. The matrix may comprise a SIMD conversion of a matrix M in a conventional data format. A mapping may be defined from each element of the matrix to an element of a SIMD conversion of a transpose of matrix M. A SIMD-transposed matrix T may be generated based on matrix M and the defined mapping. A row-wise algorithm may be applied to T, without modification, to operate on columns of matrix M.Type: ApplicationFiled: November 4, 2009Publication date: May 5, 2011Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Jeffrey S. McAllister, Mark A. Bransford, Timothy J. Mullins, Nelson Ramirez
-
Patent number: 7937567Abstract: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.Type: GrantFiled: November 1, 2006Date of Patent: May 3, 2011Assignee: Nvidia CorporationInventors: John R. Nickolls, Stephen D. Lew
-
Publication number: 20110099352Abstract: There is provided a method of performing single instruction multiple data (SIMD) operations. The method comprises storing a plurality of arrays in memory for performing SIMD operations thereon; determining a total number of SIMD operations to be performed on the plurality of arrays; loading a counter with the total number of SIMD operations to be performed on the plurality of arrays; enabling a plurality of arithmetic logic units (ALUs) to perform a first number of operations on first elements of the plurality of arrays; performing the first number of operations on first elements of the plurality of arrays using the plurality of ALUs; decrementing the counter by the first number of operations to provide a remaining number of operations; and enabling a number of the plurality of ALUs to perform the remaining number of operations on second elements of the plurality of arrays.Type: ApplicationFiled: November 23, 2009Publication date: April 28, 2011Applicant: Mindspeed Technologies, Inc.Inventor: Patrick D. Ryan
-
Publication number: 20110093682Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to pack the packed data responsive to a pack instruction received by the decoder. A first packed data element and a second packed data element are received from the first source register. A third packed data element and a fourth packed data element are received from the second source register. The circuit packs packing a portion of each of the packed data elements into a destination register resulting with the portion from second packed data element adjacent to the portion from the first packed data element, and the portion from the fourth packed data element adjacent to the portion from the third packed data element.Type: ApplicationFiled: December 22, 2010Publication date: April 21, 2011Inventors: ALEXANDER PELEG, YAAKOV YAARI, MILLIND MITTAL, LARRY M. MENNEMEIER, BENNY EITAN
-
Publication number: 20110087860Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.Type: ApplicationFiled: December 17, 2010Publication date: April 14, 2011Applicant: NVIDIA CorporationInventors: John R. Nickolls, Stephen D. Lew
-
Patent number: 7925861Abstract: A data processor comprises a plurality of processing elements arranged in a first plurality of single instruction multiple data (SIMD) processing arrays, and comprises a second plurality of controllers for transferring instructions to the processing arrays. Each controller is operable to retrieve a plurality of incoming instruction streams in parallel with one another and operable to supply incoming instruction streams to one of a plurality of processing arrays.Type: GrantFiled: January 31, 2007Date of Patent: April 12, 2011Assignee: Rambus Inc.Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
-
Publication number: 20110083000Abstract: A data processing architecture includes an input device that receives an incoming stream of data packets. A plurality of processing elements are operable to process data received from the input device. The input device is operable to distribute data packets in whole or in part to the processing elements in dependence upon the data processing bandwidth of the processing elements.Type: ApplicationFiled: December 10, 2010Publication date: April 7, 2011Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
-
Patent number: 7916864Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.Type: GrantFiled: February 8, 2006Date of Patent: March 29, 2011Assignee: NVIDIA CorporationInventor: Norbert Juffa
-
Patent number: 7917727Abstract: An input/output system transfers data packets to and from a SIMD array of processing elements (PEs) such that different sizes of data packets are transferred to respective ones of the PEs. The packets are transferred in batches to respective different addresses in the array under the control of the PEs. Transfer to or from the array may be carried out when either a batch or part of a batch is ready for transfer. The decision to transfer either full or part batches is made in dependence upon the speed of the PEs and the speed and intermittency of the data packets.Type: GrantFiled: May 23, 2007Date of Patent: March 29, 2011Assignee: Rambus, Inc.Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
-
Publication number: 20110072238Abstract: The present invention provides a method for reducing program memory size required for a dual-issue processor with a scalar processor plus a SIMD vector processor. Coding the map of next group of instruction pairs in a no-operation (NOP) instruction of scalar and vector processor reduces the cases where one of the scalar or vector opcode being a NOP opcode. NOP for either scalar or vector processor defines the next 13 instructions as scalar-plus-vector, scalar-followed-by-scalar, or vector-followed-by-vector so that execution unit performs accordingly until next NOP or a branch instruction.Type: ApplicationFiled: September 20, 2009Publication date: March 24, 2011Inventor: Tibet Mimar
-
Patent number: 7908461Abstract: A data processing system includes an associative memory device containing n-cells, each of the n-cells includes a processing circuit. A controller is utilized for issuing one of a plurality of instructions to the associative memory device, while a clock device is utilized for outputting a synchronizing clock signal comprised of a predetermined number of clock cycles per second. The clock device outputs the synchronizing clock signal to the associative memory device and the controller which globally communicates one of the plurality of instructions to all of the n-cells simultaneously, within one of the clock cycles.Type: GrantFiled: December 19, 2007Date of Patent: March 15, 2011Assignee: Allsearch Semi, LLCInventors: Gheorghe Stefan, Dan Tomescu
-
Patent number: 7903810Abstract: A method and apparatus are disclosed for efficiently scrambling one or more bytes of data according to DSL standards on a processor. This is achieved by providing an instruction for scrambling one or more bytes of data according to the DSL standards. Accordingly, the invention advantageously provides a processor with the ability to scramble data with a single instruction thus allowing for more efficient and faster scrambling operations for subsequent modulation and transmission.Type: GrantFiled: September 22, 2004Date of Patent: March 8, 2011Assignee: Broadcom CorporationInventors: Mark Taunton, Timothy Martin Dobson
-
Patent number: 7904905Abstract: A system and method is disclosed for efficiently executing single program multiple data (SPMD) programs in a microprocessor. A micro single instruction multiple data (SIMD) unit is located within the microprocessor. A job buffer that is coupled to the micro SIMD unit dynamically allocates tasks to the micro SIMD unit. The SPMD programs each comprise a plurality of input data streams having moderate diversification of control flows. The system executes each SPMD program once for each input data stream of the plurality of input data streams.Type: GrantFiled: November 14, 2003Date of Patent: March 8, 2011Assignee: STMicroelectronics, Inc.Inventor: Stefano Cervini