Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Patent number: 7900025
    Abstract: Mechanisms for implementing a floating point only single instruction multiple data instruction set architecture are provided. A processor is provided that comprises an issue unit, an execution unit coupled to the issue unit, and a vector register file coupled to the execution unit. The execution unit has logic that implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA). The floating point vector registers of the vector register file store both scalar and floating point values as vectors having a plurality of vector elements. The processor may be part of a data processing system.
    Type: Grant
    Filed: October 14, 2008
    Date of Patent: March 1, 2011
    Assignee: International Business Machines Corporation
    Inventor: Michael K. Gschwind
  • Publication number: 20110047349
    Abstract: A processor includes a plurality of subfunctional units provided corresponding to respective slots of one or more pieces of operation result data including a plurality of slots for an SIMD operation; and an enable generating unit configured to, in each of the one or more pieces of the operation result data, compare a value of a predetermined slot with a value of a slot other than the predetermined slot, and disable one or more subfunctional units to which the value equal to the value of the predetermined slot is inputted, and the processor outputs the value of the predetermined slot as the value of the one or more subfunctional units which have been disabled.
    Type: Application
    Filed: March 12, 2010
    Publication date: February 24, 2011
    Applicant: KABUSHIKI KAISHA TOSHIBA
    Inventor: Hiroo HAYASHI
  • Patent number: 7895412
    Abstract: A programmable processing engine processes transient data within an intermediate network station of a computer network. The engine comprises an array of processing elements symmetrically arrayed as rows and columns, and embedded between input and output buffer units with a plurality of interfaces from the array to an external memory. The external memory stores non-transient data organized within data structures, such as forwarding and routing tables, for use in processing the transient data. Each processing element contains an instruction memory that allows programming of the array to process the transient data as processing element stages of baseline or extended pipelines operating in parallel.
    Type: Grant
    Filed: June 27, 2002
    Date of Patent: February 22, 2011
    Assignee: Cisco Tehnology, Inc.
    Inventors: Darren Kerr, Kenneth Michael Key, Michael L. Wright, William E. Jennings
  • Publication number: 20110040952
    Abstract: Uniforming of the processing load is efficiently realized. Each processing element configuring an SIMD parallel computer system includes a data storage module that stores data processed or transferred, a number-of-data-sets storage device that stores number of data sets, and a front data storage device that stores the front data. Each processing element further includes a control processor that compares the number of data sets stored in one processing element with the number of data sets stored in the own processing element, and issues a data distribution leveling instruction that designates an action for updating contents of the data storage module, the number-of-data-sets storage device, and the front data storage device according to a rule determined based on a comparison result of the own processing element and that of the other processing elements and an action for moving the data stored in the one processing element to the own processing element.
    Type: Application
    Filed: April 8, 2009
    Publication date: February 17, 2011
    Applicant: NEC CORPORATION
    Inventor: Shorin Kyo
  • Patent number: 7890733
    Abstract: A data processor comprises a plurality of processing elements (PEs), with memory local to at least one of the processing elements, and a data packet-switched network interconnecting the processing elements and the memory to enable any of the PEs to access the memory. The network consists of nodes arranged linearly or in a grid, e.g., in a SIMD array, so as to connect the PEs and their local memories to a common controller. Transaction-enabled PEs and nodes set flags, which are maintained until the transaction is completed and signal status to the controller e.g., over a series of OR-gates. The processor performs memory accesses on data stored in the memory in response to control signals sent by the controller to the memory. The local memories share the same memory map or space. External memory may also be connected to the “end” nodes interfacing with the network, eg to provide cache.
    Type: Grant
    Filed: August 11, 2005
    Date of Patent: February 15, 2011
    Assignee: Rambus Inc.
    Inventor: Ray McConnell
  • Publication number: 20110029756
    Abstract: A method for decoding a codeword in a data stream encoded according to a low density parity check (LDPC) code having an m×j parity check matrix H by initializing variable nodes with soft values based on symbols in the codeword, wherein a graph representation of H includes m check nodes and j variable nodes, and wherein a check node m provides a row value estimate to a variable node j and a variable node j provides a column value estimate to a check node m if H(m,j) contains a 1, computing row value estimates for each check node, wherein amplitudes of only a subset of column value estimates provided to the check node are computed, computing soft values for each variable node based on the computed row value estimates, determining whether the codeword is decoded based on the soft values, and terminating decoding when the codeword is decoded.
    Type: Application
    Filed: July 28, 2009
    Publication date: February 3, 2011
    Inventors: Eric Biscondi, David Hoyle, Tod David Wolf
  • Patent number: 7882325
    Abstract: A single micro-instruction to perform either an N-bit or a 2N-bit load is provided. A microprocessor having an N-bit load port performs either an N-bit load or a 2N-bit load in a single cycle with the same micro-instruction being used for both the N-bit and the 2N-bit load.
    Type: Grant
    Filed: December 21, 2007
    Date of Patent: February 1, 2011
    Assignee: Intel Corporation
    Inventors: Zeev Sperber, Robert Valentine, Ehud Cohen, Doron Orenstien, Benny Eitan
  • Patent number: 7882312
    Abstract: A state engine receives multiple requests from a parallel processor for a shared state. The state engine includes at least one state element and the at least one state element is adapted to operate, atomically, on the shared state in response to a request made by the parallel processor. The request includes at least a command directing the at least one state element on how to perform an operation on the shared state. The state engine also includes a memory connected to the at least one state element and configured to store the shared state.
    Type: Grant
    Filed: November 11, 2003
    Date of Patent: February 1, 2011
    Assignee: Rambus Inc.
    Inventor: Anthony Spencer
  • Patent number: 7882333
    Abstract: A method for loading microcode to a plurality of cores within a processor. The method includes loading the microcode to a first core of the plurality of cores within the processor system and generating a broadcast inter process interrupt (IPI) message via the first core. The IPI message causes other cores within the processor system to synchronize respective microcode with the microcode that is loaded into the first core. The synchronizing loads microcode to the plurality of cores without requiring independent loads of microcode to each core.
    Type: Grant
    Filed: November 5, 2007
    Date of Patent: February 1, 2011
    Assignee: Dell Products L.P.
    Inventor: Mukund Khatri
  • Patent number: 7877585
    Abstract: One embodiment of a computing system configured to manage divergent threads in a SIMD thread group includes a stack configured to store state information for processing control instructions. A parallel processing unit is configured to perform the steps of determining if one or more threads diverge during execution of a conditional control instruction. A disable mask allows for the use of conditional return and break instructions in a multithreaded SIMD architecture. Additional control instructions are used to set up thread processing target addresses for synchronization, breaks, and returns.
    Type: Grant
    Filed: August 27, 2007
    Date of Patent: January 25, 2011
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, John R. Nickolls, John Erik Lindholm, Svetoslav D. Tzvetkov
  • Patent number: 7873812
    Abstract: The new system provides for efficient implementation of matrix multiplication in a SIMD processor. The new system provides ability to map any element of a source vector register to be paired with any element of a second source vector register for vector operations, and specifically vector multiply and vector-multiply-accumulate operations to implement a variety of matrix multiplications without the additional permute or data re-ordering instructions. Operations such as DCT and Color-space transformations for video processing could be very efficiently implemented using this system.
    Type: Grant
    Filed: April 5, 2004
    Date of Patent: January 18, 2011
    Inventor: Tibet Mimar
  • Patent number: 7873794
    Abstract: Disclosed is an apparatus, method, and program product that provides atomic, multi-word load support without incurring additional memory utilization. A double-word is atomically loaded without the use of one or more additional fields and without a lock. An invalidity marker is used in connection with a cache miss time to ascertain whether a loaded double-word has been stored and loaded atomically, and is thus, valid.
    Type: Grant
    Filed: August 21, 2007
    Date of Patent: January 18, 2011
    Assignee: International Business Machines Corporation
    Inventors: Michael Joseph Corrigan, Timothy Joseph Torzewski
  • Publication number: 20110010524
    Abstract: There is provided an SIMD processor array system in which data can be efficiently transferred between processor elements located at different distances. The SIMD processor array system includes a control processor (CP) that is capable of issuing a plurality of instructions at the same time, and a PE array that includes a plurality of mutually-connected processing elements (PEs) to be controlled by the CP. The CP issues an inter-PE data shift instruction to each PE. According to the inter-PE data shift instruction, each PE performs a data sending operation of copying all the contents of a transfer data storing part of an adjoining PE to a transfer data storing part (MBF) of the own PE, and a data fetch operation of copying part or all of the contents of the MBF of the adjoining PE to a transfer data fetch and storing part (RBUF) of the own PE if part of the contents the MBF of the adjoining PE coincide with the contents of an ID storing part (IDB) of the own PE.
    Type: Application
    Filed: March 4, 2009
    Publication date: January 13, 2011
    Applicant: NEC CORPORATION
    Inventor: Shorin Kyo
  • Publication number: 20110004743
    Abstract: A data processing apparatus 1 has a plurality of registers 10 of the same type of register and a plurality of processing pipelines 40, 50, each processing pipeline 40, 50 being arranged to process instructions. At least one instruction includes a destination register specifier specifying which of said registers is a destination register for storing a processing result of the at least one instruction. Instruction issuing circuitry 26 is configured to issue the at least one instruction for processing by one of the plurality of processing pipelines. The instruction issuing circuitry 26 selects the one of the plurality of processing pipelines to which the candidate instruction is issued in dependence upon the value of the destination register specifier of the candidate instruction.
    Type: Application
    Filed: July 1, 2009
    Publication date: January 6, 2011
    Applicant: ARM Limited
    Inventor: David Raymond Lutz
  • Publication number: 20100332794
    Abstract: Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand.
    Type: Application
    Filed: June 30, 2009
    Publication date: December 30, 2010
    Inventors: Asaf Hargil, Doron Orenstein
  • Patent number: 7861060
    Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.
    Type: Grant
    Filed: December 15, 2005
    Date of Patent: December 28, 2010
    Assignee: NVIDIA Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Patent number: 7861071
    Abstract: A method of conditionally executing branch instructions which comprise an opcode field defining a type of test to be applied to determine whether or not to execute a branch operation, a control field designating a control store holding a plurality of indicators and a destination field holding information on a branch target address. The method comprises determining from the opcode field whether or not the test will check the state of one indicator or a plurality of indicators in the designated control store, accessing the designated control store to check the state of said one or said plurality of indicators depending on the determination, and generating a branch target address using information in the destination field in dependence on the state of the or each indicator checked.
    Type: Grant
    Filed: May 30, 2002
    Date of Patent: December 28, 2010
    Assignee: Broadcom Corporation
    Inventor: Sophie Wilson
  • Patent number: 7856543
    Abstract: A data processing architecture comprising: an input device for receiving an incoming stream of data packets; and a plurality of processing elements which are operable to process data received thereby; wherein the input device is operable to distribute data packets in whole or in part to the processing elements in dependence upon the data processing bandwidth of the processing elements.
    Type: Grant
    Filed: February 14, 2002
    Date of Patent: December 21, 2010
    Assignee: Rambus Inc.
    Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
  • Publication number: 20100318766
    Abstract: A processor includes a processing unit capable of executing single-instruction multiple-data operations; a register file configured to store data that is to be supplied to the processing unit and to be subjected to operations, and a buffer provided separately from the register file, the buffer being a buffer where an integer “n” number of data columns each having a plurality of data elements are written on a column-by-column basis, and data elements at the same location are selected and read as “n” data elements from the respective “n” data columns, wherein the “n” data elements read from the buffer is supplied to the processing unit as data to be subjected to a single-instruction multiple-data operation.
    Type: Application
    Filed: June 7, 2010
    Publication date: December 16, 2010
    Applicant: FUJITSU SEMICONDUCTOR LIMITED
    Inventor: Masayuki TSUJI
  • Patent number: 7853778
    Abstract: A method includes, in a processor, loading/moving a first portion of bits of a source into a first portion of a destination register and duplicate that first portion of bits in a subsequent portion of the destination register.
    Type: Grant
    Filed: December 20, 2001
    Date of Patent: December 14, 2010
    Assignee: Intel Corporation
    Inventor: Patrice Roussel
  • Publication number: 20100312989
    Abstract: A processor 2 supporting register renaming has a rename table 20 in which the flag register has multiple tag values associated therewith. These tag values indicate which virtual register corresponds to a destination flag register of the oldest instruction which wrote a still up-to-date value of a subset of the flags.
    Type: Application
    Filed: June 4, 2009
    Publication date: December 9, 2010
    Inventor: James Nolan Hardage
  • Publication number: 20100293534
    Abstract: In one embodiment, the invention is a method and apparatus for use of vectorization instruction sets. One embodiment of a method for generating vector instructions includes receiving source code written in a high-level programming language, wherein the source code includes at least one high-level instruction that performs multiple operations on a plurality of vector operands, and compiling the high-level instruction(s) into one or more low-level instructions, wherein the low-level instructions are in an instruction set of a specific computer architecture.
    Type: Application
    Filed: May 15, 2009
    Publication date: November 18, 2010
    Inventors: HENRIQUE ANDRADE, Bugra Gedik, Hua Yong Wang, Kun-Lung Wu
  • Patent number: 7831804
    Abstract: A processor architecture includes a number of processing elements for treating input signals. The architecture is organized according to a matrix including rows and columns, the columns of which each include at least one microprocessor block having a computational part and a set of associated processing elements that are able to receive the same input signals. The number of associated processing elements is selectively variable in the direction of the column so as to exploit the parallelism of said signals. Additionally the processor architecture of the present invention enable dynamic switching between instruction parallelism and data parallel processing typical of vectorial functionality. The architecture can be scaled in various dimensions in an optimal configuration for the algorithm to be executed.
    Type: Grant
    Filed: May 30, 2008
    Date of Patent: November 9, 2010
    Assignee: ST Microelectronics S.R.L.
    Inventors: Francesco Pappalardo, Giuseppe Notarangelo, Elio Guidetti
  • Publication number: 20100281255
    Abstract: In one embodiment of the present invention, a method includes verifying a master processor of a system; validating a trusted agent with the master processor if the master processor is verified; and launching the trusted agent on a plurality of processors of the system if the trusted agent is validated. After execution of such a trusted agent, a secure kernel may then be launched, in certain embodiments. The system may be a multiprocessor server system having a partially or fully connected topology with arbitrary point-to-point interconnects, for example.
    Type: Application
    Filed: June 29, 2010
    Publication date: November 4, 2010
    Inventors: John H. Wilson, Ioannis T. Schoinas, Mazin S. Yousif, Linda J. Rankin, David W. Grawrock, Robert J. Greiner, James A. Sutton, Kushagra Vaid, Willard M. Wiseman
  • Publication number: 20100274989
    Abstract: A method executed by an instruction set on a processor is described. The method includes providing a tbbit instruction, inputting a first index for the tbbit instruction, loading a second value for the tbbit instruction, wherein the second value comprises at least 2b bits, using selected b bits of the first index to select at least one target bit in the loaded second value, shifting the target bit into the bottom of the first index, and computing a second index based on the shifting of the target bit into the bottom of the first index. Other methods and variations are also described.
    Type: Application
    Filed: December 8, 2008
    Publication date: October 28, 2010
    Inventors: Mayan Moudgill, Sitij Agrawal
  • Publication number: 20100274990
    Abstract: An apparatus and method for performing SIMD multiply-accumulate operations includes SIMD data processing circuitry responsive to control signals to perform data processing operations in parallel on multiple data elements. Instruction decoder circuitry is coupled to the SIMD data processing circuitry and is responsive to program instructions to generate the required control signals. The instruction decoder circuitry is responsive to a single instruction (referred to herein as a repeating multiply-accumulate instruction) having as input operands a first vector of input data elements, a second vector of coefficient data elements, and a scalar value indicative of a plurality of iterations required, to generate control signals to control the SIMD processing circuitry.
    Type: Application
    Filed: September 17, 2009
    Publication date: October 28, 2010
    Inventors: Mladen Wilder, Dominic Hugo Symes, Richard Edward Bruce
  • Patent number: 7818540
    Abstract: A vector processing system for executing vector instructions, each instruction defining multiple value pairs, an operation to be executed and a modifier, the vector processing system comprising a plurality of parallel processing units, each arranged to receive one of said pairs of values and, when selected, to implement an operation on said value pair to generate a result, each processing unit comprising at least one flag and being selectable in dependence on a condition defined by said at least one flag, wherein the modifier defines the condition under which the parallel processing unit is individually selected.
    Type: Grant
    Filed: May 19, 2006
    Date of Patent: October 19, 2010
    Assignee: Broadcom Corporation
    Inventors: Stephen Barlow, Neil Bailey, Timothy Ramsdale, David Plowman, Robert Swann
  • Patent number: 7818539
    Abstract: A processor implements conditional vector operations in which, for example, an input vector containing multiple operands to be used in conditional operations is divided into two or more output vectors based on a condition vector. Each output vector can then be processed at full processor efficiency without cycles wasted due to branch latency. Data to be processed are divided into two groups based on whether or not they satisfy a given condition by e.g., steering each to one of the two index vectors. Once the data have been segregated in this way, subsequent processing can be performed without conditional operations, processor cycles wasted due to branch latency, incorrect speculation or execution of unnecessary instructions due to predication. Other examples of conditional operations include combining one or more input vectors into a single output vector based on a condition vector, conditional vector switching, conditional vector combining, and conditional vector load balancing.
    Type: Grant
    Filed: August 28, 2006
    Date of Patent: October 19, 2010
    Assignees: The Board of Trustees of the Leland Stanford Junior University, The Massachusetts Institute of Technology
    Inventors: Scott Rixner, John D. Owens, Ujval J. Kapasi, William J. Dally
  • Patent number: 7818541
    Abstract: A data processing architecture comprising: an input device for receiving an incoming stream of data packets; and a plurality of processing elements which are operable to process data received thereby; wherein the input device is operable to distribute data packets in whole or in part to the processing elements in dependence upon the data processing bandwidth of the processing elements.
    Type: Grant
    Filed: May 23, 2007
    Date of Patent: October 19, 2010
    Assignee: Clearspeed Technology Limited
    Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
  • Patent number: 7818548
    Abstract: Methods and software are presented for processing data in a programmable processor, involving (a) decoding instructions for execution using an execution unit operable to execute instructions by partitioning data stored in registers in a register file into multiple data elements, the instructions selected from an instruction set that includes group arithmetic instructions and group data handling instructions, (b) in response to decoding different group arithmetic instructions, executing a plurality of different group floating-point and group integer arithmetic operations that each arithmetically operates on multiple data elements stored in registers in the register file to produce a catenated result that is returned to a register in the register file, wherein the catenated result comprises a plurality of individual results, and (c) in response to decoding different group data handling instructions, executing group data handling operations that re-arrange data elements in different ways.
    Type: Grant
    Filed: July 27, 2007
    Date of Patent: October 19, 2010
    Assignee: Microunity Systems Engineering, Inc.
    Inventors: Craig Hansen, John Moussouris, Alexia Massalin
  • Patent number: 7814297
    Abstract: A data processing apparatus comprises data processing logic operable to perform data processing operations specified by program instructions. The data processing logic (140) has a plurality of functional units (142, 144, 146) configured to execute in parallel on data received from a data source. A decoder (130) is responsive to a single program instruction to control the data processing logic (140) to concurrently execute the single program instruction on each of a plurality of vector elements of each of a respective plurality of vector input operands (310, 320) received from the data source using the plurality of functional units (142, 144, 146).
    Type: Grant
    Filed: July 26, 2005
    Date of Patent: October 12, 2010
    Assignee: ARM Limited
    Inventor: Martinus Cornelis Wezelenburg
  • Publication number: 20100250897
    Abstract: The invention relates to a parallel processor which comprises elementary processors (3) disposed according to a topology with a predetermined position within this topology and capable of simultaneously executing the same instruction on different data, the instruction relating to at least one operand and/or providing at least one result. The instruction comprises, for each operand and/or each result, information relating to the position of a field of action within a data structure of the table of dimension M type and the parallel processor comprises means (41, 42, 43) for calculating the address of each operand and/or each result within each elementary processor, as a function of the position of the field of action and of the position of the elementary processor within the topology.
    Type: Application
    Filed: June 26, 2008
    Publication date: September 30, 2010
    Applicant: Thales
    Inventor: Gérard Gaillat
  • Patent number: 7805561
    Abstract: A single instruction, multiple data (“SIMD”) computer system includes a central control unit coupled to 256 processing elements (“PEs”) and to 32 static random access memory (“SRAM”) devices. Each group of eight PEs can access respective groups of eight columns in a respective SRAM device. Each PE includes a local column address register that can be loaded through a data bus of the respective PE. A local column address stored in the local column address register is applied to an AND gate, which selects either the local column address or a column address applied to the AND gate by the central control unit. As a result, the central control unit can globally access the SRAM device, or a specific one of the eight columns that can be accessed by each PE can be selected locally by the PE.
    Type: Grant
    Filed: January 16, 2009
    Date of Patent: September 28, 2010
    Assignee: Micron Technology, Inc.
    Inventor: Jon Skull
  • Publication number: 20100241824
    Abstract: Techniques are disclosed for converting data into a format tailored for efficient multidimensional fast Fourier transforms (FFTS) on single instruction, multiple data (SIMD) multi-core processor architectures. The technique includes converting data from a multidimensional array stored in a conventional row-major order into SIMD format. Converted data in SIMD format consists of a sequence of blocks, where each block interleaves s rows such that SIMD vector processors may operate on s rows simultaneously. As a result, the converted data in SIMD format enables smaller-sized 1D FFTs to be optimized in SIMD multi-core processor architectures.
    Type: Application
    Filed: March 18, 2009
    Publication date: September 23, 2010
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: David G. Carlson, Travis M. Drucker, Timothy J. Mullins, Jeffrey S. McAllister, Nelson Ramirez
  • Patent number: 7802079
    Abstract: A parallel data processing apparatus using a SIMD array of processing elements is disclosed. The apparatus makes use of a register in order to control issuance of instructions to the processing elements in the array.
    Type: Grant
    Filed: June 29, 2007
    Date of Patent: September 21, 2010
    Assignee: Clearspeed Technology Limited
    Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
  • Patent number: 7797517
    Abstract: Reference architecture instructions are translated into target architecture operations. Sequences of operations, in a predicted execution order in some embodiments, form traces. In some embodiments, a trace is based on a plurality of basic blocks. In some embodiments, a trace is committed or aborted as a single entity. Sequences of operations are optimized by fusing collections of operations; fused operations specify a same observable function as respective collections, but advantageously enable more efficient processing. In some embodiments, a collection comprises multiple register operations. Fusing a register operation with a branch operation in a trace forms a fused reg-op/branch operation. In some embodiments, branch instructions translate into assert operations. Fusing an assert operation with another operation forms a fused assert operation. In some embodiments, fused operations only set architectural state, such as high-order portions of registers, that is subsequently read before being written.
    Type: Grant
    Filed: November 17, 2006
    Date of Patent: September 14, 2010
    Assignee: Oracle America, Inc.
    Inventor: John Gregory Favor
  • Patent number: 7788468
    Abstract: A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.
    Type: Grant
    Filed: December 15, 2005
    Date of Patent: August 31, 2010
    Assignee: NVIDIA Corporation
    Inventors: John R. Nickolls, Stephen D. Lew, Brett W. Coon, Peter C. Mills
  • Patent number: 7788471
    Abstract: A system and method for performing vector arithmetic is disclosed. The method includes loading two operand vectors, each composed of a number of vector elements, into two storage locations. A selected arithmetic operation is performed on the operand vectors to produce a result vector having the number of vector elements. Each vector element of the result vector is associated with an arithmetic logic cell that has a first input that can receive any vector element from the first vector and a second input that can receive any vector element from the second vector. Accordingly each vector element of the result vector is a function of any two individual vector elements of the operand vectors. By applying the operand vector elements to the appropriate arithmetic logic cells, and by selecting the appropriate arithmetic operation, complex vector operations can be performed efficiently.
    Type: Grant
    Filed: September 18, 2006
    Date of Patent: August 31, 2010
    Assignee: Freescale Semiconductor, Inc.
    Inventor: Chengke Sheng
  • Patent number: 7783862
    Abstract: One embodiment of the present invention is a processor that processes inductive doubling SIMD instructions, which processor includes: an Instruction Fetch Unit that loads a SIMD instruction and applies it as input to a SIMD Instruction Decode Unit; wherein the SIMD Instruction Decode Unit decodes the applied SIMD instruction and produces output signals including SIMD field width identification signals and one or more SIMD half-operand modifier signals.
    Type: Grant
    Filed: August 6, 2007
    Date of Patent: August 24, 2010
    Assignee: International Characters, Inc.
    Inventor: Robert D. Cameron
  • Publication number: 20100211758
    Abstract: A microprocessor that can perform sequential processing in data array unit includes: a load store unit that loads, when a fetched instruction is a load instruction for data, a data sequence including designated data from a data memory in memory width unit and specifies, based on an analysis result of the instruction, data scheduled to be designated in a load instruction in future; and a data temporary storage unit that stores use-scheduled data as the data specified by the load store unit.
    Type: Application
    Filed: December 29, 2009
    Publication date: August 19, 2010
    Applicant: KABUSHIKI KAISHA TOSHIBA
    Inventors: Masato Sumiyoshi, Takashi Miyamori, Shunichi Ishiwata, Katsuyuki Kimura, Takahisa Wada, Keiri Nakanishi, Yasuki Tanabe, Ryuji Hada
  • Publication number: 20100211858
    Abstract: An application specific processor to implement a Viterbi decode algorithm for channel decoding functions of received symbols. The Viterbi decode algorithm is at least one of a Bit Serial decode algorithm, and block based decode algorithm. The application specific processor includes a Load-Store, Logical and De-puncturing (LLD) slot that performs a Load-Store function, a Logical function, a De-puncturing function, and a Trace-back Address generation function, a Branch Metric Compute (BMU) slot that performs a Radix-2 branch metric computations, a Radix-4 branch metric computations, and Squared Euclidean Branch Metric computations, and an Add-Compare-Select (ACS) slot that performs a Radix-2 Path metric computations, a Radix-4 Path metric computations, a best state computations, and a decision bit generation. The LLD slot, the BMU slot and the ACS slot perform in a software pipelined manner to enable high speed Viterbi decoding functions.
    Type: Application
    Filed: February 18, 2010
    Publication date: August 19, 2010
    Applicant: SAANKHYA LABS PVT LTD
    Inventors: Anindya Saha, Hemant Mallapur, Santhosh Billava, Smitha Bmv
  • Patent number: 7774189
    Abstract: A system and method for implementing a unified model for integration systems is presented. A user provides inputs to an integrated language engine for placing operator components and arc components onto a dataflow diagram. Operator components include data ports for expressing data flow, and also include meta-ports for expressing control flow. Arc components connect operator components together for data and control information to flow between the operator components. The dataflow diagram is a directed acyclic graph that expresses an application without including artificial boundaries during the application design process. Once the integrated language engine generates the dataflow diagram, the integrated language engine compiles the dataflow diagram to generated application code.
    Type: Grant
    Filed: December 1, 2006
    Date of Patent: August 10, 2010
    Assignee: International Business Machines Corporation
    Inventors: Amir Bar-Or, Michael James Beckerle
  • Patent number: 7774600
    Abstract: In one embodiment of the present invention, a method includes verifying an initiating logical processor of a system; validating a trusted agent with the initiating logical processor if the initiating logical processor is verified; and launching the trusted agent on a plurality of processors of the system if the trusted agent is validated. After execution of such a trusted agent, a secure kernel may then be launched, in certain embodiments. The system may be a multiprocessor server system having a partially or fully connected topology with arbitrary point-to-point interconnects, for example.
    Type: Grant
    Filed: December 27, 2007
    Date of Patent: August 10, 2010
    Assignee: Intel Corporation
    Inventors: John H. Wilson, Ioannis T. Schoinas, Mazin S. Yousif, Linda J. Rankin, David W. Grawrock, Robert J. Greiner, James A. Sutton, Kushagra Vaid, Willard M. Wiseman
  • Patent number: 7769980
    Abstract: In arithmetic/logic units (ALU) provided corresponding to entries, an MIMD instruction decoder generating a group of control signals in accordance with a Multiple Instruction-Multiple Data (MIMD) instruction and an MIMD register storing data designating the MIMD instruction are provided, and an inter-ALU communication circuit is provided. The amount and direction of movement of the inter-ALU communication circuit are set by data bits stored in a movement data register. It is possible to execute data movement and arithmetic/logic operation with the amount of movement and operation instruction set individually for each ALU unit. Therefore, in a Single Instruction-Multiple Data type processing device, Multiple Instruction-Multiple Data operation can be executed at high speed in a flexible manner.
    Type: Grant
    Filed: August 16, 2007
    Date of Patent: August 3, 2010
    Assignee: Renesas Technology Corp.
    Inventors: Toshinori Sueyoshi, Masahiro Iida, Mitsutaka Nakano, Fumiaki Senoue, Katsuya Mizumoto
  • Patent number: 7769989
    Abstract: A processor architecture, for example, a SIMD processor architecture, includes at least two arithmetic/logic units to implement data processing, a data memory arrangement or a memory device interface to a memory arrangement to store data of different data types, an addressing unit to generate access addresses for the data to be stored in the data memory arrangement, and an address memory arrangement to store access addresses. The access addresses are logically linked to the given data type of the data, and/or a distribution of the data to the arithmetic/logic units is dependent on the access addresses, and/or a storage of the output data as the data is dependent on the access addresses.
    Type: Grant
    Filed: September 1, 2006
    Date of Patent: August 3, 2010
    Assignee: Trident Microsystems (Far East) Ltd.
    Inventors: Carsten Noeske, Matthias Vierthaler
  • Patent number: 7770005
    Abstract: In one embodiment of the present invention, a method includes verifying an initiating logical processor of a system; validating a trusted agent with the initiating logical processor if the initiating logical processor is verified; and launching the trusted agent on a plurality of processors of the system if the trusted agent is validated. After execution of such a trusted agent, a secure kernel may then be launched, in certain embodiments. The system may be a multiprocessor server system having a partially or fully connected topology with arbitrary point-to-point interconnects, for example.
    Type: Grant
    Filed: December 27, 2007
    Date of Patent: August 3, 2010
    Assignee: Intel Corporation
    Inventors: John H. Wilson, Ioannis T. Schoinas, Mazin S. Yousif, Linda J. Rankin, David W. Grawrock, Robert J. Greiner, James A. Sutton, Kushagra Vaid, Willard M. Wiseman
  • Publication number: 20100180100
    Abstract: A microprocessor includes a direct access memory (DMA) engine which is responsive to pairs of block indices associated with one or more blocks in a first logical plane and transfers the one or more blocks between the first logical plane, a second logical plane, and a physical memory space according to the pairs of block indices. The logical planes represent two dimensional fields of data such as those found in images and videos. The microprocessor further comprises cache memory which updates its content with one or more cache-blocks which are in the neighborhood of the one or more blocks improving the operation of the cache memory by increasing cache hits. The DMA engine may further operate on n-dimensional blocks in a n-dimensional logical space. The microprocessor further includes special-purpose instructions, operative on a single-instruction-multiple-data (SIMD) computation unit, especially tailored to perform matrix operations.
    Type: Application
    Filed: January 13, 2009
    Publication date: July 15, 2010
    Inventors: Tsung-Hsin Lu, Carl Alberola, Rajesh Chhabria, Zhenyu Zhou
  • Patent number: 7751557
    Abstract: A method and apparatus are disclosed for efficiently de-scrambling one or more bytes of data according to DSL standards on a processor. This is achieved by providing an instruction for de-scrambling one or more bytes of data according to the DSL standards. Accordingly, the invention advantageously provides a processor with the ability to de-scramble data with a single instruction thus allowing for more efficient and faster de-scrambling operations for subsequent processing.
    Type: Grant
    Filed: September 22, 2004
    Date of Patent: July 6, 2010
    Assignee: Broadcom Corporation
    Inventors: Mark Taunton, Timothy Martin Dobson
  • Patent number: 7739479
    Abstract: A method of providing physics data within a game program or simulation using a hardware-based physics processing unit having unique architecture designed to efficiently calculate physics related data.
    Type: Grant
    Filed: November 19, 2003
    Date of Patent: June 15, 2010
    Assignee: NVIDIA Corporation
    Inventors: Jean Pierre Bordes, Curtis Davis, Monier Maher, Manju Hegde, Otto A. Schmid
  • Publication number: 20100146241
    Abstract: An apparatus and method for processing data includes an array of processing elements to simultaneously perform operations on multiple data elements using a single instruction. A grouping module assigns each processing element within the array to one of several groups. A modification module designates how each group of processing elements should handle the single instruction. This enables each group of processing elements to handle the single instruction differently. Each processing element is configured to handle the single instruction based on the group the processing element belongs to.
    Type: Application
    Filed: December 9, 2008
    Publication date: June 10, 2010
    Applicant: Novafora, Inc.
    Inventors: Shlomo Selim Rakib, Yoram Zarai