Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Patent number: 8443170
    Abstract: An apparatus and method for performing SIMD multiply-accumulate operations includes SIMD data processing circuitry responsive to control signals to perform data processing operations in parallel on multiple data elements. Instruction decoder circuitry is coupled to the SIMD data processing circuitry and is responsive to program instructions to generate the required control signals. The instruction decoder circuitry is responsive to a single instruction (referred to herein as a repeating multiply-accumulate instruction) having as input operands a first vector of input data elements, a second vector of coefficient data elements, and a scalar value indicative of a plurality of iterations required, to generate control signals to control the SIMD processing circuitry.
    Type: Grant
    Filed: September 17, 2009
    Date of Patent: May 14, 2013
    Assignee: ARM Limited
    Inventors: Mladen Wilder, Dominic Hugo Symes, Richard Edward Bruce
  • Patent number: 8438322
    Abstract: A processing module includes a fetch and decode module, an instruction register, a data register, an execution module, and a millimeter wave (MMW) transceiver section. The fetch and decode module is operable to fetch and decode an instruction of a program and to identify data associated with the instruction. The execution module is operable to execute the instruction upon the data associated with the instruction. The MMW transceiver section is operable to wirelessly receive at least one of the instruction and the data associated with the instruction from memory.
    Type: Grant
    Filed: August 30, 2008
    Date of Patent: May 7, 2013
    Assignee: Broadcom Corporation
    Inventors: Ahmadreza (Reza) Rofougaran, Timothy W. Markison
  • Patent number: 8438366
    Abstract: Multiple data processing instructions instruct a computing device to process multiple data including first data and second data. When a multiple data processing instruction is decoded, two allocatable registers are selected. One is used to store the result of a processing operation performed on first data by one processing unit, and the other is used to store the result of a processing operation performed on second data by another processing unit. Those stored processing results are then transferred to result registers. Normal data processing instructions, on the other hand, instruct a processing operation on third data. When a normal data processing instruction is decoded, one allocatable register is selected and used to store the result of processing that a processing unit performs on the third data. The stored processing result is then transferred to a result register.
    Type: Grant
    Filed: August 2, 2010
    Date of Patent: May 7, 2013
    Assignee: Fujitsu Limited
    Inventors: Yasunobu Akizuki, Toshio Yoshida
  • Patent number: 8429656
    Abstract: Methods and apparatuses are presented for graphics operations with thread count throttling, involving operating a processor to carry out multiple threads of execution of, wherein the processor comprises at least one execution unit capable of supporting up to a maximum number of threads, obtaining a defined memory allocation size for allocating, in at least one memory device, a thread-specific memory space for the multiple threads, obtaining a per thread memory requirement corresponding to the thread-specific memory space, determining a thread count limit based on the defined memory allocation size and the per thread memory requirement, and sending a command to the processor to cause the processor to limit the number of threads carried out by the at least one execution unit to a reduced number of threads, the reduced number of threads being less than the maximum number of threads.
    Type: Grant
    Filed: November 2, 2006
    Date of Patent: April 23, 2013
    Assignee: NVIDIA Corporation
    Inventors: Jerome F. Duluk, Jr., Bryon S. Nordquist
  • Patent number: 8429380
    Abstract: A processor includes a plurality of subfunctional units provided corresponding to respective slots of one or more pieces of operation result data including a plurality of slots for an SIMD operation; and an enable generating unit configured to, in each of the one or more pieces of the operation result data, compare a value of a predetermined slot with a value of a slot other than the predetermined slot, and disable one or more subfunctional units to which the value equal to the value of the predetermined slot is inputted, and the processor outputs the value of the predetermined slot as the value of the one or more subfunctional units which have been disabled.
    Type: Grant
    Filed: March 12, 2010
    Date of Patent: April 23, 2013
    Assignee: Kabushiki Kaisha Toshiba
    Inventor: Hiroo Hayashi
  • Patent number: 8429667
    Abstract: Optimum load distribution processing is selected and executed based on settings made by a user in consideration of load changes caused by load distribution in a plurality of asymmetric cores, by using: a controller having a plurality of cores, and configured to extract, for each LU, a pattern showing the relationship between a core having an LU ownership and a candidate core as an LU ownership change destination based on LU ownership management information; to measure, for each LU, the usage of a plurality of resources; to predicate, for each LU based on the measurement results, a change in the usage of the plurality of resources and overhead to be generated by transfer processing itself; to select, based on the respective prediction results, a pattern that matches the user's setting information; and to transfer the LU ownership to the core belonging to the selected pattern.
    Type: Grant
    Filed: February 29, 2012
    Date of Patent: April 23, 2013
    Assignee: Hitachi, Ltd.
    Inventors: Junji Ogawa, Yusuke Nonaka, Yuko Matsui
  • Patent number: 8413151
    Abstract: One embodiment of the present invention sets forth a technique for selectively spawning threads within a multiprocessing system. A computation work distributor (CWD), within the system, is responsible for performing the detailed work needed to spawn a thread grid. A request to the CWD to spawn a thread grid includes a predicate table, which includes an array of flags used to indicate which thread indices should have an associated thread block spawned and which should not. Greater efficiency is achieved by only spawning thread blocks that should perform useful computation.
    Type: Grant
    Filed: December 19, 2007
    Date of Patent: April 2, 2013
    Assignee: NVIDIA Corporation
    Inventors: John A. Stratton, David Luebke
  • Patent number: 8392789
    Abstract: A method for decoding a codeword in a data stream encoded according to a low density parity check (LDPC) code having an m×j parity check matrix H by initializing variable nodes with soft values based on symbols in the codeword, wherein a graph representation of H includes m check nodes and j variable nodes, and wherein a check node m provides a row value estimate to a variable node j and a variable node j provides a column value estimate to a check node m if H(m,j) contains a 1, computing row value estimates for each check node, wherein amplitudes of only a subset of column value estimates provided to the check node are computed, computing soft values for each variable node based on the computed row value estimates, determining whether the codeword is decoded based on the soft values, and terminating decoding when the codeword is decoded.
    Type: Grant
    Filed: July 28, 2009
    Date of Patent: March 5, 2013
    Assignee: Texas Instruments Incorporated
    Inventors: Eric Biscondi, David Hoyle, Tod David Wolf
  • Patent number: 8370817
    Abstract: A mechanism is provided for optimizing scalar code executed on a single instruction multiple data (SIMD) engine by aligning the slots of SIMD registers. With the mechanism, a compiler is provided that parses source code and, for each statement in the program, generates an expression tree. The compiler inspects all storage inputs to scalar operations in the expression tree to determine their alignment in the SIMD registers. This alignment is propagated up the expression tree from the leaves. When the alignments of two operands in the expression tree are the same, the resulting alignment is the shared value. When the alignments of two operands in the expression tree are different, one operand is shifted. For shifted operands, a shift operation is inserted in the expression tree. The executable code is then generated for the expression tree and shifts are inserted where indicated.
    Type: Grant
    Filed: May 27, 2008
    Date of Patent: February 5, 2013
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, John Kevin Patrick O'Brien
  • Patent number: 8341328
    Abstract: A single instruction, multiple data (“SIMD”) computer system includes a central control unit coupled to 256 processing elements (“PEs”) and to 32 static random access memory (“SRAM”) devices. Each group of eight PEs can access respective groups of eight columns in a respective SRAM device. Each PE includes a local column address register that can be loaded through a data bus of the respective PE. A local column address stored in the local column address register is applied to an AND gate, which selects either the local column address or a column address applied to the AND gate by the central control unit. As a result, the central control unit can globally access the SRAM device, or a specific one of the eight columns that can be accessed by each PE can be selected locally by the PE.
    Type: Grant
    Filed: September 27, 2010
    Date of Patent: December 25, 2012
    Assignee: Micron Technology, Inc.
    Inventor: Jon Skull
  • Publication number: 20120297163
    Abstract: A system and method for automatically migrating the execution of work units between multiple heterogeneous cores. A computing system includes a first processor core with a single instruction multiple data micro-architecture and a second processor core with a general-purpose micro-architecture. A compiler predicts execution of a function call in a program migrates at a given location to a different processor core. The compiler creates a data structure to support moving live values associated with the execution of the function call at the given location. An operating system (OS) scheduler schedules at least code before the given location in program order to the first processor core. In response to receiving an indication that a condition for migration is satisfied, the OS scheduler moves the live values to a location indicated by the data structure for access by the second processor core and schedules code after the given location to the second processor core.
    Type: Application
    Filed: May 16, 2011
    Publication date: November 22, 2012
    Inventors: Mauricio Breternitz, Patryk Kaminski, Keith Lowery, Anton Chernoff, Dz-Ching Ju
  • Patent number: 8316190
    Abstract: Computers and other computing machines and information appliances having a modified computer architecture and program structure which enables the operation of an application program concurrently or simultaneously on a plurality of computers interconnected via a communications link or network using a special distributed runtime (DRT), and that provides for a redundant array of independent computing systems that include computer code distribution using code-striping onto the plurality of the computers or computing machines. A redundant array of independent computing systems operating in concert and code-striping features.
    Type: Grant
    Filed: March 19, 2008
    Date of Patent: November 20, 2012
    Assignee: Waratek Pty. Ltd.
    Inventor: John M. Holt
  • Publication number: 20120265964
    Abstract: Disclosed is a data processing device capable of efficiently performing an arithmetic process on variable-length data and an arithmetic process on fixed-length data. The data processing device includes first PEs of SIMD type, SRAMs provided respectively for the first PEs, and second PEs. The first PEs each perform an arithmetic operation on data stored in a corresponding one of the SRAMs. The second PEs each perform an arithmetic operation on data stored in corresponding ones of the SRAMs. Therefore, the SRAMs can be shared so as to efficiently perform the arithmetic process on variable-length data and the arithmetic process on fixed-length data.
    Type: Application
    Filed: February 3, 2012
    Publication date: October 18, 2012
    Inventors: Kan MURATA, Hideyuki Noda, Masaru Haraguchi
  • Patent number: 8286180
    Abstract: Method and apparatus are provided for a synchronizing execution of a plurality of threads on a multi-threaded processor. Each thread is provided with a number of synchronization points corresponding to points where it is advantageous or preferable that execution should be synchronized with another thread. Execution of a thread is paused when it reaches a synchronization point until at least one other thread with which it is intended to be synchronized reaches a corresponding synchronization point. Execution is subsequently resumed. Where an executing thread branches over a section of code which included a synchronization point then execution is paused at the end of the branch until the at least one other thread reaches the synchronization point of the end of the corresponding branch.
    Type: Grant
    Filed: August 24, 2007
    Date of Patent: October 9, 2012
    Assignee: Imagination Technologies Limited
    Inventor: Yoong Chert Foo
  • Publication number: 20120254845
    Abstract: System and method for vectorizing combinations of program operations. Program code is received that includes a combination of individually vectorizable program portions that collectively implement a first computation. Each individually vectorizable program portion has at least one array input and at least one array output. The combination of individually vectorizable program portions is transformed into a single vectorizable program portion that is or includes a functional composition of the combination of individually vectorizable program portions. Vectorized executable code implementing the first computation is generated based on the single vectorizable program portion. The generated executable code is directed to SIMD (Single-Instruction-Multiple-Data) computing units of a target processor.
    Type: Application
    Filed: March 30, 2011
    Publication date: October 4, 2012
    Inventors: Haoran Yi, Brady C. Duggan, Robert E. Dye, Adam L. Bordelon, Jeffrey L. Kodosky
  • Publication number: 20120254585
    Abstract: Methods and apparatus for double precision division/inversion vector computations on Single Instruction Multiple Data (SIMD) computing platforms are described. In one embodiment, an input argument is represented by an exponent portion and a fraction portion. These portions are scaled, inverted, and multiplied to generate an inverse version of the input argument. In an embodiment, the inversion of the exponent portion may be done by changing the sign of the exponent. Other embodiments are also described.
    Type: Application
    Filed: December 25, 2009
    Publication date: October 4, 2012
    Inventors: Andrey Kolesov, Valery Kuriakin, Maria Guseva
  • Patent number: 8261043
    Abstract: A method and apparatus are provided to perform efficient merging operations of two or more streams of data by using SIMD instruction. Streams of data are merged together in parallel and with mitigated or removed conditional branching. The merge operations of the streams of data include Merge AND and Merge OR operations.
    Type: Grant
    Filed: May 12, 2009
    Date of Patent: September 4, 2012
    Assignee: SAP AG
    Inventors: Hiroshi Inoue, Moriyoshi Ohara, Hideaki Komatsu
  • Patent number: 8255886
    Abstract: A method for analyzing and presenting in a graphical manner single instruction, multiple data (SIMD) instructions involves disassembling a stream of machine instructions into a stream of assembly language instructions. Instruction objects “M” and “N” are created to represent SIMD instructions “M” and “N” from the stream of instructions. Instruction objects “M” and “N” include multiple data objects corresponding to the multiple data items of the respective SIMD instruction. Different colors are assigned to at least two of the multiple data objects of instruction object “M.” If a data item of SIMD instruction “N” is based on a data item of SIMD instruction “M,” the color from the source object is automatically assigned to the target object. Dependencies between data items of instruction “M” and “N” are annotated by arrows between corresponding data objects. Other embodiments are described and claimed.
    Type: Grant
    Filed: June 30, 2008
    Date of Patent: August 28, 2012
    Assignee: Intel Corporation
    Inventor: Peter Lachner
  • Publication number: 20120210098
    Abstract: Systems and methods of enabling virtual calls in a single instruction multiple data (SIMD) environment may involve detecting a virtual call of a function and using a single dispatch of the function to invoke the virtual call for two or more channels of the virtual call. In one example, it is determined that the two or more channels share a common target address and a single dispatch of the function is conducted with respect to the common target address. The process may be iterated for additional channels of the virtual call that share a common target address.
    Type: Application
    Filed: February 16, 2011
    Publication date: August 16, 2012
    Inventors: Wei-Yu Chen, Guei-Yuan Lueh, Subramaniam Maiyuran
  • Patent number: 8239659
    Abstract: Techniques for vector completion mask (VCM) handling are provided. A data structure includes a mask field for each operand of a particular operation. A processor attempts to execute the operation with multiple operands, which are identified in the data structure by the mask fields. If operands are successfully retrieved for execution with the operation, then the corresponding mask field within the data structure is cleared. The processor can reset if any field remains set within the data structure and can re-process the operation with operands that were not previously handled with the operation.
    Type: Grant
    Filed: September 29, 2006
    Date of Patent: August 7, 2012
    Assignee: Intel Corporation
    Inventors: Stephan Jourdan, Michael Fetterman, Michael Cornaby, Per Hammarlund, Ronak Signhal, Glenn Hinton
  • Patent number: 8239847
    Abstract: General-purpose distributed data-parallel computing using high-level computing languages is described. Data parallel portions of a sequential program written in a high-level language are automatically translated into a distributed execution plan. Map and reduction computations are automatically added to the plan. Patterns in the sequential program can be automatically identified to trigger map and reduction processing. Direct invocation of map and reduction processing is also provided. One or more portions of the reduce computation are pushed to the map stage and dynamic aggregation is inserted when possible. The system automatically identifies opportunities for partial reductions and aggregation, but also provides a set of extensions in a high-level computing language for the generation and optimization of the distributed execution plan. The extensions include annotations to declare functions suitable for these optimizations.
    Type: Grant
    Filed: March 18, 2009
    Date of Patent: August 7, 2012
    Assignee: Microsoft Corporation
    Inventors: Yuan Yu, Pradeep Kumar Gunda, Michael A Isard
  • Patent number: 8234250
    Abstract: A method and apparatus for deduplication of files of a storage system is described. During a gathering phase, a file may be simultaneously processed by two or more threads to produce and store content identifiers for data blocks of the file. Each file may be sub-divided into multiple file sub-portions, each file sub-portion comprising a predetermined number of data blocks. A thread may be assigned to each sub-portion of a file for processing the data blocks. The currently assigned sub-portion for each thread may be recorded and used upon a system crash to restart each scanner thread at the currently assigned sub-portion to minimize the data blocks that are re-processed. The size of a file sub-portion may be predetermined based on the organization of inode data structures representing the files (e.g., based on the maximum number of pointers that an indirect block in the inode data structure may contain).
    Type: Grant
    Filed: September 17, 2009
    Date of Patent: July 31, 2012
    Assignee: NetApp. Inc.
    Inventors: Alok Sharma, Praveen Killamsetti, Bipul Raj
  • Patent number: 8225075
    Abstract: Method, apparatus, and program means for shuffling data. The method of one embodiment comprises receiving a first operand having a set of L data elements and a second operand having a set of L control elements. For each control element, data from a first operand data element designated by the individual control element is shuffled to an associated resultant data element position if its flush to zero field is not set and a zero is placed into the associated resultant data element position if its flush to zero field is not set.
    Type: Grant
    Filed: October 8, 2010
    Date of Patent: July 17, 2012
    Assignee: Intel Corporation
    Inventors: William W. Macy, Jr., Eric L. Debes, Patrice L. Roussel, Huy V. Nguyen
  • Patent number: 8219783
    Abstract: An SIMD type microprocessor is disclosed. The SIMD type microprocessor includes plural PEs (processor elements) each of which provides an ALU (arithmetic and logic unit) for lower-order bits, an ALU for upper-order bits, a control circuit for lower-order bits, a control circuit for upper-order bits, a range determining circuit for lower-order bits, and a range determining circuit for upper-order bits. The SIMD type microprocessor further includes a global processor, a range designation bus for lower-order bits which connects the global processor to the range determining circuit for lower-order bits, and a range designation bus for upper-order bits which connects the global processor to the range determining circuit for upper-order bits.
    Type: Grant
    Filed: July 3, 2008
    Date of Patent: July 10, 2012
    Assignee: Ricoh Company, Ltd.
    Inventor: Kazuhiko Hara
  • Patent number: 8214808
    Abstract: A system and method for speculative assistance to a thread in a heterogeneous processing environment is provided. A first set of instructions is identified in a source code representation (e.g., a source code file) that is suitable for speculative execution. The identified set of instructions are analyzed to determine the processing requirements. Based on the analysis, a processor type is identified that will be used to execute the identified first set of instructions based. The processor type is selected from more than one processor types that are included in the heterogeneous processing environment. The heterogeneous processing environment includes more than one heterogeneous processing cores in a single silicon substrate. The various processing cores can utilize different instruction set architectures (ISAs). An object code representation is then generated for the identified first set of instructions with the object code representation being adapted to execute on the determined type of processor.
    Type: Grant
    Filed: May 7, 2007
    Date of Patent: July 3, 2012
    Assignee: International Business Machines Corporation
    Inventors: Michael Norman Day, Michael Karl Gschwind, John Kevin Patrick O'Brien, Kathryn O'Brien
  • Publication number: 20120166762
    Abstract: Provided are a computing apparatus and method based on SIMD architecture capable of supporting various SIMD widths without wasting resources. The computing apparatus includes a plurality of configurable execution cores (CECs) that have a plurality of execution modes, and a controller for detecting a loop region from a program, determining a Single Instruction Multiple Data (SIMD) width for the detected loop region, and determining an execution mode of the processor according to the determined SIMD width.
    Type: Application
    Filed: July 8, 2011
    Publication date: June 28, 2012
    Inventors: Jae Un Park, Suk-Jin Kim, Scott Mahlke, Yong-Jun Park
  • Publication number: 20120159120
    Abstract: A microprocessor configured to execute programs divided into discrete phases, comprising: a scheduler for scheduling program instructions to be executed on the processor; a plurality of resources for executing programming instructions issued by the scheduler; wherein the scheduler is configured to schedule each phase of the program only after receiving an indication that execution of the preceding phase of the program has been completed. By splitting programs into multiple phases and providing a scheduler that is able to determine whether execution of a phase has been completed, each phase can be separately scheduled and the results of preceding phases can be used to inform the scheduling of subsequent phases.
    Type: Application
    Filed: May 19, 2011
    Publication date: June 21, 2012
    Inventor: Yoong Chert Foo
  • Publication number: 20120151183
    Abstract: An embodiment may include circuitry to execute, at least in part, a first list of instructions and/or to concurrently process, at least in part, first and second buffers. The execution of the first list of instructions may result, at least in part, from invocation of a first function call. The first list of instructions may include at least one portion of a second list of instructions interleaved, at least in part, with at least one other portion of a third list of instructions. The portions may be concurrently carried out, at least in part, by one or more sets of execution units of the circuitry. The second and third lists of instructions may implement, at least in part, respective algorithms that are amenable to being invoked by separate respective function calls. The concurrent processing may involve, at least in part, complementary algorithms.
    Type: Application
    Filed: December 8, 2010
    Publication date: June 14, 2012
    Inventors: James D. Guilford, Wajdi K. Feghali, Vinodh Gopal, Gilbert M. Wolrich, Erdinc Ozturk, Martin G. Dixon, Deniz Karakoyunlu, Kahraman D. Akdemir
  • Publication number: 20120151145
    Abstract: A method for optimizing processing in a SIMD core. The method comprises processing units of data within a working domain, wherein the processing includes one or more working items executing in parallel within a persistent thread. The method further comprises retrieving a unit of data from within a working domain, processing the unit of data, retrieving other units of data when processing of the unit of data has finished, processing the other units of data, and terminating the execution of the working items when processing of the working domain has finished.
    Type: Application
    Filed: December 13, 2010
    Publication date: June 14, 2012
    Applicant: Advanced Micro Devices, Inc.
    Inventor: Alexander M. LYASHEVSKY
  • Publication number: 20120151182
    Abstract: In one embodiment, a processor can perform a function call from a main program to a function that is to operate on at least one vector-type operand, in which only scalar values are passed to the function, and input values to the function including the at least one vector-type operand are to be renamed from virtual registers identified in the function to physical registers of a vector register file, and output values from the function including the at least one vector-type operand are to be renamed from virtual registers identified in the function to physical registers of the vector register file. Other embodiments are described and claimed.
    Type: Application
    Filed: December 9, 2010
    Publication date: June 14, 2012
    Inventor: Tomasz Madajczak
  • Patent number: 8200941
    Abstract: A method includes, in a processor, loading/moving a first portion of bits of a source into a first portion of a destination register and duplicate that first portion of bits in a subsequent portion of the destination register.
    Type: Grant
    Filed: April 15, 2011
    Date of Patent: June 12, 2012
    Assignee: Intel Corporation
    Inventor: Patrice Roussel
  • Patent number: 8200913
    Abstract: An information processing system includes a plurality of PMM and data transmission paths for connection between the PMM and transmitting a value of a PMM to another PMM. A memory of each PMM holds a list of values of first items arranged in the ascending order or descending order without overlap and/or a list of values of the second item to be shared. A memory module of each PMM transmits a value contained in the value list to another PMM, receives a value contained in the value list from the another PMM, references the value list of the first item and the value list of the second item of the another PMM, and generates a list of common values considering the values contained in the value lists of the first item and the second item of all the other PMM.
    Type: Grant
    Filed: January 25, 2005
    Date of Patent: June 12, 2012
    Assignee: Turbo Data Laboratories, Inc.
    Inventor: Shinji Furusho
  • Patent number: 8200948
    Abstract: An apparatus and method are provided for performing re-arrangement operations on data. The data processing apparatus has a register data store with a plurality of registers for storing data, and processing logic for performing a sequence of operations on data including at least one re-arrangement operation. The processing logic has scalar processing logic for performing scalar operations and SIMD processing logic for performing SIMD operations. The SIMD processing logic is responsive to a re-arrangement instruction specifying a family of re-arrangement operations to perform a selected re-arrangement operation from that family on a plurality of data elements constituted by data in one or more registers identified by the re-arrangement instruction. The selected re-arrangement operation is dependent on at least one parameter provided by the scalar processing logic, that parameter identifying a data element width for the data elements on which the selected re-arrangement operation is performed.
    Type: Grant
    Filed: December 4, 2007
    Date of Patent: June 12, 2012
    Assignee: ARM Limited
    Inventors: Daniel Kershaw, Dominic Hugo Symes, Alastair Reid
  • Patent number: 8200940
    Abstract: A system and method for successfully performing reduction operations in a multi-threaded SIMD (single-instruction multiple-data) system while one or more threads are disabled allows for the reduction operations to be performed without a performance penalty compared with performing the same operation with all of the threads enabled. The source data for each intermediate computation of the reduction operation is remapped by a configurable crossbar as needed to avoid using invalid data from the disabled threads. The remapping function is transparent to the user and enables correct execution of order invariant reduction operations and order dependent prefix-reduction operations.
    Type: Grant
    Filed: June 30, 2008
    Date of Patent: June 12, 2012
    Assignee: NVIDIA Corporation
    Inventor: John Erik Lindholm
  • Patent number: 8190867
    Abstract: A processor comprising a register file, and a decoder to decode an instruction to specify a first source register having a first packed signed 16-bit integers, and to specify a second source register having a second packed signed 16-bit integers. A functional unit to generate a result to be stored in a specified destination. The result including a third packed 8-bit integers including an integer for each integer in the first packed integers, and an integer for each integer in the second packed integers. The integers corresponding to the first packed integers next to one another in the result. The integers corresponding to the second packed integers next to one another. A highest order integer of the result corresponding to a highest order integer of the first packed integers. A lowest order integer of the result corresponding to a lowest order integer of the second packed integers.
    Type: Grant
    Filed: May 16, 2011
    Date of Patent: May 29, 2012
    Assignee: Intel Corporation
    Inventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
  • Patent number: 8180998
    Abstract: A system for performing data-parallel operations and task-parallel operations. A first switch fabric node (SFN) includes first and second lane processing engines (LPEs). The first LPE includes a first set of lane processing units (LPUs) configured to perform data-parallel operations, where each LPU performs a set of operations, and each LPU uses a different set of data for the set of operations, and each LPU within the first set of LPUs uses a different set of data for the set of operations. The second LPE includes a second set of LPUs configured to perform task-parallel operations, where each LPU performs a different set of operations. A processing control engine (PCE) is configured to distribute instructions and data to the first LPE and the second LPE. Advantageously, data parallel operations and task parallel operations are able to be performed on the same processor simultaneously.
    Type: Grant
    Filed: September 10, 2008
    Date of Patent: May 15, 2012
    Assignee: NVIDIA Corporation
    Inventors: Monier Maher, Christopher Lamb, Sanjay J. Patel, Peter Hsu
  • Patent number: 8176290
    Abstract: A memory controller, on receiving a write request to write write-data into an address of a second memory region issued by a processor, determines whether read-data requested to be read from an address of a first memory region by the processor is matched with the write-data requested to be written into the address of the second memory region, and if the read-data is matched with the write-data, prevents the write-data from being written into the address of the second memory region.
    Type: Grant
    Filed: June 11, 2009
    Date of Patent: May 8, 2012
    Assignee: Kabushiki Kaisha Toshiba
    Inventors: Takahisa Wada, Katsuyuki Kimura, Shunichi Ishiwata, Takashi Miyamori, Ryuji Hada, Keiri Nakanishi, Yasuki Tanabe, Masato Sumiyoshi
  • Patent number: 8171263
    Abstract: A parallel data processing apparatus using a SIMD array of processing elements is disclosed. The apparatus makes use of a register in order to control issuance of instructions to the processing elements in the array.
    Type: Grant
    Filed: June 29, 2007
    Date of Patent: May 1, 2012
    Assignee: Rambus Inc.
    Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
  • Patent number: 8171464
    Abstract: An approach is provided for vectorizing misaligned references in compiled code for SIMD architectures that support only aligned loads and stores. In this framework, a loop is first simdized as if the memory unit imposes no alignment constraints. The compiler then inserts data reorganization operations to satisfy the actual alignment requirements of the hardware. Finally, the code generation algorithm generates SIMD codes based on the data reorganization graph, addressing realistic issues such as runtime alignments, unknown loop bounds, residual iteration counts, and multiple statements with arbitrary alignment combinations. Loop peeling is used to reduce the computational overhead associated with misaligned data. A loop prologue and epilogue are peeled from individual iterations in the simdized loop, and vector-splicing instructions are applied to the peeled iterations, while the steady-state loop body incurs no additional computational overhead.
    Type: Grant
    Filed: May 16, 2008
    Date of Patent: May 1, 2012
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Kai-Ting Amy Wang, Peng Wu
  • Patent number: 8161280
    Abstract: In one embodiment of the present invention, a method includes verifying a master processor of a system; validating a trusted agent with the master processor if the master processor is verified; and launching the trusted agent on a plurality of processors of the system if the trusted agent is validated. After execution of such a trusted agent, a secure kernel may then be launched, in certain embodiments. The system may be a multiprocessor server system having a partially or fully connected topology with arbitrary point-to-point interconnects, for example.
    Type: Grant
    Filed: June 29, 2010
    Date of Patent: April 17, 2012
    Assignee: Intel Corporation
    Inventors: John H. Wilson, Ioannis T. Schoinas, Mazin S. Yousif, Linda J. Rankin, David W. Grawrock, Robert J. Greiner, James A. Sutton, Kushagra Vaid, Willard M. Wiseman
  • Patent number: 8161271
    Abstract: Embodiments of the invention provide logic within the store data path between a processor and a memory array. The logic may be configured to misalign vector data as it is stored to memory. By misaligning vector data as it is stored to memory, memory bandwidth may be maximized while processing bandwidth required to store vector data misaligned is minimized. Furthermore, embodiments of the invention provide logic within the load data path which allows vector data which is stored misaligned to be aligned as it is loaded into a vector register. By aligning misaligned vector data as it is loaded into a vector register, memory bandwidth may be maximized while processing bandwidth required to align misaligned vector data may be minimized.
    Type: Grant
    Filed: July 11, 2007
    Date of Patent: April 17, 2012
    Assignee: International Business Machines Corporation
    Inventors: David Arnold Luick, Eric Oliver Mejdrich, Adam James Muff
  • Patent number: 8161267
    Abstract: Hardware and software techniques for interrupt detection and response in a scalable pipelined array processor environment are described. Utilizing these techniques, a sequential program execution model with interrupts can be maintained in a highly parallel scalable pipelined array processing containing multiple processing elements and distributed memories and register files. When an interrupt occurs, interface signals are provided to all PEs to support independent interrupt operations in each PE dependent upon the local PE instruction sequence prior to the interrupt. Processing/element exception interrupts are supported and low latency interrupt processing is also provided for embedded systems where real time signal processing is required. Further, a hierarchical interrupt structure is used allowing a generalized debug approach using debut interrupts and a dynamic debut monitor mechanism.
    Type: Grant
    Filed: November 30, 2010
    Date of Patent: April 17, 2012
    Assignee: Altera Corporation
    Inventors: Edwin Franklin Barry, Patrick R. Marchand, Gerald George Pechanek, Larry D. Larsen
  • Patent number: 8146092
    Abstract: A controller having a plurality of cores extracts, for each logical unit (LU), a pattern showing the relationship between a core having an LU ownership and a candidate core as an LU ownership change destination based on LU ownership management information, measures, for each LU, the usage of a plurality of resources, predicates, for each LU based on the measurement results, a change in the usage of the plurality of resources and overhead to be generated by transfer processing itself, selects, based on the respective prediction results, a pattern that matches the user's setting information, and transfer the LU ownership to the core belonging to the selected pattern.
    Type: Grant
    Filed: March 2, 2009
    Date of Patent: March 27, 2012
    Assignee: Hitachi, Ltd.
    Inventors: Junji Ogawa, Yusuke Nonaka, Yuko Matsui
  • Patent number: 8146150
    Abstract: Multi-node and multi-processor security management is described in this application. Data may be secured in a TPM of any one of a plurality of nodes, each node including one or more processors. The secured data may be protected using hardware hooks to prevent unauthorized access to the secured information. Security hierarchy may be put in place to protect certain memory addresses from access by requiring permission by VMM, OS, ACM or processor hardware. The presence of secured data may be communicated to each of the nodes to ensure that data is protected. Other embodiments are described.
    Type: Grant
    Filed: December 31, 2007
    Date of Patent: March 27, 2012
    Assignee: Intel Corporation
    Inventors: Mahesh S. Natu, Sham Datta
  • Patent number: 8131981
    Abstract: A data processing system, apparatus and method for performing fractional multiply operations is disclosed. The system includes a memory that stores instructions for SIMD operations and a processing core. The processing core includes registers that store operands for the fractional multiply operations. A coprocessor included in the processing core performs the fractional multiply operations on the operands and stores the result in a destination register that is also included in the processing core.
    Type: Grant
    Filed: August 12, 2009
    Date of Patent: March 6, 2012
    Assignee: Marvell International Ltd.
    Inventors: Nigel C. Paver, Bradley C. Aldrich
  • Patent number: 8127112
    Abstract: A data processing architecture includes an input device that receives an incoming stream of data packets. A plurality of processing elements are operable to process data received from the input device. The input device is operable to distribute data packets in whole or in part to the processing elements in dependence upon the data processing bandwidth of the processing elements.
    Type: Grant
    Filed: December 10, 2010
    Date of Patent: February 28, 2012
    Assignee: Rambus Inc.
    Inventors: John Rhoades, Ken Cameron, Paul Winser, Ray McConnell, Gordon Faulds, Simon McIntosh-Smith, Anthony Spencer, Jeff Bond, Matthias Dejaegher, Danny Halamish, Gajinder Panesar
  • Patent number: 8122227
    Abstract: A data processing circuit contains an instruction execution circuit that has an instruction set that comprises a SIMD instruction. The instruction execution circuit comprises a plurality of arithmetic circuits, arranged to perform N respective identical operations in parallel in response to the SIMD instruction. The SIMD instruction defines selects a first one and a second one of the registers. The SIMD instruction defines a first and second series of N respective SIMD instruction operands of the SIMD instruction from the addressed registers. Each arithmetic circuit receives a respective first operand and a respective second operand from the first and second series respectively, when executing the SIMD instruction. The instruction execution circuit is arranged for selecting the first and second series so that they partially overlap. Preferably, the position of the operands of at least one the series is under program control, preferably under control of operand data.
    Type: Grant
    Filed: November 2, 2005
    Date of Patent: February 21, 2012
    Assignee: Silicon Hive B.V.
    Inventor: Antonius A. M. Van Wel
  • Patent number: 8112691
    Abstract: The generation of Fletcher/Alder partial checksums are transformed from a space that requires integer multiplications and additions to a space that requires only integer additions and shifts on a single SIMD pipeline capable processor. This transformation permits the use of Fletcher/Alder checksums on processors where the performance of SIMD instructions are sub-optimal, on CMT processors that support a single SIMD pipeline as well as other processors that can be configured by executing software to implement SIMD operations for a single SIMD pipeline. The implementation of the process with this transformation on a general-purpose computer system transforms that general-purpose computer system into a special-purpose computer system that uses a single SIMD pipeline to generate a Fletcher/Alder checksum. The elimination of integer multiplications in the generation of the partial checksums results in a significant improvement in performance.
    Type: Grant
    Filed: March 25, 2008
    Date of Patent: February 7, 2012
    Assignee: Oracle America, Inc.
    Inventor: Lawrence A. Spracklen
  • Patent number: 8112614
    Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.
    Type: Grant
    Filed: December 17, 2010
    Date of Patent: February 7, 2012
    Assignee: Nvidia Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Patent number: RE43825
    Abstract: A system and method forward data between processing elements. A first processing element includes an address register that stores a first memory address. A forwarding storage element is coupled to the first processing element. A second processing element, coupled to the forwarding storage element, transmits a second memory address to the forwarding storage element. The forwarding storage transmits the second memory address to the first processing element, and the first processing element compares the second memory address with the first memory address.
    Type: Grant
    Filed: November 19, 2007
    Date of Patent: November 20, 2012
    Assignee: The United States of America as Represented by the Secretary of the Navy
    Inventors: Joel Zvi Apisdorf, Sam Brandon Sandbote, Michael Daniel Poole