Single Instruction, Multiple Data (simd) Patents (Class 712/22)
  • Publication number: 20140189296
    Abstract: A loop remainder mask instruction indicates a current iteration count of a loop as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop remainder mask instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates a number of data elements of the array past an end of a preceding portion of the array that are to be handled separately from the preceding portion, the end of the preceding portion being where the current iteration count is recorded.
    Type: Application
    Filed: December 14, 2011
    Publication date: July 3, 2014
    Inventors: Elmoustapha Ould-Ahmed-Vall, Robert Valentine, Jesus Corbal, Andrey Naraikin, Suleyman Sair, Asaf Hargil, Miland B. Girkar, Bret T. Toll, Mark J. Charney
  • Patent number: 8769244
    Abstract: Uniforming of the processing load is efficiently realized. Each processing element configuring an SIMD parallel computer system includes a data storage module that stores data processed or transferred, a number-of-data-sets storage device that stores number of data sets, and a front data storage device that stores the front data. Each processing element further includes a control processor that compares the number of data sets stored in one processing element with the number of data sets stored in the own processing element, and issues a data distribution leveling instruction that designates an action for updating contents of the data storage module, the number-of-data-sets storage device, and the front data storage device according to a rule determined based on a comparison result of the own processing element and that of the other processing elements and an action for moving the data stored in the one processing element to the own processing element.
    Type: Grant
    Filed: April 8, 2009
    Date of Patent: July 1, 2014
    Assignee: Nec Corporation
    Inventor: Shorin Kyo
  • Publication number: 20140181467
    Abstract: Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup.
    Type: Application
    Filed: December 21, 2012
    Publication date: June 26, 2014
    Applicant: ADVANCED MICRO DEVICES, INC.
    Inventors: Timothy G. Rogers, Bradford M. Beckmann, James M. O'Connor
  • Patent number: 8762691
    Abstract: A data processing apparatus includes a plurality of processing elements arranged in a single instruction multiple data array. The apparatus includes an instruction controller operable to receive instructions from a plurality of instructions streams, and to transfer instructions from those instructions streams to the processing elements in the array, such that the data processing apparatus is operable to process a plurality of processing threads substantially in parallel with one another. A data transfer controller is provided which is operable to control transfer of data between the internal memory units associated with the processing elements, and memory external to the array.
    Type: Grant
    Filed: June 29, 2007
    Date of Patent: June 24, 2014
    Assignee: Rambus Inc.
    Inventors: Dave Stuttard, Dave Williams, Eamon O'Dea, Gordon Faulds, John Rhoades, Ken Cameron, Phil Atkin, Paul Winser, Russell David, Ray McConnell, Tim Day, Trey Greer
  • Patent number: 8756270
    Abstract: A mechanism is provided in a collective acceleration unit for performing a collective operation to distribute or collect data among a plurality of participant nodes. The mechanism receives an input collective packet for a collective operation from a neighbor node within a collective tree. The input collective packet comprises a tree identifier and an input data field and wherein the collective tree comprises a plurality of sub trees. The mechanism maps the tree identifier to an index within the collective acceleration unit. The index identifies a portion of resources within the collective acceleration unit and is associated with a set of neighbor nodes in a given sub tree within the collective tree. For each neighbor node the collective acceleration unit stores destination information. The collective acceleration unit performs an operation on the input data field using the portion of resources to effect the collective operation.
    Type: Grant
    Filed: April 24, 2012
    Date of Patent: June 17, 2014
    Assignee: International Business Machines Corporation
    Inventors: Lakshminarayana B. Arimilli, Bernard C. Drerup, Paul F. Lecocq, Hanhong Xue
  • Patent number: 8751655
    Abstract: A mechanism is provided in a collective acceleration unit for performing a collective operation to distribute or collect data among a plurality of participant nodes. The mechanism receives an input collective packet for a collective operation from a neighbor node within a collective tree. The input collective packet comprises a tree identifier and an input data field and wherein the collective tree comprises a plurality of sub trees. The mechanism maps the tree identifier to an index within the collective acceleration unit. The index identifies a portion of resources within the collective acceleration unit and is associated with a set of neighbor nodes in a given sub tree within the collective tree. For each neighbor node the collective acceleration unit stores destination information. The collective acceleration unit performs an operation on the input data field using the portion of resources to effect the collective operation.
    Type: Grant
    Filed: March 29, 2010
    Date of Patent: June 10, 2014
    Assignee: International Business Machines Corporation
    Inventors: Lakshminarayana B. Arimilli, Bernard C. Drerup, Paul F. Lecocq, Hanhong Xue
  • Patent number: 8745358
    Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.
    Type: Grant
    Filed: September 4, 2012
    Date of Patent: June 3, 2014
    Assignee: Intel Corporation
    Inventors: Julien Sebot, William W. Macy, Eric Debes, Huy V. Nguyen
  • Patent number: 8732437
    Abstract: Systems and methods for performing single instruction multiple data (SIMD) operations on a data set. The methods may include examining a structure of the data set to determine what reorganization may be necessary to facilitate SIMD processing. The method may include selecting a stored bit mask corresponding to the organization of the data set and loading the bit mask into an application specific register (ASR). Subsequently, the data may be reorganized inline according to the ASR as the data is loaded into the SIMD functional unit such that the SIMD functional unit may operate on the data set. The results of the SIMD operation may be written to a results register.
    Type: Grant
    Filed: January 26, 2010
    Date of Patent: May 20, 2014
    Assignee: Oracle America, Inc.
    Inventor: Lawrence A. Spracklen
  • Patent number: 8725990
    Abstract: A configurable SIMD engine in a video processor for executing video processing operations. The engine includes a SIMD component having a plurality of inputs for receiving input data and a plurality of outputs for providing output data. A plurality of execution units are included in the SIMD component. Each of the execution units comprise a first and a second data path, and are configured for selectively implementing arithmetic operations on a set of low precision or high precision inputs. Each of the execution units have a first configuration and a second configuration, such that the first data path and the second data path are combined to produce a single high precision output in the first configuration, and such that the first data path and the second data path are partitioned to produce a respective first low precision output and second low precision output in the second configuration.
    Type: Grant
    Filed: November 4, 2005
    Date of Patent: May 13, 2014
    Assignee: Nvidia Corporation
    Inventors: Ashish Karandikar, Pooja Agarwal
  • Patent number: 8726252
    Abstract: A compiler of a single instruction multiple data (SIMD) information handling system (IHS) identifies “if-then-else” statements that offer opportunity for conditional branch conversion. The SIMD IHS employs a processor or processors to execute the executable program. During execution, the processor generates and updates SIMD lane mask information to track and manage the conditional branch loops of the executing program. The processor saves branch addresses and employs SIMD lane masks to identify conditional branch loops with different branch conditions than previous conditional branch loops. The processor may reduce SIMD IHS processing time during processing of compiled code of the original “if-then-else” statements. The processor continues processing next statements inline after all SIMD lanes are complete, while providing speculative and parallel processing capability for multiple data operations of the executable program.
    Type: Grant
    Filed: January 28, 2011
    Date of Patent: May 13, 2014
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Brian Flachs, Dorit Nuzman, Ira Rosen, Ulrich Weigand, Ayal Zaks
  • Patent number: 8713286
    Abstract: A processor device is disclosed and includes a memory and a sequencer that is responsive to the memory. The sequencer supports very long instruction word (VLIW) type instructions and at least one VLIW instruction packet uses a number of operands during execution. The processor device further includes a plurality of instruction execution units responsive to the sequencer and a plurality of register files. Each of the plurality of register files includes a plurality of registers and the plurality of register files are coupled to the plurality of instruction execution units. Further, each of the plurality of register files includes a number of data read ports and the number of data read ports of each of the plurality of register files is less than the number of operands used by the at least one VLIW instruction packet.
    Type: Grant
    Filed: April 26, 2005
    Date of Patent: April 29, 2014
    Assignee: QUALCOMM Incorporated
    Inventors: Muhammad Ahmed, Erich Plondke, Lucian Codrescu, William C. Anderson
  • Patent number: 8700884
    Abstract: A processor in a data processing system executes a permutation instruction which identifies a first source register, at least one other source register, and a destination register. The first source register stores at least one in-range index value for the at least one other source register and at least one out-of-range index value for the at least one other source register. The at least one other source register stores a plurality of vector element values, wherein each in-range index value indicates which vector element value of the at least one other source register is to be stored into a corresponding vector element of the destination register. Each out-of-range index value is used to indicate which one of at least two predetermined constant values is to be stored into a corresponding vector element of the destination register. Partial table lookups using a permutation instruction shortens the time required to retrieve data.
    Type: Grant
    Filed: October 12, 2007
    Date of Patent: April 15, 2014
    Assignee: Freescale Semiconductor, Inc.
    Inventors: William C. Moyer, Imran Ahmed, Dan E. Tamir
  • Patent number: 8688959
    Abstract: Method, apparatus, and program means for shuffling data. The method of one embodiment comprises receiving a first operand having a set of L data elements and a second operand having a set of L control elements. For each control element, data from a first operand data element designated by the individual control element is shuffled to an associated resultant data element position if its flush to zero field is not set and a zero is placed into the associated resultant data element position if its flush to zero field is not set.
    Type: Grant
    Filed: September 10, 2012
    Date of Patent: April 1, 2014
    Assignee: Intel Corporation
    Inventors: William W. Macy, Jr., Eric L. Debes, Patrice L. Roussel, Huy V. Nguyen
  • Patent number: 8688957
    Abstract: A system and method are configured to detect conflicts when converting scalar processes to parallel processes (“SIMDifying”). Conflicts may be detected for an unordered single index, an ordered single index and/or ordered pairs of indices. Conflicts may be further detected for read-after-write dependencies. Conflict detection is configured to identify operations (i.e., iterations) in a sequence of iterations that may not be done in parallel.
    Type: Grant
    Filed: December 21, 2010
    Date of Patent: April 1, 2014
    Assignee: Intel Corporation
    Inventors: Mikhail Smelyanskiy, Yen-Kuang Chen, Daehyun Kim, Christopher J. Hughes, Victor W. Lee
  • Patent number: 8688958
    Abstract: A processor has a plurality of PEs (processing elements) that operate in parallel based on operation commands and an information collection unit that collects the data of the plurality of PEs, wherein each of the plurality of PEs holds data and a condition flag, supplies the data and the condition flag to the information collection unit upon receiving an operation command, and upon receiving an update request for updating the condition flag, updates the condition flag in accordance with the update request that was received; and the information collection unit, upon receiving the data and the condition flags, selects one PE based on a predetermined order of priority from among the PEs for which the received condition flags are active and both supplies the data of the selected PE as collection result data and supplies an update request for updating the condition flag of the PE that was selected.
    Type: Grant
    Filed: January 14, 2010
    Date of Patent: April 1, 2014
    Assignee: NEC Corporation
    Inventor: Shohei Nomoto
  • Patent number: 8661225
    Abstract: A data processing apparatus and method and provided for handling vector instructions. The data processing apparatus has a register data store with a plurality of registers arranged to store data elements. A vector processing unit is then used to execute a sequence of vector instructions, with the vector processing unit having a plurality of lanes of parallel processing and having access to the register data store in order to read data elements from, and write data elements to, the register data store during the execution of the sequence of vector instructions. A skip indication storage maintains a skip indicator for each of the lanes of parallel processing. The vector processing unit is responsive to a vector skip instruction to perform an update operation to set within the skip indication storage the skip indicator for a determined one or more lanes.
    Type: Grant
    Filed: January 19, 2010
    Date of Patent: February 25, 2014
    Assignee: ARM Limited
    Inventors: Andreas Björklund, Erik Persson, Ola Hugosson
  • Patent number: 8656376
    Abstract: A method for providing intrinsic supports for a VLIW DSP processor with distributed register files comprises the steps of: generating a program representation with cluster information on instructions of the DSP processor, wherein the cluster information is provided by a program with cluster intrinsic coding; identifying data stream operations indicating parallel instruction sequences applied on different data sets in the program representation; identifying data sharing relations indicating data shared by the data stream operations in the program representation; identifying data aggregation relations indicating results aggregated from the data stream operations in the program representation; and performing register allocation for the DSP processor according to the identified data stream operations, the data sharing relations and the data aggregation relations.
    Type: Grant
    Filed: September 1, 2011
    Date of Patent: February 18, 2014
    Assignee: National Tsing Hua University
    Inventors: Jenq Kuen Lee, Chi Bang Kuan
  • Publication number: 20140032879
    Abstract: Search circuitry responsive to a single instruction for undertaking a step of a search of a data array for an extreme value therein, a method of searching a data array to identify an extreme value therein and a location thereof and a single-instruction, multiple-data (SIMD) processing unit incorporating the search circuitry or the method. In one embodiment, the search circuitry includes: a comparison element configured to compare two values in the data array, (2) multiplexers coupled to the comparison element and configured to select a more extreme value of the two values and a location in the data array of the more extreme value and (3) an incrementer configured to increment a counter associated with the search.
    Type: Application
    Filed: July 26, 2012
    Publication date: January 30, 2014
    Applicant: VeriSilicon Holdings Co., Ltd
    Inventor: Stephen E. Jarboe
  • Patent number: 8638805
    Abstract: Described embodiments provide for restructuring a scheduling hierarchy of a network processor having a plurality of processing modules and a shared memory. The scheduling hierarchy schedules packets for transmission. The network processor generates tasks corresponding to each received packet associated with a data flow. A traffic manager receives tasks provided by one of the processing modules and determines a queue of the scheduling hierarchy corresponding to the task. The queue has a parent scheduler at each of one or more next levels of the scheduling hierarchy up to a root scheduler, forming a branch of the hierarchy. The traffic manager determines if the queue and one or more of the parent schedulers of the branch should be restructured. If so, the traffic manager drops subsequently received tasks for the branch, drains all tasks of the branch, and removes the corresponding nodes of the branch from the scheduling hierarchy.
    Type: Grant
    Filed: September 30, 2011
    Date of Patent: January 28, 2014
    Assignee: LSI Corporation
    Inventors: Balakrishnan Sundararaman, Shashank Nemawarkar, David Sonnier, Shailendra Aulakh, Allen Vestal
  • Patent number: 8639914
    Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.
    Type: Grant
    Filed: December 29, 2012
    Date of Patent: January 28, 2014
    Assignee: Intel Corporation
    Inventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
  • Patent number: 8635432
    Abstract: There is provided an SIMD processor array system in which data can be efficiently transferred between processor elements located at different distances. The SIMD processor array system includes a control processor (CP) that is capable of issuing a plurality of instructions at the same time, and a PE array that includes a plurality of mutually-connected processing elements (PEs) to be controlled by the CP. The CP issues an inter-PE data shift instruction to each PE. According to the inter-PE data shift instruction, each PE performs a data sending operation of copying all the contents of a transfer data storing part of an adjoining PE to a transfer data storing part (MBF) of the own PE, and a data fetch operation of copying part or all of the contents of the MBF of the adjoining PE to a transfer data fetch and storing part (RBUF) of the own PE if part of the contents the MBF of the adjoining PE coincide with the contents of an ID storing part (IDB) of the own PE.
    Type: Grant
    Filed: March 4, 2009
    Date of Patent: January 21, 2014
    Assignee: NEC Corporation
    Inventor: Shorin Kyo
  • Publication number: 20140013077
    Abstract: A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented.
    Type: Application
    Filed: September 10, 2013
    Publication date: January 9, 2014
    Applicant: Oracle International Corporation
    Inventors: Amit Ganesh, Shasank K. Chavan, Vineet Marwah, Jesse Kamp, Anindya C. Patthak, Michael J. Gleeson, Allison L. Holloway, Roger Macnicol
  • Publication number: 20140013078
    Abstract: A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented.
    Type: Application
    Filed: September 10, 2013
    Publication date: January 9, 2014
    Applicant: Oracle International Corporation
    Inventors: Amit Ganesh, Shasank K. Chavan, Vineet Marwah, Jesse Kamp, Anindya C. Patthak, Michael J. Gleeson, Allison L. Holloway, Roger Macnicol
  • Patent number: 8612732
    Abstract: One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application program that rely on specific features of the multi-core GPU are converted by a translator for execution by a general purpose CPU. The application program is partitioned into regions of synchronization independent instructions. The instructions are classified as convergent or divergent and divergent memory references that are shared between regions are replicated. Thread loops are inserted to ensure correct sharing of memory between various threads during execution by the general purpose CPU.
    Type: Grant
    Filed: March 19, 2009
    Date of Patent: December 17, 2013
    Assignee: NVIDIA Corporation
    Inventors: Vinod Grover, Bastiaan Joannes Matheus Aarts, Michael Murphy, Boris Beylin, Jayant B. Kolhe, Douglas Saylor
  • Patent number: 8612507
    Abstract: A computing device includes: a deciding unit which, in computation of values of nodes on a lattice in a direction where a value of m representing a horizontal axis coordinate of the lattice increases, decides dummy nodes to be added to m=n?1, so as to enable values of nodes on m=n to be calculated by adding the dummy nodes to m=n?1 and executing a vector operation through the use of the SIMD function by using values of nodes on m=n?1 and values of the added dummy nodes; an adding unit adding the dummy nodes decided by the deciding unit to m=n?1; and a calculating unit calculating the values of the nodes present on m=n by executing the vector operation through the use of the SIMD function by using the values of the nodes on m=n?1 and the values of the dummy nodes added by the adding unit.
    Type: Grant
    Filed: April 16, 2010
    Date of Patent: December 17, 2013
    Assignee: NS Solutions Corporation
    Inventor: Hiroki Takeshita
  • Patent number: 8605099
    Abstract: A technique to increase memory bandwidth for throughput applications. In one embodiment, memory bandwidth can be increased, particularly for throughput applications, without increasing interconnect trace or pin count by pipelining pages between one or more memory storage areas on half cycles of a memory access clock.
    Type: Grant
    Filed: March 31, 2008
    Date of Patent: December 10, 2013
    Assignee: Intel Corporation
    Inventor: Eric Sprangle
  • Patent number: 8601246
    Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.
    Type: Grant
    Filed: June 27, 2002
    Date of Patent: December 3, 2013
    Assignee: Intel Corporation
    Inventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
  • Publication number: 20130318324
    Abstract: A minicore-based reconfigurable processor and a method of flexibly processing multiple data using the same are provided. The reconfigurable processor includes minicores, each of the minicores including function units configured to perform different operations, respectively. The reconfigurable processor further includes a processing unit configured to activate two or more function units of two or more respective minicores, among the minicores, that are configured to perform an operation of a single instruction multiple data (SIMD) instruction, the processing unit further configured to execute the SIMD instruction using the activated two or more function units.
    Type: Application
    Filed: February 13, 2013
    Publication date: November 28, 2013
    Applicant: Samsung Electronics Co., Ltd.
    Inventor: Dong-Kwan SUH
  • Patent number: 8595467
    Abstract: Mechanisms are provided for performing a floating point collect and operate for a summation across a vector for a dot product operation. A routing network placed before the single instruction multiple data (SIMD) unit allows the SIMD unit to perform a summation across a vector with a singe stage of adders. The routing network routes the vector elements to the adders in a first cycle. The SIMD unit stores the results of the adders into a results vector register. The routing network routes the summation results from the results vector register to the adders in a second cycle. The SIMD unit then stores the results from the second cycle in the results vector register.
    Type: Grant
    Filed: December 29, 2009
    Date of Patent: November 26, 2013
    Assignee: International Business Machines Corporation
    Inventors: Brian K. Flachs, Seiji Maeda, Steven Osman
  • Patent number: 8587568
    Abstract: An integrated circuit device includes a host I/F, an information register, and a control section. The information register stores wave selection information for selecting waveform information which defines a waveform of a drive signal of the electro-optical device. Waveform information selected by the wave selection information stored in the information register from among a plurality of pieces of waveform information is loaded to an information memory at the time of manufacturing an electronic apparatus including the electro-optical device. The control section controls the display of the electro-optical device on the basis of the waveform information read from the information memory at the time of an actual operation of the electronic apparatus.
    Type: Grant
    Filed: March 24, 2010
    Date of Patent: November 19, 2013
    Assignee: Seiko Epson Corporation
    Inventor: Hideki Ogawa
  • Patent number: 8578387
    Abstract: An embodiment of a computing system is configured to process data using a multithreaded SIMD architecture that includes heterogeneous processing engines to execute a program. The program is constructed of various program instructions. A first type of the program instructions can only be executed by a first type of processing engine and a third type of program instructions can only be executed by a second type of processing engine. A second type of program instructions can be executed by the first and the second type of processing engines. An assignment unit may be configured to dynamically determine which of the two processing engines executes any program instructions of the second type in order to balance the workload between the heterogeneous processing engines.
    Type: Grant
    Filed: July 31, 2007
    Date of Patent: November 5, 2013
    Assignee: Nvidia Corporation
    Inventors: Peter C. Mills, Stuart F. Oberman, John Erik Lindholm, Samuel Liu
  • Patent number: 8572355
    Abstract: One embodiment of the present invention sets forth a method for executing a non-local return instruction in a parallel thread processor. The method comprises the steps of receiving, within the thread group, a first long jump instruction and, in response, popping a first token from the execution stack. The method also comprises determining whether the first token is a first long jump token that was pushed onto the execution stack when a first push instruction associated with the first long jump instruction was executed, and when the first token is the first long jump token, jumping to the second instruction based on the address specified by the first long jump token, or, when the first token is not the first long jump token, disabling the active thread until the first long jump token is popped from the execution stack.
    Type: Grant
    Filed: September 13, 2010
    Date of Patent: October 29, 2013
    Assignee: Nvidia Corporation
    Inventors: Guillermo Juan Rozas, Brett W. Coon
  • Patent number: 8560809
    Abstract: According to some embodiments, a technique provides for the execution of an instruction that includes receiving residual data of a first image and decoded pixels of a second image, zero-extending a plurality of unsigned data operands of the decoded pixels producing a plurality of unpacked data operands, adding a plurality of signed data operands of the residual data to the plurality of unpacked data operands producing a plurality of signed results; and saturating the plurality of signed results producing a plurality of unsigned results.
    Type: Grant
    Filed: November 15, 2011
    Date of Patent: October 15, 2013
    Assignee: Intel Corporation
    Inventors: Bradley C. Aldrich, Nigel C. Paver, Murli Ganeshan
  • Patent number: 8539202
    Abstract: A method includes, in a processor, loading/moving a first portion of bits of a source into a first portion of a destination register and duplicate that first portion of bits in a subsequent portion of the destination register.
    Type: Grant
    Filed: June 12, 2012
    Date of Patent: September 17, 2013
    Assignee: Intel Corporation
    Inventor: Patrice Roussel
  • Patent number: 8539201
    Abstract: Systems, methods and articles of manufacture are disclosed for transposing array data on a SIMD multi-core processor architecture. A matrix in a SIMD format may be received. The matrix may comprise a SIMD conversion of a matrix M in a conventional data format. A mapping may be defined from each element of the matrix to an element of a SIMD conversion of a transpose of matrix M. A SIMD-transposed matrix T may be generated based on matrix M and the defined mapping. A row-wise algorithm may be applied to T, without modification, to operate on columns of matrix M.
    Type: Grant
    Filed: November 4, 2009
    Date of Patent: September 17, 2013
    Assignee: International Business Machines Corporation
    Inventors: Jeffrey S. McAllister, Timothy J. Mullins, Nelson Ramirez, Mark A. Bransford
  • Patent number: 8532288
    Abstract: A cryptographic engine for modulo N multiplication, which is structured as a plurality of almost identical, serially connected Processing Elements, is controlled so as to accept input in blocks that are smaller than the maximum capability of the engine in terms of bits multiplied at one time. The serially connected hardware is thus partitioned on the fly to process a variety of cryptographic key sizes while still maintaining all of the hardware in an active processing state.
    Type: Grant
    Filed: December 1, 2006
    Date of Patent: September 10, 2013
    Assignee: International Business Machines Corporation
    Inventors: Camil Fayad, John K. Li, Siegfried K. H. Sutter, Phil C. Yeh
  • Publication number: 20130227249
    Abstract: A three-dimensional (3D) permute unit for a single-instruction-multiple-data stacked processor includes a first vector permute subunit and a second vector permute subunit. The first and second vector permute subunits are arranged in different layers of a 3D chip package. The vector permute subunits are each configured to process a portion of at least two input vectors. A first contact sub-field of the first vector permute subunit is configured to connect output ports of a first crossbar of the first vector permute subunit, holding an intermediate result of the first vector permute subunit, to a second contact sub-field of the second vector permute subunit. A first contact sub-field of the second vector permute subunit is configured to connect output ports of a first crossbar of the second vector permute subunit, holding an intermediate result of the second vector permute subunit, to a second contact sub-field of the first vector permute subunit.
    Type: Application
    Filed: February 25, 2013
    Publication date: August 29, 2013
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventor: INTERNATIONAL BUSINESS MACHINES CORPORATION
  • Patent number: 8521994
    Abstract: An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to pack the packed data responsive to a pack instruction received by the decoder. A first packed data element and a second packed data element are received from the first source register. A third packed data element and a fourth packed data element are received from the second source register. The circuit packs packing a portion of each of the packed data elements into a destination register resulting with the portion from second packed data element adjacent to the portion from the first packed data element, and the portion from the fourth packed data element adjacent to the portion from the third packed data element.
    Type: Grant
    Filed: December 22, 2010
    Date of Patent: August 27, 2013
    Assignee: Intel Corporation
    Inventors: Alexander Peleg, Yaakov Yaari, Millind Mittal, Larry M. Mennemeier, Benny Eitan
  • Patent number: 8510536
    Abstract: Techniques for vector completion mask (VCM) handling are provided. A data structure includes a mask field for each operand of a particular operation. A processor attempts to execute the operation with multiple operands, which are identified in the data structure by the mask fields. If operands are successfully retrieved for execution with the operation, then the corresponding mask field within the data structure is cleared. The processor can reset if any field remains set within the data structure and can re-process the operation with operands that were not previously handled with the operation.
    Type: Grant
    Filed: June 28, 2012
    Date of Patent: August 13, 2013
    Assignee: Intel Corporation
    Inventors: Stephan Jourdan, Michael Fetterman, Michael Cornaby, Per Hammarlund, Ronak Signhal, Glenn Hinton
  • Patent number: 8493979
    Abstract: Executing a single instruction/multiple data (SIMD) instruction of a program to process a vector of data wherein each element of the packet vector corresponds to a different received packet.
    Type: Grant
    Filed: December 30, 2008
    Date of Patent: July 23, 2013
    Assignee: Intel Corporation
    Inventors: Bryan E. Veal, Travis T. Schluessler
  • Patent number: 8495253
    Abstract: An article of manufacture, apparatus, and a method for facilitating input/output (I/O) processing for an I/O operation at a host computer system configured for communication with a control unit. The method includes the host computer system obtaining a transport command word (TCW) for an I/O operation having both input and output data. The TCW specifies a location of the output data and a location for storing the input data. The host computer system forwards the I/O operation to the control unit for execution. The host computer system gathers the output data responsive to the location of the output data specified by the TCW, and then forwards the output data to the control unit for use in the execution of the I/O operation. The host computer system receives the input data from the control unit and stores the input data at the location specified by the TCW.
    Type: Grant
    Filed: March 30, 2011
    Date of Patent: July 23, 2013
    Assignee: International Business Machines Corporation
    Inventors: John R. Flanagan, Daniel F. Casper, Catherine C. Huang, Matthew J. Kalos, Ugochukwu C. Njoku, Dale F. Riedy, Gustav E. Sittmann
  • Publication number: 20130185538
    Abstract: A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage. The instruction stream includes scalar instructions executable by the scalar processor core and vector instructions executable by the vector coprocessor core. The scalar processor core is configured to pass the vector instructions to the vector coprocessor core. The vector coprocessor core configured to process a plurality of data values in parallel while executing each vector instruction passed by the scalar processor core. The vector coprocessor core includes a plurality of processing paths arranged in parallel to process the data values. Each of the processing paths includes an execution unit. Each of the execution units is configured to communicate a result of processing to each other of the execution units.
    Type: Application
    Filed: July 13, 2012
    Publication date: July 18, 2013
    Applicant: TEXAS INSTRUMENTS INCORPORATED
    Inventors: Ching-Yu Hung, Shinri Inamori, Jagadeesh Sankaran, Peter Chang
  • Patent number: 8478969
    Abstract: In one embodiment, the present invention includes a processor having multiple execution units, at least one of which includes a circuit having a multiply-accumulate (MAC) unit including multiple multipliers and adders, and to execute a user-level multiply-multiply-accumulate instruction to populate a destination storage with a plurality of elements each corresponding to an absolute value for a pixel of a pixel block. Other embodiments are described and claimed.
    Type: Grant
    Filed: September 24, 2010
    Date of Patent: July 2, 2013
    Assignee: Intel Corporation
    Inventor: Eric S. Sprangle
  • Publication number: 20130166878
    Abstract: Operation parallelism of a data processor is enhanced by floating-point inner product execution units compatible with single instruction multiple data (SIMD). An operating system that can significantly enhance the level of operation parallelism per instruction while maintaining efficiency of floating-point length-4 vector inner product execution units is implemented. The floating-point length-4 vector inner product execution units are defined in the minimum width (32 bits for single precision) even where an extensive operating system becomes available, and compose the inner product execution units to be compatible with SIMD. The mutually augmenting effects of the inner product execution units and SIMD-compatible composition enhances the level of operation parallelism dramatically.
    Type: Application
    Filed: November 28, 2012
    Publication date: June 27, 2013
    Applicant: RENESAS ELECTRONICS CORPORATION
    Inventor: Renesas Electronics Corporation
  • Patent number: 8464025
    Abstract: A signal processing apparatus able to raise a processing capability in processing accompanying access to a storing means is provided. Stream control units (SCU) 203—0 to 203—3 access data at an external memory system or local memories 204—0 to 204—3 according to a thread under control from a host processor. Processor units (PU) arrays 202—0 to 202—3 perform image processing by a different thread from the thread of the SCUs 203—0 to 203—3.
    Type: Grant
    Filed: May 22, 2006
    Date of Patent: June 11, 2013
    Assignee: Sony Corporation
    Inventors: Yuji Yamaguchi, Masatoshi Imai, Toshiharu Noda, Naosuke Asari, Tomoo Mitsunaga, Mitsuharu Ohki, Kazumasa Ito, Hidetoshi Nagano, Sumito Arakawa, Kei Ito
  • Publication number: 20130145120
    Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.
    Type: Application
    Filed: January 29, 2013
    Publication date: June 6, 2013
    Inventors: Yen-Kueng Chen, William W. Macy, Jr., Matthew Holliman, Eric L. Debes, Minerva M. Yeung
  • Patent number: 8458442
    Abstract: A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=?1.
    Type: Grant
    Filed: August 26, 2009
    Date of Patent: June 4, 2013
    Assignee: International Business Machines Corporation
    Inventors: Alexandre E. Eichenberger, Michael Karl Gschwind, John A. Gunnels, Fred Gehrung Gustavson, Brett Olsson
  • Publication number: 20130138917
    Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.
    Type: Application
    Filed: January 29, 2013
    Publication date: May 30, 2013
    Inventors: Yen-Kueng Chen, William W. Macy, JR., Matthew Holliman, Eric L. Debes, Minerva M. Yeung
  • Patent number: 8447953
    Abstract: A microprocessor architecture comprises a plurality of processing elements arranged in a single instruction multiple data SIMD array, wherein each processing element includes a plurality of execution units, each of which is operable to process an instruction of a particular instruction type, a serial processor which includes a plurality of execution units, each of which is operable to process an instruction of a particular instruction type, and an instruction controller operable to receive a plurality of instructions, and to distribute received instructions to the execution units in dependence upon the instruction types of the received instruction. The execution units of the serial processor are operable to process respective instructions in parallel.
    Type: Grant
    Filed: February 7, 2006
    Date of Patent: May 21, 2013
    Assignee: Rambus Inc.
    Inventor: Leon David Wildman
  • Publication number: 20130124824
    Abstract: Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.
    Type: Application
    Filed: December 27, 2012
    Publication date: May 16, 2013
    Inventors: Yen-Kueng Chen, William W. Macy, Matthew Holliman, Eric L. Debes, Minerva M. Yeung